Academic

Context-Dependent Affordance Computation in Vision-Language Models

arXiv:2603.04419v1 Announce Type: new Abstract: We characterize the phenomenon of context-dependent affordance computation in vision-language models (VLMs). Through a large-scale computational study (n=3,213 scene-context pairs from COCO-2017) using Qwen-VL 30B and LLaVA-1.5-13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating that >90% of lexical scene description is context-dependent. Sentence-level cosine similarity confirms substantial drift at the semantic level (mean = 0.415, 58.5% context-dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within-prime variance is substantially lower than cross-prime variance across all conditions. Tucker decomposition with bootstrap stability analysis

M
Murad Farzulla
· · 1 min read · 3 views

arXiv:2603.04419v1 Announce Type: new Abstract: We characterize the phenomenon of context-dependent affordance computation in vision-language models (VLMs). Through a large-scale computational study (n=3,213 scene-context pairs from COCO-2017) using Qwen-VL 30B and LLaVA-1.5-13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating that >90% of lexical scene description is context-dependent. Sentence-level cosine similarity confirms substantial drift at the semantic level (mean = 0.415, 58.5% context-dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within-prime variance is substantially lower than cross-prime variance across all conditions. Tucker decomposition with bootstrap stability analysis (n=1,000 resamples) reveals stable orthogonal latent factors: a "Culinary Manifold" isolated to chef contexts and an "Access Axis" spanning child-mobility contrasts. These findings establish that VLMs compute affordances in a substantially context-dependent manner -- with the difference between lexical (90%) and semantic (58.5%) measures reflecting that surface vocabulary changes more than underlying meaning under context shifts -- and suggest a direction for robotics research: dynamic, query-dependent ontological projection (JIT Ontology) rather than static world modeling. We do not claim to establish processing order or architectural primacy; such claims require internal representational analysis beyond output behavior.

Executive Summary

This article explores context-dependent affordance computation in vision-language models, revealing significant affordance drift across different context conditions. The study demonstrates that over 90% of lexical scene description is context-dependent, with substantial drift at the semantic level. The findings have implications for robotics research, suggesting a dynamic, query-dependent ontological projection approach rather than static world modeling. The research contributes to our understanding of how vision-language models compute affordances and highlights the importance of context in shaping their output.

Key Points

  • Context-dependent affordance computation in vision-language models leads to significant affordance drift
  • Over 90% of lexical scene description is context-dependent, with substantial drift at the semantic level
  • The study suggests a dynamic, query-dependent ontological projection approach for robotics research

Merits

Comprehensive methodology

The study employs a large-scale computational approach, using multiple models and systematic context priming, to demonstrate the phenomenon of context-dependent affordance computation

Demerits

Limited internal representational analysis

The study focuses on output behavior, without examining the internal representational processes of the models, which may limit the understanding of processing order and architectural primacy

Expert Commentary

The article provides a significant contribution to our understanding of context-dependent affordance computation in vision-language models. The study's findings highlight the importance of considering context in the development of AI systems, particularly in robotics and related fields. The suggested approach of dynamic, query-dependent ontological projection has the potential to improve the effectiveness and adaptability of AI systems in real-world environments. However, further research is needed to fully understand the internal representational processes of these models and to address the limitations of the current study.

Recommendations

  • Future research should focus on internal representational analysis to better understand the processing order and architectural primacy of vision-language models
  • The development of dynamic, query-dependent ontological projection approaches should be prioritized in robotics research to improve the effectiveness and adaptability of AI systems

Sources