Academic

Predicting Sentence Acceptability Judgments in Multimodal Contexts

arXiv:2602.20918v1 Announce Type: new Abstract: Previous work has examined the capacity of deep neural networks (DNNs), particularly transformers, to predict human sentence acceptability judgments, both independently of context, and in document contexts. We consider the effect of prior exposure to visual images (i.e., visual context) on these judgments for humans and large language models (LLMs). Our results suggest that, in contrast to textual context, visual images appear to have little if any impact on human acceptability ratings. However, LLMs display the compression effect seen in previous work on human judgments in document contexts. Different sorts of LLMs are able to predict human acceptability judgments to a high degree of accuracy, but in general, their performance is slightly better when visual contexts are removed. Moreover, the distribution of LLM judgments varies among models, with Qwen resembling human patterns, and others diverging from them. LLM-generated predictions

arXiv:2602.20918v1 Announce Type: new Abstract: Previous work has examined the capacity of deep neural networks (DNNs), particularly transformers, to predict human sentence acceptability judgments, both independently of context, and in document contexts. We consider the effect of prior exposure to visual images (i.e., visual context) on these judgments for humans and large language models (LLMs). Our results suggest that, in contrast to textual context, visual images appear to have little if any impact on human acceptability ratings. However, LLMs display the compression effect seen in previous work on human judgments in document contexts. Different sorts of LLMs are able to predict human acceptability judgments to a high degree of accuracy, but in general, their performance is slightly better when visual contexts are removed. Moreover, the distribution of LLM judgments varies among models, with Qwen resembling human patterns, and others diverging from them. LLM-generated predictions on sentence acceptability are highly correlated with their normalised log probabilities in general. However, the correlations decrease when visual contexts are present, suggesting that a higher gap exists between the internal representations of LLMs and their generated predictions in the presence of visual contexts. Our experimental work suggests interesting points of similarity and of difference between human and LLM processing of sentences in multimodal contexts.

Executive Summary

This article examines the impact of visual context on human sentence acceptability judgments and large language models (LLMs). The study finds that visual images have little effect on human judgments, whereas LLMs display a compression effect similar to that seen in document contexts. The results show that LLMs can predict human judgments with high accuracy, but their performance is slightly better without visual contexts. The study highlights interesting similarities and differences between human and LLM processing of sentences in multimodal contexts, with implications for natural language processing and human-computer interaction.

Key Points

  • Visual images have little impact on human sentence acceptability judgments
  • LLMs display a compression effect similar to that seen in document contexts
  • LLMs can predict human judgments with high accuracy, but performance is slightly better without visual contexts

Merits

Comprehensive Experimental Design

The study employs a rigorous experimental design, examining the impact of visual context on both human and LLM judgments, and comparing the performance of different LLMs.

Demerits

Limited Generalizability

The study's findings may not generalize to other contexts or populations, and the use of a specific set of visual images and sentences may limit the applicability of the results.

Expert Commentary

The study's findings highlight the complex and nuanced nature of human and LLM processing of sentences in multimodal contexts. The results suggest that LLMs are capable of capturing certain aspects of human judgment, but also exhibit distinct patterns of processing that are influenced by the presence of visual context. Further research is needed to fully understand the implications of these findings and to develop more effective LLMs that can accurately predict human sentence acceptability judgments in a wide range of contexts.

Recommendations

  • Further research should be conducted to examine the impact of visual context on human and LLM processing of sentences in different contexts and populations
  • The development of more advanced LLMs that can effectively incorporate visual context into their processing and prediction of sentence acceptability judgments

Sources