Academic

Language-Guided Invariance Probing of Vision-Language Models

arXiv:2511.13494v1 Announce Type: cross Abstract: Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigL

J
Jae Joong Lee
· · 1 min read · 2 views

arXiv:2511.13494v1 Announce Type: cross Abstract: Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.

Executive Summary

The article introduces Language-Guided Invariance Probing (LGIP), a benchmark designed to evaluate the linguistic robustness of vision-language models (VLMs) by measuring their invariance to meaning-preserving paraphrases and sensitivity to meaning-changing semantic flips. Using 40k MS COCO images with five human captions each, the study generates paraphrases and semantic flips to assess model performance. The findings reveal that EVA02-CLIP and large OpenCLIP variants exhibit favorable invariance-sensitivity characteristics, while SigLIP and SigLIP2 show significant invariance errors and often prefer flipped captions over original descriptions. The study highlights the limitations of standard retrieval metrics in capturing these nuances, emphasizing the need for more sophisticated evaluation methods.

Key Points

  • Introduction of LGIP benchmark for evaluating VLMs' linguistic robustness.
  • Assessment of invariance to paraphrases and sensitivity to semantic flips.
  • EVA02-CLIP and large OpenCLIP variants perform favorably in invariance-sensitivity trade-off.
  • SigLIP and SigLIP2 show significant invariance errors and prefer flipped captions.
  • Standard retrieval metrics fail to capture these linguistic robustness issues.

Merits

Innovative Benchmark

The LGIP benchmark is a novel approach to evaluating the linguistic robustness of VLMs, addressing a critical gap in current assessment methods.

Comprehensive Evaluation

The study provides a thorough evaluation of multiple VLMs, offering insights into their performance on paraphrases and semantic flips.

Practical Implications

The findings have practical implications for improving the robustness and reliability of VLMs in real-world applications.

Demerits

Limited Scope

The study focuses on a specific set of VLMs and may not be generalizable to all vision-language models.

Automated Generation of Perturbations

The use of automated methods for generating paraphrases and semantic flips may introduce biases or limitations in the evaluation.

Human Annotation Dependence

The reliance on human-annotated captions and perturbations may introduce subjectivity and variability in the results.

Expert Commentary

The article presents a significant advancement in the evaluation of vision-language models by introducing the LGIP benchmark. This benchmark addresses a critical gap in current assessment methods, which often overlook the linguistic robustness of models. The study's findings are particularly noteworthy, as they reveal substantial differences in performance among various VLMs. For instance, EVA02-CLIP and large OpenCLIP variants demonstrate favorable invariance-sensitivity characteristics, while SigLIP and SigLIP2 exhibit significant invariance errors and a tendency to prefer flipped captions. These insights are crucial for developers and researchers aiming to enhance the reliability and accuracy of VLMs. However, the study's limitations, such as the reliance on automated methods for generating perturbations and the dependence on human-annotated captions, should be acknowledged. Future research could explore more diverse and comprehensive evaluation methods to further validate these findings. Overall, the article makes a valuable contribution to the field by highlighting the importance of linguistic robustness in vision-language models and providing a robust framework for future evaluations.

Recommendations

  • Developers should incorporate the LGIP benchmark into their evaluation processes to ensure the linguistic robustness of VLMs.
  • Future research should explore more diverse and comprehensive methods for generating linguistic perturbations to enhance the robustness of the evaluation.

Sources