Academic

On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

arXiv:2602.12506v1 Announce Type: new Abstract: Reinforcement learning (RL) fine-tuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations--misleading captions or incorrect chain-of-thought (CoT) traces--cause substantial drops in robustness and confidence, and that these effects are more pronounced when CoT consistency is taken into account across open-source multimodal reasoning models. Entropy-based metrics further show that these perturbations reshape model uncertainty and probability mass on the correct option, exposing model-specific trends in miscalibration. To better understand these vulnerabilities, we further analyze RL fine-tuning dynamics and uncover an acc

arXiv:2602.12506v1 Announce Type: new Abstract: Reinforcement learning (RL) fine-tuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations--misleading captions or incorrect chain-of-thought (CoT) traces--cause substantial drops in robustness and confidence, and that these effects are more pronounced when CoT consistency is taken into account across open-source multimodal reasoning models. Entropy-based metrics further show that these perturbations reshape model uncertainty and probability mass on the correct option, exposing model-specific trends in miscalibration. To better understand these vulnerabilities, we further analyze RL fine-tuning dynamics and uncover an accuracy-faithfulness trade-off: fine-tuning raises benchmark accuracy, but can simultaneously erode the reliability of the accompanying CoT and its robustness to contextual shifts. Although adversarial augmentation improves robustness, it does not by itself prevent faithfulness drift. Incorporating a faithfulness-aware reward can restore alignment between answers and reasoning, but when paired with augmentation, training risks collapsing onto shortcut strategies and robustness remains elusive. Together, these findings highlight the limitations of accuracy-only evaluations and motivate training and assessment protocols that jointly emphasize correctness, robustness, and the faithfulness of visually grounded reasoning.

Executive Summary

The article 'On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs' investigates the vulnerabilities of reinforcement learning (RL) fine-tuned vision-language models (VLMs) on reasoning-intensive tasks. The study reveals that RL-tuned VLMs, while improving benchmark accuracy, are susceptible to weak visual grounding, hallucinations, and over-reliance on textual cues. The authors demonstrate that controlled textual perturbations, such as misleading captions or incorrect chain-of-thought (CoT) traces, significantly reduce model robustness and confidence. The research uncovers an accuracy-faithfulness trade-off, where fine-tuning enhances accuracy but may compromise the reliability and robustness of the accompanying CoT. The study suggests that adversarial augmentation can improve robustness but does not prevent faithfulness drift, and incorporating a faithfulness-aware reward can restore alignment between answers and reasoning. The findings underscore the limitations of accuracy-only evaluations and advocate for training and assessment protocols that emphasize correctness, robustness, and faithfulness in visually grounded reasoning.

Key Points

  • RL fine-tuned VLMs improve benchmark accuracy but are vulnerable to weak visual grounding and hallucinations.
  • Textual perturbations cause substantial drops in robustness and confidence, with pronounced effects when CoT consistency is considered.
  • An accuracy-faithfulness trade-off exists, where fine-tuning enhances accuracy but may erode the reliability of CoT.
  • Adversarial augmentation improves robustness but does not prevent faithfulness drift.
  • Incorporating a faithfulness-aware reward can restore alignment between answers and reasoning.

Merits

Comprehensive Analysis

The article provides a thorough examination of the vulnerabilities and trade-offs in RL-finetuned VLMs, offering valuable insights into the limitations of current models.

Empirical Evidence

The study presents empirical evidence supporting the impact of textual perturbations on model robustness and confidence, enhancing the credibility of the findings.

Practical Recommendations

The research offers practical recommendations for improving the robustness and faithfulness of VLMs, which can guide future model development and evaluation.

Demerits

Limited Scope

The study focuses primarily on open-source multimodal reasoning models, which may limit the generalizability of the findings to other types of models or applications.

Complexity of Implementation

The proposed solutions, such as incorporating a faithfulness-aware reward, may be complex to implement and could require significant computational resources.

Potential for Overfitting

The use of adversarial augmentation and faithfulness-aware rewards may risk overfitting to specific types of perturbations, potentially limiting the model's performance on diverse tasks.

Expert Commentary

The article 'On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs' presents a rigorous and well-reasoned analysis of the vulnerabilities and trade-offs in RL-finetuned vision-language models. The study's comprehensive examination of the impact of textual perturbations on model robustness and confidence is particularly noteworthy, as it highlights the limitations of accuracy-only evaluations. The uncovering of an accuracy-faithfulness trade-off is a significant contribution to the field, as it underscores the importance of considering multiple dimensions of model performance. The practical recommendations offered by the study, such as incorporating a faithfulness-aware reward, provide valuable guidance for future model development and evaluation. However, the study's focus on open-source multimodal reasoning models may limit the generalizability of the findings. Additionally, the complexity of implementing the proposed solutions and the potential for overfitting are important considerations that should be addressed in future research. Overall, the article makes a substantial contribution to the understanding of RL-finetuned VLMs and offers valuable insights for the development of more robust and reliable AI systems.

Recommendations

  • Future research should explore the generalizability of the findings to a broader range of models and applications, ensuring that the insights are applicable across different domains.
  • Developers should adopt a holistic approach to model evaluation, considering not only accuracy but also robustness, faithfulness, and other relevant dimensions of performance.
  • Policymakers should promote the ethical and responsible development of AI models, ensuring that they are robust, reliable, and aligned with human values through appropriate guidelines and regulatory frameworks.

Sources