Academic

MSA-Thinker: Discrimination-Calibration Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis

arXiv:2604.00013v1 Announce Type: cross Abstract: Multimodal sentiment analysis aims to understand human emotions by integrating textual, auditory, and visual modalities. Although Multimodal Large Language Models (MLLMs) have achieved state-of-the-art performance via supervised fine-tuning (SFT), their end-to-end "black-box" nature limits interpretability. Existing methods incorporating Chain-of-Thought (CoT) reasoning are hindered by high annotation costs, while Reinforcement Learning (RL) faces challenges such as low exploration efficiency and sparse rewards, particularly on hard samples. To address these issues, we propose a novel training framework that integrates structured Discrimination-Calibration (DC) reasoning with Hint-based Reinforcement Learning. First, we perform cold-start SFT using high-quality CoT data synthesized by a teacher model (Qwen3Omni-30B), which inherently contains the DC structure. This equips the model with a reasoning paradigm that performs macro discrimi

arXiv:2604.00013v1 Announce Type: cross Abstract: Multimodal sentiment analysis aims to understand human emotions by integrating textual, auditory, and visual modalities. Although Multimodal Large Language Models (MLLMs) have achieved state-of-the-art performance via supervised fine-tuning (SFT), their end-to-end "black-box" nature limits interpretability. Existing methods incorporating Chain-of-Thought (CoT) reasoning are hindered by high annotation costs, while Reinforcement Learning (RL) faces challenges such as low exploration efficiency and sparse rewards, particularly on hard samples. To address these issues, we propose a novel training framework that integrates structured Discrimination-Calibration (DC) reasoning with Hint-based Reinforcement Learning. First, we perform cold-start SFT using high-quality CoT data synthesized by a teacher model (Qwen3Omni-30B), which inherently contains the DC structure. This equips the model with a reasoning paradigm that performs macro discrimination followed by fine-grained calibration from the initial stage. Building on this, we propose Hint-GRPO, which leverages the discrimination phase within the DC structure as a verifiable anchor during RL to provide directional hints for hard samples, guiding policy optimization and effectively mitigating the reward sparsity problem. Experiments on the Qwen2.5Omni-7B model demonstrate that our method not only achieves higher accuracy in fine-grained sentiment regression tasks but also generates high-quality structured reasoning chains. Crucially, it exhibits superior generalization capability in cross-domain evaluations. This enhances model interpretability while validating the positive contribution of explicit reasoning steps to model robustness, offering a new paradigm for building trustworthy and efficient sentiment analysis systems.

Executive Summary

The article introduces MSA-Thinker, a novel framework that integrates Discrimination-Calibration (DC) reasoning with Hint-Guided Reinforcement Learning to enhance interpretability and performance in multimodal sentiment analysis. By leveraging high-quality CoT data synthesized via a teacher model for cold-start fine-tuning, the authors equip models with structured reasoning from inception, enabling macro discrimination and fine-grained calibration. Hint-GRPO further augments this by using the DC structure as a verifiable anchor in RL to provide directional hints for hard samples, mitigating reward sparsity and improving policy optimization. Experimental results on the Qwen2.5Omni-7B model demonstrate improved accuracy in sentiment regression, superior cross-domain generalization, and enhanced reasoning chain quality. These contributions address key limitations of black-box MLLMs and advance the field toward more interpretable and robust sentiment analysis systems.

Key Points

  • Introduction of DC reasoning integrated with RL for multimodal sentiment analysis
  • Use of teacher-model-synthesized CoT data for cold-start SFT to embed structured reasoning
  • Development of Hint-GRPO to leverage DC structure as anchor in RL for improved hard-sample handling

Merits

Enhanced Interpretability

The framework explicitly structures reasoning steps, making model behavior more transparent and accountable.

Improved Performance

Experimental validation shows higher accuracy and better generalization across domains.

Addressing Reward Sparsity

Hint-GRPO effectively mitigates sparse reward issues in RL by anchoring on verifiable DC anchors.

Demerits

Complexity of Implementation

Integrating DC reasoning with RL adds computational and design overhead, potentially complicating deployment.

Dependence on Quality of Teacher Model

Performance hinges on the quality and applicability of the teacher model’s CoT data; suboptimal synthesis may limit effectiveness.

Expert Commentary

MSA-Thinker represents a significant conceptual leap in bridging the gap between performance and interpretability in multimodal sentiment analysis. The integration of structured reasoning via DC within the training pipeline marks a departure from ad-hoc CoT approaches that suffer from cost and consistency issues. By anchoring RL on verifiable DC structures, the authors effectively convert a probabilistic, opaque decision process into a semi-deterministic reasoning pathway, which is both theoretically sound and empirically validated. Notably, the cross-domain generalization results suggest that the reasoning architecture is not merely task-specific but captures broader cognitive patterns applicable across modalities and domains—a critical insight for scalable AI deployment. Furthermore, the work implicitly challenges the conventional wisdom that black-box models are inherently incompatible with interpretability; MSA-Thinker demonstrates that with intentional design, even end-to-end models can be structured to reveal internal logic. This sets a precedent for future research in multimodal AI, particularly as regulatory and ethical demands for transparency intensify.

Recommendations

  • Extend MSA-Thinker to other modalities beyond sentiment analysis, such as multimodal reasoning in legal documents or scientific papers.
  • Develop benchmarks specifically evaluating the generalization of reasoning chains across diverse domains to validate scalability.

Sources

Original: arXiv - cs.AI