Academic

Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

arXiv:2604.05445v1 Announce Type: new Abstract: Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque "black boxes." To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve

arXiv:2604.05445v1 Announce Type: new Abstract: Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque "black boxes." To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability, providing a scalable solution for VLM alignment.

Executive Summary

The paper introduces VL-MDR, a vision-language reward modeling framework that addresses the interpretability-efficiency trade-off in evaluating and aligning vision-language models (VLMs). By decomposing evaluation into 21 fine-grained, interpretable dimensions (e.g., hallucination, reasoning) and dynamically weighting them via a visual-aware gating mechanism, VL-MDR provides granular insights into model performance while maintaining computational efficiency. The framework is supported by a large-scale dataset of 321k annotated preference pairs and demonstrates superior performance on benchmarks like VL-RewardBench. Additionally, VL-MDR’s outputs enable effective Direct Preference Optimization (DPO) alignment, reducing visual hallucinations and enhancing VLM reliability. This work offers a scalable solution for interpretable and efficient reward modeling in VLMs.

Key Points

  • Introduces VL-MDR, a dynamic, multi-dimensional reward modeling framework for VLMs that decomposes evaluation into interpretable dimensions (e.g., hallucination, reasoning) rather than relying on monolithic scalar rewards.
  • Employs a visual-aware gating mechanism to adaptively weight dimensions based on input specificity, balancing interpretability and computational efficiency.
  • Demonstrates superior performance on benchmarks (VL-RewardBench) and enables effective DPO alignment to mitigate visual hallucinations, improving VLM reliability.

Merits

Interpretability-Through-Dynamic-Decomposition

The framework’s decomposition of evaluation into fine-grained, human-interpretable dimensions (e.g., hallucination, reasoning) addresses the black-box nature of traditional discriminative reward models, enhancing transparency and trustworthiness in VLM evaluation.

Scalable and Efficient Architecture

VL-MDR leverages a visual-aware gating mechanism to dynamically select and weight dimensions, achieving a balance between interpretability and computational efficiency—unlike generative models that are slow or discriminative models that are opaque.

Empirical Robustness and Generalizability

The model’s performance on VL-RewardBench and its ability to generate preference pairs for DPO alignment highlight its practical utility in real-world VLM alignment tasks, reducing hallucinations and improving reliability at scale.

Demerits

Dependence on High-Quality Annotations

The framework relies heavily on the quality and granularity of the 321k annotated preference pairs, which may introduce biases or inconsistencies if the annotations are noisy or subjective, potentially limiting generalizability.

Computational Overhead of Dynamic Gating

While VL-MDR improves interpretability, the adaptive gating mechanism may introduce additional computational overhead compared to static discriminative models, particularly in inference-time scenarios with high-dimensional inputs.

Limited Applicability to Non-Vision Tasks

The framework is tailored specifically for vision-language tasks, and its extensibility to other modalities (e.g., pure text or audio) or domains (e.g., robotics) remains untested, limiting its broader applicability.

Expert Commentary

The authors present a compelling solution to the longstanding dilemma of balancing interpretability and efficiency in vision-language reward modeling. VL-MDR’s dynamic decomposition of evaluation into interpretable dimensions, coupled with its visual-aware gating mechanism, represents a significant advancement over traditional monolithic reward models. The empirical results, particularly the performance on VL-RewardBench and the effective mitigation of visual hallucinations via DPO alignment, underscore the framework’s practical utility. However, the reliance on high-quality annotations and the computational overhead of dynamic gating pose challenges that warrant further investigation. Additionally, while the framework’s focus on VLMs is justified, exploring its extensibility to other modalities could broaden its impact. Overall, VL-MDR is a timely and impactful contribution to the field, with implications for both AI alignment research and real-world deployment of VLMs.

Recommendations

  • Expand the framework to include additional modalities (e.g., text-only or audio-visual tasks) to assess its generalizability beyond vision-language applications.
  • Develop standardized annotation protocols for the 21 fine-grained dimensions to ensure consistency and reduce biases in the preference pair dataset.
  • Investigate the computational efficiency of the visual-aware gating mechanism to optimize inference-time performance, particularly for large-scale deployments.
  • Explore hybrid approaches that combine VL-MDR’s interpretability with the scalability of traditional discriminative models to further enhance practical deployment.

Sources

Original: arXiv - cs.CL