Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT
arXiv:2603.09715v1 Announce Type: new Abstract: Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-c
arXiv:2603.09715v1 Announce Type: new Abstract: Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron show that CVS achieves solid performance across datasets. On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of the data, respectively, and remains robust on the highly heterogeneous Cauldron dataset. Moreover, CVS reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.
Executive Summary
This article presents a novel training-free data selection method, CVS, for improving vision-language large models (VLLMs). CVS leverages a frozen VLLM as an evaluator to identify samples that require vision-language joint reasoning, thereby reducing the need for costly proxy model training. The method measures the discrepancy in answer validity with and without conditioning on the question, effectively filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron demonstrate CVS's performance and computational efficiency, outperforming full-data training and existing methods like COINCIDE and XMAS. This research addresses a significant challenge in multimodal learning by providing a more targeted and efficient approach to data selection.
Key Points
- ▸ CVS is a training-free data selection method that leverages a frozen VLLM as an evaluator.
- ▸ CVS measures the discrepancy in answer validity with and without conditioning on the question to identify relevant samples.
- ▸ Experiments demonstrate CVS's performance and computational efficiency on Vision-Flan and The Cauldron datasets.
Merits
Improved Efficiency
CVS reduces computational cost by leveraging a frozen VLLM, allowing for more efficient data selection and model training.
Enhanced Accuracy
CVS effectively identifies samples that require vision-language joint reasoning, improving the accuracy of VLLMs in multimodal learning.
Demerits
Limited Generalizability
The method's performance may be dataset-specific, and further research is needed to evaluate its generalizability to other vision-language datasets.
Overreliance on Frozen VLLM
CVS's reliance on a frozen VLLM may limit its adaptability to different model architectures and training environments.
Expert Commentary
This article presents a compelling solution to a significant challenge in multimodal learning. By leveraging a frozen VLLM as an evaluator, CVS provides a more targeted and efficient approach to data selection, which can improve the accuracy and efficiency of VLLMs. The method's performance and computational efficiency are demonstrated through experiments on Vision-Flan and The Cauldron datasets. However, further research is needed to evaluate the method's generalizability to other vision-language datasets and to explore its adaptability to different model architectures and training environments.
Recommendations
- ✓ Future research should investigate the method's generalizability to other vision-language datasets and evaluate its performance in more diverse and complex scenarios.
- ✓ The development of CVS should be further explored and refined to ensure its adaptability to different model architectures and training environments.