Skip to main content
Academic

DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

arXiv:2602.16742v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has been shown effective in enhancing the visual reflection and reasoning capabilities of Large Multimodal Models (LMMs). However, existing datasets are predominantly derived from either small-scale manual construction or recombination of prior resources, which limits data diversity and coverage, thereby constraining further gains in model performance. To this end, we introduce \textbf{DeepVision-103K}, a comprehensive dataset for RLVR training that covers diverse K12 mathematical topics, extensive knowledge points, and rich visual elements. Models trained on DeepVision achieve strong performance on multimodal mathematical benchmarks, and generalize effectively to general multimodal reasoning tasks. Further analysis reveals enhanced visual perception, reflection and reasoning capabilities in trained models, validating DeepVision's effectiveness for advancing multimodal reasoning. Data

arXiv:2602.16742v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has been shown effective in enhancing the visual reflection and reasoning capabilities of Large Multimodal Models (LMMs). However, existing datasets are predominantly derived from either small-scale manual construction or recombination of prior resources, which limits data diversity and coverage, thereby constraining further gains in model performance. To this end, we introduce \textbf{DeepVision-103K}, a comprehensive dataset for RLVR training that covers diverse K12 mathematical topics, extensive knowledge points, and rich visual elements. Models trained on DeepVision achieve strong performance on multimodal mathematical benchmarks, and generalize effectively to general multimodal reasoning tasks. Further analysis reveals enhanced visual perception, reflection and reasoning capabilities in trained models, validating DeepVision's effectiveness for advancing multimodal reasoning. Data: \href{https://huggingface.co/datasets/skylenage/DeepVision-103K}{this url}.

Executive Summary

This article introduces DeepVision-103K, a comprehensive dataset designed for Reinforcement Learning with Verifiable Rewards (RLVR) to enhance the multimodal reasoning capabilities of Large Multimodal Models (LMMs). The dataset covers diverse K12 mathematical topics, extensive knowledge points, and rich visual elements, achieving strong performance on multimodal mathematical benchmarks and general multimodal reasoning tasks. The study reveals enhanced visual perception, reflection, and reasoning capabilities in trained models, validating DeepVision's effectiveness. The dataset is publicly available for further research and development.

Key Points

  • DeepVision-103K is a comprehensive dataset designed for RLVR to enhance multimodal reasoning capabilities of LMMs.
  • The dataset covers diverse K12 mathematical topics, extensive knowledge points, and rich visual elements.
  • Trained models on DeepVision achieve strong performance on multimodal mathematical benchmarks and general multimodal reasoning tasks.

Merits

Comprehensive Coverage

DeepVision-103K provides extensive coverage of K12 mathematical topics, ensuring that LMMs are trained on a broad range of knowledge points.

Visual Diversity

The dataset includes rich visual elements, enhancing the visual perception and reflection capabilities of trained models.

Verifiable Rewards

The use of RLVR ensures that rewards are verifiable, allowing for more accurate and reliable training of LMMs.

Demerits

Data Quality Control

The article does not discuss data quality control measures, which is crucial for ensuring the reliability and accuracy of the dataset.

Scalability

The article does not discuss the scalability of the dataset, which may be a concern for large-scale deployment of LMMs.

Expert Commentary

The introduction of DeepVision-103K is a significant development in the field of multimodal reasoning and artificial intelligence. The dataset's comprehensive coverage and visual diversity make it an ideal resource for training LMMs. However, the article's limitations, such as data quality control and scalability, need to be addressed in future research. Furthermore, the dataset's implications for education and artificial intelligence are significant, and its use can inform policy decisions and interventions. As such, DeepVision-103K is a valuable resource for researchers and practitioners in the field.

Recommendations

  • Future research should focus on addressing the limitations of the dataset, including data quality control and scalability.
  • The dataset should be used to inform policy decisions and interventions in education and artificial intelligence.

Sources