MMKG-RDS: Reasoning Data Synthesis via Deep Mining of Multimodal Knowledge Graphs
arXiv:2602.23632v1 Announce Type: new Abstract: Synthesizing high-quality training data is crucial for enhancing domain models' reasoning abilities. Existing methods face limitations in long-tail knowledge coverage, effectiveness verification, and interpretability. Knowledge-graph-based approaches still fall short in functionality, granularity, customizability, and evaluation. To address these issues, we propose MMKG-RDS, a flexible framework for reasoning data synthesis that leverages multimodal knowledge graphs. It supports fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring. We validate MMKG-RDS with the MMKG-RDS-Bench dataset, covering five domains, 17 task types, and 14,950 samples. Experimental results show fine-tuning Qwen3 models (0.6B/8B/32B) on a small number of synthesized samples improves reasoning accuracy by 9.2%. The framework also generates distinct data, challenging existing models on tasks involving tables and form
arXiv:2602.23632v1 Announce Type: new Abstract: Synthesizing high-quality training data is crucial for enhancing domain models' reasoning abilities. Existing methods face limitations in long-tail knowledge coverage, effectiveness verification, and interpretability. Knowledge-graph-based approaches still fall short in functionality, granularity, customizability, and evaluation. To address these issues, we propose MMKG-RDS, a flexible framework for reasoning data synthesis that leverages multimodal knowledge graphs. It supports fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring. We validate MMKG-RDS with the MMKG-RDS-Bench dataset, covering five domains, 17 task types, and 14,950 samples. Experimental results show fine-tuning Qwen3 models (0.6B/8B/32B) on a small number of synthesized samples improves reasoning accuracy by 9.2%. The framework also generates distinct data, challenging existing models on tasks involving tables and formulas, useful for complex benchmark construction. The dataset and code are available at https://github.com/360AILAB-NLP/MMKG-RDS
Executive Summary
This article proposes MMKG-RDS, a novel framework for synthesizing high-quality training data using multimodal knowledge graphs. MMKG-RDS addresses existing limitations in knowledge coverage, effectiveness verification, and interpretability by supporting fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring. The authors validate MMKG-RDS with a comprehensive benchmark dataset, demonstrating improved reasoning accuracy and challenging existing models on complex tasks. While MMKG-RDS shows promise, its scalability and generalizability to diverse domains remain areas for further exploration.
Key Points
- ▸ MMKG-RDS leverages multimodal knowledge graphs for reasoning data synthesis
- ▸ The framework supports fine-grained knowledge extraction and customizable path sampling
- ▸ Multidimensional data quality scoring enhances the effectiveness of MMKG-RDS
Merits
Improves reasoning accuracy
Fine-tuning Qwen3 models on synthesized samples improves reasoning accuracy by 9.2%
Enhances domain models' generalizability
MMKG-RDS challenges existing models on complex tasks, promoting domain models' adaptability
Supports interpretability and customizability
The framework's fine-grained knowledge extraction and path sampling facilitate interpretability and customizability
Demerits
Scalability limitations
The framework's performance on large-scale datasets and diverse domains remains to be explored
Dependence on multimodal knowledge graphs
The effectiveness of MMKG-RDS relies on the availability and quality of multimodal knowledge graphs
Expert Commentary
The article presents a well-crafted solution to the challenges of data synthesis in AI development. MMKG-RDS demonstrates a sophisticated understanding of the complexities involved in data synthesis and provides a comprehensive framework for addressing these challenges. While the framework shows promise, it is essential to acknowledge the limitations and areas for further exploration. The scalability and generalizability of MMKG-RDS to diverse domains remain critical concerns. Nevertheless, the article contributes significantly to the ongoing research in data synthesis and multimodal learning, warranting further investigation and development.
Recommendations
- ✓ Future research should focus on improving the scalability and generalizability of MMKG-RDS
- ✓ Further investigation into the dependence of MMKG-RDS on multimodal knowledge graphs is necessary to ensure its effectiveness in diverse domains