Academic

MMKG-RDS: Reasoning Data Synthesis via Deep Mining of Multimodal Knowledge Graphs

arXiv:2602.23632v1 Announce Type: new Abstract: Synthesizing high-quality training data is crucial for enhancing domain models' reasoning abilities. Existing methods face limitations in long-tail knowledge coverage, effectiveness verification, and interpretability. Knowledge-graph-based approaches still fall short in functionality, granularity, customizability, and evaluation. To address these issues, we propose MMKG-RDS, a flexible framework for reasoning data synthesis that leverages multimodal knowledge graphs. It supports fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring. We validate MMKG-RDS with the MMKG-RDS-Bench dataset, covering five domains, 17 task types, and 14,950 samples. Experimental results show fine-tuning Qwen3 models (0.6B/8B/32B) on a small number of synthesized samples improves reasoning accuracy by 9.2%. The framework also generates distinct data, challenging existing models on tasks involving tables and form

L
Lun Zhan, Feng Xiong, Huanyong Liu, Feng Zhang, Yuhui Yin
· · 1 min read · 8 views

arXiv:2602.23632v1 Announce Type: new Abstract: Synthesizing high-quality training data is crucial for enhancing domain models' reasoning abilities. Existing methods face limitations in long-tail knowledge coverage, effectiveness verification, and interpretability. Knowledge-graph-based approaches still fall short in functionality, granularity, customizability, and evaluation. To address these issues, we propose MMKG-RDS, a flexible framework for reasoning data synthesis that leverages multimodal knowledge graphs. It supports fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring. We validate MMKG-RDS with the MMKG-RDS-Bench dataset, covering five domains, 17 task types, and 14,950 samples. Experimental results show fine-tuning Qwen3 models (0.6B/8B/32B) on a small number of synthesized samples improves reasoning accuracy by 9.2%. The framework also generates distinct data, challenging existing models on tasks involving tables and formulas, useful for complex benchmark construction. The dataset and code are available at https://github.com/360AILAB-NLP/MMKG-RDS

Executive Summary

This article proposes MMKG-RDS, a novel framework for synthesizing high-quality training data using multimodal knowledge graphs. MMKG-RDS addresses existing limitations in knowledge coverage, effectiveness verification, and interpretability by supporting fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring. The authors validate MMKG-RDS with a comprehensive benchmark dataset, demonstrating improved reasoning accuracy and challenging existing models on complex tasks. While MMKG-RDS shows promise, its scalability and generalizability to diverse domains remain areas for further exploration.

Key Points

  • MMKG-RDS leverages multimodal knowledge graphs for reasoning data synthesis
  • The framework supports fine-grained knowledge extraction and customizable path sampling
  • Multidimensional data quality scoring enhances the effectiveness of MMKG-RDS

Merits

Improves reasoning accuracy

Fine-tuning Qwen3 models on synthesized samples improves reasoning accuracy by 9.2%

Enhances domain models' generalizability

MMKG-RDS challenges existing models on complex tasks, promoting domain models' adaptability

Supports interpretability and customizability

The framework's fine-grained knowledge extraction and path sampling facilitate interpretability and customizability

Demerits

Scalability limitations

The framework's performance on large-scale datasets and diverse domains remains to be explored

Dependence on multimodal knowledge graphs

The effectiveness of MMKG-RDS relies on the availability and quality of multimodal knowledge graphs

Expert Commentary

The article presents a well-crafted solution to the challenges of data synthesis in AI development. MMKG-RDS demonstrates a sophisticated understanding of the complexities involved in data synthesis and provides a comprehensive framework for addressing these challenges. While the framework shows promise, it is essential to acknowledge the limitations and areas for further exploration. The scalability and generalizability of MMKG-RDS to diverse domains remain critical concerns. Nevertheless, the article contributes significantly to the ongoing research in data synthesis and multimodal learning, warranting further investigation and development.

Recommendations

  • Future research should focus on improving the scalability and generalizability of MMKG-RDS
  • Further investigation into the dependence of MMKG-RDS on multimodal knowledge graphs is necessary to ensure its effectiveness in diverse domains

Sources