Automatic Inter-document Multi-hop Scientific QA Generation
arXiv:2603.14257v1 Announce Type: new Abstract: Existing automatic scientific question generation studies mainly focus on single-document factoid QA, overlooking the inter-document reasoning crucial for scientific understanding. We present AIM-SciQA, an automated framework for generating multi-document, multi-hop scientific QA datasets. AIM-SciQA extracts single-hop QAs using large language models (LLMs) with machine reading comprehension and constructs cross-document relations based on embedding-based semantic alignment while selectively leveraging citation information. Applied to 8,211 PubMed Central papers, it produced 411,409 single-hop and 13,672 multi-hop QAs, forming the IM-SciQA dataset. Human and automatic validation confirmed high factual consistency, and experimental results demonstrate that IM-SciQA effectively differentiates reasoning capabilities across retrieval and QA stages, providing a realistic and interpretable benchmark for retrieval-augmented scientific reasoning
arXiv:2603.14257v1 Announce Type: new Abstract: Existing automatic scientific question generation studies mainly focus on single-document factoid QA, overlooking the inter-document reasoning crucial for scientific understanding. We present AIM-SciQA, an automated framework for generating multi-document, multi-hop scientific QA datasets. AIM-SciQA extracts single-hop QAs using large language models (LLMs) with machine reading comprehension and constructs cross-document relations based on embedding-based semantic alignment while selectively leveraging citation information. Applied to 8,211 PubMed Central papers, it produced 411,409 single-hop and 13,672 multi-hop QAs, forming the IM-SciQA dataset. Human and automatic validation confirmed high factual consistency, and experimental results demonstrate that IM-SciQA effectively differentiates reasoning capabilities across retrieval and QA stages, providing a realistic and interpretable benchmark for retrieval-augmented scientific reasoning. We further extend this framework to construct CIM-SciQA, a citation-guided variant achieving comparable performance to the Oracle setting, reinforcing the dataset's validity and generality.
Executive Summary
The article presents AIM-SciQA, a novel machine learning framework for generating multi-document, multi-hop scientific QA datasets. This framework utilizes large language models to extract single-hop QAs and constructs cross-document relations based on semantic alignment and citation information. The authors validate the framework's performance using the IM-SciQA dataset, which includes 411,409 single-hop and 13,672 multi-hop QAs, and demonstrate its effectiveness in differentiating reasoning capabilities across retrieval and QA stages. The framework is further extended to construct CIM-SciQA, a citation-guided variant that achieves comparable performance to the Oracle setting. The study provides a realistic and interpretable benchmark for retrieval-augmented scientific reasoning, filling a gap in existing automatic scientific question generation studies.
Key Points
- ▸ AIM-SciQA is an automated framework for generating multi-document, multi-hop scientific QA datasets.
- ▸ The framework extracts single-hop QAs using large language models and constructs cross-document relations based on semantic alignment and citation information.
- ▸ The IM-SciQA dataset includes 411,409 single-hop and 13,672 multi-hop QAs, validated through human and automatic evaluation.
Merits
State-of-the-art performance in scientific QA generation
The AIM-SciQA framework achieves high factual consistency and effectively differentiates reasoning capabilities across retrieval and QA stages, providing a realistic and interpretable benchmark for retrieval-augmented scientific reasoning.
Novel approach to multi-document, multi-hop scientific QA generation
The framework's ability to extract single-hop QAs and construct cross-document relations based on semantic alignment and citation information fills a gap in existing automatic scientific question generation studies.
Demerits
Limited generalizability to non-scientific domains
The framework's reliance on citation information and semantic alignment may limit its applicability to non-scientific domains or domains with less structured citation patterns.
Potential bias in dataset construction
The use of large language models and citation information may introduce biases in the dataset, particularly if the models are trained on biased or incomplete data.
Expert Commentary
The AIM-SciQA framework presents a significant advancement in scientific QA generation, addressing a critical gap in existing studies. The framework's performance is impressive, and its ability to differentiate reasoning capabilities across retrieval and QA stages is a notable achievement. However, the study's limitations, including the potential bias in dataset construction and limited generalizability to non-scientific domains, must be considered. Furthermore, the study's implications for policy and practice are significant, highlighting the need for more diverse and representative datasets and the importance of considering the limitations of machine learning models in scientific QA generation.
Recommendations
- ✓ Future studies should explore the applicability of the AIM-SciQA framework to non-scientific domains and address the potential bias in dataset construction.
- ✓ The authors should consider extending the framework to incorporate more diverse and representative datasets to improve its generalizability.