SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
arXiv:2602.23286v1 Announce Type: new Abstract: Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested pred
arXiv:2602.23286v1 Announce Type: new Abstract: Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.
Executive Summary
The article introduces SPARTA, a scalable framework for generating large-scale Table-Text question answering (QA) benchmarks. SPARTA addresses the limitations of existing benchmarks by automatically creating high-fidelity question-answer pairs that require multi-hop reasoning and advanced analytical operations. The framework constructs a reference fact database by enriching source tables with atomic facts extracted from unstructured passages and synthesizes nested queries with desired hop counts. Novel techniques like provenance-based refinement and realistic-structure enforcement ensure the executability and fluency of the generated queries. SPARTA's benchmarks reveal significant weaknesses in current state-of-the-art models, which experience a drop of over 30 F1 points. The framework and its resources are openly available for further research.
Key Points
- ▸ SPARTA automates the generation of large-scale Table-Text QA benchmarks with minimal human validation.
- ▸ The framework ensures high-fidelity question-answer pairs through provenance-based refinement and realistic-structure enforcement.
- ▸ State-of-the-art models perform significantly worse on SPARTA benchmarks, highlighting their limitations in cross-modal reasoning.
Merits
Scalability
SPARTA's automated framework significantly reduces the time and effort required for benchmark creation, making it scalable and efficient.
High-Fidelity Generation
The use of provenance-based refinement and realistic-structure enforcement ensures that the generated queries are both executable and fluent, enhancing the quality of the benchmark.
Comprehensive Evaluation
SPARTA's benchmarks cover a wide range of complex operations, including aggregations, grouping, and deep multi-hop reasoning, providing a thorough evaluation of model capabilities.
Demerits
Dependence on Source Data
The quality of the generated benchmarks is highly dependent on the quality and completeness of the source tables and unstructured passages, which may introduce biases or limitations.
Model Performance Drop
While the significant drop in model performance on SPARTA benchmarks is informative, it also indicates that current models may not be immediately adaptable to the new benchmark standards.
Expert Commentary
The introduction of SPARTA represents a significant advancement in the field of Table-Text QA benchmarks. By automating the generation process and ensuring high-fidelity question-answer pairs, SPARTA addresses critical limitations of existing benchmarks. The framework's ability to synthesize complex queries that demand multi-hop reasoning and advanced analytical operations provides a more rigorous evaluation of model capabilities. The substantial drop in performance of state-of-the-art models on SPARTA benchmarks underscores the need for further research and development in cross-modal reasoning. However, the dependence on the quality of source data and the immediate performance drop of current models highlight areas for improvement and further investigation. Overall, SPARTA sets a new standard for benchmark creation and evaluation, paving the way for more sophisticated and reliable AI models.
Recommendations
- ✓ Further research should focus on improving model architectures to better handle the complexities introduced by SPARTA benchmarks.
- ✓ Establishing guidelines for the creation and evaluation of benchmarks to ensure consistency and reliability in AI research.