Academic

TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution

arXiv:2602.13059v1 Announce Type: new Abstract: Question answering (QA) over structured tables requires not only accurate answers but also transparency about which cells support them. Existing table QA systems rarely provide fine-grained attribution, so even correct answers often lack verifiable grounding, limiting trust in high-stakes settings. We address this with TraceBack, a modular multi-agent framework for scalable, cell-level attribution in single-table QA. TraceBack prunes tables to relevant rows and columns, decomposes questions into semantically coherent sub-questions, and aligns each answer span with its supporting cells, capturing both explicit and implicit evidence used in intermediate reasoning steps. To enable systematic evaluation, we release CITEBench, a benchmark with phrase-to-cell annotations drawn from ToTTo, FetaQA, and AITQA. We further propose FairScore, a reference-less metric that compares atomic facts derived from predicted cells and answers to estimate attr

arXiv:2602.13059v1 Announce Type: new Abstract: Question answering (QA) over structured tables requires not only accurate answers but also transparency about which cells support them. Existing table QA systems rarely provide fine-grained attribution, so even correct answers often lack verifiable grounding, limiting trust in high-stakes settings. We address this with TraceBack, a modular multi-agent framework for scalable, cell-level attribution in single-table QA. TraceBack prunes tables to relevant rows and columns, decomposes questions into semantically coherent sub-questions, and aligns each answer span with its supporting cells, capturing both explicit and implicit evidence used in intermediate reasoning steps. To enable systematic evaluation, we release CITEBench, a benchmark with phrase-to-cell annotations drawn from ToTTo, FetaQA, and AITQA. We further propose FairScore, a reference-less metric that compares atomic facts derived from predicted cells and answers to estimate attribution precision and recall without human cell labels. Experiments show that TraceBack substantially outperforms strong baselines across datasets and granularities, while FairScore closely tracks human judgments and preserves relative method rankings, supporting interpretable and scalable evaluation of table-based QA.

Executive Summary

The article 'TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution' introduces a novel framework, TraceBack, designed to enhance the transparency and trustworthiness of question-answering (QA) systems over structured tables. By providing fine-grained attribution at the cell level, TraceBack addresses the critical need for verifiable grounding in high-stakes settings. The framework decomposes questions into semantically coherent sub-questions and aligns answers with supporting cells, capturing both explicit and implicit evidence. The authors also introduce CITEBench, a benchmark for systematic evaluation, and FairScore, a reference-less metric for estimating attribution precision and recall. Experiments demonstrate TraceBack's superior performance over strong baselines, highlighting its potential for interpretable and scalable evaluation in table-based QA.

Key Points

  • TraceBack provides fine-grained, cell-level attribution for table QA systems.
  • The framework decomposes questions into sub-questions and aligns answers with supporting cells.
  • CITEBench and FairScore are introduced for systematic evaluation and reference-less metric estimation.
  • TraceBack outperforms strong baselines across datasets and granularities.
  • FairScore closely tracks human judgments and preserves relative method rankings.

Merits

Innovative Framework

TraceBack introduces a modular multi-agent framework that significantly enhances the transparency and trustworthiness of table QA systems by providing fine-grained attribution.

Comprehensive Evaluation

The introduction of CITEBench and FairScore provides a robust and systematic approach to evaluating the performance of table QA systems, addressing a critical gap in the field.

Superior Performance

Experiments demonstrate that TraceBack substantially outperforms strong baselines, highlighting its potential for practical applications in high-stakes settings.

Demerits

Complexity

The multi-agent decomposition process may introduce complexity and computational overhead, which could limit its scalability in certain applications.

Dependence on Benchmark

The effectiveness of TraceBack is heavily reliant on the quality and comprehensiveness of CITEBench, which may not cover all possible scenarios in real-world applications.

Human Judgment Validation

While FairScore closely tracks human judgments, it may not fully capture the nuances of human evaluation, potentially leading to discrepancies in attribution precision and recall.

Expert Commentary

The article presents a significant advancement in the field of table QA systems by introducing TraceBack, a framework that addresses the critical need for fine-grained attribution. The modular multi-agent approach not only enhances the transparency of QA systems but also provides a robust methodology for evaluating their performance. The introduction of CITEBench and FairScore further strengthens the framework's validity, offering a systematic and reference-less evaluation metric. However, the complexity of the multi-agent decomposition process and the dependence on the benchmark's comprehensiveness are notable limitations. Despite these challenges, TraceBack's superior performance and potential for practical applications make it a valuable contribution to the field. The framework's alignment with broader issues in explainable AI, data privacy, and AI ethics underscores its relevance and potential impact on both practical and policy levels. As AI systems continue to evolve, the need for transparent and interpretable frameworks like TraceBack will only grow, making this article a timely and impactful contribution to the scholarly discourse.

Recommendations

  • Further research should explore methods to reduce the computational overhead of the multi-agent decomposition process to enhance scalability.
  • Expanding CITEBench to include a broader range of scenarios and edge cases would improve the robustness of the benchmark and the validity of the evaluation metrics.

Sources