$\pi^2$: Structure-Originated Reasoning Data Improves Long-Context Reasoning Ability of Large Language Models
arXiv:2604.05114v1 Announce Type: new Abstract: We study a pipeline that curates reasoning data from initial structured data for improving long-context reasoning in large language models (LLMs). Our approach, $\pi^2$, constructs high-quality reasoning data through rigorous QA curation: 1) extracting and expanding tables from Wikipedia, 2) from the collected tables and relevant context, generating realistic and multi-hop analytical reasoning questions whose answers are automatically determined and verified through dual-path code execution, and 3) back-translating step-by-step structured reasoning traces as solutions of QA pairs given realistic web-search context. Supervised fine-tuning with \textsc{\small{gpt-oss-20b}} and \textsc{\small{Qwen3-4B-Instruct-2507}} on $\pi^2$ yields consistent improvements across four long-context reasoning benchmarks and our alike $\pi^2$-Bench, with average absolute accuracy gains of +4.3% and +2.7% respectively. Notably, our dataset facilitates self-di
arXiv:2604.05114v1 Announce Type: new Abstract: We study a pipeline that curates reasoning data from initial structured data for improving long-context reasoning in large language models (LLMs). Our approach, $\pi^2$, constructs high-quality reasoning data through rigorous QA curation: 1) extracting and expanding tables from Wikipedia, 2) from the collected tables and relevant context, generating realistic and multi-hop analytical reasoning questions whose answers are automatically determined and verified through dual-path code execution, and 3) back-translating step-by-step structured reasoning traces as solutions of QA pairs given realistic web-search context. Supervised fine-tuning with \textsc{\small{gpt-oss-20b}} and \textsc{\small{Qwen3-4B-Instruct-2507}} on $\pi^2$ yields consistent improvements across four long-context reasoning benchmarks and our alike $\pi^2$-Bench, with average absolute accuracy gains of +4.3% and +2.7% respectively. Notably, our dataset facilitates self-distillation, where \textsc{\small{gpt-oss-20b}} even improves its average performance by +4.4% with its own reasoning traces, demonstrating $\pi^2$'s usefulness. Our code, data, and models are open-source at https://github.com/vt-pi-squared/pi-squared.
Executive Summary
The paper introduces π², a novel pipeline designed to enhance the long-context reasoning capabilities of large language models (LLMs) by generating high-quality reasoning data from structured sources. The method leverages Wikipedia tables, extracts and expands them, and generates multi-hop analytical reasoning questions with verifiable answers via dual-path code execution. Additionally, it back-translates structured reasoning traces as solutions for training. Fine-tuning models like gpt-oss-20b and Qwen3-4B-Instruct-2507 on π² yields substantial accuracy improvements across benchmarks, including a self-distillation effect where models enhance their own performance. The authors release the code, data, and models open-source, facilitating reproducibility and further research.
Key Points
- ▸ π² employs a three-step pipeline: structured data extraction (Wikipedia tables), multi-hop QA generation with automatic verification via code execution, and back-translation of reasoning traces for structured solutions.
- ▸ Supervised fine-tuning on π² demonstrates consistent improvements in long-context reasoning, with average accuracy gains of +4.3% (gpt-oss-20b) and +2.7% (Qwen3-4B-Instruct-2507) across four benchmarks.
- ▸ The dataset enables self-distillation, where gpt-oss-20b improves its own performance by +4.4% using its generated reasoning traces, highlighting the scalability and adaptability of the approach.
Merits
Novel Data Curation Pipeline
The systematic extraction of structured data from Wikipedia and generation of verifiable multi-hop reasoning questions represents a significant innovation in automated dataset creation for LLM training.
Automated Verification Mechanism
Dual-path code execution ensures high accuracy and reliability of generated answers, addressing a critical challenge in synthetic data generation.
Open-Source Contribution
The release of code, data, and models fosters transparency, reproducibility, and community-driven advancements in LLM training methodologies.
Empirical Robustness
Consistent performance gains across multiple models and benchmarks, including self-distillation, underscore the robustness and generalizability of the approach.
Demerits
Dependency on Structured Data Sources
The reliance on Wikipedia tables as the primary structured data source may limit the diversity of reasoning tasks and introduce biases inherent to the encyclopedic format.
Scalability of Dual-Path Code Execution
The computational overhead of dual-path code execution for verification could pose challenges for scaling the pipeline to larger or more complex datasets.
Model-Specific Gains
Performance improvements are demonstrated on specific models (e.g., gpt-oss-20b, Qwen3-4B-Instruct-2507), raising questions about generalizability to other architectures or larger models.
Limited Benchmark Diversity
The evaluation focuses on long-context reasoning benchmarks, leaving unaddressed other critical aspects such as factual accuracy, bias mitigation, or cross-domain generalization.
Expert Commentary
The π² pipeline represents a significant advancement in the automated generation of high-quality reasoning data for LLMs, addressing a critical bottleneck in long-context reasoning. The dual-path code execution verification mechanism is particularly noteworthy, as it ensures the reliability of synthetic data—a persistent challenge in the field. The demonstrated self-distillation capability further underscores the scalability of the approach, offering a glimpse into the future of autonomous model improvement. However, the reliance on Wikipedia as the primary structured data source may constrain the diversity of reasoning tasks, and the computational overhead of dual-path verification could limit scalability. Future work should explore more diverse data sources and optimize the verification process to enhance practical applicability. Overall, π² sets a new benchmark for synthetic data generation in LLM training and merits further investigation as a tool for advancing long-context reasoning capabilities.
Recommendations
- ✓ Investigate the integration of alternative structured data sources beyond Wikipedia to enhance the diversity and robustness of the reasoning tasks generated by the π² pipeline.
- ✓ Optimize the dual-path code execution process to reduce computational overhead, potentially through parallelization or lightweight verification methods, to improve scalability for larger datasets.
- ✓ Expand the evaluation framework to include a broader range of benchmarks, such as those assessing factual accuracy, bias mitigation, and cross-domain generalization, to provide a more comprehensive assessment of the π² approach.
- ✓ Explore the applicability of the π² pipeline to other LLM architectures and larger models to validate the generalizability of the observed performance gains.
- ✓ Develop standardized protocols for the ethical and responsible use of synthetic reasoning data, including guidelines for bias mitigation and data provenance tracking, to ensure alignment with emerging regulatory frameworks.
Sources
Original: arXiv - cs.CL