Reasoning Topology Matters: Network-of-Thought for Complex Reasoning Tasks
arXiv:2603.20730v1 Announce Type: new Abstract: Existing prompting paradigms structure LLM reasoning in limited topologies: Chain-of-Thought (CoT) produces linear traces, while Tree-of-Thought (ToT) performs branching search. Yet complex reasoning often requires merging intermediate results, revisiting hypotheses, and integrating evidence from multiple sources. We propose Network-of-Thought (NoT), a framework that models reasoning as a directed graph with typed nodes and edges, guided by a heuristic-based controller policy. Across four benchmarks (GSM8K, Game of 24, HotpotQA, ProofWriter) and three models (GPT-4o-mini, Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct), we investigate when network topology outperforms chain or tree structures, whether LLM-generated heuristics can guide graph-based reasoning search, and the computation-accuracy tradeoff across topologies, evaluating each method on accuracy, topology simplicity, and token efficiency. Our results show that CoT remains effecti
arXiv:2603.20730v1 Announce Type: new Abstract: Existing prompting paradigms structure LLM reasoning in limited topologies: Chain-of-Thought (CoT) produces linear traces, while Tree-of-Thought (ToT) performs branching search. Yet complex reasoning often requires merging intermediate results, revisiting hypotheses, and integrating evidence from multiple sources. We propose Network-of-Thought (NoT), a framework that models reasoning as a directed graph with typed nodes and edges, guided by a heuristic-based controller policy. Across four benchmarks (GSM8K, Game of 24, HotpotQA, ProofWriter) and three models (GPT-4o-mini, Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct), we investigate when network topology outperforms chain or tree structures, whether LLM-generated heuristics can guide graph-based reasoning search, and the computation-accuracy tradeoff across topologies, evaluating each method on accuracy, topology simplicity, and token efficiency. Our results show that CoT remains effective for sequential tasks with GPT-4o-mini (89.5\% on GSM8K), while NoT surpasses ToT on multi-hop reasoning (91.0\% vs.\ 88.0\% on HotpotQA with LLM-as-Judge). With 72B open-source models, NoT achieves the highest accuracy on GSM8K (91.5\%), and Qwen2.5-72B achieves the best multi-hop QA result overall (91.7\% on HotpotQA). Self-generated controller heuristics outperform fixed and random strategies on logical reasoning, with uncertainty-only weighting achieving 57.0\% on ProofWriter. We also find that evaluation methodology significantly impacts method rankings: string-match underestimates all methods on open-ended QA, with the largest gap for NoT, a pattern consistent across all three models (14--18 percentage point gap on HotpotQA).
Executive Summary
This study proposes Network-of-Thought (NoT), a framework that models reasoning as a directed graph with typed nodes and edges, and evaluates its performance on complex reasoning tasks. The authors compare NoT with Chain-of-Thought (CoT) and Tree-of-Thought (ToT) topologies, and demonstrate its superiority on multi-hop reasoning tasks. The results show that NoT outperforms ToT on HotpotQA and achieves the highest accuracy on GSM8K. The study also highlights the importance of evaluation methodology, as string-match underestimates NoT's performance on open-ended QA. The findings have significant implications for the development of more accurate and efficient language models.
Key Points
- ▸ NoT outperforms ToT on multi-hop reasoning tasks
- ▸ NoT achieves the highest accuracy on GSM8K
- ▸ Evaluation methodology significantly impacts method rankings
Merits
Strength
The study proposes a novel framework for complex reasoning tasks, which has the potential to improve the accuracy and efficiency of language models.
Strength
The authors conduct a thorough evaluation of NoT's performance across multiple benchmarks and models, providing a comprehensive understanding of its strengths and limitations.
Demerits
Limitation
The study relies on a limited set of benchmarks and models, which may not be representative of the broader range of complex reasoning tasks.
Limitation
The evaluation methodology used in the study may not be universally applicable, and may require further refinement for more accurate comparisons.
Expert Commentary
This study makes a significant contribution to the field of natural language processing by proposing a novel framework for complex reasoning tasks. The authors' use of a directed graph to model reasoning is a particularly innovative aspect of the study, as it allows for a more nuanced and flexible representation of complex reasoning processes. The evaluation of NoT's performance across multiple benchmarks and models provides a comprehensive understanding of its strengths and limitations, and highlights the importance of evaluation methodology in assessing the performance of language models. Overall, this study has significant implications for the development of more accurate and efficient language models, and highlights the need for further research in this area.
Recommendations
- ✓ Future studies should investigate the use of NoT in more complex and realistic reasoning tasks, such as multi-step problem-solving and decision-making.
- ✓ The development of more robust evaluation methodologies for language models is essential for ensuring the accurate assessment of their performance in real-world applications.
Sources
Original: arXiv - cs.CL