Skip to main content
Academic

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

arXiv:2602.19049v1 Announce Type: new Abstract: Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods

arXiv:2602.19049v1 Announce Type: new Abstract: Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at https://github.com/YinhanHe123/IAPO.

Executive Summary

This article proposes a novel information-theoretic post-training framework, IAPO, which optimizes large language models for token-efficient reasoning by assigning token-wise advantages based on each token's conditional mutual information with the final answer. IAPO aims to bridge the gap between existing sequence-level reward-shaping methods, which offer limited control over reasoning effort allocation, and token-efficient post-training. Empirical evaluations demonstrate that IAPO improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. This breakthrough has significant implications for the development of more efficient and effective language models.

Key Points

  • IAPO assigns token-wise advantages based on each token's conditional mutual information with the final answer.
  • IAPO bridges the gap between existing sequence-level reward-shaping methods and token-efficient post-training.
  • Empirical evaluations demonstrate that IAPO improves reasoning accuracy while reducing reasoning length by up to 36%.

Merits

Strength

IAPO's use of information theory provides a principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration.

Empirical validation

IAPO consistently outperforms existing token-efficient RL methods across various reasoning datasets, demonstrating its effectiveness and generalizability.

Theoretical analysis

IAPO's theoretical analysis shows that it can induce monotonic reductions in reasoning verbosity without harming correctness, providing a solid foundation for its claims.

Demerits

Limitation

The current implementation of IAPO may not be scalable to very large language models, due to the computational complexity of calculating token-wise advantages.

Expert Commentary

The article presents a novel and effective approach to optimizing large language models for token-efficient reasoning. IAPO's use of information theory provides a principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration, which is a significant contribution to the field. The empirical evaluations demonstrate that IAPO consistently outperforms existing token-efficient RL methods across various reasoning datasets, providing a solid foundation for its claims. While there may be limitations to the current implementation of IAPO, such as scalability issues, the breakthrough has significant implications for the development of more efficient and effective language models.

Recommendations

  • Further investigation into the scalability of IAPO is necessary to ensure its applicability to very large language models.
  • Theoretical analysis should be extended to explore the robustness of IAPO to various types of language models and reasoning tasks.

Sources