TARo: Token-level Adaptive Routing for LLM Test-time Alignment
arXiv:2603.18411v1 Announce Type: new Abstract: Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instr
arXiv:2603.18411v1 Announce Type: new Abstract: Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.
Executive Summary
The article proposes a novel test-time alignment method called Token-level Adaptive Routing (TARo) for improving the reasoning capabilities of large language models (LLMs). TARo leverages a learnable token-level router to control the guidance of a reward model, trained on step-wise mathematical traces, to steer frozen LLMs toward structured reasoning entirely at inference time. Extensive experiments demonstrate significant improvements in reasoning performance, out-of-distribution clinical reasoning, and instruction following. The approach also generalizes across different backbone models without retraining, expanding the scope of test-time alignment from preference optimization to robust, cross-domain reasoning. This breakthrough has the potential to revolutionize the field of LLMs, enabling more efficient and effective use of these powerful tools.
Key Points
- ▸ TARo is a novel test-time alignment method for improving LLM reasoning capabilities.
- ▸ The approach leverages a learnable token-level router to control the guidance of a reward model.
- ▸ TARo demonstrates significant improvements in reasoning performance and out-of-distribution clinical reasoning.
- ▸ The approach generalizes across different backbone models without retraining.
Merits
Strength
TARo's ability to improve LLM reasoning capabilities entirely at inference time is a significant advancement over existing methods.
Demerits
Limitation
The approach requires extensive training data for the reward model, which may be a limitation for certain applications.
Expert Commentary
The proposed method is a significant advancement in the field of LLMs, offering a lightweight and efficient approach to improving reasoning capabilities entirely at inference time. The use of a learnable token-level router and a reward model trained on step-wise mathematical traces is a novel and innovative approach that demonstrates the potential for test-time alignment to go beyond preference optimization. The generalizability of TARo across different backbone models without retraining is a key aspect of transfer learning in LLMs, which has important implications for the development of more robust and efficient LLM-based technologies.
Recommendations
- ✓ Future research should focus on exploring the applicability of TARo to other areas, such as natural language processing and computer vision.
- ✓ The development of more efficient and scalable training methods for the reward model is essential for widespread adoption of TARo.