Learning Adaptive LLM Decoding
arXiv:2603.09065v1 Announce Type: new Abstract: Decoding from large language models (LLMs) typically relies on fixed sampling hyperparameters (e.g., temperature, top-p), despite substantial variation in task difficulty and uncertainty across prompts and individual decoding steps. We propose to learn adaptive decoding policies that dynamically select sampling strategies at inference time, conditioned on available compute resources. Rather than fine-tuning the language model itself, we introduce lightweight decoding adapters trained with reinforcement learning and verifiable terminal rewards (e.g. correctness on math and coding tasks). At the sequence level, we frame decoding as a contextual bandit problem: a policy selects a decoding strategy (e.g. greedy, top-k, min-p) for each prompt, conditioned on the prompt embedding and a parallel sampling budget. At the token level, we model decoding as a partially observable Markov decision process (POMDP), where a policy selects sampling actio
arXiv:2603.09065v1 Announce Type: new Abstract: Decoding from large language models (LLMs) typically relies on fixed sampling hyperparameters (e.g., temperature, top-p), despite substantial variation in task difficulty and uncertainty across prompts and individual decoding steps. We propose to learn adaptive decoding policies that dynamically select sampling strategies at inference time, conditioned on available compute resources. Rather than fine-tuning the language model itself, we introduce lightweight decoding adapters trained with reinforcement learning and verifiable terminal rewards (e.g. correctness on math and coding tasks). At the sequence level, we frame decoding as a contextual bandit problem: a policy selects a decoding strategy (e.g. greedy, top-k, min-p) for each prompt, conditioned on the prompt embedding and a parallel sampling budget. At the token level, we model decoding as a partially observable Markov decision process (POMDP), where a policy selects sampling actions at each token step based on internal model features and the remaining token budget. Experiments on the MATH and CodeContests benchmarks show that the learned adapters improve the accuracy-budget tradeoff: on MATH, the token-level adapter improves Pass@1 accuracy by up to 10.2% over the best static baseline under a fixed token budget, while the sequence-level adapter yields 2-3% gains under fixed parallel sampling. Ablation analyses support the contribution of both sequence- and token-level adaptation.
Executive Summary
This article presents a novel approach to decoding from large language models (LLMs) by introducing adaptive decoding policies that dynamically select sampling strategies at inference time. Rather than fine-tuning the language model, lightweight decoding adapters are trained with reinforcement learning and verifiable terminal rewards. The proposed method frames decoding as a contextual bandit problem at the sequence level and a partially observable Markov decision process (POMDP) at the token level. Experiments on the MATH and CodeContests benchmarks demonstrate improved accuracy-budget tradeoffs, with gains of up to 10.2% in Pass@1 accuracy for token-level adaptation and 2-3% gains for sequence-level adaptation. The study highlights the potential of adaptive decoding policies in optimizing the performance of LLMs.
Key Points
- ▸ Adaptive decoding policies are proposed to dynamically select sampling strategies at inference time.
- ▸ Lightweight decoding adapters are trained with reinforcement learning and verifiable terminal rewards.
- ▸ Decoding is framed as a contextual bandit problem at the sequence level and a POMDP at the token level.
Merits
Improved Accuracy-Budget Tradeoff
The proposed method demonstrates significant improvements in accuracy-budget tradeoffs, making it a valuable contribution to the field of LLMs.
Flexibility and Adaptability
The adaptive decoding policies can be tailored to different tasks and environments, enhancing the flexibility and adaptability of LLMs.
Demerits
Computational Complexity
The proposed method may introduce additional computational complexity, which could be a limitation in resource-constrained environments.
Overfitting Risk
The lightweight decoding adapters may be prone to overfitting, especially if the training data is limited or noisy.
Expert Commentary
The article presents a well-motivated and well-executed study on adaptive decoding policies for LLMs. The proposed method demonstrates significant improvements in accuracy-budget tradeoffs, making it a valuable contribution to the field. However, the study also highlights the potential limitations of the method, including computational complexity and overfitting risk. Future research should focus on addressing these limitations and exploring the applicability of the proposed method to other LLM-based applications.
Recommendations
- ✓ Future research should investigate the use of more advanced reinforcement learning techniques to improve the performance of the lightweight decoding adapters.
- ✓ The study should be extended to explore the applicability of adaptive decoding policies to other LLM-based applications, such as conversation systems and text generation.