Skip to main content
Academic

KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

arXiv:2602.20217v1 Announce Type: new Abstract: Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardw

S
Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, Insu Han
· · 1 min read · 3 views

arXiv:2602.20217v1 Announce Type: new Abstract: Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks. Our plug-and-play approach ensures high-speed inference for long sequences without requiring additional training or compromising the target model's output distribution.

Executive Summary

The article introduces KnapSpec, a training-free framework for self-speculative decoding (SSD) in large language models (LLMs). By reformulating draft model selection as a knapsack problem, KnapSpec adaptively identifies optimal draft configurations based on hardware-specific latencies and cosine similarity between hidden states. The framework demonstrates consistent wall-clock speedup across various benchmarks, outperforming state-of-the-art SSD baselines. KnapSpec's plug-and-play approach ensures high-speed inference without compromising the target model's output distribution. The article provides a rigorous theoretical analysis and empirical evaluations on Qwen3 and Llama3 models, showcasing the potential of KnapSpec for accelerating LLM inference in long-context scenarios.

Key Points

  • KnapSpec reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput.
  • The framework decouples Attention and MLP layers, modeling their hardware-specific latencies as functions of context length.
  • KnapSpec provides a rigorous theoretical analysis establishing cosine similarity between hidden states as a proxy for token acceptance rate.

Merits

Strength in Theoretical Foundation

The article provides a mathematically sound theoretical analysis of cosine similarity between hidden states as a proxy for token acceptance rate, establishing a solid foundation for the KnapSpec framework.

Demerits

Potential Overhead of Dynamic Programming

The use of parallel dynamic programming may introduce additional computational overhead, potentially negating some of the benefits of the KnapSpec framework.

Expert Commentary

The KnapSpec framework represents a significant advancement in the field of efficient inference techniques for large language models. By reformulating draft model selection as a knapsack problem, the authors have developed a novel approach that adapts to hardware-specific latencies and cosine similarity between hidden states. While the use of parallel dynamic programming may introduce additional overhead, the framework's ability to maintain high drafting faithfulness while navigating shifting bottlenecks is a notable achievement. The article's theoretical analysis and empirical evaluations provide a solid foundation for the KnapSpec framework, which has the potential to accelerate LLM inference in long-context scenarios. As the field of AI continues to evolve, frameworks like KnapSpec will become increasingly important for scaling and deploying complex models in real-world applications.

Recommendations

  • Future research should focus on further optimizing the parallel dynamic programming component to minimize additional overhead.
  • The KnapSpec framework should be explored in other areas where efficient inference is critical, such as computer vision and graph neural networks.

Sources