Academic

SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

arXiv:2603.02599v1 Announce Type: new Abstract: In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. In particular, SUN improves throughput

arXiv:2603.02599v1 Announce Type: new Abstract: In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. In particular, SUN improves throughput per GPU by up to 2.0x over conventional disaggregation while keeping time-per-output-token (TPOT) within 5%. SUN inherently enables and facilitates low-bit decoding; with Quantized SUN (QSUN), it achieves a 45% speedup with comparable accuracy to SUN while preserving the benefits of shared decoding.

Executive Summary

The article presents SUN (Shared Use of Next-token Prediction), a novel approach to improve the efficiency of multi-LLM (Large Language Model) serving. SUN decomposes the decoder-only Transformer into a prefill module and a decode module, allowing for cross-model sharing of decode execution. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers, maximizing utilization. The study demonstrates that SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. The proposed technique also enables low-bit decoding and facilitates Quantized SUN (QSUN), achieving a 45% speedup with comparable accuracy. This innovative approach has significant implications for the scalability and efficiency of multi-LLM serving systems.

Key Points

  • SUN decomposes the decoder-only Transformer into a prefill module and a decode module for cross-model sharing of decode execution.
  • The model-agnostic decode routing policy balances decode requests across shared workers, maximizing utilization.
  • SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers.

Merits

Efficiency Improvement

SUN significantly improves the efficiency of multi-LLM serving by reducing GPU underutilization and maximizing system throughput.

Scalability Enhancement

The proposed technique enables cross-model sharing of decode execution, facilitating the deployment of multiple models on a single system.

Flexibility and Adaptability

SUN's model-agnostic design allows for seamless integration with various models and tasks, making it a versatile solution for multi-LLM serving.

Demerits

Complexity

The decomposition of the decoder-only Transformer into separate modules may introduce additional complexity and require significant computational resources.

Limited Context

The article primarily focuses on the technical aspects of SUN, and further research is necessary to fully understand its implications and applications in real-world scenarios.

Expert Commentary

The article presents a significant contribution to the field of natural language processing, offering a novel and efficient approach to multi-LLM serving. The proposed technique has the potential to transform the way we deploy and utilize large language models, enabling faster and more efficient processing of complex tasks. However, further research is necessary to fully understand the implications and applications of SUN in real-world scenarios. The development of QSUN highlights the potential benefits of quantization and low-bit encoding, which can be explored in future studies. Overall, this work demonstrates the power of interdisciplinary research and collaboration, showcasing the potential for innovative solutions to emerge from the intersection of computer science, AI, and engineering.

Recommendations

  • Further research should focus on exploring the applications and implications of SUN in real-world scenarios, such as in the deployment of large language models for conversational AI, text classification, and question-answering tasks.
  • The development of QSUN should be continued, with a focus on optimizing the quantization process and exploring the potential benefits of low-bit encoding in reducing computational costs.

Sources