ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models
arXiv:2603.21237v1 Announce Type: new Abstract: Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference has emerged as a promising paradigm by dynamically routing queries to models of different capacities across tiers. In this paper, we propose ConsRoute, a lightweight, semantic-aware, and adaptive routing framework that significantly improves inference efficiency while minimizing impact on response quality. Unlike prior routing methods that rely on predicting coarse-grained output quality gaps, ConsRoute leverages a reranker to directly assess the semantic consistency between responses generated by models at different tiers, yielding fine-grained soft supervision signals for routing. To minimize device-side overhead, ConsRoute reuses hidden states from the LLM prefilling stage as compact query represen
arXiv:2603.21237v1 Announce Type: new Abstract: Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference has emerged as a promising paradigm by dynamically routing queries to models of different capacities across tiers. In this paper, we propose ConsRoute, a lightweight, semantic-aware, and adaptive routing framework that significantly improves inference efficiency while minimizing impact on response quality. Unlike prior routing methods that rely on predicting coarse-grained output quality gaps, ConsRoute leverages a reranker to directly assess the semantic consistency between responses generated by models at different tiers, yielding fine-grained soft supervision signals for routing. To minimize device-side overhead, ConsRoute reuses hidden states from the LLM prefilling stage as compact query representations, avoiding additional encoders or inference passes. Furthermore, these representations are clustered, and Bayesian optimization is employed to learn cluster-specific routing thresholds that dynamically balance quality, latency, and cost under heterogeneous query distributions. Extensive experiments demonstrate that ConsRoute achieves near-cloud performance (>=95%) while reducing end-to-end latency and inference cost by nearly 40%, consistently outperforming existing routing baselines in both response quality and system efficiency.
Executive Summary
ConsRoute, a novel adaptive query routing framework, is proposed to optimize inference efficiency for large language models in cloud-edge-device collaborative inference scenarios. By leveraging a reranker to assess semantic consistency between responses and reusing hidden states as compact query representations, ConsRoute effectively balances quality, latency, and cost. Experimental results demonstrate a significant reduction in end-to-end latency and inference cost while maintaining near-cloud performance. This framework has the potential to revolutionize the deployment of large language models in resource-constrained and latency-sensitive scenarios. By dynamically adapting to query distributions, ConsRoute offers a practical solution to the challenges of model deployment, making it an attractive option for industries relying on efficient and accurate language processing.
Key Points
- ▸ ConsRoute is a lightweight, semantic-aware, and adaptive routing framework for large language models.
- ▸ It leverages a reranker to assess semantic consistency between responses generated by models at different tiers.
- ▸ ConsRoute reuses hidden states from the LLM prefilling stage as compact query representations.
Merits
Strength in Balancing Quality and Efficiency
ConsRoute effectively balances the trade-off between response quality and system efficiency, making it a practical solution for industries relying on large language models.
Demerits
Limited Evaluation on Handling Adversarial Inputs
The article does not explicitly evaluate ConsRoute's performance on handling adversarial inputs, which is a critical consideration in real-world applications.
Expert Commentary
While ConsRoute demonstrates impressive results in balancing quality and efficiency, its limitations in handling adversarial inputs highlight the need for further research. The framework's reliance on a reranker to assess semantic consistency may also introduce additional computational overhead. Nevertheless, ConsRoute represents a significant step forward in optimizing the deployment of large language models in cloud-edge-device collaborative inference scenarios. As the field continues to evolve, it will be essential to address the remaining challenges and explore the potential applications of ConsRoute in various industries.
Recommendations
- ✓ Future research should focus on evaluating ConsRoute's performance on handling adversarial inputs and exploring the potential applications of the framework in different domains.
- ✓ The development of more sophisticated reranking algorithms and compact query representations may further enhance the efficiency and effectiveness of ConsRoute.
Sources
Original: arXiv - cs.AI