SimpleTool: Parallel Decoding for Real-Time LLM Function Calling
arXiv:2603.00030v1 Announce Type: new Abstract: LLM-based function calling enables intelligent agents to interact with external tools and environments, yet autoregressive decoding imposes a fundamental latency bottleneck that limits real-time applications such as embodied intelligence, game AI, and interactive avatars (e.g., 10 Hz control frequency). We observe that function calling differs fundamentally from free-form text generation: structured outputs exhibit substantial token redundancy (delimiters, parameter names), and arguments exhibit weak causal dependencies. Crucially, these two properties must be exploited jointly to achieve real-time performance. We present SimpleTool, which introduces special tokens that serve a dual role: compressing low-entropy tokens (4-6x reduction) while acting as mode selectors that enable independent parallel generation of function name and arguments. This synergistic design achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelizat
arXiv:2603.00030v1 Announce Type: new Abstract: LLM-based function calling enables intelligent agents to interact with external tools and environments, yet autoregressive decoding imposes a fundamental latency bottleneck that limits real-time applications such as embodied intelligence, game AI, and interactive avatars (e.g., 10 Hz control frequency). We observe that function calling differs fundamentally from free-form text generation: structured outputs exhibit substantial token redundancy (delimiters, parameter names), and arguments exhibit weak causal dependencies. Crucially, these two properties must be exploited jointly to achieve real-time performance. We present SimpleTool, which introduces special tokens that serve a dual role: compressing low-entropy tokens (4-6x reduction) while acting as mode selectors that enable independent parallel generation of function name and arguments. This synergistic design achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead. Experiments on five benchmarks across Qwen-series models (0.5B-14B) demonstrate substantial speedup while maintaining competitive or improved accuracy. On Mobile Actions, ST-Qwen-0.5B outperforms Google's FunctionGemma in both accuracy and latency consistency. With quantization on consumer-grade GPU, SimpleTool achieves 61.2ms P50 latency, enabling 16 Hz real-time control at 4B model scale, bridging the gap between LLM function calling and latency-critical real-world deployment.
Executive Summary
The article introduces SimpleTool, a novel approach to parallel decoding for real-time LLM function calling, achieving 3-6x end-to-end speedup with minimal parallelization overhead. By exploiting token redundancy and weak causal dependencies, SimpleTool enables independent parallel generation of function name and arguments, making it suitable for latency-critical applications such as embodied intelligence and game AI. Experiments demonstrate substantial speedup while maintaining competitive accuracy, bridging the gap between LLM function calling and real-world deployment.
Key Points
- ▸ SimpleTool introduces special tokens for compressing low-entropy tokens and enabling parallel generation
- ▸ The approach achieves 3-6x end-to-end speedup with only +8.2% parallelization overhead
- ▸ Experiments demonstrate substantial speedup and competitive accuracy across five benchmarks
Merits
Efficient Parallelization
SimpleTool's design enables efficient parallelization, reducing latency and improving real-time performance
Improved Accuracy
The approach maintains competitive or improved accuracy across various benchmarks
Demerits
Limited Model Support
The article only evaluates SimpleTool on Qwen-series models, leaving uncertainty about its applicability to other models
Expert Commentary
SimpleTool represents a significant advancement in parallel decoding for real-time LLM function calling. By leveraging the unique properties of structured outputs and weak causal dependencies, the approach achieves substantial speedup while maintaining competitive accuracy. The results demonstrate the potential for SimpleTool to bridge the gap between LLM function calling and latency-critical real-world deployment. However, further research is needed to fully explore the limitations and applicability of SimpleTool to various models and applications.
Recommendations
- ✓ Further evaluate SimpleTool on a broader range of models and benchmarks to assess its generalizability
- ✓ Explore the potential applications of SimpleTool in various real-time domains, such as game AI and interactive avatars