Academic

SimpleTool: Parallel Decoding for Real-Time LLM Function Calling

Xiaoxin Shi, Jiaxin Wan, Linkang Dong, Wei Jiang, Yue Liu, Zengfeng Huang · March 7, 2026 · 1 min read · 17 views

#cs.CL

arXiv:2603.00030v1 Announce Type: new Abstract: LLM-based function calling enables intelligent agents to interact with external tools and environments, yet autoregressive decoding imposes a fundamental latency bottleneck that limits real-time applications such as embodied intelligence, game AI, and interactive avatars (e.g., 10 Hz control frequency). We observe that function calling differs fundamentally from free-form text generation: structured outputs exhibit substantial token redundancy (delimiters, parameter names), and arguments exhibit weak causal dependencies. Crucially, these two properties must be exploited jointly to achieve real-time performance. We present SimpleTool, which introduces special tokens that serve a dual role: compressing low-entropy tokens (4-6x reduction) while acting as mode selectors that enable independent parallel generation of function name and arguments. This synergistic design achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead. Experiments on five benchmarks across Qwen-series models (0.5B-14B) demonstrate substantial speedup while maintaining competitive or improved accuracy. On Mobile Actions, ST-Qwen-0.5B outperforms Google's FunctionGemma in both accuracy and latency consistency. With quantization on consumer-grade GPU, SimpleTool achieves 61.2ms P50 latency, enabling 16 Hz real-time control at 4B model scale, bridging the gap between LLM function calling and latency-critical real-world deployment.

Executive Summary

The article introduces SimpleTool, a novel approach to parallel decoding for real-time LLM function calling, achieving 3-6x end-to-end speedup with minimal parallelization overhead. By exploiting token redundancy and weak causal dependencies, SimpleTool enables independent parallel generation of function name and arguments, making it suitable for latency-critical applications such as embodied intelligence and game AI. Experiments demonstrate substantial speedup while maintaining competitive accuracy, bridging the gap between LLM function calling and real-world deployment.

Key Points

▸ SimpleTool introduces special tokens for compressing low-entropy tokens and enabling parallel generation
▸ The approach achieves 3-6x end-to-end speedup with only +8.2% parallelization overhead
▸ Experiments demonstrate substantial speedup and competitive accuracy across five benchmarks

Merits

Efficient Parallelization

SimpleTool's design enables efficient parallelization, reducing latency and improving real-time performance

Improved Accuracy

The approach maintains competitive or improved accuracy across various benchmarks

Demerits

Limited Model Support

The article only evaluates SimpleTool on Qwen-series models, leaving uncertainty about its applicability to other models

Expert Commentary

SimpleTool represents a significant advancement in parallel decoding for real-time LLM function calling. By leveraging the unique properties of structured outputs and weak causal dependencies, the approach achieves substantial speedup while maintaining competitive accuracy. The results demonstrate the potential for SimpleTool to bridge the gap between LLM function calling and latency-critical real-world deployment. However, further research is needed to fully explore the limitations and applicability of SimpleTool to various models and applications.

Recommendations

✓ Further evaluate SimpleTool on a broader range of models and benchmarks to assess its generalizability
✓ Explore the potential applications of SimpleTool in various real-time domains, such as game AI and interactive avatars

Sources

arXiv - cs.CL

SimpleTool: Parallel Decoding for Real-Time LLM Function Calling

AI Commentary

Executive Summary

Key Points

Merits

Efficient Parallelization

Improved Accuracy

Demerits

Limited Model Support

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs