Academic

Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

arXiv:2603.04428v1 Announce Type: new Abstract: Multi-agent LLM systems on edge devices face a memory management problem: device RAM is too small to hold every agent's KV cache simultaneously. On Apple M4 Pro with 10.2 GB of cache budget, only 3 agents fit at 8K context in FP16. A 10-agent workflow must constantly evict and reload caches. Without persistence, every eviction forces a full re-prefill through the model -- 15.7 seconds per agent at 4K context. We address this by persisting each agent's KV cache to disk in 4-bit quantized format and reloading it directly into the attention layer, eliminating redundant O(n) prefill computation via direct cache restoration. The system comprises three components: a block pool providing per-agent isolated Q4 KV caches in safetensors format, a BatchQuantizedKVCache for concurrent inference over multiple agents' quantized caches, and cross-phase context injection that accumulates attention state across conversation phases without re-computation.

Y
Yakov Pyotr Shkolnikov
· · 1 min read · 9 views

arXiv:2603.04428v1 Announce Type: new Abstract: Multi-agent LLM systems on edge devices face a memory management problem: device RAM is too small to hold every agent's KV cache simultaneously. On Apple M4 Pro with 10.2 GB of cache budget, only 3 agents fit at 8K context in FP16. A 10-agent workflow must constantly evict and reload caches. Without persistence, every eviction forces a full re-prefill through the model -- 15.7 seconds per agent at 4K context. We address this by persisting each agent's KV cache to disk in 4-bit quantized format and reloading it directly into the attention layer, eliminating redundant O(n) prefill computation via direct cache restoration. The system comprises three components: a block pool providing per-agent isolated Q4 KV caches in safetensors format, a BatchQuantizedKVCache for concurrent inference over multiple agents' quantized caches, and cross-phase context injection that accumulates attention state across conversation phases without re-computation. Evaluated on three architectures (Gemma 3 12B, dense GQA, 48 layers; DeepSeek-Coder-V2-Lite 16B, MoE MLA, 27 layers; Llama 3.1 8B, dense GQA, 32 layers), cache restoration reduces time-to-first-token by up to 136x (Gemma: 22--136x at 4K--32K; DeepSeek: 11--76x at 4K--32K; Llama: 24--111x at 4K--16K; 3--10x at 1K). Q4 quantization fits 4x more agent contexts into fixed device memory than FP16. Perplexity measured with actual Q4 KV caches shows -0.7% for Gemma, +2.8% for Llama, and +3.0% for DeepSeek. Open-source at https://github.com/yshk-mxim/agent-memory

Executive Summary

This article proposes a novel approach to address the memory management problem in multi-agent large language model (LLM) systems on edge devices. The proposed solution, dubbed Agent Memory Below the Prompt, persists each agent's key-value (KV) cache to disk in 4-bit quantized format, eliminating the need for redundant computation during cache reloads. The resulting system achieves significant performance improvements, with up to 136x reduction in time-to-first-token, while maintaining comparable perplexity scores. This breakthrough has far-reaching implications for the deployment of large language models on resource-constrained edge devices.

Key Points

  • Proposes a novel approach to address memory management in multi-agent LLM systems on edge devices
  • Introduces a persistent KV cache solution that eliminates redundant computation during cache reloads
  • Achieves significant performance improvements, with up to 136x reduction in time-to-first-token
  • Maintains comparable perplexity scores while reducing memory requirements

Merits

Strength in Performance

The proposed solution achieves substantial performance improvements, with significant reductions in time-to-first-token across various LLM architectures.

Memory Efficiency

The use of 4-bit quantization allows for a 4x increase in the number of agent contexts that can fit within fixed device memory.

Scalability

The proposed solution is designed to scale with the number of agents, making it a viable solution for large-scale multi-agent systems.

Demerits

Quantization Limitations

The use of 4-bit quantization may introduce some loss of precision, which could impact model performance in certain applications.

Implementation Complexity

The proposed solution involves a novel combination of components, which may increase implementation complexity and require significant expertise.

Dependence on Specific Hardware

The performance benefits of the proposed solution may be highly dependent on the specific hardware configuration and architecture used.

Expert Commentary

The proposed solution represents a significant breakthrough in addressing the memory management challenges of large language models on edge devices. The use of persistent KV caches and 4-bit quantization enables significant performance improvements while maintaining comparable perplexity scores. However, the implementation complexity and dependence on specific hardware configurations may limit the widespread adoption of this solution. Further research is needed to explore the trade-offs between performance, precision, and implementation complexity. Additionally, the scalability of this solution in large-scale multi-agent systems requires careful consideration of system architecture and resource allocation.

Recommendations

  • Further research is needed to explore the trade-offs between performance, precision, and implementation complexity in the proposed solution.
  • Careful consideration should be given to system architecture and resource allocation when deploying the proposed solution in large-scale multi-agent systems.

Sources