Academic

Prompt Compression in Production Task Orchestration: A Pre-Registered Randomized Trial

arXiv:2603.23525v1 Announce Type: new Abstract: The economics of prompt compression depend not only on reducing input tokens but on how compression changes output length, which is typically priced several times higher. We evaluate this in a pre-registered six-arm randomized controlled trial of prompt compression on production multi-agent task-orchestration, analyzing 358 successful Claude Sonnet 4.5 runs (59-61 per arm) drawn from a randomized corpus of 1,199 real orchestration instructions. We compare an uncompressed control with three uniform retention rates (r=0.8, 0.5, 0.2) and two structure-aware strategies (entropy-adaptive and recency-weighted), measuring total inference cost (input+output) and embedding-based response similarity. Moderate compression (r=0.5) reduced mean total cost by 27.9%, while aggressive compression (r=0.2) increased mean cost by 1.8% despite substantial input reduction, consistent with small mean output expansion (1.03x vs. control) and heavy-tailed uncer

W
Warren Johnson, Charles Lee
· · 1 min read · 24 views

arXiv:2603.23525v1 Announce Type: new Abstract: The economics of prompt compression depend not only on reducing input tokens but on how compression changes output length, which is typically priced several times higher. We evaluate this in a pre-registered six-arm randomized controlled trial of prompt compression on production multi-agent task-orchestration, analyzing 358 successful Claude Sonnet 4.5 runs (59-61 per arm) drawn from a randomized corpus of 1,199 real orchestration instructions. We compare an uncompressed control with three uniform retention rates (r=0.8, 0.5, 0.2) and two structure-aware strategies (entropy-adaptive and recency-weighted), measuring total inference cost (input+output) and embedding-based response similarity. Moderate compression (r=0.5) reduced mean total cost by 27.9%, while aggressive compression (r=0.2) increased mean cost by 1.8% despite substantial input reduction, consistent with small mean output expansion (1.03x vs. control) and heavy-tailed uncertainty. Recency-weighted compression achieved 23.5% savings and, together with moderate compression, occupied the empirical cost-similarity Pareto frontier, whereas aggressive compression was dominated on both cost and similarity. These results show that "compress more" is not a reliable production heuristic and that output tokens must be treated as a first-class outcome when designing compression policies.

Executive Summary

This article presents the results of a pre-registered randomized controlled trial assessing the impact of prompt compression on production multi-agent task-orchestration. The study evaluates various compression strategies, including uniform retention rates and structure-aware approaches, and measures their effects on total inference cost and embedding-based response similarity. The findings suggest that moderate compression (retention rate=0.5) offers the best balance between cost savings and response similarity. The results imply that 'compress more' is not a reliable production heuristic and highlight the importance of considering output tokens in compression policy design. The study's methodology and findings contribute to the development of more effective prompt compression strategies in production task-orchestration.

Key Points

  • Prompt compression affects both input and output tokens, influencing total inference cost and response similarity.
  • Moderate compression (retention rate=0.5) offers the best balance between cost savings and response similarity.
  • Aggressive compression (retention rate=0.2) led to increased mean cost despite substantial input reduction.

Merits

Strength in Methodology

The study employed a pre-registered randomized controlled trial, providing a high level of methodological rigor and ensuring the integrity of the results.

Insight into Prompt Compression

The research offers valuable insights into the impact of prompt compression on production multi-agent task-orchestration, informing the development of more effective compression strategies.

Empirical Cost-Similarity Pareto Frontier

The study's results provide a comprehensive understanding of the trade-offs between total inference cost and response similarity, enabling the identification of optimal compression policies.

Demerits

Limitation in Generalizability

The study's findings may not be generalizable to other task-orchestration systems or domains, highlighting the need for further research to establish the applicability of these results.

Heavy-Tailed Uncertainty

The study's analysis of heavy-tailed uncertainty may be complex and challenging to interpret, potentially limiting the practical application of the results.

Expert Commentary

The study's findings are significant because they demonstrate the importance of considering output tokens in compression policy design. This is a crucial consideration for AI systems, where output tokens can have a substantial impact on performance and efficiency. The results of this study highlight the need for more research into the effects of prompt compression and the development of more effective compression strategies. Furthermore, the study's methodology and findings have implications for the development of efficient task-orchestration systems and AI policy design frameworks.

Recommendations

  • Future research should investigate the applicability of these results to other task-orchestration systems and domains.
  • Developers should consider the impact of prompt compression on output tokens and response similarity when designing AI systems.

Sources

Original: arXiv - cs.CL