Academic

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, Jishen Zhao · March 1, 2026 · 1 min read · 9 views

#cs.AI #cs.LG

arXiv:2602.22769v1 Announce Type: new Abstract: Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between practical applications and current evaluation standards for agent memory: existing benchmarks primarily focus on dialogue-centric, human-agent interactions. In reality, agent memory consists of a continuous stream of agent-environment interactions that are primarily composed of machine-generated representations. To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real-world agentic trajectories across representative agentic applications, paired with expert-curated QA, and (2) a set of synthetic agentic trajectories that scale to arbitrary horizons, paired with rule-based QA. Our comprehensive study shows that existing memory systems underperform on AMA-Bench primarily because they lack causality and objective information and are constrained by the lossy nature of similarity-based retrieval employed by many memory systems. To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval. Our results demonstrate that AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baselines by 11.16%.

Executive Summary

This study bridges the gap between current evaluation standards and practical applications of Large Language Models (LLMs) as autonomous agents. AMA-Bench, a new evaluation benchmark, assesses long-horizon memory for LLMs in real-world agentic applications, highlighting limitations of existing memory systems. The study proposes AMA-Agent, a novel memory system that addresses these limitations, achieving superior performance on AMA-Bench. This work has significant implications for the development and deployment of LLMs in complex applications.

Key Points

▸ AMA-Bench evaluates long-horizon memory for LLMs in real-world agentic applications
▸ Existing memory systems underperform due to causality and objective information limitations
▸ AMA-Agent proposes a novel memory system with causality graph and tool-augmented retrieval

Merits

Strength

AMA-Bench provides a comprehensive evaluation framework for long-horizon memory in LLMs, highlighting the limitations of existing systems.

Innovative Approach

AMA-Agent's causality graph and tool-augmented retrieval provide a novel solution to addressing the limitations of existing memory systems.

Improved Performance

AMA-Agent achieves superior performance on AMA-Bench, surpassing state-of-the-art memory systems by a significant margin.

Demerits

Limitation

AMA-Bench is narrowly focused on LLMs, potentially limiting its applicability to other types of agents.

Scalability

AMA-Agent's performance may degrade as the complexity of the agentic application increases.

Interpretability

The causal graph and tool-augmented retrieval components of AMA-Agent may be challenging to interpret and understand.

Expert Commentary

The study presents a significant advancement in the evaluation and development of long-horizon memory for LLMs. AMA-Bench and AMA-Agent demonstrate a deep understanding of the limitations of existing memory systems and propose innovative solutions to address these limitations. The study has far-reaching implications for the development and deployment of LLMs in complex applications. However, further research is needed to fully explore the scalability and interpretability of AMA-Agent. Overall, this study is a significant contribution to the field of AI and LLMs.

Recommendations

✓ Future research should explore the scalability of AMA-Agent in more complex agentic applications.
✓ Developers should consider incorporating AMA-Bench and AMA-Agent into their evaluation and development pipelines for LLMs.

Sources

arXiv - cs.AI

Something extraordinary is coming.

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

AI Commentary

Executive Summary

Key Points

Merits

Strength

Innovative Approach

Improved Performance

Demerits

Limitation

Scalability

Interpretability

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.