MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
arXiv:2602.16313v1 Announce Type: new Abstract: Existing evaluations of agents with memory typically assess memorization and action in isolation. One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions. Another class focuses on agents acting in single-session tasks without the need for long-term memory. However, in realistic settings, memorization and action are tightly coupled: agents acquire memory while interacting with the environment, and subsequently rely on that memory to solve future tasks. To capture this setting, we introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops. The benchmark consists of human-crafted agentic tasks with explicitly interdependent subtasks, where agents must learn from earlier actions and feedback by distilling experiences into memory, and subsequently use that memory to guide l
arXiv:2602.16313v1 Announce Type: new Abstract: Existing evaluations of agents with memory typically assess memorization and action in isolation. One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions. Another class focuses on agents acting in single-session tasks without the need for long-term memory. However, in realistic settings, memorization and action are tightly coupled: agents acquire memory while interacting with the environment, and subsequently rely on that memory to solve future tasks. To capture this setting, we introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops. The benchmark consists of human-crafted agentic tasks with explicitly interdependent subtasks, where agents must learn from earlier actions and feedback by distilling experiences into memory, and subsequently use that memory to guide later actions to solve the overall task. MemoryArena supports evaluation across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning, and reveals that agents with near-saturated performance on existing long-context memory benchmarks like LoCoMo perform poorly in our agentic setting, exposing a gap in current evaluations for agents with memory.
Executive Summary
The article introduces MemoryArena, a novel benchmarking framework designed to evaluate agent memory in interdependent multi-session tasks. Unlike existing benchmarks that assess memorization and action in isolation, MemoryArena integrates these aspects into a unified evaluation gym. It consists of human-crafted tasks with interdependent subtasks, requiring agents to learn from past interactions and use memory to guide future actions. The study reveals that agents performing well on existing long-context memory benchmarks like LoCoMo struggle in this more realistic setting, highlighting a significant gap in current evaluation methods.
Key Points
- ▸ MemoryArena integrates memorization and action in a unified evaluation framework.
- ▸ Existing benchmarks fail to capture the interdependence of memory and action in realistic settings.
- ▸ Agents performing well on current benchmarks struggle in MemoryArena, exposing evaluation gaps.
Merits
Comprehensive Evaluation Framework
MemoryArena provides a holistic approach to evaluating agent memory by integrating memorization and action in a multi-session context, which is more aligned with real-world scenarios.
Realistic Task Design
The benchmark includes human-crafted tasks with interdependent subtasks, requiring agents to learn from past experiences and use memory effectively, thus simulating realistic agent-environment interactions.
Exposure of Evaluation Gaps
The study reveals that agents performing well on existing benchmarks like LoCoMo perform poorly in MemoryArena, highlighting the need for more comprehensive evaluation methods.
Demerits
Limited Scope of Tasks
While MemoryArena covers a range of tasks, the scope might be limited compared to the vast array of real-world scenarios agents might encounter.
Complexity in Implementation
The complexity of designing and implementing tasks that accurately simulate real-world interdependencies might pose challenges for widespread adoption.
Potential Bias in Human-Crafted Tasks
The reliance on human-crafted tasks could introduce biases, potentially affecting the generalizability of the benchmark results.
Expert Commentary
The introduction of MemoryArena represents a significant advancement in the evaluation of agent memory. By integrating memorization and action in a multi-session context, the benchmark addresses a critical gap in current evaluation methods. The study's findings underscore the importance of developing more realistic and comprehensive benchmarks that can accurately assess agent performance in complex, interdependent tasks. However, the complexity of designing such tasks and the potential for bias in human-crafted evaluations present challenges that need to be addressed. The study's exposure of the limitations of existing benchmarks like LoCoMo highlights the need for ongoing research and development in this area. Overall, MemoryArena provides a valuable framework for evaluating agent memory, but further refinement and expansion of its scope will be necessary to ensure its widespread applicability and effectiveness.
Recommendations
- ✓ Further research should focus on expanding the scope of tasks within MemoryArena to cover a broader range of real-world scenarios.
- ✓ Efforts should be made to mitigate potential biases in human-crafted tasks by incorporating diverse and representative task designs.