PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
arXiv:2603.23231v1 Announce Type: new Abstract: Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. However, prior evaluations typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events that drive the evolution of user preferences. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-rela
arXiv:2603.23231v1 Announce Type: new Abstract: Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. However, prior evaluations typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events that drive the evolution of user preferences. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model's understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems can extract more precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at https://github.com/PolarisLiu1/PERMA.
Executive Summary
The article PERMA introduces a novel benchmark designed to evaluate personalized memory agents by simulating realistic, temporally ordered interactions across multiple sessions and domains. Addressing a critical gap in current evaluations—where preference dialogues are interleaved with irrelevant conversations—PERMA focuses on capturing the evolution of user preferences as they emerge gradually and accumulate across interactions. By integrating text variability and linguistic alignment, the benchmark mimics erratic user inputs and idiolectal diversity. Experiments show that models leveraging event-driven memory systems can more effectively identify precise preferences and reduce token usage compared to traditional semantic retrieval. However, persistent challenges remain in maintaining coherent personas over temporal depth and amid cross-domain interference. The study offers a valuable framework for advancing personalized memory research in LLMs.
Key Points
- ▸ Introduction of PERMA as a benchmark for evaluating personalized memory agents
- ▸ Focus on event-driven preference evolution across temporally ordered interactions
- ▸ Simulation of real-world linguistic variability and idiolectal diversity
Merits
Innovative Framework
PERMA provides a realistic, temporally contextualized evaluation platform that better reflects user behavior than existing needle-in-a-haystack approaches.
Demerits
Scalability Challenge
While effective in small-scale experiments, maintaining coherence across extended temporal depth and complex cross-domain interference remains a significant limitation.
Expert Commentary
PERMA represents a substantive advance in the evaluation of personalized memory systems within LLMs. The authors rightly identify a critical flaw in prior assessments: the conflation of preference dialogues with unrelated conversations dilutes the relevance of memory testing. By establishing a benchmark that mirrors the temporal, contextual, and linguistic complexity of real user interactions, PERMA elevates the rigor of memory agent evaluation. Moreover, the inclusion of linguistic alignment and text variability introduces a level of realism previously absent in benchmarking. However, the persistent difficulty in sustaining persona coherence across prolonged timelines and domain shifts signals a deeper systemic issue: current memory architectures may lack sufficient contextual embedding or adaptive recalibration mechanisms. This insight is invaluable—it points toward future research directions such as dynamic memory pruning, contextual drift detection, or hybrid memory-reasoning architectures. The open-sourcing of code and data further enhances reproducibility and accelerates progress. PERMA is not merely a benchmark; it is a catalyst for recalibrating the trajectory of personalized memory research.
Recommendations
- ✓ 1. Researchers should integrate PERMA into their evaluation pipelines when developing memory-augmented LLMs.
- ✓ 2. Future studies should explore hybrid architectures combining event-driven memory with contextual drift mitigation to address coherence challenges identified in the study.
Sources
Original: arXiv - cs.AI