Academic

LMEB: Long-horizon Memory Embedding Benchmark

Xinping Zhao, Xinshuo Hu, Jiaxin Xu, Danyu Tang, Xin Zhang, Mengjia Zhou, Yan Zhong, Yao Zhou, Zifei Shan, Meishan Zhang, Baotian Hu, Min Zhang · March 16, 2026 · 1 min read · 22 views

#cs.CL

arXiv:2603.12572v1 Announce Type: new Abstract: Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.

Executive Summary

This article introduces the Long-horizon Memory Embedding Benchmark (LMEB), a novel framework designed to evaluate memory-augmented systems' ability to handle complex, long-horizon memory retrieval tasks. LMEB assesses 15 widely used embedding models across 4 memory types and 193 zero-shot retrieval tasks, revealing that larger models do not always perform better. The results suggest that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks. LMEB fills a crucial gap in memory embedding evaluation, providing a standardized and reproducible framework for handling long-term, context-dependent memory retrieval.

Key Points

▸ LMEB introduces a comprehensive framework for evaluating memory-augmented systems' ability to handle long-horizon memory retrieval tasks.
▸ The benchmark spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types, capturing distinct aspects of memory retrieval.
▸ Results reveal that larger models do not always perform better, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval.

Merits

Comprehensive Evaluation Framework

LMEB provides a standardized and reproducible evaluation framework, filling a crucial gap in memory embedding evaluation.

Diverse Memory Types

The benchmark assesses memory models across 4 memory types, capturing distinct aspects of memory retrieval.

Orthogonality with MTEB

The results suggest that LMEB and MTEB exhibit orthogonality, indicating that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks.

Demerits

Limited Generalizability

The results may not generalize to real-world scenarios, highlighting the need for further research and development.

Limited Scalability

The benchmark may require significant computational resources and expertise to implement and scale.

Expert Commentary

The introduction of LMEB marks a significant milestone in the development and evaluation of memory-augmented systems. The results of LMEB highlight the need for more research and development in the area of long-horizon memory retrieval, and provide a comprehensive framework for evaluating memory models across diverse memory types. While LMEB has its limitations, it provides a valuable contribution to the field, and its results have significant implications for the development of more advanced memory-augmented systems.

Recommendations

✓ Further research and development are needed in the area of long-horizon memory retrieval to improve the performance of memory-augmented systems.
✓ The development of more advanced memory-augmented systems has significant implications for industries such as healthcare, finance, and education, and should be prioritized in future research and development efforts.

Sources

arXiv - cs.CL

LMEB: Long-horizon Memory Embedding Benchmark

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Evaluation Framework

Diverse Memory Types

Orthogonality with MTEB

Demerits

Limited Generalizability

Limited Scalability

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs