Academic

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

arXiv:2603.04191v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions. However, assessing how well LLMs can follow these preferences in realistic, long-term situations remains underexplored. This work proposes RealPref, a benchmark for evaluating realistic preference-following in personalized user-LLM interactions. RealPref features 100 user profiles, 1300 personalized preferences, four types of preference expression (ranging from explicit to implicit), and long-horizon interaction histories. It includes three types of test questions (multiple-choice, true-or-false, and open-ended), with detailed rubrics for LLM-as-a-judge evaluation. Results indicate that LLM performance significantly drops as context length grows and preference expression becomes more implicit, and that generalizing user preference understanding to unseen scenarios poses further challe

Qianyun Guo, Yibo Li, Yue Liu, Bryan Hooi · March 7, 2026 · 1 min read · 20 views

#cs.AI

Executive Summary

This article introduces RealPref, a benchmark for evaluating the ability of Large Language Models (LLMs) to follow user preferences in long-term interactions. The benchmark features 100 user profiles, 1300 personalized preferences, and various types of preference expression. The results show that LLM performance drops as context length grows and preference expression becomes more implicit. This work provides a foundation for developing user-aware LLM assistants that better adapt to individual needs.

Key Points

▸ Introduction of RealPref benchmark for evaluating LLMs' preference-following abilities
▸ LLM performance degrades with increasing context length and implicit preference expression
▸ Generalizing user preference understanding to unseen scenarios poses significant challenges

Merits

Comprehensive Benchmark

RealPref provides a thorough evaluation framework for assessing LLMs' ability to follow user preferences in realistic, long-term interactions.

Demerits

Limited Generalizability

The study's findings may not generalize to all types of users or scenarios, as the benchmark is limited to a specific set of user profiles and preferences.

Expert Commentary

The introduction of RealPref is a significant step towards developing more sophisticated and user-aware LLM assistants. The findings of this study underscore the challenges of generalizing user preference understanding to unseen scenarios, highlighting the need for continued research in this area. Furthermore, the emphasis on evaluating LLMs' ability to follow user preferences in realistic, long-term interactions underscores the importance of considering the complexities of human-AI interaction in AI system design.

Recommendations

✓ Future research should focus on developing more advanced techniques for generalizing user preference understanding to unseen scenarios
✓ Developers of LLM-based systems should prioritize transparency, explainability, and user-centric design to ensure that their systems can effectively adapt to individual needs and preferences

Sources

arXiv - cs.AI

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Benchmark

Demerits

Limited Generalizability

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs