Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions
arXiv:2603.04191v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions. However, assessing how well LLMs can follow these preferences in realistic, long-term situations remains underexplored. This work proposes RealPref, a benchmark for evaluating realistic preference-following in personalized user-LLM interactions. RealPref features 100 user profiles, 1300 personalized preferences, four types of preference expression (ranging from explicit to implicit), and long-horizon interaction histories. It includes three types of test questions (multiple-choice, true-or-false, and open-ended), with detailed rubrics for LLM-as-a-judge evaluation. Results indicate that LLM performance significantly drops as context length grows and preference expression becomes more implicit, and that generalizing user preference understanding to unseen scenarios poses further challe
arXiv:2603.04191v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions. However, assessing how well LLMs can follow these preferences in realistic, long-term situations remains underexplored. This work proposes RealPref, a benchmark for evaluating realistic preference-following in personalized user-LLM interactions. RealPref features 100 user profiles, 1300 personalized preferences, four types of preference expression (ranging from explicit to implicit), and long-horizon interaction histories. It includes three types of test questions (multiple-choice, true-or-false, and open-ended), with detailed rubrics for LLM-as-a-judge evaluation. Results indicate that LLM performance significantly drops as context length grows and preference expression becomes more implicit, and that generalizing user preference understanding to unseen scenarios poses further challenges. RealPref and these findings provide a foundation for future research to develop user-aware LLM assistants that better adapt to individual needs. The code is available at https://github.com/GG14127/RealPref.
Executive Summary
This article introduces RealPref, a benchmark for evaluating the ability of Large Language Models (LLMs) to follow user preferences in long-term interactions. The benchmark features 100 user profiles, 1300 personalized preferences, and various types of preference expression. The results show that LLM performance drops as context length grows and preference expression becomes more implicit. This work provides a foundation for developing user-aware LLM assistants that better adapt to individual needs.
Key Points
- ▸ Introduction of RealPref benchmark for evaluating LLMs' preference-following abilities
- ▸ LLM performance degrades with increasing context length and implicit preference expression
- ▸ Generalizing user preference understanding to unseen scenarios poses significant challenges
Merits
Comprehensive Benchmark
RealPref provides a thorough evaluation framework for assessing LLMs' ability to follow user preferences in realistic, long-term interactions.
Demerits
Limited Generalizability
The study's findings may not generalize to all types of users or scenarios, as the benchmark is limited to a specific set of user profiles and preferences.
Expert Commentary
The introduction of RealPref is a significant step towards developing more sophisticated and user-aware LLM assistants. The findings of this study underscore the challenges of generalizing user preference understanding to unseen scenarios, highlighting the need for continued research in this area. Furthermore, the emphasis on evaluating LLMs' ability to follow user preferences in realistic, long-term interactions underscores the importance of considering the complexities of human-AI interaction in AI system design.
Recommendations
- ✓ Future research should focus on developing more advanced techniques for generalizing user preference understanding to unseen scenarios
- ✓ Developers of LLM-based systems should prioritize transparency, explainability, and user-centric design to ensure that their systems can effectively adapt to individual needs and preferences