Academic

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

arXiv:2603.16120v1 Announce Type: new Abstract: Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers' queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user's research interests; 2) proposes personalized actions for a user's input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP's standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to for

arXiv:2603.16120v1 Announce Type: new Abstract: Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers' queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user's research interests; 2) proposes personalized actions for a user's input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP's standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.

Executive Summary

The article critiques the current state of personalization in deep research (DR) tools, arguing that existing models lack true user understanding. The authors introduce MyScholarQA (MySQA), a personalized DR tool that infers user research interests, proposes tailored actions, and generates reports based on user-approved inputs. While MySQA demonstrates superiority over baselines in synthetic benchmarks, the authors uncover nine nuanced personalization errors detectable only through real user interviews, revealing a critical gap between synthetic evaluation and real-world user experience. The study underscores that real user engagement is indispensable for meaningful personalization in DR systems.

Key Points

  • MySQA introduces personalization by inferring user interests and tailoring DR outputs
  • Synthetic benchmarks reveal limited evaluation scope
  • Real user interviews uncover undetectable personalization errors

Merits

Innovative Evaluation Method

The authors effectively combine synthetic benchmarking with qualitative user interviews to identify nuanced personalization issues, offering a more holistic assessment of DR tools.

Demerits

Limited Generalizability

The findings are derived from a specific context (NLP) and may not fully translate to broader DR applications across disciplines or user types.

Expert Commentary

This article makes a significant contribution to the discourse on AI-assisted research by challenging the prevailing reliance on synthetic evaluation metrics. The authors’ shift from algorithmic performance to user-centric validation is both timely and necessary. While LLM judges offer a scalable evaluation mechanism, their inability to capture the full spectrum of user experience—particularly the subtle, context-dependent nuances—highlights a fundamental limitation in current AI evaluation paradigms. MySQA’s approach, though resource-intensive, sets a new benchmark for evaluating personalization by anchoring it in authentic user interaction. The broader implications extend beyond DR tools: any AI system intended to support human decision-making must be validated through direct engagement with its end users. This work may catalyze a paradigm shift in AI evaluation across domains, urging stakeholders to prioritize real-world usability over algorithmic metrics alone.

Recommendations

  • Integrate real user co-design and iterative feedback loops into DR tool development pipelines.
  • Develop standardized frameworks for capturing qualitative user insights as complementary metrics to algorithmic performance indicators.

Sources