Academic

ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

arXiv:2602.16938v1 Announce Type: new Abstract: The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt

Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz, Jihwan Jeong, Amir Globerson, Craig Boutilier · February 22, 2026 · 1 min read · 4 views

#cs.CL

Executive Summary

The article presents ConvApparel, a novel dataset and validation framework designed to bridge the 'realism gap' in LLM-based user simulators for conversational AI. The dataset captures a wide spectrum of user experiences through a dual-agent data collection protocol, enriched with first-person annotations of user satisfaction. The validation framework combines statistical alignment, human-likeness score, and counterfactual validation to test for generalization. Experiments reveal a significant realism gap across all simulators, but data-driven simulators outperform a prompted baseline, particularly in counterfactual validation. The study highlights the importance of realistic user modeling and provides a benchmark for future research.

Key Points

▸ ConvApparel introduces a new dataset and validation framework to address the 'realism gap' in LLM-based user simulators.
▸ The dataset uses a dual-agent data collection protocol with both 'good' and 'bad' recommenders to capture a wide spectrum of user experiences.
▸ The validation framework combines statistical alignment, human-likeness score, and counterfactual validation to test for generalization.

Merits

Strength in Addressing Realism Gap

ConvApparel addresses a critical limitation in LLM-based user simulators, providing a more realistic understanding of user experiences and preferences.

Demerits

Limitation in Generalizability

The study's findings may not generalize to other domains or applications, potentially limiting the utility of the ConvApparel framework.

Data Bias Concerns

The dataset's dual-agent protocol may introduce biases, particularly if the 'good' and 'bad' recommenders are not representative of real-world user experiences.

Expert Commentary

The article presents a significant contribution to the field of conversational AI, highlighting the importance of realistic user modeling and providing a benchmark for future research. While the study's limitations and potential biases should be carefully considered, the findings suggest that data-driven simulators may be more effective in real-world applications. The development of more human-like and effective conversational AI systems has significant implications for various domains, and the ConvApparel framework provides a valuable tool for evaluating the performance of conversational recommenders.

Recommendations

✓ Future research should focus on developing more robust and generalizable user models that can adapt to various contexts and applications.
✓ The ConvApparel framework should be extended to include more diverse and representative user experiences, reducing the potential for biases and improving generalizability.

Sources

arXiv - cs.CL

Something extraordinary is coming.

ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

AI Commentary

Executive Summary

Key Points

Merits

Strength in Addressing Realism Gap

Demerits

Limitation in Generalizability

Data Bias Concerns

Expert Commentary

Recommendations

Sources

Related Articles

Humans and LLMs Diverge on Probabilistic Inferences

France or Spain or Germany or France: A Neural Account …

Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of …

JCG, PC

HSOLLC Co., Ltd.