Academic

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Yan Wang, Yi Han, Lingfei Qian, Yueru He, Xueqing Peng, Dongji Feng, Zhuohan Xie, Vincent Jim Zhang, Rosie Guo, Fengran Mo, Jimin Huang, Yankai Chen, Xue Liu, Jian-Yun Nie · February 22, 2026 · 1 min read · 5 views

#cs.AI #cs.CE

arXiv:2602.16990v1 Announce Type: new Abstract: Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.

Executive Summary

This article introduces Conv-FinRe, a novel benchmark for evaluating the performance of large language models (LLMs) in financial advisory tasks. Unlike existing benchmarks that focus on behavior matching, Conv-FinRe assesses LLMs' ability to generate rankings over a fixed investment horizon based on investor-specific risk preferences. The benchmark provides a multi-view reference system that distinguishes descriptive behavior from normative utility, enabling the diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. The authors demonstrate the effectiveness of Conv-FinRe by evaluating a suite of state-of-the-art LLMs and releasing the dataset and codebase publicly.

Key Points

▸ Conv-FinRe is a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching.
▸ The benchmark assesses LLMs' ability to generate rankings over a fixed investment horizon based on investor-specific risk preferences.
▸ Conv-FinRe provides a multi-view reference system that distinguishes descriptive behavior from normative utility.

Merits

Grounded in Real-World Market Data

The benchmark is built from real market data and human decision trajectories, making it more relevant and applicable to real-world financial advisory tasks.

Multi-View Reference System

The benchmark provides a comprehensive evaluation framework that distinguishes between descriptive behavior and normative utility, enabling a more nuanced understanding of LLM performance.

Publicly Available Dataset and Codebase

The authors release the dataset and codebase publicly, facilitating reproducibility and enabling the research community to build upon and extend the work.

Demerits

Complexity of the Benchmark

The benchmark requires the evaluation of LLMs' ability to generate rankings over a fixed investment horizon, which may introduce additional complexity and computational requirements.

Potential for Overfitting

The benchmark may be susceptible to overfitting, particularly if the training data is not representative of real-world market conditions or investor risk preferences.

Expert Commentary

Conv-FinRe represents a significant advancement in the evaluation of LLMs' performance in financial advisory tasks. By providing a comprehensive and nuanced evaluation framework, the benchmark can contribute to the development of more effective and responsible AI-driven decision-making systems. However, the complexity of the benchmark and potential for overfitting need to be carefully addressed to ensure its validity and applicability in real-world financial advisory tasks. Furthermore, the study's findings can inform policy discussions around the regulation of AI-driven financial advisory services, ensuring that they are designed and implemented in a way that prioritizes investor protection and fairness.

Recommendations

✓ Future research should focus on developing more efficient and scalable evaluation methods for Conv-FinRe, enabling its widespread adoption in the financial advisory industry.
✓ The development of more advanced LLMs that can effectively leverage the multi-view reference system provided by Conv-FinRe is essential to fully realizing the benchmark's potential.

Sources

arXiv - cs.AI

Something extraordinary is coming.

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

AI Commentary

Executive Summary

Key Points

Merits

Grounded in Real-World Market Data

Multi-View Reference System

Publicly Available Dataset and Codebase

Demerits

Complexity of the Benchmark

Potential for Overfitting

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.