Skip to main content
Academic

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

arXiv:2602.16990v1 Announce Type: new Abstract: Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market

arXiv:2602.16990v1 Announce Type: new Abstract: Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.

Executive Summary

This article introduces Conv-FinRe, a novel benchmark for evaluating the performance of large language models (LLMs) in financial advisory tasks. Unlike existing benchmarks that focus on behavior matching, Conv-FinRe assesses LLMs' ability to generate rankings over a fixed investment horizon based on investor-specific risk preferences. The benchmark provides a multi-view reference system that distinguishes descriptive behavior from normative utility, enabling the diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. The authors demonstrate the effectiveness of Conv-FinRe by evaluating a suite of state-of-the-art LLMs and releasing the dataset and codebase publicly.

Key Points

  • Conv-FinRe is a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching.
  • The benchmark assesses LLMs' ability to generate rankings over a fixed investment horizon based on investor-specific risk preferences.
  • Conv-FinRe provides a multi-view reference system that distinguishes descriptive behavior from normative utility.

Merits

Grounded in Real-World Market Data

The benchmark is built from real market data and human decision trajectories, making it more relevant and applicable to real-world financial advisory tasks.

Multi-View Reference System

The benchmark provides a comprehensive evaluation framework that distinguishes between descriptive behavior and normative utility, enabling a more nuanced understanding of LLM performance.

Publicly Available Dataset and Codebase

The authors release the dataset and codebase publicly, facilitating reproducibility and enabling the research community to build upon and extend the work.

Demerits

Complexity of the Benchmark

The benchmark requires the evaluation of LLMs' ability to generate rankings over a fixed investment horizon, which may introduce additional complexity and computational requirements.

Potential for Overfitting

The benchmark may be susceptible to overfitting, particularly if the training data is not representative of real-world market conditions or investor risk preferences.

Expert Commentary

Conv-FinRe represents a significant advancement in the evaluation of LLMs' performance in financial advisory tasks. By providing a comprehensive and nuanced evaluation framework, the benchmark can contribute to the development of more effective and responsible AI-driven decision-making systems. However, the complexity of the benchmark and potential for overfitting need to be carefully addressed to ensure its validity and applicability in real-world financial advisory tasks. Furthermore, the study's findings can inform policy discussions around the regulation of AI-driven financial advisory services, ensuring that they are designed and implemented in a way that prioritizes investor protection and fairness.

Recommendations

  • Future research should focus on developing more efficient and scalable evaluation methods for Conv-FinRe, enabling its widespread adoption in the financial advisory industry.
  • The development of more advanced LLMs that can effectively leverage the multi-view reference system provided by Conv-FinRe is essential to fully realizing the benchmark's potential.

Sources