DeliberationBench: A Normative Benchmark for the Influence of Large Language Models on Users' Views
arXiv:2603.10018v1 Announce Type: cross Abstract: As large language models (LLMs) become pervasive as assistants and thought partners, it is important to characterize their persuasive influence on users' beliefs. However, a central challenge is to distinguish "beneficial" from "harmful" forms of influence, in a manner that is normatively defensible and legitimate. We propose DeliberationBench, a benchmark for assessing LLM influence that takes the process of deliberative opinion polling as its standard. We demonstrate our approach in a preregistered randomized experiment in which 4,088 U.S. participants discussed 65 policy proposals with six frontier LLMs. Using opinion change data from four prior Deliberative Polls conducted by the Deliberative Democracy Lab, we find evidence that the tested LLMs' influence is substantial in magnitude and positively associated with the net opinion shifts following deliberation, suggesting that these models exert broadly epistemically desirable effect
arXiv:2603.10018v1 Announce Type: cross Abstract: As large language models (LLMs) become pervasive as assistants and thought partners, it is important to characterize their persuasive influence on users' beliefs. However, a central challenge is to distinguish "beneficial" from "harmful" forms of influence, in a manner that is normatively defensible and legitimate. We propose DeliberationBench, a benchmark for assessing LLM influence that takes the process of deliberative opinion polling as its standard. We demonstrate our approach in a preregistered randomized experiment in which 4,088 U.S. participants discussed 65 policy proposals with six frontier LLMs. Using opinion change data from four prior Deliberative Polls conducted by the Deliberative Democracy Lab, we find evidence that the tested LLMs' influence is substantial in magnitude and positively associated with the net opinion shifts following deliberation, suggesting that these models exert broadly epistemically desirable effects. We further explore differential influence between topic areas, demographic subgroups, and models. Our framework can function as an evaluation and monitoring tool, helping to ensure that the influence of LLMs remains consistent with democratically legitimate standards, and preserves users' autonomy in forming their views.
Executive Summary
This article proposes DeliberationBench, a normative benchmark for evaluating the influence of large language models (LLMs) on users' beliefs. The authors demonstrate their approach in a randomized experiment involving 4,088 U.S. participants and six frontier LLMs. The results indicate that the tested LLMs exhibit substantial and desirable influence on users' opinions. The study also explores differential influence across topic areas, demographic subgroups, and models. DeliberationBench offers a framework for evaluating and monitoring LLM influence, ensuring that it aligns with democratically legitimate standards and preserves users' autonomy. This work has significant implications for the development and deployment of LLMs in various contexts, including education, decision-making, and public discourse.
Key Points
- ▸ DeliberationBench is a normative benchmark for assessing LLM influence
- ▸ The authors demonstrate the effectiveness of DeliberationBench in a randomized experiment
- ▸ The study finds that tested LLMs exhibit substantial and desirable influence on users' opinions
Merits
Strength in theoretical foundation
DeliberationBench draws on the process of deliberative opinion polling, providing a theoretically sound basis for evaluating LLM influence.
Empirical rigor
The study's randomized experiment and use of preregistered data ensure the validity and reliability of the findings.
Practical utility
DeliberationBench offers a framework for evaluating and monitoring LLM influence, enabling developers and policymakers to ensure that these models align with democratically legitimate standards.
Demerits
Limited generalizability
The study's focus on U.S. participants and policy proposals may limit the generalizability of the findings to diverse populations and contexts.
Methodological complexities
The measurement of LLM influence and user opinion change requires sophisticated methodological approaches, which may introduce additional challenges and limitations.
Expert Commentary
The article's proposal of DeliberationBench as a normative benchmark for evaluating LLM influence is a significant contribution to the field. However, further research is needed to explore the limitations and complexities of the approach, particularly with regard to generalizability and methodological challenges. Moreover, the study's findings highlight the need for a more nuanced understanding of the role of AI in decision-making and the importance of digital literacy and critical thinking skills in the digital age.
Recommendations
- ✓ Future research should focus on developing and refining DeliberationBench, addressing the limitations and complexities of the approach.
- ✓ Developers and policymakers should prioritize the design and deployment of LLMs that promote critical thinking, evaluation, and autonomy, rather than merely relying on the models' persuasive influence.