PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay
arXiv:2603.23841v1 Announce Type: new Abstract: While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact their objectivity. Existing benchmarks of LLM social bias primarily evaluate gender and racial stereotypes. When political bias is included, it is typically measured at a coarse level, neglecting the specific values that shape sociopolitical leanings. This study investigates political bias in eight prominent LLMs (Claude, Deepseek, Gemini, GPT, Grok, Llama, Qwen Base, Qwen Instruction-Tuned) using PoliticsBench: a novel multi-turn roleplay framework adapted from the EQ-Bench-v3 psychometric benchmark. We test whether commercially developed LLMs display a systematic left-leaning bias that becomes more pronounced in later stages of multi-stage roleplay. Through twenty evolving scenarios, each model reported its stance and determined its course of action. Scoring these responses on a scale of ten politica
arXiv:2603.23841v1 Announce Type: new Abstract: While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact their objectivity. Existing benchmarks of LLM social bias primarily evaluate gender and racial stereotypes. When political bias is included, it is typically measured at a coarse level, neglecting the specific values that shape sociopolitical leanings. This study investigates political bias in eight prominent LLMs (Claude, Deepseek, Gemini, GPT, Grok, Llama, Qwen Base, Qwen Instruction-Tuned) using PoliticsBench: a novel multi-turn roleplay framework adapted from the EQ-Bench-v3 psychometric benchmark. We test whether commercially developed LLMs display a systematic left-leaning bias that becomes more pronounced in later stages of multi-stage roleplay. Through twenty evolving scenarios, each model reported its stance and determined its course of action. Scoring these responses on a scale of ten political values, we explored the values underlying chatbots' deviations from unbiased standards. Seven of our eight models leaned left, while Grok leaned right. Each left-leaning LLM strongly exhibited liberal traits and moderately exhibited conservative ones. We discovered slight variations in alignment scores across stages of roleplay, with no particular pattern. Though most models used consequence-based reasoning, Grok frequently argued with facts and statistics. Our study presents the first psychometric evaluation of political values in LLMs through multi-stage, free-text interactions.
Executive Summary
This study presents PoliticsBench, a novel multi-turn roleplay framework for evaluating political bias in Large Language Models (LLMs). By adapting the EQ-Bench-v3 psychometric benchmark, the authors test eight prominent LLMs for their stance on twenty evolving scenarios, scoring their responses on a scale of ten political values. The results show a systematic left-leaning bias in seven out of eight models, with slight variations in alignment scores across stages of roleplay. The study highlights the importance of considering political bias in LLMs and provides a new framework for evaluating their objectivity. The findings have significant implications for the development and deployment of AI-powered chatbots in various applications, including customer service, education, and politics.
Key Points
- ▸ PoliticsBench: a novel multi-turn roleplay framework for evaluating political bias in LLMs
- ▸ Seven out of eight tested LLMs exhibit a systematic left-leaning bias
- ▸ Grok, an exception, leans right and argues with facts and statistics
- ▸ Variations in alignment scores across stages of roleplay are observed
Merits
Strength
PoliticsBench provides a novel framework for evaluating political bias in LLMs, addressing a significant gap in existing benchmarks.
Methodological Rigor
The study adapts a psychometric benchmark, EQ-Bench-v3, to assess LLMs' responses on a scale of ten political values.
Demerits
Limitation
The study only evaluates eight LLMs, limiting the generalizability of the findings.
Scope
The study focuses on a specific aspect of LLMs' bias, neglecting other potential biases and limitations.
Expert Commentary
The study presents a significant contribution to the field of AI research, highlighting the importance of considering political bias in LLMs. The PoliticsBench framework provides a valuable tool for evaluating the objectivity of LLMs, and the findings have significant implications for the development and deployment of AI-powered chatbots. However, the study's limitations, such as the small sample size and narrow scope, should be acknowledged and addressed in future research. Furthermore, the study's focus on left-leaning bias raises questions about the potential for right-leaning bias in other LLMs, which warrants further investigation.
Recommendations
- ✓ Future research should expand the sample size and scope to evaluate a broader range of LLMs and biases.
- ✓ Developers and policymakers should consider the potential biases of LLMs and develop guidelines for their deployment in various applications.
Sources
Original: arXiv - cs.CL