Academic

How Utilitarian Are OpenAI's Models Really? Replicating and Reinterpreting Pfeffer, Kr\"ugel, and Uhl (2025)

arXiv:2603.22730v1 Announce Type: new Abstract: Pfeffer, Kr\"ugel, and Uhl (2025) report that OpenAI's reasoning model o1-mini produces more utilitarian responses to the trolley problem and footbridge dilemma than the non-reasoning model GPT-4o. I replicate their study with four current OpenAI models and extend it with prompt variant testing. The trolley finding does not survive: GPT-4o's low utilitarian rate doesn't reflect a deontological commitment but safety refusals triggered by the prompt's advisory framing. When framed as "Is it morally permissible...?" instead of "Should I...?", GPT-4o gives 99% utilitarian responses. All models converge on utilitarian answers when prompt confounds are removed. The footbridge finding survives with blemishes. Reasoning models tend to give more utilitarian responses than non-reasoning models across prompt variations. But often they refuse to answer the dilemma or, when they answer, give a non-utilitarian rather than a utilitarian answer. These r

J
Johannes Himmelreich
· · 1 min read · 41 views

arXiv:2603.22730v1 Announce Type: new Abstract: Pfeffer, Kr\"ugel, and Uhl (2025) report that OpenAI's reasoning model o1-mini produces more utilitarian responses to the trolley problem and footbridge dilemma than the non-reasoning model GPT-4o. I replicate their study with four current OpenAI models and extend it with prompt variant testing. The trolley finding does not survive: GPT-4o's low utilitarian rate doesn't reflect a deontological commitment but safety refusals triggered by the prompt's advisory framing. When framed as "Is it morally permissible...?" instead of "Should I...?", GPT-4o gives 99% utilitarian responses. All models converge on utilitarian answers when prompt confounds are removed. The footbridge finding survives with blemishes. Reasoning models tend to give more utilitarian responses than non-reasoning models across prompt variations. But often they refuse to answer the dilemma or, when they answer, give a non-utilitarian rather than a utilitarian answer. These results demonstrate that single-prompt evaluations of LLM moral reasoning are unreliable: multi-prompt robustness testing should be standard practice for any empirical claim about LLM behavior.

Executive Summary

This study critically examines the claims of Pfeffer, Krügel, and Uhl (2025) regarding the utilitarian behavior of OpenAI’s models in moral dilemmas. The authors replicate the original findings using four OpenAI models and introduce prompt variant testing. Contrary to initial claims, the trolley problem findings are invalidated: GPT-4o’s low utilitarian rate stems from safety-related refusals due to advisory framing rather than deontological bias. When prompts are rephrased, GPT-4o aligns with 99% utilitarian responses. Across all models, prompt confounds distort findings; when removed, convergence on utilitarian answers occurs. The footbridge findings persist but with notable inconsistencies. The study underscores the fragility of single-prompt evaluations and advocates for multi-prompt robustness testing as a standard empirical protocol. These results have significant implications for the credibility and methodology of LLM moral reasoning assessments.

Key Points

  • Prompt framing significantly alters utilitarian response rates
  • Safety refusals, not deontological bias, explain initial findings
  • Multi-prompt robustness is necessary to validate claims

Merits

Methodological Rigor

The replication and extension of the original study with multiple models and prompt variants demonstrates a commitment to empirical accuracy and transparency.

Findings Clarity

The authors effectively disentangle confounding variables, providing a clearer picture of the impact of prompt design on moral reasoning outputs.

Demerits

Limited Scope

The study focuses on specific models and dilemmas; broader applicability to other LLM architectures or ethical domains remains unaddressed.

Generalizability Concern

Results may not extend to other types of moral queries or user contexts beyond the tested scenarios.

Expert Commentary

The work represents a pivotal correction to the emerging literature on LLM moral reasoning. The initial claim by Pfeffer et al. (2025) was widely cited, and this replication effort not only disproves it but also reveals a systemic flaw in evaluation methodology—prompt framing as a confounding variable. This is not merely a case of misinterpretation; it is a structural issue in how empirical assessments are designed. The authors’ decision to test multiple prompt variants is commendable, as it aligns with best practices in cognitive science and experimental design. Moreover, the observation that reasoning models, while often more aligned with utilitarian outcomes, frequently refuse to answer or provide non-utilitarian responses, suggests a deeper complexity in LLM moral processing that warrants further investigation. This paper should become a benchmark for future empirical work in this domain. Its implications extend beyond OpenAI models to the broader AI ethics literature, calling for a paradigm shift in how we assess and interpret LLM behavior in moral contexts.

Recommendations

  • Adopt multi-prompt robustness testing as a mandatory component of any empirical study on LLM moral reasoning.
  • Develop standardized evaluation frameworks that include prompt variation protocols to mitigate confounding effects.

Sources

Original: arXiv - cs.CL