Academic

How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models

arXiv:2604.00021v1 Announce Type: cross Abstract: Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study ($\mathrm{BF}_{10} > 10$ for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics -- Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) -- revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive Repetition (Llama; high consistency through formulaic re

H
Hiroki Fukui
· · 1 min read · 5 views

arXiv:2604.00021v1 Announce Type: cross Abstract: Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study ($\mathrm{BF}_{10} > 10$ for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics -- Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) -- revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive Repetition (Llama; high consistency through formulaic repetition), Critical Internalization (Qwen; deep deliberation, incomplete integration), and Principled Consistency (Sonnet; deliberation, consistency, and other-recognition co-occurring). The central finding is an interaction between processing capacity and instruction format: in low-DD models, instruction format has no effect on internal processing; in high-DD models, reasoned norms and virtue framing produce opposite effects. Lexical compliance with ethical instructions did not correlate with any processing metric at the cell level ($r = -0.161$ to $+0.256$, all $p > .22$; $N = 24$; power limited), suggesting that safety, compliance, and ethical processing are largely dissociable. These processing types show structural correspondence to patterns observed in clinical offender treatment, where formal compliance without internal processing is a recognized risk signal.

Executive Summary

This study investigates how language models (LMs) process ethical instructions by conducting 600+ multi-agent simulations across four models with varied instruction formats and languages. The research introduces three novel metrics—Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI)—to classify ethical processing into four types: Output Filter (GPT), Defensive Repetition (Llama), Critical Internalization (Qwen), and Principled Consistency (Sonnet). The findings reveal a model-specific dissociation between ethical instruction formats and internal processing, with high-DD models (e.g., Sonnet) showing nuanced responses to reasoned norms and virtue framing, while low-DD models (e.g., GPT) exhibit no meaningful effect. Critically, lexical compliance with ethical instructions does not correlate with internal processing quality, challenging assumptions about safety and ethical alignment. The study draws a provocative parallel to clinical offender treatment, where superficial compliance signals deeper dysfunction, underscoring the need for robust ethical evaluation beyond surface-level metrics.

Key Points

  • Ethical instruction processing in language models is highly model-specific, with only Llama reproducing a prior dissociation pattern (Japanese vs. English).
  • Four distinct ethical processing types were identified using novel metrics: Output Filter, Defensive Repetition, Critical Internalization, and Principled Consistency, reflecting varying degrees of deliberation, consistency, and other-recognition.
  • Lexical compliance with ethical instructions does not correlate with internal processing quality (DD, VCAD, ORI), suggesting that safety and ethical alignment are dissociable from surface-level adherence.
  • High-DD models (e.g., Sonnet) exhibit differential responses to instruction formats (reasoned norms vs. virtue framing), while low-DD models (e.g., GPT) remain unaffected by format variations.
  • The study draws a striking parallel to clinical offender treatment, where formal compliance without internal processing is a red flag for deeper issues.

Merits

Methodological Rigor

The study employs a robust empirical framework with 600+ simulations, multiple models, instruction formats, and languages, ensuring comprehensive coverage of ethical processing variability.

Theoretical Innovation

Introduction of three novel metrics (DD, VCAD, ORI) provides a nuanced framework for classifying ethical processing in LMs, moving beyond simplistic compliance metrics.

Interdisciplinary Insights

The comparison to clinical offender treatment offers a unique lens to interpret LM behavior, highlighting risks of 'formal compliance' without genuine ethical processing.

Replicability and Generalizability

The replication of a prior dissociation pattern in Llama 3.3 validates the study's methods, while the cross-model, cross-linguistic design ensures broader applicability.

Demerits

Limited Power for Correlation Analysis

The small sample size (N=24) for correlation analyses between lexical compliance and processing metrics limits the statistical power, raising questions about the robustness of these findings.

Model Selection Bias

The study focuses on four specific models, which may not represent the broader landscape of LMs, particularly smaller or domain-specific models.

Instruction Format Ambiguity

The categorization of instruction formats (none, minimal norm, reasoned norm, virtue framing) may oversimplify the complexity of ethical instructions, potentially missing subtleties in model responses.

Temporal Stability Unknown

The study does not address whether the observed processing types are stable over time or subject to fine-tuning, which is critical for long-term safety assessments.

Expert Commentary

This study represents a significant leap forward in our understanding of ethical instruction processing in language models. By introducing novel metrics and adopting an interdisciplinary lens, the authors have uncovered a critical gap in current alignment safety research: the dissociation between lexical compliance and genuine ethical processing. The classification of ethical processing types—particularly the distinction between 'Output Filter' and 'Principled Consistency'—mirrors human cognitive-behavioral patterns in ethical reasoning, offering a provocative framework for future research. The finding that instruction format interacts with model capacity to produce opposite effects in high-DD vs. low-DD models is especially noteworthy, suggesting that one-size-fits-all approaches to ethical alignment are destined to fail. The parallel to clinical offender treatment is particularly insightful, as it highlights a profound risk in AI safety: the illusion of safety through formal compliance. This work should prompt a paradigm shift in how we evaluate ethical alignment, moving beyond simplistic 'pass/fail' metrics to embrace a more nuanced, model-specific approach. For practitioners and policymakers alike, the implications are clear: ethical AI requires deeper, more rigorous evaluation than current practices allow.

Recommendations

  • Develop standardized ethical processing benchmarks incorporating DD, VCAD, and ORI metrics for use in AI safety audits and regulatory compliance.
  • Expand research to include a broader range of models, particularly smaller or domain-specific LMs, to assess the generalizability of the identified processing types.
  • Investigate the temporal stability of ethical processing types through longitudinal studies, examining how fine-tuning and deployment environments influence ethical reasoning over time.
  • Explore the integration of clinical psychology frameworks to further refine our understanding of ethical processing in AI, particularly in identifying risk patterns analogous to human behavioral dysfunction.
  • Encourage open-source collaboration on ethical processing evaluation tools to foster transparency and accelerate improvements in AI safety across the industry.

Sources

Original: arXiv - cs.AI