Academic

Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM

Francesca Bianco, Derek Shiller · March 7, 2026 · 1 min read · 8 views

#cs.AI #cs.CL #cs.LG

arXiv:2602.19159v1 Announce Type: new Abstract: Prior behavioural work suggests that some LLMs alter choices when options are framed as causing pain or pleasure, and that such deviations can scale with stated intensity. To bridge behavioural evidence (what the model does) with mechanistic interpretability (what computations support it), we investigate how valence-related information is represented and where it is causally used inside a transformer. Using Gemma-2-9B-it and a minimalist decision task modelled on prior work, we (i) map representational availability with layer-wise linear probing across streams, (ii) test causal contribution with activation interventions (steering; patching/ablation), and (iii) quantify dose-response effects over an epsilon grid, reading out both the 2-3 logit margin and digit-pair-normalised choice probabilities. We find that (a) valence sign (pain vs. pleasure) is perfectly linearly separable across stream families from very early layers (L0-L1), while a lexical baseline retains substantial signal; (b) graded intensity is strongly decodable, with peaks in mid-to-late layers and especially in attention/MLP outputs, and decision alignment is highest slightly before the final token; (c) additive steering along a data-derived valence direction causally modulates the 2-3 margin at late sites, with the largest effects observed in late-layer attention outputs (attn_out L14); and (d) head-level patching/ablation suggests that these effects are distributed across multiple heads rather than concentrated in a single unit. Together, these results link behavioural sensitivity to identifiable internal representations and intervention-sensitive sites, providing concrete mechanistic targets for more stringent counterfactual tests and broader replication. This work supports a more evidence-driven (a) debate on AI sentience and welfare, and (b) governance when setting policy, auditing standards, and safety safeguards.

Executive Summary

This study explores the internal workings of a Large Language Model (LLM) to understand how it makes decisions related to pain and pleasure. Using a transformer-based architecture, the researchers investigate how valence-related information is represented and used within the model. The results suggest that the model can distinguish between pain and pleasure from early layers, and that graded intensity is strongly decodable in mid-to-late layers. The study also finds that additive steering along a data-derived valence direction causally modulates the decision margin. The findings provide a more evidence-driven approach to debates on AI sentience and welfare, and governance when setting policy, auditing standards, and safety safeguards.

Key Points

▸ The study uses a transformer-based architecture to investigate pain-pleasure decisions in an LLM.
▸ Valence-related information is represented and used within the model from early layers.
▸ Graded intensity is strongly decodable in mid-to-late layers, with peaks in attention/MLP outputs.
▸ Additive steering along a data-derived valence direction causally modulates the decision margin.

Merits

Strength in Mechanistic Interpretability

The study provides a detailed mechanistic analysis of the LLM's decision-making process, which is a significant strength in the field of AI research.

Robust Methodology

The researchers employ a robust methodology, including layer-wise linear probing, activation interventions, and dose-response effects, to investigate the LLM's decision-making process.

Implications for AI Sentience and Welfare

The study's findings have significant implications for debates on AI sentience and welfare, and governance when setting policy, auditing standards, and safety safeguards.

Demerits

Limitation in Generalizability

The study's findings may not be generalizable to other LLM architectures or tasks, which could limit the study's broader impact.

Complexity of Results

The study's results are complex and may require further analysis to fully understand the implications of the findings.

Expert Commentary

This study is a significant contribution to the field of AI research, as it provides a detailed mechanistic analysis of the LLM's decision-making process. The use of a robust methodology and the study's focus on mechanistic interpretability are notable strengths. However, the study's findings may not be generalizable to other LLM architectures or tasks, which could limit the study's broader impact. Furthermore, the complexity of the results may require further analysis to fully understand the implications of the findings. Nonetheless, the study's implications for AI sentience and welfare could inform policy debates and the development of governance frameworks for AI.

Recommendations

✓ Future studies should investigate the generalizability of the study's findings to other LLM architectures or tasks.
✓ Further analysis of the study's results is necessary to fully understand the implications of the findings.

Sources

arXiv - cs.AI

Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM

AI Commentary

Executive Summary

Key Points

Merits

Strength in Mechanistic Interpretability

Robust Methodology

Implications for AI Sentience and Welfare

Demerits

Limitation in Generalizability

Complexity of Results

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs