Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM
arXiv:2602.19159v1 Announce Type: new Abstract: Prior behavioural work suggests that some LLMs alter choices when options are framed as causing pain or pleasure, and that such deviations can scale with stated intensity. To bridge behavioural evidence (what the model does) with mechanistic interpretability (what computations support it), we investigate how valence-related information is represented and where it is causally used inside a transformer. Using Gemma-2-9B-it and a minimalist decision task modelled on prior work, we (i) map representational availability with layer-wise linear probing across streams, (ii) test causal contribution with activation interventions (steering; patching/ablation), and (iii) quantify dose-response effects over an epsilon grid, reading out both the 2-3 logit margin and digit-pair-normalised choice probabilities. We find that (a) valence sign (pain vs. pleasure) is perfectly linearly separable across stream families from very early layers (L0-L1), while
arXiv:2602.19159v1 Announce Type: new Abstract: Prior behavioural work suggests that some LLMs alter choices when options are framed as causing pain or pleasure, and that such deviations can scale with stated intensity. To bridge behavioural evidence (what the model does) with mechanistic interpretability (what computations support it), we investigate how valence-related information is represented and where it is causally used inside a transformer. Using Gemma-2-9B-it and a minimalist decision task modelled on prior work, we (i) map representational availability with layer-wise linear probing across streams, (ii) test causal contribution with activation interventions (steering; patching/ablation), and (iii) quantify dose-response effects over an epsilon grid, reading out both the 2-3 logit margin and digit-pair-normalised choice probabilities. We find that (a) valence sign (pain vs. pleasure) is perfectly linearly separable across stream families from very early layers (L0-L1), while a lexical baseline retains substantial signal; (b) graded intensity is strongly decodable, with peaks in mid-to-late layers and especially in attention/MLP outputs, and decision alignment is highest slightly before the final token; (c) additive steering along a data-derived valence direction causally modulates the 2-3 margin at late sites, with the largest effects observed in late-layer attention outputs (attn_out L14); and (d) head-level patching/ablation suggests that these effects are distributed across multiple heads rather than concentrated in a single unit. Together, these results link behavioural sensitivity to identifiable internal representations and intervention-sensitive sites, providing concrete mechanistic targets for more stringent counterfactual tests and broader replication. This work supports a more evidence-driven (a) debate on AI sentience and welfare, and (b) governance when setting policy, auditing standards, and safety safeguards.
Executive Summary
This study explores the internal workings of a Large Language Model (LLM) to understand how it makes decisions related to pain and pleasure. Using a transformer-based architecture, the researchers investigate how valence-related information is represented and used within the model. The results suggest that the model can distinguish between pain and pleasure from early layers, and that graded intensity is strongly decodable in mid-to-late layers. The study also finds that additive steering along a data-derived valence direction causally modulates the decision margin. The findings provide a more evidence-driven approach to debates on AI sentience and welfare, and governance when setting policy, auditing standards, and safety safeguards.
Key Points
- ▸ The study uses a transformer-based architecture to investigate pain-pleasure decisions in an LLM.
- ▸ Valence-related information is represented and used within the model from early layers.
- ▸ Graded intensity is strongly decodable in mid-to-late layers, with peaks in attention/MLP outputs.
- ▸ Additive steering along a data-derived valence direction causally modulates the decision margin.
Merits
Strength in Mechanistic Interpretability
The study provides a detailed mechanistic analysis of the LLM's decision-making process, which is a significant strength in the field of AI research.
Robust Methodology
The researchers employ a robust methodology, including layer-wise linear probing, activation interventions, and dose-response effects, to investigate the LLM's decision-making process.
Implications for AI Sentience and Welfare
The study's findings have significant implications for debates on AI sentience and welfare, and governance when setting policy, auditing standards, and safety safeguards.
Demerits
Limitation in Generalizability
The study's findings may not be generalizable to other LLM architectures or tasks, which could limit the study's broader impact.
Complexity of Results
The study's results are complex and may require further analysis to fully understand the implications of the findings.
Expert Commentary
This study is a significant contribution to the field of AI research, as it provides a detailed mechanistic analysis of the LLM's decision-making process. The use of a robust methodology and the study's focus on mechanistic interpretability are notable strengths. However, the study's findings may not be generalizable to other LLM architectures or tasks, which could limit the study's broader impact. Furthermore, the complexity of the results may require further analysis to fully understand the implications of the findings. Nonetheless, the study's implications for AI sentience and welfare could inform policy debates and the development of governance frameworks for AI.
Recommendations
- ✓ Future studies should investigate the generalizability of the study's findings to other LLM architectures or tasks.
- ✓ Further analysis of the study's results is necessary to fully understand the implications of the findings.