Weber's Law in Transformer Magnitude Representations: Efficient Coding, Representational Geometry, and Psychophysical Laws in Language Models
arXiv:2603.20642v1 Announce Type: new Abstract: How do transformer language models represent magnitude? Recent work disagrees: some find logarithmic spacing, others linear encoding, others per-digit circular representations. We apply the formal tools of psychophysics to resolve this. Using four converging paradigms (representational similarity analysis, behavioural discrimination, precision gradients, causal intervention) across three magnitude domains in three 7-9B instruction-tuned models spanning three architecture families (Llama, Mistral, Qwen), we report three findings. First, representational geometry is consistently log-compressive: RSA correlations with a Weber-law dissimilarity matrix ranged from .68 to .96 across all 96 model-domain-layer cells, with linear geometry never preferred. Second, this geometry is dissociated from behaviour: one model produces a human-range Weber fraction (WF = 0.20) while the other does not, and both models perform at chance on temporal and spati
arXiv:2603.20642v1 Announce Type: new Abstract: How do transformer language models represent magnitude? Recent work disagrees: some find logarithmic spacing, others linear encoding, others per-digit circular representations. We apply the formal tools of psychophysics to resolve this. Using four converging paradigms (representational similarity analysis, behavioural discrimination, precision gradients, causal intervention) across three magnitude domains in three 7-9B instruction-tuned models spanning three architecture families (Llama, Mistral, Qwen), we report three findings. First, representational geometry is consistently log-compressive: RSA correlations with a Weber-law dissimilarity matrix ranged from .68 to .96 across all 96 model-domain-layer cells, with linear geometry never preferred. Second, this geometry is dissociated from behaviour: one model produces a human-range Weber fraction (WF = 0.20) while the other does not, and both models perform at chance on temporal and spatial discrimination despite possessing logarithmic geometry. Third, causal intervention reveals a layer dissociation: early layers are functionally implicated in magnitude processing (4.1x specificity) while later layers where geometry is strongest are not causally engaged (1.2x). Corpus analysis confirms the efficient coding precondition (alpha = 0.77). These results suggest that training data statistics alone are sufficient to produce log-compressive magnitude geometry, but geometry alone does not guarantee behavioural competence.
Executive Summary
This article presents a comprehensive investigation into the magnitude representations in transformer language models, specifically focusing on the application of Weber's Law. Through a multi-paradigm approach, the authors demonstrate that representational geometry is consistently log-compressive, dissociated from behavioral competence, and layer-dependent. The results suggest that efficient coding is a precondition for log-compressive magnitude geometry, but geometry alone does not guarantee behavioral competence. The study contributes to our understanding of the complex relationships between representation, behavior, and training data statistics in transformer models.
Key Points
- ▸ Representational geometry is consistently log-compressive across three magnitude domains and three 7-9B instruction-tuned models.
- ▸ Representational geometry is dissociated from behavioral competence, with some models performing at chance despite possessing logarithmic geometry.
- ▸ Causal intervention reveals a layer dissociation, with early layers functionally implicated in magnitude processing and later layers not causally engaged.
Merits
Strength of methodology
The authors employ a multi-paradigm approach, combining representational similarity analysis, behavioural discrimination, precision gradients, and causal intervention to provide a comprehensive understanding of magnitude representations in transformer models.
Depth of analysis
The study delves into the complex relationships between representation, behavior, and training data statistics, providing insights into the layer-dependent and geometry-dependent nature of magnitude processing.
Demerits
Limitation of generalizability
The study is limited to a specific set of models and magnitude domains, which may not generalize to other architectures or tasks.
Potential overemphasis on efficient coding
The study's focus on efficient coding as a precondition for log-compressive magnitude geometry may overlook other important factors influencing magnitude representations.
Expert Commentary
The study presents a significant contribution to the field of transformer models and magnitude representations. The use of multiple paradigms and the layer-dependent analysis provide a nuanced understanding of the complex relationships between representation and behavior. However, the limitations of generalizability and the potential overemphasis on efficient coding should be considered in future studies. The findings have implications for the development of more efficient and effective transformer models and may inform policy decisions related to AI system development and deployment.
Recommendations
- ✓ Future studies should investigate the generalizability of the study's findings to other architectures and tasks.
- ✓ Researchers should consider the potential impact of other factors, such as task-specific training data or cognitive biases, on magnitude representations in transformer models.
Sources
Original: arXiv - cs.CL