Academic

Weber's Law in Transformer Magnitude Representations: Efficient Coding, Representational Geometry, and Psychophysical Laws in Language Models

arXiv:2603.20642v1 Announce Type: new Abstract: How do transformer language models represent magnitude? Recent work disagrees: some find logarithmic spacing, others linear encoding, others per-digit circular representations. We apply the formal tools of psychophysics to resolve this. Using four converging paradigms (representational similarity analysis, behavioural discrimination, precision gradients, causal intervention) across three magnitude domains in three 7-9B instruction-tuned models spanning three architecture families (Llama, Mistral, Qwen), we report three findings. First, representational geometry is consistently log-compressive: RSA correlations with a Weber-law dissimilarity matrix ranged from .68 to .96 across all 96 model-domain-layer cells, with linear geometry never preferred. Second, this geometry is dissociated from behaviour: one model produces a human-range Weber fraction (WF = 0.20) while the other does not, and both models perform at chance on temporal and spati

J
Jon-Paul Cacioli
· · 1 min read · 7 views

arXiv:2603.20642v1 Announce Type: new Abstract: How do transformer language models represent magnitude? Recent work disagrees: some find logarithmic spacing, others linear encoding, others per-digit circular representations. We apply the formal tools of psychophysics to resolve this. Using four converging paradigms (representational similarity analysis, behavioural discrimination, precision gradients, causal intervention) across three magnitude domains in three 7-9B instruction-tuned models spanning three architecture families (Llama, Mistral, Qwen), we report three findings. First, representational geometry is consistently log-compressive: RSA correlations with a Weber-law dissimilarity matrix ranged from .68 to .96 across all 96 model-domain-layer cells, with linear geometry never preferred. Second, this geometry is dissociated from behaviour: one model produces a human-range Weber fraction (WF = 0.20) while the other does not, and both models perform at chance on temporal and spatial discrimination despite possessing logarithmic geometry. Third, causal intervention reveals a layer dissociation: early layers are functionally implicated in magnitude processing (4.1x specificity) while later layers where geometry is strongest are not causally engaged (1.2x). Corpus analysis confirms the efficient coding precondition (alpha = 0.77). These results suggest that training data statistics alone are sufficient to produce log-compressive magnitude geometry, but geometry alone does not guarantee behavioural competence.

Executive Summary

This article presents a comprehensive investigation into the magnitude representations in transformer language models, specifically focusing on the application of Weber's Law. Through a multi-paradigm approach, the authors demonstrate that representational geometry is consistently log-compressive, dissociated from behavioral competence, and layer-dependent. The results suggest that efficient coding is a precondition for log-compressive magnitude geometry, but geometry alone does not guarantee behavioral competence. The study contributes to our understanding of the complex relationships between representation, behavior, and training data statistics in transformer models.

Key Points

  • Representational geometry is consistently log-compressive across three magnitude domains and three 7-9B instruction-tuned models.
  • Representational geometry is dissociated from behavioral competence, with some models performing at chance despite possessing logarithmic geometry.
  • Causal intervention reveals a layer dissociation, with early layers functionally implicated in magnitude processing and later layers not causally engaged.

Merits

Strength of methodology

The authors employ a multi-paradigm approach, combining representational similarity analysis, behavioural discrimination, precision gradients, and causal intervention to provide a comprehensive understanding of magnitude representations in transformer models.

Depth of analysis

The study delves into the complex relationships between representation, behavior, and training data statistics, providing insights into the layer-dependent and geometry-dependent nature of magnitude processing.

Demerits

Limitation of generalizability

The study is limited to a specific set of models and magnitude domains, which may not generalize to other architectures or tasks.

Potential overemphasis on efficient coding

The study's focus on efficient coding as a precondition for log-compressive magnitude geometry may overlook other important factors influencing magnitude representations.

Expert Commentary

The study presents a significant contribution to the field of transformer models and magnitude representations. The use of multiple paradigms and the layer-dependent analysis provide a nuanced understanding of the complex relationships between representation and behavior. However, the limitations of generalizability and the potential overemphasis on efficient coding should be considered in future studies. The findings have implications for the development of more efficient and effective transformer models and may inform policy decisions related to AI system development and deployment.

Recommendations

  • Future studies should investigate the generalizability of the study's findings to other architectures and tasks.
  • Researchers should consider the potential impact of other factors, such as task-specific training data or cognitive biases, on magnitude representations in transformer models.

Sources

Original: arXiv - cs.CL