Academic

The Illusion of Latent Generalization: Bi-directionality and the Reversal Curse

arXiv:2604.04943v1 Announce Type: new Abstract: The reversal curse describes a failure of autoregressive language models to retrieve a fact in reverse order (e.g., training on ``$A > B$'' but failing on ``$B < A$''). Recent work shows that objectives with bidirectional supervision (e.g., bidirectional attention or masking-based reconstruction for decoder-only models) can mitigate the reversal curse. We extend this evaluation to include a vanilla masked language modeling (MLM) objective and compare it to decoder-only masking-based training across four reversal benchmarks and then provide a minimal mechanistic study of \emph{how} these objectives succeed. We show that reversal accuracy requires training signal that explicitly makes the source entity a prediction target, and we find little evidence that success corresponds to a single direction-agnostic representation of a fact. Instead, representation distances and linear probes are consistent with storing forward and reverse directions

J
Julian Coda-Forno, Jane X. Wang, Arslan Chaudhry
· · 1 min read · 4 views

arXiv:2604.04943v1 Announce Type: new Abstract: The reversal curse describes a failure of autoregressive language models to retrieve a fact in reverse order (e.g., training on ``$A > B$'' but failing on ``$B < A$''). Recent work shows that objectives with bidirectional supervision (e.g., bidirectional attention or masking-based reconstruction for decoder-only models) can mitigate the reversal curse. We extend this evaluation to include a vanilla masked language modeling (MLM) objective and compare it to decoder-only masking-based training across four reversal benchmarks and then provide a minimal mechanistic study of \emph{how} these objectives succeed. We show that reversal accuracy requires training signal that explicitly makes the source entity a prediction target, and we find little evidence that success corresponds to a single direction-agnostic representation of a fact. Instead, representation distances and linear probes are consistent with storing forward and reverse directions as distinct entries, with different indexing geometry for MLM versus decoder-only masking-based training. Our results caution that objective-level ``fixes'' can improve reversal behavior without necessarily inducing the kind of latent generalization one might expect from a unified concept.

Executive Summary

The article critically examines the 'reversal curse' in autoregressive language models (LMs), wherein models trained on directional facts (e.g., A > B) fail to retrieve the reverse (e.g., B < A). The authors extend prior work by evaluating bidirectional training objectives—including masked language modeling (MLM) and decoder-only masking-based training—across four reversal benchmarks. Their mechanistic analysis reveals that reversal accuracy hinges on whether the source entity is explicitly predicted as a target during training, rather than a unified, direction-agnostic representation of facts. The study finds that forward and reverse directions are stored as distinct entries with differing indexing geometries depending on the training objective, cautioning that objective-level improvements may not equate to latent generalization. These insights challenge assumptions about latent concept learning in LMs and underscore the need for more nuanced evaluations of bidirectional training effects.

Key Points

  • The reversal curse persists in autoregressive LMs despite their ability to model complex linguistic patterns, highlighting a fundamental limitation in their representational generalization.
  • Bidirectional training objectives (e.g., MLM, decoder-only masking) mitigate the reversal curse, but their success stems from explicit prediction of source entities as targets during training, not from direction-agnostic latent generalization.
  • Mechanistic analysis shows that forward and reverse fact directions are stored as distinct entries with different indexing geometries, depending on the training objective, suggesting that bidirectional training does not induce a unified conceptual representation.

Merits

Rigorous Benchmark Evaluation

The study evaluates four reversal benchmarks and compares vanilla MLM with decoder-only masking-based training, providing a comprehensive empirical foundation for its claims.

Mechanistic Insight into Representational Geometry

The authors conduct a minimal mechanistic study to dissect how bidirectional objectives achieve reversal accuracy, offering novel insights into the internal organization of fact representations in LMs.

Challenges Assumptions About Latent Generalization

The article directly refutes the notion that bidirectional training induces a direction-agnostic latent concept, advancing a more nuanced understanding of how LMs store and retrieve directional facts.

Demerits

Limited Scope of Mechanistic Analysis

The mechanistic study is described as 'minimal,' focusing on representation distances and linear probes. A deeper exploration of causal mechanisms (e.g., attention patterns, gradient dynamics) could strengthen the conclusions.

Dependence on Synthetic Benchmarks

The reversal benchmarks rely on synthetic or highly controlled datasets (e.g., A > B relationships). The generalizability of these findings to real-world, complex natural language scenarios remains to be tested.

Narrow Focus on Bidirectional Objectives

The study primarily compares MLM and decoder-only masking, leaving other potential mitigations (e.g., architectural changes, prompting strategies) underexplored as avenues for addressing the reversal curse.

Expert Commentary

This article makes a significant contribution to the understanding of the reversal curse by systematically dismantling the assumption that bidirectional training objectives induce a unified, direction-agnostic representation of facts. The authors’ mechanistic analysis is particularly compelling, as it reveals that the success of such objectives hinges on explicit prediction targets rather than latent generalization. This challenges a growing body of work that treats bidirectional training as a panacea for representational limitations in LMs. The study’s findings also raise important questions about the nature of knowledge storage in these models: if forward and reverse directions are stored as distinct entries, does this imply a form of 'indexed knowledge' rather than a coherent conceptual structure? Such insights are critical for advancing both theoretical and applied research in machine learning. Future work should explore whether these findings extend to more complex, real-world scenarios and investigate alternative architectures or objectives that might achieve true direction-agnostic generalization.

Recommendations

  • Expand the mechanistic analysis to include causal interventions (e.g., ablations on attention mechanisms or gradient pathways) to further elucidate how bidirectional objectives store and retrieve directional facts.
  • Develop more ecologically valid reversal benchmarks that incorporate natural language corpora with inherent directional relationships (e.g., causal chains or hierarchical dependencies) to test the generalizability of the findings.
  • Explore hybrid training objectives that combine the strengths of bidirectional and autoregressive approaches, potentially leveraging auxiliary prediction tasks (e.g., source entity prediction) to explicitly address the reversal curse without sacrificing other desirable properties.
  • Investigate the interplay between model scale and the reversal curse, particularly whether larger models exhibit different representational geometries or mitigation strategies compared to smaller ones.
  • Engage with the legal and policy communities to establish standardized evaluation protocols for directional knowledge in LMs, ensuring that critical applications (e.g., contract analysis, risk assessment) are not undermined by these representational failures.

Sources

Original: arXiv - cs.CL