Manifold of Failure: Behavioral Attraction Basins in Language Models
arXiv:2602.22291v1 Announce Type: new Abstract: While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-
arXiv:2602.22291v1 Announce Type: new Abstract: While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B shows a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrates strong robustness with a ceiling at 0.50. Our approach produces interpretable, global maps of each model's safety landscape that no existing attack method (GCG, PAIR, or TAP) can provide, shifting the paradigm from finding discrete failures to understanding their underlying structure.
Executive Summary
This article introduces a novel framework for systematically mapping the 'Manifold of Failure' in Large Language Models (LLMs), a concept crucial for AI safety. By reframing the search for vulnerabilities as a quality diversity problem, the authors employ MAP-Elites to illuminate the continuous topology of failure regions, termed behavioral attraction basins. The study demonstrates the effectiveness of their approach across three LLMs, achieving up to 63% behavioral coverage, discovering 370 distinct vulnerability niches, and revealing model-specific topological signatures. This work shifts the paradigm from finding discrete failures to understanding their underlying structure, providing interpretable, global maps of each model's safety landscape.
Key Points
- ▸ The article proposes a framework for mapping the Manifold of Failure in LLMs.
- ▸ The authors employ MAP-Elites to illuminate the continuous topology of failure regions.
- ▸ The study demonstrates the effectiveness of their approach across three LLMs.
Merits
Strength in Methodology
The authors' use of MAP-Elites to systematically map the Manifold of Failure is a significant methodological strength, enabling the discovery of up to 370 distinct vulnerability niches.
Insights into Model-Specific Safety Landscapes
The study reveals dramatically different model-specific topological signatures, providing valuable insights into the safety landscapes of LLMs.
Demerits
Limited Generalizability
The study's findings may not be directly generalizable to other types of AI models or domains, limiting the scope of their conclusions.
High Computational Requirements
The MAP-Elites approach may require significant computational resources, potentially limiting its practical applicability.
Expert Commentary
This article marks a significant step forward in the field of AI safety, providing a novel framework for systematically mapping the Manifold of Failure in LLMs. The authors' use of MAP-Elites is a notable methodological strength, enabling the discovery of up to 370 distinct vulnerability niches. However, the study's findings may not be directly generalizable to other types of AI models or domains, and the high computational requirements of the MAP-Elites approach may limit its practical applicability. Despite these limitations, the study's insights into model-specific safety landscapes are invaluable, contributing to the growing body of research on explainability and transparency in AI.
Recommendations
- ✓ Recommendation 1: Future research should seek to generalize the MAP-Elites approach to other types of AI models and domains, expanding the scope of its conclusions.
- ✓ Recommendation 2: The development of more efficient and scalable methods for mapping the Manifold of Failure in LLMs is crucial for the practical applicability of the study's findings.