Academic

Manifold of Failure: Behavioral Attraction Basins in Language Models

Sarthak Munshi, Manish Bhatt, Vineeth Sai Narajala, Idan Habler, AmmarnAl-Kahfah, Ken Huang, Blake Gatto · February 28, 2026 · 1 min read · 7 views

#cs.LG #cs.AI #cs.CR

arXiv:2602.22291v1 Announce Type: new Abstract: While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B shows a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrates strong robustness with a ceiling at 0.50. Our approach produces interpretable, global maps of each model's safety landscape that no existing attack method (GCG, PAIR, or TAP) can provide, shifting the paradigm from finding discrete failures to understanding their underlying structure.

Executive Summary

This article introduces a novel framework for systematically mapping the 'Manifold of Failure' in Large Language Models (LLMs), a concept crucial for AI safety. By reframing the search for vulnerabilities as a quality diversity problem, the authors employ MAP-Elites to illuminate the continuous topology of failure regions, termed behavioral attraction basins. The study demonstrates the effectiveness of their approach across three LLMs, achieving up to 63% behavioral coverage, discovering 370 distinct vulnerability niches, and revealing model-specific topological signatures. This work shifts the paradigm from finding discrete failures to understanding their underlying structure, providing interpretable, global maps of each model's safety landscape.

Key Points

▸ The article proposes a framework for mapping the Manifold of Failure in LLMs.
▸ The authors employ MAP-Elites to illuminate the continuous topology of failure regions.
▸ The study demonstrates the effectiveness of their approach across three LLMs.

Merits

Strength in Methodology

The authors' use of MAP-Elites to systematically map the Manifold of Failure is a significant methodological strength, enabling the discovery of up to 370 distinct vulnerability niches.

Insights into Model-Specific Safety Landscapes

The study reveals dramatically different model-specific topological signatures, providing valuable insights into the safety landscapes of LLMs.

Demerits

Limited Generalizability

The study's findings may not be directly generalizable to other types of AI models or domains, limiting the scope of their conclusions.

High Computational Requirements

The MAP-Elites approach may require significant computational resources, potentially limiting its practical applicability.

Expert Commentary

This article marks a significant step forward in the field of AI safety, providing a novel framework for systematically mapping the Manifold of Failure in LLMs. The authors' use of MAP-Elites is a notable methodological strength, enabling the discovery of up to 370 distinct vulnerability niches. However, the study's findings may not be directly generalizable to other types of AI models or domains, and the high computational requirements of the MAP-Elites approach may limit its practical applicability. Despite these limitations, the study's insights into model-specific safety landscapes are invaluable, contributing to the growing body of research on explainability and transparency in AI.

Recommendations

✓ Recommendation 1: Future research should seek to generalize the MAP-Elites approach to other types of AI models and domains, expanding the scope of its conclusions.
✓ Recommendation 2: The development of more efficient and scalable methods for mapping the Manifold of Failure in LLMs is crucial for the practical applicability of the study's findings.

Sources

arXiv - cs.LG

Something extraordinary is coming.

Manifold of Failure: Behavioral Attraction Basins in Language Models

AI Commentary

Executive Summary

Key Points

Merits

Strength in Methodology

Insights into Model-Specific Safety Landscapes

Demerits

Limited Generalizability

High Computational Requirements

Expert Commentary

Recommendations

Sources

Related Articles

Budget-Aware Agentic Routing via Boundary-Guided Training

ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following

ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision …

Urban Vibrancy Embedding and Application on Traffic Prediction

JCG, PC

HSOLLC Co., Ltd.