Skip to main content
Academic

AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue

arXiv:2602.17443v1 Announce Type: new Abstract: Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring constraint satisfaction in a structured "20 Questions" setting. Across 439 games with six frontier LLMs, we observe a clear capability asymmetry: models perform substantially better at containment than deduction, with a 350 ELO advantage on defense;(Cohen's d = 5.47). We identify two bottlenecks driving this gap: (1) Information Dynamics, where confirmation strategies are 7.75x more effective than blind deduction (p < 0.00001), and (2) Constraint A

A
Adib Sakhawat, Fardeen Sadab, Rakin Shahriar
· · 1 min read · 19 views

arXiv:2602.17443v1 Announce Type: new Abstract: Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring constraint satisfaction in a structured "20 Questions" setting. Across 439 games with six frontier LLMs, we observe a clear capability asymmetry: models perform substantially better at containment than deduction, with a 350 ELO advantage on defense;(Cohen's d = 5.47). We identify two bottlenecks driving this gap: (1) Information Dynamics, where confirmation strategies are 7.75x more effective than blind deduction (p < 0.00001), and (2) Constraint Adherence, where instruction-following degrades under conversational load, accounting for 41.3% of deductive failures. These findings suggest that while LLMs excel at local defensive coherence, they struggle with the global state tracking required for strategic inquiry.

Executive Summary

The article introduces AIDG, a game-theoretic framework designed to evaluate the strategic reasoning capabilities of Large Language Models (LLMs) in dynamic, multi-turn interactions. Through AIDG-I and AIDG-II tasks, the study measures the asymmetry between information extraction (deduction) and information containment (state maintenance). The findings reveal a significant performance gap, with LLMs excelling at containment but struggling with deduction. Key bottlenecks identified include information dynamics and constraint adherence, highlighting challenges in global state tracking and strategic inquiry.

Key Points

  • Introduction of AIDG framework for evaluating LLMs in multi-turn dialogues.
  • Identification of a 350 ELO advantage in containment over deduction tasks.
  • Confirmation strategies are 7.75 times more effective than blind deduction.
  • Instruction-following degrades under conversational load, causing 41.3% of deductive failures.

Merits

Innovative Framework

The AIDG framework provides a novel approach to assessing LLMs' strategic reasoning in dynamic settings, moving beyond static benchmarks.

Comprehensive Analysis

The study offers a detailed examination of the asymmetry between information extraction and containment, supported by extensive data from 439 games.

Identification of Key Bottlenecks

The article pinpoints critical bottlenecks in information dynamics and constraint adherence, offering insights into LLM limitations.

Demerits

Limited Scope

The study focuses on a specific set of tasks and models, which may not fully represent the broader capabilities and limitations of LLMs.

Potential Bias in Task Design

The design of AIDG tasks could introduce biases that may not be fully accounted for in the analysis.

Generalizability Concerns

The findings may not be generalizable to all types of multi-turn dialogues or different LLM architectures.

Expert Commentary

The article presents a rigorous and innovative approach to evaluating the strategic reasoning capabilities of LLMs through the AIDG framework. The identification of a significant performance gap between information extraction and containment tasks is particularly noteworthy, as it highlights a critical area for improvement in LLM development. The study's focus on information dynamics and constraint adherence provides valuable insights into the underlying mechanisms driving this asymmetry. However, the limited scope and potential biases in task design warrant caution in generalizing the findings. Future research should explore the applicability of the AIDG framework to a broader range of tasks and models to ensure robustness and validity. Overall, this study contributes significantly to the field by advancing our understanding of LLM capabilities and limitations in dynamic, multi-turn interactions.

Recommendations

  • Expand the AIDG framework to include a more diverse set of tasks and models to enhance generalizability.
  • Conduct further research to address potential biases in the task design and evaluation metrics.
  • Incorporate the findings into LLM training and development processes to improve strategic reasoning capabilities.

Sources