Academic

MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games

arXiv:2602.24188v1 Announce Type: new Abstract: We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective communication about private information. This enables an interactive scaling analysis, in which a fixed token budget is divided over a variable number of turns. We find that in many cases, language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario in which one agent attempts to summarize its information and the other agent immediately acts -- despite substantial headroom. This suggests that state-of-the-art models still suffer from significant weaknesses in planning and executing multi-turn collaborative conversations. We analyze the linguistic features of these dialogues, assessing the roles of sycophancy, information density, and discourse coherence. While there is no single linguistic explanation for the collaborative weaknesses o

arXiv:2602.24188v1 Announce Type: new Abstract: We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective communication about private information. This enables an interactive scaling analysis, in which a fixed token budget is divided over a variable number of turns. We find that in many cases, language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario in which one agent attempts to summarize its information and the other agent immediately acts -- despite substantial headroom. This suggests that state-of-the-art models still suffer from significant weaknesses in planning and executing multi-turn collaborative conversations. We analyze the linguistic features of these dialogues, assessing the roles of sycophancy, information density, and discourse coherence. While there is no single linguistic explanation for the collaborative weaknesses of contemporary language models, we note that humans achieve comparable task success at superior token efficiency by producing dialogues that are more coherent than those produced by most language models. The proactive management of private information is a defining feature of real-world communication, and we hope that MT-PingEval will drive further work towards improving this capability.

Executive Summary

This article introduces MT-PingEval, a scalable methodology for evaluating language models in multi-turn interactions. The authors conduct an interactive scaling analysis using collaborative games that require effective communication about private information. The results show that state-of-the-art language models struggle to improve over a non-interactive baseline scenario, highlighting significant weaknesses in planning and executing multi-turn conversations. The analysis reveals that humans achieve comparable task success at superior token efficiency by producing more coherent dialogues. The study emphasizes the importance of proactive information management in real-world communication and calls for further research to improve this capability. The findings have significant implications for the development of more advanced language models and human-computer interfaces.

Key Points

  • MT-PingEval is a novel methodology for evaluating language models in multi-turn interactions
  • State-of-the-art language models struggle to improve over a non-interactive baseline scenario
  • Human-computer interfaces can benefit from more advanced language models with improved collaboration capabilities

Merits

Strengths in Methodological Approach

The study employs a scalable and interactive scaling analysis, allowing for a more comprehensive evaluation of language models in multi-turn interactions.

Insights into Human-Linguistic Performance

The analysis provides valuable insights into human-linguistic performance, highlighting the importance of coherence and proactive information management in real-world communication.

Demerits

Limitation in Generalizability

The study focuses on a specific set of collaborative games and may not be generalizable to other interaction scenarios.

Dependence on Task-Specific Models

The results may be task-specific and may not generalize to other domains or tasks.

Expert Commentary

The MT-PingEval study is a significant contribution to the field of natural language processing, highlighting the challenges and limitations of current language models in multi-turn interactions. The findings emphasize the importance of coherence and proactive information management in real-world communication, which has significant implications for the development of more advanced human-computer interfaces. However, the study also raises important questions about the generalizability of the results and the dependence on task-specific models. Further research is needed to address these limitations and to develop more advanced language models with improved collaboration capabilities.

Recommendations

  • Future research should focus on developing more advanced language models with improved collaboration capabilities, incorporating insights from human-linguistic performance and proactive information management.
  • The study's methodology and findings should be replicated and extended to other domains and tasks, to further explore the generalizability of the results.

Sources