Skip to main content
Academic

Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination

arXiv:2602.20517v1 Announce Type: new Abstract: Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts. Imitation learning has emerged as one of the prominent approaches to build such agents by training them to mimic human-demonstrated behaviors. However, current methods struggle to capture the inherent diversity and non-Markovian nature of human behavior and lack the ability to steer behavior at inference time. Drawing inspiration from the theory of human cognitive processes, where inner speech guides action selection before execution, we propose MIMIC (Modeling Inner Motivations for Imitation and Control), a framework that uses language as an internal representation of behavioral intent. MIMIC employs the novel use of vision-language models as linguistic scaffolding to train a conditional variational autoencoder capable of generating inner speech from observations. A diffusion-based

R
Rakshit Trivedi, Kartik Sharma, David C Parkes
· · 1 min read · 0 views

arXiv:2602.20517v1 Announce Type: new Abstract: Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts. Imitation learning has emerged as one of the prominent approaches to build such agents by training them to mimic human-demonstrated behaviors. However, current methods struggle to capture the inherent diversity and non-Markovian nature of human behavior and lack the ability to steer behavior at inference time. Drawing inspiration from the theory of human cognitive processes, where inner speech guides action selection before execution, we propose MIMIC (Modeling Inner Motivations for Imitation and Control), a framework that uses language as an internal representation of behavioral intent. MIMIC employs the novel use of vision-language models as linguistic scaffolding to train a conditional variational autoencoder capable of generating inner speech from observations. A diffusion-based behavior cloning policy then selects actions conditioned on current observations and the generated inner speech. MIMIC enables fine-grained steering of behavior at inference time by conditioning the agent on behavior-specific speech. Experiments across robotic manipulation tasks and human-AI collaboration games demonstrate that MIMIC significantly enhances both behavior diversity and fidelity to human demonstrations while enabling nuanced behavioral steering without training on additional demonstrations. We open source our code and provide pre-trained MIMIC agents and qualitative demos at: https://mimic-research.github.io.

Executive Summary

This article presents MIMIC (Modeling Inner Motivations for Imitation and Control), a novel framework that enables human-AI coordination by leveraging language as an internal representation of behavioral intent. MIMIC employs vision-language models to generate inner speech from observations and a diffusion-based behavior cloning policy to select actions. This approach allows for fine-grained steering of behavior at inference time and enhances both behavior diversity and fidelity to human demonstrations. The authors demonstrate MIMIC's effectiveness across robotic manipulation tasks and human-AI collaboration games. While MIMIC shows promise, its generalizability to other domains and scalability to complex scenarios remain to be explored. Overall, MIMIC offers a compelling approach to human-AI coordination, and its potential applications in areas such as robotics, healthcare, and education are significant.

Key Points

  • MIMIC uses language as an internal representation of behavioral intent
  • MIMIC employs vision-language models to generate inner speech
  • MIMIC enables fine-grained steering of behavior at inference time

Merits

Strength

MIMIC's ability to capture the inherent diversity and non-Markovian nature of human behavior is a significant improvement over current imitation learning methods.

Demerits

Limitation

MIMIC's generalizability to other domains and scalability to complex scenarios remain to be explored.

Expert Commentary

MIMIC is a significant contribution to the field of human-AI coordination, as it offers a novel approach to leveraging language as an internal representation of behavioral intent. While its generalizability and scalability remain to be explored, MIMIC's potential applications are substantial. The use of vision-language models to generate inner speech is a particularly innovative aspect of MIMIC, as it allows for the incorporation of linguistic scaffolding into the imitation learning process. Furthermore, the diffusion-based behavior cloning policy enables fine-grained steering of behavior at inference time, which is a critical aspect of human-AI coordination. As the field continues to evolve, MIMIC is likely to have a significant impact on the development of more effective human-AI coordination systems.

Recommendations

  • Future research should focus on exploring MIMIC's generalizability to other domains and scalability to complex scenarios.
  • The development of more advanced vision-language models and diffusion-based behavior cloning policies could further enhance MIMIC's capabilities.

Sources