CWM: Contrastive World Models for Action Feasibility Learning in Embodied Agent Pipelines
arXiv:2602.22452v1 Announce Type: new Abstract: A reliable action feasibility scorer is a critical bottleneck in embodied agent pipelines: before any planning or reasoning occurs, the agent must identify which candidate actions are physically executable in the current state. Existing approaches use supervised fine-tuning (SFT) to train action scorers, but SFT treats each candidate independently and does not explicitly teach the model to discriminate between actions that are physically correct and those that are subtly wrong. We propose the Contrastive World Model (CWM), which fine-tunes a large language model (LLM) as an action scorer using an InfoNCE contrastive objective with hard-mined negative examples. The key idea is to push valid actions away from invalid ones in scoring space, with special emphasis on hard negatives: semantically similar but physically incompatible candidates. We evaluate CWM on the ScienceWorld benchmark through two studies. First, an intrinsic affordance eva
arXiv:2602.22452v1 Announce Type: new Abstract: A reliable action feasibility scorer is a critical bottleneck in embodied agent pipelines: before any planning or reasoning occurs, the agent must identify which candidate actions are physically executable in the current state. Existing approaches use supervised fine-tuning (SFT) to train action scorers, but SFT treats each candidate independently and does not explicitly teach the model to discriminate between actions that are physically correct and those that are subtly wrong. We propose the Contrastive World Model (CWM), which fine-tunes a large language model (LLM) as an action scorer using an InfoNCE contrastive objective with hard-mined negative examples. The key idea is to push valid actions away from invalid ones in scoring space, with special emphasis on hard negatives: semantically similar but physically incompatible candidates. We evaluate CWM on the ScienceWorld benchmark through two studies. First, an intrinsic affordance evaluation on 605 hard-negative test pairs shows that CWM outperforms SFT by +6.76 percentage points on Precision@1 for minimal-edit negatives -- cases where a single word changes the physical outcome -- and achieves a higher AUC-ROC (0.929 vs. 0.906). Second, a live filter characterisation study measures how well CWM ranks gold-path actions against all valid environment actions during task execution. Under out-of-distribution stress conditions, CWM maintains a significantly better safety margin (-2.39) than SFT (-3.96), indicating that the gold action is ranked closer to the top. These results support the hypothesis that contrastive training induces representations that capture physical feasibility more faithfully than SFT alone.
Executive Summary
This article proposes the Contrastive World Model (CWM), a novel approach to action feasibility learning in embodied agent pipelines. CWM fine-tunes a large language model as an action scorer using an InfoNCE contrastive objective with hard-mined negative examples. The approach pushes valid actions away from invalid ones in scoring space, emphasizing semantically similar but physically incompatible candidates. The authors evaluate CWM on the ScienceWorld benchmark, demonstrating improved performance over supervised fine-tuning (SFT) in both intrinsic affordance evaluation and live filter characterisation studies. The results support the hypothesis that contrastive training induces representations that capture physical feasibility more faithfully than SFT alone. The CWM's ability to effectively distinguish between physically correct and subtly wrong actions has significant implications for the development of reliable and safe embodied agents.
Key Points
- ▸ CWM fine-tunes a large language model as an action scorer using an InfoNCE contrastive objective
- ▸ The approach emphasizes semantically similar but physically incompatible candidates
- ▸ CWM outperforms SFT in both intrinsic affordance evaluation and live filter characterisation studies
Merits
Improved action feasibility scoring
CWM's contrastive training approach enables more accurate and reliable action feasibility scoring, which is critical for embodied agent pipelines.
Enhanced safety margins
The CWM's ability to effectively distinguish between physically correct and subtly wrong actions results in significantly better safety margins during task execution.
Demerits
Limited evaluation scope
The article's evaluation is limited to the ScienceWorld benchmark, and it is unclear how CWM's performance would generalize to other environments and tasks.
Dependence on large language models
The CWM's performance relies heavily on the quality and pre-training of the large language model, which may be a limitation in certain resource-constrained settings.
Expert Commentary
The CWM's approach to action feasibility learning is a significant advancement in the field of embodied cognition and artificial intelligence. By leveraging contrastive training and hard-mined negative examples, the authors demonstrate a more effective and reliable method for distinguishing between physically correct and subtly wrong actions. However, the article's evaluation scope is limited, and further research is needed to explore the CWM's performance in other environments and tasks. Additionally, the CWM's dependence on large language models may be a limitation in certain resource-constrained settings. Nevertheless, the article's findings have significant implications for the development of reliable and safe embodied agents, and its contributions to the field of embodied cognition and artificial intelligence are substantial.
Recommendations
- ✓ Further research is needed to explore the CWM's performance in other environments and tasks
- ✓ Investigate the use of alternative pre-trained language models to reduce the CWM's dependence on large language models