Skip to main content
Academic

Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models

arXiv:2602.18639v1 Announce Type: new Abstract: World models learned from high-dimensional visual observations allow agents to make decisions and plan directly in latent space, avoiding pixel-level reconstruction. However, recent latent predictive architectures (JEPAs), including the DINO world model (DINO-WM), display a degradation in test time robustness due to their sensitivity to "slow features". These include visual variations such as background changes and distractors that are irrelevant to the task being solved. We address this limitation by augmenting the predictive objective with a bisimulation encoder that enforces control-relevant state equivalence, mapping states with similar transition dynamics to nearby latent states while limiting contributions from slow features. We evaluate our model on a simple navigation task under different test-time background changes and visual distractors. Across all benchmarks, our model consistently improves robustness to slow features while o

arXiv:2602.18639v1 Announce Type: new Abstract: World models learned from high-dimensional visual observations allow agents to make decisions and plan directly in latent space, avoiding pixel-level reconstruction. However, recent latent predictive architectures (JEPAs), including the DINO world model (DINO-WM), display a degradation in test time robustness due to their sensitivity to "slow features". These include visual variations such as background changes and distractors that are irrelevant to the task being solved. We address this limitation by augmenting the predictive objective with a bisimulation encoder that enforces control-relevant state equivalence, mapping states with similar transition dynamics to nearby latent states while limiting contributions from slow features. We evaluate our model on a simple navigation task under different test-time background changes and visual distractors. Across all benchmarks, our model consistently improves robustness to slow features while operating in a reduced latent space, up to 10x smaller than that of DINO-WM. Moreover, our model is agnostic to the choice of pretrained visual encoder and maintains robustness when paired with DINOv2, SimDINOv2, and iBOT features.

Executive Summary

This article proposes a novel approach to learning invariant visual representations for planning with joint-embedding predictive world models. The authors address the limitation of existing models, such as the DINO world model, which are sensitive to 'slow features' like background changes and distractors. By introducing a bisimulation encoder, the model enforces control-relevant state equivalence, resulting in improved robustness to slow features and reduced latent space. The approach is evaluated on a navigation task and demonstrates significant improvements in robustness and efficiency.

Key Points

  • Introduction of a bisimulation encoder to enforce control-relevant state equivalence
  • Improved robustness to slow features such as background changes and distractors
  • Reduced latent space, up to 10x smaller than that of DINO-WM

Merits

Improved Robustness

The proposed model demonstrates improved robustness to slow features, making it more suitable for real-world applications

Efficient Latent Space

The reduced latent space results in improved computational efficiency and reduced storage requirements

Demerits

Limited Evaluation

The model is evaluated on a simple navigation task, and its performance on more complex tasks is unclear

Expert Commentary

The proposed approach represents a significant advancement in the field of joint-embedding predictive world models. The introduction of a bisimulation encoder provides a novel solution to the problem of slow features, and the results demonstrate improved robustness and efficiency. However, further evaluation on more complex tasks is necessary to fully understand the capabilities and limitations of the model. Additionally, the implications of this approach on transfer learning and explainability are substantial and warrant further investigation.

Recommendations

  • Further evaluation on more complex tasks, such as multi-agent environments and tasks with high-dimensional state spaces
  • Investigation of the applicability of the proposed approach to other areas, such as natural language processing and computer vision

Sources