Skip to main content
Academic

The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

arXiv:2602.15382v1 Announce Type: new Abstract: Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high-bandwidth alternative, existing approaches either assume homogeneous sender-receiver architectures or rely on pair-specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision-Language Models (VLMs) to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a uni

arXiv:2602.15382v1 Announce Type: new Abstract: Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high-bandwidth alternative, existing approaches either assume homogeneous sender-receiver architectures or rely on pair-specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision-Language Models (VLMs) to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a universal port for inter-agent telepathy. Our framework adopts a hub-and-spoke topology to reduce pairwise alignment complexity from O(N^2) to O(N) and leverages a label-free, teacher-student distillation objective to align the high-speed visual channel with the robust reasoning patterns of the text pathway. Extensive experiments across heterogeneous model families (e.g., Qwen-VL, Gemma) demonstrate that the Vision Wormhole reduces end-to-end wall-clock time in controlled comparisons while maintaining reasoning fidelity comparable to standard text-based MAS. Code is available at https://github.com/xz-liu/heterogeneous-latent-mas

Executive Summary

The Vision Wormhole proposes a novel framework for latent-space communication in heterogeneous multi-agent systems, leveraging the visual interface of Vision-Language Models to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, the framework maps heterogeneous reasoning traces into a shared continuous latent space and injects them directly into the receiver's visual pathway. This approach reduces end-to-end wall-clock time and maintains reasoning fidelity comparable to standard text-based MAS. The framework adopts a hub-and-spoke topology and leverages label-free, teacher-student distillation to align the visual channel with the text pathway. The authors demonstrate the effectiveness of the Vision Wormhole across heterogeneous model families, showcasing its potential for scalable and modular communication in MAS.

Key Points

  • The Vision Wormhole framework enables model-agnostic, text-free communication in heterogeneous multi-agent systems.
  • The framework leverages the visual interface of Vision-Language Models to map heterogeneous reasoning traces into a shared continuous latent space.
  • The approach reduces end-to-end wall-clock time and maintains reasoning fidelity comparable to standard text-based MAS.

Merits

Strength in Scalability

The Vision Wormhole's use of a Universal Visual Codec and hub-and-spoke topology enables scalable communication across diverse model families with disjoint manifolds.

Demerits

Dependence on Vision-Language Models

The framework's reliance on Vision-Language Models may limit its applicability to agents that do not possess a visual interface or require alternative forms of communication.

Expert Commentary

The Vision Wormhole framework represents a significant advancement in the development of efficient communication protocols for heterogeneous multi-agent systems. By leveraging the visual interface of Vision-Language Models, the authors demonstrate a novel approach to addressing the inefficiency of discrete text communication. However, the framework's dependence on Vision-Language Models may limit its applicability, and further research is necessary to explore the potential for alternative forms of communication. Nonetheless, the Vision Wormhole's potential for scalability and modularity makes it an exciting development in the field of MAS research.

Recommendations

  • Future research should investigate the applicability of the Vision Wormhole framework to agents that do not possess a visual interface or require alternative forms of communication.
  • The development of the Vision Wormhole framework highlights the need for further research into the intersection of computer vision, natural language processing, and multi-agent systems, with implications for AI and data regulation policy.

Sources