The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
arXiv:2602.15382v1 Announce Type: new Abstract: Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high-bandwidth alternative, existing approaches either assume homogeneous sender-receiver architectures or rely on pair-specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision-Language Models (VLMs) to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a uni
arXiv:2602.15382v1 Announce Type: new Abstract: Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high-bandwidth alternative, existing approaches either assume homogeneous sender-receiver architectures or rely on pair-specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision-Language Models (VLMs) to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a universal port for inter-agent telepathy. Our framework adopts a hub-and-spoke topology to reduce pairwise alignment complexity from O(N^2) to O(N) and leverages a label-free, teacher-student distillation objective to align the high-speed visual channel with the robust reasoning patterns of the text pathway. Extensive experiments across heterogeneous model families (e.g., Qwen-VL, Gemma) demonstrate that the Vision Wormhole reduces end-to-end wall-clock time in controlled comparisons while maintaining reasoning fidelity comparable to standard text-based MAS. Code is available at https://github.com/xz-liu/heterogeneous-latent-mas
Executive Summary
The Vision Wormhole proposes a novel framework for latent-space communication in heterogeneous multi-agent systems, leveraging the visual interface of Vision-Language Models to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, the framework maps heterogeneous reasoning traces into a shared continuous latent space and injects them directly into the receiver's visual pathway. This approach reduces end-to-end wall-clock time and maintains reasoning fidelity comparable to standard text-based MAS. The framework adopts a hub-and-spoke topology and leverages label-free, teacher-student distillation to align the visual channel with the text pathway. The authors demonstrate the effectiveness of the Vision Wormhole across heterogeneous model families, showcasing its potential for scalable and modular communication in MAS.
Key Points
- ▸ The Vision Wormhole framework enables model-agnostic, text-free communication in heterogeneous multi-agent systems.
- ▸ The framework leverages the visual interface of Vision-Language Models to map heterogeneous reasoning traces into a shared continuous latent space.
- ▸ The approach reduces end-to-end wall-clock time and maintains reasoning fidelity comparable to standard text-based MAS.
Merits
Strength in Scalability
The Vision Wormhole's use of a Universal Visual Codec and hub-and-spoke topology enables scalable communication across diverse model families with disjoint manifolds.
Demerits
Dependence on Vision-Language Models
The framework's reliance on Vision-Language Models may limit its applicability to agents that do not possess a visual interface or require alternative forms of communication.
Expert Commentary
The Vision Wormhole framework represents a significant advancement in the development of efficient communication protocols for heterogeneous multi-agent systems. By leveraging the visual interface of Vision-Language Models, the authors demonstrate a novel approach to addressing the inefficiency of discrete text communication. However, the framework's dependence on Vision-Language Models may limit its applicability, and further research is necessary to explore the potential for alternative forms of communication. Nonetheless, the Vision Wormhole's potential for scalability and modularity makes it an exciting development in the field of MAS research.
Recommendations
- ✓ Future research should investigate the applicability of the Vision Wormhole framework to agents that do not possess a visual interface or require alternative forms of communication.
- ✓ The development of the Vision Wormhole framework highlights the need for further research into the intersection of computer vision, natural language processing, and multi-agent systems, with implications for AI and data regulation policy.