OmniGAIA: Towards Native Omni-Modal AI Agents
arXiv:2602.22897v1 Announce Type: new Abstract: Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to …
Tag: cs.MM
arXiv:2602.22897v1 Announce Type: new Abstract: Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to …
arXiv:2602.17871v1 Announce Type: cross Abstract: Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document …
arXiv:2602.15082v1 Announce Type: cross Abstract: Neural audio compression models have recently achieved extreme compression rates, enabling efficient latent generative modeling. Conversely, latent generative models have …
arXiv:2602.16197v1 Announce Type: new Abstract: Multimodal systems are vulnerable to partial or complete loss of input channels at deployment, which undermines reliability in real-world settings. …
arXiv:2602.15707v1 Announce Type: cross Abstract: Real-time conversational assistants for procedural tasks often depend on video input, which can be computationally expensive and compromise user privacy. …