How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning
arXiv:2602.15580v1 Announce Type: new Abstract: When a multimodal Transformer answers a visual question, is the prediction driven by visual evidence, linguistic reasoning, or genuinely fused …
Hongxuan Wu, Yukun Zhang, Xueqing Zhou
8 views