Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal …
arXiv:2603.09095v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same …
Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan Bai
12 views