Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models
arXiv:2602.12618v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning …
Omer Faruk Deniz, Ruiyu Mao, Ruochen Li, Yapeng Tian, Latifur Khan
3 views