Top-K Retrieval with Fixed-Size Linear-Attention Completion: Backbone- and KV-Format-Preserving Attention for KV-Cache Read Reduction
arXiv:2604.05438v1 Announce Type: new Abstract: Long-context generation is increasingly limited by decode-time key-value (KV) cache traffic, particularly when KV is offloaded beyond GPU memory. Query-aware …
Yasuto Hoshi, Daisuke Miyashita, Jun Deguchi
5 views