AXELRAM: Quantize Once, Never Dequantize
arXiv:2604.02638v1 Announce Type: new Abstract: We propose AXELRAM, a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. …
Quality follows upgrading
Academic
arXiv:2604.02638v1 Announce Type: new Abstract: We propose AXELRAM, a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. …
arXiv:2604.03110v1 Announce Type: new Abstract: Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distribution …
arXiv:2604.02340v1 Announce Type: new Abstract: Recent advances in masked diffusion language models (MDLMs) narrow the quality gap to autoregressive LMs, but their sampling remains expensive …
arXiv:2604.02362v1 Announce Type: cross Abstract: Decoding speech information from scalp EEG remains difficult due to low SNR and spatial blurring. We present CIPHER (Conformer-based Inference …
arXiv:2604.02355v1 Announce Type: new Abstract: Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT's exploration and RL's …
arXiv:2604.02601v1 Announce Type: new Abstract: Data-driven discovery of governing equations from noisy observations remains a fundamental challenge in scientific machine learning. While GENERIC formalism informed …
arXiv:2604.02923v1 Announce Type: new Abstract: Large Language Models (LLMs), particularly those employing Mixture-of-Experts (MoE) architectures, have achieved remarkable capabilities across diverse natural language processing tasks. …
arXiv:2604.02352v1 Announce Type: cross Abstract: Although LLMs are capable of generating functionally correct code, they also tend to produce less energy-efficient code in comparison to …
arXiv:2604.02343v1 Announce Type: cross Abstract: We study the compression of LLM-generated text across lossless and lossy regimes, characterizing a compression-compute frontier where more compression is …
arXiv:2604.02560v1 Announce Type: new Abstract: Discrete diffusion language models (dLLMs) accelerate text generation by unmasking multiple tokens in parallel. However, parallel decoding introduces a distributional …
arXiv:2604.03173v1 Announce Type: new Abstract: Large language models and deep research agents supply citation URLs to support their claims, yet the reliability of these citations …
arXiv:2604.02423v1 Announce Type: new Abstract: Large language models exhibit sycophancy: the tendency to shift outputs toward user-expressed stances, regardless of correctness or consistency. While prior …