Skip to main content
Academic

SPQ: An Ensemble Technique for Large Language Model Compression

arXiv:2602.18420v1 Announce Type: new Abstract: This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. Each component targets a different source of inefficiency: i) pruning removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit quantization uniformly compresses all linear layers. At matched compression ratios, SPQ outperforms individual methods (SVD-only, pruning-only, or quantization-only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K. Compared to strong baselines like GPTQ

J
Jiamin Yao, Eren Gultepe
· · 1 min read · 2 views

arXiv:2602.18420v1 Announce Type: new Abstract: This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. Each component targets a different source of inefficiency: i) pruning removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit quantization uniformly compresses all linear layers. At matched compression ratios, SPQ outperforms individual methods (SVD-only, pruning-only, or quantization-only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K. Compared to strong baselines like GPTQ and SparseGPT, SPQ offers competitive perplexity and accuracy while using less memory (6.86 GB vs. 7.16 GB for GPTQ). Moreover, SPQ improves inference throughput over GPTQ, achieving up to a 1.9x speedup, which further enhances its practicality for real-world deployment. The effectiveness of SPQ's robust compression through layer-aware and complementary compression techniques may provide practical deployment of LLMs in memory-constrained environments. Code is available at: https://github.com/JiaminYao/SPQ_LLM_Compression/

Executive Summary

This study introduces SPQ, an ensemble technique for large language model compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. SPQ outperforms individual methods in perplexity at matched compression ratios and achieves up to 75% memory reduction while maintaining or improving perplexity and accuracy on downstream benchmarks. Compared to strong baselines, SPQ offers competitive perplexity and accuracy while using less memory and improving inference throughput. The study highlights the effectiveness of SPQ's robust compression through layer-aware and complementary compression techniques, providing practical deployment of LLMs in memory-constrained environments.

Key Points

  • SPQ is an ensemble technique that combines SVD, pruning, and quantization for large language model compression
  • SPQ outperforms individual methods in perplexity at matched compression ratios
  • SPQ achieves up to 75% memory reduction while maintaining or improving perplexity and accuracy on downstream benchmarks

Merits

Robust Compression

SPQ's combination of SVD, pruning, and quantization provides robust compression that is effective in reducing memory usage while maintaining model performance.

Layer-Aware Compression

SPQ's technique is layer-aware, allowing for more targeted and efficient compression of large language models.

Competitive Performance

SPQ offers competitive perplexity and accuracy compared to strong baselines while using less memory and improving inference throughput.

Demerits

Limited Evaluation

The study only evaluates SPQ on a limited set of downstream benchmarks, which may not be representative of all possible applications.

Dependence on Hyperparameters

The performance of SPQ may be sensitive to hyperparameters, which can affect the quality of compression and model performance.

Expert Commentary

The study presents a novel ensemble technique for large language model compression that outperforms individual methods in perplexity at matched compression ratios. The combination of SVD, pruning, and quantization provides robust compression that is effective in reducing memory usage while maintaining model performance. The study's contributions to efficient neural network compression are significant, and the findings have practical implications for real-world deployment of LLMs in memory-constrained environments. However, the study's limitations, such as limited evaluation and dependence on hyperparameters, should be addressed in future work. Overall, the study is a significant contribution to the field of machine learning and neural networks.

Recommendations

  • Future research should focus on evaluating SPQ on a broader range of downstream benchmarks to assess its generalizability.
  • The study's dependence on hyperparameters should be investigated, and methods to optimize hyperparameters should be developed.

Sources