Academic

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Jianan Zhao, Xixian Liu, Zhihao Zhan, Xinyu Yuan, Hongyu Guo, Jian Tang · March 7, 2026 · 1 min read · 10 views

#q-bio.GN #cs.AI #cs.LG

arXiv:2602.17739v1 Announce Type: cross Abstract: Genomic sequences span billions of base pairs (bp), posing a fundamental challenge for genome-scale foundation models. Existing approaches largely sidestep this barrier by either scaling relatively small models to long contexts or relying on heavy multi-GPU parallelism. Here we introduce GeneZip, a DNA compression model that leverages a key biological prior: genomic information is highly imbalanced. Coding regions comprise only a small fraction (about 2 percent) yet are information-dense, whereas most non-coding sequence is comparatively information-sparse. GeneZip couples HNet-style dynamic routing with a region-aware compression-ratio objective, enabling adaptive allocation of representation budget across genomic regions. As a result, GeneZip learns region-aware compression and achieves 137.6x compression with only 0.31 perplexity increase. On downstream long-context benchmarks, GeneZip achieves comparable or better performance on contact map prediction, expression quantitative trait loci prediction, and enhancer-target gene prediction. By reducing effective sequence length, GeneZip unlocks simultaneous scaling of context and capacity: compared to the prior state-of-the-art model JanusDNA, it enables training models 82.6x larger at 1M-bp context, supporting a 636M-parameter GeneZip model at 1M-bp context. All experiments in this paper can be trained on a single A100 80GB GPU.

Executive Summary

The GeneZip model introduces a novel approach to compressing genomic sequences, leveraging the imbalance of information density between coding and non-coding regions. By adaptively allocating representation budget across genomic regions, GeneZip achieves significant compression ratios while maintaining model performance. This enables the training of larger models with longer context lengths, making it a valuable tool for genome-scale foundation models.

Key Points

▸ GeneZip leverages biological priors to compress genomic sequences
▸ The model achieves 137.6x compression with minimal perplexity increase
▸ GeneZip enables training of larger models with longer context lengths

Merits

Efficient Compression

GeneZip's region-aware compression approach allows for significant reductions in sequence length while preserving model performance.

Scalability

The model enables the training of larger models with longer context lengths, making it a valuable tool for genome-scale foundation models.

Demerits

Computational Requirements

While GeneZip can be trained on a single A100 80GB GPU, the computational requirements for larger models may still be significant.

Expert Commentary

The GeneZip model represents a significant advancement in the field of genomic sequence analysis, leveraging biological priors to achieve efficient compression and scalability. The model's ability to adaptively allocate representation budget across genomic regions is a key innovation, enabling the training of larger models with longer context lengths. While there are potential limitations to consider, such as computational requirements, the implications of GeneZip are far-reaching and have the potential to drive significant advances in fields such as precision medicine and synthetic biology.

Recommendations

✓ Further research is needed to explore the applications of GeneZip in various fields, including precision medicine and synthetic biology
✓ Investment in computational resources and infrastructure is necessary to support the development and deployment of models like GeneZip

Sources

arXiv - cs.AI

GeneZip: Region-Aware Compression for Long Context DNA Modeling

AI Commentary

Executive Summary

Key Points

Merits

Efficient Compression

Scalability

Demerits

Computational Requirements

Expert Commentary

Recommendations

Sources

Related Articles

Cross-subject Muscle Fatigue Detection via Adversarial and Supervised Contrastive Learning …

A Numerical Method for Coupling Parameterized Physics-Informed Neural Networks and …

Low-Rank Compression of Pretrained Models via Randomized Subspace Iteration

Product-Stability: Provable Convergence for Gradient Descent on the Edge of …

JCG, PC

HSOLLC Co., Ltd.