Uncovering a Winning Lottery Ticket with Continuously Relaxed Bernoulli Gates
arXiv:2603.08914v1 Announce Type: new Abstract: Over-parameterized neural networks incur prohibitive memory and computational costs for resource-constrained deployment. The Strong Lottery Ticket (SLT) hypothesis suggests that randomly initialized networks contain sparse subnetworks achieving competitive accuracy without weight training. Existing SLT methods, notably edge-popup, rely on non-differentiable score-based selection, limiting optimization efficiency and scalability. We propose using continuously relaxed Bernoulli gates to discover SLTs through fully differentiable, end-to-end optimization - training only gating parameters while keeping all network weights frozen at their initialized values. Continuous relaxation enables direct gradient-based optimization of an $\ell_0$-regularization objective, eliminating the need for non-differentiable gradient estimators or iterative pruning cycles. To our knowledge, this is the first fully differentiable approach for SLT discovery that a
arXiv:2603.08914v1 Announce Type: new Abstract: Over-parameterized neural networks incur prohibitive memory and computational costs for resource-constrained deployment. The Strong Lottery Ticket (SLT) hypothesis suggests that randomly initialized networks contain sparse subnetworks achieving competitive accuracy without weight training. Existing SLT methods, notably edge-popup, rely on non-differentiable score-based selection, limiting optimization efficiency and scalability. We propose using continuously relaxed Bernoulli gates to discover SLTs through fully differentiable, end-to-end optimization - training only gating parameters while keeping all network weights frozen at their initialized values. Continuous relaxation enables direct gradient-based optimization of an $\ell_0$-regularization objective, eliminating the need for non-differentiable gradient estimators or iterative pruning cycles. To our knowledge, this is the first fully differentiable approach for SLT discovery that avoids straight-through estimator approximations. Experiments across fully connected networks, CNNs (ResNet, Wide-ResNet), and Vision Transformers (ViT, Swin-T) demonstrate up to 90% sparsity with minimal accuracy loss - nearly double the sparsity achieved by edge-popup at comparable accuracy - establishing a scalable framework for pre-training network sparsification.
Executive Summary
This article introduces a novel, fully differentiable method for discovering Strong Lottery Tickets (SLTs) using continuously relaxed Bernoulli gates, replacing non-differentiable score-based selection with gradient-based optimization. By freezing network weights and optimizing gating parameters end-to-end, the method achieves up to 90% sparsity across diverse architectures with minimal accuracy degradation—outperforming edge-popup by nearly double in sparsity efficiency. The innovation lies in eliminating reliance on straight-through estimators, enabling scalable, efficient pre-training sparsification. This represents a significant advancement in neural network optimization for resource-constrained deployments.
Key Points
- ▸ First fully differentiable SLT discovery method without straight-through estimators
- ▸ Utilizes continuous relaxation of Bernoulli gates for gradient-based optimization
- ▸ Achieves up to 90% sparsity with minimal accuracy loss across CNNs and Transformers
Merits
Innovation
Introduces a fundamentally different approach to SLT discovery by enabling full differentiability, bypassing prior limitations of non-differentiable selection techniques
Demerits
Scope Constraint
Experiments are limited to pre-trained architectures (ResNet, ViT, etc.)—application to novel architectures or non-convolutional models remains to be validated
Expert Commentary
The shift from non-differentiable pruning heuristics to a fully differentiable framework marks a paradigm shift in lottery ticket hypothesis research. The authors effectively address a critical bottleneck: the inability to optimize sparsity via gradient descent due to non-differentiable selection mechanisms. By leveraging continuous relaxation, they open the door to end-to-end optimization of sparsity without compromising model fidelity. This is not merely an incremental improvement—it is a foundational advancement that may redefine how sparsity is engineered in neural networks. The empirical results are compelling, particularly the near-doubling of sparsity at comparable accuracy, suggesting that prior methods were constrained not by theoretical limits but by algorithmic incompatibility. One caveat: while the method demonstrates efficacy across multiple architectures, longitudinal validation across divergent model classes (e.g., transformers for NLP or RL agents) will be essential to confirm generalizability. Overall, this work elevates the SLT discourse from heuristic-driven to mathematically rigorous.
Recommendations
- ✓ 1. Extend validation to non-convolutional and transformer-based architectures in diverse domains (e.g., NLP, RL)
- ✓ 2. Integrate this framework into standard pre-training pipelines for commercial AI systems to quantify cost-efficiency gains