On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD
arXiv:2603.10397v1 Announce Type: new Abstract: One crucial factor behind the success of deep learning lies in the implicit bias induced by noise inherent in gradient-based training algorithms. Motivated by empirical observations that training with noisy labels improves model generalization, we delve into the underlying mechanisms behind stochastic gradient descent (SGD) with label noise. Focusing on a two-layer over-parameterized linear network, we analyze the learning dynamics of label noise SGD, unveiling a two-phase learning behavior. In \emph{Phase I}, the magnitudes of model weights progressively diminish, and the model escapes the lazy regime; enters the rich regime. In \emph{Phase II}, the alignment between model weights and the ground-truth interpolator increases, and the model eventually converges. Our analysis highlights the critical role of label noise in driving the transition from the lazy to the rich regime and minimally explains its empirical success. Furthermore, we e
arXiv:2603.10397v1 Announce Type: new Abstract: One crucial factor behind the success of deep learning lies in the implicit bias induced by noise inherent in gradient-based training algorithms. Motivated by empirical observations that training with noisy labels improves model generalization, we delve into the underlying mechanisms behind stochastic gradient descent (SGD) with label noise. Focusing on a two-layer over-parameterized linear network, we analyze the learning dynamics of label noise SGD, unveiling a two-phase learning behavior. In \emph{Phase I}, the magnitudes of model weights progressively diminish, and the model escapes the lazy regime; enters the rich regime. In \emph{Phase II}, the alignment between model weights and the ground-truth interpolator increases, and the model eventually converges. Our analysis highlights the critical role of label noise in driving the transition from the lazy to the rich regime and minimally explains its empirical success. Furthermore, we extend these insights to Sharpness-Aware Minimization (SAM), showing that the principles governing label noise SGD also apply to broader optimization algorithms. Extensive experiments, conducted under both synthetic and real-world setups, strongly support our theory. Our code is released at https://github.com/a-usually/Label-Noise-SGD.
Executive Summary
This article delves into the learning dynamics of two-layer linear networks trained with stochastic gradient descent (SGD) and label noise. The authors identify a two-phase learning behavior: Phase I, where model weights decrease in magnitude and the model transitions from the lazy to the rich regime, and Phase II, where weights align with the ground-truth interpolator and converge. Label noise is shown to drive this transition, explaining its empirical success. The study extends to Sharpness-Aware Minimization (SAM) and is supported by extensive experiments. The findings highlight the importance of label noise in deep learning and its effects on model generalization.
Key Points
- ▸ The authors analyze the learning dynamics of two-layer linear networks with label noise SGD.
- ▸ A two-phase learning behavior is identified, with Phase I characterized by decreasing model weights and Phase II by increasing alignment with the ground-truth interpolator.
- ▸ Label noise is shown to drive the transition from the lazy to the rich regime, explaining its empirical success.
Merits
Strength in Analytical Approach
The authors employ a rigorous analytical approach to understanding the learning dynamics of two-layer linear networks, providing insights into the role of label noise in deep learning.
Empirical Support
The study is supported by extensive experiments under both synthetic and real-world setups, lending credibility to the authors' findings.
Demerits
Limited to Linear Networks
The study's focus on two-layer linear networks may limit its generalizability to more complex neural network architectures.
Assumes Stationarity
The analysis assumes stationarity of the noise process, which may not hold in practice, potentially affecting the results' applicability.
Expert Commentary
The article provides a comprehensive analysis of the learning dynamics of two-layer linear networks with label noise SGD. The identification of a two-phase learning behavior and the role of label noise in driving this transition are significant contributions to the field of deep learning. However, the study's limitations, such as its focus on linear networks and assumption of stationarity, should be considered when interpreting the results. The extension to Sharpness-Aware Minimization (SAM) is a notable aspect of the study, highlighting the broader implications of label noise for optimization algorithms. The findings of this study have the potential to inform the development of more effective training protocols for deep learning models, potentially leading to improved performance in real-world applications.
Recommendations
- ✓ Future studies should investigate the generalizability of the findings to more complex neural network architectures.
- ✓ The assumption of stationarity should be relaxed to better capture the non-stationary nature of real-world noise processes.