Academic

Improved Scaling Laws via Weak-to-Strong Generalization in Random Feature Ridge Regression

arXiv:2603.05691v1 Announce Type: new Abstract: It is increasingly common in machine learning to use learned models to label data and then employ such data to train more capable models. The phenomenon of weak-to-strong generalization exemplifies the advantage of this two-stage procedure: a strong student is trained on imperfect labels obtained from a weak teacher, and yet the strong student outperforms the weak teacher. In this paper, we show that the potential improvement is substantial, in the sense that it affects the scaling law followed by the test error. Specifically, we consider students and teachers trained via random feature ridge regression (RFRR). Our main technical contribution is to derive a deterministic equivalent for the excess test error of the student trained on labels obtained via the teacher. Via this deterministic equivalent, we then identify regimes in which the scaling law of the student improves upon that of the teacher, unveiling that the improvement can be ac

D
Diyuan Wu, Lehan Chen, Theodor Misiakiewicz, Marco Mondelli
· · 1 min read · 15 views

arXiv:2603.05691v1 Announce Type: new Abstract: It is increasingly common in machine learning to use learned models to label data and then employ such data to train more capable models. The phenomenon of weak-to-strong generalization exemplifies the advantage of this two-stage procedure: a strong student is trained on imperfect labels obtained from a weak teacher, and yet the strong student outperforms the weak teacher. In this paper, we show that the potential improvement is substantial, in the sense that it affects the scaling law followed by the test error. Specifically, we consider students and teachers trained via random feature ridge regression (RFRR). Our main technical contribution is to derive a deterministic equivalent for the excess test error of the student trained on labels obtained via the teacher. Via this deterministic equivalent, we then identify regimes in which the scaling law of the student improves upon that of the teacher, unveiling that the improvement can be achieved both in bias-dominated and variance-dominated settings. Strikingly, the student may attain the minimax optimal rate regardless of the scaling law of the teacher -- in fact, when the test error of the teacher does not even decay with the sample size.

Executive Summary

This article explores the concept of weak-to-strong generalization in random feature ridge regression (RFRR), where a strong student model is trained on imperfect labels obtained from a weak teacher model. The authors derive a deterministic equivalent for the excess test error of the student model and identify regimes in which the scaling law of the student improves upon that of the teacher. The findings suggest that the student model can attain the minimax optimal rate regardless of the scaling law of the teacher, even when the teacher's test error does not decay with sample size. This research has significant implications for the development of more capable machine learning models and the potential for improved performance in bias-dominated and variance-dominated settings.

Key Points

  • The article explores the concept of weak-to-strong generalization in RFRR, where a strong student model is trained on imperfect labels obtained from a weak teacher model.
  • The authors derive a deterministic equivalent for the excess test error of the student model, providing a mathematical framework for understanding the improvement in scaling law.
  • The findings suggest that the student model can attain the minimax optimal rate regardless of the scaling law of the teacher, even when the teacher's test error does not decay with sample size.

Merits

Strength in mathematical analysis

The article provides a rigorous mathematical framework for understanding the weak-to-strong generalization phenomenon in RFRR, which is a significant contribution to the field of machine learning.

Demerits

Limitation in real-world applicability

The article's findings are based on a specific mathematical model and may not directly translate to real-world applications, where complex interactions and noise can affect model performance.

Expert Commentary

The article's findings are significant and have the potential to impact the development of machine learning models. However, it is essential to consider the limitations of the research, particularly in terms of real-world applicability. Future research should focus on extending the mathematical framework to more complex scenarios and exploring the practical implications of the findings. Additionally, the article highlights the importance of understanding the scaling law of machine learning models, which is a critical aspect of model development and deployment.

Recommendations

  • Future research should focus on extending the mathematical framework to more complex scenarios, such as non-linear models and high-dimensional data.
  • The article's findings should be tested and validated in real-world applications to better understand their practical implications and limitations.

Sources