Rolling-Origin Validation Reverses Model Rankings in Multi-Step PM10 Forecasting: XGBoost, SARIMA, and Persistence
arXiv:2603.20315v1 Announce Type: new Abstract: (a) Many air quality forecasting studies report gains from machine learning, but evaluations often use static chronological splits and omit persistence baselines, so the operational added value under routine updating is unclear. (b) Using 2,350 daily PM10 observations from 2017 to 2024 at an urban background monitoring station in southern Europe, we compare XGBoost and SARIMA against persistence under a static split and a rolling-origin protocol with monthly updates. We report horizon-specific skill and the predictability horizon, defined as the maximum horizon with positive persistence-relative skill. Static evaluation suggests XGBoost performs well from one to seven days ahead, but rolling-origin evaluation reverses rankings: XGBoost is not consistently better than persistence at short and intermediate horizons, whereas SARIMA remains positively skilled across the full range. (c) For researchers, static splits can overstate operati
arXiv:2603.20315v1 Announce Type: new Abstract: (a) Many air quality forecasting studies report gains from machine learning, but evaluations often use static chronological splits and omit persistence baselines, so the operational added value under routine updating is unclear. (b) Using 2,350 daily PM10 observations from 2017 to 2024 at an urban background monitoring station in southern Europe, we compare XGBoost and SARIMA against persistence under a static split and a rolling-origin protocol with monthly updates. We report horizon-specific skill and the predictability horizon, defined as the maximum horizon with positive persistence-relative skill. Static evaluation suggests XGBoost performs well from one to seven days ahead, but rolling-origin evaluation reverses rankings: XGBoost is not consistently better than persistence at short and intermediate horizons, whereas SARIMA remains positively skilled across the full range. (c) For researchers, static splits can overstate operational usefulness and change rankings. For practitioners, rolling-origin, persistence-referenced skill profiles show which methods stay reliable at each lead time.
Executive Summary
This article presents a critical evaluation of the performance of machine learning algorithms (XGBoost and SARIMA) in multi-step PM10 forecasting, comparing their skills against a persistence baseline under different evaluation protocols. The study reveals that a rolling-origin validation approach, which mimics real-world operational scenarios, reverses the traditional rankings of these algorithms, suggesting that their operational added value is often overstated when using static chronological splits. The findings have significant implications for both researchers and practitioners, highlighting the importance of using robust evaluation methods to assess the reliability of forecasting models across different horizons.
Key Points
- ▸ The study highlights the limitations of traditional evaluation protocols in assessing the operational usefulness of machine learning algorithms in forecasting
- ▸ Rolling-origin validation approach reveals that XGBoost is not consistently better than persistence at short and intermediate horizons
- ▸ SARIMA remains positively skilled across the full range of horizons under rolling-origin evaluation
Merits
Strength in methodology
The study employs a robust evaluation protocol, incorporating a rolling-origin validation approach and persistence baseline, which provides a more accurate assessment of the operational added value of machine learning algorithms.
Comprehensive analysis
The study provides a comprehensive analysis of the performance of XGBoost and SARIMA across different horizons, offering a nuanced understanding of their strengths and limitations.
Practical implications
The study highlights the practical implications of using robust evaluation methods, emphasizing the importance of choosing the right evaluation protocol for real-world forecasting applications.
Demerits
Limitation in data
The study relies on a single dataset from an urban background monitoring station in southern Europe, which may not be representative of other regions or environments.
Assumption of stationarity
The study assumes stationarity in the time series data, which may not hold in practice, potentially affecting the accuracy of the evaluation results.
Expert Commentary
The study presents a timely and important contribution to the field of air quality forecasting, highlighting the limitations of traditional evaluation protocols and the importance of using robust methods to assess the performance of machine learning algorithms. The findings have significant implications for both researchers and practitioners, emphasizing the need for careful consideration of the evaluation protocol and the potential biases of machine learning algorithms. The study's results also underscore the importance of interdisciplinary collaboration, as experts from machine learning, statistics, and environmental science work together to develop more accurate and reliable forecasting models.
Recommendations
- ✓ Researchers should employ robust evaluation protocols, including rolling-origin validation and persistence baseline, to assess the performance of machine learning algorithms in forecasting applications.
- ✓ Practitioners should consider the limitations and potential biases of machine learning algorithms in forecasting, and choose evaluation protocols that reflect the specific requirements of their applications.
Sources
Original: arXiv - cs.LG