Academic

Rolling-Origin Validation Reverses Model Rankings in Multi-Step PM10 Forecasting: XGBoost, SARIMA, and Persistence

arXiv:2603.20315v1 Announce Type: new Abstract: (a) Many air quality forecasting studies report gains from machine learning, but evaluations often use static chronological splits and omit persistence baselines, so the operational added value under routine updating is unclear. (b) Using 2,350 daily PM10 observations from 2017 to 2024 at an urban background monitoring station in southern Europe, we compare XGBoost and SARIMA against persistence under a static split and a rolling-origin protocol with monthly updates. We report horizon-specific skill and the predictability horizon, defined as the maximum horizon with positive persistence-relative skill. Static evaluation suggests XGBoost performs well from one to seven days ahead, but rolling-origin evaluation reverses rankings: XGBoost is not consistently better than persistence at short and intermediate horizons, whereas SARIMA remains positively skilled across the full range. (c) For researchers, static splits can overstate operati

Federico Garcia Crespi, Eduardo Yubero Funes, Marina Alfosea Simon · March 24, 2026 · 1 min read · 6 views

#cs.LG

Executive Summary

This article presents a critical evaluation of the performance of machine learning algorithms (XGBoost and SARIMA) in multi-step PM10 forecasting, comparing their skills against a persistence baseline under different evaluation protocols. The study reveals that a rolling-origin validation approach, which mimics real-world operational scenarios, reverses the traditional rankings of these algorithms, suggesting that their operational added value is often overstated when using static chronological splits. The findings have significant implications for both researchers and practitioners, highlighting the importance of using robust evaluation methods to assess the reliability of forecasting models across different horizons.

Key Points

▸ The study highlights the limitations of traditional evaluation protocols in assessing the operational usefulness of machine learning algorithms in forecasting
▸ Rolling-origin validation approach reveals that XGBoost is not consistently better than persistence at short and intermediate horizons
▸ SARIMA remains positively skilled across the full range of horizons under rolling-origin evaluation

Merits

Strength in methodology

The study employs a robust evaluation protocol, incorporating a rolling-origin validation approach and persistence baseline, which provides a more accurate assessment of the operational added value of machine learning algorithms.

Comprehensive analysis

The study provides a comprehensive analysis of the performance of XGBoost and SARIMA across different horizons, offering a nuanced understanding of their strengths and limitations.

Practical implications

The study highlights the practical implications of using robust evaluation methods, emphasizing the importance of choosing the right evaluation protocol for real-world forecasting applications.

Demerits

Limitation in data

The study relies on a single dataset from an urban background monitoring station in southern Europe, which may not be representative of other regions or environments.

Assumption of stationarity

The study assumes stationarity in the time series data, which may not hold in practice, potentially affecting the accuracy of the evaluation results.

Expert Commentary

The study presents a timely and important contribution to the field of air quality forecasting, highlighting the limitations of traditional evaluation protocols and the importance of using robust methods to assess the performance of machine learning algorithms. The findings have significant implications for both researchers and practitioners, emphasizing the need for careful consideration of the evaluation protocol and the potential biases of machine learning algorithms. The study's results also underscore the importance of interdisciplinary collaboration, as experts from machine learning, statistics, and environmental science work together to develop more accurate and reliable forecasting models.

Recommendations

✓ Researchers should employ robust evaluation protocols, including rolling-origin validation and persistence baseline, to assess the performance of machine learning algorithms in forecasting applications.
✓ Practitioners should consider the limitations and potential biases of machine learning algorithms in forecasting, and choose evaluation protocols that reflect the specific requirements of their applications.

Sources

Original: arXiv - cs.LG

arXiv - cs.LG

Rolling-Origin Validation Reverses Model Rankings in Multi-Step PM10 Forecasting: XGBoost, SARIMA, and Persistence

AI Commentary

Executive Summary

Key Points

Merits

Strength in methodology

Comprehensive analysis

Practical implications

Demerits

Limitation in data

Assumption of stationarity

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.