Benchmarking IoT Time-Series AD with Event-Level Augmentations
arXiv:2602.15457v1 Announce Type: new Abstract: Anomaly detection (AD) for safety-critical IoT time series should be judged at the event level: reliability and earliness under realistic perturbations. Yet many studies still emphasize point-level results on curated base datasets, limiting value for model selection in practice. We introduce an evaluation protocol with unified event-level augmentations that simulate real-world issues: calibrated sensor dropout, linear and log drift, additive noise, and window shifts. We also perform sensor-level probing via mask-as-missing zeroing with per-channel influence estimation to support root-cause analysis. We evaluate 14 representative models on five public anomaly datasets (SWaT, WADI, SMD, SKAB, TEP) and two industrial datasets (steam turbine, nuclear turbogenerator) using unified splits and event aggregation. There is no universal winner: graph-structured models transfer best under dropout and long events (e.g., on SWaT under additive noise
arXiv:2602.15457v1 Announce Type: new Abstract: Anomaly detection (AD) for safety-critical IoT time series should be judged at the event level: reliability and earliness under realistic perturbations. Yet many studies still emphasize point-level results on curated base datasets, limiting value for model selection in practice. We introduce an evaluation protocol with unified event-level augmentations that simulate real-world issues: calibrated sensor dropout, linear and log drift, additive noise, and window shifts. We also perform sensor-level probing via mask-as-missing zeroing with per-channel influence estimation to support root-cause analysis. We evaluate 14 representative models on five public anomaly datasets (SWaT, WADI, SMD, SKAB, TEP) and two industrial datasets (steam turbine, nuclear turbogenerator) using unified splits and event aggregation. There is no universal winner: graph-structured models transfer best under dropout and long events (e.g., on SWaT under additive noise F1 drops 0.804->0.677 for a graph autoencoder, 0.759->0.680 for a graph-attention variant, and 0.762->0.756 for a hybrid graph attention model); density/flow models work well on clean stationary plants but can be fragile to monotone drift; spectral CNNs lead when periodicity is strong; reconstruction autoencoders become competitive after basic sensor vetting; predictive/hybrid dynamics help when faults break temporal dependencies but remain window-sensitive. The protocol also informs design choices: on SWaT under log drift, replacing normalizing flows with Gaussian density reduces high-stress F1 from ~0.75 to ~0.57, and fixing a learned DAG gives a small clean-set gain (~0.5-1.0 points) but increases drift sensitivity by ~8x.
Executive Summary
This article introduces an evaluation protocol for anomaly detection in safety-critical IoT time-series data, focusing on event-level augmentations that simulate real-world issues such as sensor dropout, drift, and noise. The protocol is used to evaluate 14 representative models on seven datasets, including both public and industrial datasets, and findings suggest that no single model is universally effective. Instead, the optimal model choice depends on the specific characteristics of the dataset. The study contributes to the field by providing insights into the strengths and weaknesses of different anomaly detection models, and offering recommendations for model selection and design choices.
Key Points
- ▸ The evaluation protocol introduced in this study focuses on event-level augmentations, which simulate real-world issues such as sensor dropout, drift, and noise.
- ▸ The study evaluates 14 representative models on seven datasets, including both public and industrial datasets.
- ▸ No single model is found to be universally effective across all datasets, suggesting that optimal model choice depends on dataset characteristics.
Merits
Comprehensive Evaluation Protocol
The study introduces a comprehensive evaluation protocol that simulates real-world issues, providing a more accurate assessment of anomaly detection models.
Comparative Evaluation of Representative Models
The study evaluates 14 representative models, providing insights into their strengths and weaknesses across different datasets.
Demerits
Limited Generalizability
The study's findings may not generalize to other domains or applications, highlighting the need for further research to validate these results.
Complexity of Evaluation Protocol
The evaluation protocol introduced in this study may be complex and challenging to replicate, particularly for researchers without extensive experience in anomaly detection.
Expert Commentary
This study makes an important contribution to the field of anomaly detection in IoT time-series data by introducing a comprehensive evaluation protocol that simulates real-world issues. The study's findings provide valuable insights into the strengths and weaknesses of different anomaly detection models, highlighting the need for careful model selection and design choices. While the study's results are promising, they also highlight the complexity of the evaluation protocol and the need for further research to validate these findings. Nevertheless, the study's implications for anomaly detection in industrial IoT applications are significant, and the results are likely to inform policy and regulatory decisions in the field.
Recommendations
- ✓ Developers and researchers should prioritize the development of anomaly detection models that can effectively handle real-world issues such as sensor dropout, drift, and noise.
- ✓ Policy makers and regulators should prioritize the development of anomaly detection models that can provide accurate predictions in safety-critical applications.