Academic

Benchmarking IoT Time-Series AD with Event-Level Augmentations

Dmitry Zhevnenko, Ilya Makarov, Aleksandr Kovalenko, Fedor Meshchaninov, Anton Kozhukhov, Vladislav Travnikov, Makar Ippolitov, Kirill Yashunin, Iurii Katser · February 19, 2026 · 1 min read · 6 views

#cs.LG

arXiv:2602.15457v1 Announce Type: new Abstract: Anomaly detection (AD) for safety-critical IoT time series should be judged at the event level: reliability and earliness under realistic perturbations. Yet many studies still emphasize point-level results on curated base datasets, limiting value for model selection in practice. We introduce an evaluation protocol with unified event-level augmentations that simulate real-world issues: calibrated sensor dropout, linear and log drift, additive noise, and window shifts. We also perform sensor-level probing via mask-as-missing zeroing with per-channel influence estimation to support root-cause analysis. We evaluate 14 representative models on five public anomaly datasets (SWaT, WADI, SMD, SKAB, TEP) and two industrial datasets (steam turbine, nuclear turbogenerator) using unified splits and event aggregation. There is no universal winner: graph-structured models transfer best under dropout and long events (e.g., on SWaT under additive noise F1 drops 0.804->0.677 for a graph autoencoder, 0.759->0.680 for a graph-attention variant, and 0.762->0.756 for a hybrid graph attention model); density/flow models work well on clean stationary plants but can be fragile to monotone drift; spectral CNNs lead when periodicity is strong; reconstruction autoencoders become competitive after basic sensor vetting; predictive/hybrid dynamics help when faults break temporal dependencies but remain window-sensitive. The protocol also informs design choices: on SWaT under log drift, replacing normalizing flows with Gaussian density reduces high-stress F1 from ~0.75 to ~0.57, and fixing a learned DAG gives a small clean-set gain (~0.5-1.0 points) but increases drift sensitivity by ~8x.

Executive Summary

This article introduces an evaluation protocol for anomaly detection in safety-critical IoT time-series data, focusing on event-level augmentations that simulate real-world issues such as sensor dropout, drift, and noise. The protocol is used to evaluate 14 representative models on seven datasets, including both public and industrial datasets, and findings suggest that no single model is universally effective. Instead, the optimal model choice depends on the specific characteristics of the dataset. The study contributes to the field by providing insights into the strengths and weaknesses of different anomaly detection models, and offering recommendations for model selection and design choices.

Key Points

▸ The evaluation protocol introduced in this study focuses on event-level augmentations, which simulate real-world issues such as sensor dropout, drift, and noise.
▸ The study evaluates 14 representative models on seven datasets, including both public and industrial datasets.
▸ No single model is found to be universally effective across all datasets, suggesting that optimal model choice depends on dataset characteristics.

Merits

Comprehensive Evaluation Protocol

The study introduces a comprehensive evaluation protocol that simulates real-world issues, providing a more accurate assessment of anomaly detection models.

Comparative Evaluation of Representative Models

The study evaluates 14 representative models, providing insights into their strengths and weaknesses across different datasets.

Demerits

Limited Generalizability

The study's findings may not generalize to other domains or applications, highlighting the need for further research to validate these results.

Complexity of Evaluation Protocol

The evaluation protocol introduced in this study may be complex and challenging to replicate, particularly for researchers without extensive experience in anomaly detection.

Expert Commentary

This study makes an important contribution to the field of anomaly detection in IoT time-series data by introducing a comprehensive evaluation protocol that simulates real-world issues. The study's findings provide valuable insights into the strengths and weaknesses of different anomaly detection models, highlighting the need for careful model selection and design choices. While the study's results are promising, they also highlight the complexity of the evaluation protocol and the need for further research to validate these findings. Nevertheless, the study's implications for anomaly detection in industrial IoT applications are significant, and the results are likely to inform policy and regulatory decisions in the field.

Recommendations

✓ Developers and researchers should prioritize the development of anomaly detection models that can effectively handle real-world issues such as sensor dropout, drift, and noise.
✓ Policy makers and regulators should prioritize the development of anomaly detection models that can provide accurate predictions in safety-critical applications.

Sources

arXiv - cs.LG

Something extraordinary is coming.

Benchmarking IoT Time-Series AD with Event-Level Augmentations

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Evaluation Protocol

Comparative Evaluation of Representative Models

Demerits

Limited Generalizability

Complexity of Evaluation Protocol

Expert Commentary

Recommendations

Sources

Related Articles

How Large Language Models Get Stuck: Early structure with persistent …

Distribution-Aware Companding Quantization of Large Language Models

Policy Compliance of User Requests in Natural Language for AI …

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

JCG, PC

HSOLLC Co., Ltd.