Academic

DQE: A Semantic-Aware Evaluation Metric for Time Series Anomaly Detection

arXiv:2603.06131v1 Announce Type: new Abstract: Time series anomaly detection has achieved remarkable progress in recent years. However, evaluation practices have received comparatively less attention, despite their critical importance. Existing metrics exhibit several limitations: (1) bias toward point-level coverage, (2) insensitivity or inconsistency in near-miss detections, (3) inadequate penalization of false alarms, and (4) inconsistency caused by threshold or threshold-interval selection. These limitations can produce unreliable or counterintuitive results, hindering objective progress. In this work, we revisit the evaluation of time series anomaly detection from the perspective of detection semantics and propose a novel metric for more comprehensive assessment. We first introduce a partitioning strategy grounded in detection semantics, which decomposes the local temporal region of each anomaly into three functionally distinct subregions. Using this partitioning, we evaluate ov

arXiv:2603.06131v1 Announce Type: new Abstract: Time series anomaly detection has achieved remarkable progress in recent years. However, evaluation practices have received comparatively less attention, despite their critical importance. Existing metrics exhibit several limitations: (1) bias toward point-level coverage, (2) insensitivity or inconsistency in near-miss detections, (3) inadequate penalization of false alarms, and (4) inconsistency caused by threshold or threshold-interval selection. These limitations can produce unreliable or counterintuitive results, hindering objective progress. In this work, we revisit the evaluation of time series anomaly detection from the perspective of detection semantics and propose a novel metric for more comprehensive assessment. We first introduce a partitioning strategy grounded in detection semantics, which decomposes the local temporal region of each anomaly into three functionally distinct subregions. Using this partitioning, we evaluate overall detection behavior across events and design finer-grained scoring mechanisms for each subregion, enabling more reliable and interpretable assessment. Through a systematic study of existing metrics, we identify an evaluation bias associated with threshold-interval selection and adopt an approach that aggregates detection qualities across the full threshold spectrum, thereby eliminating evaluation inconsistency. Extensive experiments on synthetic and real-world data demonstrate that our metric provides stable, discriminative, and interpretable evaluation, while achieving robust assessment compared with ten widely used metrics.

Executive Summary

This article proposes a novel evaluation metric, DQE, for time series anomaly detection. It addresses existing limitations in evaluation practices, including bias toward point-level coverage, insensitivity to near-miss detections, inadequate penalization of false alarms, and inconsistency due to threshold or threshold-interval selection. DQE employs a partitioning strategy grounded in detection semantics to evaluate overall detection behavior across events and incorporates finer-grained scoring mechanisms for each subregion. Experiments on synthetic and real-world data demonstrate that DQE provides stable, discriminative, and interpretable evaluation, achieving robust assessment compared to ten widely used metrics. This work has significant implications for the field of time series anomaly detection, enabling more reliable and objective progress.

Key Points

  • Existing evaluation metrics for time series anomaly detection exhibit several limitations.
  • DQE proposes a novel partitioning strategy grounded in detection semantics for more comprehensive assessment.
  • Experiments demonstrate that DQE provides stable, discriminative, and interpretable evaluation, achieving robust assessment compared to existing metrics.

Merits

Strength

DQE addresses existing limitations in evaluation practices, enabling more reliable and objective progress in time series anomaly detection.

Comprehensive assessment

The proposed partitioning strategy and finer-grained scoring mechanisms provide a more comprehensive understanding of detection behavior across events.

Demerits

Limitation

The proposed metric may require significant computational resources for large-scale datasets, potentially limiting its practical application.

Expert Commentary

The proposed DQE metric represents a significant advancement in the field of time series anomaly detection, addressing existing limitations in evaluation practices. The comprehensive assessment provided by DQE is particularly noteworthy, enabling more reliable and objective progress in the field. However, the potential computational requirements of the proposed metric may limit its practical application. Nevertheless, the implications of DQE are far-reaching, with potential applications in various industries and policy decisions.

Recommendations

  • Future research should focus on developing more efficient algorithms for computing DQE, enabling its practical application in large-scale datasets.
  • The proposed metric should be applied in various industries to evaluate and compare the performance of different anomaly detection algorithms, enabling more informed decision-making in real-world applications.

Sources