Academic

From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures

arXiv:2604.06448v1 Announce Type: new Abstract: Prime Video regularly conducts load tests to simulate the viewer traffic spikes seen during live events such as Thursday Night Football as well as video-on-demand (VOD) events such as Rings of Power. While these stress tests validate system capacity, they can sometimes miss service behaviors unique to real event traffic. We present a graph-based anomaly detection system that identifies under-represented services using unsupervised node-level graph embeddings. Built on a GCN-GAE, our approach learns structural representations from directed, weighted service graphs at minute-level resolution and flags anomalies based on cosine similarity between load test and event embeddings. The system identifies incident-related services that are documented and demonstrates early detection capability. We also introduce a preliminary synthetic anomaly injection framework for controlled evaluation that show promising precision (96%) and low false positive

arXiv:2604.06448v1 Announce Type: new Abstract: Prime Video regularly conducts load tests to simulate the viewer traffic spikes seen during live events such as Thursday Night Football as well as video-on-demand (VOD) events such as Rings of Power. While these stress tests validate system capacity, they can sometimes miss service behaviors unique to real event traffic. We present a graph-based anomaly detection system that identifies under-represented services using unsupervised node-level graph embeddings. Built on a GCN-GAE, our approach learns structural representations from directed, weighted service graphs at minute-level resolution and flags anomalies based on cosine similarity between load test and event embeddings. The system identifies incident-related services that are documented and demonstrates early detection capability. We also introduce a preliminary synthetic anomaly injection framework for controlled evaluation that show promising precision (96%) and low false positive rate (0.08%), though recall (58%) remains limited under conservative propagation assumptions. This framework demonstrates practical utility within Prime Video while also surfacing methodological lessons and directions, providing a foundation for broader application across microservice ecosystems.

Executive Summary

This article introduces a novel graph-based anomaly detection system developed by Prime Video to identify under-represented services during live events, addressing limitations of traditional load testing. Utilizing a GCN-GAE, the system learns structural representations from minute-level, directed, weighted service graphs and flags anomalies based on cosine similarity between load test and live event embeddings. It demonstrates early detection capabilities for documented incidents and features a synthetic anomaly injection framework for evaluation, reporting high precision (96%) and a low false positive rate (0.08%), albeit with a limited recall (58%). This research provides a practical framework for microservice anomaly detection, offering valuable insights for broader application within complex distributed systems.

Key Points

  • Addresses the limitations of traditional load testing in capturing unique service behaviors during live events.
  • Utilizes a GCN-GAE model for unsupervised node-level graph embedding on directed, weighted service graphs.
  • Detects anomalies by comparing cosine similarity of embeddings between load tests and live event traffic, identifying under-represented services.
  • Features a synthetic anomaly injection framework for controlled evaluation, reporting strong precision and low false positives but limited recall.
  • Demonstrates practical utility within Prime Video and offers a foundational approach for broader microservice anomaly detection.

Merits

Novelty in Application

Applies advanced graph embedding techniques (GCN-GAE) to a critical, real-world problem of anomaly detection in large-scale microservice architectures, particularly for live streaming events, where traditional methods fall short.

Real-World Validation and Utility

Developed and validated within Prime Video, a high-stakes environment, demonstrating its practical utility in identifying incident-related services and offering early detection capabilities for documented issues.

Unsupervised Learning Approach

The unsupervised nature of the node-level graph embeddings makes the system adaptable to evolving microservice landscapes without requiring extensive labeled anomaly data, which is often scarce.

Synthetic Anomaly Framework

Introduction of a synthetic anomaly injection framework is a significant methodological contribution, enabling controlled evaluation and iterative improvement in a domain where real anomaly data is sensitive and difficult to acquire.

Granular Resolution

Operating at minute-level resolution for graph construction allows for timely detection of transient anomalies, crucial for live event monitoring.

Demerits

Limited Recall

The reported recall of 58% is a significant limitation, indicating a substantial proportion of actual anomalies might be missed, especially 'under conservative propagation assumptions'.

Dependency on Graph Structure

The effectiveness is highly dependent on the quality and representativeness of the constructed service graphs. Changes in service dependencies or architecture might require re-training or adaptation.

Interpretability Challenges

Like many deep learning methods, interpreting *why* a particular service is deemed 'under-represented' or anomalous based on graph embeddings can be challenging, hindering root cause analysis.

Generalizability of Synthetic Framework

While valuable, the 'preliminary' synthetic anomaly injection framework's ability to fully capture the complexity and diversity of real-world anomalies across different microservice ecosystems remains to be thoroughly demonstrated.

Scalability Considerations

Processing minute-level graph embeddings for potentially thousands of microservices in real-time presents significant computational and storage challenges not fully detailed in the abstract.

Expert Commentary

This work by Prime Video represents a significant step forward in the notoriously challenging field of anomaly detection within large-scale microservice architectures. The shift from traditional load testing, which often fails to capture the 'long tail' of real-world event dynamics, to a graph-embedding approach is both theoretically sound and practically astute. The use of GCN-GAE to learn structural representations is particularly elegant, as it inherently models inter-service dependencies, moving beyond isolated service metrics. The reported high precision and low false positive rate are commendable, crucial for preventing alert fatigue in operational teams. However, the limited recall (58%) is a critical area for further research. While 'conservative propagation assumptions' might explain this, enhancing the system's ability to detect a broader spectrum of anomalies without sacrificing precision will be key to its widespread adoption. The synthetic anomaly injection framework is a methodological gem, addressing a fundamental bottleneck in AIOps: the scarcity of labeled anomaly data. This framework alone offers immense value for iterative model improvement and robust evaluation. This paper lays a robust foundation, demonstrating the power of graph-based methods for proactive incident management in complex, distributed systems.

Recommendations

  • Prioritize research and development to improve the recall rate, potentially through ensemble methods, dynamic weighting of graph features, or more sophisticated anomaly scoring techniques that balance precision and recall.
  • Investigate methods for enhancing the interpretability of anomalies detected by the GCN-GAE model, providing actionable insights for engineers beyond simply flagging an 'under-represented service'.
  • Explore the generalizability of the synthetic anomaly injection framework by applying it to diverse microservice architectures and anomaly types, comparing synthetic anomalies against a wider range of real-world incidents.
  • Publish a more detailed technical paper or open-source components of the graph construction and embedding pipeline to foster broader academic and industry collaboration and accelerate adoption.

Sources

Original: arXiv - cs.LG