Modern industrial systems are often large-scale distributed systems composed of dozens to thousands of services, leading to difficulty in anomaly detection and localization. KPIs (Key Performance Indicators) record the states of different services and are presented as time series, which reflect the status of the system. However, due to the dynamic and complex periodic patterns embedded in KPIs, pinpointing anomalous behavior of these multivariate time series data quickly and accurately is a challenging problem. The current state-of-the-art deep-learning-based anomaly detection methods model global inter-KPI dependency, causing the limited ability to detect local subtle anomalies and poor interpretability.In practice, interpreting anomalies can accelerate problem localization and further troubleshooting. In this study, we propose TS-InvarNet, an interpretable end-to-end anomaly detection and diagnosis framework based on tempo-spatial KPI invariants. Extensive empirical studies on three real-world industrial datasets and a widely-used open-source system demonstrate that TS-InvarNet can outperform state-of-the-art baseline methods in detection and diagnosis performance. Specifically, TS-InvarNet increases F1-scores by up to 27% compared to the baselines.
The evolution of an invariant network when a failure occurred.