2024

Mint: Cost-Efficient Tracing with All Requests Collection via Commonality and Variability Analysiss

Haiyu Huang, Cheng Chen, Kunyi Chen, Pengfei Chen, Guangba Yu, Zilong He, Yilun Wang, Huxing Zhang, Qi Zhou

ASPLOS'25 (CCF A) In 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.

Distributed traces contain valuable information but are often massive in volume, posing a core challenge in tracing framework design: balancing the tradeoff between preserving essential trace information and reducing trace volume. To address this tradeoff, previous approaches typically used a '1 or 0' sampling strategy: retaining sampled traces while completely discarding unsampled ones. However, based on an empirical study on real-world production traces, we discover that the '1 or 0' strategy actually fails to effectively balance this tradeoff. To achieve a more balanced outcome, we shift the strategy from the '1 or 0' paradigm to the 'commonality + variability' paradigm. The core of 'commonality + variability' paradigm is to first parse traces into common patterns and variable parameters, then aggregate the patterns and filter the parameters. We propose a cost-efficient tracing framework, Mint, which implements the 'commonality + variability' paradigm on the agent side to enable all requests capturing. Our experiments show that Mint can capture all traces and retain more trace information while optimizing trace storage (reduced to an average of 2.7%) and network overhead (reduced to an average of 4.2%). Moreover, experiments also demonstrate that Mint is lightweight enough for production use.

Mint: Cost-Efficient Tracing with All Requests Collection via Commonality and Variability Analysiss
Mint: Cost-Efficient Tracing with All Requests Collection via Commonality and Variability Analysiss

Haiyu Huang, Cheng Chen, Kunyi Chen, Pengfei Chen, Guangba Yu, Zilong He, Yilun Wang, Huxing Zhang, Qi Zhou

ASPLOS'25 (CCF A) In 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.

Distributed traces contain valuable information but are often massive in volume, posing a core challenge in tracing framework design: balancing the tradeoff between preserving essential trace information and reducing trace volume. To address this tradeoff, previous approaches typically used a '1 or 0' sampling strategy: retaining sampled traces while completely discarding unsampled ones. However, based on an empirical study on real-world production traces, we discover that the '1 or 0' strategy actually fails to effectively balance this tradeoff. To achieve a more balanced outcome, we shift the strategy from the '1 or 0' paradigm to the 'commonality + variability' paradigm. The core of 'commonality + variability' paradigm is to first parse traces into common patterns and variable parameters, then aggregate the patterns and filter the parameters. We propose a cost-efficient tracing framework, Mint, which implements the 'commonality + variability' paradigm on the agent side to enable all requests capturing. Our experiments show that Mint can capture all traces and retain more trace information while optimizing trace storage (reduced to an average of 2.7%) and network overhead (reduced to an average of 4.2%). Moreover, experiments also demonstrate that Mint is lightweight enough for production use.

FaaSConf: QoS-aware Hybrid Resources Configuration for Serverless Workflows

Yilun Wang, Pengfei Chen, Hui Dou, Yiwen Zhang, Guangba Yu, Zilong He, Haiyu Huang

ASE'24 (CCF A) In 39th IEEE/ACM International Conference on Automated Software Engineering.

The workflow composition of multiple short-lived functions has emerged as a prominent pattern in Function-as-a-Service (FaaS), exposing a considerable resources configuration challenge compared to individual independent serverless functions. This challenge unfolds in two ways. Firstly, serverless workflows frequently encounter dynamic and concurrent user workloads, increasing the risk of QoS violations. Secondly, the performance of a function can be affected by the resource re-provision of other functions within the workflow. With the popularity of the mode of concurrent processing in one single instance, concurrency limit as a critical configuration parameter imposes restrictions on the capacity of requests per instance. In this study, we present FaaSConf, a QoS-aware hybrid resource configuration approach that uses multi-agent reinforcement learning (MARL) to configure hybrid resources, including hardware resources and concurrency, thereby ensuring end-to-end QoS while minimizing resource costs. To enhance decision-making, we employ an attention technique in MARL to capture the complex performance dependencies between functions. We further propose a safe exploration strategy to mitigate QoS violations, resulting in a safer and efficient configuration exploration. The experimental results demonstrate that FaaSConf outperforms state-of-the-art approaches significantly. On average, it achieves a 26.5% cost reduction while exhibiting robustness to dynamic load changes.

FaaSConf: QoS-aware Hybrid Resources Configuration for Serverless Workflows
FaaSConf: QoS-aware Hybrid Resources Configuration for Serverless Workflows

Yilun Wang, Pengfei Chen, Hui Dou, Yiwen Zhang, Guangba Yu, Zilong He, Haiyu Huang

ASE'24 (CCF A) In 39th IEEE/ACM International Conference on Automated Software Engineering.

The workflow composition of multiple short-lived functions has emerged as a prominent pattern in Function-as-a-Service (FaaS), exposing a considerable resources configuration challenge compared to individual independent serverless functions. This challenge unfolds in two ways. Firstly, serverless workflows frequently encounter dynamic and concurrent user workloads, increasing the risk of QoS violations. Secondly, the performance of a function can be affected by the resource re-provision of other functions within the workflow. With the popularity of the mode of concurrent processing in one single instance, concurrency limit as a critical configuration parameter imposes restrictions on the capacity of requests per instance. In this study, we present FaaSConf, a QoS-aware hybrid resource configuration approach that uses multi-agent reinforcement learning (MARL) to configure hybrid resources, including hardware resources and concurrency, thereby ensuring end-to-end QoS while minimizing resource costs. To enhance decision-making, we employ an attention technique in MARL to capture the complex performance dependencies between functions. We further propose a safe exploration strategy to mitigate QoS violations, resulting in a safer and efficient configuration exploration. The experimental results demonstrate that FaaSConf outperforms state-of-the-art approaches significantly. On average, it achieves a 26.5% cost reduction while exhibiting robustness to dynamic load changes.

FaaSRCA: Full Lifecycle Root Cause Analysis for Serverless Applications

Jin Huang, Pengfei Chen, Guangba Yu, Yilun Wang, Haiyu Huang, Zilong He

ISSRE'24 (CCF B) In 35th IEEE International Symposium on Software Reliability Engineering.

Serverless becomes popular as a novel computing paradigms for cloud native services. However, the complexity and dynamic nature of serverless applications present significant challenges to ensure system availability and performance. There are many root cause analysis (RCA) methods for microservice systems, but they are not suitable for precise modeling serverless applications. This is because: (1) Compared to microservice, serverless applications exhibit a highly dynamic nature. They have short lifecycle and only generate instantaneous pulse-like data, lacking long-term continuous information. (2) Existing methods solely focus on analyzing the running stage and overlook other stages, failing to encompass the entire lifecycle of serverless applications. To address these limitations, we propose FaaSRCA, a full lifecycle root cause analysis method for serverless applications. It integrates multi-modal observability data generated from platform and application side by using Global Call Graph. We train a Graph Attention Network (GAT) based graph auto-encoder to compute reconstruction scores for the nodes in global call graph. Based on the scores, we determine the root cause at the granularity of the lifecycle stage of serverless functions. We conduct experimental evaluations on two serverless benchmarks, the results show that FaaSRCA outperforms other baseline methods with a top-k precision improvement ranging from 21.25% to 81.63%.

FaaSRCA: Full Lifecycle Root Cause Analysis for Serverless Applications
FaaSRCA: Full Lifecycle Root Cause Analysis for Serverless Applications

Jin Huang, Pengfei Chen, Guangba Yu, Yilun Wang, Haiyu Huang, Zilong He

ISSRE'24 (CCF B) In 35th IEEE International Symposium on Software Reliability Engineering.

Serverless becomes popular as a novel computing paradigms for cloud native services. However, the complexity and dynamic nature of serverless applications present significant challenges to ensure system availability and performance. There are many root cause analysis (RCA) methods for microservice systems, but they are not suitable for precise modeling serverless applications. This is because: (1) Compared to microservice, serverless applications exhibit a highly dynamic nature. They have short lifecycle and only generate instantaneous pulse-like data, lacking long-term continuous information. (2) Existing methods solely focus on analyzing the running stage and overlook other stages, failing to encompass the entire lifecycle of serverless applications. To address these limitations, we propose FaaSRCA, a full lifecycle root cause analysis method for serverless applications. It integrates multi-modal observability data generated from platform and application side by using Global Call Graph. We train a Graph Attention Network (GAT) based graph auto-encoder to compute reconstruction scores for the nodes in global call graph. Based on the scores, we determine the root cause at the granularity of the lifecycle stage of serverless functions. We conduct experimental evaluations on two serverless benchmarks, the results show that FaaSRCA outperforms other baseline methods with a top-k precision improvement ranging from 21.25% to 81.63%.

A Survey on Failure Analysis and Fault Injection in AI Systems

Guangba Yu, Gou Tan, Haojia Huang, Zhenyu Zhang, Pengfei Chen, Roberto Natella, Zibin Zheng, Michael R. Lyu

 Under review.

The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ensure resilience and reliability. Despite the importance of these techniques, there lacks a comprehensive review of FA and FI methodologies in AI systems. This study fills this gap by presenting a detailed survey of existing FA and FI approaches across six layers of AI systems. We systematically analyze 160 papers and repositories to answer three research questions including (1) what are the prevalent failures in AI systems, (2) what types of faults can current FI tools simulate, (3) what gaps exist between the simulated faults and real-world failures. Our findings reveal a taxonomy of AI system failures, assess the capabilities of existing FI tools, and highlight discrepancies between real-world and simulated failures. Moreover, this survey contributes to the field by providing a framework for fault diagnosis, evaluating the state-of-the-art in FI, and identifying areas for improvement in FI techniques to enhance the resilience of AI systems.

A Survey on Failure Analysis and Fault Injection in AI Systems
A Survey on Failure Analysis and Fault Injection in AI Systems

Guangba Yu, Gou Tan, Haojia Huang, Zhenyu Zhang, Pengfei Chen, Roberto Natella, Zibin Zheng, Michael R. Lyu

 Under review.

The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ensure resilience and reliability. Despite the importance of these techniques, there lacks a comprehensive review of FA and FI methodologies in AI systems. This study fills this gap by presenting a detailed survey of existing FA and FI approaches across six layers of AI systems. We systematically analyze 160 papers and repositories to answer three research questions including (1) what are the prevalent failures in AI systems, (2) what types of faults can current FI tools simulate, (3) what gaps exist between the simulated faults and real-world failures. Our findings reveal a taxonomy of AI system failures, assess the capabilities of existing FI tools, and highlight discrepancies between real-world and simulated failures. Moreover, this survey contributes to the field by providing a framework for fault diagnosis, evaluating the state-of-the-art in FI, and identifying areas for improvement in FI techniques to enhance the resilience of AI systems.

CTuner: Automatic NoSQL Database Tuning with Causal Reinforcement Learning

Genting Mai, Zilong He, Guangba Yu, Zhiming Chen, Pengfei Chen

Internetware'24 (CCF C) In 15th Asia-Pacific Symposium on Internetware.

The rapid development of information technology has necessitated the management of large volumes of data in modern society, leading to the emergence of NoSQL databases (e.g., MongoDB). To meet the huge demand for efficient data management and query, optimizing the performance of these databases has become crucial. Currently, some reinforcement learning-based methods have been used to improve the efficiency of databases by tuning customizable database configurations. However, these methods have two limitations including susceptibility to cold-start effect and low efficiency in configuration search. To address these issues, we propose a novel and effective approach named CTuner for the online performance tuning of NoSQL databases. CTuner skips cold start by Bayesian optimization-based learning, and improves the exploitation strategy of the TD3 model with causal inference. Practical implementation and experimental evaluations on three prominent NoSQL databases show that CTuner can find a better configuration at the same time cost than state-of-the-art approaches, with up to a 27.4% improvement in throughput and up to 13.2% reduction in 95th latency. Moreover, we introduce meta-learning to enhance the adaptability of CTuner and confirm that it is able to reliably improve performance under new environments and workloads.

CTuner: Automatic NoSQL Database Tuning with Causal Reinforcement Learning
CTuner: Automatic NoSQL Database Tuning with Causal Reinforcement Learning

Genting Mai, Zilong He, Guangba Yu, Zhiming Chen, Pengfei Chen

Internetware'24 (CCF C) In 15th Asia-Pacific Symposium on Internetware.

The rapid development of information technology has necessitated the management of large volumes of data in modern society, leading to the emergence of NoSQL databases (e.g., MongoDB). To meet the huge demand for efficient data management and query, optimizing the performance of these databases has become crucial. Currently, some reinforcement learning-based methods have been used to improve the efficiency of databases by tuning customizable database configurations. However, these methods have two limitations including susceptibility to cold-start effect and low efficiency in configuration search. To address these issues, we propose a novel and effective approach named CTuner for the online performance tuning of NoSQL databases. CTuner skips cold start by Bayesian optimization-based learning, and improves the exploitation strategy of the TD3 model with causal inference. Practical implementation and experimental evaluations on three prominent NoSQL databases show that CTuner can find a better configuration at the same time cost than state-of-the-art approaches, with up to a 27.4% improvement in throughput and up to 13.2% reduction in 95th latency. Moreover, we introduce meta-learning to enhance the adaptability of CTuner and confirm that it is able to reliably improve performance under new environments and workloads.

MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications

Hongyang Chen, Pengfei Chen, Guangba Yu, Xiaoyun Li, Zilong He, Huxing Zhang

TDSC (CCF A) In IEEE Transactions on Dependable and Secure Computing

Microservice is a widely-adopted architecture for constructing cloud-native applications. To test application resiliency, chaos engineering is widely used to inject faults proactively in applications. However, the searching space formed by possible injection locations is huge due to the scale and complexity of the application. Although some methods are proposed to effectively explore injection space, they cannot prioritize high-impact injection solutions. Additionally, the blast radius of faults injected by existing methods is typically full of uncertainty, causing faults of multiple application functions. Although some tools are designed to conduct request-level injection, they require instrumentation on application code. To tackle these problems, this paper presents MicroFI, a non-intrusive fault injection framework, aiming to efficiently test different application functions with request-level injection. Request-level injection limits the blast radius to specified requests without any source code modification. Additionally, MicroFI leverages historical injection results and parallel technique to accelerate the searching. Moreover, An enhanced PageRank is used to measure the impact of faults and prioritize high-impact faults that fail more functions. Evaluations on three microservice applications show that MicroFI precisely injects faults and reduces up to 91% redundant faults on average. Additionally, by employing prioritization, MicroFI reduces an average of 47.3% injection budgets to cover all high-impact faults.

MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications
MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications

Hongyang Chen, Pengfei Chen, Guangba Yu, Xiaoyun Li, Zilong He, Huxing Zhang

TDSC (CCF A) In IEEE Transactions on Dependable and Secure Computing

Microservice is a widely-adopted architecture for constructing cloud-native applications. To test application resiliency, chaos engineering is widely used to inject faults proactively in applications. However, the searching space formed by possible injection locations is huge due to the scale and complexity of the application. Although some methods are proposed to effectively explore injection space, they cannot prioritize high-impact injection solutions. Additionally, the blast radius of faults injected by existing methods is typically full of uncertainty, causing faults of multiple application functions. Although some tools are designed to conduct request-level injection, they require instrumentation on application code. To tackle these problems, this paper presents MicroFI, a non-intrusive fault injection framework, aiming to efficiently test different application functions with request-level injection. Request-level injection limits the blast radius to specified requests without any source code modification. Additionally, MicroFI leverages historical injection results and parallel technique to accelerate the searching. Moreover, An enhanced PageRank is used to measure the impact of faults and prioritize high-impact faults that fail more functions. Evaluations on three microservice applications show that MicroFI precisely injects faults and reduces up to 91% redundant faults on average. Additionally, by employing prioritization, MicroFI reduces an average of 47.3% injection budgets to cover all high-impact faults.

TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State

Haiyu Huang, Xiaoyu Zhang, Pengfei Chen, Zilong He, Zhiming Chen, Guangba Yu, Hongyang Chen, Chen Sun

FSE'24 (CCF A) In 32nd ACM International Conference on the Foundations of Software Engineering. 🏆 Distinguish Paper Award

Distributed tracing has been widely adopted in many microservice systems and plays an important role in monitoring and analyzing the system. However, trace data often come in large volumes, incurring substantial computational and storage costs. To reduce the quantity of traces, trace sampling has become a prominent topic of discussion, and several methods have been proposed in prior work. To attain higher-quality sampling outcomes, biased sampling has gained more attention compared to random sampling. Previous biased sampling methods primarily considered the importance of traces based on diversity, aiming to sample more edge-case traces and fewer common-case traces. However, we contend that relying solely on trace diversity for sampling is insufficient, system runtime state is another crucial factor that needs to be considered, especially in cases of system failures. In this study, we introduce TraStrainer, an online sampler that takes into account both system runtime state and trace diversity. TraStrainer employs an interpretable and automated encoding method to represent traces as vectors. Simultaneously, it adaptively determines sampling preferences by analyzing system runtime metrics. When sampling, it combines the results of system-bias and diversity-bias through a dynamic voting mechanism. Experimental results demonstrate that TraStrainer can achieve higher quality sampling results and significantly improve the performance of downstream root cause analysis (RCA) tasks. It has led to an average increase of 32.63\% in Top-1 RCA accuracy compared to four baselines in two datasets.

TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State
TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State

Haiyu Huang, Xiaoyu Zhang, Pengfei Chen, Zilong He, Zhiming Chen, Guangba Yu, Hongyang Chen, Chen Sun

FSE'24 (CCF A) In 32nd ACM International Conference on the Foundations of Software Engineering. 🏆 Distinguish Paper Award

Distributed tracing has been widely adopted in many microservice systems and plays an important role in monitoring and analyzing the system. However, trace data often come in large volumes, incurring substantial computational and storage costs. To reduce the quantity of traces, trace sampling has become a prominent topic of discussion, and several methods have been proposed in prior work. To attain higher-quality sampling outcomes, biased sampling has gained more attention compared to random sampling. Previous biased sampling methods primarily considered the importance of traces based on diversity, aiming to sample more edge-case traces and fewer common-case traces. However, we contend that relying solely on trace diversity for sampling is insufficient, system runtime state is another crucial factor that needs to be considered, especially in cases of system failures. In this study, we introduce TraStrainer, an online sampler that takes into account both system runtime state and trace diversity. TraStrainer employs an interpretable and automated encoding method to represent traces as vectors. Simultaneously, it adaptively determines sampling preferences by analyzing system runtime metrics. When sampling, it combines the results of system-bias and diversity-bias through a dynamic voting mechanism. Experimental results demonstrate that TraStrainer can achieve higher quality sampling results and significantly improve the performance of downstream root cause analysis (RCA) tasks. It has led to an average increase of 32.63\% in Top-1 RCA accuracy compared to four baselines in two datasets.

ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems

Guangba Yu, Pengfei Chen, Zilong He, Qiuyu Yan, Yu Luo, Fangyan Li, Zibin Zheng

FSE'24 (CCF A) In 32nd ACM International Conference on the Foundations of Software Engineering.

In large-scale online service systems, the occurrence of software changes is inevitable and frequent. Despite rigorous pre-deployment testing practices, the presence of defective software changes in the online environment cannot be completely eliminated. Consequently, there is a pressing need for automated techniques that can effectively identify these defective changes However, the current abnormal change detection (ACD) approaches fall short in accurately pinpointing defective changes, primarily due to their disregard for the propagation of faults. To address the limitations of ACD, we propose a novel concept called root cause change analysis (RCCA) to identify the underlying root causes of change-inducing incidents. In order to apply the RCCA concept to practical scenarios, we have devised an intelligent RCCA framework named ChangeRCA. This framework aims to localize the defective change associated with change-inducing incidents among multiple changes.

ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems
ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems

Guangba Yu, Pengfei Chen, Zilong He, Qiuyu Yan, Yu Luo, Fangyan Li, Zibin Zheng

FSE'24 (CCF A) In 32nd ACM International Conference on the Foundations of Software Engineering.

In large-scale online service systems, the occurrence of software changes is inevitable and frequent. Despite rigorous pre-deployment testing practices, the presence of defective software changes in the online environment cannot be completely eliminated. Consequently, there is a pressing need for automated techniques that can effectively identify these defective changes However, the current abnormal change detection (ACD) approaches fall short in accurately pinpointing defective changes, primarily due to their disregard for the propagation of faults. To address the limitations of ACD, we propose a novel concept called root cause change analysis (RCCA) to identify the underlying root causes of change-inducing incidents. In order to apply the RCCA concept to practical scenarios, we have devised an intelligent RCCA framework named ChangeRCA. This framework aims to localize the defective change associated with change-inducing incidents among multiple changes.

2023

Network shortcut in data plane of service mesh with eBPF

Wanqi Yang, Pengfei Chen, Guangba Yu, Haibin Zhang, Huxing Zhang

JNCA (CCF C) In Journal of Network and Computer Applications

In recent years, the adoption of the service mesh as a dedicated infrastructure layer to support cloud-native systems has gained significant popularity. Service meshes involve the incorporation of proxies to handle communication between microservices, thereby speeding up the development and deployment of microservice applications. However, the use of service meshes also increases the request latency because they elongate the packet transmission between services. After investigating the transmission path of packets in a representative service mesh Istio, we observed that the service mesh dedicates approximately 25% of its time to packet transmission in the Linux kernel network stack. To shorten this process, we propose a non-intrusive solution that enables packets to bypass the kernel network stack through the implementation of socket redirection and tc (traffic control) redirection with eBPF (extended Berkeley Packet Filter). We also conduct comprehensive experiments on the widely-used Istio. The evaluation results show that our approach can significantly reduce the request latency by up to 21%. Furthermore, our approach decreases CPU usage by 1.73% and reduces memory consumption by approximately 0.98% when compared to the original service mesh implementation.

Network shortcut in data plane of service mesh with eBPF
Network shortcut in data plane of service mesh with eBPF

Wanqi Yang, Pengfei Chen, Guangba Yu, Haibin Zhang, Huxing Zhang

JNCA (CCF C) In Journal of Network and Computer Applications

In recent years, the adoption of the service mesh as a dedicated infrastructure layer to support cloud-native systems has gained significant popularity. Service meshes involve the incorporation of proxies to handle communication between microservices, thereby speeding up the development and deployment of microservice applications. However, the use of service meshes also increases the request latency because they elongate the packet transmission between services. After investigating the transmission path of packets in a representative service mesh Istio, we observed that the service mesh dedicates approximately 25% of its time to packet transmission in the Linux kernel network stack. To shorten this process, we propose a non-intrusive solution that enables packets to bypass the kernel network stack through the implementation of socket redirection and tc (traffic control) redirection with eBPF (extended Berkeley Packet Filter). We also conduct comprehensive experiments on the widely-used Istio. The evaluation results show that our approach can significantly reduce the request latency by up to 21%. Furthermore, our approach decreases CPU usage by 1.73% and reduces memory consumption by approximately 0.98% when compared to the original service mesh implementation.

Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data

Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, Zibin Zheng

FSE'23 (CCF A) In 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging task. To understand and localize root causes of unexpected faults, modern observability tools collect and preserve multi-modal observability data, including metrics, traces, and logs. Since system faults may manifest as anomalies in different data sources, existing RCA approaches that rely on single-modal data are constrained in the granularity and interpretability of root causes. In this study, we present Nezha, an interpretable and fine-grained RCA approach that pinpoints root causes at the code region and resource type level by incorporative analysis of multi-modal data. Nezha transforms heterogeneous multi-modal data into a homogeneous event representation and extracts event patterns by constructing and mining event graphs. The core idea of Nezha is to compare event patterns in the fault-free phase with those in the fault-suffering phase to localize root causes in an interpretable way. Practical implementation and experimental evaluations on two microservice applications show that Nezha achieves a high top1 accuracy (87.5%) on average at the code region and resource type level and outperforms state-of-the-art approaches by a large margin. Two ablation studies further confirm the contributions of incorporating multi-modal data.

Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data
Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data

Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, Zibin Zheng

FSE'23 (CCF A) In 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging task. To understand and localize root causes of unexpected faults, modern observability tools collect and preserve multi-modal observability data, including metrics, traces, and logs. Since system faults may manifest as anomalies in different data sources, existing RCA approaches that rely on single-modal data are constrained in the granularity and interpretability of root causes. In this study, we present Nezha, an interpretable and fine-grained RCA approach that pinpoints root causes at the code region and resource type level by incorporative analysis of multi-modal data. Nezha transforms heterogeneous multi-modal data into a homogeneous event representation and extracts event patterns by constructing and mining event graphs. The core idea of Nezha is to compare event patterns in the fault-free phase with those in the fault-suffering phase to localize root causes in an interpretable way. Practical implementation and experimental evaluations on two microservice applications show that Nezha achieves a high top1 accuracy (87.5%) on average at the code region and resource type level and outperforms state-of-the-art approaches by a large margin. Two ablation studies further confirm the contributions of incorporating multi-modal data.

DiagConfig: Configuration Diagnosis of Performance Violations in Configurable Software Systems

Zhiming Chen, Pengfei Chen, Peipei Wang, Guangba Yu, Zilong He, Genting Mai

FSE'23 (CCF A) In 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Performance degradation due to misconfiguration in software systems that violates SLOs (service-level objectives) is commonplace. Diagnosing and explaining the root causes of such performance violations in configurable software systems is often challenging due to their increasing complexity. Although there are many tools and techniques for diagnosing performance violations, they provide limited evidence to attribute causes of observed performance violations to specific configurations. This is because the configuration is not originally considered in those tools. This paper proposes DiagConfig, specifically designed to conduct configuration diagnosis of performance violations. It leverages static code analysis to track configuration option propagation, identifies performance-sensitive options, detects performance violations, and constructs cause-effect chains that help stakeholders better understand the relationship between configuration and performance violations. Through experimental evaluations with eight real-world open-source software, we demonstrate that DiagConfig effectively identifies performance-sensitive options and constructs cause-effect chains. Specifically, DiagConfig produces fewer false positives than SafeTune (i.e., 5 vs 77) in the identification of performance-sensitive options, and outperforms Unicorn in the diagnosis of performance violations caused by configuration changes, offering more comprehensive results (recall 0.892 vs 0.289).We also show that DiagConfig can accelerate auto-tuning by compressing configuration space.

DiagConfig: Configuration Diagnosis of Performance Violations in Configurable Software Systems
DiagConfig: Configuration Diagnosis of Performance Violations in Configurable Software Systems

Zhiming Chen, Pengfei Chen, Peipei Wang, Guangba Yu, Zilong He, Genting Mai

FSE'23 (CCF A) In 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Performance degradation due to misconfiguration in software systems that violates SLOs (service-level objectives) is commonplace. Diagnosing and explaining the root causes of such performance violations in configurable software systems is often challenging due to their increasing complexity. Although there are many tools and techniques for diagnosing performance violations, they provide limited evidence to attribute causes of observed performance violations to specific configurations. This is because the configuration is not originally considered in those tools. This paper proposes DiagConfig, specifically designed to conduct configuration diagnosis of performance violations. It leverages static code analysis to track configuration option propagation, identifies performance-sensitive options, detects performance violations, and constructs cause-effect chains that help stakeholders better understand the relationship between configuration and performance violations. Through experimental evaluations with eight real-world open-source software, we demonstrate that DiagConfig effectively identifies performance-sensitive options and constructs cause-effect chains. Specifically, DiagConfig produces fewer false positives than SafeTune (i.e., 5 vs 77) in the identification of performance-sensitive options, and outperforms Unicorn in the diagnosis of performance violations caused by configuration changes, offering more comprehensive results (recall 0.892 vs 0.289).We also show that DiagConfig can accelerate auto-tuning by compressing configuration space.

MARS: Fault Localization in Programmable Networking Systems with Low-cost In-Band Network Telemetry

Benran Wang, Hongyang Chen, Pengfei Chen, Zilong He, Guangba Yu

ICPP'23 (CCF B) In 32nd International Conference on Parallel Processing

In this paper, we present MARS, a lightweight system for anomaly detection with dynamic threshold and automatic root cause localization in programmable networking systems. MARS collects aggregated packet level telemetry on demand and generates a ranked list of fine-grained fault culprits at multiple levels, including port level, switch level, and flow level. Experimental evaluations show the cost-effectiveness of MARS, both in terms of network bandwidth and switch memory usage. Moreover, MARS achieves a 0.97 F1 score in anomaly detection, and 0.95 Recall at Top2 and an overall 0.3 Exam Score in root cause localization.

MARS: Fault Localization in Programmable Networking Systems with Low-cost In-Band Network Telemetry
MARS: Fault Localization in Programmable Networking Systems with Low-cost In-Band Network Telemetry

Benran Wang, Hongyang Chen, Pengfei Chen, Zilong He, Guangba Yu

ICPP'23 (CCF B) In 32nd International Conference on Parallel Processing

In this paper, we present MARS, a lightweight system for anomaly detection with dynamic threshold and automatic root cause localization in programmable networking systems. MARS collects aggregated packet level telemetry on demand and generates a ranked list of fine-grained fault culprits at multiple levels, including port level, switch level, and flow level. Experimental evaluations show the cost-effectiveness of MARS, both in terms of network bandwidth and switch memory usage. Moreover, MARS achieves a 0.97 F1 score in anomaly detection, and 0.95 Recall at Top2 and an overall 0.3 Exam Score in root cause localization.

DeepPower: Deep Reinforcement Learning based Power Management for Latency Critical Applications in Multi-core Systems

Jingrun Zhang, Guangba Yu, Zilong He, Liang Ai, Pengfei Chen

ICPP'23 (CCF B) In 32nd International Conference on Parallel Processing

Latency-critical (LC) applications are widely deployed in modern datacenters. Effective power management for LC applications can yield significant cost savings. However, it poses a significant challenge in maintaining the desired Service Level Aggrement (SLA) levels. Prior researches have mainly emphasized predicting the service time of request and utilize heuristic algorithms for CPU frequency adjustment. Unfortunately, the control granularity is limited to the request level and manual feature selection is needed. This paper proposes DeepPower, a deep reinforcement learning (DRL) based power management solution for LC applications. DeepPower comprises two key components, a DRL agent for monitoring the system load changes and a thread controller for CPU frequency adjustment. Considering the high overhead of the neural network and the short service time of requests, it is infeasible to employ DRL for direct adjustment of CPU frequency at the request level. Instead, DeepPower proposes a hierarchical control mechanism. That means the DRL agent adjusts the parameter of thread controller with longer intervals, and thread controller adjusts the CPU frequency with shorter intervals. This control mechanism enables DeepPower to adapt to dynamic workloads and achieves fine-grained frequency adjustments. We evaluate DeepPower with some common LC applications under dynamic workload. The experimental results show that DeepPower saves up to 28.4\% power compared with state-of-the-art methods and reduces the percentage of request timeout.

DeepPower: Deep Reinforcement Learning based Power Management for Latency Critical Applications in Multi-core Systems
DeepPower: Deep Reinforcement Learning based Power Management for Latency Critical Applications in Multi-core Systems

Jingrun Zhang, Guangba Yu, Zilong He, Liang Ai, Pengfei Chen

ICPP'23 (CCF B) In 32nd International Conference on Parallel Processing

Latency-critical (LC) applications are widely deployed in modern datacenters. Effective power management for LC applications can yield significant cost savings. However, it poses a significant challenge in maintaining the desired Service Level Aggrement (SLA) levels. Prior researches have mainly emphasized predicting the service time of request and utilize heuristic algorithms for CPU frequency adjustment. Unfortunately, the control granularity is limited to the request level and manual feature selection is needed. This paper proposes DeepPower, a deep reinforcement learning (DRL) based power management solution for LC applications. DeepPower comprises two key components, a DRL agent for monitoring the system load changes and a thread controller for CPU frequency adjustment. Considering the high overhead of the neural network and the short service time of requests, it is infeasible to employ DRL for direct adjustment of CPU frequency at the request level. Instead, DeepPower proposes a hierarchical control mechanism. That means the DRL agent adjusts the parameter of thread controller with longer intervals, and thread controller adjusts the CPU frequency with shorter intervals. This control mechanism enables DeepPower to adapt to dynamic workloads and achieves fine-grained frequency adjustments. We evaluate DeepPower with some common LC applications under dynamic workload. The experimental results show that DeepPower saves up to 28.4\% power compared with state-of-the-art methods and reduces the percentage of request timeout.

LogReducer: Identify and Reduce Log Hotspots in Kernel on the Fly

Guangba Yu, Pengfei Chen, Pairui Li, Tianjun Weng, Haibing Zheng, Yuetang Deng, Zibin Zheng

ICSE'23 (CCF A) In 45th IEEE/ACM International Conference on Software Engineering

Modern systems generate a massive amount of logs to detect and diagnose system faults, which incurs expensive storage cost and runtime overhead. After investigating real-world production logs, we observe that most of the logging overhead is due to a small number of log templates, referred to as log hotspots. Therefore, we conduct a systematical study about log hotspots in an industrial system WeChat, which motivates us to identify log hotspots and reduce them on the fly. In this paper, we propose LogReducer, a non-intrusive and language-independent log reduction framework based on eBPF (Extended Berkeley Packet Filter), consisting of both online and offline processes. After two months of serving the offline process of LogReducer in WeChat, the log storage overhead has dropped from 19.7 PB per day to 12.0 PB (i.e., about a 39.08% decrease). Practical implementation and experimental evaluations in the test environment demonstrate that the online process of LogReducer can control the logging overhead of hotspots while preserving logging effectiveness. Moreover, the log hotspot handling time can be reduced from average 9 days in production to 10 minutes in the test with the help of LogReducer.

LogReducer: Identify and Reduce Log Hotspots in Kernel on the Fly
LogReducer: Identify and Reduce Log Hotspots in Kernel on the Fly

Guangba Yu, Pengfei Chen, Pairui Li, Tianjun Weng, Haibing Zheng, Yuetang Deng, Zibin Zheng

ICSE'23 (CCF A) In 45th IEEE/ACM International Conference on Software Engineering

Modern systems generate a massive amount of logs to detect and diagnose system faults, which incurs expensive storage cost and runtime overhead. After investigating real-world production logs, we observe that most of the logging overhead is due to a small number of log templates, referred to as log hotspots. Therefore, we conduct a systematical study about log hotspots in an industrial system WeChat, which motivates us to identify log hotspots and reduce them on the fly. In this paper, we propose LogReducer, a non-intrusive and language-independent log reduction framework based on eBPF (Extended Berkeley Packet Filter), consisting of both online and offline processes. After two months of serving the offline process of LogReducer in WeChat, the log storage overhead has dropped from 19.7 PB per day to 12.0 PB (i.e., about a 39.08% decrease). Practical implementation and experimental evaluations in the test environment demonstrate that the online process of LogReducer can control the logging overhead of hotspots while preserving logging effectiveness. Moreover, the log hotspot handling time can be reduced from average 9 days in production to 10 minutes in the test with the help of LogReducer.

FaaSDeliver: Cost-efficient and QoS-aware Function Delivery in Computing Continuum

Guangba Yu, Pengfei Chen, Zibin Zheng, Jingrun Zhang, Xiaoyun Li, Zilong He

TSC (CCF A) In IEEE Transaction on Service Computing

Serverless Function-as-a-Service (FaaS) is a rapidly growing computing paradigm in the cloud era. To provide rapid service response and save network bandwidth, traditional cloud-based FaaS platforms have been extended to the edge. However, launching functions in a heterogeneous computing continuum (HCC) that includes the cloud, fog, and the edge brings new challenges, determining where functions should be delivered and how many resources should be allocated. To optimize the cost of running functions in the HCC, we propose an adaptive and efficient function delivery engine, named FaaSDeliver, which automatically unearths a cost-efficient function delivery policy (FDP) for each function, including the FaaS platform selection and resource allocation. Real system implementation and evaluations in a practical HCC demonstrate that FaaSDeliver can unearth the most cost-efficient FDPs from among 180,200 FDPs after a few trials. FaaSDeliver reduces the average cost of function execution from 38% to 78% compared to some state-of-the-art approaches.

FaaSDeliver: Cost-efficient and QoS-aware Function Delivery in Computing Continuum
FaaSDeliver: Cost-efficient and QoS-aware Function Delivery in Computing Continuum

Guangba Yu, Pengfei Chen, Zibin Zheng, Jingrun Zhang, Xiaoyun Li, Zilong He

TSC (CCF A) In IEEE Transaction on Service Computing

Serverless Function-as-a-Service (FaaS) is a rapidly growing computing paradigm in the cloud era. To provide rapid service response and save network bandwidth, traditional cloud-based FaaS platforms have been extended to the edge. However, launching functions in a heterogeneous computing continuum (HCC) that includes the cloud, fog, and the edge brings new challenges, determining where functions should be delivered and how many resources should be allocated. To optimize the cost of running functions in the HCC, we propose an adaptive and efficient function delivery engine, named FaaSDeliver, which automatically unearths a cost-efficient function delivery policy (FDP) for each function, including the FaaS platform selection and resource allocation. Real system implementation and evaluations in a practical HCC demonstrate that FaaSDeliver can unearth the most cost-efficient FDPs from among 180,200 FDPs after a few trials. FaaSDeliver reduces the average cost of function execution from 38% to 78% compared to some state-of-the-art approaches.

2022

MicroSketch: Lightweight and Adaptive Sketch based Performance Issue Detection and Localization in Microservice Systems

Xiaoyun Li, Guangba Yu, Pengfei Chen, Hongyang Chen, Zhekang Chen

ICSOC'22 (CCF B) In 20th International Conference on Service-Oriented Computing

With the rapid growth of microservice systems in cloud-native environments, end-to-end traces have become essential data to help diagnose performance issues. However, existing trace-based anomalydetection and root cause analysis (RCA) still suffer from practical issues due to either the massive volume or frequent system changes. In this study, we propose a lightweight and adaptive trace-based anomaly detection and RCA approach, named MicroSketch, which leverages Sketch based features and Robust Random Cut Forest (RRCForest) to rendertrace analysis more effective and efficient. In addition,MicroSketchis an unsupervised approach that is able to adapt to changes in microservicesystems without any human intervention. We evaluated MicroSketch on a widely-used open-source system and a production system. The results demonstrate the efficiency and effectiveness of MicroSketch. MicroSketch significantly outperforms start-of-the-art approaches, with an average of 40.9% improvement in F1 score on anomaly detection and 25.0% improvement in Recall of Top-1 on RCA. In particular, MicroSketch is at least 60x faster than other methods in terms of diagnosis time.

MicroSketch: Lightweight and Adaptive Sketch based Performance Issue Detection and Localization in Microservice Systems
MicroSketch: Lightweight and Adaptive Sketch based Performance Issue Detection and Localization in Microservice Systems

Xiaoyun Li, Guangba Yu, Pengfei Chen, Hongyang Chen, Zhekang Chen

ICSOC'22 (CCF B) In 20th International Conference on Service-Oriented Computing

With the rapid growth of microservice systems in cloud-native environments, end-to-end traces have become essential data to help diagnose performance issues. However, existing trace-based anomalydetection and root cause analysis (RCA) still suffer from practical issues due to either the massive volume or frequent system changes. In this study, we propose a lightweight and adaptive trace-based anomaly detection and RCA approach, named MicroSketch, which leverages Sketch based features and Robust Random Cut Forest (RRCForest) to rendertrace analysis more effective and efficient. In addition,MicroSketchis an unsupervised approach that is able to adapt to changes in microservicesystems without any human intervention. We evaluated MicroSketch on a widely-used open-source system and a production system. The results demonstrate the efficiency and effectiveness of MicroSketch. MicroSketch significantly outperforms start-of-the-art approaches, with an average of 40.9% improvement in F1 score on anomaly detection and 25.0% improvement in Recall of Top-1 on RCA. In particular, MicroSketch is at least 60x faster than other methods in terms of diagnosis time.

Going through the Life Cycle of Faults in Clouds:Guidelines on Fault Handling

Xiaoyun Li, Guangba Yu, Pengfei Chen, Hongyang Chen, Zhekang Chen

ISSRE'22 (CCF B) In 33rd IEEE International Symposium on Software Reliability Engineering

Faults are the primary culprits of breaking the high availability of cloud systems, even leading to costly outages. As the scale and complexity of clouds increase, it becomes extraordinarily difficult to understand, detect and diagnose faults. During outages, engineers record the detailed information of the whole life cycle of faults (i.e., fault occurrence, fault detection, fault identification, and fault mitigation) in the form of post-mortems. In this paper, we conduct a quantitative and qualitative study on 354 public post-mortems collected in three popular large-scale clouds, 97.7% of which spans from 2015 to 2021. By reviewing and analyzing post-mortems, we go through the life cycle of faults in clouds and obtain 10 major findings. Based on these findings, we further reach a series of actionable guidelines for better fault handling.

Going through the Life Cycle of Faults in Clouds:Guidelines on Fault Handling
Going through the Life Cycle of Faults in Clouds:Guidelines on Fault Handling

Xiaoyun Li, Guangba Yu, Pengfei Chen, Hongyang Chen, Zhekang Chen

ISSRE'22 (CCF B) In 33rd IEEE International Symposium on Software Reliability Engineering

Faults are the primary culprits of breaking the high availability of cloud systems, even leading to costly outages. As the scale and complexity of clouds increase, it becomes extraordinarily difficult to understand, detect and diagnose faults. During outages, engineers record the detailed information of the whole life cycle of faults (i.e., fault occurrence, fault detection, fault identification, and fault mitigation) in the form of post-mortems. In this paper, we conduct a quantitative and qualitative study on 354 public post-mortems collected in three popular large-scale clouds, 97.7% of which spans from 2015 to 2021. By reviewing and analyzing post-mortems, we go through the life cycle of faults in clouds and obtain 10 major findings. Based on these findings, we further reach a series of actionable guidelines for better fault handling.

Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems

Zilong He, Pengfei Chen, Yu Luo, Qiuyu Yan, Hongyang Chen, Guangba Yu, Fangyuan Li

ASE'22 (CCF A) In 37th IEEE/ACM International Conference on Automated Software Engineering

With the ever increasing scale and complexity of online systems, incidents are gradually becoming commonplace. Without appropriate handling, they can seriously harm the system availability. However, in large-scale online systems, these incidents are usually drowning in a slew of issues (i.e., something abnormal, while not necessarily an incident), rendering them difficult to handle. Typically, these issues will result in a cascading effect across the system, and a proper management of the incidents depends heavily on a thorough analysis of this effect. Therefore, in this paper, we propose a method to automatically analyze the cascading effect of availability issues in online systems and extract the corresponding graph based issue representations incorporating both of the issue symptoms and affected service attributes. With the extracted representations, we train and utilize a graph neural networks based model to perform incident detection. Then, for the detected incident, we leverage the PageRank algorithm with a flexible transition matrix design to locate its root cause. We evaluate our approach using real-world data collected from the WeChat ® online service system, the largest instant message system in China. The results confirm the effectiveness of our approach. Moreover, our approach is successfully deployed in the company and eases the burden of operators in the face of a flood of issues and related alert signals.

Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems
Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems

Zilong He, Pengfei Chen, Yu Luo, Qiuyu Yan, Hongyang Chen, Guangba Yu, Fangyuan Li

ASE'22 (CCF A) In 37th IEEE/ACM International Conference on Automated Software Engineering

With the ever increasing scale and complexity of online systems, incidents are gradually becoming commonplace. Without appropriate handling, they can seriously harm the system availability. However, in large-scale online systems, these incidents are usually drowning in a slew of issues (i.e., something abnormal, while not necessarily an incident), rendering them difficult to handle. Typically, these issues will result in a cascading effect across the system, and a proper management of the incidents depends heavily on a thorough analysis of this effect. Therefore, in this paper, we propose a method to automatically analyze the cascading effect of availability issues in online systems and extract the corresponding graph based issue representations incorporating both of the issue symptoms and affected service attributes. With the extracted representations, we train and utilize a graph neural networks based model to perform incident detection. Then, for the detected incident, we leverage the PageRank algorithm with a flexible transition matrix design to locate its root cause. We evaluate our approach using real-world data collected from the WeChat ® online service system, the largest instant message system in China. The results confirm the effectiveness of our approach. Moreover, our approach is successfully deployed in the company and eases the burden of operators in the face of a flood of issues and related alert signals.

TS-InvarNet: Anomaly Detection and Localization based on Tempo-spatial KPI Invariants in Distributed Services

Zijun Hu, Pengfei Chen, Guangba Yu, Zilong He, Xiaoyun Li

ICWS'22 (CCF B) In Proceeidings of 2022 IEEE International Conference on Web Services

Modern industrial systems are often large-scale distributed systems composed of dozens to thousands of services, leading to difficulty in anomaly detection and localization. KPIs (Key Performance Indicators) record the states of different services and are presented as time series, which reflect the status of the system. However, due to the dynamic and complex periodic patterns embedded in KPIs, pinpointing anomalous behavior of these multivariate time series data quickly and accurately is a challenging problem. The current state-of-the-art deep-learning-based anomaly detection methods model global inter-KPI dependency, causing the limited ability to detect local subtle anomalies and poor interpretability.In practice, interpreting anomalies can accelerate problem localization and further troubleshooting. In this study, we propose TS-InvarNet, an interpretable end-to-end anomaly detection and diagnosis framework based on tempo-spatial KPI invariants. Extensive empirical studies on three real-world industrial datasets and a widely-used open-source system demonstrate that TS-InvarNet can outperform state-of-the-art baseline methods in detection and diagnosis performance. Specifically, TS-InvarNet increases F1-scores by up to 27% compared to the baselines.

TS-InvarNet: Anomaly Detection and Localization based on Tempo-spatial KPI Invariants in Distributed Services
TS-InvarNet: Anomaly Detection and Localization based on Tempo-spatial KPI Invariants in Distributed Services

Zijun Hu, Pengfei Chen, Guangba Yu, Zilong He, Xiaoyun Li

ICWS'22 (CCF B) In Proceeidings of 2022 IEEE International Conference on Web Services

Modern industrial systems are often large-scale distributed systems composed of dozens to thousands of services, leading to difficulty in anomaly detection and localization. KPIs (Key Performance Indicators) record the states of different services and are presented as time series, which reflect the status of the system. However, due to the dynamic and complex periodic patterns embedded in KPIs, pinpointing anomalous behavior of these multivariate time series data quickly and accurately is a challenging problem. The current state-of-the-art deep-learning-based anomaly detection methods model global inter-KPI dependency, causing the limited ability to detect local subtle anomalies and poor interpretability.In practice, interpreting anomalies can accelerate problem localization and further troubleshooting. In this study, we propose TS-InvarNet, an interpretable end-to-end anomaly detection and diagnosis framework based on tempo-spatial KPI invariants. Extensive empirical studies on three real-world industrial datasets and a widely-used open-source system demonstrate that TS-InvarNet can outperform state-of-the-art baseline methods in detection and diagnosis performance. Specifically, TS-InvarNet increases F1-scores by up to 27% compared to the baselines.

SwissLog: Robust Anomaly Detection and Localization for Interleaved Unstructured Logs

Xiaoyun Li, Pengfei Chen, Linxiao Jing, Zilong He, Guangba Yu

TDSC (CCF A) In IEEE Transactions on Dependable and Secure Computing

Modern distributed systems generate interleaved logs when running in parallel. Identifiers (ID) are always attached to them to trace running instances or entities in logs. Therefore, log messages can be grouped by the same IDs to help anomaly detection and localization. The existing approaches to achieve this still fall short meeting these challenges, 1) Log is solely processed in single components without mining log dependencies, 2) Log formats are continually changing in modern software systems, 3) It is challenging to detect latent performance issues non-intrusively by trivial monitoring tools. To remedy the above shortcomings, we propose SwissLog, a robust anomaly detection and localization tool for interleaved unstructured logs. \textcolor{black}{SwissLog focuses on log sequential anomalies and tries to dig out possible performance issues. SwissLog constructs ID relation graphs across distributed components and groups log messages by IDs. Moreover, we propose an online data-driven log parser without parameter tuning.} The grouped log messages are parsed via the novel log parser and transformed with semantic and temporal embedding. Finally, SwissLog utilizes an attention-based Bi-LSTM model and a heuristic searching algorithm to detect and localize anomalies in instance-granularity, respectively. The experiments on real-world and synthetic datasets confirm the effectiveness, efficiency, and robustness of SwissLog.

SwissLog: Robust Anomaly Detection and Localization for Interleaved Unstructured Logs
SwissLog: Robust Anomaly Detection and Localization for Interleaved Unstructured Logs

Xiaoyun Li, Pengfei Chen, Linxiao Jing, Zilong He, Guangba Yu

TDSC (CCF A) In IEEE Transactions on Dependable and Secure Computing

Modern distributed systems generate interleaved logs when running in parallel. Identifiers (ID) are always attached to them to trace running instances or entities in logs. Therefore, log messages can be grouped by the same IDs to help anomaly detection and localization. The existing approaches to achieve this still fall short meeting these challenges, 1) Log is solely processed in single components without mining log dependencies, 2) Log formats are continually changing in modern software systems, 3) It is challenging to detect latent performance issues non-intrusively by trivial monitoring tools. To remedy the above shortcomings, we propose SwissLog, a robust anomaly detection and localization tool for interleaved unstructured logs. \textcolor{black}{SwissLog focuses on log sequential anomalies and tries to dig out possible performance issues. SwissLog constructs ID relation graphs across distributed components and groups log messages by IDs. Moreover, we propose an online data-driven log parser without parameter tuning.} The grouped log messages are parsed via the novel log parser and transformed with semantic and temporal embedding. Finally, SwissLog utilizes an attention-based Bi-LSTM model and a heuristic searching algorithm to detect and localize anomalies in instance-granularity, respectively. The experiments on real-world and synthetic datasets confirm the effectiveness, efficiency, and robustness of SwissLog.

2021

TraceRank: Abnormal Service Localization with Dis-Aggregated End-to-End Tracing Data in Cloud Native Systems

Guangba Yu, Zicheng Huang, Pengfei Chen,

JSEP (CCF B) In Journal of Software Evolution and Process

Modern cloud native applications are generally built with a microservice architecture. To tackle various performance problems among a large number of services and machines, an end-to-end tracing tool is always equipped in these systems to track the execution path of every single request. However, it is nontrivial to conduct root cause analysis of anomalies with such a large volume of tracing data. This paper proposes a novel system named TraceRank to identify and locate abnormal services causing performance problems with dis-aggregated end-to-end traces. TraceRank mainly includes an anomaly detection module and a root cause analysis module. The root cause analysis procedure is triggered when an anomaly is detected. To fully leverage the information provided by the tracing data, both the spectrum analysis and the PageRank-based random walk methods are introduced to pinpoint abnormal services. The experiments in TrainTicket and Bookinfo microservice benchmarks and a real-world system show that TraceRank can locate root causes with 90% in Precision and 86% in Recall. TraceRank has up to 10% improvement compared with several state-of-the-art approaches in both Precision and Recall. Finally, TraceRank has good scalability and a low overhead to adapt to large-scale microservice systems.

TraceRank: Abnormal Service Localization with Dis-Aggregated End-to-End Tracing Data in Cloud Native Systems
TraceRank: Abnormal Service Localization with Dis-Aggregated End-to-End Tracing Data in Cloud Native Systems

Guangba Yu, Zicheng Huang, Pengfei Chen,

JSEP (CCF B) In Journal of Software Evolution and Process

Modern cloud native applications are generally built with a microservice architecture. To tackle various performance problems among a large number of services and machines, an end-to-end tracing tool is always equipped in these systems to track the execution path of every single request. However, it is nontrivial to conduct root cause analysis of anomalies with such a large volume of tracing data. This paper proposes a novel system named TraceRank to identify and locate abnormal services causing performance problems with dis-aggregated end-to-end traces. TraceRank mainly includes an anomaly detection module and a root cause analysis module. The root cause analysis procedure is triggered when an anomaly is detected. To fully leverage the information provided by the tracing data, both the spectrum analysis and the PageRank-based random walk methods are introduced to pinpoint abnormal services. The experiments in TrainTicket and Bookinfo microservice benchmarks and a real-world system show that TraceRank can locate root causes with 90% in Precision and 86% in Recall. TraceRank has up to 10% improvement compared with several state-of-the-art approaches in both Precision and Recall. Finally, TraceRank has good scalability and a low overhead to adapt to large-scale microservice systems.

Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems

Zicheng Huang, Pengfei Chen, Guangba Yu, Hongyang Chen, Zibin Zheng

ICWS'21 (CCF B) In Proceeidings of 2021 IEEE International Conference on Web Services

End-to-end tracing plays an important role in understanding and monitoring distributed microservice systems. The trace data are valuable to help find out the anomalous or erroneous behavior of the system. However, the volume of trace data is huge leading to a heavy burden on analyzing and storing them. To reduce the volume of trace data, the sampling technique is widely adopted. However, existing uniform sampling approaches are unable to capture uncommon traces that are more interesting and informative. To tackle this problem, we design and implement Sieve, an online sampler that aims to bias sampling towards uncommon traces by taking advantage of the attention mechanism. The evaluation results on the trace datasets collected from real-world and experimental microservice systems show that Sieve is effective to increase sampling probabilities of the structurally and temporally uncommon traces and reduce the storage space to a large extent by taking a low sampling rate.

Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems
Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems

Zicheng Huang, Pengfei Chen, Guangba Yu, Hongyang Chen, Zibin Zheng

ICWS'21 (CCF B) In Proceeidings of 2021 IEEE International Conference on Web Services

End-to-end tracing plays an important role in understanding and monitoring distributed microservice systems. The trace data are valuable to help find out the anomalous or erroneous behavior of the system. However, the volume of trace data is huge leading to a heavy burden on analyzing and storing them. To reduce the volume of trace data, the sampling technique is widely adopted. However, existing uniform sampling approaches are unable to capture uncommon traces that are more interesting and informative. To tackle this problem, we design and implement Sieve, an online sampler that aims to bias sampling towards uncommon traces by taking advantage of the attention mechanism. The evaluation results on the trace datasets collected from real-world and experimental microservice systems show that Sieve is effective to increase sampling probabilities of the structurally and temporally uncommon traces and reduce the storage space to a large extent by taking a low sampling rate.

T-Rank:A Lightweight Spectrum based Fault Localization Approach for Microservice Systems

Zihao Ye, Pengfei Chen, Guangba Yu

CCGrid'21 (CCF C, CORE A) In Proceedings of 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing

The cloud-native system is shifting from traditional monolithic architecture to microservice architecture because of loosely coupling, better maintainability and availability, faster deployment, and richer ecology brought by it. Except for these advantages, it still has an inevitable weakness-the communication over RPC (Remote Procedure Call) between services makes the system performance more unpredictable. Moreover, the complex interactions amongst services make it hard to reveal the root cause of performance issues. To address this challenge, we propose a lightweight spectrum-based performance diagnosis tool, named T-Rank. T-Rank provides the ranked suspicious score in a list of microservices to localize root causes with very few resources. We demonstrate the high accuracy and the low cost of T-Rank by conducting experiments with the data collected from a real-world production microservice system. Moreover, comparison results show that T-Rank outperforms other state-of-the-art approaches.

T-Rank:A Lightweight Spectrum based Fault Localization Approach for Microservice Systems
T-Rank:A Lightweight Spectrum based Fault Localization Approach for Microservice Systems

Zihao Ye, Pengfei Chen, Guangba Yu

CCGrid'21 (CCF C, CORE A) In Proceedings of 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing

The cloud-native system is shifting from traditional monolithic architecture to microservice architecture because of loosely coupling, better maintainability and availability, faster deployment, and richer ecology brought by it. Except for these advantages, it still has an inevitable weakness-the communication over RPC (Remote Procedure Call) between services makes the system performance more unpredictable. Moreover, the complex interactions amongst services make it hard to reveal the root cause of performance issues. To address this challenge, we propose a lightweight spectrum-based performance diagnosis tool, named T-Rank. T-Rank provides the ranked suspicious score in a list of microservices to localize root causes with very few resources. We demonstrate the high accuracy and the low cost of T-Rank by conducting experiments with the data collected from a real-world production microservice system. Moreover, comparison results show that T-Rank outperforms other state-of-the-art approaches.

Kmon: An In-kernel Transparent Monitoring System for Microservice Systems with eBPF

Tianjun Weng, Wanqi Yang, Guangba Yu, Pengfei Chen, Jieqi Cui, Chuangfu Zhang

CloudIntelligence'21 In Proceedings of 2021 IEEE/ACM International Workshop on Cloud Intelligence

Currently, the architecture of software systems is shifting from “monolith” to “microservice” which is an important enabling technology of cloud native systems. Since the advantages of microservice in agility, efficiency, and scaling, it has become the most popular architecture in the industry. However, as the increase of microservice complexity and scale, it becomes challenging to monitor such a large number of microservices. Traditional monitoring techniques such as end-to-end tracing cannot well fit microservice environment, because they need code instrumentation with great effort. Moreover, they cannot explore the fine-grained internal states of microservice instances. To tackle this problem, we propose Kmon, which is an In-kernel transparent monitoring system for microservice systems with extended Berkeley Packet Filter (eBPF). Kmon can provide multiple kinds of run-time information of micrservices such as latency, topology, performance metrics with a low overhead.

Kmon: An In-kernel Transparent Monitoring System for Microservice Systems with eBPF
Kmon: An In-kernel Transparent Monitoring System for Microservice Systems with eBPF

Tianjun Weng, Wanqi Yang, Guangba Yu, Pengfei Chen, Jieqi Cui, Chuangfu Zhang

CloudIntelligence'21 In Proceedings of 2021 IEEE/ACM International Workshop on Cloud Intelligence

Currently, the architecture of software systems is shifting from “monolith” to “microservice” which is an important enabling technology of cloud native systems. Since the advantages of microservice in agility, efficiency, and scaling, it has become the most popular architecture in the industry. However, as the increase of microservice complexity and scale, it becomes challenging to monitor such a large number of microservices. Traditional monitoring techniques such as end-to-end tracing cannot well fit microservice environment, because they need code instrumentation with great effort. Moreover, they cannot explore the fine-grained internal states of microservice instances. To tackle this problem, we propose Kmon, which is an In-kernel transparent monitoring system for microservice systems with extended Berkeley Packet Filter (eBPF). Kmon can provide multiple kinds of run-time information of micrservices such as latency, topology, performance metrics with a low overhead.

MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments

Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, Xiaoyun Li

WWW'21 (CCF A) In Proceedings of the Web Conference 2021

With the advantages of flexible scalability and fast delivery, microservice has become a popular software architecture in the modern IT industry. However, the explosion in the number of service instances and complex dependencies make the troubleshooting extremely challenging in microservice environments. To help understand and troubleshoot a microservice system, the end-to-end tracing technology has been widely applied to capture the execution path of each request. Nevertheless, the tracing data are not fully leveraged by cloud and application providers when conducting latency issue localization in the microservice environment. This paper proposes a novel system, named MicroRank, which analyzes clues provided by normal and abnormal traces to locate root causes of latency issues. Once a latency issue is detected by the Anomaly Detector in MicroRank, the cause localization procedure is triggered. MicroRank first distinguishs which traces are abnormal. Then, MicroRank’s PageRank Scorer module uses the abnormal and normal trace information as its input and differentials the importance of different traces to extended spectrum techniques . Finally, the spectrum techniques can calculate the ranking list based on the weighted spectrum information from PageRank Scorer to locate root causes more effectively. The experimental evaluations on a widely-used open-source system and a production system show that MicroRank achieves excellent results not only in one root cause situation but also in two issues that happen at the same time. Moreover, MicroRank makes 6% to 22% improvement in recall in localizing root causes compared to current state-of-the-art methods.

MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments
MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments

Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, Xiaoyun Li

WWW'21 (CCF A) In Proceedings of the Web Conference 2021

With the advantages of flexible scalability and fast delivery, microservice has become a popular software architecture in the modern IT industry. However, the explosion in the number of service instances and complex dependencies make the troubleshooting extremely challenging in microservice environments. To help understand and troubleshoot a microservice system, the end-to-end tracing technology has been widely applied to capture the execution path of each request. Nevertheless, the tracing data are not fully leveraged by cloud and application providers when conducting latency issue localization in the microservice environment. This paper proposes a novel system, named MicroRank, which analyzes clues provided by normal and abnormal traces to locate root causes of latency issues. Once a latency issue is detected by the Anomaly Detector in MicroRank, the cause localization procedure is triggered. MicroRank first distinguishs which traces are abnormal. Then, MicroRank’s PageRank Scorer module uses the abnormal and normal trace information as its input and differentials the importance of different traces to extended spectrum techniques . Finally, the spectrum techniques can calculate the ranking list based on the weighted spectrum information from PageRank Scorer to locate root causes more effectively. The experimental evaluations on a widely-used open-source system and a production system show that MicroRank achieves excellent results not only in one root cause situation but also in two issues that happen at the same time. Moreover, MicroRank makes 6% to 22% improvement in recall in localizing root causes compared to current state-of-the-art methods.

2020

A Learning-based Dynamic Load Balancing Approach for Microservice Systems in Multi-cloud Environment

Jieqi Cui, Pengfei Chen, Guangba Yu

ICPADS'20 (CCF C, CORE B) In Proceedings of IEEE 26th International Conference on Parallel and Distributed Systems

Multi-cloud environment has become common since companies manage to prevent cloud vendor lock-in for security and cost concerns. Meanwhile, the microservice architecture is often considered for its flexibility. Combining multi-cloud with microservice, the problem of routing requests among all possible microservice instances in multi-cloud environment arises. This paper presents a learning-based approach to route requests in order to balance the load. In our approach, the performance of microservice is modeled explicitly through machine learning models. The model can derive the response time from request volume, route decision, and other cloud metrics. Then the balanced route decision is obtained from optimizing the model with Bayesian Optimization. With this approach, the request route decision can adjust to dynamic runtime metrics instead of remaining static for all different circumstances. Explicit performance modeling avoids searching on an actual microservice system which is time-consuming. Experiments show that our approach reduces average response time by 10% at least.

A Learning-based Dynamic Load Balancing Approach for Microservice Systems in Multi-cloud Environment
A Learning-based Dynamic Load Balancing Approach for Microservice Systems in Multi-cloud Environment

Jieqi Cui, Pengfei Chen, Guangba Yu

ICPADS'20 (CCF C, CORE B) In Proceedings of IEEE 26th International Conference on Parallel and Distributed Systems

Multi-cloud environment has become common since companies manage to prevent cloud vendor lock-in for security and cost concerns. Meanwhile, the microservice architecture is often considered for its flexibility. Combining multi-cloud with microservice, the problem of routing requests among all possible microservice instances in multi-cloud environment arises. This paper presents a learning-based approach to route requests in order to balance the load. In our approach, the performance of microservice is modeled explicitly through machine learning models. The model can derive the response time from request volume, route decision, and other cloud metrics. Then the balanced route decision is obtained from optimizing the model with Bayesian Optimization. With this approach, the request route decision can adjust to dynamic runtime metrics instead of remaining static for all different circumstances. Explicit performance modeling avoids searching on an actual microservice system which is time-consuming. Experiments show that our approach reduces average response time by 10% at least.

SwissLog: Robust and Unified Deep Learning Based Log Anomaly Detection for Diverse Faults

Xiaoyun Li, Pengfei Chen, Linxiao Jing, Zilong He, Guangba Yu

ISSRE'20 (CCF B) In Proceedings of the 2020 IEEE 31st International Symposium on Software Reliability Engineering

Log-based anomaly detection has been widely studied and achieves a satisfying performance on stable log data. But, the existing approaches still fall short meeting these challenges, 1) Log formats are changing continually in practice in those software systems under active development and maintenance. 2) Performance issues are latent causes that may not be detected by trivial monitoring tools. We thus propose SwissLog, namely a robust and unified deep learning based anomaly detection model for detecting diverse faults. SwissLog targets at those faults resulting in log sequence order changes and log time interval changes. To achieve that, an advanced log parser is introduced. Moreover, the semantic embedding and the time embedding approaches are combined to train a unified attention based BiLSTM model to detect anomalies. The experiments on real-world datasets and synthetic datasets show that SwissLog is robust to the changing log data and effective for diverse faults.

SwissLog: Robust and Unified Deep Learning Based Log Anomaly Detection for Diverse Faults
SwissLog: Robust and Unified Deep Learning Based Log Anomaly Detection for Diverse Faults

Xiaoyun Li, Pengfei Chen, Linxiao Jing, Zilong He, Guangba Yu

ISSRE'20 (CCF B) In Proceedings of the 2020 IEEE 31st International Symposium on Software Reliability Engineering

Log-based anomaly detection has been widely studied and achieves a satisfying performance on stable log data. But, the existing approaches still fall short meeting these challenges, 1) Log formats are changing continually in practice in those software systems under active development and maintenance. 2) Performance issues are latent causes that may not be detected by trivial monitoring tools. We thus propose SwissLog, namely a robust and unified deep learning based anomaly detection model for detecting diverse faults. SwissLog targets at those faults resulting in log sequence order changes and log time interval changes. To achieve that, an advanced log parser is introduced. Moreover, the semantic embedding and the time embedding approaches are combined to train a unified attention based BiLSTM model to detect anomalies. The experiments on real-world datasets and synthetic datasets show that SwissLog is robust to the changing log data and effective for diverse faults.

A Spatiotemporal Deep Learning Approach for Unsupervised Anomaly Detection in Cloud Systems

Zilong He, Pengfei Chen, Xiaoyun Li, Yongfeng Wang, Guangba Yu, Cailin Chen, Xinrui Li, Zibin Zheng

TNNLS (Impact Factor 10.4, CCF B) In IEEE Transactions on Neural Networks and Learning Systems

Anomaly detection is a critical task for maintaining the performance of a cloud system. Using data-driven methods to address this issue is the mainstream in recent years. However, due to the lack of labeled data for training in practice, it is necessary to enable an anomaly detection model trained on contaminated data in an unsupervised way. Besides, with the increasing complexity of cloud systems, effectively organizing data collected from a wide range of components of a system and modeling spatiotemporal dependence among them become a challenge. In this article, we propose TopoMAD, a stochastic seq2seq model which can robustly model spatial and temporal dependence among contaminated data. We include system topological information to organize metrics from different components and apply sliding windows over metrics collected continuously to capture the temporal dependence. We extract spatial features with the help of graph neural networks and temporal features with long short-term memory networks. Moreover, we develop our model based on variational auto-encoder, enabling it to work well robustly even when trained on contaminated data. Our approach is validated on the run-time performance data collected from two representative cloud systems, namely, a big data batch processing system and a microservice-based transaction processing system. The experimental results show that TopoMAD outperforms some state-of-the-art methods on these two data sets.

A Spatiotemporal Deep Learning Approach for Unsupervised Anomaly Detection in Cloud Systems
A Spatiotemporal Deep Learning Approach for Unsupervised Anomaly Detection in Cloud Systems

Zilong He, Pengfei Chen, Xiaoyun Li, Yongfeng Wang, Guangba Yu, Cailin Chen, Xinrui Li, Zibin Zheng

TNNLS (Impact Factor 10.4, CCF B) In IEEE Transactions on Neural Networks and Learning Systems

Anomaly detection is a critical task for maintaining the performance of a cloud system. Using data-driven methods to address this issue is the mainstream in recent years. However, due to the lack of labeled data for training in practice, it is necessary to enable an anomaly detection model trained on contaminated data in an unsupervised way. Besides, with the increasing complexity of cloud systems, effectively organizing data collected from a wide range of components of a system and modeling spatiotemporal dependence among them become a challenge. In this article, we propose TopoMAD, a stochastic seq2seq model which can robustly model spatial and temporal dependence among contaminated data. We include system topological information to organize metrics from different components and apply sliding windows over metrics collected continuously to capture the temporal dependence. We extract spatial features with the help of graph neural networks and temporal features with long short-term memory networks. Moreover, we develop our model based on variational auto-encoder, enabling it to work well robustly even when trained on contaminated data. Our approach is validated on the run-time performance data collected from two representative cloud systems, namely, a big data batch processing system and a microservice-based transaction processing system. The experimental results show that TopoMAD outperforms some state-of-the-art methods on these two data sets.

Microscaler: Cost-effective Scaling for Microservice Applications in the Cloud with an Online Learning Approach

Guangba Yu, Pengfei Chen, Zibin Zheng

TCC (Impact Factor 5.9) In IEEE Transaction on Cloud Computing

Recently, the microservice becomes a popular architecture to construct cloud native systems due to its agility. In cloud native systems, autoscaling is a key enabling technique to adapt to workload changes by acquiring or releasing the right amount of computing resources. However, it becomes a challenging problem in microservice applications, since such an application usually comprises a large number of different microservices with complex interactions. When the performance decreases due to an unpredictable workload peak, it is difficult to pinpoint the scaling-needed services which need to scale out and evaluate how many resources they need. In this paper, we present a novel system named Microscaler to automatically identify the scaling-needed services and scale them to meet the Service Level Agreement (SLA) with an optimal cost for microservice applications. Microscaler first collects the quality of service (QoS) metrics in the service mesh enabled microservice infrastructure. Then, it determines under-provisioning or over-provisioning service instances along the service dependency graph with a novel scaling-needed service criterion named service power. The service dependency graph could be obtained by correlating each request flow in the service mesh. By combining an online learning approach and a step-by-step heuristic approach, Microscaler can precisely reach the optimal service scale meeting the SLA requirements. The experimental evaluations in a microservice benchmark show that Microscaler achieves an average 93% precision in scaling-needed service determination and converges to the optimal service scale faster than several state-of-the-art methods. Moreover, Microscaler is lightweight and flexible enough to work in a large-scale microservice system.

Microscaler: Cost-effective Scaling for Microservice Applications in the Cloud with an Online Learning Approach
Microscaler: Cost-effective Scaling for Microservice Applications in the Cloud with an Online Learning Approach

Guangba Yu, Pengfei Chen, Zibin Zheng

TCC (Impact Factor 5.9) In IEEE Transaction on Cloud Computing

Recently, the microservice becomes a popular architecture to construct cloud native systems due to its agility. In cloud native systems, autoscaling is a key enabling technique to adapt to workload changes by acquiring or releasing the right amount of computing resources. However, it becomes a challenging problem in microservice applications, since such an application usually comprises a large number of different microservices with complex interactions. When the performance decreases due to an unpredictable workload peak, it is difficult to pinpoint the scaling-needed services which need to scale out and evaluate how many resources they need. In this paper, we present a novel system named Microscaler to automatically identify the scaling-needed services and scale them to meet the Service Level Agreement (SLA) with an optimal cost for microservice applications. Microscaler first collects the quality of service (QoS) metrics in the service mesh enabled microservice infrastructure. Then, it determines under-provisioning or over-provisioning service instances along the service dependency graph with a novel scaling-needed service criterion named service power. The service dependency graph could be obtained by correlating each request flow in the service mesh. By combining an online learning approach and a step-by-step heuristic approach, Microscaler can precisely reach the optimal service scale meeting the SLA requirements. The experimental evaluations in a microservice benchmark show that Microscaler achieves an average 93% precision in scaling-needed service determination and converges to the optimal service scale faster than several state-of-the-art methods. Moreover, Microscaler is lightweight and flexible enough to work in a large-scale microservice system.

A Framework of Virtual War Room and Matrix Sketch-Based Streaming Anomaly Detection for Microservice Systems

Hongyang Chen, Pengfei Chen, Guangba Yu

Access In IEEE Access

Recently, microservice has been a popular architecture to construct cloud-native systems. This novel architecture brings agility and accelerates the software development process significantly. However, it is not easy to manage and operate microservice systems due to their scale and complexity. Many approaches are proposed to automatically operate microservice systems such as anomaly detection. Nevertheless, those methods cannot be sufficiently validated and compared due to a lack of real microservice systems, which leads to the slow process of intelligent operation. These challenges inspire us to build a system named “VWR”, a framework of Virtual War Room for operating microservice applications which allows users to simulate their microservice architectures with low overhead and inject multiple types of faults into the microservice system with chaos engineering. VWR can mimic user requests and record the end-to-end tracing data (i.e., service call chains) for each request in a way consistent with OpenTracing. With easily designed tests and the produced streaming tracing data, the users can validate the performance of their intelligent operation algorithms and improve the algorithms as needed. Besides, based on the streaming tracing data generated by VWR, we introduce a novel unsupervised anomaly detection algorithm based on Matrix Sketch and set it as a default intelligent operation algorithm in VWR. This algorithm can detect anomalies by analyzing high-dimensional performance data collected from a microservice system in a streaming manner. The experimental result in VWR shows that the matrix sketch based method can precisely detect anomalies in microservice systems and outperform some widely used anomaly detection methods such as isolation forest in some scenario. We believe more approaches on the intelligent operation of microservice systems can be constructed based on VWR.

A Framework of Virtual War Room and Matrix Sketch-Based Streaming Anomaly Detection for Microservice Systems
A Framework of Virtual War Room and Matrix Sketch-Based Streaming Anomaly Detection for Microservice Systems

Hongyang Chen, Pengfei Chen, Guangba Yu

Access In IEEE Access

Recently, microservice has been a popular architecture to construct cloud-native systems. This novel architecture brings agility and accelerates the software development process significantly. However, it is not easy to manage and operate microservice systems due to their scale and complexity. Many approaches are proposed to automatically operate microservice systems such as anomaly detection. Nevertheless, those methods cannot be sufficiently validated and compared due to a lack of real microservice systems, which leads to the slow process of intelligent operation. These challenges inspire us to build a system named “VWR”, a framework of Virtual War Room for operating microservice applications which allows users to simulate their microservice architectures with low overhead and inject multiple types of faults into the microservice system with chaos engineering. VWR can mimic user requests and record the end-to-end tracing data (i.e., service call chains) for each request in a way consistent with OpenTracing. With easily designed tests and the produced streaming tracing data, the users can validate the performance of their intelligent operation algorithms and improve the algorithms as needed. Besides, based on the streaming tracing data generated by VWR, we introduce a novel unsupervised anomaly detection algorithm based on Matrix Sketch and set it as a default intelligent operation algorithm in VWR. This algorithm can detect anomalies by analyzing high-dimensional performance data collected from a microservice system in a streaming manner. The experimental result in VWR shows that the matrix sketch based method can precisely detect anomalies in microservice systems and outperform some widely used anomaly detection methods such as isolation forest in some scenario. We believe more approaches on the intelligent operation of microservice systems can be constructed based on VWR.

2019

Microscaler: Automatic Scaling for Microservices with an Online Learning Approach

Guangba Yu, Pengfei Chen, Zibin Zheng

ICWS'19 (CCF B) In Proceedings of the 2019 IEEE International Conference on Web Services

Recently, the microservice becomes a popular architecture to construct cloud native systems due to its agility. In cloud native systems, autoscaling is a core enabling technique to adapt to workload changes by scaling out/in. However, it becomes a challenging problem in a microservice system, since such a system usually comprises a large number of different micro services with complex interactions. When bursty and unpredictable workloads arrive, it is difficult to pinpoint the scaling-needed services which need to scale and evaluate how much resource they need. In this paper, we present a novel system named Microscaler to automatically identify the scaling-needed services and scale them to meet the service level agreement (SLA) with an optimal cost for micro-service systems. Microscaler collects the quality of service metrics (QoS) with the help of the service mesh enabled infrastructure. Then, it determines the under-provisioning or over-provisioning services with a novel criterion named service power. By combining an online learning approach and a step-by-step heuristic approach, Microscaler could achieve the optimal service scale satisfying the SLA requirements. The experimental evaluations in a micro-service benchmark show that Microscaler converges to the optimal service scale faster than several state-of-the-art methods.

Microscaler: Automatic Scaling for Microservices with an Online Learning Approach
Microscaler: Automatic Scaling for Microservices with an Online Learning Approach

Guangba Yu, Pengfei Chen, Zibin Zheng

ICWS'19 (CCF B) In Proceedings of the 2019 IEEE International Conference on Web Services

Recently, the microservice becomes a popular architecture to construct cloud native systems due to its agility. In cloud native systems, autoscaling is a core enabling technique to adapt to workload changes by scaling out/in. However, it becomes a challenging problem in a microservice system, since such a system usually comprises a large number of different micro services with complex interactions. When bursty and unpredictable workloads arrive, it is difficult to pinpoint the scaling-needed services which need to scale and evaluate how much resource they need. In this paper, we present a novel system named Microscaler to automatically identify the scaling-needed services and scale them to meet the service level agreement (SLA) with an optimal cost for micro-service systems. Microscaler collects the quality of service metrics (QoS) with the help of the service mesh enabled infrastructure. Then, it determines the under-provisioning or over-provisioning services with a novel criterion named service power. By combining an online learning approach and a step-by-step heuristic approach, Microscaler could achieve the optimal service scale satisfying the SLA requirements. The experimental evaluations in a micro-service benchmark show that Microscaler converges to the optimal service scale faster than several state-of-the-art methods.