Publications - Guangba's Home Page

2025

iKnow: an Intent-Guided Chatbot for Cloud Operations with Retrieval-Augmented Generation

Junjie Huang, Yuedong Zhong, Guangba Yu^†, Zihan Jiang, Minzhi Yan, Wenfei Luan, Tianyu Yang, Rui Ren, Michael Lyu

ASE'25 (CCF A) In 40th IEEE/ACM International Conference on Automated Software Engineering

Managing complex cloud services requires standard operational documentation, but its sheer volume often hinders cloud engineers from efficient knowledge acquisition. Retrieval-Augmented Generation (RAG) can streamline this process by retrieving relevant knowledge and generating concise, referenced answers. However, deploying a reliable RAG-based chatbot for cloud operation remains a challenge. In this experience paper, we analyze the development and deployment of RAG-based chatbots for operational question answering (OpsQA) at a large-scale cloud vendor. Through an empirical study of 2,000 real-world queries across three operational teams, we identify five unique OpsQA intent types (e.g., symptom analysis and terminology explanation) and their corresponding requirements for a satisfactory answer, which differ from general software engineering queries. Our analysis further uncovers six root causes leading to chatbot failures---over half stem from query issues (i.e., incompleteness, out-of-scope, or invalid queries), while others are from retrieval or generation issues. To address these issues, we propose iKnow, an intent-guided RAG-based chatbot that integrates intent detection, query rewriting tailored to each intent, and missing knowledge detection to enhance answer quality. In internal evaluations, iKnow improves average answer accuracy from 65.8% to 81.3% with only a modest increase in latency. iKnow has been deployed for six months at CloudA, supporting thousands of cloud engineers in daily operations. We discuss lessons learned from real-world deployment, providing valuable insights for future research and practical implementations in similar domains.

iKnow: an Intent-Guided Chatbot for Cloud Operations with Retrieval-Augmented Generation

Junjie Huang, Yuedong Zhong, Guangba Yu^†, Zihan Jiang, Minzhi Yan, Wenfei Luan, Tianyu Yang, Rui Ren, Michael Lyu

ASE'25 (CCF A) In 40th IEEE/ACM International Conference on Automated Software Engineering

AlertGuardian: Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems

Guangba Yu, Genting Mai, Rui Wang, Ruipeng Li, Pengfei Chen^†, Long Pan, Ruijie Xu

ASE'25 (CCF A) In 40th IEEE/ACM International Conference on Automated Software Engineering

Alerts are critical for detecting anomalies in large-scale cloud systems, ensuring reliability and user experience. However, current systems generate overwhelming volumes of alerts, degrading operational efficiency due to ineffective alert life-cycle management. This experience paper details the efforts of Company-X to optimize alert life-cycle management, addressing alert fatigue in cloud systems. We propose AlertGuardian, a framework collaborating large language models (LLMs) and lightweight graph models to optimize the alert life-cycle through three phases: Alert Denoise uses graph learning model with virtual noise to filter noise, Alert Summary employs Retrieval Augmented Generation (RAG) with LLMs to create actionable alert summary, and Alert Rule Refinement leverages multi-agent iterative feedbacks to improve alert rule quality. Evaluated on four real-world datasets from Company-X’s services, AlertGuardian significantly mitigates alert fatigue (94.8% alert reduction ratios) and accelerates fault diagnosis (90.5% diagnosis accuracy). Moreover, AlertGuardian improves 1,174 alert rules, with 375 accepted by SREs (32% acceptance rate). Finally, we share success stories and lessons learned about alert life-cycle management from the deployment of AlertGuardian at Company-X.

AlertGuardian: Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems

Guangba Yu, Genting Mai, Rui Wang, Ruipeng Li, Pengfei Chen^†, Long Pan, Ruijie Xu

ASE'25 (CCF A) In 40th IEEE/ACM International Conference on Automated Software Engineering

NetScope: Fault Localization in Programmable Networking Systems With Low-Cost In-Band Network Telemetry and In-Network Detection

Hongyang Chen, Benran Wang, Guangba Yu, Zilong He, Pengfei Chen^†, Chen Sun, Zibin Zheng

ToN (CCF A) In IEEE Transactions on Networking

Recently, Software Defined Networking (SDN) has gained widespread adoption as a network infrastructure. Although the openness and programmability of SDN facilitate large complex network construction, diagnosing faults in datacenter-scale network remains challenging. Previous network diagnosis tools pose significant overhead in fine-grained telemetry and typically lack automated fine-grained fault diagnosis capabilities. Although on-demand monitoring methods have been proposed to reduce telemetry overhead, they struggle with effectively setting fixed thresholds, which requires expert experience. This paper presents NetScope, a lightweight system for real-time anomaly detection with self-adaptive thresholds and automatic root cause localization in programmable networking systems. NetScope estimates latency medians for each Flow (i.e., a pair of source and sink switches) within the switch using the proposed per-Flow quantile sketch and calculates the threshold accordingly for anomaly detection. Upon detecting anomalies, NetScope collects aggregated packet-level telemetry on demand and generates a ranked list of fine-grained fault culprits at multiple levels, including port-level, Flow-level, and switch-level. Extensive experiments demonstrate the effectiveness and efficiency of NetScope in anomaly detection and fault localization. Specifically, NetScope achieves a 32%~116% relative improvement in anomaly detection and 6%~197% improvement in root cause analysis compared with other baselines without causing any network bandwidth in anomaly detection while consuming 64.2% less telemetry bandwidth for localization.

NetScope: Fault Localization in Programmable Networking Systems With Low-Cost In-Band Network Telemetry and In-Network Detection

Hongyang Chen, Benran Wang, Guangba Yu, Zilong He, Pengfei Chen^†, Chen Sun, Zibin Zheng

ToN (CCF A) In IEEE Transactions on Networking

Subgraphs as First-Class Citizens in Incident Management for Large-Scale Online Systems An Evolution-AwareFramework

Zilong He, Pengfei Chen^†, Yu Luo, Qiuyu Yan, Hongyang Chen, Guangba Yu, Fangyuan Lo, Xiaoyun Li, Zibin Zheng

TSE (CCF A) In IEEE Transactions Software Engineering

With the ever-increasing scale and complexity of modern online systems, incidents are becoming inevitable, which seriously decreases the system availability and user satisfaction. To enhance incident management, many machine learning based techniques are proposed to automate incident detection and diagnosis. However, previous studies have mostly ignored the impact of evolution on the practicality of an incident management framework. Specifically, (1) The scale of modern online systems is continually evolving, but most state-of-the-art techniques are overly dependent on a continuous modelling of the entire system, and thus are less practical for online systems evolved to tens of thousands of services; (2) The volume of telemetry data is massively growing, while the number of incident records for learning is scarce and slowly generated (sometimes from zero), but prior techniques usually neglect this extreme imbalance in data volume evolution, and cannot support the life-cycle evolution (i.e., cold start and continual learning) of their developed models; (3) Prior techniques usually require operators to manually select a set of telemetry as inputs for incident diagnosis, but ignore how to automatically evolve this selection to continually improve diagnosis performance. These gaps stem from the unawareness of evolution, including the evolution of the target online system and the evolution of the built incident management models. To fill these gaps, we propose an evolution-aware incident management framework GEM. Specifically, considering the evolution of system scale and data volume, GEM continually refines the enormous real-time collected telemetry data into individual compact yet expressive graph-based representations, namely issue impact subgraphs, and treat them as the first-class citizens in incident management. Centered around these subgraphs, we design a couple of lifelong learning based graph analysis techniques to learn and evolve models for incident detection and diagnosis.

Subgraphs as First-Class Citizens in Incident Management for Large-Scale Online Systems An Evolution-AwareFramework

Zilong He, Pengfei Chen^†, Yu Luo, Qiuyu Yan, Hongyang Chen, Guangba Yu, Fangyuan Lo, Xiaoyun Li, Zibin Zheng

TSE (CCF A) In IEEE Transactions Software Engineering

CauseLens: Causality-based Interpretable Root Cause Analysis for Microservice Systems

Qihan Liu, Pengfei Chen^†, Guangba Yu, Yuanhao Lai, Xiaoyun Li

IWQoS'25 (CCF B) In IEEE/ACM International Symposium on Quality of Service

Microservice applications consist of complex API invocation relationships, where a single fault can propagate through multiple paths, leading to widespread failures. The diverse propagation patterns of different faults make efficient and interpretable root cause analysis (RCA) crucial. We propose CauseLens, a causality-based unsupervised RCA framework that improves both accuracy and interpretability. The key insight is that fine-grained causal modeling enhances root cause localization. CauseLens constructs a heterogeneous causal diagram at the operation and entity levels using normal monitoring data (i.e., metrics and traces) and trains a structural causal model. It then integrates reconstruction error and counterfactual analysis to identify root causes while revealing fault propagation paths. Experiments on two microservice datasets demonstrate that CauseLens outperforms state-of-the-art methods in RCA accuracy.

CauseLens: Causality-based Interpretable Root Cause Analysis for Microservice Systems

Qihan Liu, Pengfei Chen^†, Guangba Yu, Yuanhao Lai, Xiaoyun Li

IWQoS'25 (CCF B) In IEEE/ACM International Symposium on Quality of Service

LLMPrism: Black-box Performance Diagnosis for Production LLM Training Platforms

Zhihan Jiang, Rui Ren, Guangba Yu^†, Yulun Wu, Wenwei Gu, Yichen Li, Yujie Huang, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu

DSN'25 (CCF B) In 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

Large Language Models (LLMs) have brought about revolutionary changes in diverse fields, rendering LLM training of utmost importance for modern enterprises. To meet this demand, multi-tenant large-scale LLM training platforms have been built to offer LLM training services. Nevertheless, due to the complexity and synchronous nature of LLM training process, performance issues occur frequently and can result in substantial resource wastage. The limited visibility from the perspective of platform providers impedes existing profiling methods and poses challenges to the monitoring and diagnosis of the performance of LLM training jobs. For the first time, this paper proposes the utilization of underlying network flow data to reconstruct the training timelines of jobs based on the distinct characteristics in the LLM training procedure. We design LLMPrism, the first black-box performance diagnosis system for LLM training platforms. By progressively recognizing LLM training jobs, identifying their parallelism strategies, and reconstructing the training timelines, LLMPrism achieves non-intrusive, lightweight, and continuous monitoring of LLM training systems. Leveraging this monitoring capability, it further effectively diagnoses potential performance issues.Since Oct. 2024, LLMPrism has been deployed on our large-scale production Platform-X, in which the evaluations and deployment experiences demonstrate that LLMPrism can achieve accurate timeline reconstruction with an error within 0.3% and effectively diagnose various performance issues.

LLMPrism: Black-box Performance Diagnosis for Production LLM Training Platforms

Zhihan Jiang, Rui Ren, Guangba Yu^†, Yulun Wu, Wenwei Gu, Yichen Li, Yujie Huang, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu

DSN'25 (CCF B) In 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

A Survey on Failure Analysis and Fault Injection in AI Systems

Guangba Yu, Gou Tan, Haojia Huang, Zhenyu Zhang, Pengfei Chen^†, Roberto Natella, Zibin Zheng, Michael R. Lyu

TOSEM (CCF A) In ACM Transactions on Software Engineering and Methodology

The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ensure resilience and reliability. Despite the importance of these techniques, there lacks a comprehensive review of FA and FI methodologies in AI systems. This study fills this gap by presenting a detailed survey of existing FA and FI approaches across six layers of AI systems. We systematically analyze 160 papers and repositories to answer three research questions including (1) what are the prevalent failures in AI systems, (2) what types of faults can current FI tools simulate, (3) what gaps exist between the simulated faults and real-world failures. Our findings reveal a taxonomy of AI system failures, assess the capabilities of existing FI tools, and highlight discrepancies between real-world and simulated failures. Moreover, this survey contributes to the field by providing a framework for fault diagnosis, evaluating the state-of-the-art in FI, and identifying areas for improvement in FI techniques to enhance the resilience of AI systems.

A Survey on Failure Analysis and Fault Injection in AI Systems

Guangba Yu, Gou Tan, Haojia Huang, Zhenyu Zhang, Pengfei Chen^†, Roberto Natella, Zibin Zheng, Michael R. Lyu

TOSEM (CCF A) In ACM Transactions on Software Engineering and Methodology

Take Kernel Stack Overhead Out: eBPF-Enhanced Network Acceleration for Distributed Training within Ethernet

Zhenyu Zhang, Pengfei Chen, Guangba Yu^†, Zilong He, Xiaoyun Li

Internetware'25 (CCF C) In 16th International Conference on Internetware.

As deep neural networks (DNN) continue to scale up in size to achieve greater capabilities, distributed training (DT) has become the prevailing approach to accelerate the training process. However, according to our observation on the network communication overheads in DT within Ethernet, the Linux kernel network stack accounts for 30% to 40% of the total communication time, posing a significant bottleneck to training efficiency. To mitigate the overhead introduced by the kernel network stack, we propose eRAR, an eBPF-based gradient aggregation over Ring-AR for DT tasks in commodity data centers. eRAR exploits Ring-AR's topology for in-kernel gradient aggregation using eBPF, enabling packet-level parallelism and avoiding the overhead of network stack. It ensures reliability through ring-based retransmission and accelerates computations via SIMD-enabled kfuncs. eRAR has the advantages of hardware-agnostic, network-topology-independent, and resource-efficient. Our experimental results on four popular DNN models demonstrate that, compared to aggregation based on TCP/IP network stack, eRAR improves the gradient aggregation throughput by 77.2%. Furthermore, eRAR reduces the communication time by up to 37.4% compared to existing systems.

Take Kernel Stack Overhead Out: eBPF-Enhanced Network Acceleration for Distributed Training within Ethernet

Zhenyu Zhang, Pengfei Chen, Guangba Yu^†, Zilong He, Xiaoyun Li

Internetware'25 (CCF C) In 16th International Conference on Internetware.

L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis

Zhihan Jiang, Junjie Huang, Guangba Yu^†, Zhuangbin Chen, Yichen Li, Renyi Zhong, Cong Feng, Yongqiang Yang, Michael R. Lyu

FSE'25 (CCF A) In 33nd ACM International Conference on the Foundations of Software Engineering.

As Large Language Models (LLMs) show their capabilities across various applications, training customized LLMs has become essential for modern enterprises. However, due to the complexity of LLM training, which requires massive computational resources and extensive training time, failures are inevitable during the training process. These failures result in considerable waste of resource and time, highlighting the critical need for effective and efficient failure diagnosis to reduce the cost of LLM training. In this paper, we present the first empirical study on the failure reports of 428 LLM training failures in our production Platform-X between May 2023 and April 2024. Our study reveals that hardware and user faults are the predominant root causes, and current diagnosis processes rely heavily on training logs. Unfortunately, existing log-based diagnostic methods fall short in handling LLM training logs. Considering the unique features of LLM training, we identify three distinct patterns of LLM training logs: cross-job, spatial, and temporal patterns. We then introduce our Log-based Large-scale LLM training failure diagnosis framework, L4, which can automatically extract failure-indicating information (i.e., log events, nodes, stages, and iterations) from extensive training logs, thereby reducing manual effort and facilitating failure recovery. Experimental results on real-world datasets show that L4 outperforms existing approaches in identifying failure-indicating logs and localizing faulty nodes. Furthermore, L4 has been applied in Platform-X and demonstrated its effectiveness in enabling accurate and efficient failure diagnosis.

L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis

Zhihan Jiang, Junjie Huang, Guangba Yu^†, Zhuangbin Chen, Yichen Li, Renyi Zhong, Cong Feng, Yongqiang Yang, Michael R. Lyu

FSE'25 (CCF A) In 33nd ACM International Conference on the Foundations of Software Engineering.

COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge

Yichen Li, Yulun Wu, Jinyang Liu, Zhihan Jiang, Zhuangbin Chen, Guangba Yu^†, Michael R. Lyu

ICSE'25 (CCF A) In 47th IEEE/ACM International Conference on Software Engineering

Runtime failures are commonplace in modern distributed systems. When such issues arise, users often turn to platforms such as Github or JIRA to report them and request assistance. Automatically identifying the root cause of these failures is critical for ensuring high reliability and availability. However, prevailing automatic root cause analysis (RCA) approaches rely significantly on comprehensive runtime monitoring data, which is often not fully available in issue platforms. Recent methods leverage large language models (LLMs) to analyze issue reports, but their effectiveness is limited by incomplete or ambiguous user-provided information. To obtain more accurate and comprehensive RCA results, the core idea of this work is to extract additional diagnostic clues from code to supplement data-limited issue reports. Specifically, we propose COCA, a code knowledge enhanced root cause analysis approach for issue reports. Based on the data within issue reports, COCA intelligently extracts relevant code snippets and reconstructs execution paths, providing a comprehensive execution context for further RCA. Subsequently, COCA constructs a prompt combining historical issue reports along with profiled code knowledge, enabling the LLMs to generate detailed root cause summaries and localize responsible components. Our evaluation on datasets from five real-world distributed systems demonstrates that COCA significantly outperforms existing methods, achieving a 28.3% improvement in root cause localization and a 22.0% improvement in root cause summarization. Furthermore, COCA's performance consistency across various LLMs underscores its robust generalizability.

COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge

Yichen Li, Yulun Wu, Jinyang Liu, Zhihan Jiang, Zhuangbin Chen, Guangba Yu^†, Michael R. Lyu

ICSE'25 (CCF A) In 47th IEEE/ACM International Conference on Software Engineering

Conan: Uncover Consensus Issues in Distributed Databases Using Fuzzing-driven Fault Injection

Haojia Huang, Pengfei Chen^†, Guangba Yu, Haiyu Huang, Jia Chang, Jun Li, Jian Han

SANER'25 (CCF B) In IEEE International Conference on Software Analysis, Evolution and Reengineering

Consensus is critical for distributed databases as it ensures the consistency of states across nodes, reinforcing the robustness of the overall system. However, faults related to the consensus protocols such as Paxos can lead to serious issues in distributed databases. Such consensus issues impact the correctness and availability of these databases. Therefore, to automatically uncover consensus issues in distributed databases, we propose Conan, a framework designed with fuzzing-driven fault injection. Conan applies a state-guided fuzzing algorithm to effectively explore the fault search space. Moreover, Conan employs hybrid fault sequences that combines fine-grained message-level faults and coarse-grained system-level faults to enhance fault injection. We implement and evaluate Conan on 3 widely-used distributed databases, including etcd, rqlite and openGauss. Finally, Conan has successfully uncovered previously unknown consensus issues, some of which are not detected by existing approaches.

Conan: Uncover Consensus Issues in Distributed Databases Using Fuzzing-driven Fault Injection

Haojia Huang, Pengfei Chen^†, Guangba Yu, Haiyu Huang, Jia Chang, Jun Li, Jian Han

SANER'25 (CCF B) In IEEE International Conference on Software Analysis, Evolution and Reengineering

Mint: Cost-Efficient Tracing with All Requests Collection via Commonality and Variability Analysiss

Haiyu Huang, Cheng Chen, Kunyi Chen, Pengfei Chen^†, Guangba Yu, Zilong He, Yilun Wang, Huxing Zhang, Qi Zhou

ASPLOS'25 (CCF A) In 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.

Distributed traces contain valuable information but are often massive in volume, posing a core challenge in tracing framework design: balancing the tradeoff between preserving essential trace information and reducing trace volume. To address this tradeoff, previous approaches typically used a '1 or 0' sampling strategy: retaining sampled traces while completely discarding unsampled ones. However, based on an empirical study on real-world production traces, we discover that the '1 or 0' strategy actually fails to effectively balance this tradeoff. To achieve a more balanced outcome, we shift the strategy from the '1 or 0' paradigm to the 'commonality + variability' paradigm. The core of 'commonality + variability' paradigm is to first parse traces into common patterns and variable parameters, then aggregate the patterns and filter the parameters. We propose a cost-efficient tracing framework, Mint, which implements the 'commonality + variability' paradigm on the agent side to enable all requests capturing. Our experiments show that Mint can capture all traces and retain more trace information while optimizing trace storage (reduced to an average of 2.7%) and network overhead (reduced to an average of 4.2%). Moreover, experiments also demonstrate that Mint is lightweight enough for production use.

Mint: Cost-Efficient Tracing with All Requests Collection via Commonality and Variability Analysiss

Haiyu Huang, Cheng Chen, Kunyi Chen, Pengfei Chen^†, Guangba Yu, Zilong He, Yilun Wang, Huxing Zhang, Qi Zhou

ASPLOS'25 (CCF A) In 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.

2024

FaaSConf: QoS-aware Hybrid Resources Configuration for Serverless Workflows

Yilun Wang, Pengfei Chen, Hui Dou^†, Yiwen Zhang, Guangba Yu, Zilong He, Haiyu Huang

ASE'24 (CCF A) In 39th IEEE/ACM International Conference on Automated Software Engineering.

The workflow composition of multiple short-lived functions has emerged as a prominent pattern in Function-as-a-Service (FaaS), exposing a considerable resources configuration challenge compared to individual independent serverless functions. This challenge unfolds in two ways. Firstly, serverless workflows frequently encounter dynamic and concurrent user workloads, increasing the risk of QoS violations. Secondly, the performance of a function can be affected by the resource re-provision of other functions within the workflow. With the popularity of the mode of concurrent processing in one single instance, concurrency limit as a critical configuration parameter imposes restrictions on the capacity of requests per instance. In this study, we present FaaSConf, a QoS-aware hybrid resource configuration approach that uses multi-agent reinforcement learning (MARL) to configure hybrid resources, including hardware resources and concurrency, thereby ensuring end-to-end QoS while minimizing resource costs. To enhance decision-making, we employ an attention technique in MARL to capture the complex performance dependencies between functions. We further propose a safe exploration strategy to mitigate QoS violations, resulting in a safer and efficient configuration exploration. The experimental results demonstrate that FaaSConf outperforms state-of-the-art approaches significantly. On average, it achieves a 26.5% cost reduction while exhibiting robustness to dynamic load changes.

FaaSConf: QoS-aware Hybrid Resources Configuration for Serverless Workflows

Yilun Wang, Pengfei Chen, Hui Dou^†, Yiwen Zhang, Guangba Yu, Zilong He, Haiyu Huang

ASE'24 (CCF A) In 39th IEEE/ACM International Conference on Automated Software Engineering.

FaaSRCA: Full Lifecycle Root Cause Analysis for Serverless Applications

Jin Huang, Pengfei Chen^†, Guangba Yu, Yilun Wang, Haiyu Huang, Zilong He

ISSRE'24 (CCF B) In 35th IEEE International Symposium on Software Reliability Engineering.

Serverless becomes popular as a novel computing paradigms for cloud native services. However, the complexity and dynamic nature of serverless applications present significant challenges to ensure system availability and performance. There are many root cause analysis (RCA) methods for microservice systems, but they are not suitable for precise modeling serverless applications. This is because: (1) Compared to microservice, serverless applications exhibit a highly dynamic nature. They have short lifecycle and only generate instantaneous pulse-like data, lacking long-term continuous information. (2) Existing methods solely focus on analyzing the running stage and overlook other stages, failing to encompass the entire lifecycle of serverless applications. To address these limitations, we propose FaaSRCA, a full lifecycle root cause analysis method for serverless applications. It integrates multi-modal observability data generated from platform and application side by using Global Call Graph. We train a Graph Attention Network (GAT) based graph auto-encoder to compute reconstruction scores for the nodes in global call graph. Based on the scores, we determine the root cause at the granularity of the lifecycle stage of serverless functions. We conduct experimental evaluations on two serverless benchmarks, the results show that FaaSRCA outperforms other baseline methods with a top-k precision improvement ranging from 21.25% to 81.63%.

FaaSRCA: Full Lifecycle Root Cause Analysis for Serverless Applications

Jin Huang, Pengfei Chen^†, Guangba Yu, Yilun Wang, Haiyu Huang, Zilong He

ISSRE'24 (CCF B) In 35th IEEE International Symposium on Software Reliability Engineering.

CTuner: Automatic NoSQL Database Tuning with Causal Reinforcement Learning

Genting Mai, Zilong He, Guangba Yu, Zhiming Chen, Pengfei Chen^†

Internetware'24 (CCF C) In 15th Asia-Pacific Symposium on Internetware.

The rapid development of information technology has necessitated the management of large volumes of data in modern society, leading to the emergence of NoSQL databases (e.g., MongoDB). To meet the huge demand for efficient data management and query, optimizing the performance of these databases has become crucial. Currently, some reinforcement learning-based methods have been used to improve the efficiency of databases by tuning customizable database configurations. However, these methods have two limitations including susceptibility to cold-start effect and low efficiency in configuration search. To address these issues, we propose a novel and effective approach named CTuner for the online performance tuning of NoSQL databases. CTuner skips cold start by Bayesian optimization-based learning, and improves the exploitation strategy of the TD3 model with causal inference. Practical implementation and experimental evaluations on three prominent NoSQL databases show that CTuner can find a better configuration at the same time cost than state-of-the-art approaches, with up to a 27.4% improvement in throughput and up to 13.2% reduction in 95th latency. Moreover, we introduce meta-learning to enhance the adaptability of CTuner and confirm that it is able to reliably improve performance under new environments and workloads.

CTuner: Automatic NoSQL Database Tuning with Causal Reinforcement Learning

Genting Mai, Zilong He, Guangba Yu, Zhiming Chen, Pengfei Chen^†

Internetware'24 (CCF C) In 15th Asia-Pacific Symposium on Internetware.

MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications

Hongyang Chen, Pengfei Chen^†, Guangba Yu, Xiaoyun Li, Zilong He, Huxing Zhang

TDSC (CCF A) In IEEE Transactions on Dependable and Secure Computing

Microservice is a widely-adopted architecture for constructing cloud-native applications. To test application resiliency, chaos engineering is widely used to inject faults proactively in applications. However, the searching space formed by possible injection locations is huge due to the scale and complexity of the application. Although some methods are proposed to effectively explore injection space, they cannot prioritize high-impact injection solutions. Additionally, the blast radius of faults injected by existing methods is typically full of uncertainty, causing faults of multiple application functions. Although some tools are designed to conduct request-level injection, they require instrumentation on application code. To tackle these problems, this paper presents MicroFI, a non-intrusive fault injection framework, aiming to efficiently test different application functions with request-level injection. Request-level injection limits the blast radius to specified requests without any source code modification. Additionally, MicroFI leverages historical injection results and parallel technique to accelerate the searching. Moreover, An enhanced PageRank is used to measure the impact of faults and prioritize high-impact faults that fail more functions. Evaluations on three microservice applications show that MicroFI precisely injects faults and reduces up to 91% redundant faults on average. Additionally, by employing prioritization, MicroFI reduces an average of 47.3% injection budgets to cover all high-impact faults.

MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications

Hongyang Chen, Pengfei Chen^†, Guangba Yu, Xiaoyun Li, Zilong He, Huxing Zhang

TDSC (CCF A) In IEEE Transactions on Dependable and Secure Computing

TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State

Haiyu Huang, Xiaoyu Zhang, Pengfei Chen^†, Zilong He, Zhiming Chen, Guangba Yu, Hongyang Chen, Chen Sun

FSE'24 (CCF A) In 32nd ACM International Conference on the Foundations of Software Engineering. 🏆 Distinguish Paper Award

Distributed tracing has been widely adopted in many microservice systems and plays an important role in monitoring and analyzing the system. However, trace data often come in large volumes, incurring substantial computational and storage costs. To reduce the quantity of traces, trace sampling has become a prominent topic of discussion, and several methods have been proposed in prior work. To attain higher-quality sampling outcomes, biased sampling has gained more attention compared to random sampling. Previous biased sampling methods primarily considered the importance of traces based on diversity, aiming to sample more edge-case traces and fewer common-case traces. However, we contend that relying solely on trace diversity for sampling is insufficient, system runtime state is another crucial factor that needs to be considered, especially in cases of system failures. In this study, we introduce TraStrainer, an online sampler that takes into account both system runtime state and trace diversity. TraStrainer employs an interpretable and automated encoding method to represent traces as vectors. Simultaneously, it adaptively determines sampling preferences by analyzing system runtime metrics. When sampling, it combines the results of system-bias and diversity-bias through a dynamic voting mechanism. Experimental results demonstrate that TraStrainer can achieve higher quality sampling results and significantly improve the performance of downstream root cause analysis (RCA) tasks. It has led to an average increase of 32.63\% in Top-1 RCA accuracy compared to four baselines in two datasets.

TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State

Haiyu Huang, Xiaoyu Zhang, Pengfei Chen^†, Zilong He, Zhiming Chen, Guangba Yu, Hongyang Chen, Chen Sun

FSE'24 (CCF A) In 32nd ACM International Conference on the Foundations of Software Engineering. 🏆 Distinguish Paper Award

ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems

Guangba Yu, Pengfei Chen^†, Zilong He, Qiuyu Yan, Yu Luo, Fangyan Li, Zibin Zheng

FSE'24 (CCF A) In 32nd ACM International Conference on the Foundations of Software Engineering.

In large-scale online service systems, the occurrence of software changes is inevitable and frequent. Despite rigorous pre-deployment testing practices, the presence of defective software changes in the online environment cannot be completely eliminated. Consequently, there is a pressing need for automated techniques that can effectively identify these defective changes However, the current abnormal change detection (ACD) approaches fall short in accurately pinpointing defective changes, primarily due to their disregard for the propagation of faults. To address the limitations of ACD, we propose a novel concept called root cause change analysis (RCCA) to identify the underlying root causes of change-inducing incidents. In order to apply the RCCA concept to practical scenarios, we have devised an intelligent RCCA framework named ChangeRCA. This framework aims to localize the defective change associated with change-inducing incidents among multiple changes.

ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems

Guangba Yu, Pengfei Chen^†, Zilong He, Qiuyu Yan, Yu Luo, Fangyan Li, Zibin Zheng

FSE'24 (CCF A) In 32nd ACM International Conference on the Foundations of Software Engineering.

2023

Network shortcut in data plane of service mesh with eBPF

Wanqi Yang, Pengfei Chen^†, Guangba Yu, Haibin Zhang, Huxing Zhang

JNCA (CCF C) In Journal of Network and Computer Applications

In recent years, the adoption of the service mesh as a dedicated infrastructure layer to support cloud-native systems has gained significant popularity. Service meshes involve the incorporation of proxies to handle communication between microservices, thereby speeding up the development and deployment of microservice applications. However, the use of service meshes also increases the request latency because they elongate the packet transmission between services. After investigating the transmission path of packets in a representative service mesh Istio, we observed that the service mesh dedicates approximately 25% of its time to packet transmission in the Linux kernel network stack. To shorten this process, we propose a non-intrusive solution that enables packets to bypass the kernel network stack through the implementation of socket redirection and tc (traffic control) redirection with eBPF (extended Berkeley Packet Filter). We also conduct comprehensive experiments on the widely-used Istio. The evaluation results show that our approach can significantly reduce the request latency by up to 21%. Furthermore, our approach decreases CPU usage by 1.73% and reduces memory consumption by approximately 0.98% when compared to the original service mesh implementation.

Network shortcut in data plane of service mesh with eBPF

Wanqi Yang, Pengfei Chen^†, Guangba Yu, Haibin Zhang, Huxing Zhang

JNCA (CCF C) In Journal of Network and Computer Applications

Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data

Guangba Yu, Pengfei Chen^†, Yufeng Li, Hongyang Chen, Xiaoyun Li, Zibin Zheng

FSE'23 (CCF A) In 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging task. To understand and localize root causes of unexpected faults, modern observability tools collect and preserve multi-modal observability data, including metrics, traces, and logs. Since system faults may manifest as anomalies in different data sources, existing RCA approaches that rely on single-modal data are constrained in the granularity and interpretability of root causes. In this study, we present Nezha, an interpretable and fine-grained RCA approach that pinpoints root causes at the code region and resource type level by incorporative analysis of multi-modal data. Nezha transforms heterogeneous multi-modal data into a homogeneous event representation and extracts event patterns by constructing and mining event graphs. The core idea of Nezha is to compare event patterns in the fault-free phase with those in the fault-suffering phase to localize root causes in an interpretable way. Practical implementation and experimental evaluations on two microservice applications show that Nezha achieves a high top1 accuracy (87.5%) on average at the code region and resource type level and outperforms state-of-the-art approaches by a large margin. Two ablation studies further confirm the contributions of incorporating multi-modal data.

Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data

Guangba Yu, Pengfei Chen^†, Yufeng Li, Hongyang Chen, Xiaoyun Li, Zibin Zheng

FSE'23 (CCF A) In 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

DiagConfig: Configuration Diagnosis of Performance Violations in Configurable Software Systems

Zhiming Chen, Pengfei Chen^†, Peipei Wang, Guangba Yu, Zilong He, Genting Mai

FSE'23 (CCF A) In 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Performance degradation due to misconfiguration in software systems that violates SLOs (service-level objectives) is commonplace. Diagnosing and explaining the root causes of such performance violations in configurable software systems is often challenging due to their increasing complexity. Although there are many tools and techniques for diagnosing performance violations, they provide limited evidence to attribute causes of observed performance violations to specific configurations. This is because the configuration is not originally considered in those tools. This paper proposes DiagConfig, specifically designed to conduct configuration diagnosis of performance violations. It leverages static code analysis to track configuration option propagation, identifies performance-sensitive options, detects performance violations, and constructs cause-effect chains that help stakeholders better understand the relationship between configuration and performance violations. Through experimental evaluations with eight real-world open-source software, we demonstrate that DiagConfig effectively identifies performance-sensitive options and constructs cause-effect chains. Specifically, DiagConfig produces fewer false positives than SafeTune (i.e., 5 vs 77) in the identification of performance-sensitive options, and outperforms Unicorn in the diagnosis of performance violations caused by configuration changes, offering more comprehensive results (recall 0.892 vs 0.289).We also show that DiagConfig can accelerate auto-tuning by compressing configuration space.

DiagConfig: Configuration Diagnosis of Performance Violations in Configurable Software Systems

Zhiming Chen, Pengfei Chen^†, Peipei Wang, Guangba Yu, Zilong He, Genting Mai

FSE'23 (CCF A) In 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

MARS: Fault Localization in Programmable Networking Systems with Low-cost In-Band Network Telemetry

Benran Wang, Hongyang Chen, Pengfei Chen^†, Zilong He, Guangba Yu

ICPP'23 (CCF B) In 32nd International Conference on Parallel Processing

In this paper, we present MARS, a lightweight system for anomaly detection with dynamic threshold and automatic root cause localization in programmable networking systems. MARS collects aggregated packet level telemetry on demand and generates a ranked list of fine-grained fault culprits at multiple levels, including port level, switch level, and flow level. Experimental evaluations show the cost-effectiveness of MARS, both in terms of network bandwidth and switch memory usage. Moreover, MARS achieves a 0.97 F1 score in anomaly detection, and 0.95 Recall at Top2 and an overall 0.3 Exam Score in root cause localization.

MARS: Fault Localization in Programmable Networking Systems with Low-cost In-Band Network Telemetry

Benran Wang, Hongyang Chen, Pengfei Chen^†, Zilong He, Guangba Yu

ICPP'23 (CCF B) In 32nd International Conference on Parallel Processing

DeepPower: Deep Reinforcement Learning based Power Management for Latency Critical Applications in Multi-core Systems

Jingrun Zhang, Guangba Yu, Zilong He, Liang Ai, Pengfei Chen^†

ICPP'23 (CCF B) In 32nd International Conference on Parallel Processing

Latency-critical (LC) applications are widely deployed in modern datacenters. Effective power management for LC applications can yield significant cost savings. However, it poses a significant challenge in maintaining the desired Service Level Aggrement (SLA) levels. Prior researches have mainly emphasized predicting the service time of request and utilize heuristic algorithms for CPU frequency adjustment. Unfortunately, the control granularity is limited to the request level and manual feature selection is needed. This paper proposes DeepPower, a deep reinforcement learning (DRL) based power management solution for LC applications. DeepPower comprises two key components, a DRL agent for monitoring the system load changes and a thread controller for CPU frequency adjustment. Considering the high overhead of the neural network and the short service time of requests, it is infeasible to employ DRL for direct adjustment of CPU frequency at the request level. Instead, DeepPower proposes a hierarchical control mechanism. That means the DRL agent adjusts the parameter of thread controller with longer intervals, and thread controller adjusts the CPU frequency with shorter intervals. This control mechanism enables DeepPower to adapt to dynamic workloads and achieves fine-grained frequency adjustments. We evaluate DeepPower with some common LC applications under dynamic workload. The experimental results show that DeepPower saves up to 28.4\% power compared with state-of-the-art methods and reduces the percentage of request timeout.

DeepPower: Deep Reinforcement Learning based Power Management for Latency Critical Applications in Multi-core Systems

Jingrun Zhang, Guangba Yu, Zilong He, Liang Ai, Pengfei Chen^†

ICPP'23 (CCF B) In 32nd International Conference on Parallel Processing

LogReducer: Identify and Reduce Log Hotspots in Kernel on the Fly

Guangba Yu, Pengfei Chen^†, Pairui Li, Tianjun Weng, Haibing Zheng, Yuetang Deng, Zibin Zheng

ICSE'23 (CCF A) In 45th IEEE/ACM International Conference on Software Engineering

Modern systems generate a massive amount of logs to detect and diagnose system faults, which incurs expensive storage cost and runtime overhead. After investigating real-world production logs, we observe that most of the logging overhead is due to a small number of log templates, referred to as log hotspots. Therefore, we conduct a systematical study about log hotspots in an industrial system WeChat, which motivates us to identify log hotspots and reduce them on the fly. In this paper, we propose LogReducer, a non-intrusive and language-independent log reduction framework based on eBPF (Extended Berkeley Packet Filter), consisting of both online and offline processes. After two months of serving the offline process of LogReducer in WeChat, the log storage overhead has dropped from 19.7 PB per day to 12.0 PB (i.e., about a 39.08% decrease). Practical implementation and experimental evaluations in the test environment demonstrate that the online process of LogReducer can control the logging overhead of hotspots while preserving logging effectiveness. Moreover, the log hotspot handling time can be reduced from average 9 days in production to 10 minutes in the test with the help of LogReducer.

LogReducer: Identify and Reduce Log Hotspots in Kernel on the Fly

Guangba Yu, Pengfei Chen^†, Pairui Li, Tianjun Weng, Haibing Zheng, Yuetang Deng, Zibin Zheng

ICSE'23 (CCF A) In 45th IEEE/ACM International Conference on Software Engineering

FaaSDeliver: Cost-efﬁcient and QoS-aware Function Delivery in Computing Continuum

Guangba Yu, Pengfei Chen^†, Zibin Zheng, Jingrun Zhang, Xiaoyun Li, Zilong He

TSC (CCF A) In IEEE Transaction on Service Computing

Serverless Function-as-a-Service (FaaS) is a rapidly growing computing paradigm in the cloud era. To provide rapid service response and save network bandwidth, traditional cloud-based FaaS platforms have been extended to the edge. However, launching functions in a heterogeneous computing continuum (HCC) that includes the cloud, fog, and the edge brings new challenges, determining where functions should be delivered and how many resources should be allocated. To optimize the cost of running functions in the HCC, we propose an adaptive and efficient function delivery engine, named FaaSDeliver, which automatically unearths a cost-efficient function delivery policy (FDP) for each function, including the FaaS platform selection and resource allocation. Real system implementation and evaluations in a practical HCC demonstrate that FaaSDeliver can unearth the most cost-efficient FDPs from among 180,200 FDPs after a few trials. FaaSDeliver reduces the average cost of function execution from 38% to 78% compared to some state-of-the-art approaches.

FaaSDeliver: Cost-efﬁcient and QoS-aware Function Delivery in Computing Continuum

Guangba Yu, Pengfei Chen^†, Zibin Zheng, Jingrun Zhang, Xiaoyun Li, Zilong He

TSC (CCF A) In IEEE Transaction on Service Computing

2022

MicroSketch: Lightweight and Adaptive Sketch based Performance Issue Detection and Localization in Microservice Systems

Xiaoyun Li, Guangba Yu, Pengfei Chen^†, Hongyang Chen, Zhekang Chen

ICSOC'22 (CCF B) In 20th International Conference on Service-Oriented Computing

With the rapid growth of microservice systems in cloud-native environments, end-to-end traces have become essential data to help diagnose performance issues. However, existing trace-based anomalydetection and root cause analysis (RCA) still suffer from practical issues due to either the massive volume or frequent system changes. In this study, we propose a lightweight and adaptive trace-based anomaly detection and RCA approach, named MicroSketch, which leverages Sketch based features and Robust Random Cut Forest (RRCForest) to rendertrace analysis more effective and efficient. In addition,MicroSketchis an unsupervised approach that is able to adapt to changes in microservicesystems without any human intervention. We evaluated MicroSketch on a widely-used open-source system and a production system. The results demonstrate the efficiency and effectiveness of MicroSketch. MicroSketch significantly outperforms start-of-the-art approaches, with an average of 40.9% improvement in F1 score on anomaly detection and 25.0% improvement in Recall of Top-1 on RCA. In particular, MicroSketch is at least 60x faster than other methods in terms of diagnosis time.

MicroSketch: Lightweight and Adaptive Sketch based Performance Issue Detection and Localization in Microservice Systems

Xiaoyun Li, Guangba Yu, Pengfei Chen^†, Hongyang Chen, Zhekang Chen

ICSOC'22 (CCF B) In 20th International Conference on Service-Oriented Computing

Going through the Life Cycle of Faults in Clouds:Guidelines on Fault Handling

Xiaoyun Li, Guangba Yu, Pengfei Chen^†, Hongyang Chen, Zhekang Chen

ISSRE'22 (CCF B) In 33rd IEEE International Symposium on Software Reliability Engineering

Faults are the primary culprits of breaking the high availability of cloud systems, even leading to costly outages. As the scale and complexity of clouds increase, it becomes extraordinarily difficult to understand, detect and diagnose faults. During outages, engineers record the detailed information of the whole life cycle of faults (i.e., fault occurrence, fault detection, fault identification, and fault mitigation) in the form of post-mortems. In this paper, we conduct a quantitative and qualitative study on 354 public post-mortems collected in three popular large-scale clouds, 97.7% of which spans from 2015 to 2021. By reviewing and analyzing post-mortems, we go through the life cycle of faults in clouds and obtain 10 major findings. Based on these findings, we further reach a series of actionable guidelines for better fault handling.

Going through the Life Cycle of Faults in Clouds:Guidelines on Fault Handling

Xiaoyun Li, Guangba Yu, Pengfei Chen^†, Hongyang Chen, Zhekang Chen

ISSRE'22 (CCF B) In 33rd IEEE International Symposium on Software Reliability Engineering

Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems

Zilong He, Pengfei Chen^†, Yu Luo, Qiuyu Yan, Hongyang Chen, Guangba Yu, Fangyuan Li

ASE'22 (CCF A) In 37th IEEE/ACM International Conference on Automated Software Engineering

With the ever increasing scale and complexity of online systems, incidents are gradually becoming commonplace. Without appropriate handling, they can seriously harm the system availability. However, in large-scale online systems, these incidents are usually drowning in a slew of issues (i.e., something abnormal, while not necessarily an incident), rendering them difficult to handle. Typically, these issues will result in a cascading effect across the system, and a proper management of the incidents depends heavily on a thorough analysis of this effect. Therefore, in this paper, we propose a method to automatically analyze the cascading effect of availability issues in online systems and extract the corresponding graph based issue representations incorporating both of the issue symptoms and affected service attributes. With the extracted representations, we train and utilize a graph neural networks based model to perform incident detection. Then, for the detected incident, we leverage the PageRank algorithm with a flexible transition matrix design to locate its root cause. We evaluate our approach using real-world data collected from the WeChat ® online service system, the largest instant message system in China. The results confirm the effectiveness of our approach. Moreover, our approach is successfully deployed in the company and eases the burden of operators in the face of a flood of issues and related alert signals.

Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems

Zilong He, Pengfei Chen^†, Yu Luo, Qiuyu Yan, Hongyang Chen, Guangba Yu, Fangyuan Li

ASE'22 (CCF A) In 37th IEEE/ACM International Conference on Automated Software Engineering

TS-InvarNet: Anomaly Detection and Localization based on Tempo-spatial KPI Invariants in Distributed Services

Zijun Hu, Pengfei Chen^†, Guangba Yu, Zilong He, Xiaoyun Li

ICWS'22 (CCF B) In Proceeidings of 2022 IEEE International Conference on Web Services

Modern industrial systems are often large-scale distributed systems composed of dozens to thousands of services, leading to difficulty in anomaly detection and localization. KPIs (Key Performance Indicators) record the states of different services and are presented as time series, which reflect the status of the system. However, due to the dynamic and complex periodic patterns embedded in KPIs, pinpointing anomalous behavior of these multivariate time series data quickly and accurately is a challenging problem. The current state-of-the-art deep-learning-based anomaly detection methods model global inter-KPI dependency, causing the limited ability to detect local subtle anomalies and poor interpretability.In practice, interpreting anomalies can accelerate problem localization and further troubleshooting. In this study, we propose TS-InvarNet, an interpretable end-to-end anomaly detection and diagnosis framework based on tempo-spatial KPI invariants. Extensive empirical studies on three real-world industrial datasets and a widely-used open-source system demonstrate that TS-InvarNet can outperform state-of-the-art baseline methods in detection and diagnosis performance. Specifically, TS-InvarNet increases F1-scores by up to 27% compared to the baselines.

TS-InvarNet: Anomaly Detection and Localization based on Tempo-spatial KPI Invariants in Distributed Services

Zijun Hu, Pengfei Chen^†, Guangba Yu, Zilong He, Xiaoyun Li

ICWS'22 (CCF B) In Proceeidings of 2022 IEEE International Conference on Web Services

SwissLog: Robust Anomaly Detection and Localization for Interleaved Unstructured Logs

Xiaoyun Li, Pengfei Chen^†, Linxiao Jing, Zilong He, Guangba Yu

TDSC (CCF A) In IEEE Transactions on Dependable and Secure Computing

Modern distributed systems generate interleaved logs when running in parallel. Identifiers (ID) are always attached to them to trace running instances or entities in logs. Therefore, log messages can be grouped by the same IDs to help anomaly detection and localization. The existing approaches to achieve this still fall short meeting these challenges, 1) Log is solely processed in single components without mining log dependencies, 2) Log formats are continually changing in modern software systems, 3) It is challenging to detect latent performance issues non-intrusively by trivial monitoring tools. To remedy the above shortcomings, we propose SwissLog, a robust anomaly detection and localization tool for interleaved unstructured logs. \textcolor{black}{SwissLog focuses on log sequential anomalies and tries to dig out possible performance issues. SwissLog constructs ID relation graphs across distributed components and groups log messages by IDs. Moreover, we propose an online data-driven log parser without parameter tuning.} The grouped log messages are parsed via the novel log parser and transformed with semantic and temporal embedding. Finally, SwissLog utilizes an attention-based Bi-LSTM model and a heuristic searching algorithm to detect and localize anomalies in instance-granularity, respectively. The experiments on real-world and synthetic datasets confirm the effectiveness, efficiency, and robustness of SwissLog.

SwissLog: Robust Anomaly Detection and Localization for Interleaved Unstructured Logs

Xiaoyun Li, Pengfei Chen^†, Linxiao Jing, Zilong He, Guangba Yu

TDSC (CCF A) In IEEE Transactions on Dependable and Secure Computing

2021

TraceRank: Abnormal Service Localization with Dis-Aggregated End-to-End Tracing Data in Cloud Native Systems

Guangba Yu, Zicheng Huang, Pengfei Chen^†,

JSEP (CCF B) In Journal of Software Evolution and Process

Modern cloud native applications are generally built with a microservice architecture. To tackle various performance problems among a large number of services and machines, an end-to-end tracing tool is always equipped in these systems to track the execution path of every single request. However, it is nontrivial to conduct root cause analysis of anomalies with such a large volume of tracing data. This paper proposes a novel system named TraceRank to identify and locate abnormal services causing performance problems with dis-aggregated end-to-end traces. TraceRank mainly includes an anomaly detection module and a root cause analysis module. The root cause analysis procedure is triggered when an anomaly is detected. To fully leverage the information provided by the tracing data, both the spectrum analysis and the PageRank-based random walk methods are introduced to pinpoint abnormal services. The experiments in TrainTicket and Bookinfo microservice benchmarks and a real-world system show that TraceRank can locate root causes with 90% in Precision and 86% in Recall. TraceRank has up to 10% improvement compared with several state-of-the-art approaches in both Precision and Recall. Finally, TraceRank has good scalability and a low overhead to adapt to large-scale microservice systems.

TraceRank: Abnormal Service Localization with Dis-Aggregated End-to-End Tracing Data in Cloud Native Systems

Guangba Yu, Zicheng Huang, Pengfei Chen^†,

JSEP (CCF B) In Journal of Software Evolution and Process

Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems

Zicheng Huang, Pengfei Chen^†, Guangba Yu, Hongyang Chen, Zibin Zheng

ICWS'21 (CCF B) In Proceeidings of 2021 IEEE International Conference on Web Services

End-to-end tracing plays an important role in understanding and monitoring distributed microservice systems. The trace data are valuable to help find out the anomalous or erroneous behavior of the system. However, the volume of trace data is huge leading to a heavy burden on analyzing and storing them. To reduce the volume of trace data, the sampling technique is widely adopted. However, existing uniform sampling approaches are unable to capture uncommon traces that are more interesting and informative. To tackle this problem, we design and implement Sieve, an online sampler that aims to bias sampling towards uncommon traces by taking advantage of the attention mechanism. The evaluation results on the trace datasets collected from real-world and experimental microservice systems show that Sieve is effective to increase sampling probabilities of the structurally and temporally uncommon traces and reduce the storage space to a large extent by taking a low sampling rate.

Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems

Zicheng Huang, Pengfei Chen^†, Guangba Yu, Hongyang Chen, Zibin Zheng

ICWS'21 (CCF B) In Proceeidings of 2021 IEEE International Conference on Web Services

T-Rank:A Lightweight Spectrum based Fault Localization Approach for Microservice Systems

Zihao Ye, Pengfei Chen^†, Guangba Yu

CCGrid'21 (CCF C, CORE A) In Proceedings of 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing

The cloud-native system is shifting from traditional monolithic architecture to microservice architecture because of loosely coupling, better maintainability and availability, faster deployment, and richer ecology brought by it. Except for these advantages, it still has an inevitable weakness-the communication over RPC (Remote Procedure Call) between services makes the system performance more unpredictable. Moreover, the complex interactions amongst services make it hard to reveal the root cause of performance issues. To address this challenge, we propose a lightweight spectrum-based performance diagnosis tool, named T-Rank. T-Rank provides the ranked suspicious score in a list of microservices to localize root causes with very few resources. We demonstrate the high accuracy and the low cost of T-Rank by conducting experiments with the data collected from a real-world production microservice system. Moreover, comparison results show that T-Rank outperforms other state-of-the-art approaches.

T-Rank:A Lightweight Spectrum based Fault Localization Approach for Microservice Systems

Zihao Ye, Pengfei Chen^†, Guangba Yu

CCGrid'21 (CCF C, CORE A) In Proceedings of 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing

Kmon: An In-kernel Transparent Monitoring System for Microservice Systems with eBPF

Tianjun Weng, Wanqi Yang, Guangba Yu, Pengfei Chen^†, Jieqi Cui, Chuangfu Zhang

CloudIntelligence'21 In Proceedings of 2021 IEEE/ACM International Workshop on Cloud Intelligence

Currently, the architecture of software systems is shifting from “monolith” to “microservice” which is an important enabling technology of cloud native systems. Since the advantages of microservice in agility, efficiency, and scaling, it has become the most popular architecture in the industry. However, as the increase of microservice complexity and scale, it becomes challenging to monitor such a large number of microservices. Traditional monitoring techniques such as end-to-end tracing cannot well fit microservice environment, because they need code instrumentation with great effort. Moreover, they cannot explore the fine-grained internal states of microservice instances. To tackle this problem, we propose Kmon, which is an In-kernel transparent monitoring system for microservice systems with extended Berkeley Packet Filter (eBPF). Kmon can provide multiple kinds of run-time information of micrservices such as latency, topology, performance metrics with a low overhead.

Kmon: An In-kernel Transparent Monitoring System for Microservice Systems with eBPF

Tianjun Weng, Wanqi Yang, Guangba Yu, Pengfei Chen^†, Jieqi Cui, Chuangfu Zhang

CloudIntelligence'21 In Proceedings of 2021 IEEE/ACM International Workshop on Cloud Intelligence

MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments

Guangba Yu, Pengfei Chen^†, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, Xiaoyun Li

WWW'21 (CCF A) In Proceedings of the Web Conference 2021

With the advantages of flexible scalability and fast delivery, microservice has become a popular software architecture in the modern IT industry. However, the explosion in the number of service instances and complex dependencies make the troubleshooting extremely challenging in microservice environments. To help understand and troubleshoot a microservice system, the end-to-end tracing technology has been widely applied to capture the execution path of each request. Nevertheless, the tracing data are not fully leveraged by cloud and application providers when conducting latency issue localization in the microservice environment. This paper proposes a novel system, named MicroRank, which analyzes clues provided by normal and abnormal traces to locate root causes of latency issues. Once a latency issue is detected by the Anomaly Detector in MicroRank, the cause localization procedure is triggered. MicroRank first distinguishs which traces are abnormal. Then, MicroRank’s PageRank Scorer module uses the abnormal and normal trace information as its input and differentials the importance of different traces to extended spectrum techniques . Finally, the spectrum techniques can calculate the ranking list based on the weighted spectrum information from PageRank Scorer to locate root causes more effectively. The experimental evaluations on a widely-used open-source system and a production system show that MicroRank achieves excellent results not only in one root cause situation but also in two issues that happen at the same time. Moreover, MicroRank makes 6% to 22% improvement in recall in localizing root causes compared to current state-of-the-art methods.

MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments

Guangba Yu, Pengfei Chen^†, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, Xiaoyun Li

WWW'21 (CCF A) In Proceedings of the Web Conference 2021

2020

A Learning-based Dynamic Load Balancing Approach for Microservice Systems in Multi-cloud Environment

Jieqi Cui, Pengfei Chen^†, Guangba Yu

ICPADS'20 (CCF C, CORE B) In Proceedings of IEEE 26th International Conference on Parallel and Distributed Systems

Multi-cloud environment has become common since companies manage to prevent cloud vendor lock-in for security and cost concerns. Meanwhile, the microservice architecture is often considered for its flexibility. Combining multi-cloud with microservice, the problem of routing requests among all possible microservice instances in multi-cloud environment arises. This paper presents a learning-based approach to route requests in order to balance the load. In our approach, the performance of microservice is modeled explicitly through machine learning models. The model can derive the response time from request volume, route decision, and other cloud metrics. Then the balanced route decision is obtained from optimizing the model with Bayesian Optimization. With this approach, the request route decision can adjust to dynamic runtime metrics instead of remaining static for all different circumstances. Explicit performance modeling avoids searching on an actual microservice system which is time-consuming. Experiments show that our approach reduces average response time by 10% at least.

A Learning-based Dynamic Load Balancing Approach for Microservice Systems in Multi-cloud Environment

Jieqi Cui, Pengfei Chen^†, Guangba Yu

ICPADS'20 (CCF C, CORE B) In Proceedings of IEEE 26th International Conference on Parallel and Distributed Systems

SwissLog: Robust and Unified Deep Learning Based Log Anomaly Detection for Diverse Faults

Xiaoyun Li, Pengfei Chen^†, Linxiao Jing, Zilong He, Guangba Yu

ISSRE'20 (CCF B) In Proceedings of the 2020 IEEE 31st International Symposium on Software Reliability Engineering

Log-based anomaly detection has been widely studied and achieves a satisfying performance on stable log data. But, the existing approaches still fall short meeting these challenges, 1) Log formats are changing continually in practice in those software systems under active development and maintenance. 2) Performance issues are latent causes that may not be detected by trivial monitoring tools. We thus propose SwissLog, namely a robust and unified deep learning based anomaly detection model for detecting diverse faults. SwissLog targets at those faults resulting in log sequence order changes and log time interval changes. To achieve that, an advanced log parser is introduced. Moreover, the semantic embedding and the time embedding approaches are combined to train a unified attention based BiLSTM model to detect anomalies. The experiments on real-world datasets and synthetic datasets show that SwissLog is robust to the changing log data and effective for diverse faults.

SwissLog: Robust and Unified Deep Learning Based Log Anomaly Detection for Diverse Faults

Xiaoyun Li, Pengfei Chen^†, Linxiao Jing, Zilong He, Guangba Yu

ISSRE'20 (CCF B) In Proceedings of the 2020 IEEE 31st International Symposium on Software Reliability Engineering

A Spatiotemporal Deep Learning Approach for Unsupervised Anomaly Detection in Cloud Systems

Zilong He, Pengfei Chen^†, Xiaoyun Li, Yongfeng Wang, Guangba Yu, Cailin Chen, Xinrui Li, Zibin Zheng

TNNLS (Impact Factor 10.4, CCF B) In IEEE Transactions on Neural Networks and Learning Systems

Anomaly detection is a critical task for maintaining the performance of a cloud system. Using data-driven methods to address this issue is the mainstream in recent years. However, due to the lack of labeled data for training in practice, it is necessary to enable an anomaly detection model trained on contaminated data in an unsupervised way. Besides, with the increasing complexity of cloud systems, effectively organizing data collected from a wide range of components of a system and modeling spatiotemporal dependence among them become a challenge. In this article, we propose TopoMAD, a stochastic seq2seq model which can robustly model spatial and temporal dependence among contaminated data. We include system topological information to organize metrics from different components and apply sliding windows over metrics collected continuously to capture the temporal dependence. We extract spatial features with the help of graph neural networks and temporal features with long short-term memory networks. Moreover, we develop our model based on variational auto-encoder, enabling it to work well robustly even when trained on contaminated data. Our approach is validated on the run-time performance data collected from two representative cloud systems, namely, a big data batch processing system and a microservice-based transaction processing system. The experimental results show that TopoMAD outperforms some state-of-the-art methods on these two data sets.

A Spatiotemporal Deep Learning Approach for Unsupervised Anomaly Detection in Cloud Systems

Zilong He, Pengfei Chen^†, Xiaoyun Li, Yongfeng Wang, Guangba Yu, Cailin Chen, Xinrui Li, Zibin Zheng

TNNLS (Impact Factor 10.4, CCF B) In IEEE Transactions on Neural Networks and Learning Systems

Microscaler: Cost-effective Scaling for Microservice Applications in the Cloud with an Online Learning Approach

Guangba Yu, Pengfei Chen^†, Zibin Zheng

TCC (Impact Factor 5.9) In IEEE Transaction on Cloud Computing

Recently, the microservice becomes a popular architecture to construct cloud native systems due to its agility. In cloud native systems, autoscaling is a key enabling technique to adapt to workload changes by acquiring or releasing the right amount of computing resources. However, it becomes a challenging problem in microservice applications, since such an application usually comprises a large number of different microservices with complex interactions. When the performance decreases due to an unpredictable workload peak, it is difficult to pinpoint the scaling-needed services which need to scale out and evaluate how many resources they need. In this paper, we present a novel system named Microscaler to automatically identify the scaling-needed services and scale them to meet the Service Level Agreement (SLA) with an optimal cost for microservice applications. Microscaler first collects the quality of service (QoS) metrics in the service mesh enabled microservice infrastructure. Then, it determines under-provisioning or over-provisioning service instances along the service dependency graph with a novel scaling-needed service criterion named service power. The service dependency graph could be obtained by correlating each request flow in the service mesh. By combining an online learning approach and a step-by-step heuristic approach, Microscaler can precisely reach the optimal service scale meeting the SLA requirements. The experimental evaluations in a microservice benchmark show that Microscaler achieves an average 93% precision in scaling-needed service determination and converges to the optimal service scale faster than several state-of-the-art methods. Moreover, Microscaler is lightweight and flexible enough to work in a large-scale microservice system.

Microscaler: Cost-effective Scaling for Microservice Applications in the Cloud with an Online Learning Approach

Guangba Yu, Pengfei Chen^†, Zibin Zheng

TCC (Impact Factor 5.9) In IEEE Transaction on Cloud Computing

A Framework of Virtual War Room and Matrix Sketch-Based Streaming Anomaly Detection for Microservice Systems

Hongyang Chen, Pengfei Chen^†, Guangba Yu

Access In IEEE Access

Recently, microservice has been a popular architecture to construct cloud-native systems. This novel architecture brings agility and accelerates the software development process significantly. However, it is not easy to manage and operate microservice systems due to their scale and complexity. Many approaches are proposed to automatically operate microservice systems such as anomaly detection. Nevertheless, those methods cannot be sufficiently validated and compared due to a lack of real microservice systems, which leads to the slow process of intelligent operation. These challenges inspire us to build a system named “VWR”, a framework of Virtual War Room for operating microservice applications which allows users to simulate their microservice architectures with low overhead and inject multiple types of faults into the microservice system with chaos engineering. VWR can mimic user requests and record the end-to-end tracing data (i.e., service call chains) for each request in a way consistent with OpenTracing. With easily designed tests and the produced streaming tracing data, the users can validate the performance of their intelligent operation algorithms and improve the algorithms as needed. Besides, based on the streaming tracing data generated by VWR, we introduce a novel unsupervised anomaly detection algorithm based on Matrix Sketch and set it as a default intelligent operation algorithm in VWR. This algorithm can detect anomalies by analyzing high-dimensional performance data collected from a microservice system in a streaming manner. The experimental result in VWR shows that the matrix sketch based method can precisely detect anomalies in microservice systems and outperform some widely used anomaly detection methods such as isolation forest in some scenario. We believe more approaches on the intelligent operation of microservice systems can be constructed based on VWR.

A Framework of Virtual War Room and Matrix Sketch-Based Streaming Anomaly Detection for Microservice Systems

Hongyang Chen, Pengfei Chen^†, Guangba Yu

Access In IEEE Access

2019

Microscaler: Automatic Scaling for Microservices with an Online Learning Approach

Guangba Yu, Pengfei Chen^†, Zibin Zheng

ICWS'19 (CCF B) In Proceedings of the 2019 IEEE International Conference on Web Services

Recently, the microservice becomes a popular architecture to construct cloud native systems due to its agility. In cloud native systems, autoscaling is a core enabling technique to adapt to workload changes by scaling out/in. However, it becomes a challenging problem in a microservice system, since such a system usually comprises a large number of different micro services with complex interactions. When bursty and unpredictable workloads arrive, it is difficult to pinpoint the scaling-needed services which need to scale and evaluate how much resource they need. In this paper, we present a novel system named Microscaler to automatically identify the scaling-needed services and scale them to meet the service level agreement (SLA) with an optimal cost for micro-service systems. Microscaler collects the quality of service metrics (QoS) with the help of the service mesh enabled infrastructure. Then, it determines the under-provisioning or over-provisioning services with a novel criterion named service power. By combining an online learning approach and a step-by-step heuristic approach, Microscaler could achieve the optimal service scale satisfying the SLA requirements. The experimental evaluations in a micro-service benchmark show that Microscaler converges to the optimal service scale faster than several state-of-the-art methods.

Microscaler: Automatic Scaling for Microservices with an Online Learning Approach

Guangba Yu, Pengfei Chen^†, Zibin Zheng

ICWS'19 (CCF B) In Proceedings of the 2019 IEEE International Conference on Web Services