Welcome to Guangba's HomePage
Welcome to Guangba's HomePage
Home
Publications
Blogs
Projects
Contact
Light
Dark
Automatic
AIOps
MARS: Fault Localization in Programmable Networking Systems with Low-cost In-Band Network Telemetry
In this paper, we present MARS, a lightweight system for anomaly detection with dynamic threshold and automatic root cause localization in programmable networking systems.
Benran Wang
,
Hongyang Chen
,
Pengfei Chen
,
Zilong He
,
Guangba Yu
最新出炉!WWW 2023 云计算领域论文盘点(一)
国际互联网技术的研究与发展领域的顶级学术盛会 WWW 2023 即将开始,一起跟随本文追踪 WWW 2023 中云计算领域的最新研究吧~
Guangba Yu
Mar 21, 2023
2 min read
Weekly Paper
异常变更识别(一)
微服务架构和CI/CD的出现让现代应用快速和频繁的开发和发布新的特性成为可能,但是频繁的代码和配置变更为系统引入了更多的不稳定因素。根据 Google SRE 书的描述有 70% 的 incident 都是由变更导致的,因此在程序灰度变更时及时的发现异常变更,尽快地采取 rollback 的策略是非常重要的。本文介绍三个学术界和工业界前沿的异常变更识别方法。
Guangba Yu
Sep 30, 2022
1 min read
Weekly Paper
DDS 第一手消息(一)
我们实验室最近在 ASE2022,ISSRE2022, ICSOC2022, ICWS2022 等会议上都有斩获,本文绝对是全网第一手消息。下面简单的介绍一下我们的工作,有兴趣的同学可以下载预览版的论文详细看看。
Guangba Yu
Sep 9, 2022
1 min read
Weekly Paper
MicroSketch: Lightweight and Adaptive Sketch based Performance Issue Detection and Localization in Microservice Systems
In this study, we propose a lightweight and adaptive trace-based anomaly detection and RCA approach, named MicroSketch, which leverages Sketch based features and Robust Random Cut Forest (RRCForest) to rendertrace analysis more effective and efficient.
Yufeng Li
,
Guangba Yu
,
Pengfei Chen
,
Chuanfu Zhang
Going through the Life Cycle of Faults in Clouds:Guidelines on Fault Handling
When cloud service experience failures, it is typical to conduct a “post-mortem” analysis after its recovery to understand what went wrong, what went right, and how the team could do better in the future. When those failures are public-facing, it is common for some portion of those post-mortem analyses to be made publicly available. The paper describes an analysis of 354 publicly visible post-mortem analyses for three popular three popular large-scale clouds. Based on these findings, the authors have suggested some guidelines on fault handling using chaos engineering, observability, and intelligent operations considerations.
Xiaoyun Li
,
Guangba Yu
,
Pengfei Chen
,
Hongyang Chen
,
Zhekang Chen
Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems
This paper proposes a novel system, named GIED, which is a method to automatically analyze the cascading effect of availability issues in online systems. GIED enables the extraction of graph-based issue representations. This representation includes both the issue symptoms and affected service attributes. A neural network is used to perform incident detection. Finally, the PageRank algorithm is used to locate the root cause of the incident.
Zilong He
,
Pengfei Chen
,
Yu Luo
,
Qiuyu Yan
,
Hongyang Chen
,
Guangba Yu
,
Fangyuan Li
SwissLog: Robust Anomaly Detection andLocalization for Interleaved Unstructured Logs
In this paper, we propose SwissLog, namely a robust and unified deep learning based anomaly detection model for detecting diverse faults based on logs.
Xiaoyun Li
,
Pengfei Chen
,
Linxiao Jing
,
Zilong He
,
Guangba Yu
TraceRank: Abnormal Service Localization with Dis-Aggregated End-to-End Tracing Data in Cloud Native Systems
This paper proposes a novel system named TraceRank to identify and locate abnormal services causing performance problems with dis-aggregated end-to-end traces.
Guangba Yu
,
Zicheng Huang
,
Pengfei Chen
T-Rank:A Lightweight Spectrum based Fault Localization Approach for Microservice Systems
This paper proposes a novel system, named T-Rank, which analyzes clues provided by normal and abnormal traces to locate root causes of latency issues.
Zihao Ye
,
Pengfei Chen
,
Guangba Yu
»
Cite
×