When cloud service experience failures, it is typical to conduct a “post-mortem” analysis after its recovery to understand what went wrong, what went right, and how the team could do better in the future. When those failures are public-facing, it is common for some portion of those post-mortem analyses to be made publicly available. The paper describes an analysis of 354 publicly visible post-mortem analyses for three popular three popular large-scale clouds. Based on these findings, the authors have suggested some guidelines on fault handling using chaos engineering, observability, and intelligent operations considerations.
Xiaoyun Li,
Guangba Yu,
Pengfei Chen,
Hongyang Chen,
Zhekang Chen