Cloud | Welcome to Guangba's HomePage

When cloud service experience failures, it is typical to conduct a “post-mortem” analysis after its recovery to understand what went wrong, what went right, and how the team could do better in the future. When those failures are public-facing, it is common for some portion of those post-mortem analyses to be made publicly available. The paper describes an analysis of 354 publicly visible post-mortem analyses for three popular three popular large-scale clouds. Based on these findings, the authors have suggested some guidelines on fault handling using chaos engineering, observability, and intelligent operations considerations.