I hiked the beautiful coastline south of Adelaide in my youth, carefully watching the waves crashing below. Cliffs are beautiful yet perilous. Unlike the real world, cartoons afford recoverability after the pain of a fall. In 2017, I began using the analogy of a cliff fall to describe incidents with my Atlassian team Confluence. More recently, I've refined and shared this analogy at Qwilr. Building and running a high scale service, aiming for fast development, deployment and feedback, comes with risks that resemble being near the edge of a cliff. It became a good analogy to not only reason about the seriousness of incidents but also the need to become resilient to them.
This post contains recommendations for incident management, but it is not comprehensive. Consider this an analogy for understanding and tracking incidents, with encouragement to pragmatically respond to incidents in the short-term and build in resiliency over the long-term.
Understanding the facts of a fall
Start your reflection of an incident by objectively capturing exactly what happened. These details need little discussion except regarding the unusual.
1. How/when did we know we fell off the cliff?
What told us there was an incident? An alert? A customer report, or a spike of them? Were customers first impacted then or later? If later, how did we discover the actual start (e.g. diagnostics)? Knowing these facts, particularly if there are delays, will warrant improvement.
If you are tracking data for incidents, TTD (time to detection) is the time it took between the actual start of the incident and when it was detected.
2. How bad was our fall?
Was it a sudden 100m drop like the earlier cartoon, or was it a series of painful moments?
The overall impact on the customer is called severity. It represents how bad the impact was (e.g. complete unavailability vs a feature not working). Still, it is also informed by how many customers were impacted and the length of the incident.
For tracking, defining severity levels with objective definitions is useful for reviews (low-severity incidents may not warrant a meeting), trend reporting and compliance.
3. How/when did we get back onto the cliff?
Knowing when the service is restored for customers is a critical piece of incident information. Capturing how the service was restored (e.g. feature flag, rollback or otherwise) along with the time helps inform service restoration improvements.
When tracking incidents, the TTR (time to resolution) is the time from the incident started (the service was degraded) to when the incident was resolved (the service was restored). TTR is the most important metric for incidents as it fundamentally equates to downtime.
TTR (Time to Resolution) is the most important metric to track for incidents.
Responding to a fall
Once your team has understood the facts of an incident, now focus on these two questions that must be answered when reviewing an incident.
4. What was the cause and reason for falling?
I once managed a product that tracked and reported on downtime events for plant machinery to inform maintenance investment. It attributed production loss (e.g. gold mined) to a two-level code structure of “cause” (e.g. an engine seized) and “reason” (e.g. “maintenance overdue”).
Similarly, while a bad commit may have caused an incident, to get to the reason (also known as the root cause), the team must ask “why?” a few times to get the right level. It may help to capture "Cause" and "Reason" separately or focus only on the "Root Cause".
A healthy DevOps culture must keep cause/reason discussions blameless. This ensures objective analysis and open collaboration.
5. How will we prevent this fall in future?
The most important thing you can leave a review with is clear, timely, priority action(s) that will prevent a similar incident from occurring again.
Positively challenge ideas that focus on the “cause” rather than the “reason.” E.g. “we need to write unit tests to catch bad commits” has good intentions, but in many cases won’t fix the specific reason for an incident. If ideas arise that are larger in effort and possibly impact (maybe they fix a class of issues), record them as possible improvements (see resiliency in the next section). However, continue to search for an action that will make a timely impact on the reason.
Ensuring the completion of priority action(s) is the most important thing you can do to ensure incident reviews are worthwhile.
Becoming resilient to falls
Repeatedly responding to priority actions tends to stabilise a service through responding to obvious weaknesses. However, this alone won’t translate to any desired level of availability nor bring you back “from the edge of the cliff” regarding incidents. Further, incremental changes can create bloated tests and slow your team down anyway. Therefore, if you want to “move fast and try not to break things”,, you need investment in both velocity and resilience.
6. How can we more quickly recover from falling off of the cliff?
Velocity is important. It is critical during an incident. If resolution time (TTR) is systemically slow, discover if this stems from detection (TTD), troubleshooting difficulties or fundamentally slow development and deployment, and focus improvement accordingly.
If detection (TTD) in particular is slow, this may indicate a lack of observability, but also consider people factors such as communication between Support and Engineering teams.
You may not have hard or contractual targets for incident response. However, consider softer internal targets for TTR and TTD for each severity level, and use breaches to indicate possible improvement explorations. Also, consider incident frequency trends. When you do get to ranking improvements, ensure you either improve or at least maintain velocity.
7. How can we make the cliff smaller?
Another improvement approach is to reduce the severity of incidents should they occur, not just try to better prevent them or recover from them. In our analogy, this is making the cliff itself smaller in certain scenarios.
In a micro-services architecture, tactics such as the circuit breaker pattern help one service’s downtime from triggering wide-spread downtime. Graceful degradation in UX, where a blank experience is shown if a component error occurs, is another tactic. Advanced forms of these tactics are captured in the concept of chaos engineering.
8. How can we move further away from the cliff and stay fast?
The reason to be “close to the edge” is velocity and feedback. This final question asks if we can move further away, i.e. reduce the probability of an incident, without losing that advantage. The answer is: It is possible but requires investment.
Future-testing is one way: Test your service at 5x your current peak load. Generate test data 2x your current database’s size and run load tests. If you can simulate future scenarios, the outcomes will occur in simulations and not in production.
Automated tests remain a legitimate tactic, but consider their run time a budget that needs to be spent wisely. In the Confluence team, I once approved, to the surprise of the DevOps manager who suggested it, the deletion of several automated tests. This was freeing up our budget for more valuable tests and making the team faster.
Tension between velocity and availability will always exist. Use tactics that create a balance in your service, organisation and most importantly, your customers.
One more tactic to cover here is error budgets, i.e. the allowable amount of downtime in a period that meets availability goals, after which the team switches to more conservative deployment criteria. This can serve as a rear-guard for customer impact and is critical if there are hard/contractual targets for availability.
Learning and persistence
Wile E Coyote may not be a great example of learning from one’s mistakes, but he is a good example of resilience and persistence! Incidents will happen. Combine these dos and don'ts by picking yourself and your team up after having responded to an incident, maintain a blameless culture as you reflect on how to prevent it from happening again, and form a strategy for building longer-term resilience into your service!
Credits / shout-outs:
- The many colleagues I sparred with on operational health at Atlassian.
- Haymo for suggesting awesome resiliency tactics in a certain high scale service.
- Pete for suggesting to delete those tests.
- Warner Bros for the classic Wile E Coyote cartoons.