A Production Incident is where Engineering excellence starts from.
Learning from Mistakes is a topic near and dear to my heart. Both personally and professionally I am intrigued by the ideas on What is the learning here? How can we do better? And How can we do differently?
Over my journey in the last 20 years in leading engineering organizations of diverse skill sets and experiences, I have noticed pitfalls in discounting these learnings in favor of quick delivery.
Learning requires a Curious Mind and an Objective view
While it is cringing and takes immense vulnerability to learn from mistakes, transformative effects of learning from Incidents on engineering culture cannot be ignored.
Every Incident is an Opportunity to Learn and Excel in Engineering Best Practices. While this sounds philosophical, incidents/outages are glaring symptoms of the health of the engineering teams and organizations and a direct opportunity to innovate. These are a direct insight into team’s culture, team’s strengths, abilities and other pressures.
These are outright visible signs that are conveniently ignored post Incident remediation and are buried in Incident trend reports that are mere statistics in the eyes of the leaders and team members.
How can you use Incidents to learn about team’s culture?
Well, here is a typical story of engineers battling incidents/software outages over the weekends, it is 2:00 AM on Saturday morning, I am getting paged that my application is down with 500 errors. Engineer is waking up and while in half sleep is triaging this incident in figuring on what is the culprit for this, is it disk space? Is it logging? Is it a DDOS attack? Is this a scaling issue? Is it a memory issue? Is it a hardware failure? (Huh, yes, this can happen too.) Is it a downstream system not responding? Is this network connectivity issue? Is this a rate limiter issue.
With a myriad of possible combinations of incidents, engineer is under tremendous panic in identifying and mitigating the issue and getting the system up. After hours of push and pull, engineer triumphs in getting the system up and running.
While mostly ignored, what happens after this is crucial for identifying and building engineering culture.
Most engineering leaders examine the incident and conduct an RCA (Root cause analysis) by detailing the source of the failure and peeling the onion via the ‘Whys’ to understand the root cause. This focuses a lot on the why did the incident occur and generating corrective actions to prevent recurrence. While this sounds promising, this falls short as most root causes (95% from my 20 years of observation) are unique.
Conduct Postmortem rather than RCA.
A thorough Postmortem of the incident that extends beyond the root cause empowers an incident response team and engineering leaders drive meaningful improvements from design & coding standards to reducing detection times, mitigation times and innovating scalable designs and recovery techniques.
What is the difference?
While both root cause and Postmortem start from the incident detection moment, the difference is in the approach taken. RCAs start inside-out by examining the remediation done and the code issue or infrastructure issue that caused the incident.
Postmortem analysis involves a bi-directional view (outside-in and inside-out) by examining the actual customer impact, blast radius of impact, incident timeline from issue introduction to detection to remediation and RCA. Blast radius goes beyond actual customer impact and includes brand value damage and estimates impact that is not directly measurable by just the volume metrics during the outage.
Reviewing Blast radius of customer impact helps us see beyond the root cause.
Did we detect this incident or did our customers detect it before us?
Did we diminish customer trust?
Did we damage our brand name and value?
Corrective actions from Postmortems go beyond addressing code quality alone and address futuristic improvements in; detection & logging, optimizing on-call processes, re-vamping product feature functional/nonfunctional requirements, re-hashing of technical design processes and re-thinking of automatic/manual testing and release processes.
Postmortems if written well are a powerful tool to propagate learnings of scalable system design, code design or product design and operational troubleshooting flaws across the entire org and entire company as a whole.
These are treasure cove of learnings that allow us to learn from each other’s mistakes without repeating our own.
Sounds good, this takes time, which is limited, How do we get our leaders invest in these learnings?
From my personal experience from few years ago, I’ve noticed writing thorough Postmortems reduced incidents by 80% (1 incident a week in a quarter to 1incident a quarter) in my engineering team.
Use customer anecdotes to appease the emotional mind
Customer anecdotes from both positive and negative experiences help drive the point of customer trust and brand value and their direct correlation to product reliability and quality.
Use metrics driven rational to appease the logical mind
Measuring and sharing incident detection durations, % of total incidents detected by customers, incident mitigation durations and number of incidents help drive the point on improving developer productivity hours across the org.
How to leverage Postmortems to measure organizational health?
Out of curiosity, I once analyzed the corrective actions and summaries of over 21 high severity incidents over a course of a quarter and here are some interesting patterns that emerged.
1. Missing Documented Functional & Non-Functional Requirements — Results in ambiguity in engineers’ understanding of product behavior and constraints.
2. Assumptions of no customer impact for backward incompatible changes — Results in losing out on existing customer trust where new code altered existing product behavior drastically.
3. Missing Documented Low-level tech Designs & scaling constraints — Results in lowering availability and reliability due to missing error handling for downstream system and infrastructure dependencies.
4. Missing alerts and monitoring on volume, traffic patterns, latency, disk space — Results customers detecting issues before engineering teams and lowering customer trust.
5. No or Missing Functional Tests & Intentional Overriding of Regression Tests — Results in in-consistent product behavior in different conditions impacting customer perception of product quality, brand value and customer retention.
From these five patterns, I could identify opportunities of improvements in communication patterns, engineering best practices that is from documenting, designing, coding, testing and deployment to troubleshooting and on-call processes. Improvements formed a basis to help groom engineering team talent in designing & developing better reliable and available scalable systems.
This enabled me to present these findings along with proposed improvements and goals widely to the Organization, leading up to a deeper holistic change in the ‘Product development lifecycle.’
How to rollout Postmortem culture?
· Start with recruiting champions, who are emotionally bought into customer trust and product quality.
· Invest in training engineers on how to write a detailed Postmortem and foster a culture of pride in sharing learnings from mistakes.
· Establish quality bar raiser review teams
· Include Postmortems as part of the engineer calibration guidelines.
· Hold Postmortem learning sessions with engineers from cross functional teams
· Measure effectives of Postmortems with quarterly goals like reducing; org-wide customer detected incidents, incident detection duration and incident mitigation duration, and customer satisfaction scores on product quality.
Finally, persistent and deliberate investment in Postmortems improves organizational health, product quality and customer trust holistically. This is a single most proven insight into the inner workings of an engineering organization.