Skip to main content
added 561 characters in body
Source Link
mxyzplk
  • 36.8k
  • 15
  • 109
  • 136

You are in luck, there is a lot of contemporary work on conducting constructive incident retrospectives out there to learn from technical problems.

It's referred to as "incident postmortems," "(post-) incident analysis," or "incident retrospectives." The 2010 book Web Operations included a paper called How Complex Systems Fail by Dr. Richard Cook - it was written for medical systems but was accessible to technologists as well, which brought current thinking in safety and resilience from the hard engineering disciplines and into the view of software engineers. (He presented it at the Velocity conference in 2012, here's video of the session.) Then John Allspaw, who was CTO of Etsy at the time, kicked off the modern wave of postmortem thinking in 2012 with his article Blameless PostMortems and a Just Culture, which established the term. (He currently runs a consultancy specifically oriented towards coaching organizations in performing incident analysis.)

Now that you know what you're looking for and have some terms to use in Google, there's content everywhere. There is a LinkedIn Learning course on Effective Postmortems, the Google SRE books that are free online all have chapters on them - Postmortem Culture: Learning from Failure in "Site Reliability Engineering," expansion on that in an identically titled chapter in The SRE Workbook, and Chapter 18: Recovery and Aftermath in Building Secure & Reliable Systems. There's a "Learning from Incidents in Software" group on Twitter (@LFISoftware). Companies like Atlassian and PagerDuty have put together basic primers on the topic that are easier reading. There are good books and papers on the topic in general, though not IT specific, from researchers like Sidney Dekker and Erik Hollnagel. You can go so far as to enroll in the Lund University Master's program in Human Factors and System Safety, which some technologists really deep into resilience engineering and postmortems have done in order to learn from the existing knowledge on this topic that exists in the larger safety/investigation space. You really can go down the rabbit hole as much as you want on this topic. It'll lead you into incident response, resilience engineering, chaos engineering, and other related spaces as well.

Top 3 Hot Tips:

  • The idea of a single "root cause" is flawed and overly simplistic for modern complex systems and the term is deprecated in current postmortem thinking
  • The idea of "human error" as a root cause is even more flawed and is seldom useful in analysis, hence "blameless" or "blame-aware" postmortems
  • The goal of a postmortem is to learn as an organization, not necessarily "fix something" or "make sure that doesn't happen again"; people are who create safety in systems.

I've been involved in running postmortems for the last 10 years in a variety of organizations, and if you do it well it's your best way to learn from issues. Routine retros are also good, and as an aside whoever says sprint retros are "only supposed to be about communication" or whatever can hose off - they should be about whatever the team believes needs to improve to help them, and if that's something technical it's something technical. I'd never dream of calling something "off limits" for a retrospective (except namecalling or something of course).

You are in luck, there is a lot of contemporary work on conducting constructive incident retrospectives out there to learn from technical problems.

It's referred to as "incident postmortems," "(post-) incident analysis," or "incident retrospectives." The 2010 book Web Operations included a paper called How Complex Systems Fail by Dr. Richard Cook - it was written for medical systems but was accessible to technologists as well, which brought current thinking in safety and resilience from the hard engineering disciplines and into the view of software engineers. (He presented it at the Velocity conference in 2012, here's video of the session.) Then John Allspaw, who was CTO of Etsy at the time, kicked off the modern wave of postmortem thinking in 2012 with his article Blameless PostMortems and a Just Culture, which established the term. (He currently runs a consultancy specifically oriented towards coaching organizations in performing incident analysis.)

Now that you know what you're looking for and have some terms to use in Google, there's content everywhere. There is a LinkedIn Learning course on Effective Postmortems, the Google SRE books that are free online all have chapters on them - Postmortem Culture: Learning from Failure in "Site Reliability Engineering," expansion on that in an identically titled chapter in The SRE Workbook, and Chapter 18: Recovery and Aftermath in Building Secure & Reliable Systems. There's a "Learning from Incidents in Software" group on Twitter (@LFISoftware). Companies like Atlassian and PagerDuty have put together basic primers on the topic that are easier reading. There are good books and papers on the topic in general, though not IT specific, from researchers like Sidney Dekker and Erik Hollnagel. You can go so far as to enroll in the Lund University Master's program in Human Factors and System Safety, which some technologists really deep into resilience engineering and postmortems have done in order to learn from the existing knowledge on this topic that exists in the larger safety/investigation space. You really can go down the rabbit hole as much as you want on this topic. It'll lead you into incident response, resilience engineering, chaos engineering, and other related spaces as well.

Top 3 Hot Tips:

  • The idea of a single "root cause" is flawed and overly simplistic for modern complex systems and the term is deprecated in current postmortem thinking
  • The idea of "human error" as a root cause is even more flawed and is seldom useful in analysis, hence "blameless" or "blame-aware" postmortems
  • The goal of a postmortem is to learn as an organization, not necessarily "fix something" or "make sure that doesn't happen again"; people are who create safety in systems.

You are in luck, there is a lot of contemporary work on conducting constructive incident retrospectives out there to learn from technical problems.

It's referred to as "incident postmortems," "(post-) incident analysis," or "incident retrospectives." The 2010 book Web Operations included a paper called How Complex Systems Fail by Dr. Richard Cook - it was written for medical systems but was accessible to technologists as well, which brought current thinking in safety and resilience from the hard engineering disciplines and into the view of software engineers. (He presented it at the Velocity conference in 2012, here's video of the session.) Then John Allspaw, who was CTO of Etsy at the time, kicked off the modern wave of postmortem thinking in 2012 with his article Blameless PostMortems and a Just Culture, which established the term. (He currently runs a consultancy specifically oriented towards coaching organizations in performing incident analysis.)

Now that you know what you're looking for and have some terms to use in Google, there's content everywhere. There is a LinkedIn Learning course on Effective Postmortems, the Google SRE books that are free online all have chapters on them - Postmortem Culture: Learning from Failure in "Site Reliability Engineering," expansion on that in an identically titled chapter in The SRE Workbook, and Chapter 18: Recovery and Aftermath in Building Secure & Reliable Systems. There's a "Learning from Incidents in Software" group on Twitter (@LFISoftware). Companies like Atlassian and PagerDuty have put together basic primers on the topic that are easier reading. There are good books and papers on the topic in general, though not IT specific, from researchers like Sidney Dekker and Erik Hollnagel. You can go so far as to enroll in the Lund University Master's program in Human Factors and System Safety, which some technologists really deep into resilience engineering and postmortems have done in order to learn from the existing knowledge on this topic that exists in the larger safety/investigation space. You really can go down the rabbit hole as much as you want on this topic. It'll lead you into incident response, resilience engineering, chaos engineering, and other related spaces as well.

Top 3 Hot Tips:

  • The idea of a single "root cause" is flawed and overly simplistic for modern complex systems and the term is deprecated in current postmortem thinking
  • The idea of "human error" as a root cause is even more flawed and is seldom useful in analysis, hence "blameless" or "blame-aware" postmortems
  • The goal of a postmortem is to learn as an organization, not necessarily "fix something" or "make sure that doesn't happen again"; people are who create safety in systems.

I've been involved in running postmortems for the last 10 years in a variety of organizations, and if you do it well it's your best way to learn from issues. Routine retros are also good, and as an aside whoever says sprint retros are "only supposed to be about communication" or whatever can hose off - they should be about whatever the team believes needs to improve to help them, and if that's something technical it's something technical. I'd never dream of calling something "off limits" for a retrospective (except namecalling or something of course).

Source Link
mxyzplk
  • 36.8k
  • 15
  • 109
  • 136

You are in luck, there is a lot of contemporary work on conducting constructive incident retrospectives out there to learn from technical problems.

It's referred to as "incident postmortems," "(post-) incident analysis," or "incident retrospectives." The 2010 book Web Operations included a paper called How Complex Systems Fail by Dr. Richard Cook - it was written for medical systems but was accessible to technologists as well, which brought current thinking in safety and resilience from the hard engineering disciplines and into the view of software engineers. (He presented it at the Velocity conference in 2012, here's video of the session.) Then John Allspaw, who was CTO of Etsy at the time, kicked off the modern wave of postmortem thinking in 2012 with his article Blameless PostMortems and a Just Culture, which established the term. (He currently runs a consultancy specifically oriented towards coaching organizations in performing incident analysis.)

Now that you know what you're looking for and have some terms to use in Google, there's content everywhere. There is a LinkedIn Learning course on Effective Postmortems, the Google SRE books that are free online all have chapters on them - Postmortem Culture: Learning from Failure in "Site Reliability Engineering," expansion on that in an identically titled chapter in The SRE Workbook, and Chapter 18: Recovery and Aftermath in Building Secure & Reliable Systems. There's a "Learning from Incidents in Software" group on Twitter (@LFISoftware). Companies like Atlassian and PagerDuty have put together basic primers on the topic that are easier reading. There are good books and papers on the topic in general, though not IT specific, from researchers like Sidney Dekker and Erik Hollnagel. You can go so far as to enroll in the Lund University Master's program in Human Factors and System Safety, which some technologists really deep into resilience engineering and postmortems have done in order to learn from the existing knowledge on this topic that exists in the larger safety/investigation space. You really can go down the rabbit hole as much as you want on this topic. It'll lead you into incident response, resilience engineering, chaos engineering, and other related spaces as well.

Top 3 Hot Tips:

  • The idea of a single "root cause" is flawed and overly simplistic for modern complex systems and the term is deprecated in current postmortem thinking
  • The idea of "human error" as a root cause is even more flawed and is seldom useful in analysis, hence "blameless" or "blame-aware" postmortems
  • The goal of a postmortem is to learn as an organization, not necessarily "fix something" or "make sure that doesn't happen again"; people are who create safety in systems.