The document discusses responding to outages and system failures in a mature way. It covers several key points:
- Outages are complex and can have cascading effects that are difficult to anticipate. Teams must learn to respond effectively.
- High reliability organizations like air traffic control respond well through close coordination, redundancy, flexibility and learning from both failures and successes.
- Teams can improve their response by practicing through drills, sharing lessons from near misses, and having open post-mortems to prevent future issues.
- Both successes and failures contain valuable lessons, and systems tend to operate at maximum capacity, so constant improvement is needed to handle inevitable stresses and failures.
Report
Share
Report
Share
1 of 94
Download to read offline
More Related Content
Responding to Outages Maturely
1. Responding
to Outages
Maturely
John Allspaw
SVP, Tech Ops
Code As Craft, Berlin
Tuesday, April 24, 12
22. Complex
Systems
• Cascading Failures
• Difficult to determine boundaries
• Complex systems may be open
• Complex systems may have a memory
• Complex systems may be nested
• Dynamic network of multiplicity
• May produce emergent phenomena
• Relationships are non-linear
• Relationships contain feedback loops
Tuesday, April 24, 12
23. How Can This Happen?
It does happen.
And it will again.
Tuesday, April 24, 12
And again.
27. How does team
troubleshooting
happen?
Tuesday, April 24, 12
28. Problem Starts
Detection
Evaluation
Response
Stable
PostMortem
Confirmation
All Clear
Time
Tuesday, April 24, 12
29. Problem Starts
Stress
Detection
Evaluation
Response
Stable
PostMortem
Confirmation
All Clear
Time
Tuesday, April 24, 12
30. Forced beyond learned roles
Actions whose consequences are both important and
difficult to see
Cognitively and perceptively noisy
Coordinative load increases exponentially
Tuesday, April 24, 12
35. Characteristics of response to
escalating scenarios
...tend to neglect how processes
develop within time (awareness of
rates) versus assessing how things
are in the moment
“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980
Tuesday, April 24, 12
36. Characteristics of response to
escalating scenarios
...have difficulty in dealing with
exponential developments (hard to
imagine how fast something can
change, or accelerate)
“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980
Tuesday, April 24, 12
37. Characteristics of response to
escalating scenarios
...inclined to think in causal series,
instead of causal nets.
A therefore B,
instead of
A, therefore B and C (therefore D and
E), etc.
“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980
Tuesday, April 24, 12
41. Heroism
Non-communicating lone wolf-isms
Tuesday, April 24, 12
42. Distraction
Irrelevant noise in comm channels
Tuesday, April 24, 12
43. Jens Rasmussen, 1983
Senior Member, IEEE
“Skills, Rules, and Knowledge; Signals, Signs,
and Symbols, and Other Distinctions in Human
Performance Models”
IEEE Transactions On Systems, Man, and Cybernetics, May 1983
Tuesday, April 24, 12
44. SKILL - BASED
Simple, routine
RULE - BASED
Knowable, but unfamiliar
KNOWLEDGE - BASED
(Reason, 1990)
WTF IS GOING ON?
Tuesday, April 24, 12
45. Team Troubleshooting
• Which causes did you consider first?
• Which ones did you not consider at all?
• How much of what you considered comes
from recent history?
• How much comes from observations from
other team members?
Tuesday, April 24, 12
46. Team Troubleshooting
• How effective is the response team in
communicating to other groups? Users?
• How long does it take to exhaust obvious
cause(s)?
Tuesday, April 24, 12
48. High Reliability Organizations
• Air Traffic Control • Complex Socio-Technical
systems
• Naval Air Operations At Sea • Efficiency <-> Thoroughness
• Electrical Power Systems • Time/Resource Constrained
• Etc. • Engineering-driven
Tuesday, April 24, 12
50. “The Self-Designing High-Reliability Organization:
Aircraft Carrier Flight Operations at Sea”
Rochlin, La Porte, and Roberts. Naval War College Review 1987
http://govleaders.org/reliability.htm
Tuesday, April 24, 12
71. Postmortems
• Full timelines: What happened, when, who involved
• Review in public, everyone invited
• Search for “second stories” instead of “human error”
• Cultivating a blameless environment
• Giving requisite authority to individuals to improve
things
Tuesday, April 24, 12
72. Qualifying Response
High signal:noise in comm channels?
Troubleshooting fatigue?
Troubleshooting handoff?
All tools on-hand and working?
Improvised tooling or solutions?
Metrics visibility?
Collaborative and skillful communication?
Tuesday, April 24, 12
75. Near Misses
Hey everybody -
Don’t be like me. I tried to X, but
that wasn’t a good idea.
It almost exploded everyone.
So, don’t do: (details about X)
Love,
Joe
Tuesday, April 24, 12
76. Near Misses
• Can act like “vaccines” - help system safety without actually
hurting anything
• Happen more often, so provide more data on latent failures
• Powerful reminder of hazards, and slows down the process of
forgetting to be afraid
Tuesday, April 24, 12
77. Practice!
• How we troubleshoot in the moment, as a distributed team
• How we handle time pressure
• How we Observe/Orient/Decide/Act
• How we communicate during emergencies
• How we trust (or not) each other during emergencies
• How we relate to emergencies when things are normal
• How we could detect how we are protected during normal times
(i.e., why aren’t we going down RIGHT NOW?)
Tuesday, April 24, 12
78. Resilient Response
• Can learn from other fields
• Can train for outages
• Can learn from mistakes
• Can learn from successes as well as failures
Tuesday, April 24, 12
85. Proposition #1
“Ways in which things go right are special cases
of the ways in which things go wrong.”
Tuesday, April 24, 12
86. Proposition #1
Successes = failures gone wrong
Study the failures, generalize from that.
Potential data sources: 6 out of 100
Tuesday, April 24, 12
87. Proposition #2
“Ways in which things go wrong are special
cases of the ways in which things go right.”
Tuesday, April 24, 12
88. Proposition #2
Failures = successes gone wrong
Study the successes, generalize from that
Tuesday, April 24, 12
Potential data sources: 94 out of 100
90. What and WHY Do Things
Go RIGHT?
Tuesday, April 24, 12
91. Not just:
why did we fail?
But also:
why did we succeed?
Tuesday, April 24, 12
92. Mature Role of Automation
“Ironies of Automation” - Lisanne Bainbridge
http://www.bainbrdg.demon.co.uk/Papers/Ironies.html
Tuesday, April 24, 12
93. Mature Role of Automation
• Moves humans from manual operator to supervisor
• Extends and augments human abilities, doesn’t replace it
• Doesn’t remove “human error”
• Are brittle
• Recognize that there is always discretionary space for humans
• Recognizes the Law of Stretched Systems
Tuesday, April 24, 12
94. Law of Stretched Systems
“Every system is stretched to operate at its
capacity; as soon as there is some
improvement, for example, in the form of
new technology, it will be exploited to
achieve a new intensity and tempo of
activity”
D.Woods, E. Hollnagel, “Joint Cognitive Systems: Patterns” 2006
Tuesday, April 24, 12