Resilient Response In Complex Systems
- 1. Resilient
Response in
Complex
Systems
John Allspaw
SVP, Tech Ops
Qcon London 2012
Sunday, March 11, 12
- 22. Complex Systems
• Cascading Failures
• Difficult to determine boundaries
• Complex systems may be open
• Complex systems may have a memory
• Complex systems may be nested
• Dynamic network of multiplicity
• May produce emergent phenomena
• Relationships are non-linear
• Relationships contain feedback loops
Sunday, March 11, 12
- 24. How Can This Happen?
It does happen.
And it will again.
Sunday, March 11, 12
And again.
- 29. Problem Starts
Detection
Evaluation
Response
Stable
PostMortem
Confirmation
All Clear
Time
Sunday, March 11, 12
- 30. Problem Starts
Stress
Detection
Evaluation
Response
Stable
PostMortem
Confirmation
All Clear
Time
Sunday, March 11, 12
- 31. Forced beyond learned roles
Actions whose consequences are both important and
difficult to see
Cognitively and perceptively noisy
Coordinative load increases exponentially
Sunday, March 11, 12
- 33. So What
Can We Do?
Sunday, March 11, 12
- 36. Characteristics of response to
escalating scenarios
...tend to neglect how processes
develop within time (awareness of
rates) versus assessing how things
are in the moment
“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980
Sunday, March 11, 12
- 37. Characteristics of response to
escalating scenarios
...have difficulty in dealing with
exponential developments (hard to
imagine how fast something can
change, or accelerate)
“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980
Sunday, March 11, 12
- 38. Characteristics of response to
escalating scenarios
...inclined to think in causal series,
instead of causal nets.
A therefore B,
instead of
A, therefore B and C (therefore D and
E), etc.
“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980
Sunday, March 11, 12
- 42. Heroism
Non-communicating lone wolf-isms
Sunday, March 11, 12
- 43. Distraction
Irrelevant noise in comm channels
Sunday, March 11, 12
- 44. Jens Rasmussen, 1983
Senior Member, IEEE
“Skills, Rules, and Knowledge; Signals, Signs,
and Symbols, and Other Distinctions in Human
Performance Models”
IEEE Transactions On Systems, Man, and Cybernetics, May 1983
Sunday, March 11, 12
- 45. SKILL - BASED
Simple, routine
RULE - BASED
Knowable, but unfamiliar
KNOWLEDGE - BASED
(Reason, 1990)
WTF IS GOING ON?
Sunday, March 11, 12
- 47. High Reliability Organizations
• Air Traffic Control • Complex Socio-Technical
systems
• Naval Air Operations At Sea • Efficiency <-> Thoroughness
• Electrical Power Systems • Time/Resource Constrained
• Etc. • Engineering-driven
Sunday, March 11, 12
- 49. “The Self-Designing High-Reliability Organization:
Aircraft Carrier Flight Operations at Sea”
Rochlin, La Porte, and Roberts. Naval War College Review 1987
http://govleaders.org/reliability.htm
Sunday, March 11, 12
- 50. "So you want to understand an aircraft carrier? Well, just
imagine that it's a busy day, and you shrink San Francisco
Airport to only one short runway and one ramp and gate. Make
planes take off and land at the same time, at half the present
time interval, rock the runway from side to side, and require that
everyone who leaves in the morning returns that same day.
Make sure the equipment is so close to the edge of the envelope
that it's fragile. Then turn off the radar to avoid detection,
impose strict controls on radios, fuel the aircraft in place with
their engines running, put an enemy in the air, and scatter live
bombs and rockets around. Now wet the whole thing down with
salt water and oil, and man it with 20-year-olds, half of whom
have never seen an airplane close-up.
Oh, and by the way, try not to kill anyone."
-- Senior officer, Air Division
Sunday, March 11, 12
- 52. Close reciprocal
coordination and
information sharing,
resulting in overlapping
knowledge
Sunday, March 11, 12
- 57. High levels of situation
comprehension: maintain
constant awareness of the
possibility of accidents.
Sunday, March 11, 12
- 59. Maintenance of detailed
records of past incidents
that are closely examined
with a view to learning from
them.
Sunday, March 11, 12
- 60. Patterns of authority are
changed to meet the
demands of the events:
organizational flexibility.
Sunday, March 11, 12
- 61. The reporting of errors and
faults is rewarded, not
punished.
Sunday, March 11, 12
- 70. Postmortems
• Full timelines: What happened, when
• Review in public, everyone invited
• Search for “second stories” instead of “human error”
• Cultivating a blameless environment
• Giving requisite authority to individuals to improve things
Sunday, March 11, 12
- 71. Qualifying Response
High signal:noise in comm channels?
Troubleshooting fatigue?
Troubleshooting handoff?
All tools on-hand?
Improvised tooling or solutions?
Metrics visibility?
Collaborative and skillful communication?
Sunday, March 11, 12
- 73. Mature Role of Automation
“Ironies of Automation” - Lisanne Bainbridge
http://www.bainbrdg.demon.co.uk/Papers/Ironies.html
Sunday, March 11, 12
- 74. Mature Role of Automation
• Moves humans from manual operator to supervisor
• Extends and augments human abilities, doesn’t replace it
• Doesn’t remove “human error”
• Are brittle
• Recognize that there is always discretionary space for humans
• Recognizes the Law of Stretched Systems
Sunday, March 11, 12
- 75. Law of Stretched Systems
“Every system is stretched to operate at its
capacity; as soon as there is some
improvement, for example, in the form of
new technology, it will be exploited to
achieve a new intensity and tempo of
activity”
D.Woods, E. Hollnagel, “Joint Cognitive Systems: Patterns” 2006
Sunday, March 11, 12
- 77. Near Misses
Hey everybody -
Don’t be like me. I tried to X, but
that wasn’t a good idea.
It almost exploded everyone.
So, don’t do: (details about X)
Love,
Joe
Sunday, March 11, 12
- 78. Near Misses
• Can act like “vaccines” - help system safety without actually
hurting anything
• Happen more often, so provide more data on latent failures
• Powerful reminder of hazards, and slows down the process of
forgetting to be afraid
Sunday, March 11, 12
- 81. 100 changes
6 change-related issues
Sunday, March 11, 12
- 83. Proposition #1
“Ways in which things go right are special cases
of the ways in which things go wrong.”
Sunday, March 11, 12
- 84. Proposition #1
Successes = failures gone wrong
Study the failures, generalize from that.
Potential data sources: 6 out of 100
Sunday, March 11, 12
- 85. Proposition #2
“Ways in which things go wrong are special
cases of the ways in which things go right.”
Sunday, March 11, 12
- 86. Proposition #2
Failures = successes gone wrong
Study the successes, generalize from that
Sunday, March 11, 12
Potential data sources: 94 out of 100
- 87. 94/100 ?
OR
Sunday, March 11, 12
6/100 ?
- 89. Not just:
why did we fail?
But also:
why did we succeed?
Sunday, March 11, 12
- 90. Resilient Response
• Can learn from other fields
• Can train for outages
• Can learn from mistakes
• Can learn from successes as well as failures
Sunday, March 11, 12