52

I am part of a software development team and I am looking for a way/process that we can collect information on any technical problems that recently happened and document them in a way that would make it easy for us to analyze as a technical team what went wrong in order to change the way we do things or enhance the way we do work.

By technical problems I am not referring to small bugs but issues that caused serious disruption either to the business/customers or to other developers.

I was wondering if this can be solved via some agile process or there are other ways that big companies address this.

I am not looking for a best approach but for various approaches more experienced teams/companies have been using.

8
  • 17
    Seems like this question is a much better fit for softwareengineering.stackexchange.com . If you want to generalize into how teams of all types learn from mistakes, OK, fine, but the question seems scoped quite specifically to issues of software development. Commented Sep 10, 2020 at 23:59
  • 12
    I don't think you understand agile at all - the retrospective meeting at the end of each sprint is intended exactly for this sort of analysis and reflection, with the end result being concrete suggestions for improvement that you integrate into the next sprint. At the end of that sprint, you use its retro to evaluate whether what you did had the desired outcome, and incorporate that feedback into the next sprint - rinse repeat. Agile is not about defining universal processes that work for every company, it is about being quick to react and willing to adapt as relevant for your company.
    – Ian Kemp
    Commented Sep 11, 2020 at 12:11
  • 1
    Don't you think that comes down to two separate, largely unanswerable Questions? How do big development teams (anything)? How does anyone learn from mistakes? Commented Sep 11, 2020 at 21:56
  • 1
    I just want to comment that it's far better to learn from other companies mistakes rather than your own. Meaning that, following established best practices is far better than forging your own path and coming to the same conclusions. Not really an answer, but a focus on preventative analysis is likely to be received better than a post-analysis, where people are fearful of the blame game. Commented Sep 12, 2020 at 8:02
  • 1
    @IanKemp In my experience retrospectives generally only identify problems which occured in the last sprint or two (which is important don't get me wrong) but often miss longer term trending problems, which is what I think OP is trying to ask about. E.g., I watched a talk recently where, after writing and fixing millions of lines of code, they discovered the way they were using if statements was causing 90% of their bugs. I think OP wants to learn ways of discovering those kind of bug root causes. Commented Sep 13, 2020 at 1:23

9 Answers 9

103

The moment people realize that there is blame being handed out they will obfuscate and hide their mistakes as best they can, scapegoating others and generally making it impossible to learn anything useful.

A much better idea is to do a no-blame post-mortem of the project. The goal is to go over the entire project from start to finish, examining at each stage what worked well and what went wrong. At no point should any blame be assigned to any member of the team. Instead the focus should be on identifying failures and putting procedures in place to make sure they don't happen again.

To err is human which is why we have checks and procedures in place. Therefore if a mistake goes uncorrected or unnoticed it is a problem with the procedures, not the individual (except in extreme circumstances such as gross negligence or misconduct).

In fact having a no-blame culture is a great way to prevent these things happening in the first place because people will feel able to come forward with issues when they happen, not long after they have caused a huge problem.

10
  • 7
    can't upvote enough, and still don't see it happen often enough. Done well and without assigning blame, a very effective tool and learning experience. I've been really happy with the outcome of some of those - knowing you've put in place improvements to avoid mistakes is a relief. Much better than everyone keeping their head down or colleagues being shouted at
    – bytepusher
    Commented Sep 10, 2020 at 17:11
  • 2
    But what if you harshly penalize obfuscating and scapegoating? :)
    – nanoman
    Commented Sep 10, 2020 at 19:55
  • 6
    Downvoted because 1) a whole project is far too long a time scale; you want everybody to learn from everybody else as part of the project, so a mistake only happens once within a single project and 2) this answer seems to indicate that the only way to learn during a project would involve blame game/scapegoating.
    – l0b0
    Commented Sep 10, 2020 at 20:38
  • 2
    @nanoman Then you lose all of your good people, because now the management strategy is to shovel blame on people when things go wrong and then to use fear and punishment to prevent them from trying to dodge when you start throwing s**t at the fan. This is a strategy used by authoritarians and brutal dictators. I don't think we need that as a foundational element of growing a creative and productive team.
    – J...
    Commented Sep 11, 2020 at 15:38
  • 5
    @nanoman Beatings will continue until morale improves
    – corsiKa
    Commented Sep 11, 2020 at 18:29
39

You are in luck, there is a lot of contemporary work on conducting constructive incident retrospectives out there to learn from technical problems.

It's referred to as "incident postmortems," "(post-) incident analysis," or "incident retrospectives." The 2010 book Web Operations included a paper called How Complex Systems Fail by Dr. Richard Cook - it was written for medical systems but was accessible to technologists as well, which brought current thinking in safety and resilience from the hard engineering disciplines and into the view of software engineers. (He presented it at the Velocity conference in 2012, here's video of the session.) Then John Allspaw, who was CTO of Etsy at the time, kicked off the modern wave of postmortem thinking in 2012 with his article Blameless PostMortems and a Just Culture, which established the term. (He currently runs a consultancy specifically oriented towards coaching organizations in performing incident analysis.)

Now that you know what you're looking for and have some terms to use in Google, there's content everywhere. There is a LinkedIn Learning course on Effective Postmortems, the Google SRE books that are free online all have chapters on them - Postmortem Culture: Learning from Failure in "Site Reliability Engineering," expansion on that in an identically titled chapter in The SRE Workbook, and Chapter 18: Recovery and Aftermath in Building Secure & Reliable Systems. There's a "Learning from Incidents in Software" group on Twitter (@LFISoftware). Companies like Atlassian and PagerDuty have put together basic primers on the topic that are easier reading. There are good books and papers on the topic in general, though not IT specific, from researchers like Sidney Dekker and Erik Hollnagel. You can go so far as to enroll in the Lund University Master's program in Human Factors and System Safety, which some technologists really deep into resilience engineering and postmortems have done in order to learn from the existing knowledge on this topic that exists in the larger safety/investigation space. You really can go down the rabbit hole as much as you want on this topic. It'll lead you into incident response, resilience engineering, chaos engineering, and other related spaces as well.

Top 3 Hot Tips:

  • The idea of a single "root cause" is flawed and overly simplistic for modern complex systems and the term is deprecated in current postmortem thinking
  • The idea of "human error" as a root cause is even more flawed and is seldom useful in analysis, hence "blameless" or "blame-aware" postmortems
  • The goal of a postmortem is to learn as an organization, not necessarily "fix something" or "make sure that doesn't happen again"; people are who create safety in systems.

I've been involved in running postmortems for the last 10 years in a variety of organizations, and if you do it well it's your best way to learn from issues. Routine retros are also good, and as an aside whoever says sprint retros are "only supposed to be about communication" or whatever can hose off - they should be about whatever the team believes needs to improve to help them, and if that's something technical it's something technical. I'd never dream of calling something "off limits" for a retrospective (except namecalling or something of course).

8

By technical problems I am not referring to small bugs but issues that caused serious disruption either to the business/customers or to other developers.

If you're talking about serious one-off issues, you're looking to have a post-mortem. You can run the post-mortem in whatever way you like, but you're typically looking to drill down on what happened, the exact root course of the issue, identify all the places where it could have been avoided, and then change protocols and working practices to try to prevent that, or similar problems, happening again.

If you're talking about a bunch of issues that happened during a sprint, then it sounds like an extended sprint retrospective is in order (these should really be routine at the end of each sprint if you're doing scrum properly.) The difference here is it's a general look back on the sprint, what went well, what didn't, what could have been done better, etc. - so you get to pick, discuss and document what things you did right as well as wrong.

Key to both of these is that they should be blameless - if anyone tries to blame others or indeed themselves, you must shut them down immediately, remind them that the session is blameless and bring it back on track.

6
  • extended sprint retrospective is in order I was under the impression that these retrospectives are meant to deal solely with communication and collaboration issues
    – Jim
    Commented Sep 10, 2020 at 8:01
  • you're looking to have a post-mortem is that a standard methodology I can read about?
    – Jim
    Commented Sep 10, 2020 at 8:01
  • 1
    @Jim Typically a retrospective would do just that, but if ways of working and communication could be the root cause of those technical issues, then it could still be appropriate (these things aren't set in stone.) They're both standard terminology, and often used interchangeably - personally I tend to think of a retrospective more as a regular occurrence, and a post-mortem as more of a one-off around a bunch of issues, but YMMV when searching around.
    – berry120
    Commented Sep 10, 2020 at 8:05
  • @jim, you could try the 5-why's (en.wikipedia.org/wiki/Five_whys) to get the root cause of the problem. Commented Sep 10, 2020 at 9:00
  • @RobinBennett: Interesting! How would that be applied as a "formal" process?
    – Jim
    Commented Sep 10, 2020 at 9:20
4

Here's how my old QA department did it, based on the 5-whys

You start with the problem.

"why did we ship a fault widget?"

and find the most obvious cause, let's say:

"because Bob didn't check it properly."

Some companies would just fire Bob and call it a day, but there's no guarantee that the next guy will do any better. So you have to ask:

Why didn't Bob check it properly?

Now maybe Bob is blind, or lazy, or untrained, in which case you have to ask:

Why do we have an unsuitable person doing this important job?

Now you can start looking at your supervision and recruitment procedures. Are they good enough, were they followed, and if not, why not?

An important factor here is not to assign blame as blame ends the process without finding a solution. Blame is about assigning punishment, not preventing the problem happening again. People make mistakes when they're under pressure (as anyone who has played a computer game knows). Your processes need to allow for this and not apply too much pressure or add additional checks.

Maybe you decide Bob was the problem and replace him with Charlie - what gives you confidence that Charlie is going to be better if you don't know why Bob made the mistake? If he was blind, you could give Charlie an eye test. If he was lazy you could implement some checks to find out how many other lazy employees you have, or maybe he was pressured into signing off a rush job and didn't have the authority to stand up to his boss.

Just blaming Bob doesn't help because it assumes that Bob is the only person in the world that makes mistakes, and that you can replace him with someone who never makes mistakes. If you focus on why the mistake was made, and why no one else spotted it you have a better chance of avoiding it. You may also discover that you were making unreasonable demands of Bob, and that he's still the best person for the job.

Asking 'why' five times forces you to get to the root cause - the place where a change has a good chance of preventing the problem happening again. It's important to know that 5 is just a guide, not a magic number. You keep going until you're confident that the problem won't happen again. Also the process may reveal multiple causes, and multiple things that could be improved.

I should also mention that we regularly found that a root cause was that people didn't follow the rigorous process that had been implemented after the last issue, because it was too slow. But that's OK because you can go around the cycle again and either streamline the process or look at why people decided that speed was more important than quality. Either way, you get a better understanding of the total problem.

10
  • 2
    An important factor here is not to assign blame as blame ... does not seem that easy though unless there is a template format to achieve this.E.g. even in your answer the why's seem to lead to Bob or finger point at Bob
    – Jim
    Commented Sep 10, 2020 at 10:50
  • @Jim - I've expanded that section a bit. The answers did lead to Bob, but only after 2 'whys'. If you keep going you discover how the company can prevent the same thing happening again. Commented Sep 10, 2020 at 11:16
  • 8
    We have used the "5 why" technique and I found it to be of limited use. The results are very unstable and with a slight change on level 2/3 you will end up in completely different directions. For example, I might not question Bob's competence but ask "why has Bob to check it manually (as opposed to an automated test". Anyways, I found that in the end this will always end up somewere outside or at the border of the teams influence like staffing, the business strategy is changing every few months, a product that can do X was sold to solve problem Y, etc
    – Manziel
    Commented Sep 10, 2020 at 12:26
  • 6
    "An important factor here is not to assign blame" - but you just implied Bob was to blame! Commented Sep 10, 2020 at 18:26
  • 3
    I used to be a specialist in safety-related software. The main point to grasp there is that everyone gets stuff wrong. So asking "why did Bob build something wrong?" after the fact is the wrong question. The right question is "when Bob builds something wrong, who would spot it?" and then you can engineer your system so that two people have to screw up for the bug to leave the building.
    – Graham
    Commented Sep 11, 2020 at 0:28
2

Berry has a solid answer. But there is no 'best approach'. There are other strategies and a lot depends on the company culture, perhaps even locale.

Sometimes the team doesn't get much of a say. They effed up under pressure, an expert evaluates the situation and devises protocols and procedures to mitigate against it happening again with minimal input from the team.

Other times people are sacked and the team can have a think about their job security.

Different methods for different environments. The post mortem method in some places would be useless. Making people visibly accountable is the only realistic solution. I've seen whole teams shown the door.

5
  • I couldn't help but think of this clip... youtube.com/watch?v=pGFGD5pj03M
    – berry120
    Commented Sep 10, 2020 at 9:03
  • Yes I can understand the cases that you mention but I was mostly thinking as an team internal process for team improvement before the management decides that the team has an issue and accountability/expert should be involved. I guess in most cases teams are not "under fire" on the first mistake right?
    – Jim
    Commented Sep 10, 2020 at 9:17
  • @Jim theoretically the team should rarely be under fire, their manager is responsible, but it doesn't always work that way
    – Kilisi
    Commented Sep 10, 2020 at 9:43
  • 1
    @berry120 looks like an amusing movie
    – Kilisi
    Commented Sep 10, 2020 at 9:48
  • 3
    @Kilisi TV series. It's worth a watch 😂
    – berry120
    Commented Sep 10, 2020 at 10:17
2

Development standards and guidelines

Large teams can coordinate based on shared documents, such as writing down specific standards and guidelines for various aspects of development, deployment, etc, specifying the common understanding for what is best practice in this company, which tradeoffs are preferrable, and what are the core standards that should be followed.

In this scenario, when you review causes of a particular problem and it's not just a random one-off mistake, but something systematic that repeats (or might repeat), then you can 'learn form the mistake' and communicate it by amending these standards and guidelines to facilitate practices that will mitigate such errors - and there's no need to assign blame or a specific incident as a cause for the amended standards.

2

As many have said, avoid finger pointing. People make mistakes, and that is part of the cost of business. Or, as I like to say, you pay for training one way or another

A process that I've seen work very well is, and in this order:

  1. Fix
  2. Repair/resolve
  3. Investigate
  4. Resolve

FIX

Do whatever you need to do to get the systems up and running. It can be pretty or ugly, but the priority is to get things up, and a full repair can be done later.

Repair

After the systems are up and running again, a full repair to make sure the systems are stable, and there is no pending threat of failure. Document and resolve any remaining issues.

Investigate

This is an in-depth evaluation. The goal should be solution oriented. Nobody should get in trouble, and full participation without negative consequences should be the focus.

You want to know:

  1. What went wrong
  2. How did it happen
  3. How did we fix it
  4. What could have prevented it
  5. How do we make sure it doesn't happen again.

Leave egos at the door, just find out what happened. Shoot down any "Bob should have...", and push, "Next time, we will...."

RESOLVE

This is where you address the investigation. All contributing issues should be addressed at this time and preventative measures should be instituted.

The most important thing is that the issues that led up to this are resolved permanently. The focus should be on fixes, not on punishment. EVERY organization has problems.

Get training for those who need it, impliment shop standards, and use the mistakes for opportunities to improve.

This will set up a process of constant improvement, and boost employee engagement and morale.

1

So, I've done this through 4 companies now as a manager of medium sized (5-20 people) teams. And I think there's different tools for a couple of different jobs:

Retrospectives (Agile) & Lessons Learned

Good for stuff that is somewhat like a scrum - a way for a team to look back reflectively. Good for finding the stuff that doesn't fit into a single ticket, and that can be more around communication. I find it a lot harder to get much out of this when we're talking about a team that is too big to scrum together (like a department sized group) - because the low-formality, discussion elements really need a team size that allows fairly equal person to person communication. Doing this as a group of 20+ you end up with deputized speakers (officially or unofficially) being the only contributors.

Agile has clear processes and techniques for Retrospectives. Other times companies that may not use Agile will do a "Lessons learned" meeting using their own concoction of procedures but the good formats generally seek to capture all ideas, avoid blame, and have a separation between brainstorming which is non-judgmental, and prioritizing/acting - which seeks to do the work with the most bang for the buck.

Both processes can be biased by the participants personal experiences. For example, something that is super annoying may not actually be the biggest productivity killer - it's just the squeakiest wheel. And folks with big personalities can sway others if the moderator isn't good at counterbalancing.

Single Incident/Big Impact

Post mortems (and many of the great write ups here already) are fabulous for single incident cases. There's a lot of work (as others said) to drive this away from being a blame game and into a useful learning exercise. That is the risk, though, with diving deep - you need to make sure that a single bad situation is not the only reflection of an individual or group's performance. Performance management has to be quite separate from this exercise.

The draw back is that a really good post mortem takes real time. And a superficial post mortem won't be worth the time you put into it - in some ways, bad post-mortem research can be worse than none at all.

So - you end up needing a bar for "what's so impactful we should do a post mortem?". Each business is different on this, but my advice would be to ground it in the business strategy, and then do you best to find unambiguous metrics for that. (as opposed to what situation was so bad that the CEO was embarassed/woken up in the middle of the night/etc ... may be worth doing that one too... but it shouldn't be the only one).

Post mortems are ... post - the incident resolution process is generally separate. It helps to judge what deserved a post mortem if you have an incident resolution tracking process that is relatively public - so that leaders can see how other leaders are handling this.

Death by 1,000 paper cuts

I've also been in situations where my technical issues were not delivered in big-bangs that would be assisted by post mortems, but in many small, time draining, soul sucking issues. This can be tough, as you never really get the energy to deal with the root causes and all of it seems trivial.

A that point, drawing out data is a common useful technique. I would not recommend the full on process (too burdensome) - but CMMI/CMI can be useful. It's all about tracking and analyzing data for process improvements. Big, huge companies use it, and the downside is that it's process for your process, and as such it can be an impediment to making radical change. But it's got some good techniques for data analysis in there. Steal those and discard the rest before any of it sticks to you.

What I learned most from CMMI is that you can look at these 1,000 paper cut issues and form some interesting conclusions about them by categorizing the data and looking at it in bulk. The key is that you have to have data tracking that is consistent and accurate enough for the judgements you are trying to make. For example - how are people tracking time spent solving the issue? Will that matter? If an issue is attributed to a component, is the attribution accurate? Do you know root cause? Do you have categories for it? Is everyone using these categories the same (it's usually that a human has to enter this data...).

This becomes the realm of statistics - but even a determined manager with Excel can sometimes pull out some useful details. The other trick is - know how to use statistics, and/or don't use stats you don't understand.

Back in my days of using an actual process as opposed to a DIY solution, my CMMI expert was a stats goddess. She had tools, skills and the power to communicate - which made the whole thing work well. A huge company may make that kind of investment... a smaller/less organized company may not.

0

There are a lot of great answers here but one thing I have not seen mentioned yet is the application of Value Stream Mapping (aka Value Stream Analysis) on your SDLC and Development Operations processes.

By mapping out your current workflow of how an idea becomes a requirement... becomes an actively worked story... is developed into compilable piece of code... is then turned into an artifact... and finally proceeds through the various quality gates of your testing process until it is ready for release... you can begin to paint a picture of your business and technical processes.

Analyze your Technical Value Stream Map

Once you have your development life cycle laid out in a visual document, you can begin to apply what you have learned from your recurring retrospectives and post mortems in order to identify weak areas in your processes and gather ideas for addressing problems. Assuming you use an automated tracking tool for your stories, you can also apply time data to your value stream map to identify how long it takes to complete certain steps and identify bottlenecks where your stories or tasks may bunch up.

Where I find this addition of a Value Stream Map to be superior to a simple retro or post mortem without one is two fold:

  • Retros/Post Mortems tend to look specifically at a problem in isolation as opposed to the life cycle as a whole - During these ceremonies, teams can tend to hyper-focus on existing problems without accounting for upstream factors or downstream dependencies within the overall workflow. Changing processes in isolation can be a dangerous practice and even have the opposite effect than the team intended. But by utilizing a value map when reviewing process issues, the team should be able to immediately recognize areas where caution is required.

  • Without benefit of map data, evaluating frequency and duration of a problem or affected process tends to be subjective. This can lead to action items from a retro or post mortem being prioritized not by real-time cost/value but rather by how the team feels about an issue. Using a value map with time-data applied, on the other hand, a team will immediately know how long a process step takes on average and can then cross-reference their ticketing data to get a good idea what action item should take priority.

Take advantage of Agile Metrics

Metrics also play a huge role in high-functioning agile teams. There are a number of great tools available to Agile teams in order to identify team problems and improve overall efficiency. This site has compiled a great list of popular ones. Ones that I have personally found to be helpful in managing an agile team:

  • Burn-down (or Burn-up) Charts
  • Control Chart
  • Cumulative Flow Diagram

These reports are extremely helpful as they can be used for just about any type of flavor of Agile you follow. And while they may not always point to your exact process issue, they tell a story which can then be used alongside other data to put together the information needed to implement improvements.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .