Handling a perceived over-reaction to a bug introduced by my team

Question

Background

I run my company's central data team. Other teams depend on the output of our daily refresh pipeline which includes jobs that both my team has written (such as building our core orders and customers tables) and also jobs that other teams have written (specific reporting tables built from those core tables). All failed jobs are pushed to a company-public Slack channel, and we have a support rota to investigate failures and inform affected teams.

One team in particular has some reporting jobs in our pipeline that are required for financial auditing. These jobs often fail due to errors introduced by this team or due to one particularly unreliable external data source. Any failures cause significant disruption to this team and others who depend on them, but they don't seem interested in improving these processes and refuse my help. Their jobs failing doesn't affect any other users of the central data platform so it's not a critical concern of mine.

The alerting Slack channel is often full of this team's job failures, and when the incident in question happened it was the third different failure we'd seen from this team's jobs in that week.

What Happened

Last week one of these jobs failed due to an error introduced by a member of my team - he was doing some routine background optimisation work in an upstream table that duplicated a small number of records. We informed the team immediately when the job failed and fixed it within a few hours. We also held an internal post-incident review and made a couple of technical and process changes to stop this specific issue from happening again.

This was the first time in the 18 month lifespan of my central data platform that any of these jobs have failed because of a bug introduced by my team.

I was therefore quite surprised to see a meeting invite in my calendar involving several senior stakeholders from the other team to discuss how my team can mitigate our impact on their important jobs. They also want us to inform them every time we do work that might impact their jobs.

My Thoughts

I've been doing this job for long enough to know that sometimes things go wrong and that it's a normal part of the engineering process. If there was a prior history of my team breaking things then I'd understand their concern, but there's not. Add to this that they don't seem to care much whenever their jobs break for other reasons, and I'm a bit miffed.

My questions are:

What do I mention in the meeting? Do I say "why are we having a meeting when this is the first time we've broken something, and we fixed it quickly anyway?" or do I simply treat it like a post-incident review and walk everyone through what happened and what steps we have in place to prevent it happening again? I don't want to anger anyone or have them think I'm being difficult (I get on well with these people), but I also don't want them thinking that my team is unreliable or risky.
Do I agree to inform them every time we do work that might impact their jobs? My team are constantly doing internal tasks around optimisation and refactoring that have no noticeable impact on external teams. I'm worried that telling this team every time we do such a task will a)result in them not understanding the importance when we need to change something that actually impacts them and b)add a tedious "contact this team" step into most of our tasks.

Note

My questions are around how I should handle this specific meeting. I'm not interested in advice about changing data ownership processes, that's already in progress.

The main tedium would be the extra step in each of our tasks to identify whether it impacts this one team in any way, and knowing that we'd be shouting into a void as they almost certainly wouldn't care/do anything about these messages. We don't do this with any other team, so what if other teams start noticing and want us to inform them too? Then every small internal change starts getting scrutinised and we're constantly on the defensive. — user132562, Commented Nov 13, 2023 at 11:50
2) would be very difficult. If the employee KNEW it was going to affect tables further down the line, he would have remedied it. You don't knowingly introduce a bug, that is the nature of the bug. Sometimes a bug will have unforeseen effects. If you agree to this, next time something else breaks the same stage holders will ask "Why weren't we notified"? — David Lindon, Commented Nov 14, 2023 at 10:06
Part of a mature change control process is informing stakeholders of changes that may affect them, the reason changes are being made, proof the changes have been QA'd, and the date these changes are being made in production. — Tony Ennis, Commented Nov 14, 2023 at 19:26
This may not make sense in your specific context but do you have any separation between “dev” and “production” currently? It sounds like your changes are all happening in production. — bob, Commented Nov 15, 2023 at 2:41
@D.BenKnoble alas, that's not true for Highly Active Questions — Indigenuity, Commented Nov 15, 2023 at 17:47

Joe Strazzere · Accepted Answer · 2023-11-13 11:16:00Z

79

do I simply treat it like a post-incident review and walk everyone through what happened and what steps we have in place to prevent it happening again?

Yes, this.

If you don't want them to make a bigger deal out of this incident than warranted, then don't you make it a bigger deal.

Don't be defensive. Don't point out that "they don't seem to care much whenever their jobs break for other reasons". Don't be miffed.

Go in with facts and explanations. Stay calm. Be sensitive to the disruption that was caused by the issue.

Propose a longer term solution, if the vibe in the room seems amenable.

Do I agree to inform them every time we do work that might impact their jobs?

Maybe. That depends on what they could do with the information, and how difficult it would be to produce.

An email saying "We are changing X tonight" is easy to produce. But it's hard to see how they could derive much value from it.

edited Nov 13, 2023 at 11:16

answered Nov 13, 2023 at 11:09

Joe Strazzere

384k186 gold badges1.1k silver badges1.5k bronze badges

18

@WorkplaceUser If no-one knows your team’s good record that is not going to be fixed by mentioning it in this meeting. You should maybe think about if you’re doing enough to make your team’s good performance visible to the stakeholders every day instead of just bringing it up when something goes wrong. If everyone knows your team has few mistakes and the other team has a lot, bringing it up probably isn’t constructive.
– ColleenV
Commented Nov 13, 2023 at 12:32
15

@ColleenV I don't understand why it can't be fixed in this meeting. The OP can say something about their sterling track record, show a graph or mention that it's the first incident in 18 months and say that's why they're taking it so seriously.
– matt freake
Commented Nov 13, 2023 at 13:20
33

@mattfreake Only bringing up a team's stellar record when a mistake has been made means the team is not getting the recognition they deserve the other 99% of the time. That's what I thought needed fixing. Assuming everyone notices and appreciates things not being broken is a mistake. Everyone gets used to things not being broken, and the rare mistake becomes a big deal because it's so rare. That's just how humans are.
– ColleenV
Commented Nov 13, 2023 at 14:45
7

"But it's hard to see how they could derive much value from it." They could always say, "Hey, we have to do some big important report tonight, can you please do your change tomorrow instead when we don't care what you break?"
– user3067860
Commented Nov 13, 2023 at 22:18
7

Agree with @ColleenV. As the leader / manager of this team, OP may want to consider how the consistently good work of this team is made visible across their organisation. Saying "it never normally breaks" is not good enough. It is the dichotomy of working in IT. When it works, it's invisible, meaning users don't see it, so they all question why it's so damn expensive. When it breaks, users think it is rubbish, so question why it costs so much. There is no win condition for us. Part of OP's role is to champion their team to the rest of the organisation and show what makes them amazing!
– ThaRobster
Commented Nov 14, 2023 at 8:57

| Show 9 more comments

falsedot · Accepted Answer · 2023-11-14 21:13:27Z

I think the important bit here is

reporting jobs in our pipeline that are required for financial auditing

Sometimes, the point isn't whether a job failed or not; the point is whether this was anticipated, reported, or accounted for. You mention that jobs of this team often fail; but due to their own changes that they are, presumably, aware of, so they can take remediation actions based on their runbook.

On the other hand, this failure was unaccounted for and more importantly unexpected; it likely revealed another source of potential errors that, depending on the organisation and the regulations, needs to be accounted for and have procedures in place. In other words, this signalled that "there's someone else that can modify, in a noticeable way, our processes". This isn't necessarily about being possessive; it can be just a realisation of another source of risk.

This is of course assuming best intentions from all parties; situations like these tend to get a bit political, even unintentionally.

I also don't want them thinking that my team is unreliable or risky.

Given what you say about the incident detection, fast response, notification of the team, and followup actions, it sounds like not describing what happened and evading the answer would actually hinder your goal. Instead, a summary of the timeline, especially how this was detected immediately and intentionally (ie not by accident but via automated monitoring), along with the post-incident actions will instead showcase your reliability (assuming they are receptive to that), instead of giving the impression that you are cavalier with auditing pipelines.

internal tasks around optimisation and refactoring that have no noticeable impact on external teams

Yet in this case it had. It's true that optimisation and refactoring is often the right thing to do, yet it doesn't come without risk. Perhaps in this case, there's additional risk that needs to be addressed (that would warrant this meeting).

Do I agree to inform them
not understanding the importance
tedious "contact this team" step

Is there a particular reason there's a constant stream of such tasks? Can't they be grouped when they affect this particular set of jobs and the work deployed in an agreed window of time (eg after the market closes)?

Also worth keeping an open mind when it comes to the value of this; you mention optimisation but I've seen teams being happier with signing off extra resources to ensure there are no breaks.

Finally, depending on the tone of the meeting, you could suggest that you take some extra time to work out the exact process - a manual step to notify depending on the description of the ticket isn't great either (isolating those pipelines and requiring eg managers sign-off during review, that would then contact the affected team pops in mind as a somewhat more robust alternative). This allows you to make the decision without being directly pressed (obviously freeze any changes to those jobs in the meantime).

This is an underappreciated answer. OP (likely) knows as little about their job as they do about his. It may well be that they don't care about the faults introduced by themselves because they see them, know how to fix them, and move on - but a fault due to an outside party that they couldn't fix themselves and which took hours to resolve may have well given their manager a near heart attack. — xLeitix, Commented Nov 14, 2023 at 13:18
If it's financial auditing that "heart attack" might replicate a LONG way up the the mamangement chain..... — deep64blue, Commented Nov 16, 2023 at 14:40

Greg Martin · Accepted Answer · 2023-11-16 00:15:07Z

21

Just show the data (if you have it).

I assume you folks have metrics and historical data around failed jobs and some pareto charts of the root causes. Show the data for all jobs and the data for this specific job.

Then explain what you have to prevent the problem from happening again. If you feel particularly miffed, you can add "The other root causes are outside of my domain and I have no visibility what's being done to address that".

edited Nov 16, 2023 at 0:15

Greg Martin

1886 bronze badges

answered Nov 13, 2023 at 12:42

Hilmar

124k36 gold badges246 silver badges386 bronze badges

9

Any graph would show big numbers for external causes and then a single dot for issues caused by my team. I think the intention would look quite clear and might not go down well.
– user132562
Commented Nov 13, 2023 at 12:48
9

If that wouldn't go down well then I think we're tasked with finding reasonable solutions to deal with unreasonable people, it's no wonder that the solutions aren't that palatable.
– Lamar Latrell
Commented Nov 14, 2023 at 1:22
10

@WorkplaceUser I'm an engineer and not a manager, so my messaging strategies are not especially well honed. But I would be tempted to render that graph, leave it out of the main presentation, but hold it in reserve. Make a more general and vague statement in the presentation, to the effect that you are accustomed to that job failing periodically for reasons outside your control, which impacted your response. If they challenge you on that, THEN bring out the slide with the graph. ... honestly, in the end this will probably be more nuclear than including it from the start. But well deserved.
– Glenn Willen
Commented Nov 14, 2023 at 3:48
7

@GlennWillen So basically, don't try to rub that in their face, but lend a helping hand if they try to rub it themselves?
– Frax
Commented Nov 14, 2023 at 18:30

Add a comment |

Xavier J · Accepted Answer · 2023-11-15 15:36:42Z

10

At the end of your opportunity to speak at your meeting, announce that you're going to start publishing a weekly incident report that includes all teams. Precisely at that moment, pass out paper copies of a report (or share a PDF) that shows the same info over the last 90 days.

Put the numbers in a columnar (table) format. In the report, don't highlight any one team's activities. Sort by team name alphabetically, and not by number of incidents. This way no one can directly accuse you of throwing shade.

Don't add any commentary to the information you're sharing. Let the numbers speak for themselves. If anyone asks you about the other team's numbers, share your source of just the numbers and absolutely defer any "why" questions to be answered by the persons who run that other team. Again, no one can accuse you of throwing shade, and you can get out of this hot seat that the other team is trying to put you in.

edited Nov 15, 2023 at 15:36

answered Nov 13, 2023 at 21:44

Xavier J

43.8k10 gold badges87 silver badges148 bronze badges

This report might be considerable effort to maintain.
– wizzwizz4
Commented Nov 13, 2023 at 23:17
6

@wizzwizz4 Then go about 6-8 weeks, and drop it in lieu of "pressing priorities". ;) I'd bet (smile) you're an engineer. This is a different kind of engineering, friend.
– Xavier J
Commented Nov 13, 2023 at 23:22

Add a comment |

gnasher729 · Accepted Answer · 2023-11-13 11:32:02Z

8

If this team had 99 failures that were their own fault, and didn't do anything about it, and they had one failure that is your fault, and now they are all upset and want you to take costly actions, then it is clear that they want to shift blame from themselves to you.

So I'd do a quick count, and in the meeting say "team X's job failed yesterday, and that was our fault. In the three months before this, the same job failed 97 times. Each time it was team X's fault. We offered to help them to fix their problems, but they have not taking any action as far as we can see, and they have refused to accept our help. I suggest that it is much more effective to reduce all the other failures in their jobs first. ".

answered Nov 13, 2023 at 11:32

gnasher729

170k78 gold badges317 silver badges510 bronze badges

14

This is definitely the nuclear option! It would be nice to get this point across in some form but it's hard to structure it in a way that isn't so aggressive.
– user132562
Commented Nov 13, 2023 at 11:52
12

@WorkplaceUser I agree this is a nuclear option and should not be used. But it is worth having this information ready to go, in the unlikely event they decide to try and throw your team under the bus, you have an ace you can play.
– Anketam
Commented Nov 13, 2023 at 14:22
2

OP doesn't have a country tag, but at least in my country & work culture, this answer would definitely be the way to go. Simply because it is the factual truth. Dancing around issues to not step on any toes does hurt processes and efficiancy in the long run.
– user112367
Commented Nov 14, 2023 at 7:25
8

Just make damned sure to have your facts straight before throwing shade. It would not be good if the other team counters with "but these 97 failures also have no impact because of X, whereas your failure threatened to corrupt the entire monthly reporting" and a bunch of high-level managers nod their heads.
– xLeitix
Commented Nov 14, 2023 at 13:33
1

I would do the same but in a more subtle way. In the meeting, get them to agree that you must report to them whenever their job fails, with a preliminary cause assessment. Then when they later get daily reports of their team's jobs failing, and it being their team's own fault, you've solved both problems. Perhaps state that you'll provide the preliminary cause assessment within 72 hours, so you can do them in batches. The basic idea is not to give a hint in the meeting of what they're about to let themselves in for...
– Martin Kealey
Commented Nov 15, 2023 at 11:25

Add a comment |

Chris Schaller · Accepted Answer · 2023-11-14 02:28:07Z

As pointed out in a few other comments, before you go into this meeting, you need to do some self-reflection on your team and their processes.

Last week one of these jobs failed due to an error introduced by a member of my team - he was doing some routine background optimisation work in an upstream table that duplicated a small number of records.

There is no such thing as routine background optimisation. Refactoring and optimisation tasks are critical areas of concern. All logic that depends on your logic assumes by now that it works, confirmation bias kicks in and other teams may have even removed checks and balances or might ignore notifications if it does start to fail.

A software developer who assumes that their code is bug-free and does not perform thorough testing or debugging is a common example of confirmation bias in software engineering.

Your metrics for testing are not always likely to meet the organic requirements that other teams now have after using your code for so long. So even if all of your original test cases still succeed, you wont know what boundary conditions or quirks to your logic they may have adapted to.

I once worked on a project porting to product to a new runtime. It was a requirement to replicate all of the existing bugs in the original software. Sounds crazy, but their staff and automated systems had already factored in and worked around those issues, if we fixed them now, all of the logic that depended on thos bugs would have to be reviewed and now fixed, which was well outside the scope of the project.

Due to the potential of this domino effect that your changes might have throughout the system, such changes should be coded and tested in isolated environments that simulate or replicate production loads. If your solution is split up so that there are multiple teams, then changes should only be introduced after scheduling deployments with all teams and performed in a way that you can rollback if there are any significant issues.

If it aint broke, be careful that you're not making the situation worse!

We informed the team immediately when the job failed and fixed it within a few hours.

It is great that you communicated the issue and had it fixed, the irony of situations like this is that had you not raised it, they may never have known. But honesty is the best policy, by raising the alarm early you have given them opportunity to mitigate problems at their end as well.

We also held an internal post-incident review and made a couple of technical and process changes to stop this specific issue from happening again.

The main thing you need to take into this meeting is your findings from that review. You'll need a timeline of events and some detail about your new preventative measures to reassure everyone that it won't happen again, or at least to ensure you can identify the issue and have a strategy to rectify it even sooner.

What you cannot do...

You cannot really bring up any of the history about either your team's success record or the fact that their team gives you all sorts of issues. When their team causes issues, they know how to deal with it, by the sounds of things this is by contacting you. This situation is different, they can survive their own incompetence thanks mostly to you, but when you are incompetent, now they have zero trust in the system, the next issue they will not be certain if it is you or them.

Perceived over-reaction

You haven't really justified why this is an overreaction, for them your team has been stable for so long, "what has changed, why are we suddenly seeing issues now?" It is good that they do have a strong reaction, finance departments often will, it means that your work is important to them and they recognise that and potentially want to help.

In this meeting you need to build back the trust or goodwill that was lost, but don't do it by making promises you can't keep. Just state facts and confirm the communication channels and ask them if your response was adequate. Also ask them to evaluate your resolution and mitigation strategy, make them feel like part of the solution and have them confirm that your solution is infact adequate and can be trusted, don't tell them it is, make them confirm it.

Do I agree to inform them every time we do work that might impact their jobs? My team are constantly doing internal tasks around optimisation and refactoring that have no noticeable impact on external teams.

YES this is important, they clearly depend on your system for their day to day work! If you cannot explain the importance of a change to dependant teams, then is it really that important? It is a fact of your chosen delivery and maintenance model that contacting all affected teams is an important part of the process, because you have demonstrated that your quality control was not good enough to identify this issue before it was rolled out. This is part of earning that trust back.

But this is also a good time to reflect on why you are doing so many changes into the production environment. The first step is to batch changes together to reduce deployments, but you should liaise with the other teams and setup a staging or test environment so that you can test deployments before actually rolling them into production.

no noticeable impact on external teams

This is the key, this is your opinion. If you had tested extensively and verified that there will be no noticeable impact, then what was the purpose of the change? In your communication about the deployment you might use language like "We do not expect a noticeable impact" but you still need to follow with a brief explanation to justify that claim and how you have verified it.

The communication is not the tedious step, it is the testing that is causing you issues. Communication of release notes is critical and should never be skipped. Even if you have a great DevOps pipeline and automated deployments, you should still have a collated list of the features that have changed because when the deployment fails, we need to know where to go looking for the root cause of issues.

I appreciate it is more of a longer term objective, but this should be the objective of all teams that deploy into production environments.

nicola · Accepted Answer · 2023-11-13 15:44:32Z

5

I think that you should use this opportunity to really assess your process. In fact, what baffles me is that you push changes in production without even thinking about notifying your users. This is bad, no matter how exceptional your team is. This looks to me not as an unavoidable accident, but rather a flawed process that you should fix. When I read "he was doing some routine background optimisation" I felt almost sick: you don't do that. Please realise that what you call routine is not a good practice and should not happen outside release windows.

As I said, I'd use this occasion to introduce changes to your process. You definitely need to be more formal for new releases (you should agree a calendar with the user), organise proper testing windows and communicate with users the rationale for each new release. And this is what you should say in the meeting: the fact that the other team introduced bugs is not an excuse for you to not follow best practices.

answered Nov 13, 2023 at 15:44

nicola

2,01110 silver badges14 bronze badges

When I read "he was doing some routine background optimisation" I felt almost sick this sounds like a "you" problem. I'm more interested in actually getting things done rather than bogging everything down in a sea of approval processes, and it's clearly working absolutely fine. You should consider reading about the concept of Continuous Delivery (CD).
– user132562
Commented Nov 13, 2023 at 16:16
1

Please, indicate me where is written that you don't communicate with the user and don't test your code within CD.
– nicola
Commented Nov 13, 2023 at 16:38
4

@WorkplaceUser, CD doesn't mean no reviews, no notifications and no approvals. You can have the process efficient with the other properties. If you introducce changes via pull requests to a git repo that are reviewed and performing database changes that are tested in a staging environment, then you have CD with much less chances to break. Now all depends on how critical the system is. If a few hours for a fix is small then system doesn't sound to be very critical. Depends on many factors, nikola's points make a lot of sense in many environments.
– akostadinov
Commented Nov 13, 2023 at 21:17
I don't understand how you can "agree a calendar with your users" or have "release windows" if you're doing Continuous Deployment
– matt freake
Commented Nov 13, 2023 at 21:18
2

@mattfreake Feature flags can be useful for this, but also CD doesn't need to be directly into production, if it is you can still schedule it for instance on certain days of the week or month, that can still be continuous. CD should also have an extensive review stage that includes both regression and integration testing, part of that review can easily include notification of pending changes. Finally CD might not be appropriate for one piece of a large solution if your other teams have a much slower development feedback loop or capacity.
– Chris Schaller
Commented Nov 14, 2023 at 2:16

| Show 1 more comment

Blackhawk · Accepted Answer · 2023-11-13 21:47:45Z

I agree with you and Joe that, yes, you should do this:

do I simply treat it like a post-incident review and walk everyone through what happened and what steps we have in place to prevent it happening again?

However let me propose 4 points to hit specifically.

Root Cause

What went wrong? Who, What, When, Why, Where, etc.

Impact

How many stakeholders/applications/hosts/rows are affected? What are the exact effects that stakeholders should expect going forward?

Fix

How do we plan to fix the issue and redress stakeholders in the immediate and long term?

Prevention

How do we ensure that this issue or ones like it will not occur again?

I think on that last point you may get some pushback from the stakeholders, but if you show competence on the first 3 they will likely defer to your expertise, especially if you demonstrate a strong grasp of the impacts and the fix.

I think I would take this approach as well - taking a general approach in the discussion i.e. these jobs are business critical, they must run on the last day of the month etc. Then propose a solution that includes the appropriate maintenance of the jobs so they don't randomly fail and the appropriate controls so your team doesn't impact them. — David Waterworth, Commented Nov 13, 2023 at 23:34

mdfst13 · Accepted Answer · 2023-11-13 23:14:06Z

Do I say "why are we having a meeting when this is the first time we've broken something, and we fixed it quickly anyway?"

As others have already noted, do not do this, however emotionally satisfying it seems like it would be. In fact, I would suggest getting this out of your system. Pick someone not at work who is likely to take your side, a friend, spouse, significant other, or other family member. Say all these kinds of things to that person, then when they agree with you, take the other side and explain why they shouldn't.

I simply treat it like a post-incident review and walk everyone through what happened and what steps we have in place to prevent it happening again?

As already noted, do do this. This is just a perfectly normal meeting after a problem. Already said, but worth repeating.

All that said, you should take data and statistics to the meeting with you. Some things that might help in rebuttal to things that people might say:

How many failures versus how many successes. If your team is deploying without problems 99+% of the time, that's good. Be ready to say that if someone implies that your team is making mistakes.
How many of the failures were caused by the external agent, the team responsible for those reports, and your team. A response to someone complaining about the frequency of problems with those reports and blaming the problems on your team.
What your proposals are for making the reports more robust in general. In case someone asks what you would suggest they do.
How many of the changes that you made this year would have been reported at various standards of likelihood to impact the team. Also note which of those standards would have included this problem. I mean, you say that this was an upstream table, so it's not immediately obvious to me that you should have notified them even if you were notifying them of changes that might impact them. Perhaps it would be obvious if I knew more about your data structure. Anyway, discussing multiple notification standards and their likely impact would be useful if 99% of the changes that you would report had no actual impact on them at all.

My electric utility company twice warned me of potential power outages from weather that might have caused issues. In both cases, I experienced no problem. Another time, they did not warn me (the wind storm was not predicted beforehand). My power was out for three days and my internet for five (a tree fell on the lines between my house and the street). Somewhat problematic, as I work from home. Net result is as you might expect. I tend to ignore the warnings, as they are prepared for those and there aren't problems. But I am thinking of mitigation steps for unexpected occurrences.

Anyway, my point is that I suspect that you are right, they will quickly learn to ignore your warnings except when their stuff breaks for the normal reasons. Then they'll assume that it was the warned activity and not their own actions. This is likely to make them react slower to problems rather than allow them to better anticipate things that might go wrong.

You might emphasize this by showing them the last twenty or hundred times that you would have notified them under each standard, including the incident. Then ask them which one was the notification related to that incident. See if they can recognize it under the way that you would have notified them. I.e. would "modify process populating table in [this] database to be more efficient" have let them know the potential impact to them? Even if you had added that that data would later be used to generate some data in their report(s).

I want to reiterate that this is probably not the time to be proactive in bringing things to their attention. For one thing, you may get them to start thinking about something that they hadn't considered yet. Also, stating information up front (other than the post-incident review) is likely to make them perceive you as defensive. They'll discount the information. If you use it in rebuttal, it will sound stronger. If you don't have to use it at all, then great. Perhaps they never even had that thought.

user156207 · Accepted Answer · 2023-11-14 22:14:47Z

Its called a 'postmortem'.

Its not an over reaction. Its not an accusation.

The way you are responding is immature as you are making it personal and tribal.

I was previously part of a team and their reactions to postmortems was always one of personal attack and, in my view, not constructive. Its not about you or your job. its about the product. you are there to build and maintain a product and culture of continuous improvement.

Bring to the meeting: "What happened?" "What can we do to ensure this does not happen again?"

Irrelevant to this meeting is the ongoing problems with the customer team. Totally irrelevant. Call a postmortem with them the next time an error throws if it bothers you and solve the problem.

MikeB · Accepted Answer · 2023-11-14 09:49:41Z

Others have answered most of the angles, but I'd like to pick up on one part that jumped out at me:

We also held an internal post-incident review and made a couple of technical and process changes to stop this specific issue from happening again.

I hope that I'm reading that wrongly, (but if I am, then others will too) - you need to make it clear to everyone else that your review attempted to ensure that other issues won't occur either, not JUST this specific issue.

We all know that it is impossible to predict all future issues, but if you can show that you are being pro-active, then you really don't need to worry about anything else. If the other team still grumble, I would ask for a representative to attend your meetings for both awareness and input.

Barmar · Accepted Answer · 2023-11-14 15:30:33Z

Others have answered your first question pretty well, so I'll just address the second question about notifying them whenever you make a change.

People don't like surprises. When these jobs fail due to their bad input, it's not a surprise to them, they're used to it. But other types of failures are not normal to them.

When you make a change to an application, the other stakeholders for that application should be kept in the loop. They don't need to know the technical details if it will be over their heads (so they probably shouldn't be included in the VC notifications) , but a "heads up" is good practice. Something like

We tweaked the X job to improve performance. We don't expect a problem, but let us know if you see something wrong.

This shouldn't really be an extra step for this application, it should be routine for all your applications that impact other groups. Maybe there's even some way you can automate it.

James · Accepted Answer · 2023-11-14 17:49:08Z

Their jobs failing doesn't affect any other users of the central data platform so it's not a critical concern of mine.

This is the important feature of this issue. The other team depends on the quality and timeliness of your work, not the other way around. It's completely reasonable for them to be concerned with the delay caused by your team, and to follow up with you to make sure it doesn't happen again. How often the other team's jobs fail is not your concern insofar as it has nothing to do with the quality and timeliness of your work.

I also don't want them thinking that my team is unreliable or risky.

Did anyone on the other team explicitly say your team was unreliable or risky? If they explicitly say so then feel free to bring up your team's high reliability over the last 18 months, otherwise this is not even an issue.

My team are constantly doing internal tasks around optimisation and refactoring that have no noticeable impact on external teams. I'm worried that telling this team every time we do such a task will a)result in them not understanding the importance when we need to change something that actually impacts them and b)add a tedious "contact this team" step into most of our tasks.

Bring this up during your meeting with the stakeholders on the other team.

plagiarisedwords · Accepted Answer · 2023-11-15 22:15:56Z

A lot of good answers already but I wanted to add a bit since I do the same job.

In the data world, we've adopted various software practices but have tended to build monoliths. One consideration to raise with stakeholders is the degree to which the financial audit team depends on "central" data models. There's a trade-off - rely on other people's stuff risk errors or build it yourself but have more work to do.

In addition you might want to limit which tables they can reference. Tables that can be publicly (used by people outside your team) referenced are like interfaces. This means people cannot just reference "shaky table to support fun emoji analysis" when doing serious financial stuff without you knowing. You can just agree the rules initially but you can embed them as code later (as part of your continuous integration flow).

In a past workplace, I grouped tables by which team owned them. Each teams tables are by default private and they have to be made public before a different team can reference them. Same for central team owned models. This stops people creating dependencies without both parties agreeing.

Petter TB · Accepted Answer · 2023-11-20 21:40:31Z

Just answering on a tiny detail.

I also don't want them thinking that my team is unreliable or risky.

Don't let this end up as an evaluation of your worth, neither from them or yourself. Quality of service is a big, unclear, difficult thing to handle. If a process is fragile, or somehow unreliable, that may be OK or very problematic. Neither answer condemns you, or your team, as "bad" in some way.

Desperately trying to make a fragile thing look solid, for ego or politics reasons, is a thankless place to be.

Stack Exchange Network

Handling a perceived over-reaction to a bug introduced by my team

Background

What Happened

My Thoughts

My questions are:

Note

15 Answers 15

What you cannot do...

You must log in to answer this question.

Hot Network Questions

Handling a perceived over-reaction to a bug introduced by my team

Background

What Happened

My Thoughts

My questions are:

Note

15 Answers 15

What you cannot do...

You must log in to answer this question.

Related

Hot Network Questions