12

Last week it was asked what Codegolf_Temp database is and it turned out to be a batch job that stalled mid-way and didn't recover. That was then rectified by restarting the batch later on the day.

While the root cause was not found we had to wait for it turn up again to collect more data-points. Guess what? It happened again.

This time a left over StackApps_Temp is present and the databases for StackOverflow and SuperUser, to name a few, haven't refreshed.

After consultation in chat with the bluefeeted DBA I was asked to write a new bug report.

Can you please investigate the root cause of the subsequent failures of the SEDE refresh batch, implement any needed fixes, add a verify task in the monitor tooling and have a batch restart procedure for the on-call SRE in case a failure occurs.

I do realize this fail case could be related to the out-of-disk-space incident earlier today, it still would be preferred to have a pro-active response instead of re-active.

Can this please be looked at?

The following database are impacted:

name                           create_date         database_id 
------------------------------ ------------------- ----------- 
StackApps_Temp                 2018-05-13 04:41:00 133         
StackApps                      2018-05-06 14:05:58 134         
StackExchange.Ubuntu.Meta      2018-05-06 14:06:10 135         
StackExchange.Ubuntu           2018-05-06 14:12:37 136         
StackExchange.Stats.Meta       2018-05-06 14:12:47 137         
StackExchange.Stats            2018-05-06 14:15:21 138         
StackExchange.Photography.Meta 2018-05-06 14:15:30 139         
StackExchange.Photography      2018-05-06 14:16:22 140         
StackExchange.WebApps          2018-05-06 14:16:54 141         
ServerFault.Meta               2018-05-06 14:17:04 142         
SuperUser.Meta                 2018-05-06 14:17:15 143         
StackExchange.Meta             2018-05-06 14:19:10 144         
ServerFault                    2018-05-06 14:23:03 145         
SuperUser                      2018-05-06 14:26:54 146         
StackOverflow                  2018-05-06 17:01:52 147         
3
  • 2
    I quickly checked and the job did fail again. It looks like only a handful of databases didn’t get refreshed. As to why it failed, I’ll investigate a bit further on Monday to see what’s causing the issue.
    – Taryn
    Commented May 13, 2018 at 19:19
  • If I'm doing it right this problem might be back in some fashion. Some SEDE data didn't refresh this morning. Commented Sep 30, 2018 at 13:49
  • 1
    @JeffSchaller yeah, it stalled. I have a different query to check: data.stackexchange.com/mathoverflow/query/898673 there is an email out I guess and on Monday the SRE will assess what caused it and how to recover.
    – rene
    Commented Sep 30, 2018 at 14:46

1 Answer 1

6

Unfortunately, yes, the weekly refresh did fail two weeks in a row. But the "good thing" is they were for completely different reasons and ideally it shouldn't happen on a regular basis.

The job on 2018-05-06 failed relatively quickly into the process. The failure was because of a deadlock. The message we received in the error log was:

Transaction was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

Basically, the SQL Engine decided that the weekly refresh wasn't as important as another process running and killed it. Being the deadlock victim, stinks because you've got to start over.

The failure this week, 2018-05-13, was not caused by a deadlock. We had an outage on Saturday night which just happened to coincide with the weekly refresh job. While the outage was on the SQL Servers for Chat, and the SE Network it impacted the job because the server that holds the SEDE databases uses a linked server to the server involved in the outage.

I re-ran the weekly refresh this morning, so all databases should have new data. I also added a step to notify us via email if the job fails. We won't get paged, but we'll see the email by the Monday after the failure (at the latest) and will be able to kick off a restart of the job.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .