Last week it was asked what Codegolf_Temp database is and it turned out to be a batch job that stalled mid-way and didn't recover. That was then rectified by restarting the batch later on the day.
While the root cause was not found we had to wait for it turn up again to collect more data-points. Guess what? It happened again.
This time a left over StackApps_Temp is present and the databases for StackOverflow and SuperUser, to name a few, haven't refreshed.
After consultation in chat with the bluefeeted DBA I was asked to write a new bug report.
Can you please investigate the root cause of the subsequent failures of the SEDE refresh batch, implement any needed fixes, add a verify task in the monitor tooling and have a batch restart procedure for the on-call SRE in case a failure occurs.
I do realize this fail case could be related to the out-of-disk-space incident earlier today, it still would be preferred to have a pro-active response instead of re-active.
Can this please be looked at?
The following database are impacted:
name create_date database_id
------------------------------ ------------------- -----------
StackApps_Temp 2018-05-13 04:41:00 133
StackApps 2018-05-06 14:05:58 134
StackExchange.Ubuntu.Meta 2018-05-06 14:06:10 135
StackExchange.Ubuntu 2018-05-06 14:12:37 136
StackExchange.Stats.Meta 2018-05-06 14:12:47 137
StackExchange.Stats 2018-05-06 14:15:21 138
StackExchange.Photography.Meta 2018-05-06 14:15:30 139
StackExchange.Photography 2018-05-06 14:16:22 140
StackExchange.WebApps 2018-05-06 14:16:54 141
ServerFault.Meta 2018-05-06 14:17:04 142
SuperUser.Meta 2018-05-06 14:17:15 143
StackExchange.Meta 2018-05-06 14:19:10 144
ServerFault 2018-05-06 14:23:03 145
SuperUser 2018-05-06 14:26:54 146
StackOverflow 2018-05-06 17:01:52 147