103

There has been some confusion during some recent incidents that we’d like to address. Some of the confusion stems from our status pages/banners that we’re going to update. The other point of contention involves our status page and when it’s updated.

New high availability configuration

In the past if the site ran into a major error and crashed or if we were hit with a big enough DDoS attack that it brought the site down, users were presented with a generic page saying the site was down for maintenance. Since moving to Cloudflare as our CDN, we are now utilizing one of their features that allows us to automatically redirect traffic to our secondary data center that runs in read-only mode. The end result is that even if the primary site goes down, users will still be able to access the site albeit in a degraded fashion (read-only mode). When the application recovers in the primary site, traffic should automatically be directed back.

Because of the way that we used read-only mode in the past during planned maintenance, the verbiage around read-only mode is now a bit confusing. We’ll be updating it in the coming weeks to reflect the new changes.

StackStatus

We use a tool called FireHydrant for incident management; www.stackstatus.net is one of the features built into it. Because FireHydrant is an incident management tool rather than an application monitoring tool, StackStatus is not automatically updated whenever there is site instability. StackStatus is only updated when someone internally declares a major incident. One of the thresholds for declaring a major incident is if the site is down for longer than 5 minutes. However, we strive to improve our transparency during incidents by updating StackStatus more frequently as it’s been an issue we’ve had in the past.

New Health Check Dashboard

Due to how StackStatus is used and listening to the feedback from the community, we’ve built a new dashboard that shows live data directly from our application monitoring tool s.tk/stackstatusdashboard. A link to the dashboard is already up on www.stackstatus.net. The dashboard shows the number of health check errors reported by our monitoring software which runs tests from their data centers around the world. If our monitoring software detects any errors, they will be reflected in the new dashboard. If no errors are detected the dash will be blank—which is the state we’d like for it to remain in!

6
  • 14
    Wow, all of this looks really cool. Awesome stuff!
    – Spevacus
    Commented Nov 7, 2023 at 15:27
  • 9
    dashboard courtesy of Josh Zhang. quoting about its implementation: "The dashboard I created gets data from our PingDom checks which performs health checks from all their data centers around the world."
    – starball
    Commented Nov 7, 2023 at 17:32
  • 1
    In Stack Overflow for Teams - FBB, here FBB stands for?
    – Arulkumar
    Commented Nov 8, 2023 at 16:11
  • 9
    @Arulkumar Free, Basic, Business. We have an Enterprise tier of Teams that's not relevant on StackStatus.
    – Josh Zhang StaffMod
    Commented Nov 8, 2023 at 17:53
  • @JoshZhang Thank you for your explanation.
    – Arulkumar
    Commented Nov 8, 2023 at 18:31
  • Very cool, my uBlock doesn't like iam-rum-intake.datadoghq.com tho.... No dashboard for me
    – Daviid
    Commented Nov 10, 2023 at 11:59

5 Answers 5

39

(changed to Stack Exchange Network Health due to length limit)


Header of the dashboard has a typo:

typo in dashboard header heath instead of Health

6
  • 67
    Good to know Stack Exchange keeps an eye on the Scottish wildlands.
    – Jon Custer
    Commented Nov 7, 2023 at 20:31
  • 21
    Thanks, and fixed, sort of. Fixing it caused the title to truncate to "Stack Exchange Network Health Chec..."
    – Josh Zhang StaffMod
    Commented Nov 7, 2023 at 21:37
  • 18
    @JoshZhang remove the word Checks altogether, it still makes sense as a dashboard title.
    – MT1
    Commented Nov 8, 2023 at 8:09
  • 2
    @Josh thanks, I see it's fixed now. Commented Nov 8, 2023 at 13:48
  • 6
    Could have been worse. Hoth Checks would have been funny.
    – Joshua
    Commented Nov 9, 2023 at 2:35
  • 3
    @Joshua Hoth check: still frozen, any signs of an ancient battle have long since been covered in snow and ice :P
    – Robotnik
    Commented Nov 13, 2023 at 3:24
24

On the StackStatus site, when hovering over each particular day, a tooltip appears that shows: the date, a status icon, and status text. But the icon doesn't seem to match with the legend at the top.

Hovering over a particular day shows: a green check mark, and "Site unavailable". This contradicts the legend which says the green check mark is for "Operational".

6
  • I wonder if the check means that the incident has been resolved and it is now Operational.
    – muru
    Commented Nov 8, 2023 at 2:37
  • 3
    The icon matches the last state last reported in the incident. We usually put the state back to operational once we mitigate an incident.
    – Josh Zhang StaffMod
    Commented Nov 8, 2023 at 12:51
  • 30
    @JoshZhang That's a confusing design. Commented Nov 8, 2023 at 12:53
  • 6
    Agree it's confusing and while not a bug, think it's fair to ask changing that. @Josh is this possible to change it so the icon will show the state at the time of the incident, so it match the color of the graph bar? Commented Nov 8, 2023 at 13:51
  • 2
    Perhaps a verbiage update, then? Rather than operational, Call it Restored.
    – J Scott
    Commented Nov 9, 2023 at 20:00
  • 15
    FireHydrant took everyone's feedback and made a change where the icon uses the initial incident status instead of the end so it now behaves the way people expect now.
    – Josh Zhang StaffMod
    Commented Nov 10, 2023 at 17:18
11

Disable the clickable event, if there is no pagination.

As per the screenshot, there is no next page for November month, in this case, don't show the hand cursor and disable the clickable event. Users may click the next button and expect some response.

Screenshot for reference: Disable the button if it is not clickable

1
  • 6
    FireHydrant was really responsive and fixed the issue.
    – Josh Zhang StaffMod
    Commented Nov 10, 2023 at 17:17
11

To maintain consistency, all notifications should display the same message if there are no incidents for that month. Currently, for some of the months the notifications display the below text while other months display an empty space.

No incidents found for this month

To ensure uniformity, we should standardize the message to 'No incidents found for this month' for all the months.

Screenshot from few months:

Notification message

1
  • 5
    FireHydrant also fixed this bug.
    – Josh Zhang StaffMod
    Commented Nov 10, 2023 at 17:17
4

The Stack Exchange Network Health graphs don't have their y-axis labelled (and also not the x-axis but that is obviously datetime). It's entirely unclear to me what is being shown here in the range of 0 to 180. Is this how it's supposed to look? It would be nice if the y-axis was labeled something like "Health check errors".

It would be good if data points, even if zero were shown. Right now the graphs look empty. It's unclear what the reporting interval is, for example. Is it every minute, or 5 minutes etc. A blank dashboard could mean that the dashboard isn't functioning. Showing zeroes solves the "who's watching the watchmen"-problem.

health check dashboard screenshot

Link: https://p.us3.datadoghq.com/sb/bc887ca0-08ce-11ed-a269-da7ad0900003-33d1e48ca7d76da0b86577abb0d28c1d?refresh_mode=sliding&from_ts=1700114956150&to_ts=1700201356150&live=true

4
  • 1
    Yes that's how it should look like and it's explained in the last paragraph of the announcement: "The dashboard shows the number of health check errors reported by our monitoring software which runs tests from their data centers around the world" - so the y-axis is the number of health check errors, over the hourly time in the x-axis. Commented Nov 17, 2023 at 7:48
  • 2
    I'm not sure why one would consciously not label graph axes Commented Nov 17, 2023 at 10:56
  • 3
    This is a quirk of how Datadog generates the graphs. When there is no data to present, the Y axis scales to a default maximum value and the label is hidden. This is what it looks like when there are errors to present i.imgur.com/loVTP6Q.png, both the Y axis and specific bar are labeled. I went ahead and tweaked the dashboard so it's no longer a bar chart but line graph instead that defaults to 0 in the absence of data. This is what it will look like with errors i.imgur.com/N170Jye.png
    – Josh Zhang StaffMod
    Commented Nov 17, 2023 at 15:22
  • 3
    It seems the y-label was fixed, it's now showing a scale of 0 to 1 errors. Hopefully it doesn't show a 0.8 error, that might be more serious to deal with than a 0.2 error :o)
    – pkExec
    Commented Nov 21, 2023 at 5:31

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .