172

On Friday, July 29th, starting at 13:36 UTC, we experienced a very large surge in traffic to our web servers, indicating a DDoS attack. This surge effectively brought down the Stack Exchange Network sites (including Stack Overflow) and Stack Overflow for Teams (Free, Basic, and Business). Stack Overflow for Teams Enterprise was unaffected. We were able to restore service by 15:48 UTC, and have since deployed new defenses to better address these attacks in the future.

We noticed an increase in traffic spikes starting earlier this year, which can sometimes cause site instability. While we’ve gotten better at reducing the overall impact to the site, these traffic spikes are increasing in frequency and scale. These bursts of traffic can cause some users to see a maintenance page or some other error page momentarily, but we continue to keep the effects to a minimum as much as possible.

To address these overall trends, we recently adopted a new web application firewall to both mitigate vulnerabilities and also to act as an intelligent rate limiter. We’re also currently testing new observability tools to help us respond faster and predict future attacks.

Another area we want to improve is our communication during and after incidents. It is difficult for people working on the technical problem to also be providing status updates. I personally apologize for the delay in responding to the various Meta questions with this post. We are working on a number of improvements: Now that we have an automated status page, we are examining how we can improve the process so that it is updated sooner. Other improvements revolve around additions to the status page itself and how information is displayed. To be clear, the status page reflects that a human is working on the issue, not whether our monitoring system has detected issues. We are working to improve internal processes related to communication, standardizing processes to be more consistent, and clarifying what events trigger communication.

As always, we would like to thank the community for your patience as we work hard on addressing these issues.

18
  • 13
    The last time ToR nodes played a major role in these DDoS attacks, has that changed?
    – Luuklag
    Commented Aug 11, 2022 at 19:49
  • 65
    It is difficult for people working on the technical problem to also be providing status updates: that is a contender for the understatement of the century :) Of course you can't be dealing with irate users while you're putting out fires!
    – terdon
    Commented Aug 11, 2022 at 20:08
  • 5
    @Luuklag it has not although blocking exit nodes isn't really useful any more given the size and scope of recent traffic.
    – Josh Zhang StaffMod
    Commented Aug 11, 2022 at 20:17
  • 5
    The Tor reference (February 2022 through May 2022): Update on the ongoing DDoS attacks and blocking Tor exit nodes Commented Aug 11, 2022 at 20:39
  • 4
    It seems that Sites unavailable, intermittent behaviour, from about a week ago, deals with a particular issue related to the overall site instability that this post is describing. Commented Aug 11, 2022 at 23:04
  • 16
    While I personally don't understand any of the technical stuff involved here, I want to thank you for openly communicating with the community like this. Lots of websites have outages, usually it's just annoying, but rarely do we see the site companies communicating directly with their userbase (and even responding quickly to queries!) about outages and the reasons behind them. Commented Aug 13, 2022 at 10:55
  • 6
    One suggestion I would make (based on my time dealing with various crises on an open source project, whether our websites being attacked or security issue) is to have someone "in the room" who is technical enough to understand and explain what is happening but not directly involved. Have that person manage the communications. You may have to tolerate them asking a few questions, but it takes that burden off the people fixing the problem.
    – Elin
    Commented Aug 13, 2022 at 17:12
  • 2
    It seems someone maintaining the status page does not know Stack Exchange likes a space in its name ;-)
    – Arjan
    Commented Aug 13, 2022 at 20:15
  • 3
    Any insight into why one or more parties is trying to take down Stack Overflow and Stack Exchange? Commented Aug 16, 2022 at 5:58
  • 2
    There's plenty of ideologs/disaffected-youth/s**t-stirring governments willing to take the time. Then there's the perfectly innocent explanation that we're the evil ones and deserve it. I know which side of the argument I take. :) @RockPaperLz-MaskitorCasket
    – W.O.
    Commented Aug 16, 2022 at 6:03
  • @W.O. Thanks for your thoughts. Except for someone's claim of "look at me, I can take down a popular site", what's in it for anyone (real question)? SE/SO contains a wide range of content, and isn't centered around any particular agenda (except for the owners making a bunch of money). If SE/SO disappeared tomorrow, the biggest change would be on the income of the owners (and temporarily on the employees, of course). Sure, many people would miss the semi-organized content, but much of it is available elsewhere. Commented Aug 16, 2022 at 6:15
  • There's the possibility of competitors (not there really are any). The only site that might receive negative attention from the powerful (excepting the religious sites which generate that by default) would be skeptics; debunking government/pharma claims, quite a few posts of people trying to prove the Bible or other text, quite a few posts about Trump's activities... etc.. I suppose the politics stack might come in for some, not sure.
    – W.O.
    Commented Aug 16, 2022 at 6:21
  • 1
    I'm not so good with the programming, but I have wondered for years why an intelligent rate limiter is never used to mitigate DDoS attacks. Plus, surely there are other signs of an attack that can be intelligently detected.
    – n00dles
    Commented Aug 17, 2022 at 0:35
  • 1
    @TravisJ Ah right. So there's a lot of under the radar stuff. I thought I could solve the dark matter problem in a YT comment. It's good that I'm thinking.
    – n00dles
    Commented Aug 18, 2022 at 13:27
  • 1
    @n00dles - It's not a bad thought really. There are ways to use a more counter intuitive approach as well, which is what a more sophisticated company would do. Guessing is never as efficient as knowing.
    – Travis J
    Commented Aug 18, 2022 at 18:13

4 Answers 4

87

Any chance you guys can at least consider changing the error page from the "site is undergoing maintenance" that currently gets used? It’s very misleading, when the issues are not, in fact, maintenance activities (and apparently just transient errors).

There was a Meta Stack Overflow question recently from somebody confused about why you were "planning maintenance" in peak hours all the time, so I'm not the only person that's been thrown off by that page.

13
  • 25
    While I think the wording could be improved to better reflect the possible causes, the error page just says "offline for maintenance" and never says it was planned.
    – animuson StaffMod
    Commented Aug 13, 2022 at 3:17
  • 2
    @animuson I'm thinking of opening a new meta discussion about this, but before doing it, I want to ask: Sometimes the plain "Service unavailable" page appears. Is the text also customizable? Commented Aug 13, 2022 at 13:59
  • 3
    The error page, for reference, says "We are currently offline for maintenance Routine maintenance usually takes less than an hour. If this turns into an extended outage, we will tweet updates from @StackStatus or post details on the status blog."
    – Ryan M
    Commented Aug 15, 2022 at 5:53
  • 2
    "I think there was a Meta Stack Overflow question recently from somebody confused about why you were "planning maintenance" in peak hours all the time" That would be me. The meta SO thread can be found here.
    – Lundin
    Commented Aug 15, 2022 at 11:24
  • 8
    @animuson Very fair, my bad on that. But the wording about "routine maintenance" sounds very leading in my mind.... I'd agree the wording be updated a bit.
    – mbrig
    Commented Aug 15, 2022 at 16:47
  • "peak hours" are much in the eye of the beholder. Do you mean actual peak traffic times (in UTC) on the SE servers? Because whenever the USA gets to work, I get off work. For a global site, I can very much understand that, given your site attracts visitors at all hours of the day, you do maintenance whenever your devs suits it most, even if that's USA working hours.
    – Adriaan
    Commented Aug 16, 2022 at 7:42
  • I was there.... routine maintenance!? Only lasted 'till I visited the twitter link, though. Well updated it was.
    – n00dles
    Commented Aug 17, 2022 at 0:41
  • @Adriaan the point being, it's not "maintenance" at all. Its a server error caused by transient overload, that will resolve in seconds/minutes.
    – mbrig
    Commented Aug 17, 2022 at 2:05
  • @Lundin thanks, I edited to link directly to your MSO question.
    – mbrig
    Commented Aug 17, 2022 at 2:07
  • 4
    @animuson maintenance, by its definition as far as I know, is something planned. One won't perform maintenance on their site just like that out of the blue. So I disagree with your comment, and think this word should not be used, unless of course the site is down due to, well, maintenance. Better generic wording can be "We are currently offline", that's it. And for more details, one can go to the status page. Commented Aug 17, 2022 at 6:57
  • 1
    @Shadow That's a very narrow-minded definition of maintenance that is certainly not the held definition. When something breaks, you take it in for maintenance. It encompasses all forms of preventitive care, service, and repair. Planning has nothing to do with it.
    – animuson StaffMod
    Commented Aug 17, 2022 at 7:05
  • 9
    @animuson that's new for me. Maybe just language barrier, but keep in mind there are many people using Stack Exchange sites who don't have English as their native language, so they might share my narrow-minded definition. Thinking about it some more, it might only be due to the Hebrew default definition of this term, but still, this answer proves there are at least few people who get confused. Commented Aug 17, 2022 at 7:41
  • 18
    @animuson It's not narrow minded at all. Maintenance implies planning. Outside of technology, it implies regular servicing for regular occurances. If you have a car crash, you take your car for repairs, not maintenance. Commented Aug 17, 2022 at 14:31
26

Stack Overflow for Teams Enterprise was unaffected

Was this because the SO Teams Enterprise product has less traffic in general, because it wasn't targeted, or because it has a better security infrastructure?

2
  • 25
    My understanding is that SO Enterprise is software that you can run yourself or pay SO to run for you; the attacker can't attack Enterprise instances running on private intranets (since they can't even connect to them) so those couldn't have been affected. You can also pay SE to run an SO Enterprise instance for you, but such instances are run with different servers/infrastructure/domains (Microsoft Azure is used for SE hosted SO Enterprise instances; the public network is hosted with servers owned directly by SE), so an attack on SE sites wouldn't effect any Enterprise servers.
    – smitop
    Commented Aug 11, 2022 at 23:26
  • 59
    SOE runs on individual Azure tenants, they're also generally not exposed to the open internet.
    – Josh Zhang StaffMod
    Commented Aug 12, 2022 at 0:28
21

Please say the status page is not hosted on the same infrastructure as the rest of SE... I'd hate for that to go down at the same time as an outage on the rest of SE.

7
  • 39
    The status page is hosted by a third party on AWS. We also have the Twitter account @stackstatus.
    – Josh Zhang StaffMod
    Commented Aug 12, 2022 at 4:03
  • 11
    @JoshZhang more then once I've seen the status page unavailable during an outage of the main site. If it's hosted separately then maybe it's also being targeted Commented Aug 13, 2022 at 7:35
  • @roaima that's due to some weird DNS cache in Chrome, go to https://www.stackstatus.net/ (with www) and it will work. I still didn't figure why and how, but without www it just never loads for me, to this day. Commented Aug 17, 2022 at 7:00
  • @ShadowTheKidWizard See stackoverflow.com/questions/486621/… for further reading on why www. is required for that (or any) domain
    – TylerH
    Commented Aug 25, 2022 at 19:41
  • @TylerH on quick read, it does not explain why it's required, but I might do deeper later. Anyhow, the links on SO itself (e.g. from the down for maintenance page) pointed to the www-less address, thus broken for many (e.g. for me ;)) after the switch to the new page. Commented Aug 26, 2022 at 10:05
  • @ShadowTheKidWizard Hmm, any attempt to load the non-www page should auto-forward you to the www. address, from my testing. Are you using a common browser like Firefox or Chrome, or something more esoteric?
    – TylerH
    Commented Aug 26, 2022 at 13:41
  • @TylerH Chrome latest stable, the developer explained why it happens, I forgot. Commented Aug 26, 2022 at 16:54
9

We noticed an increase in traffic spikes starting earlier this year, which can sometimes cause site instability

Can you elaborate more on site instability? :P

Does it refer to mass amount of lag? As suggested by:

These bursts of traffic can cause some users to see a maintenance page or some other error page momentarily

1
  • 15
    Yes, some users could experience momentary long page loads during a traffic spike.
    – Josh Zhang StaffMod
    Commented Aug 12, 2022 at 4:04

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .