38

I work with and on SmokeDetector (SD). Over the last several days (starting last Monday, 2020-05-111), we've noticed times when SmokeDetector has stopped scanning posts on Stack Exchange. After investigating why this was happening, I found there are now times when SmokeDetector does connect to the Stack Exchange WebSocket, but doesn't get any notifications of activity anywhere on the SE network.2

What happens: no errors, but no data for about 9.67% of connection attempts

The WebSocket is opened, a subscription is sent, but the only data that is sent from SE on the WebSocket is the heartbeat.3 No data is sent other than the heartbeat, no mater how long the socket is left open, even hours. No errors are raised on the socket. There's just no data.

In testing, I've been able to duplicate the problem in both Python and JavaScript (multiple browsers and IP addresses). I've run 331 trials of opening the WebSocket, sending the 155-questions-active channel specifier and then waiting for data. Out of those trials, 32 times the WebSocket connected, but failed to send data, other than the heartbeat.4 That works out to failing for 9.67% of connection attempts.

Does this affect other things than SmokeDetector? Yes, it affects SE in general.

Stack Exchange uses WebSockets for push notifications of status change on a wide variety of data across the network. This problem is likely affecting a significant percentage of times where an endpoint connects to the server.

The problem is probably affecting roughly 10% of page views, but it's not something that's going to be readily noticed as a specific, consistent problem, because it affects live page updates after the page is loaded (much less noticeable), and will usually resolve itself when the page is reloaded.

On each page load, the JavaScript on the page connects to the WebSocket in order to keep the page "alive" (i.e. get notifications of actions done elsewhere). On the random page loads where the problem will occur, it will just feel like the page is a bit more static than normal. The page will feel "Dead". Given that it's something that has a 90% chance of being resolved when the page is reloaded, it's likely that, when it's noticed, it's just being written off by people as a glitch.

Over the last few days, I've definitely noticed times when there have been missing notifications within a SE page for events that happen on the network (e.g. not informed of post score changes, a post being deleted or edited, inbox notifications, reputation changes, a substantial difference between updates on the same page in two tabs, etc.). The problem has been noticed by at least one other person. It's likely been noticed by many other people, but if you're not aware that there's a systemic problem, it's easy to just wave it off as some intermittent issue, as it's 90% likely to go away when the page is reloaded.

Mitigation

We've written code to mitigate the issue for SmokeDetector (detect that we're not getting data and reconnect). For normal viewing of Stack Exchange pages, just reload the page.

Duplication

You can try connecting to the SE WebSocket for yourself using the Stack Snippet below:

(() => {
  'use strict';

  let webSocket = null;
  let noAdd = true;
  let startTime = 0;
  const passEl = document.getElementById('pass');
  const failEl = document.getElementById('fail');
  const failPercentEl = document.getElementById('failPercent');
  const startTimeEl = document.getElementById('startTime');
  const elapsedTimeEl = document.getElementById('elapsedTime');
  const diffLabel = {
      days: 'd',
      hours: 'h',
      minutes: 'm',
      seconds: 's',
  };
  const diffOrder = [
      'days',
      'hours',
      'minutes',
      'seconds',
  ];

  function incrementAndShowPercent(addPass, addFail) {
    if (noAdd) {
      return;
    }
    const pass = +passEl.textContent + addPass;
    passEl.textContent = pass;
    const fail = +failEl.textContent + addFail;
    failEl.textContent = fail;
    const failPercent = Math.round((10000 * fail) / (pass + fail))/100
    failPercentEl.textContent = failPercent;
    if (fail) {
      failEl.classList.add('isFailing');
      failPercentEl.classList.add('isFailing');
    }
    /* The elapsed time code was copied, then significantly modified, from
     * https://codereview.stackexchange.com/a/160240/53535 by Przemek
     * https://codereview.stackexchange.com/users/97934/przemek */
    const elapsed = (Date.now() - startTime) / 1000;
    const diff = {
      days: Math.floor(elapsed / 86400),
      hours: Math.floor((elapsed / 3600) % 24),
      minutes: Math.floor((elapsed / 60) % 60),
      seconds: Math.floor(elapsed % 60),
    };
    const elapsedText = diffOrder.reduce((sum, key) => {
      if (sum || diff[key] || key === 'seconds') {
          sum += ` ${diff[key]}${diffLabel[key]}`;
      }
      return sum;
    }, '').trim();
    elapsedTimeEl.textContent = elapsedText;
  }

  function registerWebSocket() { // eslint-disable-line no-unused-vars
    if (webSocket) {
      if (typeof webSocket.removeEventListener === 'function') {
          //There can be additional events that mess up the count.
          webSocket.removeEventListener('message', socketOnMessage);
      }
      if (typeof webSocket.close === 'function') {
          // Close any existing WebSocket.
          webSocket.close();
      }
    }
    webSocket = new WebSocket('wss://qa.sockets.stackexchange.com');
    webSocket.addEventListener('message', socketOnMessage);
    webSocket.addEventListener('open', () => {
      //console.error('WS: Open:');
      webSocket.send('155-questions-active');
      console.log('vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv');
      console.log('WS: Open & Subscribed to active questions');
      noAdd = false;
    });
    webSocket.addEventListener('close', event => {
      //console.log('WS: Closed: event.code', event.code, '::  event:', event);
      console.log('WS: Closed');
    });
    webSocket.addEventListener('error', event => {
      console.error('WS: Error:');
      console.log('WS: Error: event', event);
    });
  }

  function socketOnMessage(event) {
    const messageObject = JSON.parse(event.data);
    const data = messageObject.data;
    let dataObject = data;
    try {
      dataObject = JSON.parse(data);
    } catch(error) {
      // Ignore any errors
    }
    messageObject.data = dataObject;
    console.log('WS: messageObject:', messageObject);
    if (messageObject.action === 'hb') {
      //At least for the 155-questions-active socket, the hb is every 5 minutes.
      incrementAndShowPercent(0, 1);
      noAdd = true;
      console.error('WS: heartbeat Received:');
      //console.error('WS: Responding to heartbeat:');
      //webSocket.send(data);
      registerWebSocket();
    } else {
      incrementAndShowPercent(1, 0);
      noAdd = true;
      registerWebSocket();
    }
  }

  document.getElementById('start').addEventListener('click', () => {
    passEl.textContent = 0;
    failEl.textContent = 0;
    failEl.classList.remove('isFailing');
    failPercentEl.textContent = 0;
    failPercentEl.classList.remove('isFailing');
    startTime = Date.now();
    startTimeEl.textContent = (new Date(startTime)).toISOString();
    registerWebSocket();
  });

  document.getElementById('stop').addEventListener('click', () => {
    webSocket.close();
  });
})();
span:not([id]) {
  margin-left: 10px;
}
span.extraSpace {
  margin-left: 20px;
}
#fail.isFailing,
#failPercent.isFailing {
  color: red;
  font-weight: bold;
}
#pass,
#failPercent {
  color: #00d000;
  font-weight: bold;
}
<button id="start">Start/Restart</button>
<button id="stop">Stop</button>
<span> Pass: <span id="pass">0</span></span>
<span> Fail: <span id="fail">0</span></span>
<span>Fail %: <span id="failPercent">0</span> %</span>
<span class="extraSpace">Started: <span id="startTime"></span></span>
<span class="extraSpace">Elapsed: <spam id="elapsedTime"></span></span>
</br>You may want to open the Web Console (F12), as it's easier to read the output there.</br>
If you don't see anything in the console for 5 minutes after "WS: Open & Subscribed to active questions", that means that attempt to connect (probably) failed.</br>


  1. See gap in "API quota rolled over with" messages in this chat search. There should be one message per day, every day at nearly the same time, unless we've switched instances. This represents days where SmokeDetector stalled out in processing posts. In addition, the following image shows times when there was no scanning or update to the reported SE API quota (both indicated with red circles). The green circle shows what the rising edge of the API quota graph should normally look like. The fact that it isn't completely vertical in the red circled areas is due to data missing as a result of no scans being performed
    enter image description here
    To be clear, the dates and times shown above are only show the points in time at which the problem randomly affected SmokeDetector and the length of time from then to when the bot was rebooted. Those dates and times don't reflect when the problem was more likely to occur.
  2. SD uses the 155-questions-active channel on the WebSocket available from wss://qa.sockets.stackexchange.com/, which notifies about (almost) all post activity on all sites.
  3. The "heartbeat" is a message that SE sends which says "hb". After receiving the heartbeat, the consumer of the WebSocket must send some data to indicate that the WebSocket is still active. The "heartbeat" is administrative and doesn't have any data for the user to consume.
  4. "Failure" here is defined never receiving any data for a full 5 minutes from the opening of the WebSocket to receiving the first heartbeat on the socket. While it's possible that some of these failures were that there just wasn't any activity on any SE site for 5 minutes, that's … very rare.
21
  • 5
    I've also noticed this same issue with Math SE during the last few days, where if I leave the page open at the all questions page, it should normally show when there are new questions. However, sometimes I don't see anything for quite a while, but when I refresh the screen, there are several questions from quite a while ago which I wasn't notified about but, normally, would have been. Commented May 17, 2020 at 3:00
  • 2
    I can reproduce in various places in the site UI. Once, when I voted to close a question, the dialog didn't close after I submitted the vote, and once, the vote count didn't update when I red-flagged a post. I thought those were all individual bugs, but since I didn't see them happening in all cases, I didn't report them. Commented May 17, 2020 at 4:43
  • 1
    If I speculate on that percentage I would say one server in the farm is not behaving. For an SRE to figure out which one it is.
    – rene
    Commented May 17, 2020 at 6:23
  • I've noticed occasional difficulties voting in the app; on the website, I can get a notification (inbox message, say, or rep change) and the fly-out which should contain details is entirely empty. On both platforms, reloading fixes everything. Commented May 17, 2020 at 7:02
  • Nice catch and excellent post. Do you know when it started happening and if that 10% means that 10% of posts did not go through smokey? Commented May 17, 2020 at 7:40
  • @DavidPostill Thanks. The first obvious record of the problem is last Monday, 2020-05-11. We haven't tried to localize it better than that. If really needed, we can probably localize the first instance of SD experiencing a significant issue to within a few minutes, but there could easily be minor glitches prior to that. Also, SD probably wasn't the first place where the problem showed up. As to scanning posts, no, it doesn't mean SD missed scanning 10% of posts. It's not that 10% of the data isn't reported, it's that 10% of the time when you try to start getting data, you get nothing.
    – Makyen
    Commented May 17, 2020 at 9:03
  • 1
    (continued) From SD's POV, a quick reboot fixes the problem 90% of the time (i.e. ~10% of the time you try to start back up, you, again, get nothing). There may be some posts which SD didn't scan, but SD does a good job of picking up from where it leaves off, as opposed to only checking just what's reported via the WebSocket. This is both to handle any reboots that happen as detections are adapted, but is also necessary, because the WebSocket doesn't report every active post on SO.
    – Makyen
    Commented May 17, 2020 at 9:03
  • Also I believe SD re-runs a post when there is any activity on it. So it might be that a post gets "caught" once an edit is made, IIRC @Makyen
    – Luuklag
    Commented May 18, 2020 at 13:45
  • @Luuklag Yes, SD scans the question and all answers whenever there is an edit on the question or any of it's answers, or when a new answer is added to the question (i.e. when the question becomes "active", per how SE defines that).
    – Makyen
    Commented May 18, 2020 at 19:29
  • 1
    This has been happening to me as long as I can remember. I always thought it was just some glitch. However, ever since the follow feature has been implemented it started to happen more often. Don't get me started on API. It has the availability of less than 99%. I keep getting error 503 every second day.
    – Dharman
    Commented May 19, 2020 at 0:07
  • See my answer below. Just pushed a new build, hopefully will take care of it. Going to leave this as status-review until get confirmation that this is fixed. Commented May 25, 2020 at 11:13
  • I lost 2 reputation because I edited a post that later got deleted, but what' strange is that it showed that I lost 2 reputation before showing that I edited it, but it wasn't closed when I edited the post. I'm thinking that SmokeDetector might be the cause here based on what you're saying. I'm certain I didn't do anything else with the post besides a comment, and it looks like this. Commented May 25, 2020 at 11:53
  • 3
    @mindstormsboi SmokeDetector won't be the cause of the problems you mention. SmokeDetector is just a user of Stack Exchange and can't affect how the Stack Exchange system handles your posts or edits. There is an associated system, metasmoke, which can raise spam flags on posts which are detected as spam, but that won't affect the things you're seeing. It is possible that you might be seeing the problem described here, but it's not caused by SmokeDetector. SmokeDetector was just an application that was clearly and unambiguously affected by the problem, making it easier to identify the issue.
    – Makyen
    Commented May 25, 2020 at 13:04
  • 1
    @mindstormsboi I think we can live with an occasional glitch for individual users even if that was caused by Smokey. Its added value out weights temporal incidents that users might like to contribute to the spam detection robot.
    – rene
    Commented May 25, 2020 at 13:51
  • 1
    I'd repost this as a new bug. The SE team prefers that people file new bug reports when a prior fixed bug comes up again. Commented Jul 10, 2020 at 2:40

1 Answer 1

14

The symptoms reported point to an issue on one of the socket servers. And the last time that the servers were built out was on May 11 (corresponding to Makyen's report of this date as the beginning of this). So while I have not been able to detect through our monitoring which server is the culprit, I am hoping that pushing out a new build (which I just completed) will solve the problem.

3
  • 3
    This problem is happening again. It doesn't affect SmokeDetector that much, because we have code which detects and avoids it. However, I'm seeing about a 15% failure rate in testing. Do you want a new Meta post?
    – Makyen
    Commented Jul 9, 2020 at 19:07
  • 1
    This seems to have regressed again. See meta.stackexchange.com/questions/350624/… Commented Jul 10, 2020 at 5:09
  • Did you test whether that build solved the problem? What were the results? Commented Jul 10, 2020 at 20:26

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .