48

In the spirit of Shog9's recent post, "2016: A year in closing", I thought I'd do something similar for spam. Ladies and gentlemen, I present you with all the statistics about spam you never needed to know.

The Big Number

Network-wide, we saw at least 32,462 spam posts last year. There may be more that aren't in my data, but that's not likely to be many. However, you can offset that with the fact that we see around 100,000 posts created or updated each day.

Posts by Site

Stack Overflow, unsurprisingly, gets the majority of spam. These numbers have been consistent for a while, but started changing towards the end of the year, with Ask Different in particular trending up, and Meta Stack Exchange down.

+-----------+------------------------------------------+
| PctOnSite | SiteName                                 |
+-----------+------------------------------------------+
| 26.6106%  | Stack Overflow                           |
| 10.6072%  | Drupal Answers                           |
| 10.2012%  | Super User                               |
| 8.7430%   | Ask Ubuntu                               |
| 3.4621%   | Meta Stack Exchange                      |
| 3.1297%   | Information Security                     |
| 2.4006%   | Arqade                                   |
| 2.3477%   | Ask Different                            |
| 2.1702%   | The Workplace                            |
| 1.7849%   | Personal Finance & Money                 |
| 1.4109%   | Android Enthusiasts                      |
| 1.2787%   | English Language & Usage                 |
| 1.2617%   | Travel                                   |
| 1.2258%   | Mathematics                              |
| 1.1805%   | Graphic Design                           |
| 0.9784%   | Web Applications                         |
| 0.8575%   | Movies & TV                              |
| 0.8292%   | Arduino                                  |
| 0.6686%   | MathOverflow                             |
| 0.6214%   | Electrical Engineering                   |
+-----------+------------------------------------------+

Truncated to top 20 sites. View full data.

Posts by Time

Make what you will of this. The majority of spam is posted between 0400-1100 UTC each day.

+----------+-----------+
| AvgPosts | HourOfDay |
+----------+-----------+
|      512 |         0 |
|      567 |         1 |
|      587 |         2 |
|      765 |         3 |
|     3596 |         4 |
|     4441 |         5 |
|     4477 |         6 |
|     3518 |         7 |
|     3342 |         8 |
|     3502 |         9 |
|     3373 |        10 |
|     3075 |        11 |
|     1855 |        12 |
|     1178 |        13 |
|      914 |        14 |
|      761 |        15 |
|      748 |        16 |
|      707 |        17 |
|      694 |        18 |
|      665 |        19 |
|      569 |        20 |
|      495 |        21 |
|      510 |        22 |
|      469 |        23 |
+----------+-----------+

Time to Deletion

On average, it takes just over 5 minutes to delete spam at peak time, but it can take over 10 at less busy times of day.

+----------------------+-----------+
| AvgSecondsToDeletion | HourOfDay |
+----------------------+-----------+
|             623.9267 |         0 |
|             604.2301 |         1 |
|             636.0441 |         2 |
|             571.1575 |         3 |
|             473.8658 |         4 |
|             441.3046 |         5 |
|             380.7654 |         6 |
|             369.7099 |         7 |
|             332.5471 |         8 |
|             315.3328 |         9 |
|             301.5370 |        10 |
|             313.2093 |        11 |
|             332.4646 |        12 |
|             354.3419 |        13 |
|             392.5989 |        14 |
|             424.5681 |        15 |
|             421.0383 |        16 |
|             420.2009 |        17 |
|             438.6229 |        18 |
|             461.5307 |        19 |
|             448.5552 |        20 |
|             478.9103 |        21 |
|             543.7133 |        22 |
|             599.4058 |        23 |
+----------------------+-----------+

SmokeDetector

Since this is where the stats come from, it's only fair to give the project some credit. I work on SmokeDetector, which is a bot that identifies possible spam and asks humans to flag and feed back on it. Here's the data that shows it works: there's a heavy correlation between the number of feedbacks the post gets, and how quickly it gets deleted.

+----------------------+---------------+
| AvgSecondsToDeletion | FeedbackCount |
+----------------------+---------------+
|           22978.8568 |             1 |
|            5969.3305 |             2 |
|            2900.3543 |             3 |
|             366.6266 |             4 |
|             328.7800 |             5 |
|             192.9524 |             6 |
|             167.2500 |             7 |
|              25.0000 |             8 |
+----------------------+---------------+
14
  • 1
    Poor Movies & TV: relatively small site, much spam. Commented Jan 14, 2017 at 14:13
  • How many spam posts did we see in 2015? Are we getting quicker or slower at deleting them? Are there outliers in the time to detection and is there anything we can usefully learn from the outlier spam? Commented Jan 14, 2017 at 14:13
  • 1
    @Robert we don't have stats from 2015, so I can't do much of that analysis.
    – ArtOfCode
    Commented Jan 14, 2017 at 14:16
  • 12
    You should probably give a disclaimer that this is not entirely accurate data; that there are a lot of posts deleted as spam missing from the statistics because this data is not official.
    – hichris123
    Commented Jan 14, 2017 at 15:23
  • 2
    @randal'thor Drupal gets hit a lot harder, relatively speaking.
    – Glorfindel Mod
    Commented Jan 14, 2017 at 15:39
  • Is PctOnSite the percentage of network spam posts that were found on that site, or the percentage of posts on that site that were found to be spam? Commented Jan 15, 2017 at 6:08
  • The former, @Nathan.
    – ArtOfCode
    Commented Jan 15, 2017 at 8:32
  • @ArtOfCode - Would be nice to see both values described by Nathan: I.e., it would be interesting to see if any site is disproportionately attacked by spam.
    – feetwet
    Commented Jan 15, 2017 at 19:50
  • @feetwet I can tell you without even querying it that Drupal, Apple, and Graphic Design would win that contest, though not necessarily in that order.
    – ArtOfCode
    Commented Jan 15, 2017 at 19:55
  • Any chance of getting a raw data dump of posts (title, body & username), feedback and reasons for spam detection? Would be nice to have a look at
    – ert
    Commented Jan 16, 2017 at 2:21
  • @Rob preferred format? I'll organise that tomorrow.
    – ArtOfCode
    Commented Jan 16, 2017 at 2:23
  • @ArtOfCode A raw database export would be nice (barring sensitive tables if any). Otherwise C(T)SV would work fine
    – ert
    Commented Jan 16, 2017 at 2:25
  • 1
    @Rob I can get you SQL format easily enough. Might end up being two or three tables to make the post has and belongs to many reasons association work.
    – ArtOfCode
    Commented Jan 16, 2017 at 2:26
  • 2
    @Rob SQL dumped. 4 tables: posts, posts_reasons, reasons, feedbacks.
    – ArtOfCode
    Commented Jan 16, 2017 at 13:13

1 Answer 1

19

As hichris notes,

... this is not entirely accurate data; that there are a lot of posts deleted as spam missing from the statistics because this data is not official.

But at the same time, we see a very, very large percentage of what comes in. I'm comfortable saying that this is probably accurate enough for most use cases.


Also, in an effort to keep the question relatively short, I'll throw some graphs here:

enter image description here

^ (not scaled to site size)

enter image description here

Time to deletion. I had a hard time with x-axis scales on this one. It's by-hour, starting at 0 UTC.

enter image description here

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .