101

I like to use Data.SE to view usage stats for some of the sites, however deleted posts are not included on Data.SE and I think this skews the numbers quite a bit, particularly on sites with a lot of deletions.

Would it be possible to include some limited data about deleted posts in Data.SE to make statistical queries more accurate?

You could clear the non-public data fields of deleted posts, such as Posts.Title, Posts.Body, and Posts.OwnerUserId to address privacy concerns explained here. You could even leave the PostId set to 0 since I know SE doesn't want to provide users with a way of searching deleted questions.

The data I would mostly be interested in is CreationDate, LastActivityDate, PostHistory (date closed, date reopened, date deleted, etc), and the fact there was a deleted question in the first place. Up/Down votes, Score, Tags, and View/Answer/Favorite count would also be preferred.

This would make the statistical queries accurate, while still removing the information not meant to be visible to the public.

12
  • 5
    I've been meaning to post this for quite a while. I would absolutely love it if this were implemented. Commented Dec 3, 2012 at 16:25
  • 4
    "now that Data.SE is automated"...automation or not was never really a motivator for hiding deleted questions, evident by the fact they're consistently hidden everywhere. We simply don't publish a list of deleted content (which you'd have by Id, with this request) anywhere. Commented Dec 27, 2012 at 11:52
  • 1
    @NickCraver Hrrmm I thought I saw a question like this a while back, and someone declined it because doing a data dump to Data.SE was a manual process and stripping non-public data would be too much of a bother.
    – Rachel
    Commented Dec 27, 2012 at 12:32
  • Hmm, not directly related, but did you ever make a request for keeping successful close/reopen votes in the votes table? I was thinking the other day that the current behaviour is rather useless.
    – Tim Stone
    Commented Dec 31, 2012 at 17:02
  • 2
    @TimStone I have now :)
    – Rachel
    Commented Dec 31, 2012 at 17:10
  • 5
    @NickCraver - I understand the value of not exposing deleted content as it's generally crud with little to no redeeming value. However, not having access to that information makes it difficult to identify behavioral patterns regarding the genesis of that content. It's hard to suggest quality improvement when I can't definitively say where some of the crud is coming from.
    – user194162
    Commented Apr 17, 2013 at 15:10
  • 1
    Can you give some examples of the kind of statistics you're hoping to garner from this? I suspect I know what you're going for, but it's not entirely clear from what you're requesting.
    – Shog9
    Commented Jun 3, 2013 at 19:54
  • 2
    @Shog9 I like to run queries on things like post activity and site usage over a time frame, and new user activity and retention rates. These can be very skewed without deleted posts, especially if the user had a bad start on the site. I've seen other users run statistical queries on Data.SE too, and keep having to remind them that deleted posts are not included. It wasn't a big deal in the past because deleted posts were such a small percentage, but with so many more closed posts now getting deleted with the tweaked auto-delete script, the numbers have shifted quite a bit.
    – Rachel
    Commented Jun 3, 2013 at 21:54
  • @Rachel: actually, the tweaks aren't live yet - they're waiting on a huge pile of other changes that should be going out Real Soon Now. Of course, they will tend to skew things a tad when they do go live, so you may want to investigate whatever oddities you're observing before then.
    – Shog9
    Commented Jun 5, 2013 at 23:25
  • @Shog9 Hrrmmm I thought the changes were live due to the number of posts deleted by Community on Programmers.SE recently
    – Rachel
    Commented Jun 6, 2013 at 3:49
  • 1
    Well, there are two other auto-delete scripts that've been running weekly / monthly for years now...
    – Shog9
    Commented Jun 6, 2013 at 4:26
  • 1
    @Shog9 I was re-running the queries displayed here and noticed a pretty dramatic shift in some of the numbers. Those were posted about a year ago. Around the same time I also noticed the Community user deleting a lot of older posts on Programmers, so thought the script had been updated. It may have just been coincidence though.
    – Rachel
    Commented Jun 10, 2013 at 13:38

1 Answer 1

44

Yes

We've added a new table called PostsWithDeleted that includes metadata from all posts, including the deleted ones. If the post is deleted, we've nulled out all fields except:

  • ID
  • PostTypeId
  • ParentId
  • CreationDate
  • DeletionDate
  • Score
  • Tags (later added)
  • ClosedDate (later added)
  • ContentLicense (later added)

That's helpful for doing meaningful research that requires looking at activity on the site without being biased by our aggressive deletion scripts.

As of right now on Stack Overflow, that table shows:

questions answers  posts    deleted extant_questions extant_answers 
--------- -------- -------- ------- ---------------- -------------- 
12779084  19666615 32519143 5576247 10133016         16736439       

That lines up with the current stats:

Here's how far we've come together

10,159,316 programming questions

16,766,219 solutions given

Since the data in SEDE is from September 9 (rev 2015.9.9.45), it has a touch fewer posts in it's counts.

At the moment, there's no way to see the titles of deleted questions. Bodies and the various UserIds are also null for deleted posts, for obvious reasons. We also null out Views, LastEditDate, and LastActivityDate. However, I'm quite pleased with this new table as a way for us to be more transparent about the posts that can't be seen on the site.

25
  • 3
    Oh, so you don't get tags for deleted posts? Hmm, then that cheering was a bit premature. Commented Sep 16, 2015 at 15:49
  • 3
    Why are votes nulled? That seems it might be useful too?
    – enderland
    Commented Sep 16, 2015 at 15:50
  • @enderland: Well, as I mention in the answer, it's not hard to calculate Score from the Votes table if you are selecting a few posts. We might need to consider leaving Score denormalized if people end up using this table to look at the scores of deleted posts often. Commented Sep 16, 2015 at 15:53
  • 4
    @JonEricson I guess that works, thanks whoever made this a priority! It'll be fun I think to play with :D
    – enderland
    Commented Sep 16, 2015 at 15:55
  • 1
    @ChristianRau: My initial spec included tags, but that was taken out in the name of caution. Commented Sep 16, 2015 at 15:56
  • 3
    I updated the Database schema documentation for the public data dump and SEDE with this new table.
    – rene
    Commented Sep 16, 2015 at 16:11
  • 2
    @JonEricson so, tags were considered somehow sensitive, right? would be interesting to learn why, I thought these were safe. I thought tags data could be useful for burnination / blacklisting but since it's unsafe, that's apparently not an option
    – gnat
    Commented Sep 16, 2015 at 16:11
  • 4
    @gnat: I sympathize with you. As I said, my initial spec included tags, which I think are both safe and useful. But somewhere in this process we decided to only include the five fields I listed above for deleted posts. That doesn't mean we can't add more fields in later. For the moment, we just want to be sure that we not releasing a djinn who refuses to get back in the jar. Commented Sep 16, 2015 at 16:20
  • 1
    @JonEricson If I remember correctly, votes on deleted posts were not previously included in the Votes table in SEDE. Now they are. Was this change simultaneous with the introduction of PostsWithDeleted, or did I miss something?
    – user259867
    Commented Sep 16, 2015 at 17:22
  • 1
    @NormalHuman: I was a bit surprised that votes for deleted posts existed, but I don't think they were added as a part of this change. From what I can tell looking at the revision history, votes on deleted post have been available since the beginning. (Don't hold me to that, however. I'm a bit rusty reading other people's code. ;) Commented Sep 16, 2015 at 17:37
  • 1
    Thanks so much!!! Now to find a weekend to brush up on my SQL skills to run some data analysis.... :D
    – Rachel
    Commented Sep 16, 2015 at 18:36
  • 1
    @JonEricson: Is there any interest in adding other content related to deleted posts? Like votes, comments and flags? Again, for similar reasons mentioned in the OP and comments.
    – Werner
    Commented Jun 25, 2019 at 15:34
  • 1
    Is there any way an OwnerUserId could be added to the list of fields populated for the PostsWithDeleted table? I understand the concern about identifying the user, but without post content and title this info is relatively harmless. What's more, it could immensely help active users in helping others asking why they are question/answer banned without the need to involve a moderator. Commented May 4, 2021 at 0:50
  • 2
    @KellyBundy: I don't know the full reason and I no longer work at the company. But the core issue is that SEDE hasn't been actively worked on in years. It's a wonderful tool and I'm glad to have it. It's just not a priority for improvement. Commented Mar 15, 2022 at 21:14
  • 1
    @JonEricson Thanks. Yes, it's nice and I use it for some things. The api as well, since SEDE data sadly lags by a few days. Commented Mar 15, 2022 at 22:05

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .