54

Over the past year, a lot of meta data has been changed or added to the system:

  • Review tasks and results
  • Duplicate links (separate from close votes and ad-hoc editing conventions)
  • Close votes as a distinct type of flag (previously they were stored as votes)

However, none of this is in the public data dump. There are bits and pieces - suggested edits predate the new review system, and duplicate links can be obtained from PostHistory with a bit of trouble - but not enough for any serious analysis.

Therefore, I recommend the following information be added to the public data dumps and data explorer:

Close and reopen flags

As with close votes, this information should have userIDs and exact times stripped before being made public. Suggested schema:

Id
FlagTypeId (Recommend Close: 13, Close: 14, Reopen: 15)
PostId
CreationDate (date only - no exact time)
CloseReasonId (requires the CloseReasons table)
CloseAsOffTopicReasonId (requires the CloseAsOffTopicReasons table)
DuplicateOfQuestionId
BelongsOnBaseHostAddress
ResultDate   (this and ResultTypeId are only needed if deleted flags should be included)
ResultTypeId

The question of whether or not deleted votes/flags should be included is a big one here: IMHO, it's very useful for doing historical analysis... But right now it's explicitly not included. If you can think of a good reason not to make this information public, post it below.

Links

The PostLinks table is quite useful: it contains not only duplicate links, but also ad-hoc links and even related links (the two sidebar lists next to most questions). I would like to have all of this information made available, but if this isn't feasible for some reason then ad-hoc links and duplicate links would be the most useful. Suggested schema:

Id
CreationDate
PostId
RelatedPostId
LinkTypeId (there is no LinkTypes table, but maybe there should be one)
Rank

Since all of this information is already public, I don't foresee any privacy issues with publishing it in the dump.

Review tasks

This covers everything available under /review. Nothing in this table should really be private, but I would recommend excluding information on audits for space and simplicity's sake. Suggested schema:

Id
ReviewTaskTypeId (requires the ReviewTaskTypes table)
CreationDate
DeletionDate
ReviewTaskState (requires the ReviewTaskStates table)
PostId
SuggestedEditId (this table already exists in the public data)

Note that "deleted" review tasks are still publicly visible - DeletionDate is really just the data on which the task was either completed or invalidated.

Review task results

Recommended schema:

Id
ReviewTaskId
ReviewTaskResultTypeId (requires ReviewTaskResultTypes table)
CreationDate  (date only - no exact time)
RejectionReasonId (requires ReviewRejectionReasons table)
Comment (custom rejection reasons for suggested edits)
CompletedReviewTask (bit) (this doesn't actually exist internally, but I recommend calculating it for convenience - particularly if timestamps are omitted)

This is a tricky one... Most of this information is already public in some fashion, but making it easily accessible is another matter. In particular, First Post and Late Answer reviews can involve actions that are not made public: voting and flagging. Making it too easy to correlate a FP/LA review with specific events on a post could lead to trouble. Therefore, I recommend the following:

  1. As with votes, omit the ID of the user reviewing.
  2. As with votes, omit the exact time that the review took place
  3. Omit any detailed information on the actions that contributed to the review for First Posts and Late Answers reviews.
  4. Omit audit information (if it is left out of ReviewTasks)

I may be over-cautious here... For instance, it's possible that including #1 would be fine as long as #3 was omitted. Post your thoughts below.

Account ID on Users rows

This is actually a very old request, but Tim Stone reminded me of it again tonight. This is public information already; at one point, the ability to disassociate accounts from your network profile may have made it problematic, but given we don't allow that anymore... I don't see any issue here.

It's another integer in each row in the Users table that corresponds to your network ID (mine is 620) and is identical on each site linked to your network account.


Please review the above suggestions. In particular, if you can envision a way in which the release of such data could be used to compromise information that is normally considered private (such as voting data), call it out now - much better to be safe than sorry.

9
  • 5
    For review task results, viewing the task result page in a browser already shows the users, full timestamps, and result, so I don't think it is necessary to leave that information out. Omitting the detailed information should be sufficient.
    – ughoavgfhw
    Commented Nov 25, 2013 at 1:56
  • Please consider this feature request too.
    – Noel
    Commented Nov 25, 2013 at 9:15
  • Wanna post an answer that elaborates on that a bit, @ughoavgfhw? Or just post an answer?
    – Shog9
    Commented Nov 25, 2013 at 19:59
  • Actually I +1'd this because I thought it was to make the user ID's public for the already available data (I clearly didn't read it in remotely enough detail). I guess that's not going to happen. I just really want to generate some (non-user-specific) stats about close voting, for which I need user-specific data. (But if a change is in the works that would provide a long-term fix for the queue size, I guess that's not really needed any more) Commented Nov 28, 2013 at 2:02
  • We don't show pending users for close votes on the site, @Dukeling, so including them in the dump is a bit sketchy; we probably could do so for reviews (as noted above) but even that isn't strictly necessary for analyzing review.
    – Shog9
    Commented Nov 28, 2013 at 2:25
  • @Shog9 Even just adding the user ID for votes cast on already closed questions may allow for some fairly decent stats (e.g. how many 20k+ users vote to close questions how often). Commented Nov 28, 2013 at 3:11
  • If you're only interested in already-closed questions, you can get that data now, @Dukeling - but you'll have to extract the JSON that represents it from the data dump. It's not hard, but chances are you'll only do it if you have a really good reason to want that data.
    – Shog9
    Commented Nov 28, 2013 at 3:20
  • Can you update us if this request is near a status-planned phase?
    – rene
    Commented Mar 9, 2014 at 19:53
  • It's on the list, @rene - but see this update: there are some issues with trying to cram all of this into the data dump, so only the stuff that makes sense to publish in both the dump and SEDE has been completed so far. We'll try to move some of the more transient stuff into SEDE only at some point, but that'll take some extra work to make it happen.
    – Shog9
    Commented Mar 9, 2014 at 19:58

6 Answers 6

18

I don't think it is necessary to leave the user ID or full timestamp out of the review task results table. That information is already available and associated with a general result on the review task's web page, viewable from the history. For example, here is a recent review from the late answer queue (with the username changed):

I circled the username and timestamp, but if you are reading this you probably can't see that anyway.

Since this information is publicly available already, there is no need to leave it out of the data dump. It should be sufficient to leave out detailed information about the actions taken.

1
  • 4
    That darn user12345!
    – Cole Tobin
    Commented Jan 4, 2014 at 2:49
10

After taking a close look at the proposed schemas here, I think we need to make a few clarifications and/or adjustments.

Specifically:

  1. For CloseReasons, we should stick with the same naming conventions as the rest of the tables currently exposed through SEDE and call it CloseReasonTypes.
  2. CloseAsOfftopicReasons should probably also be named CloseAsOfftopicReasonTypes (say that ten times fast!), and it should explicitly include:
    • the userids/names of the moderators involved in creating and approving the close reasons.
    • all currently active and deactivated close reasons. (Unapproved close reasons should be excluded).
  3. Any close/reopen votes, flags, links, and review tasks should be excluded if they are related to a deleted post.
  4. Exclude deleted votes and flags (that is, flags where ResultDate is not null) on the grounds that because a lot of closed posts are also deleted, providing information that's guaranteed to be incomplete will only do a disservice to anyone trying to perform any sort of meaningful analysis on how folks vote or flag.
  5. We should generate a LinkTypes table. Don't need one internally, really, but it'd be nice to have one in the public schema for convenience/clarity.
  6. The CompletedReviewTask field Shog mentions will have to be calculated because there's no other way to figure out what happened to the review item once we zero out the time part of the timestamp.

As far as including information about review tasks goes, I think being more paranoid is going to pay off more in the long run. While it's true that the information like exact timestamps is available on the site, acquiring it in aggregate right now is rather painful. Exposing it via SEDE would make it far easier, and we kind of don't want to make it easier for folks to make any guesses about how others vote or act.

3
  • I have a mild hunch that the custom off-topic close reasons are not going to be so easy to toss in the data dump as site-specific close reasons aren't available in the API (they are all listed as 'off-topic' only, with no sub-reason).
    – jmac
    Commented Jan 29, 2014 at 8:09
  • 1
    @jmac What the API exposes and what the data dump can contain is independent of each other.
    – Adam Lear StaffMod
    Commented Jan 29, 2014 at 16:56
  • Thanks for the clarification Anna. Didn't realize they were unrelated, I assumed there was some sort of mojo linking them. If not, all the better.
    – jmac
    Commented Jan 30, 2014 at 0:21
5

So it's been pointed out that a lot of this data would be effectively worthless in a public data dump, due to the fact that they're generally months out of date. In addition, XML is huge, and the Stack Overflow data dump is already very large.

Therefore, I recommend including all of the information outlined above (with Anna's clarifications and caveats) in the Stack Exchange Data Explorer, but adding only the following to the public data dump:

  • Links but excluding Related links
    • Also a LinkTypes table documenting the two types of links that are included
  • Account ID on Users rows

That is all. Note that I'm explicitly excluding Close and Reopen flag-votes: as Anna noted, including deleted votes gives a skewed picture of what actually happens unless flags related to deleted posts are included as well... Since most close and re-open flags are deleted, this leaves very little relevant information to include in a public, months-out-of-date dump.

It would be nice to figure out a way of including some meta-data on deleted posts, as that would enable including a much larger set of information regarding flags and reviews... However, this would be a fairly major change, and as such is out of scope for this feature request.

4

If the work isn't too difficult, please consider converting NULL in denormalized count fields to a definite number (usually 0). As I noted with CommentCount, some of these fields mean that the count has never been update. But the distinction is rarely useful and there are some strange edge cases that mean it's not terribly reliable. For the sake of simplifying the data and not producing misleading results, please convert nulls in these fields to 0 when exporting to the public:

  • Posts.CommentCount
  • Comments.Score

For questions (answers should remain null):

  • Posts.ViewCount
  • Posts.AnswerCount

There may be others; please feel free to suggest them.

1
  • This will be the case in the next export Commented Apr 1, 2014 at 19:31
3

Nick wrapped this up today, mostly as specified with Anna's adjustments. The one notable difference is the replacement of ReviewTaskResults.CompletedReviewTask with ReviewTask.CompletedByReviewTaskId - this is just easier to use.

Here are a couple of demos to get you started:

0

It would be great if you could add a magic link for the review tasks, like we have for the post itself, comments, and users (maybe others?). Something like review.PostID [Review Link]. Or it may be necessary to specify a specific queue, so [CloseReview Link], [FirstPostReview Link], etc.

2
  • note that questions can go through close review queue twice
    – gnat
    Commented Jan 20, 2014 at 15:21
  • 1
    Such a "magic link" would have to look up the PostID in the ReviewTasks table and pick a task to link to; as gnat notes, there might be more than one (particularly if a specific queue wasn't specified). I tend to think this would be easier handled in the query itself.
    – Shog9
    Commented Jan 20, 2014 at 16:44

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .