19

Is there a particular reason why vote times are omitted in the data dump?

IMO this could be a very useful data-set. Is this one of the steps that was done to anonymize the data, or does it simply not exist? This would be a very good addition to future dumps to help build time-based relationships, which are currently very rigid and full of assumptions.

3
  • It must exist because without it they couldn't do rep recalcs.
    – cletus
    Commented Jun 30, 2009 at 1:46
  • Well technically you could do a recount on the date alone, but it doesn't accurately describe the flow of events. Commented Jun 30, 2009 at 2:15
  • Hi, I'm working with the Sept 2011 data dump. This includes a timestamp in the Votes table. Can you update on me on how this has been adjusted? Is there a systematic or random shift added? Thanks
    – paulusm
    Commented Apr 29, 2012 at 20:38

2 Answers 2

26

One of Jeff's requirements in releasing the data dump was that specific user voting data would not be available. The site goes to great lengths to keep voting data private, and I support that. Jeff strongly had in mind the AOL data dump debacle (google it if you're not familiar) in which AOL thought they had anonymised a search dataset but enterprising researchers were able to correlate data searched for with other information and actually identify real-world individuals. Like, down to their home address, just from what they typed into the AOL search box.

Stack Overflow obviously has less private information and less potentially invasive results if voting data were to be exposed, but if the online site keeps voting data private then the data dump should respect that privacy too.

If the millisecond-resolution vote timestamp were included in the dump, I believe the up/down voting patterns could strongly correlate with other activity on the site (questions, answers, comments). The more history available in the dump, the stronger the correlation can be. People use Stack Overflow during certain times of the day and not others, and the usage pattern will be distinct for each individual. There might be enough pattern information in there to identify who cast a given vote or votes.

I'm not completely certain that one could get useful information out of timestamp correlation in this way, but I think there's enough of a risk that I suggested truncating the timestamps. If somebody can present a convincing argument that there wouldn't be a way to discover user voting patterns, then the data dump can always be changed for future runs. It's certainly not set in stone.

6
  • 2
    You are right! Never underestimate the power of statistic evaluation on long-term data. Commented Jun 30, 2009 at 8:22
  • 2
    I'd agree with this in some cases, but due to the hundreds of thousands of users, and thousands of votes per hour there really is no way of correlating this information back to users. In the same statistical sense I could say that a nasty comment left will most likely correlate with the next down-vote, and a positive comment with an up vote. However anyone can assume this information already simply by looking at the comment itself. There's no real way for me to prove that every n comments y user posts, and that because of z's comment it is likely that this comment is also y's. Commented Jun 30, 2009 at 13:16
  • 1
    Without included post times you can build a fairly accurate picture of the times that votes were cast already, not quite milliseconds, but in the realm of minutes. By revealing this you're also not really giving us anything we can't already obtain. Any determined user can already track a user's behaviour through the site by following their recent activity and watching the reputation patterns against them. A users comments/posting times are more incriminating than vote times, because they not only implicate votes but expose behaviour. Also n,x,y,z example above should be votes not comments. Commented Jun 30, 2009 at 13:26
  • 1
    I do agree with @Ian argument, providing a vote timestamp rounded to minute; Commented Jan 17, 2010 at 11:11
  • But now as it seems I cannot even perform a simple query for active bounty questions because a) the BountyStart column is inaccurate and b) the Votes table seems not to be updated on a daily basis. E.g. select v.PostID AS [Post Link], v.CreationDate, v.BountyAmount, DATEADD(day, 7, CreationDate) as CloseDate from Votes v join VoteTypes vt on v.VoteTypeId = vt.Id where vt.Name = 'BountyStart' and v.CreationDate > GETDATE() - 7 will only show a few questions from 6 days ago. This is not very helpful.
    – kriegaex
    Commented Mar 4, 2017 at 11:09
  • I don't get how this obfuscation protects user sensitive data. Any user can see the exact timestamp they get an upvote or downvote anyway. Why should this timestamp not become public? Since there is no link to the voter user in SEDE, what harm can be done? The downside of this obfuscation is that we cannot track voting activity over time a day.
    – user522966
    Commented Sep 4, 2019 at 18:54
12

This was done to anonymize the data.

6
  • 1
    I suspected this, but then I had a difficult time trying to figure out how knowledge of the vote time would give away who the voter was. Commented Jun 30, 2009 at 2:39
  • How would the time make the votes identifiable? I'm certain there's a good reason, I just can't think what it would be..
    – dbr
    Commented Jun 30, 2009 at 2:58
  • we need Greg Hewgill to comment on this, he suggested it, and we agreed.. Commented Jun 30, 2009 at 3:30
  • I would love that input, because out of everything I can foresee this data is very useful with no major downsides. No other information really alludes to way of correlating these with users. Commented Jun 30, 2009 at 4:19
  • Would it be possible to "tamper" the timestamp with a little random jitter ? Commented Jun 30, 2009 at 9:44
  • The time is not necessary to hide since we already hide UserId to anonymize the data. Also see this comment by duplicate question's OP
    – Himanshu
    Commented Oct 1, 2013 at 11:22

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .