150

My colleague Rosie recently announced that, because of our difficulties predictably getting data dump files to the Internet Archive[1],[2], we are setting more realistic expectations for completing that process.

A comment on that post from Fabio made us consider that it is confusing to have a quarterly data dump for, say, Q1 released in early March - but, because of how time works, couldn’t possibly include all of the data for Q1.

“First weekend of the last month of the quarter” has been the publication schedule for as long as I can remember, but it doesn't have to stay that way.

Proposal

We huddled and decided that it would be beneficial to offer you a chance to help us fix that, too, as a part of this change. We've already delivered the data dump for this month, and propose that:

  • We produce a new dump for all of Q1
    Delivered sometime in April, including data up to the end of March (or later).
  • We then resume a quarterly schedule
    Producing dumps in July (for Q2), October (for Q3), and January (for Q4).

Background

The dumps are produced after the Data Explorer refresh on the first of the month and contain, at a minimum, data up until just before midnight on the last day of the previous month. Sometimes there are a few extra days, depending on when generation starts and, because they run sequentially, they can't cut off exactly at midnight.

New schedule

With this proposal, the new schedule would look like this:

Year/Q Generation started by Contains data through Delivery by
2024 Q2 July 7th, 2024 June 30th, 2024 July 31st, 2024
2024 Q3 October 6th, 2024 September 30th, 2024 October 31st, 2024
2024 Q4 January 5th, 2025 December 31st, 2024 January 31st, 2025
2025 Q1 April 6th, 2025 March 31st, 2025 April 30th, 2025

Because of the points Rosie made in her post, we would still set the expectation that we might not deliver the data dump within a couple of days of generation.

Benefits

  • The dumps for any given quarter will accurately reflect all of the data from that quarter, up to just before midnight UTC on the last day of the quarter (plus, often, a little bit).
  • The Q4 schedule will no longer be affected by the holiday break.

    Mostly a benefit for the Database Reliability Engineering (DBRE) team here, but it also means better predictability for you.

  • You get one extra set of data in 2024. 🎉

We’d like your feedback

We are proposing this change based on your feedback from Rosie's post, but we want to make sure that there aren't any other considerations we're missing. Let us know if you think shifting the schedule as we've proposed makes as much sense to you as it does to us. We'll review feedback/votes over the next couple of weeks and confirm where we land on or before March 22nd, 2024.

8
  • 32
    There is now a 25% chance that you can use April Fool's Day as an excuse for a missing data dump! Smart planning! 😏
    – AMtwo
    Commented Mar 6 at 15:18
  • 25
    Thank you so much for the clear communication on something that could have easily been 'assumed' to not need any pre-warning about. Makes me feel involved and included, even if I don't have any other feedback than this! Commented Mar 7 at 3:12
  • Need to change "Generation started by" as well, unless... time machine? ;-) Commented Apr 2 at 12:39
  • @Shadow Not sure what you mean? We intended for this dump's data generation to start by April 7th, and we met that goal. :-)
    – Aaron Bertrand Staff
    Commented Apr 2 at 12:41
  • Ah, but it looks weird to see "started by April 7th, 2024" and "Delivery by" in earlier date, as if it started after it was delivered. So perhaps "Generation Expected Start"? Commented Apr 2 at 12:48
  • 1
    @Shadow Yeah, it's hard to mix "planned" and "actual" data in the same column. I'm planning (ha!) to put actuals in the answer, but for future dates in the question, I still want to give some wiggle room on when those will start. I've removed those rows, I hope you can sleep better now. :-)
    – Aaron Bertrand Staff
    Commented Apr 2 at 12:50
  • Thanks! sleeping like a baby Commented Apr 2 at 13:00
  • (I read the table wrong, July 30th is the expected date) Commented Jul 9 at 22:54

4 Answers 4

116

Disclaimer: I don't actually use the dumps, so my approval doesn't mean that much.

But, as the author of the comment that sparked this proposal, let me confirm that this is exactly how I'd expect the data dumps to work. And the fact that holidays cease to be a problem is certainly a nice bonus.

Do it!

That said, I'd like to thank you, twice.
First, thank you for thinking about my comment, and for turning a simple question into a suggestion.
And more importantly, thank you for coming here and discussing the change before applying it!

1
  • 63
    +1 especially for the last sentence. This announcement feels like breath of fresh air. Simple and noncontroversial as this change is, getting community input on it first is a nice return to how things should be.
    – Wildcard
    Commented Mar 6 at 19:59
27
+1000

Since this was so well received, we have pushed a new data dump in early April and, for the foreseeable future, we plan to adhere to the schedule outlined in the question.

I've set a reminder to update this answer as each dump is delivered; here's the first update under the new schedule:

Year/Q Started Data through Delivered
2024 "Q0" March 3rd, 2024 Feb 29th, 2024 March 5th, 2024  ✅
2024 Q1 April 1st, 2024
12:28 AM UTC
March 31st, 2024 April 2nd, 20241  ✅
8:32 AM UTC

This is a lot earlier than you probably anticipated.

How'd we do that?

Well, in March, the process looked like this - we'd wait until Sunday, then process each Data Explorer database in sequence. Once Data Explorer was fully refreshed, we'd start the Data Dump process - extract each of the 8 tables to XML, then compress, one database at a time. Finally, we'd start the upload process, which would also upload one file at a time:

Original serial process

The diagram isn't to scale, but the times are accurate: it took over 70 hours to deliver the data dump for March once it started (March 3rd), and it required an additional Data Explorer refresh on a non-Sunday.


Since then...

...we have made several improvements to the process:

  • Removed the dependency on Data Explorer, so the data dump can start shortly after midnight (UTC) on the first day of the quarter. It no longer waits for (or disrupts in any way) the Data Explorer processing; even if the first day of the quarter happens to also fall on a Sunday, they can now run concurrently.
  • Made XML file creation and zipping happen in parallel, meaning uploading can start a lot sooner. We observed more than an 8X improvement here, and Data Explorer should see about a 4X improvement going forward.
  • Eliminated ragged end times, so all data ends just before midnight (UTC) at the end of the previous period. This is true for weekly Data Explorer refreshes as well.2
  • Made uploads happen in parallel. I was nervous about throttling and retries but, while we are still a little handcuffed by the upload speed to Internet Archive3, the speeds were consistent - even with multiple files uploading at the same time. The upload completed in just over 28 hours, cutting the last upload time almost in half.

All told, almost a 60% reduction in runtime, plus this is measured from the beginning of the quarter, instead of from the quarter's first Sunday at midnight UTC:

New parallel process

Without question, this was the fastest delivery in our history, and we have more improvements planned.


Minor differences you may have noticed before I did

This is a journey. There were a few minor differences observed by an anonymous edit that I can explain and, for the most part, have already fixed:

  • Some data types were rendered incorrectly. For example, milliseconds were truncated from datetime values that didn't have enough precision. Early on, I discovered an inconsistency with bit columns rendering as 0/1 instead of True/False, but this one slipped by me (mostly because most values do have enough precision).
  • Tags are delimited differently. As reported by @starball here, flattened lists of tags appear in the XML as |this|format| when they should be an XML-safe variation of <this><format>. This will be corrected for the next data dump or maybe sooner.
  • Minor formatting differences. The previous version of the output would have two spaces before each <row element, and a space before the ending />. In the new format, those spaces were missing. Similarly, &#xA; was rendering as &#x0A;.
  • XML files were encoded using UTF-16. I validated the content of files in the new process, but failed to notice they were encoded incorrectly and were larger as a result (at least pre-compression). While this may not affect everyone, it makes little sense to bloat the base files or disrupt the established encoding.
  • Posts included deleted posts from PostsWithDeleted. This again had to do with validation - I spot-checked files for each table, but for sites small enough that I could manually open the files. Those sites I spot checked? Also happened to not have any deleted posts, so they looked exactly like files produced using the old method. (Deleted posts are pushed to SEDE, but not to the data dump, as described here.)
  • XML files are no longer guaranteed to be sorted by Id. This is true (though you might only see a difference on larger sites like Stack Overflow). The reason is that we enjoy incredible performance gains using parallelism, but we'd throw those gains away (and maybe more!) by forcing a re-sort after gathering streams. Presumably, people are loading these big files into their own systems, and not perusing multi-GB XML files in Notepad; those systems should support any subsequent sorting/indexing you need. (In truth, these weren't intentionally sorted by Id before; it was just a coincidence that the execution plan performed a single-threaded scan against the Id-leading clustered index.)

These corrections marked have been applied, and a new batch of files has been successfully uploaded, with the last file finishing April 7th at 19:38 UTC.


Thank you for your continued patience as we try to further improve this deliverable.


1. Most files were actually available on Internet Archive a couple of hours into April 1st, but a handful of files take a lot longer. Subsequently, a new batch was uploaded with the fixes mentioned above, with most files arriving on April 6th, and the remainder on April 7th. The changes to account for these corrections will - at least initially - result in a longer "dump time" (the 51 minutes in the diagram above). However, since upload speed is our largest bottleneck by orders of magnitude, I plan to offset that bump by also spending a little more time up front on more aggressive compression.

2. This didn't make either process faster; it just feels a whole lot cleaner.

3. We get about 0.25mb/s, most of the time, with occasional bursts of up to 0.6mb/s. Our largest file, stackoverflow.com-PostHistory.7z (369 GB / 39.2 GB compressed), took over 28 hours to upload. While splitting that file up into chunks would allow us to deliver it faster, that would probably break folks, so we'll keep hoping IA will work with us on more efficient file transfer instead. In the meantime, higher compression should help a little - though reverting to UTF-8 did not help much at all, as the difference was largely compressed away anyway.

4
  • It might make sense to tag your question post as [status-planned], in this case? Up to you, of course :)
    – V2Blast
    Commented Mar 20 at 22:49
  • 3
    @V2Blast I'm not sure if it should ever sit in status-planned, while it makes sense in English, I think it's more of a holding spot for things that will be completed at some point in the future, and can be reflected as such. No code is changing for this one, so when would it ever transition from status-planned to status-completed? After the next data dump? After the one after that? (Not being combative, just verbalizing my inner conflict. :-))
    – Aaron Bertrand Staff
    Commented Mar 21 at 1:48
  • 6
    Its almost like common sense, logical and better policies are well loved by the community :D. It makes the dumps cover an entire quarter, the exact time of the dumps dosen't 'really' matter - rather that they're regular, and we have clear deliverables. Commented Mar 21 at 2:00
  • @AaronBertrand: Makes sense :)
    – V2Blast
    Commented Mar 21 at 23:35
21

Overall, I think the proposal makes sense.

If you start your export on or after the first day of the next quarter, why do you have a minimum date timestamp? I'm not sure how impactful it would be, but would it be beneficial for anyone to have a hard cutoff for what makes it into the data dump? That is, the Q1 data dump should contain no edits or content after March 31st, 2024 23:59:59.997. It doesn't matter if you start the process on April 1, April 2, or April 7, the most recent possible data in the dump would be the last moment of March.

If the company is willing to invest resources at some point in the future, this schedule also opens up some flexibility in additional versions of the data dump while still meeting commitments. I think that's a future discussion, but I'd like to see a better XML structure for the data that is more conducive to working with large data sets or non-XML formats like JSON or SQLite. Whatever gets decided, it should be possible to start the process early in the month and do any kind of data translation or massaging and still make end-of-first-month delivery for the prior quarter.

8
  • 13
    It pulls data directly (essentially just a table dump) from the source database in SEDE, whenever it runs for that database. SEDE populates from the source database (again, all data) for a given database. We can't do all databases at the same time, nor can we do all the tables in any given database at the same time. And certainly not at 00:00 on April 1st without taking everything down. We can't WHERE clauses as they would be far too slow and disruptive (especially for tables that don't have supporting indexes). Same with "cleaning up" the data after collection but before publication.
    – Aaron Bertrand Staff
    Commented Mar 6 at 15:06
  • 6
    As for changing the format, I've mentioned to you before that we couldn't just change the format of the data to suit one consumer when we have to consider all stakeholders. Getting consensus on that change - since it's theoretically the entire world - would be impossible. The only possible way to do what you're asking would be to create a second, separate dump, and, well, I'm not going to comment on the likelihood of that.
    – Aaron Bertrand Staff
    Commented Mar 6 at 15:09
  • 10
    This proposal is just about the schedule. If you have feature requests about the format or the data, those should be separate imho.
    – Aaron Bertrand Staff
    Commented Mar 6 at 15:10
  • 9
    (And I said .997 instead of .999 because the data type is datetime, and .997 is the "end" of the day for that data type.)
    – Aaron Bertrand Staff
    Commented Mar 6 at 15:12
  • 4
    @AaronBertrand This all makes sense, then. this schedule and commitment opens the door to future feature requests if you commit to "last day of the first month of the following quarter". It also lets you handle things like holidays, vacations, issues, whatever comes up. I like it. Commented Mar 6 at 15:17
  • 3
    "If the company is willing to invest resources at some point in the future" ... I'm guessing there are pretty long odds there.
    – AMtwo
    Commented Mar 6 at 17:02
  • @AMtwo I agree with that. It seems unlikely that anything will be invested to improve the dumps and make them more widely useful. But if it does happen, there shouldn't be any interference between any data dump feature requests and meeting the delivery commitments. Commented Mar 6 at 17:23
  • 1
    @ThomasOwens Just don't want you holding your breath that there will be any functional changes beyond rescheduling it. That'd be a big 180 change since last March.
    – AMtwo
    Commented Mar 6 at 17:37
18

Sounds reasonable

I don't use the data dumps, but have a bit of experience with open data etc. The proposal sounds extremely reasonable to me, and what I would expect from any similar source. I don't think the exact timing of the dump matters much, but it might be helpful to say "It will definitely be up by the end of the following month" or something.

I would have a slight preference for labelling them "Jan-Mar" or similar, rather than "Q1", which always introduces a slight ambiguity (or at least the potential for ambiguity) about whether quarters are numbered from January, or July, etc.

2
  • 6
    Thanks! For some additional background, the files aren't currently named or labeled with quarters or months; the metadata on the archive just indicates when the most recent upload finished. I was just using traditional calendar Qs in the question above to illustrate what the schedule will look like. I do like the idea in general, since it's a little more accurate than relying on the date of completion (and would travel with the files), but it's a potentially breaking change I'd not want to introduce at the moment.
    – Aaron Bertrand Staff
    Commented Mar 7 at 4:49
  • 7
    I'm not sure a "Jan-Mar" label would make sense. It implies that the dumps contain only the data generated during that time, when each data dump is a full-sized snapshot for a point in time. If I want to be pedantic (Aaron knows how much I do love that), I'd think of it either as "The March dump" (since it has everything through the end of March), or "The April dump" (since it is processed & posted during the month of April), or "The April 7 dump" (since that's the approximate date the snapshot is taken).
    – AMtwo
    Commented Mar 7 at 20:42

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .