June 2023 Data Dump is missing

Question

The data dump usually gets refreshed the first weekend of the month, every 3 months.

The current data dump is still from March. Is there just a problem and it's delayed like in the past?

^{Relevant company information is revealed in the answers by former employee AMtwo, then by current employees Jody Bailey, and finally by Philippe.}

If the data was published it might make it easier to debunk the nonsense... — Rory Alsop, Commented Jun 7, 2023 at 21:29
I don't think it was ever done on time. It's a very complex process, hence always failing at some point. I consider it a miracle when it's actually working and there is actually a data dump. (Kind of like launching a spaceship.) — Shadow Wizard, Commented Jun 8, 2023 at 7:08
@ShadowWizardStrikesBack rude...I disagree that it was never done on time. :) — Taryn, Commented Jun 8, 2023 at 14:01
@ShadowWizardStrikesBack I can confirm it has been frequently late since Taryn left. 🤣 The upload to the Archive would frequently hang and need baby sitting. That said, we actually resolved that recently, so the data dump wouldn't be late anymore. — AMtwo, Commented Jun 8, 2023 at 15:31
I'm hoping someone from the Company gives an official response soon. Having been recently laid off, I'm not in a position to give a complete/official answer here... But if there's no official response soon, I'll jump in with what I can offer. — AMtwo, Commented Jun 8, 2023 at 15:34
Aaron just took the metaphorical slaps for me, by handling the public comms. It was the other redundant DBRE who got the process running smoothly, and me pointing and crying when I had to manually upload files after it got hung. — AMtwo, Commented Jun 8, 2023 at 20:42
An uncharitable observer might conclude the data contradicts the company's narrative. I'm not feeling particular charitable right now. — Gloweye, Commented Jun 9, 2023 at 10:49
Interestingly, the public availability of the data was a main reason I choose to contribute to stack exchange in the first place. If that's not just a delay, I might request deletion of all my contributions. — miku, Commented Jun 9, 2023 at 14:23
@miku You don't have an option to delete your contributions, you've already licensed them out (and the terms don't give you any option for revocation). Future contributions you can decide on, though. — Bryan Krause, Commented Jun 9, 2023 at 14:37
Jody has responded in an answer below. I've updated the status to planned. — Rosie, Commented Jun 9, 2023 at 18:27
@DataDude Acceptance, here, would be for visibility. It's important for people to be able to see the official answer. (Answer quality is what votes are for.) — wizzwizz4, Commented Jun 10, 2023 at 0:52
@Rosie what does the status planned mean here though? That the data dumps would be released in some form, or SE is digging in its heels and there will be serious restrictions to access by the community that has contributed all the content that's actually in the dump? — Journeyman Geek, Commented Jun 10, 2023 at 8:28
Alternatively, you could rephrase this to be the actual statement, with additional info and prompt for questions, Q&A style. Thomas Owens and kaya3 had on-point concerns that probably could be expanded here, rather than a self-answer that is a mirror of another question. — Sébastien Renauld, Commented Jun 13, 2023 at 23:04
I don't understand the strategy of opening up or not opening up of new questions by the company. Sometimes they hijack their old questions and update them with new information, now they open up a new question with the same content as an old question. They should know how this Q&A system is supposed to work. — NoDataDumpNoContribution, Commented Jun 14, 2023 at 4:15

Robotnik · Accepted Answer · 2023-10-27 01:11:27Z

275

+1000

Update 2023-06-18

The Data Dump has been re-enabled. The latest data dump is available on The Internet Archive. It was uploaded at 2023-06-14, and contains data up to 2023-06-03.

Additionally, a comment written by Stack Overflow founder Jeff Atwood under the official response reads (emphasis his):

I have confirmation via email from Prashanth that this is, indeed, the new official policy. I'm glad to see it. Creative Commons is part of our contract with the community, and it should never be broken -- however, CC does need to address the AI issue in an updated license, in my personal opinion.
-- Jeff Atwood

Original Answer

DISCLAIMER: I was recently impacted by the Company's layoff. I am going to carefully respond in a way that ensures I don't reveal anything the Company may feel is confidential--particularly with regard to strategy, or future plans. Any knowledge I have on strategy or future plans is both dated and confidential, and thus it would be irresponsible for me to say more. As a result, this answer may feel incomplete. I suspect that the CM team is rather busy this week with other topics. I'm offering what I can to uphold the Company's values of Transparency & being Community-centric.

The upload to the Internet Archive has been disabled.

The job that uploads the data dump to Archive.org was disabled on 28 March, and marked to not be re-enabled without approval of senior leadership. Had it run as scheduled, it would have completed on the first Monday after the first Sunday in June.

I mention the timing, as this change long pre-dated the current moderator strike and related policy changes. Some comments have suggested otherwise, so I thought it an important detail.

Is it going to stay that way?

The following is an excerpt from a different answer provided by Jody Bailey, CTO of Stack Overflow:

We are looking for ways to gate access to the Dump, APIs, and SEDE, that will allow individuals access to the data while preventing misuse by organizations looking to profit from the work of our community. We are working to design and implement appropriate safeguards and still sorting out the details and timelines. We will provide regular updates on our progress to this group.

How can I access that data?

Stack Exchange Data Explorer (aka SEDE) contains a subset of all data for all sites, with PII removed. The same data available in the data dump is also available on SEDE.

SEDE is updated via a weekly full refresh (every weekend). The Data Dump that is uploaded to Archive.org is a dump of the SEDE databases. The weekly SEDE refresh runs, then the data is dumped to XML & 7zipped, then the 7z files are uploaded to the Archive.

SEDE can't address all the use cases of the Data Dump, nor vice versa. However, there is overlap, and the data is at least queryable.

edited Oct 27, 2023 at 1:11

Robotnik

4,3781 gold badge21 silver badges34 bronze badges

answered Jun 9, 2023 at 13:01

AMtwo

10.5k5 gold badges29 silver badges35 bronze badges

31

Well that sucks...
– Nick is tired
Commented Jun 9, 2023 at 13:46
51

I’m sad to see that this has happened. Thank you for providing an answer.
– Taryn
Commented Jun 9, 2023 at 13:48
153

The data dump is also part of the commitment the company has had with the community for years. I would guess this is part of their master plan to sell data but some heads up would have been nice.
– Journeyman Geek
Commented Jun 9, 2023 at 13:50
9

@JourneymanGeek Asking for forgiveness is easier than asking for permission amirite?
– Data Dude
Commented Jun 9, 2023 at 14:11
27

I see next step is going full Reddit and asking payment for API usage...
– Mast
Commented Jun 9, 2023 at 14:16
33

If they're doing this because they plan to sell access to the data, that would make some business sense except that people who are using this data to, say, train AI models, currently believe that doing so is fair-use, and they can do so by crawling the data publicly available. Anyone with a financial interest big enough to spend money on the data is going to be just as capable of obtaining the data without the dump, so this only affects people like academic researchers.
– Bryan Krause
Commented Jun 9, 2023 at 14:36
95

I would never have thought I'd see the day, but this is the end for me on SO. The unwritten contract between me and this platform was that the knowledge remains freely accessible and archiveable while I contribute. Since this didn't happen anymore, well after 12 years of membership, that's it. I'll leave and contribute elsewhere. Sorry, but this is a point where no compromise is possible.
– NoDataDumpNoContribution
Commented Jun 9, 2023 at 14:37
56

@AMtwo I admire your commitment to the community even after the company has thrown you under the bus.
– Emil Jeřábek
Commented Jun 9, 2023 at 14:43
73

The Data Dump was literally one of the safeguards in case the company became evil! And I do mean literally - Jeff or Joel said that, in the early days of Stack Overflow.
– S.L. Barth is on codidact.com
Commented Jun 9, 2023 at 14:50
39

This is deeply concerning. If the next step is making all API access paid, I'm going to resign. Numerous community-driven tools, including in moderation, rely heavily on on the API to function. Forcing all those projects or even just individuals making QOL tools to pay for something that benefits the platform is outright despicable and self-destructive, and is not the SO I signed up for
– Zoe - Save the data dump
Commented Jun 9, 2023 at 15:12
31

Just sorta stating the obvious here, but the timing of this is unbelievably terrible; I actually can't fathom a worse time for this call to be made than in light of this week.
– zcoop98
Commented Jun 9, 2023 at 15:23
26

@zcoop98 Well, the call was made 3 months ago and hidden from the community. It's just exploding today.
– Restore The Data Dumps Again
Commented Jun 9, 2023 at 15:32
17

So the company unilaterally changed the policy and didn't even tell us about it? Who could have seen this coming?
– kaya3
Commented Jun 9, 2023 at 15:38
20

Here is where it was announced 14 years ago: stackoverflow.blog/2009/06/04/… RIP SE
– Bob
Commented Jun 9, 2023 at 16:08
29

@DimeCadmium -- regardless of my employment status, I personally value transparency and supporting the communities I'm a part of. Those are stated values of the company and part of why I started working there. And they are still things that I value highly. I don't see it as lip service to the company at all--if anything, it's a reminder to everyone involved about the origins of this site. No tea, no shade, just a reminder to everyone to take a breath and look up.
– AMtwo
Commented Jun 10, 2023 at 3:32

| Show 15 more comments

curious · Accepted Answer · 2023-06-09 18:53:45Z

Few things have made me more livid than what we've just learned through the answer given by AMtwo.

The data dump is one of the main reasons why I'm still participating to this day and has been part of the core values of this community. Please restore it, or consult with us to establish some equivalent way to protect and export our contributions as a whole.

I would like to bring the following pieces of history to the foreground. This is an excerpt of Joel Spolsky speaking on the Stack Overflow podcast #84

Oh, expropriation of community content that... We created Stack Overflow to be against it. If there's anything that's more in the DNA of Stack Overflow than that, I don't know what it is. That's one of our most core things. You can see this all over the place in the design of Stack Overflow.

First of all, from day one, we use the CC-wiki license. And it's basically a license, it says that we don't own the content that's on there, which is why we make those database dumps that are available.

Because we wanted to make sure that if no matter what happens, literally no matter who we sell to, or raise money from, or turn the site over to, and even if they take Stack Overflow, and make it an evil site where you have to pay to look at things and there's pop-up ads and pop-under ads, and you know, dancing chariots of fire that cross the screen and punch the monkey, and, man, I can take so many evil things anyway. And it just becomes a big gigantic spam site.

Doesn't matter because just take the latest CC-wiki download that we provided and go start your own site saying, you know what, this is gonna be the clean version. And I think a lot of people will follow you. We very, very deliberately built Stack Overflow in a way that there wouldn't be any chance of locking and we're pretty much doing the same thing with Stack Exchange.

Also, on the blog, written by Jeff Atwood in 2009:

The community has selflessly provided all this content in the spirit of sharing and helping each other. In that very same spirit, we are happy to return the favor by providing a database dump of public data.

yes, that was Joel. In the meantime, the current management... Have my vote, it is always useful having people pointing out how the current direction the company is going is against anything the network was meant to stand for in the past. — SPArcheon - on strike, Commented Jun 10, 2023 at 17:48
So who's spinning up the clone with the latest dump? Do we need to get a kickstarter going or something? — BryKKan, Commented Jun 11, 2023 at 5:14

Philippe · Accepted Answer · 2023-06-17 01:59:30Z

127

Friday evening

That's it, and the dump is uploaded. Many thanks to the stackers involved, especially Aaron Bertrand, who babysat the thing for days.

Update as of Friday morning (2023-06-16 12:29:39Z)

We're at 40%, average upload speed is 0.25 Mbps :(

This is not likely to complete today.

Update as of Thursday evening (2023-06-16 00:52:06Z)

We're pleased to say that all but one of the files are completed, ahead of the promised schedule. However, the remaining one file (32 gigs) is uploading extremely slowly due to some network slowness. Unfortunately, I think it is likely that the remaining file will not be uploaded by end of day Friday as we initially hoped. We have folks who are continuing to nursemaid this incredibly slow upload along, but it is very clear that our initial plan is at risk at this point. We will get it up as soon as we can.

(We were up to over 50%, but unfortunately the job was in our NY data center, which experienced a blip today and the job was wiped out in the process. It has been restarted, but at 0.50 Mbps, it's slow going. I wish I could project a completion date at this point, but I hesitate to do so because I don't want to miss another stated target.)

I would also like to apologize to the community here for the lack of clarity in my initial announcement here. Although several people proofed it, we missed the lack of the word "permanently" in it, which caused confusion and frustration. That was not my intent, obviously. I thank those who counseled patience and calm while we resolved the situation, and I thank Catija for actually managing that resolution. In short, we tried, but our editing of my initial draft was insufficient, and I own that. It was my writing, and it was not clear. I'll try to be sure that doesn't happen again.

Update as of Wednesday evening (2023-06-15 02:48:09Z):

By way of status, we began uploading the dumps yesterday. For some reason, the Internet Archive is internet throttled right now (seriously so, we're getting around 0.25 Mbps) so it's taking MUCH longer than anticipated. Long enough that I seriously suggested writing to disk and FedEx.

However, I still anticipate that we will meet the promised delivery of end of day Friday.

Posted Tuesday 2023-06-13 22:32:13Z:

Much has been written lately of the company’s decision to pause the distribution of the anonymized data dump that has historically been posted.

Our intention was never to stop posting the data dump permanently, only to pause it while we begin to collect more information on how it was being used and by whom - especially in light of the rise of large language models (LLMs) and questions around how generative AI models are handling attribution. However, it’s clear that many individual users (academics, researchers, etc) have an immediate need to access updated versions of the dumps. So we are re-enabling the automatic data dumps (and uploading the one that’s about a week overdue). We believe that this can happen by end of the day Friday. We will continue to work toward the creation of certain guardrails (for large AI/LLM companies) for both the dumps and the API, but again - we have no intention of restricting/charging community members or other responsible users of the dumps or the API from accessing them.

As part of this project, API users should be on the lookout for a very brief survey that will be coming out (it will be announced here and on stackapps.com) that asks about the features that you most use/would like to see in the API or data dumps moving forward so that we can plan for those, as well as collect general input.

In the meantime, the data dumps will be re-enabled by end of day Friday. We will communicate here when that has been completed or if there are any delays. We will also post here prior to making any future changes to the dumps or distribution of the dumps.

edited Jun 17, 2023 at 1:59

answered Jun 13, 2023 at 22:32

PhilippeStaffMod

21k15 gold badges63 silver badges86 bronze badges

50

"only to begin to collect more information on how it was being used and by whom" ─ how does pausing the dump help you collect this information? If anything, it prevents you from collecting information on who downloads it. If people can download it, that is an opportunity to find out who those people are.
– kaya3
Commented Jun 13, 2023 at 22:35
29

Two questions. (1) Why do you believe that you should or must "continue to work toward the creation of certain guardrails" for the data dumps and why must these guardrails be in place specifically for AI/LLM companies? (2) Will the survey be available for people who want to use the API or data dumps but don't have the features they need today?
– Thomas Owens
Commented Jun 13, 2023 at 22:37
44

The claim that the plan all along was just to collect data on who is using the dump for what purposes, directly contradicts this other answer from an SE staff member who says the dumps were stopped in order to prevent LLM developers from using the data.
– kaya3
Commented Jun 13, 2023 at 22:55
40

I have confirmation via email from Prashanth that this is, indeed, the new official policy. I'm glad to see it. Creative Commons is part of our contract with the community, and it should never be broken -- however, CC does need to address the AI issue in an updated license, in my personal opinion. @wizzwizz4 I also edited the other post to cross link to this one.
– Jeff Atwood
Commented Jun 13, 2023 at 23:54
46

Turning the dumps back on was the right thing to do. +1 for that. But at the same time this post is still gaslighting us as to what happened and why (-1 for that) and also totally wishy-washy on what you comit to doing (-1). You've already illegially relicenced my contributions once and maintained radio silence about it, now you are hinting at a plan to do it again.
– Caleb
Commented Jun 14, 2023 at 5:46
28

So. The data dump releases were stopped without any form of announcement and as usual it took someone to notice before the company admitted it was something already planned for at least a month. You say that it was "to get stats about who uses it" while a different post claims that you were past that phase and trying to gate the access to the data. After a while coincidentally when user started to plan about how to produce the data themselves the decision is put on stop for a while and you post this answer, once again denying the users to reply and forcing them to use comments [cont...]
– SPArcheon - on strike
Commented Jun 14, 2023 at 8:07
15

Hey all - we understand the confusion about how this answer was conveyed. I've worked with Philippe to identify ways we can address your feedback and, as such, I've merged the two questions and moved the comments from one to the other so that they're all in one place. Additionally, I've clarified the statement so that it conveys what we intended as we understand that this was being interpreted as a different story from what Jody shared previously. Apologies for any confusion.
– Catija
Commented Jun 14, 2023 at 13:41
14

@Catija the biggest problem is still trying to figure out what the "truth" actually is. The original statement by the CTO, the original - misleading to put it nicely - statement by Philippe - or your revised one once you saw the backlash from the statement itself. 'that this was being interpreted as a different story from what Jody shared previously.' - it literally was a different story. Jody mentioned the decision to pause/stop the dumps was intentional to prevent "abuse", the original statement said that the intention wasn't to pause. There is/was no misinterpretation.
– Script47
Commented Jun 14, 2023 at 14:06
24

If we want the company to communicate with us, I think it's better to expect back-and-forth communication and clarification as appropriate, rather than to demand that each message is conveyed perfectly at the first release. It's a barrier to actually communicating if they need to spend hours on each and every draft message. I think about how much iteration went into drafting the strike letter: that's okay for a one-off thing, but if every message took that much effort, the will to communicate is going to evaporate pretty quickly.
– Bryan Krause
Commented Jun 14, 2023 at 14:26
11

@BryanKrauseisonstrike are you kidding? How difficult is it to say this is why we disabled the dump, this is what we were thinking, clearly it's not what the community wanted so here's what we're thinking now.... Instead the CTO came out saying we intentionally stopped it for X and then another statement saying that it wasn't our intention to stop the dump, and then claiming that people are misinterpreting the statement.
– Script47
Commented Jun 14, 2023 at 14:43
12

@Script47 I think you're overemphasizing certain phrasings to make them unclear in your head rather than trying to reconcile. I think a better communication approach when you get conflicting information is to point out the potential conflict and ask for clarification; that has now happened, so, what's the problem? I am not playing defense for SO. I am very strongly urging them to make changes I feel are necessary to keep the sites I like to use operational.
– Bryan Krause
Commented Jun 14, 2023 at 14:50
13

@Script47 "Our intention was never to stop posting the data dump" is ambiguous, not different; the word stop does not indicate whether it is permanent or temporary. Jody used the word in the temporary sense, Philippe used it in a permanent sense. In Jody's post, the meaning is clarified by the word "until" later in the sentence. Now, Catija edited Philippe's post to make clear he meant the permanent sense. Now there is no ambiguity in either use of the word. However, if you assumed the statements were consistent to begin with, that interpretation was also available with the words used here.
– Bryan Krause
Commented Jun 14, 2023 at 15:29
25

I would have preferred Philippe's statement to acknowledge that this unannounced pause caused a lot of valid angst in the community and include an apology for this impact, and I would have preferred that it recognize explicitly the previous commitments made by the company to users regarding the release of data, but I do not think the message conflicts with the CTO, particularly after the clarifying edits.
– Bryan Krause
Commented Jun 14, 2023 at 15:57
11

@JeffAtwood I completely agree that CC should address LLM usage. However, ChatGPT did not use a data dump to train, but commoncrawl.org, so that's the place to start to enforce any possible CC changes, not the dump. Also, I think it's intrinsic to CC that we allow any usage that respects attribution, not merely some usage that we prefer at the moment.
– Sklivvz
Commented Jun 15, 2023 at 9:04
17

As the one nursing the process, I can confirm that the final file from the June dump has been fully uploaded, has gone through the usual processing by archive.org, and is now available for consumption. It actually finished at 22:42 UTC so, if we want to be technical, it was still delivered by end of day Friday. :-)
– Aaron Bertrand Staff
Commented Jun 17, 2023 at 1:55

| Show 29 more comments

Chenmunka · Accepted Answer · 2023-08-05 09:57:35Z

We are looking for ways to gate access to the Dump, APIs, and SEDE, that will allow individuals access to the data while preventing misuse by organizations looking to profit from the work of our community.

That ship has sailed. OpenAI's currently publicly available models (at time of writing) famously have a knowledge cut-off date of 2021, and few people care. Stack Exchange data is gathered, and fed into these models, via the Common Crawl project. Nobody is using Stack Exchange Data Dumps to train AI models.

Even if you were willing to violate the CC BY-SA licenses (of various versions) under which we contributed all that we have contributed, and release the data dumps under some restrictive license, it would achieve nothing, because these companies are already not respecting the restrictions of the licenses.

Rightly or wrongly (imo, wrongly), All Rights Reserved is being interpreted as legal permission to perform bulk statistical analyses that launder away any hope of attribution, then justify the actions by the fact that attribution becomes technically impossible. Our CC BY-SA license doesn't stand a chance. If Prosus wants a return on its investment in Stack Overflow, it should consider putting its weight behind getting the "SA" part of the license enforced against the likes of OpenAI. (I'd happily waive 20% of my part of the settlement from a collective-action lawsuit, if you want to profit more directly, and I'm happy to negotiate something higher.)

What you're doing now is just… destroying the network. For no good reason. I've stopped recommending people to Stack Overflow – when normally I would be teaching entire classes full of students how to use it. I am now actively avoiding contributing Q&A, which is a shame, because I've got 7 or 8 Q&A pairs tucked away in my drafts folder, meant for Stack Overflow, that I may never have the motivation to polish up now.

What you hope to achieve here is unachievable. Ask your lawyer. Heck, ask on Law Stack Exchange. What you're doing is not helping, and you do not know that you do not know that.

Side note: SEDE is already gated behind a Google ReCAPTCHA for not-logged-in users. That's to prevent people taking down your servers with a trillion high-intensity unauthenticated requests, but it serves this purpose quite competently, too. (If you want to make it slightly stricter, gate un-ReCAPTCHA'd SEDE behind the 15-reputation barrier used for upvoting: we have solid empirical evidence that that's high enough to make abuse detectable and manageable.)

@Andreasdetestscensorship Codidact's website is hard for me to use – harder than modern Stack Exchange, would you believe? But I suppose I could donate improvements and bug fixes to it. — wizzwizz4, Commented Jun 10, 2023 at 17:20
Yes, I looked into Codidact in 2020, and wanted to switch, but the UI was painful. It still is. That can be changed, though. I really like the SE UI, but copying that, and then improving it, likely isn’t legal. — Andreas moved to Codidact, Commented Jun 10, 2023 at 17:21
@Andreas detests censorship: Yes, it was also too buggy (server errors) the last time I looked. But perhaps it is time for another look. It is the most obvious destination for Stack Exchange refugees. — This_is_NOT_a_forum, Commented Jun 11, 2023 at 10:20
Re "violate the CC BY-SA licenses", I suppose you mean "in spirit", but I'll note that sui generis database rights are not necessarily eliminated, especially for posts until 2018 with the older licenses. — Nemo, Commented Jun 14, 2023 at 4:26
I have been switching to Codidact. At the current community size, the Meta there puts you pretty much directly in contact with the actual developers. If you see opportunities to improve the UI, please speak up! — Karl Knechtel, Commented Aug 5, 2023 at 11:51

This_is_NOT_a_forum · Accepted Answer · 2023-06-11 10:06:26Z

To highlight the significance that the data dump may have held for me and others:

I signed up to Stack Overflow in order to contribute to a free library of knowledge for everyone, knowing that the content license meant that I give my knowledge to everyone in the world (even AI companies if they can abide by the license but just to everyone, the license doesn't differentiate uses, it's all the same to it). I liked that a lot and wanted that. Specifically, I never wanted to lock my knowledge in only one location. That's why I saw the quarterly data dumps not only as an addon but as an essential core part of this service. And this understanding worked for many years.

But now the company is not the same anymore than it used to be (with Jeff Atwood at the helm) and they cancelled the insurance (the data dumps) in order to stop some competition.

That doesn't help me. I want that everyone can use my knowledge equally, only limited by the license requirements (attribution, whatever that means nowadays). While legally nothing has changed (although the company was also really bad in this regard in the past), practically the data dumps were the only independent and accessible storage of the knowledge. I wanted the company to compete on a technological level with better features, not by locking in the content more.

Without the data dumps I trust the company exactly as much as zero. There is no compromise possible for me. I cannot continue under these circumstances.

And I hope I never have to read any of these blog posts about community at the center ever again.

Maybe the blog title was trying to say that the community was being used as the pavement for the road to AI? It was kind of ambiguously worded... — Cody Gray - on strike, Commented Jun 10, 2023 at 13:06
They may have an "alternative" definition of "community": <Censored> labour. — This_is_NOT_a_forum, Commented Jun 11, 2023 at 10:10
"community at the center" - sure, center of this. They throw darts on us, and we just started to realize it now. :/ — Shadow Wizard, Commented Jun 11, 2023 at 10:14

Script47 · Accepted Answer · 2023-06-09 20:51:37Z

Stack Overflow senior leadership is working on a strategy to ~~protect~~ be able to solely monetize Stack Overflow data ~~from being misused~~ by charging companies building LLMs extortionate amounts of money. While working on this strategy, we decided to stop the dump until we could put guardrails in place. Once again ignoring the community instead of being proactive and informing them regarding this.

We are working on setting up the infrastructure to do this correctly in the age of LLMs --- where we continue to be open and share the data with our developer community but work to set up a formal framework for large AI companies that want to leverage the data.

We're quickly scrambling because this would be an amazing opportunity for us to make some more money off your backs.

We are looking for ways to gate access to the Dump, APIs, and SEDE, that will allow individuals access to the data while preventing misuse by organizations looking to profit from the work of our community. We are working to design and implement appropriate safeguards and still sorting out the details and timelines.

We're probably going to degrade the experience for people within the community who used to use these dumps because we need to somehow get companies to pay us to use this data in a more friendly way.

As for this this line:

We will provide regular updates on our progress to this group.

Group?

Do they mean group as in group of users interested in the dumps? Or, are they that detached with the community that they call these sites groups like you have WhatsApp groups?

Nevertheless, I can't believe this was actually unironically written when they didn't even bother to tell the community that this was happening.

I guess this is what comes from having employees that no longer interact with the community.

Anyone who seriously trusts SO after all they've done deserves to be fooled.

"imagine if stackexchange went from free to paid..." All of the content posted by users on Stack Exchange network is licensed under CC-BY-SA. If SE ever blocks free access to the content, anyone can host it elsewhere. Pulling EE-like shenanigans is not practical in SE.

Source

@JCL1178: I agree except the data is open source. If EE dies, the data is gone, especially if the past happens again.

Source

^{Technically speaking, this is an answer, you might not like it or think that it's not a great answer but it doesn't make it any less of an answer. So, if you happen to feel particularly strongly about it, feel free to downvote but I personally don't think it deserves deleting. But hey, what do I know?}

¯\_(ツ)_/¯

kaya3 · Accepted Answer · 2023-06-14 18:14:58Z

45

The resumption of these dumps is very welcome, and I'm glad the company has decided to listen to the community on this issue.

That said, there's something in Phillipe's answer which doesn't sound right to me:

Our intention was never to stop posting the data dump, only to begin to collect more information on how it was being used and by whom

Both parts of this statement apparently contradict other statements by SE, Inc. staff (former and current).

According to AMtwo's answer here, "The job that uploads the data dump to Archive.org was disabled on 28 March, and marked to not be re-enabled without approval of senior leadership." This is inconsistent with the claim that the intention was never to stop posting the data dump; it was stopped in March, and the stopping was clearly intentional.

Also, according to Jody Bailey's answer here, SE, Inc. is "working on a strategy to protect Stack Overflow data from being misused by companies building LLMs. While working on this strategy, we decided to stop the dump until we could put guardrails in place." This is inconsistent with the claim that the pause in availability of the dump was "only" so that SE, Inc. could begin to collect information about usage.

If it's true that the intention was only ever to begin collecting information about usage, why was this intention not mentioned earlier, and why was a different intention mentioned instead?

edited Jun 14, 2023 at 18:14

answered Jun 13, 2023 at 23:11

kaya3

6,2262 gold badges16 silver badges25 bronze badges

49

my guess is there is internal disagreement about how to handle this. Hence the mixed messaging. I have ZERO inside insight, that's just my opinion. 🤷 I am a very very VERY strong advocate of sticking to the guarantees we made with the CC license from inception, though. I think CC needs to revise / update their license to cover the AI use case, personally.
– Jeff Atwood
Commented Jun 14, 2023 at 0:15
13

@JeffAtwood First someone has to argue successfully that using copyrighted work to train AI to generate content doesn't fall under Fair Use. I expect that to be quite expensive because all the companies using GenAI are going to be on the other side with deep pockets.
– Bryan Krause
Commented Jun 14, 2023 at 3:11
Training a large language model is a derivative work, therefore requires attribution of everyone who contributed to the training data?
– gerrit
Commented Jun 14, 2023 at 6:50
4

@gerrit Is it illegal for a firm to train an AI model on a CC BY-SA 4.0 corpus and make a commercial use of it without distributing the model under CC BY-SA?. Note that before LLMs, we had the same question for word embeddings, but few people raised the issue back then Distributing machine learning models (e.g., word embeddings) based on non-sharable datasets and I don't recall seeing any court cases about it.
– Franck Dernoncourt
Commented Jun 14, 2023 at 7:11
1

Considering that thanks to "some meddling ex employees" the cat was already out of the bag there was no need to try to keep this secret anymore. Even better, by conceding this right now they are giving the userbase the "false" idea that they "won" something, and this could make them more well disposed and yielding in whatever discussion about ending the strike is going on right now.
– SPArcheon - on strike
Commented Jun 14, 2023 at 8:36
2

@JeffAtwood "I think CC needs to revise / update their license to cover the AI use case, personally." But then this newly revised CC licenses maybe wouldn't be compatible with the older CC versions anymore with difficulties for editing content. And how should that be solved practically? I cannot literally acknowledge every single place I learned something from when typing something. Should I write endless lists of footnotes with possible references for why I might say what I say?
– NoDataDumpNoContribution
Commented Jun 14, 2023 at 8:45

Add a comment |

starball · Accepted Answer · 2023-06-13 23:12:30Z

Our intention was never to stop posting the data dump

Okay that sounds real nice of you... But uh- why was there zero transparency in all of this? It was revealed by ex-staff, and then actual staff jumped in to stop the sh*tstorm.

As part of this project, API users should be on the lookout for a very brief survey [...] about the features that you most use/would like to see in the API or data dumps moving forward so that we can plan for those, as well as collect general input.

Why a survey though? You have this?

Ajedi32 · Accepted Answer · 2023-06-11 02:29:07Z

26

I want to re-iterate what others have said in the comments. The fact that my contributions to Stack Overflow are licensed under CC BY-SA is a significant motivating factor for why I contribute in the first place. I want the knowledge I share here to be accessible and freely usable by everyone, including LLMs, and such use is already allowed by the current license so it's unclear what "misuse" Stack Exchange Inc. is trying to prevent here.

In the past, sites have even gone so far as to mirror user content from Stack Overflow wholesale, and that was always considered okay as long as the sites provided attribution as required by the license and weren't doing anything illegal.

Has this changed? The wording in Stack Exchange's official answer to this question has me concerned. If "organizations looking to profit from the work of our community" is now considered a problem, I suppose the next logical step is to remove the CC BY-SA license and replace it with CC BY-NC-SA, or stop releasing user contributions under an open license entirely? If that's the path things are headed down now, I will be significantly less motivated to contribute to Stack Exchange sites in the future. I want to contribute to an open repository of knowledge that anyone can use, not a proprietary database controlled entirely by Stack Exchange, Inc.

answered Jun 11, 2023 at 2:29

Ajedi32

3592 silver badges9 bronze badges

16

The problem is LLM's are not following the SA part. The solution is to sue them, not to make the data harder to access.
– OrangeDog
Commented Jun 11, 2023 at 8:45
4

Most scraper sites don't provide proper attribution (many of which don't provide any attribution at all (pure plagiarism), to make it seem it is their own content). It could be due to ignorance, but more likely it is malicious. Or they have an "alternative" definition of plagiarism. The company gave up enforcing the license a long time ago (reporting scrapers is busy work, fooling those who want to do the right thing).
– This_is_NOT_a_forum
Commented Jun 11, 2023 at 10:07
1

@OrangeDog +1 you can sue LLM creators (though as per current caselaw you'll almost definitely lose) but you can't pretend like SE owns the content.
– JonathanReez
Commented Jun 14, 2023 at 10:40
@JonathanReez indeed they don't own it, we do. But we would really like them to deal with license violators for us, right? Otherwise we all have to do it separately for our own posts, and we'll all lose, if we can even afford it in the first place. But I don't want them to pretend to do it by making stupid decisions like this.
– OrangeDog
Commented Jun 14, 2023 at 10:47
@OrangeDog yes, they could certainly start a class action lawsuit and invite affected parties to join. But as per current caselaw they'll lose, so they'll probably have to lobby for a change in copyright law first.
– JonathanReez
Commented Jun 14, 2023 at 10:48

Add a comment |

tripleee · Accepted Answer · 2023-06-14 06:41:18Z

In an answer to an identical question Philippe today announced that the data dumps will be reinstated again, and the June data dump should be available soon.

Much has been written lately of the company’s decision to pause the distribution of the anonymized data dump that has historically been posted.

Our intention was never to stop posting the data dump, only to begin to collect more information on how it was being used and by whom - especially in light of the rise of LLMs and questions around how genAI models are handling attribution. However, it’s clear that many individual users (academics, researchers, etc) have an immediate need to access updated versions of the dumps. So we are re-enabling the automatic data dumps (and uploading the one that’s about a week overdue). We believe that this can happen by end of the day Friday. We will continue to work toward the creation of certain guardrails (for large AI/LLM companies) for both the dumps and the API, but again - we have no intention of restricting/charging community members or other responsible users of the dumps or the API from accessing them.

As part of this project, API users should be on the lookout for a very brief survey that will be coming out (announced here and on stackapps.com) that asks about the features that you most use/would like to see in the API or data dumps moving forward so that we can plan for those, as well as collect general input.

In the meantime, the data dumps will be re-enabled by end of day Friday. We will communicate here when that has been completed or if there are any delays. We will also post here prior to making any future changes to the dumps or distribution of the dumps.

My take on it: The immediate danger to data dumps not being available anymore is averted, but I find it hard to believe that it was not the intention to "stop posting the data dump". If that wasn't the case, why not announce that back when the decision was made, and why announce a reversal of the stop only after the community writes extensively about it?

This reads rather like damage mitigation, and my trust in the company not doing it again anytime soon is still really close to or at zero.

This is just my personal opinion, but stopping the data dumps even if it caused only a delay in the end is still a bad sign. One doesn't need to stop a service in order to research it, and the company should know very well how important the data dumps are to the community. At the very least, this shows a certain disinterest in the feature.

I actually missed that the answer by Philippe has also been posted here, because when I saw it over there I stopped searching. So I quoted it in this answer. Maybe not so much point in it now, but also shows that cross-postings maybe aren't the best idea. — NoDataDumpNoContribution, Commented Jun 14, 2023 at 9:14
Note that the phrasing of this statement was amended by Catija, apparently to make Philippe's actual intent more clear, to say that their "intention was never to stop posting the data dump permanently". Skepticism is certainly justified; I do feel like that message is pretty in line with the original statement on the matter, however, which said from the beginning it was a measure "until guardrails were put into place"... for as much value as that was worth at the time. — zcoop98, Commented Jun 14, 2023 at 15:24
@zcoop98 I'm also skeptic about the guardrails, the corresponding question convinced me that it's basically impossible and I don't see any urgency. The data dump isn't a risk for anyone it's just the common possession of all contributors who collect their contributions there (or let them be collected to be precise but that is part of the deal). — NoDataDumpNoContribution, Commented Jun 14, 2023 at 16:14

SPArcheon - on strike · Accepted Answer · 2023-06-15 09:38:59Z

Disclaimer: sorry, the current format is due to a merge, and originally this was a reply to the linked company post.

So, let's look at the official reply.

Much has been written lately of the company’s decision to pause the distribution of the anonymized data dump that has historically been posted.

Personally I have nothing to say here, but I can see foresee some users pointing out that the chosen wording seems to imply that it was the userbase specifically that said too much about the issue and hallucinated conspiracies that did not exist (before this message the company limited all interaction to just one post). Therefore I accept that this could feel a little condescending to some, but in my opinion I would just ignore it.

Our intention was never to stop posting the data dump, only to begin to collect more information on how it was being used and by whom - especially in light of the rise of LLMs and questions around how genAI models are handling attribution.

This is in direct contrast with the information the community got before from the company CTO that said

We are looking for ways to gate access to the Dump, APIs, and SEDE, that will allow individuals access to the data while preventing misuse by organizations looking to profit from the work of our community

It is a similar, but different message. It is also worth noticing that if the intent was indeed "begin to collect more information on how it was being used"

I don't see how you could do that while the dump was not accessible anymore.
I don't see why you should do that without any warning.

But wait! Maybe that was the test? Getting data about how long it would take to the community to notice? If that was your plan, great! Now we know that someone in the community is using the dump and will notice when it goes missing.

That said, there is also another logical association some will probably make. This happened about at the same time the company announced to the press some plan about the monetization of the data dump.

However, it’s clear that many individual users (academics, researchers, etc) have an immediate need to access updated versions of the dumps. So we are re-enabling the automatic data dumps (and uploading the one that’s about a week overdue).

Again, I won't argue here, but I'll let you know that this reads a tad like "But since you folk can't wait five minutes [...]". The problem is not the wait; the problem is that the stop came out of nowhere since you "forgot" to announce it and even if you probably are not legally bound to provide said data, users still expect to be warned if the implicit agreement changes.

We believe that this can happen by end of the day Friday. We will continue to work toward the creation of certain guardrails (for large AI/LLM companies) for both the dumps and the API, but again - we have no intention of restricting/charging community members or other responsible users of the dumps or the API from accessing them.

I fear you... can't? Either you provide the dump "as is" and nothing can stop anyone to repost it even if the original download required a login or you try to "watermark" it in some way that would workaround the current license and make it a derivative work that would not be reproducible, but in that case... That is not the dump anymore.

As part of this project, API users should be on the lookout for a very brief survey that will be coming out (announced here and on stackapps.com) that asks about the features that you most use/would like to see in the API or data dumps moving forward so that we can plan for those, as well as collect general input.

In the meantime, the data dumps will be re-enabled by end of day Friday. We will communicate here when that has been completed or if there are any delays. We will also post here prior to making any future changes to the dumps or distribution of the dumps.

Fine, I guess. Why wasn't this the original way to go?

Conclusions

This announcement leaves me with some mixed feelings.
On the front, this seems like a "victory" for rationality—the dump has been restored, everything is fine now and you even promised to work on a solution together (albeit in the form of a survey).

Yet, in the current circumstances of the ongoing strike, I can't help but feel like this has been "staged" in some way as a part of a bigger strategy to show that the company did its part to "reach out" to the moderators on strike and now they should do their part to unless they are "bad actor with an agenda to prolong a conflict that the company was trying to solve rapidly".

The decision to shut down the dump seems to precede the strike and that was confirmed by a neutral third party—an ex-employee, so the things should not be related. But at the same time, this victory only restores the initial status quo without any real gain for the community. With this stunt, the company has managed to "do its part toward an agreement" without actually changing anything, and now it feels only expected that the moderators should concede, "do their part" on something too. To put it bluntly, while it makes little sense if we assume all the related information the community got are accurate, all manages to feel very strategical and in a way even convenient for the company.

I would modify 'from another staff member that said' to 'from the CTO that said' as that's quite important considering the position of person that posted it as opposed to it being just another employee. — Script47, Commented Jun 14, 2023 at 14:00

user1376343user1376343 · Accepted Answer · 2023-06-10 01:29:07Z

1

To answer the question of "will Stack Exchange eventually charge for API or data dump use", this can be found in the WIRED article from April 20th: Stack Overflow Will Charge AI Giants for Training Data

It has such passages as:

Stack Overflow’s decision to seek compensation from companies tapping its data, part of a broader generative AI strategy, has not been previously reported. It follows an announcement by Reddit this week that it will begin charging some AI developers to access its own content starting in June.

and contains information on company direction from the CEO:

“Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive,” Stack Overflow’s Chandrasekar says. “We're very supportive of Reddit’s approach.”

Chandrasekar described the potential additional revenue as vital to ensuring Stack Overflow can keep attracting users and maintaining high-quality information. He argues that will also help future chatbots, which need “to be trained on something that's progressing knowledge forward. They need new knowledge to be created.” But fencing off valuable data also could deter some AI training and slow improvement of LLMs, which are a threat to any service that people turn to for information and conversation. Chandrasekar says proper licensing will only help accelerate development of high-quality LLMs.

answered Jun 10, 2023 at 1:29

user1376343

4

Thanks! But this was discussed at the time, and isn't really relevant to this question.
– wizzwizz4
Commented Jun 10, 2023 at 1:39
4

It was, and people seem to have missed it and have questions about the company direction or expecting the next step to "full Reddit and asking payment for API usage" or statements like "This is deeply concerning. If the next step is making all API access paid, I'm going to resign." For people making those statements, that should be seen as inevitable rather than just possible.
– user1376343
Commented Jun 10, 2023 at 1:44
Some people making those statements have access to behind-the-scenes information that you are not party to. It's not entirely hopeless yet. Several things are basically set in stone, but I don't think "make all API access paid" is one of them.
– wizzwizz4
Commented Jun 10, 2023 at 1:46
That was clearly the case that none of the recent decisions have caught any of those who have closer access to staff channels off guard. And for those who do have that access and such statements have been made in those channels, they tend to be very tight lipped about any of the goings on there until the company makes those changes public. Or that previous appeals to the upper management for issues have resulted in significant changes or about faces to planned operations after such statements have been made to the press, investors, and parent companies.
– user1376343
Commented Jun 10, 2023 at 1:52
3

Okay, but the article you're citing is from 1⅔ months ago. I feel like quoting it here (now that the context of the Reddit situation has changed) is misleading. Regardless, it neither answers the question nor responds to the CTO's answer. Stack Exchange sites – even meta – are not general forums. Please read the tour.
– wizzwizz4
Commented Jun 10, 2023 at 1:58

Add a comment |

Random Person · Accepted Answer · 2023-06-14 15:29:19Z

-104

Stack Overflow senior leadership is working on a strategy to protect Stack Overflow data from being misused by companies building LLMs. While working on this strategy, we decided to stop the dump until we could put guardrails in place.

We are working on setting up the infrastructure to do this correctly in the age of LLMs --- where we continue to be open and share the data with our developer community but work to set up a formal framework for large AI companies that want to leverage the data.

We are looking for ways to gate access to the Dump, APIs, and SEDE, that will allow individuals access to the data while preventing misuse by organizations looking to profit from the work of our community. We are working to design and implement appropriate safeguards and still sorting out the details and timelines. We will provide regular updates on our progress to this group.

UPDATE: See this post for an update from Philippe.

edited Jun 14, 2023 at 15:29

Random Person

5,9222 gold badges13 silver badges58 bronze badges

answered Jun 9, 2023 at 18:23

Jody BaileyStaff

8013 silver badges6 bronze badges

72

In order to prevent companies building LLMs from getting our data (which they already have, and have had for years) your plan is to keep all of us from getting our data, am I reading that right?
– Restore The Data Dumps Again
Commented Jun 9, 2023 at 18:31
63

Just as context for casual readers since it may not be obvious, Jody is our CTO. (I am not commenting on the matter at hand, just providing this info.)
– balpha StaffMod
Commented Jun 9, 2023 at 18:32
68

Why are you not doing this openly? This concerns the data provided by the community; why are you not letting the community have a say in how this data is protected against misuse? Why don't you use our knowledge for this purpose? Excluding us from the process only brings anger, negative speculations, and mistrust. We get so much negative by doing it this way, when we could've cooperated, and gotten so much good from it instead. What is your motivation for not including us in the process? Why is there no transparency in this?
– Andreas moved to Codidact
Commented Jun 9, 2023 at 18:34
106

Why was this not stated in March, when the decision was made to turn off future data dumps? Or, even better, why was this not discussed to get feedback from impacted members of the community to better assess the risks of turning off the data dump versus not turning off the data dump?
– Thomas Owens
Commented Jun 9, 2023 at 18:36
54

So you could say that. You say that now only because you were caught red handed, that's what I think.
– Shadow Wizard
Commented Jun 9, 2023 at 18:37
64

Also, it's not "Stack Overflow data". The content here is our data that we have chosen to license. So perhaps the creators and owners of that data should be more involved in making decisions about how their data is distributed.
– Thomas Owens
Commented Jun 9, 2023 at 18:39
55

This question has been sitting on Meta for two days. This answer would look more credible if it was the first one posted. Now it just looks like poor damage control. And even then there is still a question why has this not been communicated before?
– Resistance Is Futile
Commented Jun 9, 2023 at 18:41
52

"misuse by organizations looking to profit from the work of our community" ─ I hope this means you won't start charging for API access, because most of us would consider that a misuse by an organisation looking to profit from our work.
– kaya3
Commented Jun 9, 2023 at 18:44
46

"organizations looking to profit from the work of our community" — that's funny when you say it like that, you know :)
– Levente
Commented Jun 9, 2023 at 18:46
64

How can data licensed under the CC-BY-SA licenses that SE content is licensed under be "misused"? The license explictly allows others to do essentially anything they want with the data as long as attribution is given, in particular profit off of it. It can't be "misused*, the users can only fail to give attribution or follow the share-alike requirement. It is entirely unclear in what ways restricting access to the data would improve the situation with respect to attribution.
– ACuriousMind
Commented Jun 9, 2023 at 18:54
79

I really, really want to take this at full face value. I also think the Company really needs to understand how important this data dump is to the community, and how easy it is to read this as a message that the dumps are "temporarily postponed" indefinitely, and will never return. I don't think the Company gets to have the benefit of the doubt with messaging like this, this week of all weeks... I beg you bear in mind how the messaging comes across, and how easy it feels, from the outside, to read this as a way to misdirect and soften the blow of doing away with the data dumps for good.
– zcoop98
Commented Jun 9, 2023 at 19:18
30

What happened to at least giving heads-up that something like this was going to happen? That's what community used to mean. This new way of just doing stuff without any communication isn't the best community experience
– vbnet3d
Commented Jun 9, 2023 at 19:30
23

Is there any reason you think the data dumps would be the way that LLMs have used SE training data, versus crawling/scraping the site? I would have assumed the latter, since then they can use the same tool over many sites versus digesting the SE data format specifically.
– Bryan Krause
Commented Jun 9, 2023 at 20:51
20

“No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material.”
– Erik Darling
Commented Jun 9, 2023 at 21:21
87

A good time to have announced this change was back in April when the CEO addressed the concern with LLMs using Stack Overflow data. There was even a handy meta question about it. Hoping nobody would notice was never a good idea.
– Jon Ericson
Commented Jun 9, 2023 at 22:55

| Show 32 more comments

Stack Exchange Network

June 2023 Data Dump is missing

13 Answers 13

Update 2023-06-18

Original Answer

The upload to the Internet Archive has been disabled.

Is it going to stay that way?

How can I access that data?

Conclusions

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
support
status-completed
data-dump
.

Linked

Hot Network Questions

June 2023 Data Dump is missing

13 Answers 13

Update 2023-06-18

Original Answer

The upload to the Internet Archive has been disabled.

Is it going to stay that way?

How can I access that data?

Conclusions

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged supportstatus-completeddata-dump.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
support
status-completed
data-dump
.