96

The company has refused to honestly explain their plans to the community, but the data dump was late, and a former employee graciously did the community the favour of confirming that the company has discontinued the data dump on orders of senior management:

DISCLAIMER: I was recently impacted by the Company's layoff. I am going to carefully respond in a way that ensures I don't reveal anything the Company may feel is confidential--particularly with regard to strategy, or future plans. Any knowledge I have on strategy or future plans is both dated and confidential, and thus it would be irresponsible for me to say more. As a result, this answer may feel incomplete. I suspect that the CM team is rather busy this week with other topics. I'm offering what I can to uphold the Company's values of Transparency & being Community-centric.

The upload to the Internet Archive has been disabled.

The job that uploads the data dump to Archive.org was disabled in March, and marked to not be re-enabled without approval of senior leadership. Had it run as scheduled, it would have completed on the first Monday after the first Sunday in June.

I mention the timing, as this change long pre-dated the current moderator strike and related policy changes. Some comments have suggested otherwise, so I thought it an important detail.

Is it going to stay that way?

Hopefully the Company will provide an answer that includes this.

How can I access that data?

Stack Exchange Data Explorer (aka SEDE) contains a subset of all data for all sites, with PII removed. The same data available in the data dump is also available on SEDE.

SEDE is updated via a weekly full refresh (every weekend). The Data Dump that is uploaded to Archive.org is a dump of the SEDE databases. The weekly SEDE refresh runs, then the data is dumped to XML & 7zipped, then the 7z files are uploaded to the Archive.

SEDE can't address all the use cases of the Data Dump, nor vice versa. However, there is overlap, and the data is at least queryable.


Hopefully any questions around "why" or "what's next" will be addressed by an official response.

Original Question

Wired just published a new article, Stack Overflow Will Charge AI Giants for Training Data with some interesting new statements from CEO Prashanth Chandrasekar. I've quoted the pieces that seem most relevant to me below, but I encourage you to read the full article in case I've accidentally omitted any important context.

Stack Overflow, a popular internet forum for computer programming help, plans to begin charging large AI developers as soon as the middle of this year for access to the 50 million questions and answers on its service, CEO Prashanth Chandrasekar says.

“Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive,” Stack Overflow’s Chandrasekar says. “We're very supportive of Reddit’s approach.”

Chandrasekar says proper licensing will only help accelerate development of high-quality LLMs.

They offer downloadable “data dumps” or real-time data portals to help software to access their content known as APIs. In Stack Overflow’s case, LLM developers are getting their hands on data through a mix of dumps, APIs, and scraping, Chandrasekar says, all of which today can be done for free.

But Chandrasekar says that LLM developers are violating Stack Overflow’s terms of service. Users own the content they post on Stack Overflow, as outlined in its TOS, but it all falls under a Creative Commons license that requires anyone later using the data to mention where it came from. When AI companies sell their models to customers, they “are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license,” Chandrasekar says.

Neither Stack Overflow nor Reddit has released pricing information.

Stack Overflow and Reddit will continue to license data for free to some people and companies. Chandrasekar says Stack Overflow only wants remuneration only from companies developing LLMs for big, commercial purposes. “When people start charging for products that are built on community-built sites like ours, that's where it's not fair use,” he says.


I understand that plans are preliminary and lots of things are up-in-the-air, but this is touching on some topics that the community can be very sensitive to. If possible, any clarification on what we can expect would be appreciated. In particular:

  1. Is the company intending to change the licensing of user content again, and if so, would they attempt to apply those changes retroactively?

    Prashanth's statement argues that the existing Creative Commons license would require direct attribution of any information sourced from Stack Overflow, but that AI models don't make it practical to do that. Does this imply that they're planning to be able to offer a different license arrangement to paying customers, which would in turn require contributors to agree to that new license?

  2. Will the company be maintaining its commitment to (roughly-) quarterly data dumps, or are those at risk due to this situation?

    This has been a pillar of the company's commitment to the community. Is the company planning to restrict it in some way, or are licensing and real-time-API restrictions sufficient for their objectives?

Again, I understand that matters are still at a preliminary stage and the company is probably unable to give specifics. But these are very sensitive subjects to a lot of us, and it would really be appreciated if you could help set expectations, before we have to find out what you've decided from the press.

14
  • 15
    There's a similar question on MSE, but it doesn't discuss the continuity of Data Dumps, which are my top concern: Is SE [going to be] selling our content for AI model training? And what exactly does "reinvest back into our communities" mean?
    – user19567871
    Commented Apr 21, 2023 at 0:34
  • 32
    "“Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive,”" So, the cheque is in the mail, then? No? Huh. Fancy that. Commented Apr 21, 2023 at 4:07
  • 10
    Kinda sucks how our community effort of helping other programmers has been totally monetised by big corporations doesn't it
    – user438383
    Commented Apr 21, 2023 at 9:08
  • 4
    @user438383: Are you really surprised? Stack Exchange is and always has been a for-profit corporation, and the business model is and always has been to monetize the data that we as an aggregate collect and curate. This action of theirs doesn't surprise me in the least and doesn't upset me, since if this site doesn't reap some financial benefit from our efforts, then the site will not survive and go the way of Yahoo! Answers. Commented Apr 21, 2023 at 11:20
  • 2
    While it doesn't surprise me, the prospect of them stopping the offline data dump is pretty sad (or at least to monetise it) @HovercraftFullOfEels Commented Apr 21, 2023 at 14:13
  • 15
    @HovercraftFullOfEels The business model was originally to monetize the fact that highly skilled programmers were all frequently collected in one online space (see: ads, SO jobs, sponsored tags, sponsored sites, etc.), not to monetize the content we created. Monetizing the content is completely new with this initiative.
    – TylerH
    Commented Apr 21, 2023 at 14:36
  • @TylerH: I agree that the initial thrust was to get coder's eyeballs to view the site, but it has been the content that is present or that gets created that did this. Now those same eyeballs are being drawn elsewhere by the very same content that has been repackaged by AI. Again, the site is and has always been for-profit, and so do you disagree with the actions of the site owners? I don't and would do the same thing if I were in their shoes. Commented Apr 21, 2023 at 15:11
  • 4
    @HovercraftFullOfEels I do disagree with their actions; if I were Joel and Jeff, I would have filed Stack Overflow the company as a 501c3 non-profit back in 2008 (or at any point between then and selling the company to an investment firm). I also would seek to do the same if I were Prashanth, but I know he's much more limited based on the board's makeup today. There are plenty of ways to make money for SO without doing it for profit or off the backs of the community in such an... exploitative... way.
    – TylerH
    Commented Apr 21, 2023 at 15:19
  • @TylerH: but they didn't, and it has never been such as long as you've been a member nor as long as I've been a member, and so are you surprised by the current owner's actions? And looking at it from a business point of view, does it make sense? Commented Apr 21, 2023 at 15:21
  • 5
    @HovercraftFullOfEels Well, you keep asking different questions. Am I surprised? No, although I am disappointed. And yes, it does "make sense" as a business decision, but that's a pretty vague statement; it also "makes sense" for a company like Stack Overflow file as a 501c3 non-profit organization. Lots of bad things companies do "make sense" (and are perfectly legal, too), but I live in a country where I have the right to complain about such things, and am on a site that so far allows it, too, so I am availing myself of that opportunity.
    – TylerH
    Commented Apr 21, 2023 at 15:25
  • 9
    "Stack Overflow, a popular internet forum for computer programming help" man that stings, could someone say something about this 😔
    – Makoto
    Commented Apr 21, 2023 at 15:29
  • 4
    @Makoto It's not clear from the article of that fragment came from the CEO's own words of it is a description added by the article's author, since it is not shown as a direct quote/in quotation marks. Otherwise I would have reached out somewhere to the CEO directly to inform him of that factual inaccuracy.
    – TylerH
    Commented Apr 21, 2023 at 15:30
  • 3
    @TylerH it's more a feeble attempt here since I realize we're not going to get anyone to really see that this isn't a forum, so meh. I shouldn't be adding too much noise to this.
    – Makoto
    Commented Apr 21, 2023 at 15:32
  • 1
    Nobody knows what will happen in the future. But using future content for training could result in degradation and low quality feedback cycles unless we find a way to clearly identify human generated content. Commented May 31, 2023 at 7:12

2 Answers 2

35

So on my tirade against AI on Meta Stack Exchange, I did a little extra digging around about what was and wasn't kosher with respect to LLM training.

In short, IANAL but I don't think that the CEO's sentiments actually hold water. At least, not from Stack Overflow's lawyer's perspective.

First, Creative Commons doesn't prescribe a medium for using the data. It only expects that the license is preserved. If an exception applies (including but not necessarily limited to fair use) then there's no requirement to adhere to the license.

Our licenses do not restrict reuse to any particular types of reuse or technologies, so long as the attribution (BY), share-alike (SA), no-derivatives (ND) and non-commercial (NC) terms are respected. Therefore, strictly from a copyright perspective, no special or explicit permission is required from the licensor to use CC-licensed content to train AI applications to the extent that copyright permission is required at all.2 In addition, our licenses do not override limitations and exceptions, such as fair use. If a use is not one that requires permission under copyright or sui generis database rights (e.g. text and data mining allowed under an exception), one may conduct the AI training activity without regard to the CC license.

Okay, so then if a LLM uses Stack Overflow content, they have to abide by the license. Easy peasy.

Except. Stack Overflow Inc. does not enforce the license on our behalf!

So...this sentiment from the CEO looks pretty toothless...

...When AI companies sell their models to customers, they “are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license,” Chandrasekar says.

...mostly because it's a mathematically correct answer. If someone doesn't attribute where they got their data from, they're violating the license. Is Stack Overflow going to enforce it? Probably not. It's not like they're doing so right now.

So, if you find something you wrote in an LLM, and you're not attributed, you're on your own to pursue legal action or remedy for their violation of your license. Good luck!


But this is all a red herring since that's not the main focus. They want to charge for large amounts of data use. Which is... dubious at best? Yes, they can charge people for use of the data since that's not restricted in any of the CC-by-SA licenses.

So I think by the book they could do this.


This is my personal perspective on it, but when there's so little trust and harmony left between the company and the community, a move or decision like this couldn't possibly erode it further. If this was 2018 there'd be a whole movement against this. But in 2023, because we've already been put in a tenuous place with where we stand with the company, this feels like another corporate decision being made, and we're just going to be powerless to do anything about it.

11
  • Ownership of the underlying content is also given to SO directly when you submit it, for SO to sell, modify, etc. as it sees fit. This data can be provided without attribution legally by SO if they want to provide it directly to some company who wants to pay for it (e.g. a prepackaged subset of questions and answers in a certain language used for a specific LLM). The only time the CC attribution requirement comes into play, IIRC, is if someone takes the content from the website directly, themselves, and uses it somewhere.
    – TylerH
    Commented Apr 21, 2023 at 15:59
  • 13
    @TylerH: It's not owned, it's irrevocably licensed which is a bit different than ownership. And yes, they do stipulate they can sell the content. I believe in any context - even with the company - they are explicitly bound to CC-by-SA, even if it is commercially resold (since CC-by-SA doesn't restrict that anyway).
    – Makoto
    Commented Apr 21, 2023 at 16:05
  • 2
    @Makoto: That is the elephant in the room. "Chandrasekar says proper licensing will only help ..." but he has no authority whatsoever to change the license to what he thinks is proper. The terms and conditions may give SO some rights to perform contributed content beyond CC BY-SA, but SO cannot turn around and license those rights to others.
    – Ben Voigt
    Commented Apr 21, 2023 at 16:40
  • Here's an intrusive thought @BenVoigt. I mean, they've done this in the past and they could do it again - they could just change the license for future contributions. It was less innocuous with staying to the same general license and just "upgrading" it, but if they decided to change the license...that'd be a way to get around this. Wouldn't be a good way though, since that's very much not in line with CC-by-SA.
    – Makoto
    Commented Apr 21, 2023 at 16:48
  • 1
    @Makoto: Today the amount of content that would be under a grandfathered license is even greater than any of those past cases. Do you think that SO is going to offer AI modelers a package that only includes content from May 2023 and later?
    – Ben Voigt
    Commented Apr 21, 2023 at 16:50
  • 1
    Cynically, Stack Overflow could just offer a "license" that says "we, Stack Overflow Inc, won't sue you", because that's the actual risk that companies want to mitigate. Even if the existing Creative Commons license doesn't permit this use (for the sake of argument), the risk of Stack Overflow users collectively going after one of these companies (with a class action or whatever) for misuing old content seems small enough to ignore, while Prashanth is making it clear that the risk of the being sued by the company is very real, and they have $2B in "value" to "protect".
    – user19567871
    Commented Apr 21, 2023 at 16:56
  • @Makoto Sorry, yes, not ownership, but rather a perpetual and irrevocable right to do what they want with it.
    – TylerH
    Commented Apr 21, 2023 at 20:13
  • 1
    @TylerH: Yes, but even still they are still bound by CC-by-SA. They can introduce additional licenses (which is where the perpetual/irrevocable license comes from) but they can't be more restrictive than CC-by-SA. I interpret this to mean that, so long as Stack Exchange Inc. abides by the terms of the license and requires attribution with my data, I can't just demand they take it down.
    – Makoto
    Commented Apr 22, 2023 at 3:29
  • 1
    Isn't the point of an LLM to arrive at a set of weights (probabilities), and then produce the likely "next word". Model weights may be adjusted as influenced by SO content, but it doesn't retain it, and therefore cannot re-publish it, so attribution shouldn't be necessary. Commented May 15, 2023 at 16:18
  • IANAL, but IMO if CC content is made available (e.g., published online), then a human or machine can read or observe it, learn from it, and that learning is not license-encumbered. Imagine saying humans could read SO, but not learn from or make use of any learning gained. And why, if a tool (web browser, hardware, LLM model, etc.) is used to help with the learning, does that alter anything? Commented May 15, 2023 at 16:30
  • 4
    Selling licensed and attributed user material to LLM companies who don't include attributions in their derived content implies that users have lost something and requires that something is received in exchange. This is a crossroads - if SO does not pay users for content, users will just create content on platforms that will share the wealth. Commented May 16, 2023 at 17:19
20

The same concern was brought up on the network-wide Meta site: June 2023 Data Dump is missing. Although I have answered there, I have also copied my answer here for maximum visibility:


Much has been written lately of the company’s decision to pause the distribution of the anonymized data dump that has historically been posted.

Our intention was never to stop posting the data dump permanently, only to pause it while we begin to collect more information on how it was being used and by whom - especially in light of the rise of LLMs and questions around how genAI models are handling attribution. However, it’s clear that many individual users (academics, researchers, etc) have an immediate need to access updated versions of the dumps. So we are re-enabling the automatic data dumps (and uploading the one that’s about a week overdue). We believe that this can happen by end of the day Friday. We will continue to work toward the creation of certain guardrails (for large AI/LLM companies) for both the dumps and the API, but again - we have no intention of restricting/charging community members or other responsible users of the dumps or the API from accessing them.

As part of this project, API users should be on the lookout for a very brief survey that will be coming out (it will be announced here and on stackapps.com) that asks about the features that you most use/would like to see in the API or data dumps moving forward so that we can plan for those, as well as collect general input.

In the meantime, the data dumps will be re-enabled by end of day Friday. We will communicate here when that has been completed or if there are any delays. We will also post here prior to making any future changes to the dumps or distribution of the dumps.

0

You must log in to answer this question.