63

Many words have been written around the company's commitment to the ongoing existence of the data dumps, the API, and the Stack Exchange Data Explorer (SEDE). Much of that text can be confusing or conflicting.

So in order to make clear the company's position, we're publishing this note.

The company is committed to the long-term (foreseeable future) survival of the data dumps, the API, and SEDE. We will continue to maintain them, and assure that community members have free access to them for legitimate usages that support the community, broadly construed (including for study in classrooms, for instance). We retain the right to place guardrails around them to ensure that companies constructing language models, etc, are charged for access, but community users, including the Charcoal anti-spam network and similar projects will continue to have free access.

13
  • 15
    Related: June 2023 Data Dump is missing
    – Machavity
    Commented Jul 26, 2023 at 20:09
  • 25
    "foreseeable future" is as vague as it could be... Commented Jul 27, 2023 at 5:15
  • 15
    Is taking all content and reuploading it elsewhere to restart the community there considered a "legitimate usage", I wonder. (It should be.) Commented Jul 27, 2023 at 9:21
  • 24
    "We retain the right to place guardrails around them..." I would advice to be very careful to not restrict any legitimate use by those "guardrails". I think it's hardly possible given the license of the content to meaningfully restrict data dumps to only some organizations but not everyone. I think now that I will need to keep my user name for longer. Also long-term and foreseeable future is not really very assuring. The September data dump will be the next test. Commented Jul 27, 2023 at 10:55
  • 18
    After reading again and taking past actions into account. This is more like the preparation of a departure of free access than any kind of reassurance. I don't trust it much. Commented Jul 27, 2023 at 10:59
  • 6
    Will you "assure" that there'll be that access, or "ensure" it? Commented Jul 27, 2023 at 11:58
  • 6
    Is placing these "guardrails" consistent with the CC-BY-SA licensing of SO content? I'm no open source lawyer, but that strikes me as contrary to the licensing itself.
    – vbnet3d
    Commented Jul 27, 2023 at 12:28
  • 6
    @vbnet3d See meta.stackexchange.com/a/390138/332043 - it's scummy, but likely not a violation of the license (obligatory I'm not a lawyer) Commented Jul 27, 2023 at 12:48
  • 4
    I don't believe this commitment. It doesn't convince me that you're being honest. The answers show that you're not being honest. SE's communication in more private channels (Discord) has signaled you're not willing to commit to this for much long. The wording in this statement doesn't bind you to the commitment. I cannot in good conscience support this, or acknowledge it. This receives a downvote from me. I have no reason to trust what you write until I believe you're being honest. This does not provoke trust. It's too vague, not enough commitment, and not transparent enough. Commented Jul 29, 2023 at 22:20
  • 1
    Since the API access for legitimate usages is covered, and the community users section merely says "Charcoal anti-spam network and similar projects" it might be helpful to give a better understanding of what is considered "legitimate." As one example, I'm given to understand that nearly all the user scripts, used by users, curators, mods, and even staff, rely on API access. Is it the intention to evaluate usages to decided it "this one" is legitimate, or more to presume API usage by users is "legitimate" access by definition?
    – Chindraba
    Commented Jul 30, 2023 at 20:37
  • 1
    @Res adding another one to AMtwo? ;) Commented Jul 31, 2023 at 12:56
  • @ShadowWizardStrikesBack Well, of course. What else ;) Commented Jul 31, 2023 at 13:01
  • 1
    "legitimate usages that support the community". You don't get to dictate what it's used for, you don't have any rights to it.
    – Basic
    Commented Nov 2, 2023 at 22:48

11 Answers 11

154
+600

(Note: the quoted sentence was edited after I posted my answer. I am leaving this answer unedited and undeleted, so that it is available for folks reading the revision history.)


From the original post:

Many words have been written around the company's commitment (or perceived lack thereof) to the ongoing existence of the data dumps, the API, and the Stack Exchange Data Explorer (SEDE).

I find the inclusion of the "or perceived lack thereof" parenthetical to be incorrect, and unnecessary. At best, it is an unnecessary "dig" at members of the community who questioned the missing data dump earlier this year.

As I mentioned in my answer on that question, I was the individual who pressed the metaphorical & literal button to disable the data dump upload job. As such, I take that dig as directed (at least partially) at me specifically.

Because I find that statement as both factually incorrect and targeting me specifically, I do feel the need to address that comment.

Leading up to the disabling of the Stack Overflow Data Dump, I had a number of conversations with Prashanth, specifically around the Data Dump, the API, and SEDE. I have specific firsthand knowledge of the decision to disable the Data Dump, as well as some of the research that went into that decision.

I believe that the implication made by Philippe, that the Data Dump was never going to be disabled long term, is both untrue, and damaging to my reputation as a data professional, as a dutiful employee of Stack Overflow, and as a member of this online community.

Among the questions I fielded from Prashanth included:

  • Can we pull down the existing data dump from the Archive?
  • Can we limit who can access the Archive?
  • How do we limit what data people can download from SEDE?

None of those are direct quotations, mind you. They are simply based on my memory of the calls we had discussing the topic.

March 28, 2023

On the morning of March 28, Prashanth contacted me via Slack DM to ask me to disable the data dump, and mark it as not to be re-enabled without his specific approval.

During that conversation Prashanth said (this is a direct quote, with my emphasis adding italics): "how long does it take to re-establish the link [to Archive.org] if we want to open it back up?"

As part of my response, I said (also a direct quote): "But if we don't upload on schedule, we're likely to have someone notice and ask about it on Meta. So we need to be prepared to respond--or better proactively explain it on a Meta post of our own. The Community Team probably just needs to be in the loop to not be caught off guard with the customer service side of it."

I was assured that the good folks on the Community Team would be informed and given the chance to communicate, after which point I disabled the data dump upload job.

...IF WE WANT TO OPEN IT BACK UP

I don't think it's necessary for me to say any more. Folks who questioned the company's commitment to the Data Dump were not only justified to ask questions, but they were correct.

The company was NOT fully committed to continuing the Data Dumps, nor were they committed to involving the community in that decision, nor was the promise to me fulfilled to work with the CM team to ensure things were properly communicated in advance.

I am thankful that the company has reversed the plan to disable the Data Dump, and that they are now committed to ensuring they continue going forward. But it is not true to imply that commitment has been unwavering, or that people were incorrect to doubt them.

10
  • 73
    Reminds me of Shog9's tweet where he came out as the one who pressed the button to remove Monica's moderator status. He was falsely told that it was "extremely urgent" and that "details would be forthcoming". Jon Ericson subsequently wrote in his blog that he knew what was actually involved and would have refused to press the button had he been instructed (he happened to be out of office when the order came down). Moral of the story: don't press any buttons on the basis of mere assurances. I hope other employees see this and know what to do. Commented Jul 26, 2023 at 21:54
  • 18
    Per your profile "I'm trying to save the world one database at a time."... Thank you for your service!
    – curious
    Commented Jul 26, 2023 at 22:05
  • 20
    @Sonic -- my ability to provide a direct quote on the topic speaks to my expectations.
    – AMtwo
    Commented Jul 26, 2023 at 22:38
  • 30
    The last few years have shown, in bright glaring detail, that Prashanth is serving the desires of shareholders, to the detriment of employees AND the community of SO. His failure to inform the community team and have them take preemptive action on this point was a tremendous failure in leadership. And what bothers me is that this is a pattern that occurs multiple times a year now.
    – vbnet3d
    Commented Jul 27, 2023 at 12:34
  • 3
    Exactly how do you think they are "now committing"? the wording on the post seems to indicate that they are still committed to what they were committed before, to use John words " how we could cut off access to some of these things, either fully, or to require payment". The post keeps repeating an implicit concept: "assure that community members have free access to them". Everyone else they want to gate out somehow. So there is no commitment anywhere, just the realization that if members can read the posts, and the post are free to copy then the members can get a copy of the content. Commented Jul 27, 2023 at 15:10
  • 2
    (continue). Basically, it is useless to try to remove access to the data to the ones who have to be able to read said data in order to contribute and polish and / or provide more data. But this doesn't mean they aren't planning to gate off everyone else. The fact that every time they have to specify that "community users" should still have access is a clear sign that they see the appurtenance to the community as the only discriminator as to who should get the data/API. Commented Jul 27, 2023 at 15:30
  • 14
    @SPArcheon Since the Data Dump is explicitly and directly licensed under CC BY-SA in the site TOS, if the Dump is gated somehow, there's no reason the community wouldn't be able to re-post it elsewhere without gating (ex The Internet Archive) so long as the license terms are preserved. Unless they change the license, there's no effective way for them to gate it. If they continue to produce it under CC BY-SA, then it will continue to be freely available somewhere.
    – AMtwo
    Commented Jul 27, 2023 at 15:59
  • 1
    @AMtwo correct, see my answer too. As I said I read the notice as "We will continue the data dumps, the API, and SEDE for 6-8 something. We are looking for a way to require login to access the data so everyone else is locked out, but we couldn't (yet) figure a legal trap that wouldn't allow a company to just make an account or some user to just repost the dump elsewhere because of that meddling Creative Common license." Commented Jul 27, 2023 at 16:03
  • 2
    @AMtwo no need for this, don't worry. I just wanted to point out that despite the technical impossibility of negating the ability to repost the content, the wording they use seem to indicate they haven't given up on that. I suspect they are planning to create a dump that mixes the CC community data with some copyrighted data in a way that makes hard to separate them, then give a "free license" to said copyrighted data to the site users. Commented Jul 27, 2023 at 16:57
  • 6
    User content is licensed to the world under CC BY-SA, but licensed to the company under a perpetual, worldwide, irrevocable, unlimited license. The company can remix however they want (including new formats), and be well within THEIR rights. But it remains everyone else's rights to use the original post content and the Data Dump itself under the terms of CC BY-SA.
    – AMtwo
    Commented Jul 27, 2023 at 17:02
54

I echo starball's statement, but I want to elaborate a bit more than a comment would allow.

The statements of "long-term" and "foreseeable future" are incredibly vague and often inconsistent. For me, long-term is measured in years and foreseeable future is weeks and maybe months. That's not an insignificant difference.

When added to the comment in Discord where these mean until "the people involved are no longer with the company", that could be days. Consider a recent history of layoffs at the company. We don't know who "the people" are and there are no reasons to believe they won't get a better job or be laid off at any point in the future.

We deserve actual long-term commitments. And if - or, rather, when - situations arise where something about the availability of the data (via the dump, API, and SEDE) changes, it needs to be done as a discussion around the problems, potential solutions, and rationale rather than silently making changes until it is randomly discovered.

4
  • 10
    Many commitments are planned to be linked to the mod agreement; but these ones should be commitments to contributors, not just mods.
    – wizzwizz4
    Commented Jul 26, 2023 at 20:43
  • 5
    I agree, but I also find myself wondering if it would really feel more genuine if they said "forever" instead of "for the foreseeable future". I'm unsure how much the community is even willing to take the words at face value at this point; personally, I feel like the commitment itself over the next x years matters more than the wording of the announcement, and I'm not sure what a more binding arrangement would even look like.
    – zcoop98
    Commented Jul 26, 2023 at 22:29
  • 13
    I absolutely want the data dump to exist until the end of the site, whenever that day comes; it's a good-faith arrangement with the community that honors the content license and founding principles in a tangible way. But from the beginning the promise was worded as "ideally permanently" as far as I can tell; the difference is and always will be future leadership. I'm not sure what can be promised at this point to give us any more assurance.
    – zcoop98
    Commented Jul 26, 2023 at 22:31
  • 5
    @zcoop98 The "ideally permanently" phrasing is good. The important thing is that when changes do need to be made, it's clearly communicated in advance to the community, especially active users of these data tools, and is something open for discussion and not just dictated. Commented Jul 27, 2023 at 11:54
48

Many words have been written around the company's commitment (or perceived lack thereof) to the ongoing existence of the data dumps, the API, and the Stack Exchange Data Explorer (SEDE). Much of that text can be confusing or conflicting.

Here's your chance to correct them. As in, clear the air around what was said.

Offer a comprehensive, clear, detailed catalog of events that led to the removal of the SEDE data dumps and what the decisions were around it. Otherwise, there'll always be something said about it that can be seen as confusing or conflicting.

If you want more trust, you start by being honest. With just this blurb, I don't know how honest you're actually being. I mean, we can wait...watching the flag count grow isn't really impacting my day-to-day...

We will continue to maintain them, and assure that community members have free access to them for legitimate usages that support the community, broadly construed (including for study in classrooms, for instance).

The only illegitimate usages that you could legally contest would be if someone were to violate CC-by-SA. As in, so long as someone is sure to attribute where they got this content from, they could do whatever they wanted with it. That's kind of the whole...spirit...of Creative Commons. I may not be a lawyer but I don't have to be to know that.

One more time, for the folks a bit higher up the org chart: if I have the data, and I abide by CC-by-SA (versions 3 or 4 or both), I can do whatever I want with it. Is that clear?

We retain the right to place guardrails around them to ensure that companies constructing language models, etc, are charged for access, but community users, including the Charcoal anti-spam network and similar projects will continue to have free access.

Sure, but those guardrails are going to be flimsy. Again, as long as someone abides by CC-by-SA, you don't really have a dog in the fight. To complicate things, if some actor were found to be doing this, this would imply that you have resources to fight these as DMCA requests, which in the past, you have explicitly stated that the content owner could only do that. How do you plan to rectify this confusing legal position? We're in serious and shaky territory with your need to monetize this data somehow and your legal capacity to be able to enforce it. (Although if you pulled a Red Hat that would be quite hilarious in all of this.)

3
  • 7
    In relation to the last paragraph, I find very suspicious that every time they feel the need to specify that "community users will continue to have free access". This implies the need to be a member on the site. Up to now the dump was made available outside the site, but this seems to betray an intent of gating the access to the dump to some form of login requirement Commented Jul 27, 2023 at 15:42
  • 1
    "them" in "We retain the right to place guardrails around them" refers not just to the data dump but also to "the API, and SEDE" which can be easily throttled so as to make them useless. Commented Aug 7, 2023 at 13:25
  • 1
    @CorneliusRoemer: It's such a trifle to set up rate limits on APIs that this hardly ever needed to be called out. The thing is, the volume of data I want would never be satisfied by the APIs anyway.
    – Makoto
    Commented Aug 7, 2023 at 15:54
41

If you are really committed to maintaining the data dumps, free API and SEDE, you don't need to use phrase "foreseeable future". Probably not even the phrase "long-term".

It just sounds like a good excuse for some point in time where you will undo that commitment and then you can just say "We never said this will not change".

Stack Exchange network was founded on community work and under the assumption that this work will be forever available to the community, no matter what happens to the company. Reference The Data Dumps in: Moderation Strike update: Data dumps, choosing representatives, GPT data, and where we’re holding

If the company ever tries to pull the data dumps again, the community response will be just the same as it was now. If the company is not willing to unconditionally commit to our ability to have access to our content, how can you expect that people here continue their participation by providing new content.


P.S. If someone at the company ever start wondering again, why is participation (particularly answering) in decline, you can freely point them to data dumps incident and vague wording in this post that is not inspiring any confidence.

And yes, trying to sell our content to the AI companies which violate CC license and attribution is not exactly helping participation either. Especially, when such move was not being discussed with the community first. Is SE [going to be] selling our content for AI model training? And what exactly does "reinvest back into our communities" mean?

30

I will avoid further commenting on the issue described by AMtwo and the implications in the first version of the post about the "perceived" lack of understanding from the community. Just know that umpteenth subtle bashing hasn't gone unnoticed.

I will instead focus on the core of this post.

The company is committed to the long-term (foreseeable future) survival of the data dumps, the API, and SEDE. We will continue to maintain them, and assure that community members have free access to them for legitimate usages that support the community, broadly construed (including for study in classrooms, for instance). We retain the right to place guardrails around them to ensure that companies [cut] are charged for access, but community users, [cut] will continue to have free access.

I can't help but notice how you keep mentioning the requirement that "(only) community members have free access". The more I think about it, the more it seems evident that you are trying to find a way to require some form of login in order to access the data and the dump. All while being purposely vague on the timing so that if you can't devise some legal trap to stop companies from just becoming users you can still go back and block the access in about 6-8 time-units from now.

Basically, I can't help but read the message as:

We will continue the data dumps, the API, and SEDE for 6-8 something. We are looking for a way to require login to access the data so everyone else is locked out, but we couldn't (yet) figure a legal trap that wouldn't allow a company to just make an account or some user to just repost the dump elsewhere because of that meddling Creative Common license.

Because of this, I can't see this post as anything else than a vague half promise, and can't really upvote it as it stands.

I will also point out that as Jon Ericson noted on his blog, you probably are already late to the party - Stack Exchange content has already been scraped. Not only ChatGPT probably already includes Stack Exchange datasets up to 2022, but multiple datasets that don't seem to derive from the dump are available online. Just go on Hugging Face or other similar sites and have a look.

All you can hope to achieve by restricting the API/data dump access is still getting scraped by good old web crawling.


PS: Once upon a time Joel & Jeff used to make fun of that other site, the one that required you to log-in to see the answers... Maybe you could buy them and become one big "Stack Exchange Experts", because I don't see another way to prevent accessing data that is freely available on the web.

1
  • 24
    Licensing under CC-BY-SA was one of the ways in which Jeff and Joel assured us that Stack Overflow woud not become a walled garden. Requiring login to see the contents is against the spirit of the site's rules. Commented Jul 27, 2023 at 10:39
25

I asked in the Meta Discord about what "foreseeable future" means, to which Mithical replied:

As I understand it, until the people involved are no longer with the company.

, which I don't find particularly satisfying. It's like "we've learned our lesson, but we won't make sure that that lesson is passed down".

What exactly does "foreseeable future" mean here?

3
  • 8
    Depending on one's personal outlook on the current leadership, or the definition of "the people involved" this is even more amorphous of a definition than I would have expected.
    – AMtwo
    Commented Jul 26, 2023 at 21:12
  • 1
    This is a really good question, particularly since we've seen a pattern of throwing good employees with great community rapport under the bus over the last few years if they didn't 100% toe the line, followed by heavy-handed policy changes.
    – vbnet3d
    Commented Jul 27, 2023 at 12:37
  • 1
    It is a word in the ancient tongue of the goblin tribe. It means "until it is useful to us" Commented Jul 27, 2023 at 14:26
23

I'm rather concerned by the 'two' narratives - and the disconnect between what was publicly released, and honestly the seemingly more credible version we're gotten from current and former staff.

We were told that during the acquisition that the advantages of it were SE would be in a financially more stable position and that it would retain its independence to an extent.

If it's true that the decision was initiated by Prosus - we have a situation where core values of Stack Exchange can be compromised, on the hope of potentially making a profit off said data. This has rarely actually resulted in profit, it seems, looking at the various rounds of downsizing, and the inability to invest in many long running community feature requests.

I hate to break it to y'all, but Google's scraping your data and they're big enough they can kill a site that doesn't comply. OpenAI works with Microsoft and Bing and search engines scrape data. Most of the larger generative AI companies have no incentive to buy what they can just take. While there's lawsuits over theft of art and literature, it could go over either way - and SE might not be able to outspend bigger companies even if legally in the right.

It feels like the company might have chosen to sell its values, and spend already overdrawn community capital in the hope of a payoff. There's no profit here, and there may be no guardrails in place that could lead to somehow magically having these firms pay for this data.

Trust is earned, as is respect. Things like this both don't engender trust, and don't engender or show respect.

Maybe it's time for the company to take a step back - and ponder what went right in the past. Treating folks who've spent years contributing to the community well is likely to add value more than, well, trying to pick the most convenient narrative to quieten down dissent. Considering the result of actions such as this and the attempted gen AI policy does more than fixing the damage after the fact, and if these things are unforeseen or worse the fallout is seen but ignored, I feel like there needs to be a relook at community strategy.

19

I don't think that anything the community has written was confusing or conflicting. We realized the June data dump was missing, we asked for it, an external, very courageous source confirmed that the data dumps had been stopped. Then there came the official response that the data dumps have been switched on again and confirmation that the June data dump indeed run through. So far, so clear. The conflict about non-continued data dumps seemed to have been averted.

The only unclear thing is some vague talk about mysterious "guard rails" that the company is thinking about implementing, whatever that really means. Unfortunately this post here doesn't clarify the details behind that either. Maybe it was a bit premature and should have been posted later when more details are available.

As a reassurance it comes across rather weak. I wanted to hear something like "we are doing all we can to enable the free availability of the data dumps and other content access pipelines for now and all future." What I heard instead was some half-hearted commitment, some kind of "yes, but with restrictions for maybe some time". There is not much positive effect coming from that. My trust hasn't increased. I will wait if the September data dumps arrive in the usual format. Either it will or it won't. That will be my test.

Just two comments more:

Building additional services around the existing data offers and charging for this service is fine. If that's what you aim for, please say so.

The data isn't yours. It's a gift from all content creators to everyone (including you) in the world. We want it to be used as much as possible and that probably also includes your competitors. The license and the data dumps were supposed to be the means of that. Sure, there may be legal battles about the legality of using such material for AI training coming, but guard rails of the content aren't our concern really. We don't need them. This feature actually wasn't broken and to me the conclusion is that the company tinkering with it has a dis-proportionally high risk of making it worse. I'm very much afraid of what bad things you might have in mind there with something I value highly and this post didn't reassure me in any way.

1
  • 10
    "but guard rails of the content aren't our concern really. We don't need them. This feature actually wasn't broken and to me" - this seems the shared sentiment many users. So, it seems clear that the only ones worried about our content being used to train LLM is Stack Exchange itself, or to be more specific the man behind the curtain: Prosus. The reason is evident: Prosus want to find a way to be the only vendor with the competitive advantage of a LLM trained on Stack Exchange data. Protecting the users interest was never part of the picture. Commented Jul 28, 2023 at 8:54
8

The company is committed to the long-term (foreseeable future) survival of the data dumps, the API, and SEDE.

Will the license of the data dump change?

16
  • 4
    It can't, though. Not legally, anyway. The data provided in the dumps is all CC-by-SA.
    – Makoto
    Commented Jul 27, 2023 at 17:51
  • 2
    There's some real nuance to the licensing. The user-generated content (ie, the markdown within the post) is licensed to the world under CC BY-SA. But it is licensed to the company with a perpetual, worldwide, irrevocable license. The company can release an anthology of that content, retain the CC BY-SA license on each individual post, but license the specific format of that anthology differently, if they so choose.
    – AMtwo
    Commented Jul 27, 2023 at 18:24
  • 1
    The Data Dump is one of those anthologies, and it is enriched with additional content beyond the markdown user contributions. PostIds and whatnot are included to make it easy for folks to piece it back together, link questions and answers, etc. They would be free to license that Data Dump anthology differently, acknowledging that the "chrome" was licensed differently, but everyone can still use the markdown post content under the original CC BY-SA format. HOWEVER the site TOS explicitly licenses the Data Dump anthology with CC BY-SA, as well. Coving those PostIds etc with CC BY-SA.
    – AMtwo
    Commented Jul 27, 2023 at 18:29
  • 4
    TL;DR; It is possible for the company to change the site TOS, and change the license on future versions of the Data Dump, or create alternative formats of the data dump that are licensed differently. I'm not a lawyer, and have no specialty in Intellectual property law, so I'll leave implementation details to the IP lawyers... but I do think this is a valid question... perhaps a question worthy of it's own meta post?
    – AMtwo
    Commented Jul 27, 2023 at 18:31
  • @AMtwo thanks, yes, that's indeed my understanding. Commented Jul 27, 2023 at 18:33
  • 2
    For convenience – the relevant bit of the TOS is here.
    – V2Blast
    Commented Jul 31, 2023 at 18:49
  • 1
    @AMtwo I have explicitly looked at the dumps content, and all those ids wouldn't be copyrightable. The XML are just a plain mapping of tables. The column names (xml attributes) are barely original as well. The ids are machine generated, so not original at all. Non-tag-based Badge names might be copyrightable, it's the part most likely, and even that seems dubious.
    – Ángel
    Commented Aug 12, 2023 at 15:20
  • @Ángel The ID values themselves are nothing special. But as part of the file format, they allow you to stitch together Q&A, and also construct all sorts of URLs directing you at any given question/answer/comment. The IDs as part of those files & format are valuable IP, and it's valuable to have them released in the data dump. Without it, there is no document or public info that allows someone to connect CC BY-SA Questions with the CC BY-SA answers. It's critical to make the data useful; without it, you have a stack of questions, a stack of answers, with no links.
    – AMtwo
    Commented Aug 14, 2023 at 13:17
  • @Ángel Obviously you can't copyright a bunch of integers. But you can copyright data sets, including proprietary combinations of data elements.... by licensing the entire dump at CC BY-SA, it ensures that there is no ambiguity in the use of the public data set. Without it, the company could make public 77 million markdown files with random names, but no links between Qs and As.... and they would be fulfilling their part of the CC BY-SA license granted by the authors of posts, while also preventing others from properly "rehydrating" the data.
    – AMtwo
    Commented Aug 14, 2023 at 13:23
  • @Ángel And if they obfuscate the IDs out of the public-facing URLs, they could make it nearly impossible to figure out the linkages.
    – AMtwo
    Commented Aug 14, 2023 at 13:25
  • @AMtwo it is indeed good that the dumps are explicitly licensed as CC-BY-SA. And it is of course highly valuable to link questions and answers, to the point that they would not be fulfilling their commitment if providing them separately with no way to connect them (just as they wouldn't if they shuffled the words of the Q&A in alphabetical order). My point is that the ids themselves are not copyrightable, and not proprietary information either. And in fact, the links were created by the users, not the company.
    – Ángel
    Commented Aug 14, 2023 at 15:31
  • 1
    As for holding a copyright on the data set, copyright on collections is more tricky (and has a bigger divergence on jurisdictions), but I don't think it would be the case for a full dump. If you published an excerpt of "The 100 best Q&A from StackOverflow, as curated by @AMtwo", you could hold copyright on that work, even though it is a compendium of other works, but here this is a dumb dump of all data. See also Feist Publications, Inc., v. Rural Telephone Service Co..
    – Ángel
    Commented Aug 14, 2023 at 15:37
  • The data dump isn't a phone book.
    – AMtwo
    Commented Aug 17, 2023 at 13:17
  • @Ángel This is the kind of dumb "arguing for the sake of arguing" that is super unwelcoming to new folks, and makes seasoned users stop coming back. You can argue if the Data Dump is a phone book if you want, but the fact that the company explicitly licenses the Data Dump (via the TOS) is proof that the company's lawyers think it matters, and so does Franck.
    – AMtwo
    Commented Aug 17, 2023 at 13:22
  • 1
    @AMtwo I was not trying to argue for the sake of arguing. That case was THE landmark decision by US Supreme Court regarding that copyright requires creativity. It's not my fault that it was about phone books. It is cited for all kind of cases unrelated to phone numbers (quick example: for copyrightability of AI models), although I should note that in this case the automatically generated ids for the different rows are much more akin to phone numbers assigned to customers than most other items for which this case is used.
    – Ángel
    Commented Aug 19, 2023 at 23:52
4

We retain the right to place guardrails around them to ensure that companies constructing language models, etc, are charged for access...

This use is allowed at no cost by CC BY-SA (although they do need to follow the license). Will SE try to interfere with legitimate, authorized usage of user content?

12
  • 23
    I think this might be a misunderstanding of CC BY-SA. I don't believe CC BY-SA requires the Stack Exchange company to make all content available to everyone at no charge.
    – D.W.
    Commented Jul 26, 2023 at 20:11
  • 5
    @D.W. it doesn't require them to, but any attempt to make these companies pay could be easily circumvented by someone downloading the data dump and republishing it. This would be within that person's legal right. Will SE try to stop that?
    – Someone
    Commented Jul 26, 2023 at 20:12
  • 7
    That seems like a different question to me. I would suggest that you ask about that specific situation. Right now the answer seems to contain faulty premises or misleading information.
    – D.W.
    Commented Jul 26, 2023 at 20:13
  • 2
    I don't believe anyone training an AI model has attempted to argue they are using the content under the terms of CC-BY-SA, rather they think it's all fair use so the license doesn't matter. Commented Jul 26, 2023 at 20:13
  • 10
    What companies can use for training data, what requirements there are for the use and attribution, and all the other legal stuff around any of that is only going to be settled by court cases, or legislative acts (probably then interpreted by more court cases). What SE wants, in terms of guard rails or payments, and what the users want, think or believe, is likely to end up meaning nothing, or everything, once the courts have their final say. Probably in another 6-8.
    – Chindraba
    Commented Jul 26, 2023 at 20:18
  • @D.W. In a sense you're right , but practically speaking that is not correct. They can remove content fully at their discretion, so technically in that sense there is no requirement for them to make all contributed content available for free.
    – BryKKan
    Commented Aug 21, 2023 at 4:04
  • @D.W. However, if they share the data with anyone, they do have to share it with everyone on request. They do not have to release it in identical formats, nor specifically share any unique sub-collections. It is sufficient that they publish the full raw dump, in some readily parseable format. A full dump is intrinsically a superset of such private collections, and so they could treat them proprietarily. But they do have to make the data available, for free, if they use it anywhere. So if they don't publish full dumps, then they would be required to offer the other collections for free.
    – BryKKan
    Commented Aug 21, 2023 at 4:14
  • @BryKKan No; they aren't obligated to share it with anyone. The fact that they plan to limit who they give the DB to directly is OK; I was just making sure they don't plan to try to stop those who do have it from passing it on to others. If they choose to stop OpenAI from downloading it, that's fine; if they let me download it but then try to stop me from sharing it with OpenAI if I want to, that's not OK. Because doing the "OK" part but not the "not OK" part doesn't accomplish much for them, these plans call into question whether they might consider doing the "not OK" part.
    – Someone
    Commented Aug 21, 2023 at 4:28
  • @Someone Have you actually read the CC-BY-SA license? Because there is no practical sense in which that is true. You're entirely right about the restrictions on redistribution, and I explicitly noted that in the previous answer I linked to. I think I see where you're coming from in fear of "muddying the waters", but to my mind these are steps along the same slope. We need never let it get "that far". IFF they use contributions (in any way that gets published), they are obligated by the "SA" part of the license to share the whole thing that results. Only a larger "full dump" substitutes
    – BryKKan
    Commented Aug 21, 2023 at 4:47
  • @someone (continued) This answer gives a detailed explanation of the applicable license (CC-BY-SA 4.0), with references and verbatim quotes of the license text. Most of what you're thinking of would fall under the definition of an "adapted material", so you can skip to that if you like. Also note that this is specific to the "SA" series of CC licenses. Not all Creative Commons licenses require such "resharing", but the one which applies to SE contributions does.
    – BryKKan
    Commented Aug 21, 2023 at 4:55
  • @BryKKan if they try to get data dump downloaders to agree not to share the dumps, or if they apply DRM, then yes, that is illegal. If they only limit who can directly access them, that is legal. Do you believe that private beta sites are violating this because only certain users can access them?
    – Someone
    Commented Aug 21, 2023 at 16:34
2

This is a problem:

We retain the right to place guardrails around them

You never really had that right, and asserting that you "retain" it suggests SE fails to grasp this key point. You can modulate access to the API - albeit at risk of community revolt if done unreasonably. You can't limit the distribution of the dumps. Not now, not ever.

This claim "sets conditions" for SE to attempt some form of access control in the future, and by your wording, the quarterly archive dumps are claimed to be fully within the scope of your discretion. I explained at length in this answer, but suffice to say they aren't. (Unless you give truly unlimited access to the same data by other means - which is probably a worse value proposition for SE in terms of bandwidth, resources, and future product development.)

It seems like you're really working hard to reestablish community trust. What we need is a specific re-commitment to maintaining the public data dumps, without any access control, ad infinitum. Ideally this should come from Mr. Chandrasekar himself, accompanied by an apology, as we are all well aware of his role in disabling them to begin with.

Your question (Interim) Policy on AI-content detection reports was a welcome surprise, and I am willing to extend some grace here. However, as a statement of policy, this response simply doesn't cut it. At a minimum, you need to revise this question to exclude the data dumps from your "reservations", and explicitly state that you will maintain them perpetually, in proportion with the perpetual use rights SE seeks for contributions.

You have everything to gain by such commitment, and - because of the practical and legal requirements of the CC license - absolutely nothing "real" to lose. As an aside, it's also something that the company already committed to in the past, so walking it back, even rhetorically, is a particularly bad look. Because your immediate actions are trending in the right direction, I'm willing to (provisionally) accept the strike agreement. But I'm not going to have much interest in making future contributions as long as this particular "threat" is dangling.

For a full resolution, we need SE to recognize at an organizational level that making the data dumps — publicly, entirely unencumbered by access control or financial expectations — is the "price" of our data. It's not something which you can "work around" or pay lip service to. It's a legal and moral obligation. It's also a fantastic deal for SE. Can you show us that you understand this?

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .