-1028

Update May 14, 2024

I know there have been a lot of questions and comments around attribution. I recently answered a question related to both and am linking to it here for visibility.


Today, we announced an exciting new partnership with OpenAI. We’re pleased that OpenAI shares our commitment to socially responsible AI.

You can find more details about this in the press release.

We share updates on partnerships here on Meta because we believe in providing a space for you to ask questions about them. Partnerships are another revenue stream for us, similar to Stack Overflow for Teams or Advertising, which allows us to fund initiatives to benefit the community.

As work begins in the future, we will have more to share about how integrations with our partners will work.

In terms of project phases, we’re at the very beginning - there’s a great deal of discovery work and actual coding to do, and during that time, anything specific promised about how this will work here could change, so we want to be sure that we aren’t inadvertently misleading folks in these early announcements.

59
  • 301
    Please consider including more information directly in this post instead of just linking to the blog post. It would also be worth including here preemptive responses to the most obvious concerns that users are going to have in response to this announcement.
    – Mithical
    Commented May 6 at 13:05
  • 115
    This post is on Stack Exchange Meta, but it sounds like the partnership is only for Stack Overflow and not the rest of the network. Can you clarify?
    – Adám
    Commented May 6 at 13:07
  • 120
    I second @Mithical. I'd also recommend keeping in mind that the audience of the blog and the audience of Meta SE are very different. The blog, based on the past posts, tends to be the general public, including various stakeholders. It's a marketing tool. Meta SE should be focused on the community. It may take some more work, but I'd strongly recommend drafting two versions of these announcements and making sure the Meta SE version doesn't have business/marketing speak, but targets our concerns as active participants in the communities. Commented May 6 at 13:08
  • 19
    @Cerbrus - no, not at all. The two are unrelated, and the situation on SO was never a factor in the timing of this release.
    – Philippe StaffMod
    Commented May 6 at 13:11
  • 197
    Do the people who actually contributed the answers get anything out of this deal (attribution? access to the trained model? part of the profit?) Commented May 6 at 14:02
  • 21
  • 37
    @Rosie That's the kind of content that should be in the body of the post on Meta SE. Information about what phase the work is in, when we should expect more information (days, weeks, months), what the next things that the community will see are. This is the important stuff. Also an explicit statement that someone or some people are watching and will be addressing questions (and then, of course, following through by updating the post with answers to questions). I think if you do these things, you may alleviate some down votes that result in these posts being hidden from the front page. Commented May 6 at 15:05
  • 288
    Trusting OpenAI to be socially responsible, is like trusting Facebook to respect your privacy. Commented May 6 at 18:23
  • 33
    Folks, let's not close this. While more details in the post would have been nice, it's not productive to close staff announcements, especially featured ones.
    – cocomac
    Commented May 6 at 20:24
  • 73
    Guess my contributions to the community have reached an end. Godspeed, stack overflow. Commented May 6 at 21:31
  • 54
    I mean... I knew all my answers were being stolen before, but having it cemented like this is repugnant. My contributions will be made elsewhere from now on. Commented May 6 at 22:04
  • 20
    why did you not partner with open source foundations instead instead of closedai
    – Rainb
    Commented May 7 at 7:53
  • 39
    @Rainb Because money. "Partnerships are another revenue stream for us," Which is understandable, but unfortunately they partner with companies whose goals are not in line with the needs and desires of the community. Commented May 7 at 7:59
  • 64
    I hate this. I'm just going to delete/deface my answers one by one. I don't care if this is against your silly policies, because as this announcement shows, your policies can change at a whim without prior consultation of your stakeholders. You don't care about your users, I don't care about you. Commented May 7 at 14:11
  • 36
    @henning Please don't do that; you'd only be making more work for the folks around you that also care about the site(s) you contribute to. The company did a spectacularly horrendous job of announcing this, whatever it's supposed to be, but we truly have zero information at all about what this will practically mean. I think we should get a little bit more of an idea of concrete impact before anyone considers nuclear options.
    – zcoop98
    Commented May 7 at 15:42

30 Answers 30

559

OpenAI will also surface validated technical knowledge from Stack Overflow directly into ChatGPT, giving users easy access to trusted, attributed, accurate, and highly technical knowledge and code backed by the millions of developers that have contributed to the Stack Overflow platform for 15 years.

One of the major selling points of "OverflowAI" is (was?) attribution; most LLMs don't provide any sort of attribution for where they're getting their information from. Directly feeding Stack Exchange content into ChatGPT or other OpenAI products sounds, to me, like that attribution is not going to be retained. It says "attributed", but doesn't give any details: Will it cite Stack Overflow in general? The specific post? The specific user? Will this abide by the CC by-SA license that posts are under?

Will the users who actually spent time writing and contributing their content be attributed in any way when their content is fed into OpenAI's products?

32
  • 201
    LLMs, by their very nature, don't have a concept of "source". Attribution is pretty much impossible. Attribution only really works if you use language models as "search engine". The moment you start generating output, the source is lost.
    – Cerbrus
    Commented May 6 at 13:15
  • 37
    @Cerbrus Enterprise Copilot would like a word with you. LLMs do what they are trained for, if you train them on attributing, they will attribute.
    – tomdemuyt
    Commented May 6 at 13:46
  • 114
    @tomdemuyt No. GenAIs are not capable of citing stuff. Even if it did, there's no guarantee that the source either has anything to do with the topic in question, nor that it states the same as the generated content. Citing stuff is trivial if you don't have to care if the citation is relevant to the content, or if it says the same as you. (source) Commented May 6 at 13:51
  • 109
    There are plenty of cases where genAI cites stuff incorrectly, that says something different, or citations that simply do not exist at all. Guaranteeing citations are included is easy, but guaranteeing correctness is an unsolved problem Commented May 6 at 13:55
  • 20
    @Zoe If you ask ChatGPT to cite it will provide random citations. That's different from actually training a model to cite (e.g. use supervised finetuning on citations with human raters checking whether sources match, which would also allow you to verify how accurately a model cites). This is something OpenAI could do, it just doesn't.
    – Erik A
    Commented May 6 at 13:58
  • 22
    @ErikA I've seen people use models that generate citations, and they still manage to incorrectly cite sources. Those models are certainly better than ChatGPT, but they still screw up correctness of the content they cite regularly. Overflow AI had this very same problem while it was still running. Commented May 6 at 14:01
  • 156
    Guys, you seem to misunderstand how LLMs generate text. They don't copy text verbatim. They generate it word by word. When generating, LLMs don't even have any knowledge about their original training data. They just have the model that was built from the training data. It is literally impossible for them to know where the words came from, unless you actually built a separate system just for attribution.
    – Cerbrus
    Commented May 6 at 14:01
  • 17
    It is more likely they will use some kind of RAG i.e. search engine + summarization (rather than directly training on new data). In that case, attribution is possible (with limited to their retrieved contents, not attribution of original training data).
    – pcpthm
    Commented May 6 at 17:16
  • 67
    What could possibly go wrong? Dear Stack Overflow denizens, thanks for helping train OpenAI's billion-dollar LLMs. Seems that many have been drinking the AI koolaid or mixing psychedelics into their happy tea. So much for being part of a "community", seems that was just happy talk for "being exploited to generate LLM training data..." The corrupting influence of the profit-motive is never far away. Commented May 7 at 9:39
  • 4
    @Will you're nitpicking my terminology. My point is that while it is true that an LLM might output an exact replicate of some source text, it's just as likely to output something completely different for the exact same input, if you try it again. It could change terminology, or even hallucinate good-looking nonsense. But no, it doesn't "recite" training data. It doesn't work with large chunks of text from source to output. It works word-by-word.
    – Cerbrus
    Commented May 8 at 19:11
  • 12
    Do I get an upvote for my answers every time that an AI uses it? This is what we encourage from the (human) community, and I'd accept that if the AIs played along too. Otherwise, this becomes like Reddit, and all the content contributed by users is just feeding a nasty business model.
    – mike
    Commented May 9 at 1:27
  • 9
    @Will no, LLMs, by definition, do not "recite". They could luck into a exact replica text, but there's absolutely no form of "intent" to duplicate exact text. But again, you're nitpicking terminology. My point is that LLMs are inherently incapable of knowing where their output comes from. Your terminology remarks don't change any of that.
    – Cerbrus
    Commented May 9 at 4:16
  • 6
    Well now I understand the sudden spate of post vandalism
    – Nick
    Commented May 10 at 6:52
  • 25
    Those who say it's impossible for LMs to cite their sources are correct. If a model scans a few million documents, counts frequencies, and learns that the probability of "major" following "one of the" is 0.0023 (an oversimplification), what do you cite for that? The real ethical concern with OpenAI is that they build on a bunch of research that's freely available, and a bunch of data that's freely available, and then they post a press release instead of a paper detailing their own contributions. They stand on the shoulders of giants and refuse to tell the giants what they see.
    – Ray
    Commented May 10 at 14:09
  • 4
    I think the nature of an LLM's understanding of its sources is best summed up by a name for them that I've seen on some of the sites I frequent: "spicy autocomplete". (Or, if you are looking at it from the perspective of the rationale for banning GenAI here, "plausible gibberish generators".) Funny enough, I think I saw both of those on OSNews, though I'm only certain about that for the latter one.
    – ssokolow
    Commented May 10 at 19:53
444

I'm altering the deal.

Pray I don't alter it any further.

― Darth Vader on Bespin https://www.youtube.com/watch?v=jsW9MlYu31g

Most replies focus on certain process and legal aspects. What I want to bring up is the emotional aspects. Irrespective of whether the license agreement allows this to be done with our content or not, the whole thing is triggering an emotional response with me, and it's not a good one.

I feel violated, cheated upon, betrayed, and exploited.

I believe a lot of others feel the same. If you feel the same, upvote this.

Attribution is one aspect of it, but not the only aspect. It is the entire way how the community works and how we, the humans, created content on this platform to make the world a little bit better.

This whole thing feels ethically wrong, and emotionally damaging. Humans are meant to exploit machines, not the other way round. Exploiting us, who helped make the world a little bit better, in this way, is a turning point. It makes the world for us worse instead of better.

People already start contemplating actions against this, like purging their questions, answers, comments, and even deleting their accounts. I was contemplating deleting my account over this. I still am.

Congratulations, Stack Exchange.

Update: I know of a user whose account has been suspended for 7 days because they started deleting their stuff.

I know that deleting my account won't delete my content. But it will at least prevent me from giving more content in the future.

Consider this, where do you think Stack Overflow is in its journey? Has most of its content already been created? Or will most of its content still be created in the future?

Update 2: I know that there is an argument to be made that we've basically agreed to this right from the start. No, we haven't. We have agreed to whatever type of processing a reasonable person could expect to happen from the license conditions at the time of "signing" the agreement with the technology available at the time. Training LLMs and such was neither available nor reasonably predictable at the time of "signing" the agreement.

As someone rightly pointed out in the comments, this is like Darth Vader on Bespin saying "I'm altering the deal. Pray I don't alter it any further."

16
  • 15
    Removing help contributions once they have been submitted is a form of vandalism, if you don’t agree with the license to the contribution you are submitting, simply don’t submit it in the first place. So the user you speak of was suspended for vandalism not editing out the helpful content from their contribution
    – Ramhound
    Commented May 9 at 6:21
  • 17
    "I know of a user whose account has been suspended for 7 days because they started deleting their stuff." for reference, the suspension has no relation to the OpenAI deal. It's standard practice and has been enacted every time an account has tried to mass remove their content. There is monitoring in place for such occasions for many years now. Only the deletion in this case is related to the OpenAI deal. The presentation here may make somebody believe that the suspension is somehow "abnormal" rather than a standard.
    – VLAZ
    Commented May 9 at 6:26
  • 42
    @ramhound I submitted those contributions well before this "AI" LLM bubble was ever a thing. If they decide to retroactively alter the terms of the agreement, they don't get to be upset when I decide to retroactively withdraw my agreement to it. Commented May 9 at 6:58
  • 8
    @Shadur-don't-feed-the-AI - Your agreement cannot be revoked, you might take a moment, to read the agreement you are trying to unilaterally revoke. Doesn’t change the fact, if you vandalize contributions, the community will just reverse your vandalism. Based on the level of contributions you have made, I assume you know what that agreement actually is, so that advice isn’t necessarily specific to your situation
    – Ramhound
    Commented May 9 at 7:06
  • 8
    "I feel violated, cheated upon, betrayed, and exploited." Violated and cheated upon not really but betrayed and exploited yes. Although I kind of expected that at the very least since the company was sold from the initial founders in 2019. Commented May 10 at 7:41
  • 16
    @Danubian That was never the point though; from the beginning Stack was the company in charge of keeping the lights on, but contributing here was always about building a library that could benefit others first (as emblazoned in the opening of every site tour), the company second. I believe that hasn't been altered by this partnership, though I completely understand why some feel differently, or are angry regardless. But nothing fundamental has changed about the nature of your "free work"; if anyone was contributing to benefit the company, they were, and are, doing it for the wrong reasons.
    – zcoop98
    Commented May 10 at 15:19
  • 27
    One way to look at it is that corporations are never your friend. They love talking about building communities and ecosystems, but eventually they need to monetize user-generated content and change licensing to make your content their property. When their policies and promises change overnight 180° all you get "we are sorry you feel that way", "our hopes and prayers" and "that was a deliberate business decision we had to make with a heavy heart". And then they laugh all the way to the bank. Commented May 11 at 3:21
  • 13
    @DanubianSailor We contributed free work to the company because the content is under a CC BY-SA license. It is fine to make money off our content as long as they adhere to the license. This forbids selling the content to OpenAI, though, since they do not provide attribution or release their derivative works under a compatible license.
    – endolith
    Commented May 11 at 18:17
  • 6
    Personally I don't care about the LLM training (I'm rather pro-AI, honestly) but the way that this change in policy has been handled was definitely very poor. The feeling of exploitation is why I logged out for the last time in Jan 2020 and stopped producing content the previous October. Commented May 11 at 21:10
  • 12
    Given that OpenAI gets all the credit for all my past, present and future contributions and I get nothing, I will do this. I will only post answers generated by OpenAI. Two years down the road, it will cannibalize itself. Commented May 13 at 0:10
  • I see a lot of people saying that deleting your account won't result in content deletion but it sounds like it's violating some internet laws in the EU? Which is not a good thing Commented May 13 at 13:38
  • 10
    @ChrisZeThird GDPR (the "internet law in the EU", if you will) deals with personal data. Your name, your address, your email, etc. Not just anything you've happened to upload ever. GDPR does require for profiles to be removable along with other personal information related to them. In effect, after the account is deleted, it should not be traceable to you. That is perfectly within the scope of GDPR. But it never requires all data to be deleted. If you have any other applicable laws that might influence this, feel free to share them, though. But GDPR is not a magic "erase" button.
    – VLAZ
    Commented May 13 at 13:47
  • 8
    It's exploitative, almost guaranteed to violate the CC BY-SA license, and I've already caught ChatGPT spitting my own open source code back at me, also directly violating the MIT license I publish it all under. Just waiting for the lawsuit; I'd be a member of the injured class. Eagerly awaiting case law. Deletion of your account merely disassociates your content from your name. That also smells like an attribution clause violation.
    – GothAlice
    Commented May 15 at 9:19
  • 2
    Deleting content does not retract your licensing it in CC BY-SA. Then, feeding it to LLM pretty much violates that also.
    – Lily White
    Commented May 16 at 12:05
  • 4
    When a product is free, you become the product. Happened with GitHub, is happening now with SO, will happen again unless you, programmers, part with that childishly naive trust in corporate-backed "communities". Otherwise you'll just keep working as mechanical turks who train commercial AI for free, and then become obsolete in AI-dominated IT industry you've built with your own hands and brains.
    – sunny moon
    Commented May 17 at 18:19
317

I know - viscerally - that y'all are shorthanded. I also know that things sometimes get dropped in your laps unexpectedly with little time to prepare, so I'm hoping that this was merely an oversight... but I'd have to echo Mithical's point, particularly as link-only answers are a core concern on the platform.

Y'all have, unfortunately, posted a link-only question, which relies on external links for all of the important content. This means you're expecting people to visit two different links and read long-ish posts that aren't written for this audience to get all of the information they need to understand and respond to this post.

I think we're all well aware that AI is an extremely contentious subject on the network and the core community have very strong concerns about it. I appreciate that you link to the February blog post about socially responsible AI but, again, summarizing the core ideas from that post in this context rather than linking to it alone would have given people much more community-centric context about the content in those two sources. Particularly since the MSE announcement of that post was similarly poorly received and few questions have been answered.

Right now I'm seeing a lot of concern about whether this means that y'all are more likely to override the rules about posting AI-generated content on sites or changing your thoughts about attribution. While these points are addressed in the Commitment blog post, I'm going to be honest - that post is written in a very indirect and unclear way, so I'd really appreciate a synopsis in simpler words. For example:

In addition, community answers should be derived from quality, accurate, sourced data.

On the surface, this is a very nice sentiment - one that we can all get behind. Unfortunately, it doesn't actually say that AI-generated answers can't be posted by community members, reviewed or not - if people assume AI generated content is "quality, accurate, sourced data" (it's not), then they will think it's fine. It's also unclear what "community answers" means - two very different interpretations are:

  • Answers written by community members
  • Answers on a Stack Exchange community site

The former leaves huge openings for AI-generated answers, since they're not created by community members and the prior (much clearer) statement about questions uses "community" to mean "community member" as opposed to "assisted and curated by AI", further leading to the former interpretation.

The full statement about questions is:

questions on Stack Overflow (whether by a community member or assisted and curated by AI) are posted only after human review.

Is possible to say something similarly definite about answers to indicate that the company commits to ensure all answers must at minimum have human review to ensure accuracy while ensuring that sites' communities have oversight to further restrict using these tools to answer questions?

It'd be really great if, now that the post is out and you are getting feedback of confusion and questions from community, if you could edit a community-focused explanation/overview of the two posts into the question so that a reader doesn't have to view the full press releases to understand what's happening.

To be clear, this isn't about specifics - it's about the most general big-picture things. You're building on a foundation that's not clear to the users, even if it's obvious to you. Perhaps you find what's been written in these posts to be crystal clear. You know what the company intends and that the interest in the community connection is genuine. The community doesn't know that.

You can dismiss my statements as me looking for the negative interpretation in everything. I've been there. I've struggled with people trying too hard to read into and over-analyze every word I've written and finding cracks when I'm trying to be open and honest. I know it's difficult. But that's why I spent time clarifying things rather than just brushing those concerns aside.

If you can't clearly and succinctly say things like, "We will not create or incorporate tools to automatically post answers using AI to the sites." and "We will discourage users from manually posting or converting AI-generated answers without human verification and review to the sites." - over and over until people actually start believing it, people will continue to look for indications that you're stating things in ways that allow you to point at your prior statements and say that you're not "technically" violating them.

5
  • 111
    "that post is written in a very indirect and unclear way" -- that is intentional, no? The company has been communicating in this style for quite some time now. Lots of grandiose phrases to bamboozle the audience while very little is actually being said. It's infuriating.
    – Dan Mašek
    Commented May 7 at 9:42
  • 8
    Note that posting AI-generated content is still banned regardless of this policy.
    – Nzall
    Commented May 7 at 13:25
  • 6
    @Nzall That's true of SO and a handful of other sites - I don't think it's network-wide by any means. Sites can set that policy - for now. The question is whether the company will continue to allow users to control that into the future. At this point, there's no direct indication that this partnership will lead to posting of AI content, either automatically or manually. My guess is that the intention is to improve AI search tools to make it easier for users to find answers on SO rather than using off-site tools. If the SO bot is better/more accurate/sourced, people will use it.
    – Catija
    Commented May 8 at 15:38
  • 3
    I wouldn't focus too much on "posted only after human review" - it's worth noting that's that's worth nothing. We literally just saw a case of obviously riduculous AI images in a scientific paper breezing through peer review with noone caring, so quality will necessarily go down because Brandolini's law combined with AI is the death sentence for communities like SE and I doubt they'll employ people to review content from the money they'll make..
    – Izzy
    Commented May 11 at 15:21
  • 3
    @Nzall of course it is, AI techbros are well aware that one of the worst things that can poison LLM training data is output from itself or another LLM. They're buying SE because it's one of the most pristine sources of original, curated human content. Commented May 14 at 4:31
187

No, please don't do this.

ChatGPT and SO are two mutually exclusive things.

If we bring ChatGPT to SO then what will be the difference between SO and any other GenAI site answering questions?

SO has maintained a trust of "quality content" of over 15 years with its dedicated users, moderators, contributors, etc.

Please do not break this trust with ChatGPT.

13
  • 13
    It makes money for owners of SO. Everything you post can and will be sold as training data. Commented May 11 at 3:29
  • 27
    @MaximEgorushkin The owners of SO are free to make money off our content, but they must abide by the CC BY-SA license, which requires that derivative works be shared under a compatible license and include attribution. OpenAI doesn't currently provide attribution, and almost certainly is not going to open-source their LLMs.
    – endolith
    Commented May 11 at 18:09
  • 1
    @endolith What would make SO abide the license? Any precedents? Any pending litigation? Didn't SO change content licensing overnight in the past? How can it attribute a license to anything since it is not citing anything, but generates next words? Commented May 12 at 4:54
  • @endolith Do you attribute anything of your speech to what you read and heard? Or is that just fair use? It is an open issue, I suppose. LLMs will make human answers more valuable, IMO: get these words for free, or ask a human to read your drivel and point you to duplicate answer for a fee? Commented May 12 at 5:04
  • @MaximEgorushkin guides.library.emerson.edu/FairUse
    – endolith
    Commented May 12 at 22:15
  • 8
    They are not going to listen to you. Money >>>> all of us! Commented May 13 at 10:10
  • 1
    @MaximEgorushkin Nobody was going to sue SO over a minor change which retained the spirit of the license despite violating the letter. This violates both. Commented May 15 at 1:21
  • @endolith AI generated works cannot be copyrighted, at least in the US, they are automatically in the public domain.
    – Poscat
    Commented May 15 at 10:04
  • 1
    @Poscat So if an AI "generates" a verbatim copy of a copyrighted work, then it becomes public domain?
    – endolith
    Commented May 16 at 22:59
  • that's certainly an open question since insofar no generative AI has generated a copyrightable work (a work needs to be sufficiently long to demonstrate creativity to be copyrightable).
    – Poscat
    Commented May 17 at 1:22
  • @Poscat How is it an open question? Foundation models have been shown to reproduce verbatim copies of copyrighted works, and have been sued for it...
    – endolith
    Commented Jun 2 at 22:38
  • @endolith Do you have a link to the case?
    – Poscat
    Commented Jun 3 at 6:31
  • @Poscat thefashionlaw.com/…
    – endolith
    Commented Jun 3 at 19:37
143

Some questions that come to mind:

  • Is it going to be possible for individual users to opt out of having answers touched by OverflowAPI?

    • Follow up: can sites opt out of being included in OverflowAPI?
    • Related: What sites are included/excluded from OverflowAPI?
  • What applications do y'all plan to use OpenAI's models for?

  • Is SE receiving financial compensation for this partnership or is it purely software exchanging? If yes, how much?

    • Follow up: if yes, what'll the money be used for? Funding more AI things or giving resources to fixing/improving the SE platform without AI?
  • How does OverflowAPI actually work?

  • Is there a plan for if users start deleting answers en masse, editing/writing-in terminology to try and throw off any LLMs, or otherwise try to disrupt the quality of answers in OverflowAPI?

4
  • 30
    The most important part is the last question, if OverflowAPI can disrupt the answer quality, I definitely will left this rubbish site. Commented May 6 at 13:53
  • 5
    There are also additional financial aspects here; especially regarding SO: If OpenAI decides that only their Paid Tier gets the newest SE/SO data, you essentially need to pay, for efficient GenAI use cases. This makes a unique selling point for ChatGPT, where basically another company is profiting off of the contributors works. Apart from that, there needs to be a consideration of why the knowledge was given. SE thrives from people providing knowledge for free. Not really sure how I should feel about that yet...
    – A7exSchin
    Commented May 6 at 14:42
  • 8
    On the compensation: does anyone else interpret the press release as "OAI get SO's data in exchange for SO agreeing to beta test OAI features"? Doesn't really seem like a balanced trade.
    – AShelly
    Commented May 7 at 0:12
  • 2
    @AShelly I think SO also gets money from OpenAI (or at least get cheaper access to their models if not) though they probably don't want to disclose details.
    – dan1st
    Commented May 7 at 4:48
118

How does this impact the previously-announced partnership with Google? It seems like OpenAI and Google are both getting similar things out of their respective relationships, and that is access to the use of Network data for training. OpenAI also recently announced an entry into search. Can we expect both OpenAI's services as well as Google Cloud AI technologies to be integrated into the platform, since both announcements appear to indicate this?

There was a past blog post on "socially responsible AI". One of the key elements was "attribution is non-negotiable". OpenAI, historically, has done a poor job of attributing parts of a response to the content that the response was based on. Google's Gemini has some built-in capabilities for citing content with related web content, but there are still no guarantees that it relates the statement to the correct source material or that attribution meets the CC BY-SA requirements for attribution. Given the company's stance on the need for attribution and OpenAI's past problems in that space, why this partnership? When can we expect to see more details about how OpenAI is solving the attribution problem?

Are we going to continue to see AI services blocked if they do not sign a partnership, especially with no public announcement? This also includes the mentioned-but-not-detailed "guardrails" on the API, SEDE, and data dumps.

4
  • 7
    Despite the "CC BY-SA" technicalities, which are definitely worth better visibility to the community as the current bounty suggests, this question raises an even more important topic: The company has said, "attribution is non-negotiable." I'm actually not against a partnership with OpenAI but I think it is important for the company to stay true to this promise to continue maintaining some level of trust with us, the community. Commented May 9 at 18:03
  • 9
    One way to predict future behavior is to look at the past actions of OpenAI. At this point, the foundational principles have been gutted. Once money enters the equation, any talk of "socially responsible" guidelines falls aside. This, along with the obvious points that have been better expressed already by community members here, are the reasons why this is an ill-advised path forward.
    – ILMostro_7
    Commented May 10 at 4:51
  • "Attribution is non-negotiable" could be read as "we were not able to negotiate attribution into the contract" ;-) Commented May 23 at 23:11
  • "why this partnership?" - money, of course. SE is a profit-driven company, and that's not going to change regardless of how it impacts the community
    – Anonymous
    Commented May 24 at 14:39
114

I've been talking to a few people/reading a few posts offsite and I think there's a great deal of confusion over it.

There are a couple of key issues here with messaging:

  • the moderator teams were not given any prior notice of any of this, and we're likely the ones dealing with the fallout of people getting mad about it

  • a lot of people think this is "getting genAI on the main sites 2.0", the press release links to teams but lacks clarity on its own merits.

  • the press releases don't actually say anything about how OpenAI, Google and others intend to, or could use the API.

  • to get any clarity over what the StackOverflowAI API is I'd need to waste someone's time registering as a potential end user. While the end points might be restricted, there might be value in having the API documentation open, and the actual API restricted.

  • there's a bit of offsite soapboxing and people wanting to delete all their content which could/should have been avoided.

Essentially, we've got a ton of fluff and no clarity over what this actually means for us - nor do we have the tools to communicate with our communities over concerns they have. Considering the rocky road that we've had on the topic, I'm surprised no one thought about how to message this to the community.

As excited as y'all are about new partnerships, I feel like old ones got missed out - and an opportunity for open communication.

As a moderator, I don't know what to tell my community about this. Not in terms of "Oh god, its so bad", more of "I've got nothing at all." This isn't a good state of things to have.

10
  • 76
    "the moderator teams were not given any prior notice of any of this" - this seems a violation of the agreement that was reached less than a year ago. Commented May 7 at 15:02
  • 14
    @S.L.Barthisoncodidact.com Lots and lots of empty words, who would have guessed. Unless it hurts the company financially (and they understand that it this is the case), then nothing will change. There's no reasoning with them, everyone should have realized that at the 99th attempt. How the board can sit passive and watch their flagship product getting dragged in the dirt over and over again, I just don't understand. All other companies in the world tend to cherish their flagship product.
    – Lundin
    Commented May 7 at 15:12
  • 5
    @Lundin To be fair, I can understand there's a dilemma. Negotiating a partnership is likely a business secret until the contracts are signed. Informing the moderators ahead of time would make this difficult. Not sure what is the right solution here.... Commented May 7 at 15:18
  • 2
    Well, its out now. I'm seeing a fair amount of confusion, so I'd say as much as a press release plan, having one to provide us with basic information, communicate with end users of various engagement levels and bidirectional handling of social media might be handy. Essentially, giving us the tools to help, and not needing us to. Maybe talking points, an FAQ.... Commented May 7 at 15:26
  • 4
    I'm not surprised they didn't bother thinking about how to message this to the community - it's very much a pattern of behavior at this point. SO should never have gone public.
    – Gloweye
    Commented May 7 at 18:40
  • 5
    SE never did go public. If it did, having a community representative on the board might be a good counter balance , and I suspect we could have made it happen collectively Commented May 7 at 23:15
  • 1
    In my opinion there was no way to message this announcement where individuals who have contributed many years to a community, would have taken it positively, since the announcement is primarily negative in if you have contributed to community significantly. A partnership with ChatGPT no matter how it was handled would be received negatively (as it should) because nothing about ChatGPT is positive
    – Ramhound
    Commented May 9 at 6:26
  • 4
    @S.L.Barthisoncodidact.com - This agreement is neither a major product change nor a policy change and so is not required to be announced first to mods. It would have been good if they had, though.
    – Mithical
    Commented May 9 at 20:20
  • 2
    Re "StackOverflowAI API": There is OverflowAI and the press release uses "Stack Overflow’s OverflowAPI". Do you mean OverflowAPI? "Improve the performance of AI models & products" says "OverflowAPI is a subscription-based API service that provides continuous access to Stack Overflow’s public dataset to train and fine-tune large language models." Commented May 10 at 0:47
  • 1
    @This_is_NOT_a_forum this I did, though its entirely plausible I got OverflowAI and OverflowAPI conflated Commented May 10 at 3:50
100

Most people are focused on attribution (and rightfully so), but it seems that not much attention is being paid to the share alike part of the CC license. In AI contexts, copyright law is still being tested in court and many things are uncertain. There is a very real risk that training an AI on this site's data will not necessarily be considered "fair use" (it fails the "serves as a substitute for the original" test, among other things), which means there's a risk that the trained model will be considered a derivative work and thus required to carry a license similar to CC-BY-SA 4.0. When you make agreements with these AI companies, are you including provisions and processes to ensure their AI models will be properly open-sourced should these sorts of court decisions happen?

11
  • 17
    LLMs are pretty much the same as a lossy compression algorithm combined with a compiler, so the output naturally is a mechanically combined derivative of the input. This means all licences on all of the input must be honoured, which means they’d have to open up all the “training data” if they include copylefted works in them.
    – mirabilos
    Commented May 7 at 1:47
  • 7
    @mirabilos I'm more concerned with the situation where the trained model itself is ruled to be a derivative work.
    – bta
    Commented May 7 at 1:53
  • 17
    it is, of course; just because you can lossily JPEG-compress a picture doesn’t make the JPEG file not a derived work of the original picture, and you can decompress it and get a substantially similar result back, which people have proven for LLMs as well (by now, really substantial amounts of “training data”)
    – mirabilos
    Commented May 7 at 2:31
  • 2
    @mirabilos I agree with the technical description although there may be an argument that such LLMs compress so strongly that maybe only a tiny amount is taken from each work and that may make a difference. Anyway the courts or the legislation may well decide one way or the other. We should wait for a few of such court cases to see how it's seen legally. Commented May 7 at 6:42
  • 4
    The argument from OpenAI might be that a model should be treated the same as a person that re-expresses the ideas in an SE post, but doesn't reproduce or remix the content. In that sense, the data and the model may need to be relicensed, bif they were ever distributed. However, the only thing OpenAI publishes is the model's answers, and they only need to be CC-BY-SA if they overlap enough with the training data to count as derivative.
    – Peter
    Commented May 7 at 13:20
  • 2
    All of this is pretty orthogonal to the deal above, though, since the license applies just the same whether OpenAI scrapes the data from the website or gets it directly from SE.
    – Peter
    Commented May 7 at 13:21
  • 1
    @NoDataDumpNoContribution there has recently been work showing that, no, the compression is nowhere even remotely as heavy to reach that. But having read where someone reimplemented a slightly older ChatGPT in a 498-line PostgreSQL query, I now understand even better why this is so, and how the output is naturally derived from the inputs.
    – mirabilos
    Commented May 7 at 21:26
  • 5
    @Peter - We don't know whether SE licensed the content under the same CC license or not, that's one of the open questions. They may have re-licensed it under something more company-friendly. That would be technically illegal, but the site's ToS includes mandatory arbitration so we couldn't really do anything about it if they tried.
    – bta
    Commented May 8 at 22:53
  • 1
    @bta That's interesting, I didn't know about the arbitration thing. It seems a very flimsy basis to commit such a massive copyright violation (considering all the exceptions), but I suppose it's hard to put anything past them at this point.
    – Peter
    Commented May 9 at 9:13
  • 2
    @bta: Many of us opted out of mandatory arbitration. The fact that the clause is there probably prevents a class action, unfortunately, but there still is room for a massive lawsuit with hundreds of plaintiffs.
    – Ben Voigt
    Commented May 9 at 21:15
  • 1
94

In which case, why should any of us bother to contribute our time, energy and expertise to enrich you any further?

For that matter, why honestly did we bother in the first place, to provide unpaid labour for you, without which this website has no value whatsoever? It's made worse when you're going to just feed that effort into the digital equivalent of a compost heap, for it to be decomposed into excrement, but really why bother with this place at all instead of, say, a not-for-profit?

On an unrelated note, I'm not contributing anything further to this place ever again, and putting stuff on https://software.codidact.com/ instead. Which just so happens to be basically the same as this site only run by a not-for-profit!

6
  • 1
    "..why should any of us bother.." Well, people contributed in the past and some will probably also contribute in the future. Maybe they don't care about all that. The future will show what will happen to SO. Commented May 10 at 7:42
  • 8
    "why should any of us bother to contribute our time" — exactly. Which will cause the website to degrade over time. ChatGPT may train on the content that is here today but will it be able to answer the question about the new framework that people will create in 3 years? If there's no answer from users on SO anymore, I highly doubt it. I don't understand why SO made that partnership decision... Commented May 10 at 8:21
  • 2
    We contribute because, despite time and time again being shown how stupid we are, we continue being stupid. Commented May 14 at 8:11
  • 6
    My upvote to this post for letting me know of codidact. This post needs to promote so everyone will be aware of this and jump this sunken ship soon. Once we stop using it, they won't have the data to sell it. I think that's the only solution for this imo. Commented May 14 at 10:55
  • 4
    When this place was first started, they promised not to give out content without our attribution and that our content would never be behind a pay wall. Broken promises....
    – Travis J
    Commented May 14 at 20:04
  • 5
    I don't know about you folks, but the reason I post questions and answers on this site is so that my answers can help people, as a payback for the fact that I often find answers (from other people) on this site to problems that I have. I frankly don't care if the company manages to make money, even lots of money, using what I write, as long as the questions and answers I provide remain publically available to help people who would benefit from them. So that's "why I bother". If I wanted to get credit for or monetize text that I had written, I would have posted it on my own website, not here.
    – Some Guy
    Commented May 20 at 22:17
86
+50

Licensing....

When a user contributes content (ie, Questions & Answers; ie, data that would be shared with LLMs), the terms of service grant two licenses.

  1. User content "is perpetually and irrevocably licensed to Stack Overflow on a worldwide, royalty-free, non-exclusive basis pursuant to Creative Commons licensing terms (CC BY-SA 4.0)"
  2. "you grant Stack Overflow the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit such Subscriber Content, even if such Subscriber Content has been contributed and subsequently removed by you"

I often simplify the lawyer version in the TOC by saying that by posting on the site, users grant licenses:

  1. To anyone in the world: under CC-BY-SA 4.0.
  2. To Stack Overflow (the company): to use however they want, unrestricted by the CC-BY-SA license.

Stack Exchange Inc can do whatever they want with Q&A data.

I know many users are asking about attribution, and this post maintains that attribution is a non-negotiable requirement of the partnership.

However, I want to point out that there is no requirement that Stack Overflow provide the data to OpenAI with the CC-BY-SA license intact, because Stack Exchange, Inc is not bound to the CC-BY-SA license. Quoting (again) from the Terms of service:

Stack Overflow [has] the perpetual and irrevocable right and license to ... distribute, export, display and to commercially exploit [Q&A] Content"

How will the company be licensing the data to OpenAI and other future customers who wish to use it for LLM training data?

Stack Overflow can ensure that data provided to OpenAI and other LLM creators is licensed under the CC-BY-SA license, in order to ensure those companies are required to attribute it accordingly.

Attaching the CC-BY-SA license to data provided to OpenAI would additionally bind OpenAI to the "Share alike" clause of that license--effectively requiring that in addition to attributing specific ChatGPT responses to Stack Overflow, it also requires those ChatGPT responses to carry the CC-BY-SA license.

It seems highly unlikely to me that OpenAI would be willing to have certain GenAI output (whether ChatGPT, or coding suggestions or whatever) to be covered under CC-BY-SA. Therefore, I am assuming that Stack Overflow will be providing LLM training data under a product that uses a license different than CC-BY-SA. Stack Overflow users have provided the company with "perpetual and irrevocable right and license to...distribute..and commercially exploit" Q&A data, so it would be well within the company's rights to make the new product (LLM training data) use a completely different license.

This joint press release presumably means that the ink on the contract is dry, and thus the terms under which the data will be shared has been determined. I think many of the questions on this meta post would be answered simply by letting us, the community, know how Stack Overflow is licensing the data to OpenAI.

22
  • 15
    On the last collaboration SE explicitly stated that they are not provided the content under a different, non-CC license when I asked about that. I assumed this case would be the same, but it certainly would be useful to ask it explicitly for this case as well. Commented May 7 at 21:13
  • 41
    @MadScientist If I've learned one thing in recent years, it's that yesterday's promises are not guaranteed to be tomorrow's promises. Data Dump was free forever, then the ceo unilaterally decided to shut it off, then they pretended it was just temporary and turned it back on when everyone was upset.... And how many times have they "forgotten" to keep Mods in the loop on changes?
    – AMtwo
    Commented May 8 at 12:58
  • 8
    I understand that legally, Stack Overflow can do whatever they want with the data. But practically, to maintain some level of community trust, the company DOES need to maintain an attitude of keeping promises such as "Attribution is non-negotiable". Commented May 9 at 18:04
  • 10
    Legally, Stack Overflow has to abide by CC BY-SA. The answer quoted that. The other paragraph doesn't create a new license, it's just restating rights granted under CC BY-SA. Note that "irrevocable" does not mean "unconditional".
    – Ben Voigt
    Commented May 9 at 21:57
  • 12
    And even if you argue that it does create a second license, relicensing is not one of the rights granted, so that supposed second license cannot be used for any partnership, OpenAI or otherwise.
    – Ben Voigt
    Commented May 9 at 22:01
  • 5
  • 5
    @BenVoigt that might be your reading of it, but I assure you that Stank Overflow considers the data "dual licensed" and that they have an irrevocable right as described in my post. I say this as a former staff member.
    – AMtwo
    Commented May 10 at 21:49
  • 1
  • 5
    I disagree with your analysis of the licence granted. The core grant is under cc by SA. You fail to quote "as reasonably necessary to, for example" and therefore your summary of this is inadequate as well as the conclusions you draw from this. The scope of this is not "however they want". I cannot reply but I request any use of the content from the network comply with the full terms of the cc by SA licence, including full attribution by name with link to the post AND the full continuation of the licence in the training set and output based on this content, as per the terms of the licence. Commented May 12 at 20:30
  • 2
    gpt-4-turbo, given the current ToS: "It’s also important to note that although Stack Overflow has the right to exploit the content commercially, this must still align with the constraints of the CC BY-SA 4.0 license. This includes the stipulation that anyone who uses the content — including Stack Overflow if they are providing your content to third parties — must also distribute it under the same CC license, thus preventing any exclusive commercial rights or non-CC licensing to other companies without additional permission."
    – endolith
    Commented May 12 at 22:46
  • 1
    Claude Opus: 'While the Terms of Service (ToS) could be interpreted as implying a dual licensing scheme, with content being licensed under both the CC BY-SA 4.0 and a separate commercial license, this interpretation is based on the ambiguous "commercially exploit" clause and is not explicitly stated in the ToS. … If Stack Overflow sells the content to AI companies without ensuring that they provide attribution and release their LLMs under an open-source license, it would undermine the principles of openness and attribution that are central to the CC BY-SA 4.0 license.'
    – endolith
    Commented May 12 at 22:46
  • 1
    The business is not in a privileged position with regards to the content, except for these maintenance/analytics/regulatory items listed or others which might reasonably be construed as related, otherwise this would make the grant under cc by SA moot. The "commercially exploit" language is no big deal, as the non-NC licences allow for commercial use already. Commercial use is possible, but what is not possible is to create a derivative work from the content where the cc by SA licence would have been purged. Anything this partnership creates which incorporates the content is under this licence. Commented May 12 at 23:47
  • 3
    @ninja米étoilé I think that your interpretation varies quite a bit from the opinion of the Stack Overflow lawyers. Unless someone decides to use their lawyers to challenge the opinion of Stack's lawyers, the Company's path forward will likely follow their lawyers' interpretation.
    – AMtwo
    Commented May 13 at 1:22
  • 3
    If this interpretation is correct (and I don't doubt it either flatly is, or can easily be so interpreted by lawyers), then first stating CC-BY-SA as license and later different terms (which completely invalidate the earlier license) is incredibly deceiving. The only way to interpret that is that this is a deliberate attempt at deceiving people about the real license. Commented May 14 at 18:37
  • 1
    No, "Stack Exchange Inc can do whatever they want with Q&A data.", they cannot. That only applies to relevantly licensed material, of which not 100% of the exchange is covered. Changing the current license does not change the terms that applied to previous content, only the relevant license at the time would apply. For example, they couldn't change their definitions right now to entirely remove licensing and have that apply to the entirety of content produced by the user base.
    – Travis J
    Commented May 14 at 20:01
66

When did you consult the users and moderators of this website, who have created all of its value? Why do we learn of this decision via a press release? After the whole episode about the new AI policy and the moderator's strike, haven't you learned anything? I don't think your attitude to us, the creators of this site, has changed at all. I don't think it ever will.

I'm just going to delete/deface my answers one by one. I don't care if this is against your silly policies, because as this announcement shows, your policies can change at a whim without prior consultation. You don't care about us, I don't care about you.

9
  • 9
    You'd be putting most of the strain on the community mods rather than the staff or the gen AI providers. There's probably better ways to remove all your content from the network than that if it came to it - but its probably better to have a considered reaction, backing up your posts and such, than threatening to delete all your posts and such. I'd note defacement/deletion dosn't actually remove SE or potentially others access to your post either. Commented May 7 at 14:47
  • 10
    Vandalism is usually reversed. If you no longer want to be associated with the content, there are better paths to go. Disassociating the content from your account, for example. Or deleting the account altogether, which will dissociate everything at once.
    – Mast
    Commented May 7 at 14:55
  • 9
    Porting the content to Codidact would actually be viable, although mass imports from SE were attempted and didn't really end up well. A proposal for a new community would have to be made in this case, but once that is done, there are ways to import SE content. You could pick your best posts to serve as example questions for the new community. Although since this is getting off-topic, a separate question about it on the appropriate Codidact meta would be the best place to continue that discussion.
    – Lundin
    Commented May 7 at 14:58
  • 12
    @Mast I don't care about dissociation, I want my contributions removed so they can't be misused. Commented May 7 at 18:46
  • 15
    @henning -- of note the Terms of service give Stack Overflow license to use your post content, even if they are deleted.
    – AMtwo
    Commented May 7 at 19:41
  • 5
    @henning It might cause volunteer moderators some pain if they want to preserve your answers for the site's users; it will not affect the company and you've already licensed the content. Commented May 8 at 18:03
  • 1
    @henning It's possible you have: meta.stackexchange.com/a/399674/401068 Commented May 8 at 21:38
  • 1
    @SilentCloud Yes, that's exactly what I suggest. Do put in a link to the original on SEI, to comply with the attribution requirement. Commented May 10 at 7:35
  • 4
    @S.L.Barthisoncodidact.com: No such link is required for your own posts, since you still own them and can distribute however and wherever you want. You shouldn't be copying other peoples' content to other Q&A sites (if they want to display it under CC BY-SA, fine, but you can't paste it into the textbox intended for your own contributions)
    – Ben Voigt
    Commented May 10 at 15:43
58

How is the socially responsible aspect taking into consideration the CO2 emissions from LLMs?

I saw no mention of it in your linked article "Defining socially responsible AI: How we select partners."

1
  • 41
    Some might say its just another example of genAI being a load of hot air Commented May 8 at 13:26
52

I'm just frustrated with the continued poor communication.

GenAI is arguably the most heated discussion in tech there is right now; it's polarizing, confusing, and scary to a lot of people, especially surrounding attribution and apparent content theft (whether legal or simply perceived).

I find it unbelievably frustrating that apparently nothing has been learned by Stack Exchange on the comms front since the first round of this with the Google AI partnership, which was also poorly received (scored +35/-455 at time of writing).

Just like that one, all we've been presented here is a link to the blog with effectively zero meaningful tailored messaging; this post could have been a completely automated notification of the blog post and it would have had exactly the same amount of actual content.

It just feels like the company keeps shooting itself in the foot here, and then is shocked at how much it hurts when it happens again in exactly the same way the next time.

We, your community, care about this platform. We want to stay updated with partnerships. We even want you to get paid and keep the lights on!

But that also comes with wanting to know more than what the public press release full of buzzwords provides, to have some insight into how the partnership might practically affect us and the platform, and maybe most importantly, wanting assurances about issues important to us (like attribution), which should have been stressed in the original post from the beginning.

I strongly disagree with their approach, but the fact that there are folks already going nuclear from just this announcement illustrates just how badly the company has botched the messaging on this one. The first shot at explaining this in terms the community is willing to hear has been missed by a mile. Again.

3
  • 3
    I'm beginning to realize that all I've really done here is accidentally rehash Catija's (better) answer...
    – zcoop98
    Commented May 8 at 21:07
  • Not sure if the company is really shocked. Maybe they genuinely do not care that much about their community to bother writing nicer messages? Maybe they think that we need them more than they need us or that we are too demanding to be worth the effort. And also people go ballistic all the time. Many of the people who announce never to contribute again, didn't contribute much in the past. The true test will be the usage on SE in the coming months. Many contributors might not even be aware of the company's deal with OpenAI. Commented May 10 at 7:48
  • 4
    The continued poor communication is intentional, because the moment they have to start giving solid answers they'd have to admit how bad this is going to be. Commented May 14 at 4:34
45

For reference, the correct thing to do was to defend us from AI, to work to block and sue OpenAI and companies like them and cause them as much pain and frustration and headache and misery as legally possible. You would have been heroes even if you failed.

Instead, you sold out everyone for a very short-term dollar.

It makes me sad to say that Stack Exchange is now in free-fall, being gutted and exploited and sold for every last buck you can scrape together before its dying gasp.

3
  • 3
    Well, they are a for-profit company. What do you expect? It's their decision how to best make money and it would be nice if they had sued OpenAI instead and be a hero, but we couldn't really expect this from a for-profit company. It doesn't seem to make economical sense for them. Commented May 10 at 8:05
  • 3
    This site has never been run as a charity! Nothing is gutted since all your answers are still here, with multiple backup copies. Deletion and alteration of the answers will do nothing. The only power you have is stop giving free answers. Commented May 13 at 10:28
  • 1
    This is the most correct post about the entire topic. You should defend your data, not sell. Commented May 26 at 11:21
45

So when is the class action starting?

And let's appreciate the ironic juxtaposition:

Enter image description here

9
  • 8
    If you dive into the site Terms of Service, you'll see that "Stack Overflow [has] the perpetual and irrevocable right and license to ... distribute, export, display and to commercially exploit [Q&A] Content" ......ie) Stack can do whatever they want with Q&A data. There's really no legal basis for a user to sue the company over what they do with Q&A data.
    – AMtwo
    Commented May 7 at 19:57
  • 19
    "...with attribution." @AMtwo
    – W.O.
    Commented May 7 at 19:58
  • 4
    @W.O. -- Stack Overflow isn't bound by CC-BY-SA.
    – AMtwo
    Commented May 7 at 20:27
  • 5
    I see @AMtwo: "unrestricted by the CC-BY-SA license". Thanks for that detail that I, like a number of others had clearly missed.
    – W.O.
    Commented May 7 at 20:45
  • 5
    @W.O. Agreed. I think many folks don't fully understand how the "dual license" works on contributions.
    – AMtwo
    Commented May 8 at 13:03
  • 1
    @AMtwo license or not, the GDPR gives you a right to have your answers deleted. See here how to have your answers deleted for you: meta.stackexchange.com/a/399735/297249 Commented May 9 at 7:31
  • 3
    it is not deletion, just remove your link to th psot by anonymizing you. Commented May 9 at 7:46
  • 1
    Unless you opted out - I think there's also a binding arbitration agreement. While its entirely possible to bring a company to its knees with binding arbitration, a class action suite would be tricky Commented May 9 at 13:54
  • 7
    @henningnolongerfeedsAI If a user submits a gdpr request to be deleted, their Q&A contributions are disassociated from their user account, and the user deleted, but the q&a content remains online, but deidentified. Questions and Answers are NOT deleted. That's how Stack's lawyers determined was the right way to comply.
    – AMtwo
    Commented May 10 at 21:43
40

My answers have been read by over 151,000 people on the physics Stack Exchange site (PSE). That makes me the top 3% of users. I wrote those answers to help the community. I am not okay with my answers being used for the benefit and profit of OpenAI, a company I do not like.

I was banned yesterday from PSE for deleting my answers, which were edited back to my former writing.

For some context, I am of the belief that the goal to advance AI to an intelligence greater than humans doesn't lead us to a brighter future. We can agree that that is OpenAI's stated goal, even if we disagree how realistic it is or what the timeline might be. So, I don't want to contribute to their effort at all. That is the practical purpose behind editing away my answers. I believe this partnership may be a result of the added profit motive after purchase of SE by investment firm Prosus.

14
  • 13
    The only thing sad is your would delete your answers instead of just simply not contributing anymore.
    – Ramhound
    Commented May 9 at 6:16
  • 18
    when you participate on this site and contribute content, you are agreeing to the terms of service, which includes giving SO Inc. the license to commercially exploit your content. you may not like it (it doesn't fill me with a warm and fuzzy feeling), but it's a fact.
    – starball
    Commented May 9 at 6:43
  • 15
    Note that users trying to remove and/or deface their contributions and being suspended over it has been happening for years. There is monitoring in place for just such occasions. It all follows standard policy, as well. Neither what you tried to do, nor the result is anything new for the platform.
    – VLAZ
    Commented May 9 at 6:49
  • 3
    "This is sad." You probably thought your answers belong to you, but after posting they belong to everyone. And with the dual-licensing in the TOS you even agreed that they also potentially belonged to OpenAI. If you are not okay with that you should not have taken part in SE and did more intelligence and read the TOS more carefully before. Okay, this AI breakthrough in 2022 was a bit difficult to foresee really. The only action left now is stopping contributions in the future, but the past cannot be changed. Commented May 10 at 7:57
  • 8
    Your content has already been used for training AI, this happened long before anyone here knew about AI, even the company itself. Vandalizing or deleting your existing content does not hurt AI training not even the slightest, it just hurts people who don't want to use AI and need a reliable source of human written information. You can only stop contributing new content in the future. Doing anything else is exercise in futility. Commented May 10 at 8:04
  • @ResistanceIsFutile Editing posts is likely to also edit the training data once the site gets scraped the next time. In that way it's likely to prevent exposure of the real post to the data. So, resistance seems like it's not futile :) Commented May 10 at 13:50
  • 3
    I doubt that they will waste time training it over and over again on the same data. Still if we destroy human created data, how will that help people to stay away from AI? It will only undermine the credibility of SO as a source and AI will still be trained on anything that can be scraped. Poisoning the well, will only kill us the AI will survive. This is the most useless kind of protest there is. Will you start burning books so that AI cannot be trained on them? Commented May 10 at 14:01
  • 3
    New models would likely only be trained on the most up-to-date version of stack posts, so those would be affected - I'm not imagining double training of the same model. At minimum, I would be happy if it becomes known that there is public resistance against the idea of letting OpenAI train freely, so that other companies factor that into their decision to partner with OpenAI as well. I don't see that as fruitless. Commented May 10 at 14:12
  • 5
    I guess you missed everything that was happening here last year if you think that your protest will have any kind of effect on SE or any other company decision. It is not even a drop of water in the ocean. Again, the only people you are hurting are the very ones that don't want to use AI. And not even them because all the vandalization efforts are in vain. Right now this protest prevents most active moderators from working on removing AI generated content from sites and instead have to deal with your and other users' mischiefs. Commented May 10 at 14:40
  • 1
    @doublefelix Retraining a GPT from scratch is very expensive and time-consuming. GPT training happens in several stages. First, a word embedding is created. Next, the transformer is fed the main bulk of the training data. Finally, fine-tuning is applied, primarily Reinforcement learning from human feedback.
    – PM 2Ring
    Commented May 11 at 2:16
  • (cont) It's relatively easy to add to the top layer, but you can't update the middle layer without invalidating the upper layer. And that upper layer is expensive because it needs human interaction and supervision. Stephen Wolfram has written some excellent articles about how ChatGPT works, and its limitations, which I linked at the end of meta.stackoverflow.com/a/422397/4014959
    – PM 2Ring
    Commented May 11 at 2:17
  • 1
    What made you think that content you posted on a corporate owned web-site belonged to you? Commented May 11 at 3:32
  • 5
    openai's stated goal was to not support military applications, until they silently dropped it to make a deal with the military. their stated goals are not worth the pixels they are displayed with. Commented May 14 at 8:15
  • I guess we can still mess with their data quality by upvoting bad answers and downvoting good ones
    – Dinisaur
    Commented yesterday
38

Please, I love using SE because SE is not a forum, and I’m using Mathematics Stack Exchange and English Language Learners Stack Exchange as I can get more accurate answers at those sites.

But why do I and other users want to contribute anything to this site if ChatGPT (OpenAI) has been used in this site?

The community needs to know that SE is not the only place for Q&A, there are other places for the Q&A sections.

Take an example, if I don’t use Mathematics Stack Exchange, I can go to AOPS to ask my mathematics questions too.

So SE definitely is not the only choice for Q&A, and more sites/forums will be developed in the future.

I really hope that the community thinks wisely and does not share our content with OpenAI’s products.


Note: Quora actually also introduced ChatGpt, see here. That’s the reason why I don’t like using Quora, as I’m afraid of getting AI-generated answers or bad-quality answers.


Reply @Some Guy,

I know this is just a strawman argument and in my answer, I don’t mean that ChatGpt has been used in creating posts. But there are many things we are not understand in this announcement.

@Rosie just answered the attribution update in this OverflowAPI, but there are so many things like license and the usage of OverflowAPI are just questions.

As we know, ChatGPT and SE are two mutually exclusive things. So, I don’t believe it is a good idea such that OpenAI is partnership with SE.


Like what I write in @Iyxal’s answer,

The most important part is the last question, if OverflowAPI can disrupt the answer quality, I definitely will left this rubbish site.

So I will wait for the update for OpenAI first.

1
  • 2
    This seems like a strawman argument, or you haven't read the announcement carefully. Nobody so far has proposed that ChatGPT be used to create answers here. You haven't explained why ChatGPT training on (as opposed to being used to create) text here is a problem. Personally, I post text on this site, for free, to help people, and I don't see how ChatGPT being trained on that text does anything other than enlarge the group of people who might be helped. I don't really care if I get credit for what I wrote, so long as nothing prevents it from being read by the public it was intended for.
    – Some Guy
    Commented May 20 at 22:32
33

Wow, you guys made me log in again.

(And that's not a good thing)

I, personally, am not anti-AI but this partnership was not discussed with us, your users. You previously banned ChatGPT's webcrawler but have since removed it and have now granted access to the Overflow API directly.

I totally understand people's objection to this change in policy and their attempts to remove their questions and answers in protest, even if I won't be doing so. Afterall, that lack of transparency is why I logged out in early 2020 and haven't been back since. Well ok, I did have to return and log in so I could download my Jobs profile when that shut down.

So long and thanks for all the fish.

0
28

How can you be so sure that our content will be properly attributed on the ChatGPT end? Even if they intend to, intentions can fail. ChatGPT is intended to reject certain queries, but as we know we can 'jailbreak' ChatGPT and receive answers to queries intended to be blocked. So no 'guarantees' can be made in terms of ChatGPT responses as far as I can tell.

You assert that attribution will be given... but is there any concrete evidence to back that up, given ChatGPT has a poor track record of giving correct attribution, let alone attribution at all?

I would like to allude to How not to be a spammer of all places:

Don't tell – show! The best way to avoid being seen as a snake-oil salesman is to demonstrate a solution, rather than simply asserting that the problem can be solved.

So can you show/prove to us that attribution will be given, rather than just asserting so?

And what recourse do we have if our content is being reproduced by ChatGPT, and the attribution we were promised was not given?

3
  • 1
    A promise is non-binding and depends on how much faith you have in the other side. Some users still might have enough faith in the SO brand to believe them if they promise to not to something, whatever they will really do. Commented May 10 at 7:50
  • @NoDataDumpNoContribution So for all we know, the promise of attribution is just an empty promise in an attempt to make us 'feel better' or something.
    – CPlus
    Commented May 10 at 16:22
  • 1
    Yes, it's a half-truth at most. They probably care about attribution but only for those occasions where someone does it to whom they did not sell the content and tell us everything they believe we want to hear. They conveniently forget to mention aspects that could be seen critical by us. I mean that's how almost everyone does it (adapt the message to the audience). Do you believe people always tell you the truth when it's about business? Commented May 10 at 19:28
27

So let's say I ask OpenAI a question that I know the answer to and have actually answered on this site. OpenAI spits out a response, and it clearly starts off with my correct answer, but then hallucinates up some random details and gets it all horribly wrong.

Can I flag that up with OpenAI? Will Stack Exchange see the flag? Should I flag it up on the relevant Meta here instead as a nudge for SE to likewise nudge OpenAI to fix the bug?

2
  • 2
    Easy: no/no/cannot be fixed even if anybody wanted to Commented May 14 at 8:17
  • Anybody using the current generation of AI tools should be aware that hallucination is a possibility. They should always be checking the output for correctness before relying on it. That is their problem, not yours. If they can't be bothered to do those checks, then they should get information from carefully curated human-generated sources instead. Humans are still capable of hallucination, mind you (ask any religious or political zealot you know), and still need to be checked for accuracy. Critical thinking and fact checking skills are always necessary. Humans make mistakes too.
    – Some Guy
    Commented May 20 at 22:42
21

Since a number of users seem to be under the impression that they can prevent their posts from being scraped by OpenAI by deleting or vandalising them, it feels pertinent to ask: how much of Stack Overflow has OpenAI already scraped? Is it possible that my posts have already been fed into ChatGPT, and therefore, deleting them (hypothetically speaking) would be a waste of time?

For that matter, what happens if a Stack Overflow post is fed into ChatGPT and later deleted from SO (because, say, it's spam, or plagiarised, or both)? Would it be removed from the LLM as well? If not, I can easily see a vicious cycle where users post ChatGPT answers on SO, those answers get fed back into ChatGPT, and the model slowly degrades instead of improving.

As a final point, I'm not sure whether this has already been addressed, but is this only going to affect Stack Overflow, or are OpenAI going to scrape the entire network? As mentioned above, I've seen several users deleting or vandalising posts across the entire network, not just SO, in an attempt to prevent them from being scraped.

There are so few details that I can't tell how well this has been thought through, but my immediate instinct is "not very well".

7
  • 5
    That's a bit short-sighted once you write a question or an answer on an SE site, it's stored, if you delete or vandalise the post it still exists on SE infrastructure. You can't control that.
    – user692942
    Commented May 9 at 8:59
  • 13
    You know the datadumps exist, right? Much easier to use that instead of scraping.
    – rene
    Commented May 9 at 9:08
  • 4
    I would be extremely surprised if there were no SE data in the GPT training datasets, eg en.wikipedia.org/wiki/Common_Crawl
    – PM 2Ring
    Commented May 9 at 9:58
  • 1
    @rene Nope, I forgot about those. Someone needs to let all the users defacing their content know about the datadumps as well, I guess...
    – F1Krazy
    Commented May 9 at 13:37
  • 3
    From someone who tried deleting his answers, and thought it through: I'm aware that those posts have already been scraped, but by updating them on the live site, the next scrape is likely to overwrite the previous scrape in the training data. So by editing, rather than deleting your answers, you are more likely to have them effectively removed from OpenAI (and other) training data. Commented May 9 at 16:43
  • "Is it possible that my posts have already been fed into ChatGPT..." I would be very surprised it that wasn't so. However, your posts may have been fed illegally into ChatGPT making them vulnerable to get sued (by you) while now with this partnership the usage is on more firm ground (legally). Also there are still your future posts, which can or cannot be fed to ChatGPT. Commented May 10 at 8:01
  • 1
    " those answers get fed back into ChatGPT, and the model slowly degrades instead of improving" It can't get much dumber as it is. There is hype, damned hype and there is ChatGPT. There is no Artificial Intelligence there, just a deluxe chat bot for the purpose of killing time.
    – Lundin
    Commented May 13 at 6:36
21

Just comparing how the community reacted to banning GPT vs Partnership with OpenAI shows how they feel about this relationship and these two different products!

I’m not going to talk any further about attribution but the future of this community really concerns me! It seems to me that generating anything (even a simple reworded message), for any purpose (from a simple search aid to a complete answer), kills TRUST in SO which took years to achieve and people tend not to contribute when there is lack of trust.

Fun fact: there is little banner above this text that reminds me of something :)

20

"Do the people who actually contributed the answers get anything out of this deal (attribution? access to the trained model? part of the profit?) – Aykhan Hagverdili"

Saw this comment; and the one thing many would like to know

5
  • 2
  • 8
    To the best of my understanding, the value SE users get out of this is that the company is claiming they will spend the income to invest in the public platform, which is something that is sorely needed.
    – Catija
    Commented May 9 at 14:52
  • 1
    Very, very probably not directly as in "you contributed X% of the training material, here are Y% of the profits". Commented May 10 at 7:59
  • 1
    Pretty sure we will get nothing out of it and will need to pay OpenAI if we want to use the model trained on data we contributed to Commented May 10 at 11:02
  • I never expected to get anything out of posting on SO. I provided work to the general public, for free, and SO provided me website hosting and the tech to make my questions and answers available to a wider public audience, for free, which maximizes the number of people that can be helped by what I post. That has always been the deal. I don't know why you would think that would change with AI. If you want to monetize your questions and answers, get your own site, pay for your own hosting and site design, and monetize away! Personally I am happy that SO distributes information for me for free.
    – Some Guy
    Commented May 20 at 22:49
18

This is incredibly sad. I'm not a contributor here but I've learned SO much from this resource and value it highly. I'm putting a lot of effort right now into trying to protect original content from being scavenged and I thought Stack Exchange would be against this for some reason. Apart from the obvious issue of people's original thoughts and expertise being used to 'train' LLMs, has anyone noticed how bland and soulless A.I. content is? Even when the information is technically correct and not mangled, the tone is so offputting - 'it's important to understand...blah blah blah' - maybe we should have two internets, one where all the bots talk to each other and everything is an algorithm, one for humans?

3
  • 4
    rfc3514 would like to know your location
    – starball
    Commented May 9 at 1:30
  • 1
    It's soulless, but do you care if your problem gets solved faster? That will really be the question of the future. Do people value interaction with other real humans so much more that they live with slower reactions and maybe even less precision (on average) or will people even use AI selectively where it helps them to have even more free time to really personally contact other humans. Or will humans become isolated and AI will be their only contact (especially for lonely people), as sad as this sounds. Commented May 10 at 7:53
  • I'm not seeing your point here. If the AI doesn't work well, then people won't use it, and will use the SO site's human answers instead. People will be able to use whichever interface to that information that they feel works best for them (not necessarily the interface that YOU would imagine would work the best for them). Some people may be able to access the info more efficiently using an AI as a front-end, and may be willing to tolerate occasional AI mistakes (as opposed to human mistakes). Why is this sad? (We could travel to meet experts and ask them face to face, too, but we don't...)
    – Some Guy
    Commented May 20 at 23:01
16

Regarding attribution, the Stack Overflow team used to be very clear about the value of it:

https://stackoverflow.blog/2010/08/11/defending-attribution-required/

Please help us defend your right to have your name and source attached to the content you've so generously contributed to our sites. We will absolutely do our part, but many hands make light work

Does Stack Overflow still care about attribution? Will my many contributions to Stack Overflow over the years be correctly attributed, when used by AI products trained on those contributions?

2
15
  1. This violates the user agreement as well as the content license, and if this sees the light of day, you will be on the receiving end of a very large class action from a very informed group of people.

  2. If this goes forward, giving out content to be consumed by GenAI is to admit that you do not own the content, and thus there is no enforceability for protecting SO Network content from simply being copy pasted to a separate platform without attribution.

Proper attribution is defined in the license agreement, and that is what must be provided. There needs to be direct citation for works used; that is the license. Without direct citation, it is plagiarism unless the entirety of the work is remixed solely from one source and the source is referenced (which isn't the case here). AI remixes from multiple sources, and therefore it must explicitly cite those sources discretely in order to abide by the license.

Curious about the terms that apply to your content? Download it! Find the date your post was created, look up the ToS from Stack Overflow itself here: https://web.archive.org/web/20150701000000*/https://stackexchange.com/legal/terms-of-service .

10
  • 11
    Can you link or quote where the partnership violates the license?
    – rene
    Commented May 8 at 22:29
  • 7
    @rene See stackoverflow.com/help/licensing CC BY-SA requires attribution. LLMs and attribution are fundamentally incompatible, and they clearly have no idea what they are talking about when they pretend that they can give attribution. OpenAI would rather ask for forgiveness than permission, and SE cares little about licenses.
    – Gantendo
    Commented May 9 at 14:19
  • 10
    @Gantendo the content is dual licensed. SE also has a license to our content and that license doesn't require attribution. I'm not arguing whether I agree with all this shenanigans and whether I trust one tech company over another, I'm just trying to get it straight whether this violates anything. I don't believe it does so this argument is not going to win it. We need to come up with a better one, potential one that will survive in court.
    – rene
    Commented May 9 at 15:05
  • 5
    @rene but OpenAI used data from SE before there was any partnership/agreement between SE and OpenAI. They operate on a "better to ask forgiveness than permission"-model. They admitted to using the OpenCrawl data, and stackoverflow is one of the domains in the dataset. commoncrawl.org/blog/…
    – Gantendo
    Commented May 9 at 15:59
  • 3
    But that OpenAI used your content is a problem between you and OpenAI, not something SE can or need to fix. What SE can do is use their license of your data to get reimbursed for use of the body of knowledge by OpenAI going forward so both SE and its communities gets somewhat compensated: SE in money, the community by having more and better features on the public platform as a result of that.
    – rene
    Commented May 9 at 16:30
  • 5
    @rene - Been over this every other time this comes up. Proper attribution is required in order for the license to be honored, which GenAI doesn't do. It is a clear violation. There are numerous lawsuits which are about to become legal landmark cases, and this legal basis will be used here against Stack Exchange should it come to that.
    – Travis J
    Commented May 9 at 18:02
  • doesn't a post getting edited change the license on the latest rev to the most recent content license? I.e. is it not last activity date instead of creation date that matters?
    – starball
    Commented May 14 at 20:50
  • @super-starball-ultra - Only if there was new content added, would the new content itself be within the current ToS contract. Revision date would be relevant to the content contract. For example, changing one character at the end of a long post would not then make that post abide by the newest ToS contract; just as some employee changing everyone's last activity date would not change the revision dates.
    – Travis J
    Commented May 14 at 21:48
  • The consensus on generative AI is that the use of the training data is transformative, thus constitute fair use in the US: arl.org/blog/…. Even creative commons themselves consider it to be fair use: creativecommons.org/2023/02/17/fair-use-training-generative-ai.
    – Poscat
    Commented May 15 at 9:54
  • 1
    @Poscat - Problem there is that in edge cases, which is an abundance of Stack Overflow, AI are not trained with the depth that would be desirable. Quite the opposite, and as a result, frequent verbatim reproduction occurs, especially when it comes to code. Training is rather benign so long as it is never used to generate. However, when generation occurs and that generation contains verbatim reproduction, then training does come in question with regards to sourcing. If it was trained (sourced) on material which was licensed and is later plagiarized, then it will have harmed that author.
    – Travis J
    Commented May 15 at 19:28
14

Update after @Rosie answered:

I still don’t understand why Stack Exchange are in a partnership with OpenAI. I felt like I’m been cheated. But hopefully, I took more time in editing posts than asking or answering questions. But what can we do with those users who were contributing (asking or answering) very much in Stack Exchange?

I’m actually not familiar with the Term of Service in Stack Exchange, but I felt hopeless when I saw this in the comments and a question. It looks like the users who contributed much can’t do anything, delete their accounts or posts are useless!!!

the Terms of service give Stack Overflow license to use your post content, even if they are deleted. -AMtwo

note that deleting your account does not result in deletion of your content. -super-starball-ultra

You can also see this question for some information: How can a user unhappy with the LLM partnerships protest constructively (and ideally- effectively)?

I feel like now Stack Exchange has solely started disrespecting its users’ values and efforts.

I hope the management can be better and go into partnerships with those companies to get some benefits. And these benefits don’t last long. Who will believe Stack Exchange, from now on?


For me, I won’t believe Stack Exchange in its management.

I’m not understanding what this stupid announcement is for, but the thing that I confirmed is that Stack Exchange is very bad currently.

4
  • 2
    "I feel like now". the things you're complaining about (licencing, account deletion) have been like that for a long time now... if you want bad, see meta.stackexchange.com/a/394774/997587
    – starball
    Commented May 8 at 16:23
  • @super-starball-ultra I haven’t see that before. Commented May 8 at 16:24
  • 1
    Why is simple. SO was purchased by investors who think -- or think they can convince others -- that AI is something other than a stupid fad. Commented May 11 at 19:11
  • Please don't edit just to bump, it's not what edits are for. Commented May 14 at 17:08
5

I despair ... SO has been on strike over ChatGPT use and the company just go for more!!!

8
  • 4
    To be fair, the strike was about how SO didn't let mods handle ChatGPT content, and that was resolved. So, this is a different case, though more severe in some aspects. Commented May 16 at 6:49
  • @ShadowWizardLoveZelda Was that really resolved? The moderators went back to work, but I've not seen any statistics about number of deleted or recognized AI generated content since then nor the list of heuristic criteria that are allowed to be used. I guess everyone can have an opinion of him/her own if the strike really was successfully resolved. Commented May 24 at 9:59
  • 3
    @NoData suspending the second top user of Stack Overflow due to indirect usage of GenAI (not sure even ChatGPT) is proof enough. If the company would have wanted to keep their initial approach, they would have intervened and let the user keep doing that, but they didn't, letting the site moderators handle it their way. And it worked. Commented May 24 at 10:59
  • @ShadowWizardLoveZelda Proof may be a too strong word but yes, it seems like an example of actions being taken. On the other hand I imagine that the second top user on SO at some point after getting not caught a thousand times had so many potential cases that he/she simply admitted the use of GenAI. Thise case might be exceptional and not the best for proving the effectivity of the policy in general. I remember people saying that there are huge backlogs of flags and that the current rules hold moderators back. Commented May 24 at 12:05
  • @NoDataDumpNoContribution You will not see the list of approved heuristics as it is kept private. See meta.stackexchange.com/q/391990 Commented May 29 at 18:32
  • @ResistanceIsFutile I understand but it's not ideal. How can I know if this list makes sense, is rather restrictive or too generous? I can't form an opinion so it simply comes down to how much trust I have in company and mods, which is debatable. And whenever I cannot know something I automatically get a bit suspicious. Company and mods may have kicked the can down the road and may not have fully solved the problem that was the cause for the strike originally but simply arranged themselves with some kind of good/bad compromise. To me it's not clear if the strike issues really were resolved. Commented May 30 at 6:28
  • @NoDataDumpNoContribution Well, I understand that you don't trust the company. If you don't trust the mods either, then that is the problem. But the heuristics project started because company thought moderators are not adequately moderating the AI content (removing too much), so you have two opposing sides working on something, so it should be balanced. Commented May 30 at 6:40
  • @ResistanceIsFutile Balancing can come out more on one side or the other. And I don't know much there so my instinct is to mostly trust what I can see for myself, which I think isn't a problem by itself. Maybe presenting more statistics would help there. Anyway, I just wanted to say that everyone can decide on their own if they think that the strike was successfully resolved. I think it might not have been because the company might have tilted the balance in their favor. Commented May 30 at 9:17
-11

Long ago we tended crops, hunted, traded, and crafted. It was hard work, but we took pride in what we did. We built a civilization in spite of this labor. Then the machines came, and put metal between us and the fruits of our labor. We were afraid then, as we are now, that those machines would replace us. What shall we do if we can't farm, hunt, and craft?

But humans weren't replaced, and we certainly didn't run out of things to do. The nature of work changed. Humans remained central to civilization. Our tools changed, and just in time, I might add. Our population exploded. Our needs evolved and became more complex. We needed better tools. Those machines — that metal — did not remove the humans from humanity. It helped us to care for a burgeoning population, which is something we struggle with to this very day.

I feel those echoes today with artificial intelligence. It feels like 150 years ago when the machines first took hold. When we were afraid of how we would make a living. If there is one thing I've learned in 20 years of writing code is that demand for software has not gone down. It has only gone up. We need better tools. Artificial intelligence is the next iteration of The Machine. No longer made from metal, our silicone hammers will do for modern civilization what our metal hammers did 150 years ago.

I guess this is a long-winded way of saying I'm thankful for two specific things from Stack Exchange:

  1. Defending their business and our community from being polluted and diluted by the first generation of large language models. LLMs are amazing, but let's be honest, they essentially regurgitate the Internet, which is already mostly vomit. There are a few strongholds of good content, and I believe SE is one of them. Banning LLM-generated answers was essential to protect contributors and the company.

  2. Not sticking their heads in the sand. AI ain't goin' away. This is revolutionary technology, and it will change the nature of work, and how we interact with our machines. At least Stack Exchange is engaging this revolution to help shape it, rather than being trampled by it. This is going to happen, folks. The best thing we can do is get a seat at the table, and that's what SE is doing.

I completely understand the emotional response from people. Every one of us is hand-crafting code now. We are no different than the village craftsmen from 200 years ago. We care about what we do. We feed our families and put roofs over our heads, and these large companies come along and vacuum up our hard work so it can be vomited back up for a profit. Remember that AI is new. They (OpenAI) Got It To Work™ and now they need to make it suitable for the real world. Partnerships with content creators who realize the value of that content, who realize what it takes to generate trustable content, need to be part of this revolution.

OpenAI gave us a tool which simulates a conversation. Partnerships with companies like SE, media, and governments will polish the rough edges of this tool. It is unknown territory, and that scares the daylights out of us. Now ChatGPT has a concrete use case for proper attribution. It's a problem they need to solve, and lord knows every one of us will point out its flaws. We will QA test the crap out of this thing. It needs it.

So, I for one, welcome our new AI overlords... cordially but not enthusiastically.

C’est la vie.

15
  • 11
    We aren't getting a seat at the table. SEI is getting a seat at the table, after contributors paid the bill.
    – Conrado
    Commented May 11 at 13:56
  • @Conrado: contributors never got paid. That was never in the plan, nor was that ever promised. If people want to get paid, don't post here. It might sound harsh, but it was that way from the beginning. Nothing has changed in that regard. Commented May 11 at 17:27
  • 21
    @GregBurghardt It's not about getting paid. The point is that the users created all of the content here - they made the content that gives SO the ability to make huge deals like this. Since users are the reason the company can even make money on the content, it's reasonable for them to feel like they have some voice in how the content is used. Particularly if there's fear that the decisions being made will eventually harm the platform and the company isn't even bothering to communicate plans or assuage user fears.
    – Catija
    Commented May 11 at 20:10
  • 4
    Users were okay not getting paid BECAUSE attribution was assured and no profit was allowed to be made off our content. If that changes now, then users are allowed to feel differently.
    – Milind R
    Commented May 13 at 14:16
  • 1
    People are allowed to feel how ever they want on this subject. The trouble is, Stack Exchange is a for-profit company. They were always making money off our contributions. What makes me upset about this has nothing to do with Stack Exchange. It has everything to do with large companies vacuuming up our hard work so they can make a soul-crushing money machine. At least for me that's why I am not excited about AI. Has nothing to do with AI, and everything to do with feeling like I am being sapped dry --- and I don't think SE is the one doing the sucking. Commented May 13 at 14:21
  • 3
    @MilindR Anyone who felt that way, unfortunately, fundamentally misunderstood the nature of posting content here then. There are lots of legitimate reasons to feel frustrated, but commercial use (and profiting off of) our CC-by-SA contributions is explicitly allowed by the license. I certainly don't mean to excuse the unacceptable communication from SE the company; "no profit was allowed to be made off our content" is just categorically untrue, however.
    – zcoop98
    Commented May 13 at 17:14
  • 11
    It's not "revolutionary technology". It's the latest bubble, with a bunch of techbros doing their best to hype up the 'limitless possibilities' to convince the venture capitalists to invest one more round of capital before they cash out and leave everyone else holding the bag. Commented May 14 at 4:37
  • @Shadur-don't-feed-the-AI I'm afraid you're completely wrong. Companies are not spending astronomical amounts on AI for the fun of it. Leaders are well aware that engineering often represents the highest cost to their companies and is extraordinarily expensive. AI represents a golden opportunity to virtually eliminate this cost. Our demise is coming, and very soon. Look at the current capabilities. AI is already far more efficient at writing code than many developers combined, and it doesn't even need developer guidance! look at the studies being conducted and the sobering results. Commented May 23 at 17:57
  • 1
    @java-addict301, "writing code" and "designing software to meet customer requirements" are entirely different skillsets. AI is good at writing code, but pretty darn lousy at turning a customer's plain language description of a business process into a cohesive software echo system that performs well, is easy to maintain, and is secure. Commented May 23 at 18:49
  • 1
    To my point, how many junior developers could write an order of magnitude more code in one day than an old-timer? And how many of those same junior developers wrote useful code that was secure, easy to maintain, and actually implemented the customer's requirements? We aren't taking the human out of software engineering any time soon. AI will be a tool, just like a good IDE or a keyboard. Commented May 23 at 18:51
  • They're spending astronomical amounts on AI because they're being promised that Very Soon Now, AI will be "Good enough" to replace employees entirely, if only they support development with another funding round. It's a bubble. Commented May 23 at 19:22
  • @GregBurghardt I'm afraid us developers are sorely underestimating the capabilities AI will have in a year or so. AI is already replacing whole teams of developers some companies are reporting. Good luck to you all. Gentlemen, it has been a privilege playing with you tonight. Commented May 25 at 18:16
  • @Shadur-don't-feed-the-AI who are the ones spending astronomical amounts on AI? Hint: It's tech people - those who know what AI is capable of and are developing it to replace developers. They're some of the smartest engineers in the world, and they aren't playing! Commented May 25 at 18:20
  • They really, really aren't. They're vulture capitalists and con artists -- and they've already flat out admitted they can't make LLM's profitable unless they're allowed to steal everyone else's work without having to worry about attribution or compensation. It's a scam. Always has been. Commented May 25 at 20:33
  • @Shadur-don't-feed-the-AI attribution for what? your open source project that the AI drew inspiration from (just like any other developer)? AIs aren't copying content - they're learning from the content people intentionally made public for learning purposes. Are you saying the CEO of OpenAI, Google, or Microsoft and other tech firms are venture capitalists and con artists? Not following. Commented May 27 at 15:20
-52

There are a lot of questions and comments that have come up around attribution. Attribution is something that we believe strongly in. Having credit attributed is a non-negotiable for us, and is a critical part of any and all partnerships of this type. There aren’t specific details yet because the work is just starting, but making sure attribution is happening (in a license-compliant way) is a commitment we require and have received from our partners. This is the very heart of socially responsible AI.

15
  • 26
    This makes me more comfortable about the liability accrued by OpenAI's actions (which, unless copyright law is changed, Stack Exchange can not afford to take on). Unless they have secret technology they haven't released (not unlikely), OpenAI does not have the capacity to provide attribution in their AI systems. Are all parties comfortable with a partnership where one of the partners isn't allowed to do anything?
    – wizzwizz4
    Commented May 6 at 17:11
  • 85
    Why isn't this in the body of the question post?
    – zcoop98
    Commented May 6 at 18:09
  • 22
    Will this really be attribution, or will the output of ChatGPT just be used to find random posts that happen to resemble the answer? I suspect the latter. Commented May 6 at 18:29
  • 16
    This statement is very weak, especially the "in a license-compliant way". The attribution clauses in the CC licenses have a lot of ambiguity and the "reasonable manner based on the medium, means, and context" leaves outs for the person receiving the content. It's very possible to be fully compliant with the license with respect to attribution and have a very weak form of it. Commented May 6 at 21:30
  • 56
    I will be expecting my username to be attributed to every output made by this AI model (which itself will be licensed CC-BY-SA) assuming any content I have contributed is used to train it, as dictated by CC-BY-SA. A note giving attribution to SO/SE as a whole is not enough, it has to be my username, otherwise it is a breach of license and I will take appropriate action and issue takedown notices to stop any parties violating my copyright. CC-BY-SA is very explicit about this. I doubt I'm the only one going to do this. Are you going to comply with this?
    – Kryomaani
    Commented May 6 at 22:22
  • 45
    Please also remember that according to section 6 of the CC-BY-SA license, any breach of the license not cured within 30 days of you being informed of it will terminate the license. This would render displaying affected contributions on any of your sites and using your AI illegal, with or without attribution. I promise to that I will do everything in my power to see my copyright upheld and I urge all users to do the same. This kind of willful ignorance of copyright on part of SE should not be tolerated.
    – Kryomaani
    Commented May 7 at 1:30
  • 5
    "There aren’t specific details yet because the work is just starting..." It would be very nice if more specific details could be shared when the work has progressed more, just to be more assured. Commented May 7 at 6:39
  • 30
    "There aren’t specific details yet because the work is just starting...". It is pretty scary (to say the least) that this kind of core questions were not answered prior to any agreements and any announcements. When such a dividing topic is announced, I think we could expect to have some FAQ with explicit answers about such obvious issues for SO/SE users Commented May 7 at 8:47
  • 13
    Problem is we heard the same assurances lots of times already for many of the past projects that crashed and burned. Another "we don't have anything to share and work is just starting but don't worry, it will be totally fine when the feature is rolled out" doesn't cut it at this point. I'd bet a full years' salary that this will not be delivered in a way that satisfies the community, and probably not even in a way that satisfies the letter of the license.
    – l4mpi
    Commented May 7 at 10:45
  • 20
    And also, the "socially responsible AI" thing is a bad joke. What's supposed to be responsible about wasting tons of energy and resources on an inaccurate text generator that (assuming this will be based on GPT-4 or later) has a foundation stemming from egregious data slurping and copyright abuse, and includes all manner of problematic content in its training data (racism, conspiracies, etc).
    – l4mpi
    Commented May 7 at 10:51
  • 5
    So attribution for the training data should be given to the person who posted an AI-generated answer, which was obtained from training the AI on data for which attribution should be given-... Why doesn't OpenAI see that this is a loop that only serves to make their AI increasingly dumber with every iteration? SO the company was first insisting that AI generated posts are good for the site, only stepping back after massive criticism. And now you are trying to sell access to the data which the company itself has actively encouraged to get polluted with AI content. What's the longterm plan here?
    – Lundin
    Commented May 7 at 11:11
  • 30
    @Rosie, I find it hard to believe that Stack & OpenAI have gotten as far as announcing the partnership in a joint press release if the terms & conditions of the data licensing were not already sorted out as part of any contractual agreement.
    – AMtwo
    Commented May 7 at 19:43
  • 8
    As an additional point, AI uses vast amount of energy. I'd like to see a commitment that OpenAI can't use SO training data on servers not powered by renewable energy under any circumstances.
    – Copilot
    Commented May 10 at 13:01
  • 2
    When I use Windows Copilot or Bing Chat, all of the codes coming from StackOverflow are clearly linked there, that's exactly how I've found many answers. So I don't know why some people are worried about attributions. One of the Windows Copilot's main feature is showing the source of the information so we can verify it and not blindly trust whatever it tells us.
    – SpyNet
    Commented May 14 at 17:41
  • 2
    @Kryomaani "Please also remember that according to section 6 of the CC-BY-SA license, any breach of the license not cured within 30 days of you being informed of it will terminate the license." did you post that as an answer? I think more people need to know about all this. Commented May 14 at 18:20

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .