19

https://meta.stackexchange.com/q/388551/178179 mentions that SE will force some firms to pay to be allowed to train an AI model on the SE data dump (CC BY-SA licensed) and make a commercial use of it without distributing the model under CC BY-SA.

This makes me wonder: Is it illegal for a firm to train an AI model on a CC BY-SA 4.0 corpus and make a commercial use of it without distributing the model under CC BY-SA?

I found https://creativecommons.org/2021/03/04/should-cc-licensed-content-be-used-to-train-ai-it-depends/:

At CC, we believe that, as a matter of copyright law, the use of works to train AI should be considered non-infringing by default, assuming that access to the copyright works was lawful at the point of input.

Is that belief correct?

More specifically to the share-alike clause in CC licenses, from my understanding of https://creativecommons.org/faq/#artificial-intelligence-and-cc-licenses, it is legal for a firm to train an AI model on a CC BY-SA 4.0 corpus and make a commercial use of it without distributing the model under CC BY-SA, unless perhaps if the output is shared (2 questions: Is the output of an LLM considered an adaptation or derivative work under copyright? Does the "output" in the flowchart below mean LLM output in the case a trained LLM?).

enter image description here

8

1 Answer 1

13

The flowchart included in the question is trying to summarize a rather large amount of legal uncertainty into one image. It must be emphasized that each decision point represents an unsettled area of law. Nobody knows which path through that flowchart the law will take, or even if different forms or implementations of AI might take different paths. The short and disappointing answer to your question is that nobody knows what is or isn't legal yet.

To further elaborate on each decision point:

  • The first point is asking whether the training process requires a license at all. There are two possible reasons to think that it does not:
    • AI training is protected by fair use (see 17 USC 107). This is a case-by-case inquiry that would have to be decided by a judge.
    • AI training is nothing more than the collection of statistical information relating to a work, and does not involve "copying" the work within the meaning of 17 USC 106 (except for a de minimis period which is similar to the caching done by a web browser, and therefore subject to a fair use defense).
  • The second point is, I think, asking whether the model is subject to copyright protection under Feist v. Rural and related caselaw. Because the model is trained by a purely automated process, there's a case to be made that the model is not the product of human creativity, and is therefore unprotected by copyright altogether.
    • Dicta in Feist suggest that the person or entity directing the training might be able to obtain a "thin" copyright in the "selection or organization" of training data, but no court has ever addressed this to my knowledge.
    • This branch can also be read as asking whether the output of the model is copyrightable, when the model is run with some prompt or input. The Copyright Office seems to think the answer to that question is "no, because a human didn't create it."
  • The third decision point is, uniquely, not a legal question, but a practical question: Do you intend to distribute anything, or are you just using it for your own private entertainment? This determines whether you need to consult the rest of the flowchart or not.
  • The final decision point is whether the "output" (i.e. either the model itself, or its output) is a derivative work of the training input.
    • This would likely be decided on the basis of substantial similarity, which is a rather complicated area of law. To grossly oversimplify, the trier of fact would be shown both the training input and the allegedly infringing output, and asked to determine whether the two items have enough copyrightable elements in common that copying can reasonably be inferred.
12
  • 1
  • except for a de minimis period which is similar to the caching done by a web browser, and therefore subject to a fair use defense But that fair use defense is contingent on caching being noncommercial and having "minimal impact on the potential market for the original work", which is not true of commercial language models used to generate content that mimics the original.
    – endolith
    Commented Nov 29, 2023 at 17:10
  • @endolith: Incorrect, fair use is not inherently contingent on being noncommercial. See Campbell v. Acuff-Rose Music, Inc., Author's Guild, Inc. v. Google, Inc., and several other cases. Furthermore, OpenAI is ultimately controlled by a nonprofit entity, so it may not make a difference in their case.
    – Kevin
    Commented Nov 29, 2023 at 20:23
  • Yes, but it's one of the four factors and is cited in the rationale for why web browser caches are fair use. OpenAI Global, LLC is a for-profit corporation and ChatGPT is a commercial product.
    – endolith
    Commented Nov 29, 2023 at 22:21
  • @endolith: No, it is part of one of the four factors, which in full is the "purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes." But under US case law, this factor also includes whether and to what extent the use is transformative. The AI companies can somewhat credibly argue that the model is significantly transformative, because it (the model, not its outputs) serves a fundamentally different function to the training works. This would mitigate any commercial motive the court might ascribe to the defendants.
    – Kevin
    Commented Nov 30, 2023 at 18:39

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .