7

Many corpora (= datasets containing texts) are not freely available and/or their license do not allow redistribution and/or commercial use and/or require share-alike. For example, the Linguistic Data Consortium is notorious for selling corpora at hefty prices (despite being publicly funded).

Can I legally distribute word embeddings that I computed based on such corpora? And more generally, can I legally distribute any sort of machine learning model trained on such corpora?

In the case of a share-alike-licensed dataset, must all models trained on it be redistributed under the same or similar license?


I am mostly interested in the following locations:

  • California, United States
  • Massachusetts, United States
  • Paris, France
  • Seoul, South Korea
2
  • While I think the answer is that it's allowed, (1) I don't understand if you can reconstruct the original data from a WE and (2) there isn't a uniform license in LDC.
    – user6726
    Commented Jun 25, 2016 at 0:04
  • @user6726 1) One cannot fully reconstruct the original data, but we can infer some information the text contained, e.g. that word X and word Y tends to appear in the same context 2) Most LDC corpora are governed by the LDC User Agreement for Non-Members, but I am also interested in other licenses as well. Commented Jun 25, 2016 at 0:09

2 Answers 2

2

From what I can tell, the non-member agreement contains the core and common language regarding what one can do with the data (excluding databases with their own license). Reproducing the data is prohibited, but an analysis of the data should be consistent with the license. The core wording is

User shall not publish, retransmit, display, redistribute, reproduce or commercially exploit the Data in any form

with exceptions for short excerpts.

If the analysis produces "546869732069732074657874", that would be a violation of the license since that is just an encoding difference of the original text, whereby the text can be reproduced. At least one of the special licenses explicitly permits analysis which doesn't allow reconstruction of the text:

summaries, analyses and interpretations of the linguistic properties of the Data may be derived and published provided it is not possible to reconstruct the Data from such summaries

Another of the special licenses says something similar:

Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is not possible to reconstruct the information from these summaries.

but it uses the troubling term "information" rather that data – nobody knows what "information" is.

The MS Indian Language databases is more restricted and prohibits distributing derivative works without permission, which they might deem a mapping of the original data to be. The COMLEX databases requires permission to "redistribute any product or derivative work based on the Database" (emphasis added).

0

From https://www.reddit.com/r/MachineLearning/comments/7eor11/d_do_the_weights_trained_from_a_dataset_also_come/dq6m2su/:

We operate based on an (reviewed by lawyers but without precedent in local courts) interpretation that our copyright law (Latvia, EU) does not consider such weights as derived works (since they don't include parts of the original work) but rather as equivalent of databases/fact compilations (e.g. the ngram frequency model from a literary work is a good illustrative example, it's a factual statement about a work). In this interpretation means that the models are independent works whose distribution doesn't require permission from the copyright owners of the source data, the models aren't copyrightable as creative work, but the models are protected with the (lesser) set of rights granted to databases/fact compilations.

Contractual agreements supersede this - if we've agreed not to distribute models based on a particular dataset, then of course we're bound by that agreement; and such agreements are in place for some of our datasets. This doesn't include shrink-wrap/click-wrap/EULA licenses, which the user can refuse; especially since local law allows me to use (but not redistribute) copyrighted material for research purposes without permission/license; this means all the actual contracts/NDA's signed with e.g. industrial partners for access/use of their data.

Your mileage may wary, this is not legal advice, it will likely be different in your country, the more litigious the society the more careful you've got to be.

However, following recent AI breakthroughs, many companies are currently suing AI firms because they trained their model on their content without permission. E.g., https://www.bloomberg.com/news/articles/2023-02-17/openai-is-faulted-by-media-for-using-articles-to-train-chatgpt?srnd=technology-vp&leadSource=uverify%20wall:

News organizations aren’t the first companies to raise questions about whether their content is being used without authorization by artificial intelligence systems. In November, GitHub, Microsoft Corp. and OpenAI were sued in a case that alleged a tool called GitHub Copilot was essentially plagiarizing human developers in violation of their licenses.

In January, a group of artists sued AI generators Stability AI Ltd., Midjourney Inc. and DeviantArt Inc., claiming those companies downloaded and used billions of copyrighted images without compensating or obtaining the consent of the artists.

Like the Journal, CNN believes that using its articles to train ChatGPT violates the network’s terms of service, according to a person with knowledge of the matter. The network, owned by Warner Bros. Discovery Inc., plans to reach out to OpenAI about being paid to license the content, said the person, who asked not to be identified discussing a legal matter.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .