1

For a while now, there has been a lot of discussion about large language models such s ChatGPT of OpenAI. A major issue has been whether newly trained, open-source models can be used in a commercial setting or not if they were trained on synthetic data that was generated by OpenAI.

I read through all the related materials that I could find of OpenAI:

The only commercially restrictive "legal speak" that I can find is in the terms of use, 2c.iii:

[You may not] use output from the Services to develop models that compete with OpenAI

That is clear: the models that you create cannot be commercially usable. But what does that say about the data itself?

As far as I can tell, that does not imply that the license of a dataset should change (in case of changing/augmenting/updating an intial dataset and redistributing it) or that a dataset cannot be commercially licensable. The Terms of use are very specific in that respect: 1. models may not be developed; 2. these models may not compete with OpenAI. But it does not mention data. So if you generated a dataset with OpenAI tools, I believe that you can still use that data in a commercial setting - simply not to specifically build a model that competes with OpenAI's services.

Is my interpretation correct, and is it therefore still allowed to publish fully open datasets? As mentioned in the Sharing policy it is of course a good idea to specify in a README file (or similar) that the content was synthetically generated and attributing OpenAI for using their tools.

1 Answer 1

1

The premise of "open-source" is that a bit imprecise, since some such licenses preclude "commercial use", so first read the license. I would have to say "you may not use my model for commercial purposes" if that is my intent – I assume you aren't talking about a licensing restriction which I impose on the model that I create. I also assume that I did something in the course of creating the model that constitutes "originality", i.e. it is not just an automatic computer response to an automatic input (I have to select some of the training data).

Then it becomes a licensing question regarding your use of OpenAI to create the training data. The data itself is not (under current law) protected by copyright, but the program and its use are subject to copyright protection. I agree that OpenAI has not prohibited use of their software for commercial purposes. The data itself is not protected by copyright, because it was automatically created by a computer, and thus is not protected by copyright. An exception is that in the UK, under §9(3) of Copyright, Designs and Patents Act 1988, "In the case of a literary, dramatic, musical or artistic work which is computer-generated, the author shall be taken to be the person by whom the arrangements necessary for the creation of the work are undertaken", so the data may be protected.

Nevertheless, the verbiage of the set of documents proffered as terms of use etc. by a preponderance of evidence indicates that OpenAI permits copying of data created by their program, which would thwart any attempt on their part to prohibit you from using their bot to create data that is commercially exploited, leaving aside the "no-compete" clause that you mentioned. Of course, you have to hire a lawyer to do a specialized analysis of your proposed use, but that is the general outline of relevant copyright law.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .