For a while now, there has been a lot of discussion about large language models such s ChatGPT of OpenAI. A major issue has been whether newly trained, open-source models can be used in a commercial setting or not if they were trained on synthetic data that was generated by OpenAI.
I read through all the related materials that I could find of OpenAI:
- https://openai.com/policies/terms-of-use
- https://openai.com/policies/service-terms
- https://openai.com/policies/sharing-publication-policy
- https://openai.com/policies/usage-policies
The only commercially restrictive "legal speak" that I can find is in the terms of use, 2c.iii:
[You may not] use output from the Services to develop models that compete with OpenAI
That is clear: the models that you create cannot be commercially usable. But what does that say about the data itself?
As far as I can tell, that does not imply that the license of a dataset should change (in case of changing/augmenting/updating an intial dataset and redistributing it) or that a dataset cannot be commercially licensable. The Terms of use are very specific in that respect: 1. models may not be developed; 2. these models may not compete with OpenAI. But it does not mention data. So if you generated a dataset with OpenAI tools, I believe that you can still use that data in a commercial setting - simply not to specifically build a model that competes with OpenAI's services.
Is my interpretation correct, and is it therefore still allowed to publish fully open datasets? As mentioned in the Sharing policy it is of course a good idea to specify in a README file (or similar) that the content was synthetically generated and attributing OpenAI for using their tools.