What's the legal state of a data learnt from open-source datasets?

Question

Is it legal to make a software to "learn" from open-source datasets like Freebase or WordNet and use it to build an open-source knowledge-base system?

Freebase license (CC-BY) - Require credit and indicate if changes were made.
WordNet license - Require the same license for copies or modifications.

Both Freebase and WordNet are open-source licensed but require some kind of acknowledgment (or credit) for any change made to the dataset. My project will most likely learn from those projects and will represent the data in a very different way.

Do I have any restrictions on the data being generated by "learning"? Do I have to acknowledge those sources?

Here is some information about databases in general: opensource.stackexchange.com/a/2001 It's make me wonder if there is any restriction for databases at all.. — Moshe Simantov, Commented May 5, 2018 at 13:38

user6726 · Accepted Answer · 2018-05-05 16:18:31Z

There are two separate questions: whether the underlying databases are indeed protected by copyright, and whether your product has "substantial similarity" to the database. A database might be a phone book or some other automatically-generated thing, where there is no creativity involved in creating the database, and no copyright protection. It might also be the product of a painstaking process of judgment and data-arrangement based on reading the entire corpus of Ancient Greek, in which case it would be protected by copyright (even though the underlying text material is not protected).

It would likely be found to be infringement, if you were to copy the Greek database and perform a mechanical format conversion. The test that would be applied is, is your product substantially similar in terms of the organization (and not data) to the original? Because of functional requirements, there are only so many ways that a database can be constructed, so some degree of similarity would be logically necessary. This study from the copyright office may be of some use in figuring out where your plan falls, given the target databases.

Max Xiong · Accepted Answer · 2020-05-29 17:15:31Z

Reading the WordNet license, I do not see a copyleft condition. It is more like the MIT license actually.

This is a hard question to answer in general. There are two issues at hand here.

Firstly, it is possible that the format of a data file is copyrighted (not always the case of course), so that code reading the data file might be considered derivative work of the format. If the license allows so, eg. if the database is free (as in free speech), this can be circumvented by converting the data to a different format for your own need.

Secondly, the selection of data, or the data itself can be copyrighted. When a program reads data, the program is usually not considered a derivative work of said data. If output is generated from the data, it usually inherits the copyright of the data. There may be exceptions, such as if the output is a summary of the data.

In your case, both licenses only require some kind of attribution and are not viral. As long as you comply with the requirements you are fine. The only thing to watch out for is that there may be compatibility issues with GPL.

Stack Exchange Network

What's the legal state of a data learnt from open-source datasets?

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
open-source-software
creative-commons
data-ownership
.

Hot Network Questions

What's the legal state of a data learnt from open-source datasets?

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged open-source-softwarecreative-commonsdata-ownership.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
open-source-software
creative-commons
data-ownership
.