Commerical usage of sentence-transformers/multi-qa-mpnet-base-cos-v1 [closed]

Question

Closed. This question needs details or clarity. It is not currently accepting answers.

Want to improve this question? Add details and clarify the problem by editing this post.

Closed 12 months ago.

We are trying to use a HuggingFace Embedding model - multi-qa-mpnet-base-cos-v1 for an internal Large Language Model powered application. While reading the documentation it says (not verbatim):

For the multi-qa mpnet model, of the training datasets used, some of them are not for commercial use.

For instance, GOOAQ and Yahoo! answers.

For GOOAQ, it states:

NOTE: This dataset should not be used for any commercial purposes.

For one of the Yahoo datasets used for multi-qa-mpnet-base-cos-v1, in the readme.txt file, it states:

The original Yahoo! Answers corpus can be obtained through the Yahoo! Research Alliance Webscope program. The dataset is to be used for approved non-commercial research purposes by recipients who have signed a Data Sharing Agreement with Yahoo!.

multi-qa-mpnet-base-cos-v1 was also trained on MS MARCO which also has the same licence issues.

Does this automatically mean that the model itself is "tainted" and therefore we cannot use it for embeddings?

Before I use a flag to summon a mod: Is there a reason that this can't be moved to OSS SE (opensource.stackexchange.com)? This seems like it would be on-topic and easier to answer there. — Corbin, Commented Jul 5, 2023 at 21:45
@Corbin I don't think so; on that site, "Open" and "Open Source" refer to content that is licensed under an open source license, such as GPL, MIT, BSD, Creative Commons, etc. opensource.stackexchange.com/tour — Brandin, Commented Jul 25, 2023 at 13:52

MSalters · Accepted Answer · 2023-07-07 10:47:14Z

I've investigated this earlier in the context of an EU subsidy project. It's my belief that current law in both the US and the EU does not provide copyright protection of neural networks. The reason is simple: copyright is restricted to works made by humans, extending to derivative works (such as compiled computer programs from human-written source code). But works produced by fully automated means, such as neural networks and their outputs (such as ChatGPT transcripts) are not covered by copyright law.

The question may be tagged "open-source software", but neural networks are not software. They certainly do not have source code, so that cannot be open.

It's quite likely that the dataset used to train the networks is partially copyrighted. That would make it impossible to reproduce the training, which could be a concern in academic research. For that purpose, non-commercial reproduction of the dataset is a valuable right. Again, this is not source and therefore not open-source software.

Stack Exchange Network

Commerical usage of sentence-transformers/multi-qa-mpnet-base-cos-v1 [closed]

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
copyright
software
licensing
open-source-software
.

Hot Network Questions

Commerical usage of sentence-transformers/multi-qa-mpnet-base-cos-v1 [closed]

1 Answer 1

Not the answer you're looking for? Browse other questions tagged copyrightsoftwarelicensingopen-source-software.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
copyright
software
licensing
open-source-software
.