2

We are trying to use a HuggingFace Embedding model - multi-qa-mpnet-base-cos-v1 for an internal Large Language Model powered application. While reading the documentation it says (not verbatim):

For the multi-qa mpnet model, of the training datasets used, some of them are not for commercial use.

For instance, GOOAQ and Yahoo! answers.

For GOOAQ, it states:

NOTE: This dataset should not be used for any commercial purposes.

For one of the Yahoo datasets used for multi-qa-mpnet-base-cos-v1, in the readme.txt file, it states:

The original Yahoo! Answers corpus can be obtained through the Yahoo! Research Alliance Webscope program. The dataset is to be used for approved non-commercial research purposes by recipients who have signed a Data Sharing Agreement with Yahoo!.

multi-qa-mpnet-base-cos-v1 was also trained on MS MARCO which also has the same licence issues.

Does this automatically mean that the model itself is "tainted" and therefore we cannot use it for embeddings?

3
  • The technical jargon needs to be better explained.
    – ohwilleke
    Commented Jul 5, 2023 at 19:22
  • Before I use a flag to summon a mod: Is there a reason that this can't be moved to OSS SE (opensource.stackexchange.com)? This seems like it would be on-topic and easier to answer there.
    – Corbin
    Commented Jul 5, 2023 at 21:45
  • @Corbin I don't think so; on that site, "Open" and "Open Source" refer to content that is licensed under an open source license, such as GPL, MIT, BSD, Creative Commons, etc. opensource.stackexchange.com/tour
    – Brandin
    Commented Jul 25, 2023 at 13:52

1 Answer 1

0

I've investigated this earlier in the context of an EU subsidy project. It's my belief that current law in both the US and the EU does not provide copyright protection of neural networks. The reason is simple: copyright is restricted to works made by humans, extending to derivative works (such as compiled computer programs from human-written source code). But works produced by fully automated means, such as neural networks and their outputs (such as ChatGPT transcripts) are not covered by copyright law.

The question may be tagged "open-source software", but neural networks are not software. They certainly do not have source code, so that cannot be open.

It's quite likely that the dataset used to train the networks is partially copyrighted. That would make it impossible to reproduce the training, which could be a concern in academic research. For that purpose, non-commercial reproduction of the dataset is a valuable right. Again, this is not source and therefore not open-source software.

Not the answer you're looking for? Browse other questions tagged .