Skip to main content
Post Closed as "Needs details or clarity" by ohwilleke, user35069, Jen, Brian, IKnowNothing
deleted 2 characters in body
Source Link

We are trying to use a HuggingFace Embedding model HuggingFace Embedding model - multi-qa-mpnet-base-cos-v1 for an internal Large Language Model powered application. While reading the documentation it says (not verbatim):

For the multi-qa mpnet model, of the training datasets used, some of them are not for commercial use.

For instance, GOOAQ and Yahoo! answers.

For GOOAQ, it states: "NOTE: This dataset should not be used for any commercial purposes."

NOTE: This dataset should not be used for any commercial purposes.

For one of the Yahoo datasets used for multi-qa-mpnet-base-cos-v1, in the readme.txt file, it states:

"The original Yahoo! Answers corpus can be obtained through the Yahoo! Research Alliance Webscope program. The dataset is to be used for approved non-commercial research purposes by recipients who have signed a Data Sharing Agreement with Yahoo!."

The original Yahoo! Answers corpus can be obtained through the Yahoo! Research Alliance Webscope program. The dataset is to be used for approved non-commercial research purposes by recipients who have signed a Data Sharing Agreement with Yahoo!.

multi-qa-mpnet-base-cos-v1 was also trained on MS MARCO which also has the same licence issues.

Does this automatically mean that the model itself is "tainted" and therefore we cannot use it for embeddings?

We are trying to use a HuggingFace Embedding model - multi-qa-mpnet-base-cos-v1 for an internal Large Language Model powered application. While reading the documentation it says (not verbatim):

For the multi-qa mpnet model, of the training datasets used, some of them are not for commercial use.

For instance, GOOAQ and Yahoo! answers.

For GOOAQ, it states: "NOTE: This dataset should not be used for any commercial purposes."

For one of the Yahoo datasets used for multi-qa-mpnet-base-cos-v1, in the readme.txt file, it states:

"The original Yahoo! Answers corpus can be obtained through the Yahoo! Research Alliance Webscope program. The dataset is to be used for approved non-commercial research purposes by recipients who have signed a Data Sharing Agreement with Yahoo!."

multi-qa-mpnet-base-cos-v1 was also trained on MS MARCO which also has the same licence issues.

Does this automatically mean that the model itself is "tainted" and therefore we cannot use it for embeddings?

We are trying to use a HuggingFace Embedding model - multi-qa-mpnet-base-cos-v1 for an internal Large Language Model powered application. While reading the documentation it says (not verbatim):

For the multi-qa mpnet model, of the training datasets used, some of them are not for commercial use.

For instance, GOOAQ and Yahoo! answers.

For GOOAQ, it states:

NOTE: This dataset should not be used for any commercial purposes.

For one of the Yahoo datasets used for multi-qa-mpnet-base-cos-v1, in the readme.txt file, it states:

The original Yahoo! Answers corpus can be obtained through the Yahoo! Research Alliance Webscope program. The dataset is to be used for approved non-commercial research purposes by recipients who have signed a Data Sharing Agreement with Yahoo!.

multi-qa-mpnet-base-cos-v1 was also trained on MS MARCO which also has the same licence issues.

Does this automatically mean that the model itself is "tainted" and therefore we cannot use it for embeddings?

added 17 characters in body
Source Link

We are trying to use a HuggingFace Embedding model - multi-qa-mpnet-base-cos-v1 for an internal LLM-poweredLarge Language Model powered application. While reading the documentation it says (not verbatim):

For the multi-qa mpnet model, of the training datasets used, some of them are not for commercial use.

For instance, GOOAQ and Yahoo! answers.

For GOOAQ, it states: "NOTE: This dataset should not be used for any commercial purposes."

For one of the Yahoo datasets used for multi-qa-mpnet-base-cos-v1, in the readme.txt file, it states:

"The original Yahoo! Answers corpus can be obtained through the Yahoo! Research Alliance Webscope program. The dataset is to be used for approved non-commercial research purposes by recipients who have signed a Data Sharing Agreement with Yahoo!."

multi-qa-mpnet-base-cos-v1 was also trained on MS MARCO which also has the same licence issues.

Does this automatically mean that the model itself is "tainted" and therefore we cannot use it for embeddings?

We are trying to use a HuggingFace Embedding model - multi-qa-mpnet-base-cos-v1 for an internal LLM-powered application. While reading the documentation it says (not verbatim):

For the multi-qa mpnet model, of the training datasets used, some of them are not for commercial use.

For instance, GOOAQ and Yahoo! answers.

For GOOAQ, it states: "NOTE: This dataset should not be used for any commercial purposes."

For one of the Yahoo datasets used for multi-qa-mpnet-base-cos-v1, in the readme.txt file, it states:

"The original Yahoo! Answers corpus can be obtained through the Yahoo! Research Alliance Webscope program. The dataset is to be used for approved non-commercial research purposes by recipients who have signed a Data Sharing Agreement with Yahoo!."

multi-qa-mpnet-base-cos-v1 was also trained on MS MARCO which also has the same licence issues.

Does this automatically mean that the model itself is "tainted" and therefore we cannot use it for embeddings?

We are trying to use a HuggingFace Embedding model - multi-qa-mpnet-base-cos-v1 for an internal Large Language Model powered application. While reading the documentation it says (not verbatim):

For the multi-qa mpnet model, of the training datasets used, some of them are not for commercial use.

For instance, GOOAQ and Yahoo! answers.

For GOOAQ, it states: "NOTE: This dataset should not be used for any commercial purposes."

For one of the Yahoo datasets used for multi-qa-mpnet-base-cos-v1, in the readme.txt file, it states:

"The original Yahoo! Answers corpus can be obtained through the Yahoo! Research Alliance Webscope program. The dataset is to be used for approved non-commercial research purposes by recipients who have signed a Data Sharing Agreement with Yahoo!."

multi-qa-mpnet-base-cos-v1 was also trained on MS MARCO which also has the same licence issues.

Does this automatically mean that the model itself is "tainted" and therefore we cannot use it for embeddings?

Source Link

Commerical usage of sentence-transformers/multi-qa-mpnet-base-cos-v1

We are trying to use a HuggingFace Embedding model - multi-qa-mpnet-base-cos-v1 for an internal LLM-powered application. While reading the documentation it says (not verbatim):

For the multi-qa mpnet model, of the training datasets used, some of them are not for commercial use.

For instance, GOOAQ and Yahoo! answers.

For GOOAQ, it states: "NOTE: This dataset should not be used for any commercial purposes."

For one of the Yahoo datasets used for multi-qa-mpnet-base-cos-v1, in the readme.txt file, it states:

"The original Yahoo! Answers corpus can be obtained through the Yahoo! Research Alliance Webscope program. The dataset is to be used for approved non-commercial research purposes by recipients who have signed a Data Sharing Agreement with Yahoo!."

multi-qa-mpnet-base-cos-v1 was also trained on MS MARCO which also has the same licence issues.

Does this automatically mean that the model itself is "tainted" and therefore we cannot use it for embeddings?