In data science, research need large corpora of text, image, or video data. Luckily, these can be acquired en masse today by scraping the web in general or specific sites, such as YouTube. For reproducibility and benchmarking (as well as for sharing expensive resources), it is desirable to publish these data sets so that other researchers can use it, too. There are plenty of corpora like this out there, such as: https://commoncrawl.org/ (basically scrapes the whole written web) or http://moments.csail.mit.edu/ (scraped from various video sites). These sites often have some kind of disclaimer (e.g., do not use for commercial purposes). Also, the content is often curated in some form (e.g., normalized, filtered, trimmed, converted, etc.). Certainly the maintainers of the data set have no individual approval by producers of content and for the ones which use more than one data source, there is also likely no contract with the source platforms.

Is this or can this be made legal (the publication of foreign data in such manner) or is it just that no content creator bothers to have their short video removed from a large data set which has likely spread out already anyway (and is accesible publicly anyway)? What are the rights and the roles of the original content platforms?

    +1. I think the critical question is "What are the rights and the roles of the original content creators?" For example youtube has a license to distribute, but the copyright is still held by the original content creator. They are very unlikely to have given permission for this redistribution. It is likely to be a fair use question.
  • While clearly related, i do not think this is a duplicate. In the linked question, the OP was scraping info about the user, and using this to fill in a form, not stored until the user confirmed it. Specific source sites were mentioned, whose TOS docs figure in the only answer to date. This deals with data from unknown sources, used for research purposes. Answers may well not be the same. Commented Jun 28, 2022 at 14:16

Harvesting protected works from websites and making them publicly available, without permission of the copyright holder (author), is generally copyright infringement. Depending on jurisdiction, there can be exceptions such as the US concept of "fair use", where one can massively copy and analyze data for research purposes, but not redistribute the corpus. I don't believe that any jurisdiction allows the free re-distribution of protected works without author permission just in case the uses is said to be "for research".

It is true that there are fewer infringement lawsuits over data scraping than one would expect, given the amount of scraping that goes on. The reasons for under-litigation are mostly non-legal (mainly, did not know the work was copied), quasi-legal (does not object to copying but does not want to bother including a license; does not understand that permission has to be given i.e. thinks that "putting it out there" is enough), and legal-practicality (the hassle of a lawsuit exceeds the possible rewards). Very many works where the rights-holder actively cares are scrape-proof – hidden behind a password. In principle, a person who finds that their cat video was scraped from a web site and wrapped up in a massive and unparsable data structure could sue the researcher.

In some cases, such as Stack Exchange, the website secures a license from the content-creator. In order to use LSE to answer your question, I have to allow copying of my content as provided in the TOS. Whether or not a particular website has such a license is up to that website.


Depends on the jurisdiction. There are some which recognize a compilation copyright even if the individual pieces of information were free. Scraping a compilation could violate the copyright of that site.

