In data science, research need large corpora of text, image, or video data. Luckily, these can be acquired en masse today by scraping the web in general or specific sites, such as YouTube. For reproducibility and benchmarking (as well as for sharing expensive resources), it is desirable to publish these data sets so that other researchers can use it, too. There are plenty of corpora like this out there, such as: https://commoncrawl.org/ (basically scrapes the whole written web) or http://moments.csail.mit.edu/ (scraped from various video sites). These sites often have some kind of disclaimer (e.g., do not use for commercial purposes). Also, the content is often curated in some form (e.g., normalized, filtered, trimmed, converted, etc.). Certainly the maintainers of the data set have no individual approval by producers of content and for the ones which use more than one data source, there is also likely no contract with the source platforms.
Is this or can this be made legal (the publication of foreign data in such manner) or is it just that no content creator bothers to have their short video removed from a large data set which has likely spread out already anyway (and is accesible publicly anyway)? What are the rights and the roles of the original content platforms?