Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ethical, legal, and privacy concerns for MS-Celeb face recognition dataset, consider delisting #20

Closed
adamhrv opened this issue Jun 11, 2020 · 5 comments

Comments

@adamhrv
Copy link

adamhrv commented Jun 11, 2020

MS-Celeb dataset contains millions of copyrighted images scraped from the Internet, along with the personally identifiable Knowledge Graph IDs. The dataset was terminated by Microsoft after this article in the Financial Times exposed how problematic it is: https://www.ft.com/content/cf19b956-60a2-11e9-b285-3acd5d43599e.

Distributing the images may now be illegal under GDPR or BIPA (Illinois Biometric Information Privacy Act) because the Knowledge IDs are easily associated with the real name of individuals in the dataset, and many of them (there are 100,000) may reside in Illinois or the EU.

I wrote a lot about it here https://megapixels.cc/msceleb/

The MS-Celeb face recognition training dataset is currently listed on Academic Torrents as the 39th most popular download:
https://academictorrents.com/details/9e67eb7cc23c9417f39778a8e06cca5e26196a97

@ieee8023
Copy link
Member

Adam,

We support datasets which are used for academic research. This is still a very popular dataset in academia and is used in many papers. Here is a list from 2020:

https://scholar.google.com/scholar?as_ylo=2020&hl=en&as_sdt=2005&sciodt=0,5&cites=7096719334274798105&scipsc=

We don't believe this torrent is hurting anyone or illegal, otherwise we would moderate it. Can you provide more details on how this dataset is illegal?

-Joseph

@adamhrv
Copy link
Author

adamhrv commented Jun 11, 2020

I point out in the megapixels essay that many of those aren't very academic. They are often joint research projects done in collaboration with, or in some cases directly funded by tech companies.

Take, for example, "Face Re-Identification Challenge: Are Face Recognition Models Good Enough?", which lists Vision Semantics as an author. Vision Semantics has worked with the UK Ministry of Defence. That doesn't sound very academic.

There's also a citation from the paper "Knot Magnify Loss for Face Recognition", a research project from "Noah’s Ark Laboratory, Huawei Technologies Co., Ltd" that doesn't have any academic connection. More research affiliations are listed on towards the bottom https://megapixels.cc/msceleb/ and include Megvii Inc., Hitachi, and IBM T.J. Watson Research Center.

The original terms of the dataset prohibit any commercial use "The data is released for non-commercial research purpose only." https://web.archive.org/web/20180218212120/http://www.msceleb.org/download/sampleset

It also requires agreeing to their MSRA before downloading "You have to read and agree the MSR Data License Agreement before you downloading the data;" which is not included in the torrent.

For BIPA, https://www.perkinscoie.com/en/news-insights/new-biometrics-lawsuits-signal-potential-legal-risks-in-AI.html
"The IBM and Clearview cases demonstrate that, even for publicly available data, a plaintiff may claim that processing personal information without consent violates the law."

For GDPR it will depend on whether the Knowledge Graph identifier constitutes personally identifiable information. If so, the fine could be large since there are many EU citizens in the dataset who never agreed, and their "celebrity" status is questionable.

It could be helpful to bring in a lawyer's perspective.

@julescarbon
Copy link

+1 for the legal concerns - clearly MS-CELEB is being used in ways that violate Creative Commons licenses prohibiting commercial use. You have no way to control whether your website is used purely by academics. You should be concerned that publishing this torrent allows copyright violation on a massive scale.

But Joseph, aren't you concerned that distributing this dataset facilitates oppression? If peaceful protesters are targeted with facial recognition by police, and then later suffer retribution for expressing legitimate dissent, is the technology still neutral? If it is used to classify people by race, and then this is used by racists, is it still neutral? These are not idle concerns - there is ample evidence that corporations and governments have already used MS-CELEB to drive research into these areas. Do you want this to keep happening?

We need only look to the history of the 20th century to see how supposedly "neutral" bureaucratic technologies are used against whole populations. MS-CELEB is no ordinary image dataset. If you want to see who a technology is hurting, you must consider how it can be used against marginalized people. It is should not be in the interest of Academic Torrents or the University of Montreal to facilitate repression.

@ieee8023
Copy link
Member

Censorship of this legal (as it appears so far) data is not the answer. There are many other uses such as for animation and movie production. To stop facial recognition by police just regulate the use of it.

@adamhrv
Copy link
Author

adamhrv commented Jun 15, 2020

Please provide examples of "animation and movie production" to support your claim.

Your reference to its usefulness in academia on Google Scholar shows overwhelming majority of usage for developing surveillance technologies.

Regulation is easier said than done. Are you OK being complicit in providing this dataset if there was any evidence that it contributed to facial recognition systems used by police?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants