Ethics of scraping "public" data sources to obtain email addresses

Question

I am wondering whether the following research practice is ethical.

A software engineering researcher downloads source code repositories from Github, a large source of publicly available open source code. The researcher searches the git commit logs to find email addresses of software developers who have committed to a project, and uses these email addresses to send them an email asking them to participate in a survey. If the recipient clicks on the link to the survey, the survey contains an appropriate briefing and obtains informed consent. The researcher follows all institutional and legal requirements related to human subjects research. The researcher limits the number of emails sent to only the number of participants they think they will need to test their hypotheses. However, at least one recipient of this email is annoyed that the researcher obtained their email address in this fashion and sent them unsolicited email.

Is this an ethical research practice? In particular, what would be the relevant ethical principles or ethical framework for analyzing this question? I've read a bunch of papers and backgrounders on ethics in human subject research and in engineering research, but they seem focused on other issues. Are there accepted norms or guidelines relating to this sort of situation? Has it been considered in other fields, such as the social sciences?

A possible argument that the practice is ethical: The data source is publicly available, and the email addresses were collected from this publicly available data. Developers chose to make their software repository publicly available, and they should assume that any information contained in it are public. Developers who don't want to be contacted could have configured their git client specially to use a different email address. The research will benefit our understanding of the science of software development. Subjects have an opportunity to decide whether or not to participate in the survey. Participant confidentiality will be protected, and all responses will be treated anonymously. The research complies with all legal and compliance requirements. From a legal perspective, the emails are not "spam", since the unsolicited email was not sent for a commercial purpose.

A possible argument that the practice is unethical: Software developers probably would not expect someone to scrape email addresses from the git commit logs. Their email address might be contained in a publicly available data set, but some developers might expect/consider the information private, or at least not public and free for unrestricted use. Some developers might object that it is one thing to use email addresses that are publicly listed on their Github profile page, but it is another thing to extract private email addresses that are provided as part of their git configuration, and that their understanding of social norms is that the email addresses automatically inserted into the commit logs by their git client were not intended for this purpose. Some software developers might object to having an unwanted email message in their inbox or find the practices "creepy".

Please note: I am not asking about IRBs, legal requirements, or compliance. I am super-familiar with those considerations. Assume that the researcher has complied faithfully with all of those requirements that are applicable in their country. I'm not asking about that aspect. In my view, researchers have an independent obligation to conduct research in an ethical manner, and to exercise their own judgement in avoiding unethical behavior, even if is legally permitted or approved by an IRB.

In addition to being unethical according to the guidelines in ff524's answer, this also appears to violate GitHub's ToS and, thus, is not legal. See section G.10, which explicitly bans transmitting unsolicited e-mail. Also, according to section F.3, if someone were to sue GitHub over this, you agree to indemnify GitHub for any awarded damages and for attorney's fees. — reirab, Commented Oct 21, 2015 at 1:54
We have done this, once, ourselves as it is not an uncommon method in our field. We now rather stay away from it. While the response rate was good (indicating that plenty of people where indeed ok with our unsolicited mail), there are most definitely a non-trivial amount of people that are very annoyed by this spam. Let's be clear - if you do this, you are in for some pissed-off mailing, and that fact alone already tells you quite a bit about the ethicality of the whole endeavour. — xLeitix, Commented Oct 21, 2015 at 6:35
Your idea of what spam is appears to be too narrow. Any mass mailing to people who didn't sign up for it in the first place could reasonably be considered to be spam. — kasperd, Commented Oct 21, 2015 at 6:49
@xLeitix G.6 also says "You agree not to reproduce, duplicate, copy, sell, resell or exploit any portion of the Service, use of the Service, or access to the Service without the express written permission by GitHub." Scraping for e-mail addresses certainly seems like exploitation of the service to me, but, again, it's sufficiently ambiguous that I suppose an argument could be made either way. — reirab, Commented Oct 21, 2015 at 7:28
Software developers probably would not expect someone to scrape email addresses from the git commit logs. It's been a long time since I've associated with any software developers who would not expect that. — user2338816, Commented Oct 21, 2015 at 10:03

ff524 · Accepted Answer · 2015-10-20 20:17:46Z

34

A relevant guideline from the Council of American Survey Research Organizations' Code of Standards and Ethics for Survey Research:

Research Organizations are required to verify that individuals contacted for research by email have a reasonable expectation that they will receive email contact for research. Such agreement can be assumed when ALL of the following conditions exist:

A substantive pre-existing relationship exists between the individuals contacted and the Research Organization, the Client supplying email addresses, or the Internet Sample Providers supplying the email addresses (the latter being so identified in the email invitation);

Survey email invitees have a reasonable expectation, based on the pre-existing relationship where survey email invitees have specifically opted in for Internet research with the research company or Sample Provider, or in the case of Client-supplied lists that they may be contacted for research and invitees have not opted out of email communications;

Survey email invitations clearly communicate the name of the sample provider, the relationship of the individual to that provider, and clearly offer the choice to be removed from future email contact.

The email sample list excludes all individuals who have previously requested removal from future email contact in an appropriate and timely manner.

Participants in the email sample were not recruited via unsolicited email invitations.

It would seem here that GitHub users do not have a reasonable expectation that they will be contacted for research.

answered Oct 20, 2015 at 20:17

ff524

109k50 gold badges422 silver badges476 bronze badges

1

By agreeing to receive communication related to open source from SourceForge, I have always expected that these communications include information on particular open source tools, tips on how to conduct and publish open source projects, and general information on the open source world (all of which may or may not be interactive). I don't think that expectation is particularly unreasonable, or would be fundamentally different on Github. It might make a difference, however, that the invitation in question was not sent by Github on behalf of the researchers, but by the researchers themselves.
– O. R. Mapper
Commented Oct 20, 2015 at 20:48
5

@O.R.Mapper Also, afaik, you can choose what types of e-mail you want SourceForge/GitHub/etc. to send you. At least in the U.S., I'm fairly certain that their are laws requiring them to at least let you opt out. At any rate, the first point in the above says that there's a substantive pre-existing relationship between the individuals and the research organization, not between the individuals and the website the RO scraped their address from. It's also quite likely that this violates the terms of service of GitHub.
– reirab
Commented Oct 21, 2015 at 1:40
@reirab: "he Research Organization, the Client supplying email addresses, or the Internet Sample Providers supplying the email addresses"
– O. R. Mapper
Commented Oct 21, 2015 at 5:10
1

@O.R.Mapper True, but the point remains that none of those is the case here, at least insomuch as that 'supplying' seems to imply that the supplying was intentional, rather than having scraped their website in what seems to be a violation of their ToS.
– reirab
Commented Oct 21, 2015 at 5:36
3

That's an interesting guideline. I wonder how many actual survey studies really follow all of that. Based on my experience in Software Engineering, I would wager less than 50%. Specifically the OP's approach of mining GitHub repos of addresses of potential invitees is not at all uncommon, although I agree that this practice already leans very strongly towards "spam".
– xLeitix
Commented Oct 21, 2015 at 6:30

| Show 3 more comments

reirab · Accepted Answer · 2022-04-20 02:44:54Z

Being curious about this question, I asked GitHub about such uses. Here was their response:

Thanks for reaching out! We always encourage users who have questions about our Terms to contact GitHub Support directly, so we can learn more about their specific situation.

In the meantime, I'll be happy to point you to GitHub's Terms of Service and address the general issue of sending unsolicited emails.

Section G10 of our terms states:

You must not upload, post, host, or transmit unsolicited email, SMSs, or "spam" messages.

So, it appears that they also interpret section G.10 as banning this practice, though they encourage anyone with a question about what uses of their service are allowed to contact them directly.

April 2022 update:

As polm23 pointed out in a comment, Github's terms have been updated since my answer was originally written. This concern is now addressed in section 7 of Github's Acceptable Use Policy, which includes the following (emphasis mine):

You may not use information from the Service (whether scraped, collected through our API, or obtained otherwise) for spamming purposes, including for the purposes of sending unsolicited emails to users or selling User Personal Information (as defined in the GitHub Privacy Statement), such as to recruiters, headhunters, and job boards.

So, sending unsolicited e-mails to users is still banned, just under a different section of their policies.

It's now in a different place in their Acceptable Use Policies. docs.github.com/en/site-policy/acceptable-use-policies/… — polm23, Commented Apr 13, 2022 at 15:29

einpoklum · Accepted Answer · 2020-06-01 21:21:01Z

In my opinion, the specific case you described is ethical with the following qualification:

It depends on what the survey is about and what it is for.
It must be the case that the scraper allows for easy opting-out of any further email from them.
This is a bit into the ethical "gray area" and a slight changes of circumstances of the scraping and its use might put it past the line of acceptability.

When I put my name and email as a contributor in a file on some public repository (ignoring the case of the email getting in thee against my wishes), I am making myself somewhat available, to be reached for issues regarding that source file. Now, it's true that a survey about developers' habits is not something I had in mind; but if it's for an arguably-socially-beneficial cause - I don't believe it's an abuse of my putting my email there in the file.

Stack Exchange Network

Ethics of scraping "public" data sources to obtain email addresses

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
ethics
email
survey-research
irb
.

Linked

Hot Network Questions

Ethics of scraping "public" data sources to obtain email addresses

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged ethicsemailsurvey-researchirb.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
ethics
email
survey-research
irb
.