36

Due to the recent controversy(s) regarding StackExchange, a couple other users and I were discussing the legality of creating a copy of SE and scraping the content.

If we did not copy SE's actual code, just the content that users put on the site, and we created another public site that was completely nonprofit, and we attributed all content taken to StackExchange would it be legal? Do we need permission from every single user on SE? Do we need SE's permission?

Some relevant portions of the ToS:

You agree that any and all content, including without limitation any and all text, graphics, logos, tools, photographs, images, illustrations, software or source code, audio and video, animations, and product feedback (collectively, “Content”) that you provide to the public Network (collectively, “Subscriber Content”), is perpetually and irrevocably licensed to Stack Overflow on a worldwide, royalty-free, non-exclusive basis pursuant to Creative Commons licensing terms (CC-BY-SA), and you grant Stack Overflow the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit such Subscriber Content


Any other downloading, copying, or storing of any public Network Content (other than Subscriber Content or content made available via the Stack Overflow API) for other than personal, noncommercial use is expressly prohibited without prior written permission from Stack Overflow or from the copyright holder identified in the copyright notice per the Creative Commons License. In the event you download software from the public Network (other than Subscriber Content or content made available by the Stack Overflow API) the software including any files, images incorporated in or generated by the software, the data accompanying the software (collectively, the “Software”) is licensed to you by Stack Overflow or third party licensors for your personal, noncommercial use, and no title to the Software shall transfer to you. Stack Overflow or third party licensors retain full and complete title to the Software and all intellectual property rights therein.

8
  • 9
    Has this not been answered by the links at the bottom of every page? cc by-sa 4.0 and attribution required Commented Oct 5, 2019 at 21:13
  • 10
    Note that scraping each site is not a good way to get the content. You sould first download the weekly database dump. You can then fill in missing information using data from the Stack Exchange API. You could exclusively use the SE API to get all the content data, but that would take longer, due to quota limitations. If you tried to actually scrape all the pages, you would likely hit a rate limit block quite quickly. The SE API provides the content data about 500 to 1,000 times faster than you can get it by scraping pages (due to rate limiting on both methods).
    – Makyen
    Commented Oct 6, 2019 at 7:58
  • 2
    @AndrewLeach, note that apparently the license mentioned in the footer was just changed a while ago, it used to be CC-BY-SA 3.0 or something like that.
    – ilkkachu
    Commented Oct 6, 2019 at 12:02
  • 1
    @ilkkachuYes, I know that. So what? It's still the answer to the question. Commented Oct 6, 2019 at 12:15
  • 2
    @AndrewLeach although what happens to you if SE is sued (due to not having sought permission to change the licence) and has to go back to 3.0 is unclear.
    – Tim
    Commented Oct 6, 2019 at 12:31

2 Answers 2

26

Stack Exchange have already covered this in a couple of places, from MSE's A site (or scraper) is copying content from Stack Exchange. What should I do?:

When should I not report these sites?

  • They follow all the attribution requirements. As mentioned before, there is nothing wrong with copying our content elsewhere on the web, so long as they are following all the attribution requirements given. There is no action we can take against a scraper who follows all the rules.

And the old Attribution Required blog post mentions that the actual requirements are:

  1. Visually indicate that the content is from Stack Overflow or the Stack Exchange network in some way. It doesn’t have to be obnoxious; a discreet text blurb is fine.
  2. Hyperlink directly to the original question on the source site (e.g., http://stackoverflow.com/questions/12345)
  3. Show the author names for every question and answer
  4. Hyperlink each author name directly back to their user profile page on the source site (e.g., http://stackoverflow.com/users/1234567890/username)

By “directly”, I mean each hyperlink must point directly to our domain in standard HTML visible even with JavaScript disabled, and not use a tinyurl or any other form of obfuscation or redirection. Furthermore, the links must not be nofollowed

7
  • 60
    The nofollow restriction sounds entirely unenforceable. The license requires "attribution", not SEO boosting.
    – hobbs
    Commented Oct 6, 2019 at 5:54
  • 31
    And I believe SE employs nofollow on the links we include in answers, so that’s a bit of a double standard.
    – Tim
    Commented Oct 6, 2019 at 12:29
  • 14
    I don't see how this is relevant. The OP posits copying user content licensed to SE under CC-by-sa licenses. I do not see the theory under which SE can place any additional restrictions on how anyone can use that content. The attribution is a bit tricky, since user identification is mediated by SE, but that's about it. Commented Oct 6, 2019 at 12:51
  • 3
    @hobbs Yep, this has come up on law.stackexchange before: law.stackexchange.com/a/429
    – Brilliand
    Commented Oct 6, 2019 at 19:51
  • 8
    Note that the quoted blog post is not a good source for what the attribution must look like. It lists things that aren’t required by the license, and it misses the requirement to reference the license.
    – unor
    Commented Oct 7, 2019 at 7:43
-2

Intended to complement '947's excellent answer, a direct response:

Yes. No. You already have it, per CC-BY-SA. (In response to "... would it be legal? Do we need permission from every single user on SE? Do we need SE's permission?"

All of the Subscriber Content is available under a CC-BY-SA license. Also, because it was provided to SE only "pursuant to Creative Commons licensing terms", you don't have to scrape it. Because of the "-SA", SE can't use "Effective Technological Measures" (see this clause et seq.) and I would consider rate limiting of HTTP or the API, despite a request that it be lifted for the purpose, to be one - so if you want all the Subscriber Content, SE would be unwise to not give it to you if you requested it, in a mutually convenient form, such as a compressed database dump. (Like the ones Wikipedia/the WMF provides.)

P.S. Err, it looks like this info was largely already provided in a comment by @Ángel - "You need permission from every single user on SE [that posted something you are copying] and you have that permission by way of their releasing the content under CC-BY-SA. Note you should better attribute the users themselves as authors, not StackExchange itself (e.g. attribute to Law StackExchange user JBis (22305), not attribute it as if authored by SE, which doesn't)" (edited for clarity)

IANAL,BIPOOTI.

1
  • Haters gonna hate. -1 and no comment. Commented Dec 14, 2019 at 19:50

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .