8

I was talking to my friends a few days ago about search engines and we started discussing whether or not they need permission from websites to crawl them. I looked this one up and it said on Quora that you don't really need permission to crawl them since they are present in public domain and are public property. If the websites want privacy, they can change their settings appropriately.

However, later on, I talked to a few other people who said that search engines like Google and Bing don't just crawl all websites. They only crawl websites which are registered on their SEO or are on their radar and ranking–I didn't really understand this part very much–but that doesn't make much sense either since they'd still need to crawl other websites to be updated.

My question is: If I had a search engine that worked roughly like Google, Bing, etc., do could I just start crawling and displaying results, or would I need get special permission from the trillions of websites out there before I can actually run my search engine?

3 Answers 3

5

"Public domain" refers to things in principle copyrightable but where protection has lapsed, been repudiated, or is a statutory exception (such as government works). A website is not "in the public domain". The idea that a website is "public property" is (*cough*) mistaken.

There are basically two ways in which a web interaction could be illegal. The first regards whether accessing another person's computer is illegally accessing a computer, which is a crime. Authorization essentially comes down to "permission": if the owner permits me to access the computer, I am authorized. Putting stuff out there on a web server is an open-ended grant of permission to look at a web page. That simply means that if I create a web page (with a bunch of links or not), I am granting you permission to interact with my computer to that extent. It does not create permission to hack into a password-protected subdirectory. An ordinary web crawler automates what a clicking human does.

Copyright law is also relevant, in that the stuff I put on my webpage is not to be copied without permission. Any webpage access necessarily involves automatic copying from machine to machine: in putting stuff out there for the world to see, I am saying that the world can do that level of automatic copying that arises from normal html-and-click interactions. It does not mean that you can download and do stuff with my copyrighted content (i.e., it is not an abandonment of copyright: I did not put that stuff in the public domain). Putting a web page out there in an unrestricted fashion means that you've given a certain level of permission to "copy" (at least in the automatic server-to-browser viewing sense).

I may want to impose conditions on peoples' access to my stuff, so I can impose terms on such material. For instance, I may require users to agree to certain conditions before accessing the CoolStuff subdirectory. Users then have to jump through a minor hoop and agree to those terms. In that case, my permission is conditional, and if you violate the terms of that agreement, I may be able to sue you for copyright infringement. It could then be a violation of my terms of service (TOS) if I say "you may not crawl my website" (in less vague language). A TOS gets its legal power from copyright law, because every webpage interaction involves copying (I assume that technical point is obvious), and copying can only be done with permission. You may technologically overcome my weak click-through technology so that the bot just says "sure whatever" and proceeds to illegally use my web page: I can sue you now for copyright infringement.

The robot-specific methods of meta-tags and robots.txt have no legal force. Although there is a way to say "no you may not," which is tailored to automated access, the meaning and enforcement of these devices has not yet reached the law. If my page uses NOFOLLOW and your program doesn't know or care, you (your program) do not (yet) have a duty to understand, detect and respect that tag. Prior registration is also not a legal requirement, and very many pages that are on the master crawl list get there from being linked to by someone else's web page. Again, there is at present no legal requirement of pre-registration (and there is no effective mechanism for verifying that the site owner has registered the site).

Archiving and especially re-displaying someone's content is, on the other hand, not legal. It would be plainly copyright infringement if you were to scoop up someone else's webpage and host it. You can analyze their material and somehow associate it with some search terms, and display a link to that page, but you cannot copy and republish their material. You can put very short snippets out there taken from a web page, under the "fair use" doctrine, but you can't wholesale republish a webpage. (It should be noted that the archive.org is an internationally recognized library, and libraries have extra statutory powers to archive).

10
  • 1
    Look, I'll be honest. I didn't understand a lot of the things you said. I understand a few things , however, even before you said them such as you can't just copy someone's property. But I'm not asking about any kind of copying or stealing of property. I want to know that if I had a brand new search engine which I wanted to put out there to display results, do I have to get permission from the trillions of owners out there to crawl their websites or can I can i crawl their websites just the same and if they want privacy password-protect or whatever their websites for that? Please clarify.
    – user17346
    Commented Apr 1, 2018 at 17:12
  • So I can just put out my search engine and start crawling the websites and displaying their results without any worries? That's what you are saying, right?
    – user17346
    Commented Apr 1, 2018 at 17:15
  • 1
    @Kevin The worst that is likely to happen is that a site operator will ban your crawlers. If your site's results show significant copyrighted portions of other pages (e.g. images, verbatim text from the web site), the owner could claim copyright infringement and sue you.
    – Brandin
    Commented Apr 1, 2018 at 19:20
  • 1
    There have been ongoing battles over what content can be displayed in search results. Google has largely won the right to display copies of images from websites, but recently agreed to no longer off the full-resolution files in image search: arstechnica.com/gadgets/2018/02/… Also, beyond robots.txt files, a server administrator also has mechanisms to prevent crawlers from accessing their site, since having all your pages trawled on a regular basis by multiple bots can be annoying. Commented Apr 1, 2018 at 22:23
  • 1
    "Archiving and especially re-displaying someone's content is [...] not legal" Are you sure? This sounds like exactly what Google's cache does.
    – Laurel
    Commented Apr 4, 2018 at 2:15
0

Not really an answer but to long for comments.

First I strongly suggest the OP does a little reading online about SEO (Search Engine optimization), and look at the history of search engines by using search engines. OP could easily get a grasp of the legal framework that allows search engines to do what they do.

Google, Bing, Yahoo etc. do not copy web pages. They index web pages. How exactly they do this is trade secret. However, indexing is the process of breaking down a body of text into relevant parts to make looking up a particular body of text quicker. If you were to look at the index for this page as it might be stored someplace like google, you would not recognize any part of the page. The part of the page they keep is broken down into phrases and words, loosing almost all original content, and human context.

(There is an internet archive https://web.archive.org/ they only have 325 billion pages including a few sites I used to have. they might have a page explaining how the are able to do this without getting sued.)

In the index they also store a pointer, which is simply the URL of the page and a small snippet of the original page, that they display with the results. If they actually archived pages they would need more data storage then they have, hundreds of times more, since the actual content of a page is only small amount of the data that makes up the whole page. Something like the weight of the ink compared to the weight of the book.

They can take and use the parts they do to make money because it is considered a "fair use". Much in the same way a newspaper could do a book review, using the title and excerpts from the book they are reviewing. Indeed in the early days of the internet when they were figuring out the basic legal framework of searching and indexing I am sure some lawyer someplace made that very comparison.

They only crawl websites which are registered on their SEO or are on their radar and ranking–I didn't really understand this part very much–but that doesn't make much sense either since they'd still need to crawl other websites to be updated.

A sites gets on their radar by being linked to from a site already in their index.

So yes you can index all the websites you want. You just can't copy them and use those copies for your own means without permission. But before you do, you should study up on the differences between copying and what the current dynamic is with fair use and indexing parts of a site and its related content like images. You also might want to get a grasp on data mining which is the art of writing a program that wanders around the internet collecting data, which is all a search engine crawler is.

-3

No, The search engine usually crawls your website without your permission which is set by default but if you want to block the robots from crawling your website you can add this tag to your website head.

1
  • What you say is true, but it fails to give any legal analysis of why it's true.
    – Mark
    Commented Mar 23, 2021 at 18:28

You must log in to answer this question.