6

(I'm reposting this from workplace S.E. where I was advised to ask here; apologies in advance for any inconvenience)

tl;dr: in the USA, is copying the HTML code from a site (any site whose code is presumably copyrighted) and storing it, for a limited or unlimited amount of time a violation of copyright? Are there prior lawsuits related to this? I'm mostly interested in the particular case where the copy is not reproduced, but kept private.


I've recently learned that this is indeed the case. This came as a huge surprise to me since:

  1. [First of all] most browsers retain a copy of the HTML (for the period of the visit or much longer, if caching is enabled)
  2. Proxy servers often keep cached copies of these files
  3. Web archives (like Google's) not only copy all assets of a site it find to keep historical versions of these pages but also make available to the general public these historical copies.
  4. Programs that scrape external sites often have in their repositories copies of (likely copyrighted) HTML for testing purposes

Number (4) is the one that directly affects the company I work for, since we do web analysis and therefore write programs that visit other sites. For example, we make extensive use of vcrpy library to record external accesses and test our code against these "frozen" HTMLs.

Also, specifically in our case, we don't really copy the entirety of any site, since we are only concerned with a subset of its pages, but from what I've been informed, that doesn't seem to qualify as "fair use", such as quoting a passage of a book (where, in a sense, the book would be analogous to the entire site with all its public assets). We don't even copy assets like CSS files or images, so we can't reproduce the actual content in full.

After I was told that such copies are likely unlawful, we are not only being limited to explore more robust testing methodologies (which would likely make use of a large amount of HTML copied from the web in a local storage) but the current use of vcrpy library has become something that demanded analysis (as it's not clear if our use of it is unlawful).

1
  • What is the case law you found on this?
    – Dale M
    Commented Aug 13, 2015 at 21:30

2 Answers 2

4

You are clearly seeking legal advice. Answers on this site come from anonymous people on the internet and are not legal advice. You should not act based on information from this site.

I am unaware of any lawsuit where one would be sued for merely storing and reading HTML for personal use.

Downloading a webpage is probably not a copyright violation. Most things you create, including HTML source code, are protected by copyright and copyright includes the exclusive right to choose who can read what you created. I couldn't find any actual reference to this but I would hazard a guess that displaying an HTML webpage online is implicitly allowing others to read that code. I believe this guess is correct because all modern web browsers have the capability to view source that nobody considers illegal and browsers also include the capability to save webpages to disk. These browsers are made by companies with large legal departments, I doubt Internet Explorer would include this function if using it was a copyright violation.

Here begins speculation:

However, your expanded question says that not only you wish to read the HTML code but you also wish to process it, extract information from it and use what you learn this way. This could, I think, be prevented by the copyright holder. Still, what you are describing is commonly done in the world. Services such as Google, Bing or the Wayback Machine go far beyond what you are doing. In theory, I can see this as being a copyright violation but again, the fact that these big companies - without any kind of contract with the website owners - keep doing it is big evidence in favor of legality of storing webpages.

You should be careful about how you use the stored data, though. For example, computer programs often have a stipulation in EULA that prevents you from reverse engineering the code. I could see that the use of some websites could be protected in such manner.

Further (not authoritative) internet pages on this topic:

1
  • "copyright includes the exclusive right to choose who can read what you created" No it doesn't, at least not under US law, and I think not in any country. Copyright allows the owner to choose whether and how to distribute copies of the work. But once a copy has been validly transferred to another, with permission, the first sale doctrine allows further transfers, and no permission is needed, nor can the copyright owner forbid them. The law on implied licenses is a bit less clar, but this answer is probably roughly correct on those. Commented Oct 26, 2022 at 19:19
2

Copyright is the specific right to make copies of a protected work. The owner of a webserver presumably is the copyright owner of the content on that webserver. So when the webserver automatically makes a copy of HTML documents when a web browser requests so, this is generally not a copyright violation.

The web browser does not make any promises how long it will keep those copies, so retaining them does not constitute a contract violation. (Retaining or using copyrighted documents isn't a protected act in copyright law).

Web servers can give caching hints, including "don't cache" to proxy servers. Similar protocols exist for web archives. Failing to follow these protocols implies permission to copy, as that is the generally understood meaning of publicly posting web pages.

Your 4th use might indeed be the trickiest. You may have a fair use defense, in that you're using said data for technical compatibility purposes, and you are not interfering with the author rights which copyright law intends to protect. (Fair use is not a synonym for "small parts OK") But fair use is a defense, not an absolute right, and must be considered in context.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .