Top of page

Notice: We are making improvements in the structure of the Web Archives that may result in intermittent unavailability. Read more about these improvements.

Program Web Archiving

Frequently Asked Questions

How does the Library select web content to archive?

The Library archives websites that are selected by the Library’s subject experts, known as Recommending Officers, based on guidance set forth in subject-focused Collection Policy Statements and the format-focused Supplementary Guidelines for Web Archiving and Social Media. Collecting occurs around subjects, themes, and events identified by Library staff. Recommending Officers select “seed” URLs, which are a starting point for the crawler, and can be a full domain, a subdomain, or simply one page or document – whatever is the desired web content to archive. Depending on the topic of the content, it might have been selected for archiving in multiple thematic or event archives, resulting in captures at various points in time. Content is also collected at various frequencies, based on determinations made by Library staff. Our archives cover a wide variety of subjects and topics, with web content published in the United States and internationally.

How does the web archiving process work?

The Library's Web Archiving Section manages the overall program and ensures that content selected is archived and preserved. The Library’s goal is to create a reproducible copy of how the site appeared at a particular point in time. The Library attempts to archive as much of the site as possible, including html, CSS, JavaScript, images, PDFs, and audio and video files to provide context for future researchers. The Library (and its agents) use special software to download copies of web content and preserve it in a standard format. The crawling tools start with a "seed URL" – for instance, a homepage – and the crawler follows the links it finds, preserving content as it goes. Library staff also add scoping instructions for the crawler to follow links to that organization's host on related domains, such as third-party sites, based on permissions policies.

Archiving is not a perfect process – there are a number of technical challenges that make it difficult to preserve some content. For instance, the Library is currently unable to collect content streamed through third-party web applications, "deep web" or database content requiring user input, data visualizations that dynamically render by querying external data sources, GIS and some interactive maps, and content requiring payment or a subscription for access. In addition, there will always be some websites that take advantage of emerging or unusual technologies that the crawler cannot anticipate. Social media sites and some common publishing platforms can be difficult to preserve.

How frequently is content collected for the web archives?

The Library’s goal is to document changes in a website over time. This means that most seed URLs are archived more than once. The frequency of collection varies and depends on decisions made when the URL is nominated for collection. These decisions are occasionally re-evaluated, and frequency of collection is changed. 

What tools does the Library's Web Archiving Program use?

The Library of Congress uses open source and custom-developed software to manage different stages of the overall workflow. The Library has developed and implemented an in-house workflow tool called Digiboard, which enables staff to select websites for archiving, manage and track required permissions and notices, perform quality review processes, among other tasks. To perform the web harvesting activity which downloads the content, we primarily use the Heritrix archival web crawler External. For replay of archived content, the Library has deployed a version of OpenWayback External to allow researchers to view the archives. Additionally, the program uses Library-wide digital library services to transfer, manage, and store digital content. Institutions and others interested in learning more about Digiboard and other tools that the Library uses can contact the Web Archiving Program for more information. The Library is continually evaluating available open-source tools that might be helpful for preserving and providing access to web content. 

How are the web archives stored?

Web archives are created and stored in the Web ARChive (WARC) and (for some older collections) the Internet Archive ARC container file formats. Multiple copies (for long-term preservation and access) are stored and managed by the Library of Congress. 

Does the Library deduplicate its archive?

Since mid-2009, the Library has used a crawler that allows for deduplicationExternal of content to reduce the storage size of the archives. The Library's general strategy regarding deduplication has been to do baseline crawls at least once per year of all content identified for archiving, and subsequent crawls "dedupe" against the baseline crawls. Only new content is stored.

Is the Library legally required to archive websites?

No. Currently, the Library is not legally required to archive websites. However, the Library has been archiving born-digital online content through its Web Archiving Program since 2000 in an effort to preserve and provide access to such materials, as we have done with print materials throughout the Library’s history. 

Can I suggest web content to be collected by the Library?

Recommending Officers will review suggestions, but we do not guarantee that they will be added to the archive. Contact us and your suggestion will be forwarded to one of our Recommending Officers for consideration. 

How do I view the Library's web archives?

For details on how to view accessible content in the Library's web archives, visit For Researchers.

What resources does the Library provide for API access?

The Library of Congress makes three different loc.gov APIs available to the public. For more information about APIs, please see the following guide.

Contact Us

Comments, questions, and suggestions related to Web Archiving and this website can be sent to us online.

Location

Web Archiving Program
Library of Congress
101 Independence Avenue, SE
Washington DC 20540-1310