Internet Archive
Web & Data Services

Domain-scale Web Harvests

The Internet Archive has a long tradition of providing domain-scale web harvests, often on behalf of National Libraries. As a key provider of web archiving technologies and services, the Internet Archive has made available open source software for crawling and access, enabling national bodies to undertake web archiving locally.

National Libraries

Example crawl report from a national library partner.

The Internet Archive has worked with national libraries and archives since 1998 having domain-scale crawling, often for full country code top level domains (ccTLD) harvests of over 1 billion URLs. We have worked with partners such as: Library of Congress, National Library of Australia, National Library of Israel, National Library of New Zealand, National Library of Spain, National Library of Luxembourg, Swiss National Library, Sweden National Library, National Library of Ireland, and national archives such as the U.S. National Archives and Records Administration.

Harvesting tools

We use the Heritrix open-source software to perform web crawling for harvest, along with Umbra and Brozzler, browser-based tools that allow the crawler to imitate human interactions with Web, such as executing JavaScript through clicking or hovering the mouse over different Web page elements and scrolling down a page. This allows for discovery and archiving of user-action generated content. Both Heritrix, Umbra, and Brozzler were developed at IA and our engineers continue to lead development of these tools.

Government Web Harvesting

A web archive portal page designed by the Internet Archive for the U.S. National Archives and Records Administration.

Grant funded crawls and Special Projects

News Measures Research Project

Wikipedia

Wordpress