Internet Archive
Web & Data Services

ARCH (Archives Research Compute Hub)

One of the biggest challenges to computational research is getting access to data and supporting its use. ARCH (Archives Research Compute Hub) aims to address this challenge by making the process easier for collection maintainers and researchers. ARCH users can easily derive machine actionable data from collections at scale through the use of more than a dozen dataset jobs that can be run at the click of a button (e.g., extract all text, spreadsheets, pdfs, images, audio, named entities and more). These jobs are run on user-defined combinations of pre-existing collections and/or collections created by the user. The resulting data can be visualized in the browser and can be downloaded from ARCH to a user-defined work environment.

Recent research efforts using ARCH include but are not limited to the analysis of:

ARCH currently works with web archive collections at the Internet Archive (625 billion web pages in total, 585 million web pages added per day, containing a broad range of data types). ARCH will soon support computational analysis of additional digital collections - digitized and born digital books, newspapers, government documents, audio recordings, video recordings, and more. As a point of distinction, ARCH leverages IA’s non-profit owned, high-performance computing infrastructure for data processing.

If you would like to learn more about ARCH please let us know: arch@archive.org

ARCH user collection overview panel
Sample ARCH dataset generation jobs
Sample ARCH dataset in browser visualization
ARCH dataset download