Top Sites

About the project | Donate

Project Summary

The Whole Earth Web Archive (WEWA) is a proof-of-concept to explore ways to improve access to the archived websites of underrepresented nations around the world. Starting with a sample set of 50 small nations and extracting their archived web content from the Internet Archive’s total web archive, we have built special search and access features on top of this subcollection and created a dedicated discovery portal for searching and browsing. Further work will focus on improving IA’s harvesting of the national webs of these and other underrepresented countries as well as exploring collaborations with libraries and heritage organizations within these countries, and via international organizations, to contribute technical capacity to local experts who can identify websites of value that document the lives and activities of their citizens.

Project Context

Archived materials from the web play an increasingly necessary role in representation, evidence, historical documentation, and accountability. However, the web’s scale is vast, it changes and disappears quickly, and it requires significant infrastructure and expertise to collect and make permanently accessible. Thus, the community of National Libraries and Governments preserving the web remains overwhelmingly represented by well-resourced institutions from Europe and North America. We hope the WEWA project helps provide enhanced access to archived material otherwise hard to find and browse in the massive 25+ petabytes of IA’s web archive. More importantly, we hope the project provokes a broader reflection upon the lack of national diversity in institutions collecting the web and also spurs collective action towards diminishing the overrepresentation of “first world” nations and peoples in the overall global web archive.

Technical Approach

As with prior special projects by the Web Archiving & Data Services team, such as GifCities (search engine for animated Gifs from the Geocities web collection) or Military Industrial Powerpoint Complex (ebooks of Powerpoints from the archive of the .mil (military) web domain), the project builds on our exploratory work to provide improved access to specific, valuable subsets of the overall global web archive behind the Wayback Machine. The preliminary set of countries in WEWA were determined by selecting the 50 “smallest” countries as measured by number of websites registered on their national web domain (aka ccTLD) (a somewhat arbitrary measurement, we realize). The underlying search index (in Elasticsearch) is based on internally-developed search tools for search of both text and media. Indices are built from features like page titles or descriptive hyperlinks from other pages, with relevance ranking boosted by criteria such as number of inbound links and popularity and include a temporal dimension to account for the historicity of web archives. Additional technical information on search engineering can be found in "Exploring Web Archives Through Temporal Anchor Texts".

Future Work

We intend both to do more targeted, high-quality archiving of these and other smaller national webs and also have undertaking active outreach to national and heritage institutions in these nations, and to related international organizations, to ensure this work is guided by broader community input. If you are interested in contributing to this effort or have any questions, feel free to email us at webservices [at] archive [dot] org. Thanks for browsing the WEWA!

Credits

Engineering/Development: Helge Holzmann
Design/UI: Brenton Chang & Jim Shelton
Project/Concept: Jefferson Bailey