At every tick of the clock, pages and sites on the World Wide Web are continuously appear, evolve and disappear. The volatile nature of the web has led to numerous initiatives to archive it, including the Internet Archive’s Wayback Machine, or the Dutch Web archive. The resulting web archives are a valuable source for researchers aiming to understand our current times, but are also inherently incomplete.
In this paper, we worked with the Dutch web archive, showing that web archives contain more than meets the eye. Using the underlying link structure, we detected a similar number of unarchived pages as actually contained in the archive. A further look into the ‘lost’ pages showed that they complement the selection-based core of the archive. In the second part of the paper, we used the link URLs and link text (‘anchor text’) to generate representations of unarchived pages. While the richness of the data is skewed in nature, meaning that we have a shallow representation of the majority of pages, we found that these pages could be retrieved effectively in a ‘known item’ setting.
The results of this paper have value for institutions creating web archives, since they can use our findings to enrich selection criteria, but also to potentially provide context to existing archives. Our findings can also be useful for researchers using web archives as a source (e.g. historians and media scholars): uncovering and partially recovering unarchived contents can provide increased transparency about the coverage of web archives. Awards: best paper award honorable mention.
Read and cite the paper: