Lost but Not Forgotten – Finding Pages on the Unarchived Web (2015)

At every tick of the clock, pages and sites on the World Wide Web are continuously appear, evolve and disappear. The volatile nature of the web has led to numerous initiatives to archive it, including the Internet Archive’s Wayback Machine, or the Dutch Web archive. The resulting web archives are a valuable source for researchers aiming to understand our current times, but are also inherently incomplete.

In this paper, we worked with the Dutch web archive, showing that web archives contain more than meets the eye. Using the underlying link structure, we detected a similar number of unarchived pages as actually contained in the archive. A further look into the ‘lost’ pages showed that they complement the selection-based core of the archive. In the second part of the paper, we used the link URLs and link text (‘anchor text’) to generate representations of unarchived pages. While the richness of the data is skewed in nature, meaning that we have a shallow representation of the majority of pages, we found that these pages could be retrieved effectively in a ‘known item’ setting.

The results of this paper have value for institutions creating web archives, since they can use our findings to enrich selection criteria, but also to potentially provide context to existing archives. Our findings can also be useful for researchers using web archives as a source (e.g. historians and media scholars): uncovering and partially recovering unarchived contents can provide increased transparency about the coverage of web archives. Awards: Digital Libraries conference 2014 best paper award honorable mention.

Read and cite the journal paper:

Huurdeman, H. C., Kamps, J., Samar, T., Vries, A. P. de, Ben-David, A., & Rogers, R. A. (2015). Lost but not forgotten: finding pages on the unarchived web. International Journal on Digital Libraries, 1–19. https://doi.org/10.1007/s00799-015-0153-3 (Preprint PDF)

Related conference papers:

Huurdeman, H. C., Ben-David, A., Kamps, J., Samar, T., & de Vries, A. P. (2014). Finding pages on the unarchived Web. 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), 331–340. https://doi.org/10.1109/JCDL.2014.6970188

Samar, T., Huurdeman, H. C., Ben-David, A., Kamps, J., & de Vries, A. (2014). Uncovering the Unarchived Web. Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, 1199–1202. https://doi.org/10.1145/2600428.2609544

Related materials:

View Digital Libraries 2014 conference presentation
View other publications about web archives