Abstract
The Internet Archive pioneered web archiving and remains the largest publicly accessible web archive hosting archived copies of web pages (Mementos) going back as far as early 1996. Its holdings have grown steadily since, and it hosts more than 881 billion URIs as of September 2019. However, the landscape of web archiving has changed significantly over the last two decades. Today we can freely access Mementos from more than 20 web archives around the world, operated by for-profit and nonprofit organisations, national libraries and academic institutions, as well as individuals. The resulting diversity improves the odds of the survival of archived records but also requires technical standards to ensure interoperability between archival systems. To date, the Memento Protocol and the WARC file format are the main enablers of interoperability between web archives. We describe a variety of tools and services that leverage the broad adoption of the Memento Protocol and discuss a selection of research efforts that would likely not have been possible without these interoperability standards. In addition, we outline examples of technical specifications that build on the ability of machines to access resource versions on the Web in an automatic, standardised and interoperable manner.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Ainsworth SG, Alsum A, SalahEldeen H, Weigle MC, Nelson ML (2011) How much of the web is archived? In: ACM/IEEE joint conference on digital libraries, pp 133–136. https://doi.org/10.1145/1998076.1998100
Ainsworth SG, Nelson ML, Van de Sompel H (2015) Only one out of five archived web pages existed as presented. In: ACM conference on hypertext and social media, pp 257–266. https://doi.org/10.1145/2700171.2791044
Alam S, Nelson ML (2016) MemGator – A portable concurrent memento aggregator: cross-platform CLI and server binaries in go. In: ACM/IEEE joint conference on digital libraries, pp 243–244. https://doi.org/10.1145/2910896.2925452
Alam S, Nelson ML, Balakireva LL, Shankar H, Rosenthal DSH (2016a) Web archive profiling through CDX summarization. Int J Digital Libraries 17(3):223–238. https://doi.org/10.1007/s00799-016-0184-4
Alam S, Nelson ML, Van de Sompel H, Rosenthal DSH (2016b) Web archive profiling through fulltext search. In: International conference on theory and practice of digital libraries (TPDL), vol 9819, pp 121–132. https://doi.org/10.1007/978-3-319-43997-6_10
Alam S, Kelly M, Weigle MC, Nelson ML (2017) Client-side reconstruction of composite mementos using ServiceWorker. In: ACM/IEEE joint conference on digital libraries, pp 1–4. https://doi.org/10.1109/JCDL.2017.7991579
Alkwai LM, Nelson ML, Weigle MC (2015) How well are arabic websites archived? In: ACM/IEEE joint conference on digital libraries, pp 223–232. https://doi.org/10.1145/2756406.2756912
Alkwai LM, Nelson ML, Weigle MC (2017) Comparing the archival rate of Arabic, English, Danish, and Korean Language web pages. ACM Trans Inf Syst 36(1):1–34. https://doi.org/10.1145/3041656
AlNoamany Y, Weigle MC, Nelson ML (2016) Detecting off-topic pages within TimeMaps in web archives. Int J Digital Libraries 17(3):203–221. https://doi.org/10.1007/s00799-016-0183-5
AlNoamany Y, Weigle MC, Nelson ML (2017) Generating stories from archived collections. In: ACM conference on web science, pp 309–318. https://doi.org/10.1145/3091478.3091508
AlSum A, Nelson ML (2014) Thumbnail summarization techniques for web archives. In: European conference on information retrieval (ECIR), vol 8416, pp 299–310. https://doi.org/10.1007/978-3-319-06028-6_25
AlSum A, Weigle MC, Nelson ML, Van de Sompel H (2014) Profiling web archive coverage for top-level domain and content language. Int J Digital Libraries 14(3–4):149–166. https://doi.org/10.1007/s00799-014-0118-y
Arquivopt (2016) Arquivo.pt – new version. https://sobre.arquivo.pt/en/arquivo-pt-new-version-2/
Aturban M, Nelson ML, Weigle MC (2015) Quantifying orphaned annotations in hypothes.is. In: International conference on theory and practice of digital libraries (TPDL), vol 9316, pp 15–27. https://doi.org/10.1007/978-3-319-24592-8_2
Aturban M, Nelson ML, Weigle MC (2017) Difficulties of timestamping archived web pages. Technical Report. arXiv:1712.03140. http://arxiv.org/abs/1712.03140
Ben-David A (2019) 2014 not found: a cross-platform approach to retrospective web archiving. Internet Histories 3(3–4):316–342. https://doi.org/10.1080/24701475.2019.1654290
Berners-Lee T, Fielding R, Masinter L (2005) RFC 3986 - Uniform resource identifier (URI): Generic syntax. https://tools.ietf.org/html/rfc3986
Bicho D, Gomes D (2016) Automatic identification and preservation of R&D websites. Technical report, Arquivo.pt - The Portuguese Web Archive. https://sobre.arquivo.pt/wp-content/uploads/automatic-identification-and-preservation-of-r-d.pdf
Bornand NJ, Balakireva L, Van de Sompel H (2016) Routing memento requests using binary classifiers. In: ACM/IEEE joint conference on digital libraries, pp 63–72. https://doi.org/10.1145/2910896.2910899
Brunelle JF, Nelson ML (2013) An evaluation of caching policies for memento TimeMaps. In: ACM/IEEE joint conference on digital libraries, pp 267–276. https://doi.org/10.1145/2467696.2467717
Brunelle JF, Nelson ML, Balakireva L, Sanderson R, Van de Sompel H (2013) Evaluating the SiteStory transactional web archive with the ApacheBench Tool. In: International conference on theory and practice of digital libraries (TPDL), vol 8092, pp 204–215. https://doi.org/10.1007/978-3-642-40501-3_20
Brunelle JF, Kelly M, SalahEldeen H, Weigle MC, Nelson ML (2015) Not all mementos are created equal: measuring the impact of missing resources. Int J Digital Libraries 16(3–4):283–301. https://doi.org/10.1007/s00799-015-0150-6
Brunelle JF, Kelly M, Weigle MC, Nelson ML (2016) The impact of JavaScript on archivability. Int J Digital Libraries 17(2):95–117. https://doi.org/10.1007/s00799-015-0140-8
Brunelle JF, Weigle MC, Nelson ML (2017) Archival crawlers and JavaScript: discover more stuff but crawl more slowly. In: ACM/IEEE joint conference on digital libraries, pp 1–10. https://doi.org/10.1109/JCDL.2017.7991554
Chakrabarti S, Van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific web resource discovery. Comput Netw 31(11–16):1623–1640. https://doi.org/10.1016/S1389-1286(99)00052-3
Cocciolo A (2015) The rise and fall of text on the web: a quantitative study of web archives. Inf Res Int Electron J 20(3):1–11. https://eric.ed.gov/?id=EJ1077827
Coppens S, Mannens E, Deursen DV (2011) Publishing provenance information on the web using the memento datetime content negotiation. In: Linked data on the web workshop, pp 1–10. http://events.linkeddata.org/ldow2011/papers/ldow2011-paper02-coppens.pdf
Costa M, Gomes D, Silva MJ (2017) The evolution of web archiving. Int J Digital Libraries 18(3):191–205. https://doi.org/10.1007/s00799-016-0171-9
Curty RG, Zhang P (2011) Social commerce: looking back and forward. Proc Am Soc Inf Sci Technol 48(1):1–10. https://doi.org/10.1002/meet.2011.14504801096
Duncan S (2017) Web archiving at the New York art resources consortium (NYARC): Collaboration to preserve specialist born-digital art resources. In: Digital humanities. opportunities and risks. connecting libraries and research. https://hal.archives-ouvertes.fr/hal-01636124
Fafalios P, Holzmann H, Kasturia V, Nejdl W (2017) Building and querying semantic layers for web archives. In: ACM/IEEE joint conference on digital libraries, pp 1–10. https://doi.org/10.1109/JCDL.2017.7991555
Fielding RT (2000) REST: Architectural styles and the design of network-based software architectures. Doctoral dissertation, University of California, Irvine. https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm
Fielding R, Reschke J (2014a) RFC 7230 - hypertext transfer protocol (HTTP/1.1): message syntax and routing. https://tools.ietf.org/html/rfc7230
Fielding R, Reschke J (2014b) RFC 7231 - hypertext transfer protocol (HTTP/1.1): semantics and content. https://tools.ietf.org/html/rfc7231
Fielding R, Reschke J (2014c) RFC 7232 - hypertext transfer protocol (HTTP/1.1): conditional requests. https://tools.ietf.org/html/rfc7232
Fielding R, Reschke J (2014d) RFC 7235 - hypertext transfer protocol (HTTP/1.1): authentication. https://tools.ietf.org/html/rfc7235
Fielding R, Lafon Y, Reschke J (2014a) RFC 7233 - hypertext transfer protocol (HTTP/1.1): range requests. https://tools.ietf.org/html/rfc7233
Fielding R, Nottingham M, Reschke J (2014b) RFC 7234 - hypertext transfer protocol (HTTP/1.1): caching. https://tools.ietf.org/html/rfc7234
Gomes D, Costa M (2014) The importance of web archives for humanities. Int J Human Arts Comput 8(1):106–123. https://doi.org/10.3366/ijhac.2014.0122
Gossen G, Demidova E, Risse T (2017) Extracting event-centric document collections from large-scale web archives. In: International conference on theory and practice of digital libraries (TPDL), vol 10450, pp 116–127. https://doi.org/10.1007/978-3-319-67008-9_10
Gossen G, Risse T, Demidova E (2018) Towards extracting event-centric collections from web archives. Int J Digital Libraries. https://doi.org/10.1007/s00799-018-0258-6
Hafner K, Palmer G (2017) Skin cancers rise, along with questionable treatments. The New York Times. https://www.nytimes.com/2017/11/20/health/dermatology-skin-cancer.html
Hashmi SS, Ikram M, Kaafar MA (2019) A longitudinal analysis of online ad-blocking blacklists. Technical Report. arXiv:1906.00166. https://arxiv.org/abs/1906.00166
Helmond A, van der Vlist FN (2019) Social media and platform historiography: challenges and opportunities. J Media History 22(1):6–34. https://www.tmgonline.nl/articles/434/
Holzmann H, Anand A (2016) Tempas: temporal archive search based on tags. In: International world wide web conference, pp 207–210. https://doi.org/10.1145/2872518.2890555
International Internet Preservation Coalition (2006) The CDX file format. https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/
Jones SM, Nelson ML, Shankar H, Van de Sompel H (2014) Bringing web time travel to MediaWiki: an assessment of the memento MediaWiki extension. Technical Report. arXiv:1406.3876. http://arxiv.org/abs/1406.3876
Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, Grover C (2016) Scholarly context Adrift: three out of four URI references lead to changed content. PLoS One 11(12):e0167475. https://doi.org/10.1371/journal.pone.0167475
Jones SM, Nelson ML, Van de Sompel H (2018a) Avoiding spoilers: wiki time travel with Sheldon Cooper. Int J Digital Libraries 19(1):77–93. https://doi.org/10.1007/s00799-016-0200-8
Jones SM, Weigle MC, Nelson ML (2018b) The off-topic memento toolkit. In: International conference on digital preservation, pp 1–10. https://doi.org/10.17605/OSF.IO/UBW87
Jones SM, Weigle MC, Nelson ML (2019) Social cards probably provide for better understanding of web archive collections. In: ACM international conference on information and knowledge management, pp 2023–2032. https://doi.org/10.1145/3357384.3358039
Kelly M, Nelson ML, Weigle MC (2014) Mink: integrating the live and archived web viewing experience using web browsers and memento. In: ACM/IEEE joint conference on digital libraries, pp 469–470. https://doi.org/10.1109/JCDL.2014.6970229
Kelly M, Nelson ML, Weigle MC (2018) A framework for aggregating private and public web archives. In: ACM/IEEE joint conference on digital libraries, pp 273–282. https://doi.org/10.1145/3197026.3197045
Kiesel J, Kneist F, Alshomary M, Stein B, Hagen M, Potthast M (2018) Reproducible web corpora: interactive archiving with automatic quality assessment. J Data Inf Qual 10(4):1–25. https://doi.org/10.1145/3239574
Klein M, Nelson ML (2011) Find, new, copy, web, page – tagging for the (re-)discovery of web pages. In: International conference on theory and practice of digital libraries (TPDL), vol 6966, pp 27–39. https://doi.org/10.1007/978-3-642-24469-8_5
Klein M, Nelson ML (2014) Moved but not gone: an evaluation of real-time methods for discovering replacement web pages. Int J Digital Libraries 14(1–2):17–38. https://doi.org/10.1007/s00799-014-0108-0
Klein M, Van de Sompel H (2015) Reference rot in web-based scholarly communication and link decoration as a path to mitigation. https://blogs.lse.ac.uk/impactofsocialsciences/2015/02/05/reference-rot-in-web-based-scholarly-communication/
Klein M, Aly M, Nelson ML (2011) Synchronicity: automatically rediscover missing web pages in real time. In: ACM/IEEE joint conference on digital libraries, p 475. https://doi.org/10.1145/1998076.1998193
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, Zhou K, Tobin R (2014) Scholarly context not found: one in five articles suffers from reference rot. PLoS One 9(12):e115253. https://doi.org/10.1371/journal.pone.0115253
Klein M, Balakireva L, Van de Sompel H (2018) Focused crawl of web archives to build event collections. In: ACM conference on web science, pp 333–342. https://doi.org/10.1145/3201064.3201085
Klein M, Balakireva L, Shankar H (2019a) Evaluating memento service optimizations. In: ACM/IEEE joint conference on digital libraries, pp 182–185. https://doi.org/10.1109/JCDL.2019.00034
Klein M, Balakireva L, Shankar H (2019b) Evaluating memento service optimizations. Technical Report. arXiv:1906.00058. https://arxiv.org/abs/1906.00058
Ko L (2019) OpenWayback - IIPC. http://netpreserve.org/web-archiving/openwayback/
Kreymer I (2019) GitHub – webrecorder/pywb – Core Python Web Archiving Toolkit for replay and recording of web archives. https://github.com/webrecorder/pywb
Library of Congress (2018) WARC, Web ARChive file format. https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml
Mannens E, Coppens S, Verborgh R, Hauttekeete L, Van Deursen D, Van de Walle R (2012) Automated trust estimation in developing open news stories: combining memento & provenance. In: IEEE annual computer software and applications conference workshops, pp 122–127. https://doi.org/10.1109/COMPSACW.2012.32
Meinhardt P, Knuth M, Sack H (2015) TailR: a platform for preserving history on the web of data. In: International conference on semantic systems, pp 57–64. https://doi.org/10.1145/2814864.2814875
Melo F, Viana H, Gomes D, Costa M (2016) Architecture of the Portuguese web archive search system version 2. Technical report, Arquivo.pt - The Portuguese Web Archive. https://sobre.arquivo.pt/wp-content/uploads/architecture-of-the-portuguese-web-archive-search-1.pdf
Milligan I (2019) History in the age of abundance: how the web is transforming historical research. McGill-Queen’s University Press, Montreal
Nelson ML (2010) Memento-datetime is not last-modified. https://ws-dl.blogspot.com/2010/11/2010-11-05-memento-datetime-is-not-last.html
Nelson ML (2013) Archive.is supports memento. https://ws-dl.blogspot.com/2013/07/2013-07-09-archiveis-supports-memento.html
Nelson ML, Van de Sompel H (2019) Adding the dimension of time to HTTP. In: SAGE handbook of web history. SAGE Publishing, Philadelphia, pp 189–214
Neumaier S, Umbrich J, Polleres A (2017) Lifting data portals to the web of data. In: Linked data on the web workshop, pp 1–10. http://ceur-ws.org/Vol-1809/article-03.pdf
Nwala AC (2015) What did it look like? https://ws-dl.blogspot.com/2015/01/2015-02-05-what-did-it-look-like.html
Nwala AC, Weigle MC, Ziegler AB, Aizman A, Nelson ML (2017) Local memory project: providing tools to build collections of stories for local events from local sources. In: ACM/IEEE joint conference on digital libraries, pp 1–10. https://doi.org/10.1109/JCDL.2017.7991576
Nwala AC, Weigle MC, Nelson ML (2018a) Bootstrapping web archive collections from social media. In: ACM conference on hypertext and social media, pp 64–72. https://doi.org/10.1145/3209542.3209560
Nwala AC, Weigle MC, Nelson ML (2018b) Scraping SERPs for archival seeds: it matters when you start. In: ACM/IEEE joint conference on digital libraries, pp 263–272. https://doi.org/10.1145/3197026.3197056
Powell JE, Alcazar DA, Hopkins M, McMahon TM, Wu A, Collins L, Olendorf R (2011) Graphs in libraries: a primer. Inf Technol Libraries 30(4):157. https://doi.org/10.6017/ital.v30i4.1867
Reich V, Rosenthal DSH (2001) LOCKSS: a permanent web publishing and access system. D-Lib Mag 7(6). http://dlib.org/dlib/june01/reich/06reich.html
Ruest N, Milligan I, Lin J (2019) Warclight: a rails engine for web archive discovery. In: ACM/IEEE joint conference on digital libraries, pp 442–443. https://doi.org/10.1109/JCDL.2019.00110
SalahEldeen HM, Nelson ML (2012) Losing my revolution: how many resources shared on social media have been lost? In: International conference on theory and practice of digital libraries (TPDL), vol 7489, pp 125–137. https://doi.org/10.1007/978-3-642-33290-6_14
SalahEldeen HM, Nelson ML (2013a) Carbon dating the web: estimating the age of web resources. In: International world wide web conference, pp 1075–1082. https://doi.org/10.1145/2487788.2488121
SalahEldeen HM, Nelson ML (2013b) Reading the correct history?: Modeling temporal intention in resource sharing. In: ACM/IEEE joint conference on digital libraries, pp 257–266. https://doi.org/10.1145/2467696.2467721
Sanderson R (2012) Global web archive integration with memento. In: ACM/IEEE joint conference on digital libraries, p 379. https://doi.org/10.1145/2232817.2232900
Sanderson R, Van de Sompel H (2010) Making web annotations persistent over time. In: ACM/IEEE joint conference on digital libraries, pp 1–10. https://doi.org/10.1145/1816123.1816125
Sanderson R, Van de Sompel H (2012) Cool URIs and dynamic data. IEEE Internet Comput 16(4):76–79. https://doi.org/10.1109/MIC.2012.78
Sanderson R, Phillips M, Van de Sompel H (2011) Analyzing the persistence of referenced web resources with memento. Technical Report. arXiv:1105.3459. https://arxiv.org/abs/1105.3459
Sanderson R, Ciccarese P, Young B (2017) Web annotation data model. https://www.w3.org/TR/annotation-model
Shelby Z (2012) RFC 6690 – Constrained RESTful Environments (CoRE) link format. https://tools.ietf.org/html/rfc6690
Taelman R, Verborgh R, Mannens E (2017) Exposing RDF archives using triple pattern fragments. In: Knowledge engineering and knowledge management (EKAW), pp 188–192. https://doi.org/10.1007/978-3-319-58694-6_29
Van de Sompel H, Davis S (2015) From a system of journals to a web of objects. Serials Librarian 68(1–4):51–63. https://doi.org/10.1080/0361526X.2015.1026748
Van de Sompel H, Vander Sande M (2016) DBpedia archive using memento, triple pattern fragments, and HDT. In: CNI spring meeting. https://www.slideshare.net/hvdsomp/dbpedia-archive-using-memento-triple-pattern-fragments-and-hdt
Van de Sompel H, Sanderson R, Nelson ML (2010) An HTTP-based versioning mechanism for linked data. In: Linked data on the web workshop, pp 1–10. http://events.linkeddata.org/ldow2010/papers/ldow2010_paper13.pdf
Van de Sompel H, Nelson M, Sanderson R (2013) RFC 7089 - HTTP framework for time-based access to resource states – memento. https://tools.ietf.org/html/rfc7089
Van de Vyvere B, Colpaert P, Mannens E, Verborgh R (2019) Open traffic lights: a strategy for publishing and preserving traffic lights data. In: International world wide web conference, pp 966–971. https://doi.org/10.1145/3308560.3316520
Vander Sande M, Verborgh R, Hochstenbach P, Van de Sompel H (2018) Toward sustainable publishing and querying of distributed linked data archives. J Doc 74(1):195–222. https://doi.org/10.1108/JD-03-2017-0040
Verborgh R, Vander Sande M, Hartig O, Van Herwegen J, De Vocht L, De Meester B, Haesendonck G, Colpaert P (2016) Triple pattern fragments: a low-cost knowledge graph interface for the Web. J Web Semant 37–38:184–206. https://doi.org/10.1016/j.websem.2016.03.003
Weigle MC (2017) Visualizing webpage changes over time - new NEH digital humanities advancement grant. https://ws-dl.blogspot.com/2017/10/2017-10-16-visualizing-webpage-changes.html
Welsh B (2019) Memento for Wordpress. http://pastpages.github.io/wordpress-memento-plugin/
Zhou K, Grover C, Klein M, Tobin R (2015) No more 404s: predicting referenced link rot in scholarly articles for pro-active archiving. In: ACM/IEEE joint conference on digital libraries, pp 233–236. https://doi.org/10.1145/2756406.2756940
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Jones, S.M., Klein, M., Sompel, H.V.d., Nelson, M.L., Weigle, M.C. (2021). Interoperability for Accessing Versions of Web Resources with the Memento Protocol. In: Gomes, D., Demidova, E., Winters, J., Risse, T. (eds) The Past Web. Springer, Cham. https://doi.org/10.1007/978-3-030-63291-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-63291-5_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63290-8
Online ISBN: 978-3-030-63291-5
eBook Packages: Computer ScienceComputer Science (R0)