Report on the 5th International Web Archiving Workshop (IWAW)

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
November 2005

Volume 11 Number 11

ISSN 1082-9873

Report on the 5th International Web Archiving Workshop (IWAW)

Andreas Aschenbrenner
SAT Research Studio, Austria
<andreas.aschenbrenner@researchstudio.at>

Olaf Brandt
SUB Göttingen, Germany
<brandt@mail.sub.uni-goettingen.de>

Stephan Strodl
Vienna University of Technology, Austria
<strodl@ifs.tuwien.ac.at>

	For the fifth time, the European Conference on Digital Libraries (ECDL) [1] hosted the International Web Archiving Workshop (IWAW) [2] – this time in Vienna, Austria. About sixty participants joined the event, mainly from Europe and the USA, but also from as far away as Australia and Japan. In the last four successive years IWAW focused purely on web archiving-related issues during one day of presentations and discussion. This year, the workshop was extended to two days, and it set out to also address the broad field of digital preservation. Still, more than half of the presentations were from current web archiving initiatives as a credit to the workshop's origin and name. The IWAW'05 agenda was organized into five sessions of varying length. Papers related to Web archiving and digital preservation were presented on both days. Some papers attempted to relate the two issues, for example, by assuming a web archive perspective within a general digital preservation approach. Overall, the workshop was interactive with time allotted for questions and answers following each presentation and with additional discussion between sessions. IIPC Results The first session was dedicated to the activities of the International Internet Preservation Consortium (IIPC) [3]. The IIPC was founded in 2003 by 11 national libraries along with the Internet Archive [4] in order to develop tools and join forces in their web archiving activities. Currently, the consortium is most active in developing a toolset for establishing and maintaining a web archive. This toolset comprises the web crawler called Heritrix[5] , the archive format manipulation tool BAT [6], the access tool WERA [7], and the search engine NutchWAX [8]. All these tools are open source, and they are being developed cooperatively by IIPC members. Michael Stack from the Internet Archive introduced the Heritrix crawler, and Kristinn Sigurðsson from the National and University Library of Iceland presented a module for Heritrix to incremental crawling. Wherever possible, the toolset is built upon existing tools in this area. Specifically, the WERA access tool stems from the Nordic Web Archive (NWA) access tool, and the responsible partner within the IIPC is the National Library of Norway, which also led NWA development [9]. NutchWAX is a web archive extension of the Apache Nutch project and the Apache Lucene search software [10]. A primary developer of Lucene and Nutch, Doug Cutting, is also the main developer of NutchWAX at the Internet Archive, and he participated in IWAW'05 and introduced the tool. Besides the presentations of the IIPC activities and tools, and a technical discussion of implementation related issues, the IIPC session featured one presentation on the web archiving format WARC, and another one on metadata in web archiving. The WARC format is an initiative to define a standard web archiving format. Triggered by the IIPC, WARC builds on the ARC format, which has been in use at the Internet Archive since 1996 [11]. Each WARC file is a sequence of content blocks with metadata headers. New features of the WARC include duplicate detection, support for data migration, and enhanced metadata. The definition of the WARC is still a work in progress, and comments to the draft WARC definition are encouraged [12]. Julien Masanès, IWAW organizer and former IIPC coordinator, presented the IIPC Web Archiving Metadata Set. The set is modeled in a number of layers, from low-level server interactions and individual files up to the collection level. As one of the layers, the set assumes successive crawl sessions. It goes beyond technical metadata associated with collection activities and it also records the selection criteria and generally documents contextual metadata on a technical and organizational level. After abstract modeling and definition of the individual metadata fields on all levels, the set may be integrated in a standard representation format such as METS [13]. The active discussion during this session touched on various issues. Technical issues included the crawling of dynamic pages, and the distribution of crawling activities over multiple servers for maximum scalability and flexibility. Distribution is currently high on the agenda of the IIPC implementation efforts. Meta-searching facilities across distinct web archives are not yet scheduled as part of IIPC activities, though this may be addressed in the future. The legal aspects of web archiving were also raised for discussion. Some of the institutions represented at the workshop have a legal deposit obligation, mainly on a national level. However, the situation for smaller web archiving initiatives, or those without adequate legal deposit regulations, remains largely unclear. This is a particularly touchy issue for web sites with a commercial stake in their resources, specifically in the case of streaming media. For now, direct arrangements with the author or publisher of a website appear the most viable option from a legal perspective. Interestingly enough, the web archiving initiatives represented during the workshop all followed a broad-scale harvesting approach. However, the clear distinction between the two archetypal web archiving strategies of exhaustive harvesting and selective collection appears to be fading. Technologies appear to be converging, as institutions following a selective approach attempt to maximize automation, and exhaustive approaches increasingly research for thematic harvesting techniques and provide curation tools for directing the harvester to specific sites. Audio and Video Web Archiving After the nine presentations of the previous session, the audio and video session, which is a specialist topic, featured only one presentation. Thomas Drugeon from the French audiovisual archive center (INA) [14] presented the INA's technical approach to web crawling and archival storage. INA is one of the three largest digital image and sound banks in the world. In an extension to French legal deposit regulations that is soon to be endorsed, the INA will be in charge of French web sites with relation to media institutions and audiovisual resources. In their estimation, this comprises about 10.000 to 15.000 web sites. The INA plans to automatically harvest specific domains, such as the site of the French media channel TF1 (http://www.tf1.fr). Time Dimension Following the two sessions that were targeted at web archiving specifically, the time dimension and the following session were geared at more general digital preservation issues though web related initiatives were represented in them as well. For this session,Tiphaine Accary-Barbier was invited to introduce a formal model for extracting temporal knowledge from a body of documents. The expression of chronology enables the construction of a relative temporal model, which establishes the context of a document within a community. Thereby, dependencies and inconsistencies between documents can be made explicit. Temporal models may be relevant for any collection of digital resources, and they certainly are in web archiving, where successive crawl sessions and web site 'snapshots' taken over extended periods of time raise questions of consistency and authenticity. In the second presentation of this session, Frank McCown presented a study of the stability of URLs. At the basis of this study are external links from D-Lib Magazine articles for the period July 1999 to August 2004. The study showed a steady increase of inaccessible URLs. Even in the year of publication some links to external resources were already "dead". The half-life of a URL from an article to referenced resource is about 10 years, when half of the links from the article are inaccessible. However, factors like link depth, file format and personal vs. institutional site have a great impact on the statistical life time of a linked URL. Surprisingly, from the 59 referenced PURLs [15] more than half were inaccessible. However, of the Handle [16] and DOI [17] persistent identifiers referenced in D-Lib articles that pointed to external sources, all were still accessible. Digital preservation The digital preservation session held the second day of IWAW '05 featured a wide range of different perspectives and approaches in digital preservation. Jeffrey van der Hoeven started the day with an update on preservation-related developments at the Dutch National Library (Koninklijke Bibliotheek, KB). Their effort to build an ultimate solution for "long-term access" is based on an emulation strategy. While the KB takes a long-term perspective, Frank McCown from Old Dominion University focuses on a short-term solution and aims to slightly enlarge the 'time bubble' of supported formats. The Grace tool for dynamic file format transformation being developed there operates as a proxy web server and is capable of converting files on-the-fly before rendering, in case a web browser fails to support the original format. The following two presentations in the digital preservation session were dedicated to archive modeling and planning. Niels Christensen from the Kongelige Bibliotek in Denmark outlined a simulation approach for estimating the Mean Time To Failure (MTTF) of a repository with a variety of conceivable hardware failures and human errors. Subsequently, Stephan Strodl from the Vienna University of Technology explicated the benefits of Utility Analysis in selecting a preservation strategy. The Utility Analysis model emphasizes the importance of specifying user needs and diligently defining preservation requirements, and it provides a comprehensive framework for specification, planning, and evaluation. The final two presentations of the digital preservation session returned to discussing specific preservation approaches from a technical perspective. Jane Hunter introduced the PANIC project [18] underway at a number of Australian universities and institutions. PANIC aims to establish a framework and web service-based middleware to interconnect existing preservation services such as format and software registries, metadata extraction and validation tools. The PANIC project aims to build an infrastructure that establishes a platform for collaborative preservation efforts over an extended period of time. In the last presentation of this session, Shigeo Sugimoto presented the Enclose-and-Deposit method for digital preservation. This method essentially follows the encapsulation concept, where digital resources are annotated with sufficient metadata to facilitate their interpretation in the future. Dr. Sugimoto underlined the benefit of a simple approach that is independent from technology and organizational boundaries. A prototype of the Enclose-and-Deposit method, which was implemented upon the DSpace repository system [19], caught the interest of Robert Tansley, who is an architect of DSpace. He is currently working on a similar project for inclusion in the DSpace open source software. Current Projects and Issues The last session of IWAW '05 provided updates on other current projects in the domain. The Danish web archiving project netarchive.dk is in the lucky position of having an adequate legal framework for their activities since the revision of the Danish legal deposit in July 2005. Also, netarchive.dk is currently funded with 400.000 Euro annually. Bjarne Andersen provided an update of their current activities. In his presentation, Julien Masanès called for a syndication of web archives in a range of activities, including content exchange and functional collaboration. The recently founded European Archive could operate as a lubricant to cooperative action, and it already contributes to various European web archiving projects as a technology partner. The European Archive is an Amsterdam-based non-profit foundation modeled after the Internet Archive, with which it cooperates closely. The German project Kopal [20] aims to cooperatively develop and run a long-term preservation system based on the DIAS system, which was developed at the National Library of the Netherlands together with IBM [21]. Olaf Brandt described the ongoing implementation work in the scope of this huge German project. Overall, the IWAW workshop presented a variety of topical initiatives and triggered active discussion among the participants. Presentations and papers are available from the workshop website [22]. The IWAW workshop series is planned to be continued next year [23]. Notes and References 1. 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2005). Vienna, Austria, <http://www.ecdl2005.org/>. 2. International Web Archiving Workshop (IWAW), <http://www.iwaw.net/>. 3. International Internet Preservation Consortium (IIPC), <http://www.netpreserve.org/>. 4. Internet Archive, <http://www.archive.org/>. 5. Heritrix: Internet Archive Web Crawler, <http://sourceforge.net/projects/archive-crawler>. 6. BAT - BnF (Bibliothèque nationale de France) Archive Tool. 7. WERA - Web ARchive Access, <http://sourceforge.net/projects/nwatoolset/>. 8. NutchWAX - Nutch + Web Archive eXtensions, <http://sourceforge.net/projects/archive-access/>. 9. Nordic Web Archive (NWA), <http://nwa.nb.no/>. 10. Apache Lucene, <http://lucene.apache.org/>. 11. Mike Burner and Brewster Kahle: Internet Archive ARC File Format, <http://www.archive.org/web/researcher/ArcFileFormat.php>. 12. WARC - web archiving format, <http://cvs.sourceforge.net/viewcvs.py/archive-access/archive-access/src/docs/warc/>. 13. Metadata Encoding and Transmission Standard (METS), <http://www.loc.gov/standards/mets/>. 14. Institut National de l'Audiovisuel (INA), <http://www.ina.fr/>. 15. PURL - Persistent URL, <http://www.purl.org/>. 16. Handle® identifier system, <http://www.handle.net/>. 17. DOI - Digital Object Identifier System®, <http://www.doi.org/>. 18. PANIC - Preservation webservices Architecture for Newmedia and Interactive Collections, <http://metadata.net/panic/>. 19. DSpace™, <http://www.dspace.org/>. 20. Kopal - Cooperative development of a long-term digital information archive, <http://kopal.langzeitarchivierung.de/>. 21. See also the presentation by Jeffrey van der Hoeven in the previous session. 22. International Web Archiving Workshop (IWAW), <http://www.iwaw.net/>. 23. 10th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2006). Alicante, Spain, <http://www.ecdl2006.org/>. (On January 5, 2006, the figure for the half-life of a referenced resource in D-Lib Magazine was corrected to 10 years as stated in Section 5 of Frank McCown's paper and on slide 12 of his presentation slides, see [2].) Copyright © 2005 Andreas Aschenbrenner, Olaf Brandt, and Stephan Strodl

	Top \| Contents Search \| Author Index \| Title Index \| Back Issues Previous Report \| In Brief Home \| E-mail the Editor

	D-Lib Magazine Access Terms and Conditions doi:10.1045/november2005-aschenbrenner