After having been held in conjunction with JCDL last year, the International Workshop on Web Archiving (IWAW) this year was hosted in Europe again together with the European Conference on Digital Libraries (ECDL) on September 18-19 in Århus, Denmark. Adopting a two-day layout, IWAW 2008 attracted almost 70 participants from Europe, North America and Asia.
On the first day of the workshop, two scientific sessions as well as a dedicated session of the EU-funded LiWA (Living Web Archives) project provided an update on current developments and urgent issues, with a strong focus on access to Web Archives replacing crawling and collection building that were the dominant topics of earlier years. This also hints at the level of progress made in the Web archiving domain, which has successfully mastered many of the challenges of collection building, also highlighted by a series of case studies on the second day of the workshop.
A special session on practical matters of Web Archiving rounded off the second day, providing practical advice as well as detailed insights into operational matters.
The workshop commenced with a talk by Adam Jatowt from Kyoto University, who presented three interaction types with page histories. To this end, he introduced a set of tools offering combined browsing between current and historic pages. His first system supports passive browsing of page histories as a slide show with change detection and indication. With the page-history explorer, users get a visual summary of page evolution based on term clouds and thumbnail presentations in a spatial layout, which also provides fascinating possibilities for comparing multiple web pages to see how they evolved. In a third prototype, page views are enriched with history-derived information, such as displaying the age of certain content elements on a web page.
Sangchul Song from the University of Maryland addressed issues on how better to obtain stored documents from a web crawl by not using simple sequential wrapping into container formats such as WARC. His approach uses the popularity of pages to determine packaging so as to reduce the number of containers to be loaded from disk by ensuring that those pages that are closely linked and likely to be retrieved in a single navigation session are encapsulated together. Obtaining the optimal partitioning is based on efficient graph partitioning.
In his talk, Andreas Rauber from the Vienna University of Technology raised the highly controversial issue of the ethical implications of providing access to Web Archives. He analyzed a range of assumptions underlying the motivation for collecting and serving historic Web data. Rauber identified potential ethical dangers when it comes to content created, e.g., by children, or content meant to be ephemeral when created, as well as the characteristics of the Web as a communication rather than a publication venue. He calls for intensified research into technical solutions to inform decisions on how best to make Web archives available as valuable information resources that respect high ethical expectations of privacy.
France Lasfargues and Clement Oury from the Bibliothèque National de France presented guidelines for designing domain crawls that combine different strategies for obtaining a harvest of the French web. The crawl was performed by the Internet Archive, with the Bibliothèque Nationale de France (BnF) coordinating the design of the crawl.
The National Archives in the UK presented an innovative project (Web Continuity), in which all central governmental websites will be provided with an archiving and redirection service. Instead of getting a 404 (file not found) error message, users of these governmental sites will automatically be redirected to the archived page for which they are looking. The service, powered by the European Archive Foundation, is already working on several large websites, significantly enhancing the user's experience by providing seamless navigation and helping the site's producers by integrating an archiving function automatically.
A specific workshop session was devoted to the presentation of activities from the Living Web Archive (http://liwa-project.eu) project. LiWA is an EU-funded project that is part of FP7 (the Seventh Framework Programme). As part of this session, Andras Benczur from the Hungarian Academy of Sciences first addressed the important topic of Web Spam analysis and its influence on Web Archiving initiatives. He discussed the various techniques used for achieving higher rankings as part of search engine optimization or the effects of domain parking. Building models of spam sites will allow identifying them at crawl time and treating them accordingly in crawl management.
Next, Nina Tahmasebi from the L3S Research Center described an approach for handling terminology evolution in Web Archive search and access. Examples of the types of problems addressed are specific concept, e.g., St. Petersburg and Leningrad referring to the same city, or chairman being replaced in usage by more politically correct terms such as Chair or Chairperson. This work is based on initial concept identification and a temporal association between a term and its associated concepts.
Radu Pop from the European Archive presented the general architectural concepts and development structure for the LiWA project, which is partly building on top of IIPC web archiving reference platform (that includes Heritrix, the Wayback machine and Nutch).
Marc Spaniol from the Max Plank Institute of Computer Science addressed issues of temporal coherence in Web crawling (i.e., all pages referring to the same version of the site). This coherence is difficult to achieve due to updates that may occur during the crawl. Spaniol presented early results in this domain, specifically a method to visualize on which parts of a website pages changed during the time the whole site was being crawled.
The final LiWA presentation was made by Mark Williamson from Hanzo Archives Ltd. He described advances on detecting and extracting links from streaming media such as Flash or RealMedia, or from dynamic links and the deep web, in order to feed them into the crawling process. This is achieved by integrating helper applications to interpret the respective applications and extract the links there instead of using the traditional method of links parsing (on which search engine and other crawlers have been based since 1993).
The second day of IWAW 2008 started with reports on case studies from various national Web archiving initiatives in France, the Czech Republic, Portugal and Taiwan. The discussions covered both differences in the legal settings under which these initiatives are operating and the resulting project structures, as well as on the technical aspects of collection building and maintenance.
Daniel Gomes from FCCN presented the current Portuguese Web Archiving initiative, describing the storage solutions and crawling set-up. He proposed a distributed architecture for LOCKSS-style storage of the resulting ARC files1 in cooperation with external partners, including ciphers for protecting the data distributed.
Yen-liang Chen from the National Taiwan University Library presented their Web Archiving Initiative, NTUWAS, which had collected 6.5TB of data by September 2008 from more than 42,000 websites, based on selective crawling. The NTUWAS navigation interface includes timelines and other interfaces, plus possibilities for users to document errors identified on the displayed pages. It supports advanced search interfaces for various metadata fields as well as full-text search.
In the afternoon of the second day of the workshop, a special IWAW session was organized to discuss practical issues related to Web Archiving, focussing specifically on the experiences of the Danish Netarchive project and its supporting software suite. Presentations covered the combination of snapshot and selective harvesting of the Danish Internet, where 80 sites are collected more frequently than the snapshot harvest would permit. Selection of the sites to archive is based on news media analysis, Internet statistics and an advisory board. Philip Beresford from the British Library shared his experiences with the Web Curator tool, while Paul Wu from Nanyang Technological University in Singapore presented tools for annotating Web Archive content. France Lasfargue described his institution's experience archiving weblogs in collaboration with authors and discussed the authors' reactions to the idea of their works being preserved in a Library (BnF).
Overall, the workshop benefited tremendously from the intensive discussions that took place after each individual presentation. The discussions allowed ample time for exchanging experiences and debating different approaches. All peer-reviewed papers as well as the presentations are available via the IWAW website at <http://www.iwaw.net>.
1. ARC web site: <http://www.archive.org/web/researcher/ArcFileFormat.php>.
Copyright © 2008 Andreas Rauber and Julien Masanès