This article presents an overview of the LEAF project (Linking and Exploring Authority Files)1, which has set out to provide a framework for international, collaborative work in the sector of authority data with respect to authority control.
Elaborating the virtues of authority control in today's Web environment is an almost futile exercise, since so much has been said and written about it in the last few years.2 The World Wide Web is generally understood to be poorly structuredboth with regard to content and to locating required information. Highly structured databases might be viewed as small islands of precision within this chaotic environment. Though the Web in general or any particular structured database would greatly benefit from increased authority control, it should be noted that our following considerations only refer to authority control with regard to databases of "memory institutions" (i.e., libraries, archives, and museums). Moreover, when talking about authority records, we exclusively refer to personal name authority records that describe a specific person. Although different types of authority records could indeed be used in similar ways to the ones presented in this article, discussing those different types is outside the scope of both the LEAF project and this article.
Personal name authority recordsas are all other "authorities"are maintained as separate records and linked to various kinds of descriptive records. Name authority records are usually either kept in independent databases or in separate tables in the database containing the descriptive records. This practice points at a crucial benefit: by linking any number of descriptive records to an authorized name record, the records related to this entity are collocated in the database. Variant forms of the authorized name are referenced in the authority records and thus ensure the consistency of the database while enabling search and retrieval operations that produce accurate results.
On one hand, authority control may be viewed as a positive prerequisite of a consistent catalogue; on the other, the creation of new authority records is a very time consuming and expensive undertaking. As a consequence, various models of providing access to existing authority records have emerged: the Library of Congress and the French National Library (Bibliothèque nationale de France), for example, make their authority records available to all via a web-based search service.3 In Germany, the Personal Name Authority File (PND, Personennamendatei4) maintained by the German National Library (Die Deutsche Bibliothek, Frankfurt/Main) offers a different approach to shared access: within a closed network, participating institutions have online access to their pooled data.
The number of recent projects and initiatives that have addressed the issue of authority control in one way or another is considerable.5 Two important current initiatives should be mentioned here: The Name Authority Cooperative (NACO) and Virtual International Authority File (VIAF).
NACO was established in 1976 and is hosted by the Library of Congress. At the beginning of 2003, nearly 400 institutions were involved in this undertaking, including 43 institutions from outside the United States.6 Despite the enormous success of NACO and the impressive annual growth of the initiative, there are requirements for participation that form an obstacle for many institutions: they have to follow the Anglo-American Cataloguing Rules (AACR2) and employ the MARC217 data format. Participating institutions also have to belong to either OCLC (Online Computer Library Center) or RLG (Research Libraries Group) in order to be able to contribute records, and they have to provide a specified minimum number of authority records per year.
A recent proof of concept project of the Library of Congress, OCLC and the German National LibraryVirtual International Authority File (VIAF)8will, in its first phase, test automatic linking of the records of the Library of Congress Name Authority File (LCNAF) and the German Personal Name Authority File by using matching algorithms and software developed by OCLC. The results are expected to form the basis of a "Virtual International Authority File". The project will then test the maintenance of the virtual authority file by employing the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)9 to harvest the metadata for new, updated, and deleted records. When using the "Virtual International Authority File" a cataloguer will be able to check the system to see whether the authority record he wants to establish already exists. The final phase of the project will test possibilities for displaying records in the preferred language and script of the end user.
Currently, there are still some clear limitations associated with the ways in which authority records are used by memory institutions. One of the main problems has to do with limited access: generally only large institutions or those that are part of a library network have unlimited online access to permanently updated authority records. Smaller institutions outside these networks usually have to fall back on less efficient ways of obtaining authority data, or have no access at all.
Cross-domain sharing of authority data between libraries, archives, museums and other memory institutions simply does not happen at present. Public users are, by and large, not even aware that such things as name authority records exist and are excluded from access to these information resources.
The LEAF project was founded to improve the benefits of authority control in a number of ways. Co-funded through the "Information Society Technologies" research programme within the European Commission's Fifth Framework for research and development, the LEAF project began in March 2001 and has brought together 15 partnerslibraries, archives, documentation and research centres, universities and system developersin 10 European countries.10 The Berlin State Library (Staatsbibliothek zu Berlin) acts as the coordinator of the project, while the technical development is spearheaded by Joanneum Research (Graz, Austria).
With respect to the nature and characteristics of the authority data used in the LEAF project, the consortium partners are very diverse. Standardized records from national and local authority files exist alongside very differently structured, but also standardized archival descriptions, together with records following local rules only. Formats include various forms of MARC, the German MAB211, EAC12 and local formats.
The scope and size of the data provided to LEAF also differ considerably from partner to partner. The National Library of Portugal provides the person name authority records of the Portuguese PORBASE union catalogue (around 550.000 records in UNIMARC format). The Berlin State Library participates with the biographical data of Kalliope, a union catalogue mainly providing descriptions of manuscript material kept in more than 100 different institutions throughout the country (around 250.000 records in MAB2 format). The name records in Kalliope are part of the PND which, in turn, is partly used by the Austrian National Library as well. The Austrian National Library additionally provides internally created, detailed, biographical descriptions. The PND is also partly used by the German Literary Archives, which provide around 120.000 authority records (MAB2) from its Kallías database. The Swiss National Library participates with records from its Index of Archival Collections (non standard format). The Swedish National Archives provide the biographical descriptions of their national archival system (around 170.000 records in EAC XML). The Slovenian National and University Library provides authority records from its manuscript catalogue (COMARC and MARC21). The Library of the University Complutense of Madrid makes records of their CISNE system available to LEAF (MARC21). Smaller institutions provide further varieties of data: the Research and Documentation Centre of Austrian Philosophy provides data designed to cater for the very specific needs of its user community, as does the French publishing archive Institut Mémoires de L'Édition Contemporaine.
LEAF's primary objective has been to develop a system through which distributed name authority records are gathered, automatically linked together in meaningful ways, made available to a variety of operations and opened up for multiple analysis. The main steps comprised in the LEAF scenario are as follows:
On a very generic level the expected benefits are multiple: public users of LEAF will either benefit by retrieving data from the LEAF system or by improving search precision in other applications with the help of the LEAF service. All users, but in particular professionals (such as librarians, archivists and other specialists), will have access to rich biographical information that can be used in many ways.
System ArchitectureLEAF is based on a central system that harvests, stores, processes and makes available the authority records of the LEAF Data Providers.
To accomplish its various operations, the LEAF system consists of several modules (see Figure 1): offline components harvest local authority files, convert the local data formats into an EAC XML representation, import the data into the central repository and automatically link records describing the same person.
The online components consist of a range of user interfaces that allow for browsing, searching and annotating records or linked descriptions.
A Maintenance Suite allows data providers' administrators to add, configure and remove connectivity details for their local service to the LEAF system. The suite is also responsible for monitoring the status of LEAF Data Providers' server systems. Furthermore, a list of connected and queried servers can be kept up-to-date with the assistance of these tools. The Maintenance Suite will notify the LEAF system when any technical administrative factors are required or changed.
Interfaces to external systems make the functionality of LEAF available to other services like MALVINE (Manuscripts And Letters Via Integrated Networks in Europe).13
Different technologies are employed by the LEAF system. The central system, provided by Joanneum Research (Graz, Austria), is mostly based on Java technology. The conversion module, provided by the University of Bergen (Norway), is based on Python, Perl and XSLT. The Maintenance Suite, provided by Crossnet Systems (Newbury, United Kingdom), consists of a web application (PHP scripting and a relational database) and Java programs for monitoring the local servers. The modular design of the LEAF systems allows the different development teams to perform their tasks independently and caters for future flexibility when adding or replacing system parts.
The central system is based on a Linux environment, while the repository and the data import mechanisms make use of an Oracle Relational Data Base Management System and a range of PL/SQL stored procedures. Incoming HTTP requests are accepted by a JRun application server acting as web server and hosting the LEAF online components. On top of this application server the Apache Cocoon XML publish framework controls the flow between the various requested URIs and also provides conversions via XSLT style sheets for output of the web pages. Interactions with the LEAF database and external sites are done via the business logic including actions (Java classes) used by the Cocoon configuration.
The different components of the overall architecture of the LEAF system communicate via defined interfaces (see Figure 2). The Harvesting Interfaces to the local servers of the LEAF Data Providers use the specified communication protocols available at the local systems. The Conversion Tool Interface enables the integration of data conversion components developed by different partners in the LEAF consortium and also allows adding conversion routines for future partners. The languages to be coupled for processing the incoming local records are Java, Python and Perl. In addition, XSLT style sheets are integrated for presenting or exporting records. The External Systems Interface is provided through the means of XML. This allows external systems to easily make use of the LEAF service. The interface to the Maintenance Suite, monitoring for maintaining the connections to external resources, is done via the exchange of XML formatted information.
The central LEAF system is updated with new or modified local records on a regular basis. In addition, records deleted in local systems are removed from the central system. Each update procedure leads to an iteration of the automatic linking process in the central system: new links between records are created and links that are no longer valid are removed. Three different protocols are implemented for communication between the central and the local systems: FTP (upload and download), OAI-PMH, and Z39.50.
The FTP upload procedure is made up of the following steps:
When a LEAF Data Provider chooses FTP download as the update procedure, the export and version files are stored in a dedicated directory of the Provider's FTP server. In the event of a new version, the LEAF system picks up the export file from the LEAF Data Provider's FTP server and removes both the XML version file and the export file to signal the successful transfer to the local facility.
For harvesting via OAI-PMH, the LEAF Data Provider has to provide an OAI server compatible to version 2.0 of the protocol. 9
Data will be harvested by an OAI client module of the central LEAF system on a regular basis. Insert and update are processed within the same harvesting operation. The LEAF system performs selective harvesting with Datestamps ("
For LEAF Data Provider resources that do not support either harvesting via OAI-PMH or bulk transfer via FTP, a selective integration of records via Z39.50 is foreseen by the LEAF central system. Each user search in the LEAF system is also targeted against the Z39.50 servers of those LEAF Data Providers that support this method for their authority databases. The retrieved authority records are added to the search results. These records are integrated into the LEAF system (conversion, database insert and automatic linking) during the next regular LEAF system update. This mechanism implies that the user who retrieves a specific record via Z39.50 will not get a linked search result at first because the automatic linking process will only be triggered when the next system update occurs. The records integrated in LEAF via Z39.50 will be kept up-to-date through further searches.
In order to be able to compare individual records and thus make them available for further operations, LEAF had to define one common exchange format into which all records, independently of their native format, can be converted. LEAF has adopted the emerging standard EAC (Encoded Archival Context) for this purpose. Consequently, individual conversion scripts were written for each format used by LEAF Data Providers.
Whereas common data formats exist for the exchange of authority data (MARC21 and UNIMARC, MAB2 in Germany) in the library community, a common format is not yet available for the exchange of authority data within the archival community or for the cross-domain exchange of such data between libraries, archives, museums and related institutions. EAC is a new XML DTD designed to complement the EAD (Encoded Archival Description14) format and is able to cater for these needs.15 The format fulfils the requirements of the LEAF project with regard to the disparate structure of the formats used locally (data emanating from different domains and encoded according to several different descriptive rules and formats) that have to be mapped for linking. As the authority records from the archival LEAF partners often contain rich (biographical and historical) context information, common library data exchange formats were not flexible enough to include this information.
The SGML/XML DTD of EAD was first released in 1998 and has since become the de facto standard for the exchange of descriptive archival data; a revised version of EAD was released in 2002. EAD contains elements for names of persons, families and corporate bodies with attributes allowing links to authority records16, but the format does not support separate files of authority and context information. EAC is intended to add this feature to EAD by facilitating the separate description of the context under which archival records have been created.
An international group of archival and information specialists laid down the first principles of EAC at a meeting in Toronto in March 2001.17 A DTD was developed with the cooperation and support of the LEAF project. The alpha version of the DTD (available since July 2001) was extensively tested within LEAF. The test results went into the EAC beta version that was pre-released in October 2003.18 The development of EAC is related to the current revision process of the International Standard Archival Authority Record for Corporate Bodies, Persons and Families (ISAAR(CPF)) first published by the International Council on Archives (ICA) in 1996. Several members of the ICA Committee on Descriptive Standards (ICA/CDS19), which is currently revising20 ISAAR (CPF), are part of the EAC working group. EAC has also been adapted to library standards for authority data in order to facilitate the exchange of authority data between the archival and library domains. A special attribute in the EAC elements (
The conversion module of the central LEAF system consists of data conversion routines for each local data structure that convert the harvested local records into EAC and the different character sets into Unicode (UTF-8). The converted data are then further processed in the LEAF system. Records are saved in EAC and in their local formats as provided by the LEAF Data Providers.
For presentation of the records by the LEAF user interfaces, XSLT scripts are employed that transform the XML data into HTML representations according to the specific requests of users. Besides a view in the specific "LEAF Presentation Format", users can view the records in "MARC21-like", "UNIMARC-like" and "MAB2-like" formats. These presentations are accomplished by XSLT transformations of the EAC representations of single records. (These transformations can, of course, not be done for aggregated records.) However, the results of the transformations are not "true" MARC or MAB records: creating "true" MARC or MAB records would be highly complicated not only due to incompatibilities between cataloguing rules and data formats but also because of idiosyncratic character sets. The results may nevertheless be useful for further (manual) processing within the users' local systems. Registered LEAF users can import selected records into their individual LEAF online user work space from where they can be manually edited, stored and downloaded. Users will also be able to download records either in their original local formats or in the EAC XML format and then further process them locally.
After conversion, the LEAF Authority Records (LARs) are subjected to an automatic linking process. Links are established between records when these are interpreted as highly likely to refer to the same person. When a link is established between two or more LEAF Authority Records, these are merged into a Shared LEAF Authority Record (SLAR). Each SLAR represents all the information associated with the LARs contributing to it. Upon retrieval by a public user, the status of that record will be changed to "Central Name Authority Record" (CNAR) that makes all the information from all the records in LEAF referring to the same person accessible to the user at one time. The allocation of a specific status to retrieved records is designed to allow LEAF Data Providers to specifically identify the records being of relevance to users and, if need be, to improve the quality of those records.21
The update and linking process is a fairly complicated set of single steps which have to be performed in a well defined and ordered way. Figure 3 points out the principles and corresponding results in a descriptive form.
It is worthwhile at this point to elaborate on the differences in structure and content of the uploaded records. Library authority records may consist of one name form only or may include numerous "see-references", i.e., other name forms than the main heading. They may also contain life dates, information about the professional background, family and other relations, etc. Although more or less rich in information, library authority records have a similar structure that is determined by their main purpose of controlling names and disambiguating records and thus, by implication, the identification of one specific person. The use of authority data in archives has a wider scope: it is intended not only to control headings but also to provide biographical and historical information about the entities that created the archival records described in the archival finding aids, as well as information about the relations between these entities.
Library and archival authority records share very few types of information. Common elements are IDs, names and life dates only. These elements form the basis for the LEAF linking rules. The information that can be used for the linking process is clearly also determined by the type of descriptive records linked to the name authority records: when these describe material that is not unique, e.g., books, the information contained in the bibliographic records linked to the authority records might also be utilized for the linking process. This approach is chosen by the VIAF project. It is, however, not suitable for LEAF because many of the authority records used by LEAF are taken from archival data providers and are linked to unique resources. Because standardized and controlled vocabularies are not available on an international, multilingual level, other elements of authority records, such as professions, are not suitable for linking purposes either.
Whenever single records include identical (national) authority file IDs (e.g., the same PND ID or the same LCNAF ID), the links will be established through those IDs. In all other cases, name forms and life datesin various combinationsare the only parameters suited to establish links. Clearly, see-references (i.e., other name forms) are, in many cases, of crucial importance for the linking process: if there is a certain likelihood that persons are described by different name forms in different countries in accordance with different cataloguing and transliteration rules22, there is also a certain likelihood that one record or another will contain the name form "required" for the linking process as a see-reference.
The general steps of a repeated linking process during the update of the LEAF system with new or modified records are shown in the following state transition diagram (Figure 4).
The retrieved CNARs receive IDs that are persistent. This was deemed necessary so that external resources can link to any CNAR. However, since data may be continuously modified or updated locally, the content of a CNAR cannot be assumed to be persistent at all. A faulty LAR, for example, may have created a "wrong" SLAR. The erroneous data is identified at a later point and the record in question changed locally. Upon updating LEAF with this modified record, an existing SLAR is automatically identified as invalid. In turn, the locally modified record and other LARs may instead form a different SLAR. Therefore, IDs of broken CNARs will be maintained and, if applicable, point to a newly created CNAR.
It is inevitable that in some instances the linking process will produce incorrect results. Records describing two different persons might be automatically linked because they do not contain enough discriminating information. On the other hand, two records representing the same person might not be linked because they do not share an identical name form. Recollecting the main purpose of library authority recordsthe disambiguation of persons describedit may be argued that those records leading to wrong links are not sufficiently rich in content to serve their original purpose.
The LEAF system offers extensive annotation facilities that serve a variety of purposes. Three types of annotations are distinguished:
External Systems and Online Resources
One of the main features of LEAF is the possibility to connect the service to external systems: these can query the LEAF system and extract name information required to search descriptive records with greater efficiency. The connection to external systems is realised via an XML over HTTP interface. An example of an external system connected in this way is the distributed MALVINE search and retrieval system. In those cases where MALVINE Data Providers are also LEAF Data Providers, the particular name form used in that Data Provider's authority record and descriptive record(s) can be identified via LEAF and used to significantly improve search precision. In cases where MALVINE Data Providers do not act as LEAF Data Providers, additional search argumentsi.e., all relevant name forms identified in LEAF including all see-referencescan be used to broaden a search and thus improve chances of a successful retrieval. It is sufficient for LEAF to make name search results and full name authority records available to MALVINE; all other features of LEAF are not applicable in this scenario.
Of course, this is just one possible example. Other applications may use LEAF in a similar way. Within the project's duration, the integration of the LEAF service into the portal being developed within the European TEL project will be tested. TEL (The European Library)23 is a collaboration of a number of European national libraries under the auspices of the Conference of European National Libraries (CENL) which will establish a single access point to selected parts of the holdings of the partner libraries. The two projects will test the integration of the LEAF service into the TEL portal via XML over HTTP requests in order to enhance the search results in TEL by employing the linked authority records provided by LEAF.24
Besides the possible integration of LEAF into external search services, the LEAF system itself can also actively extend searches to external systems. In these cases, a query is sent to the external system and results are retrieved via Z39.50. Within the project's duration, this functionality is demonstrated with the Kalliope database: when a user submits a query to LEAF, the search is simultaneously performed in the LEAF database and in Kalliope. Other systems could, of course, be addressed via the same protocol. The implementation of additional search and retrieval protocols (e.g., SRW25) will be possible in the future.
Another possibility to connect to external systems is available if the external online system supports searches via URIs and the bibliographic records contain authority file IDs. In these cases, the relevant URI search string details are stored in the Data Provider configuration of the LEAF Maintenance Suite. A user can then extend the search from LEAF to the bibliographic database of a Data Provider and retrieve all bibliographic records linked to a specific authority record.
Lastly, external systems and online resources can reliably link to records in the LEAF system. This is possible because all CNARs in LEAF have stable IDs. External services can thus provide current authority data to their user communities.
At the time of writing this article, the LEAF consortium is testing a prototype of the LEAF system. The LEAF project itself is scheduled to finish in May 2004. An improved prototype version will be made available for public testing before that time and announced via listservs and on the LEAF website.
The authors wish to thank Pierre Clavel (Swiss National Library), Christopher Fletcher (The British Library), and Per-Gunnar Ottosson (Swedish National Archives), all of the LEAF project, for their contributions.
2 Current developments in this sector were recently summarized and future trends formulated in: Barbara B. Tillett: Authority Control: State of the Art and New Perspectives (Paper presented at the international conference "Authority Control: Definition and International Experiences", Florence, February 10-12, 2003), <http://eprints.rclis.org/archive/00000332>.
5 For an actual overview of the different international initiatives, see the papers presented at the international conference "Authority Control: Definitions and International Experience" held in Florence, February 10-12, 2003, <http://www.unifi.it/biblioteche/ac/en/program.htm>.
6 See the overview in John D. Byrum, Jr.: NACO: A Cooperative Model for Building and Maintaining a Shared Name Authority Database (Paper presented at the international conference "Authority Control: Definition and International Experiences", Florence, February 10-12, 2003), <http://eprints.rclis.org/archive/00000272>.
9 Open Archives Initiative - Protocol for Metadata Harvesting, <http://www.openarchives.org/OAI/openarchivesprotocol.html>.
12 Encoded Archival Context. See the section "Data Conversion" in this article.
15 See Daniel V. Pitti: Creator Description - Encoded Archival Context (Paper presented at the international conference "Authority Control: Definition and International Experiences", Florence, February 10-12, 2003), <http://eprints.rclis.org/archive/00000316/>
21 It is obvious from the above that the result of a particular user interactionthis is also true for the provision of serviceslargely depends on the type of user who triggers this interaction. The system is of course open for unregistered users who can perform simple and advanced search as well as retrieval operations. Additionally LEAF distinguishes four other types of registered users, ranging from registered public users to professionals (who may either proactively provide authority data to LEAF or not) and external service providers. LEAF allocates different rights to each of these user types.
22 An example: the Russian writer Ivan Sergeevich Turgenev (Library of Congress main heading), is listed in the German PND as "Ivan S. Turgenev" (a German user will probably search for "Iwan Turgenjew") and in the French RAMEAU (Répertoire d'autorité-matière encyclopédique et alphabétique unifié) as "Ivan Sergueevitch Tourguenev".
24 See Genevieve Clavel-Merrin: National Libraries as Access points: The Role of TEL and MACS (Paper presented at the 69th IFLA General Conference and Council, Berlin, Germany, August 1-9, 2003), <http://www.ifla.org/IV/ifla69/papers/028e-Clavel-Merrin.pdf>.
25 SRW - Search/Retrieve Web Service, <http://www.loc.gov/z3950/agency/zing/srw/background.html> .
Copyright © Max Kaiser, Hans-Jörg Lieder, Kurt Majcen, and Heribert Vallant