Search   |   Back Issues   |   Author Index   |   Title Index   |   Contents

Articles

spacer

D-Lib Magazine
January 2005

Volume 11 Number 1

ISSN 1082-9873

Trend Analysis of the Digital Library Community

 

Johan Bollen, Michael L. Nelson, Giridhar Manepalli, Giridhar Nandigam, and Suchitra Manepalli
{jbollen,mln,gmanepal,gnandiga,svijayak}@cs.odu.edu

Department of Computer Science
Old Dominion University
Norfolk VA 23508

Red Line

spacer

Abstract

Since its inception in 1995, D-Lib Magazine's articles, reviews, and editorials have documented the rapid evolution in the digital library (DL) community. To quantify the longitudinal evolution of novel ideas and concepts in this community, we mined the contents of D-Lib Magazine. For each year since 1995, we analyzed how terms co-occurred. These patterns of term co-occurrence reveal not only which terms have been used more or less frequently over the past 10 years, but also the context in which they occurred. This contextual information allows us to more precisely examine the manifold aspects of trends and evolutions in DLs.

1. Introduction

1.1 Problem Statement

D-Lib Magazine is "an electronic publication with a primary focus on digital library research and development, including but not limited to new technologies, applications, and contextual social and economic issues" and its primary goal is "timely and efficient information exchange for the digital library community" (Wilson, 2004). As such, D-Lib Magazine has evolved into the primary venue for "awareness"-type publications within the DL community. This is evidenced from its publication style: announcements and articles about novel, theoretical and practical research results are published comparatively quickly after acceptance and submission.

D-Lib Magazine is published 11 months a year and contains about 5 articles per edition in conjunction with numerous reviews and conference reports. The back issues of D-Lib Magazine represent an essential sample of trends and evolutions in the DL community. However, the determination of what has constituted or constitutes a genuine trend or evolution in the DL community, and what has conversely been found to be a fad or hype, remains a highly subjective affair.

Using the D-Lib Magazine corpus, we performed a quantitative trend analysis of D-Lib Magazine's text content for the past ten years. We identified shifting patterns of how words co-occur in documents over time and use these patterns to pinpoint some trends in the community.

1.2 General Approach

Our analysis is based on the assumption that the frequency of term or word usage in a document collection serves as an indication of how important that term and the corresponding notion is to the community at the time of publication. However, words in isolation do not fully express the views and perspectives of a community. To better evaluate the context of the occurrence of terms, we examined their co-occurrence with other terms. In that sense, we determine that when two terms frequently co-occur in, for example, the same sentence, they are related. The more frequently they co-occur compared to how often they each occur individually the stronger we will assume their relationship to be. The resulting networks of term relationships can then be examined for each publication year and sequenced to determine how the context of a particular term changed over time.

2. Methodology

2.1 Data Collection

All issues of D-Lib Magazine are available on the web in HTML. We wrote a crawler which harvested each article in a particular year, converted it to plain text (removing all HTML source code) and stored the result in a directory according to the year in which it was published.

All full D-Lib articles were harvested including briefs, editorials, and conference reports on the basis of the documents listed in the author index. A total of 675 text documents were downloaded and saved. The number of unique articles for each year was distributed as shown in Table 1.

Year Articles
1995 29
1996 77
1997 70
1998 73
1999 66
2000 62
2001 67
2002 78
2003 78
2004 75

Table 1 : Number of unique articles published each year

2.2 Data Pruning

For each year of D-Lib Magazine's publication, we concatenated all articles and separated their content into individual terms.

We then removed stopwords (e.g. "the", "a", "I", "we", etc.). Proper names were retained since they provide essential information on which persons are most strongly associated with a particular term or concepts.

All extracted terms were stemmed, meaning they were reduced to their root form by removing all changeable endings (Porter, 1980). For example, "operation", "operating" and "operator" can all be reduced to the common root "operat". This procedure allows the grouping of related terms regardless of their linguistic position or context.

No amount of data pruning can remove all irrelevant and spurious terms extracted from a corpus such as D-Lib Magazine. We decided to err on the side of caution and accept a higher rate of false positives in exchange for capturing as wide a variety of terms as possible.

2.3 Term association networks

From the list of terms extracted from each year's articles we then generated term associations. We did so on the assumption that terms occurring within close proximity, e.g. same sentence or sequence of words, are related. Since sentence boundaries are difficult to automatically and consistently detect, we applied a 8 term window to the resulting text whose position was shifted down the text, one word after the other. This window did not cover the transitions between concatenated articles.

A term co-occurrence was defined as any two words jointly occurring within that sliding window of 8 words. Term co-occurrence frequency data was generated for each year and used to determine term association data on the basis that often co-occurring terms are more related than those that do not co-occur.

We thus defined the association weight A for each pair of term (ti,tj) as follows:

Formula

where P(ti & tj) represent the probability that the terms will co-occur (normalized co-occurrence frequency), and P(ti) and P(tj) represent respectively the probability that the terms will occur independently of each other (normalized occurrence frequency). This metric of term association is also known as Jaccard similarity and can be understood as calculating the ratio between the number of times the terms co-occur over the number of times they occur independently of each other. If the frequency of co-occurrence approaches the independent occurrence of a pair of terms, A(ti,tj) approaches 1, i.e. they always co-occur. If they never co-occur, regardless of how often they occur independently, A(ti,tj) approaches 0.

When applied to D-Lib Magazine's publications, we can generate a network of term associations for each year. We assume that each term indicates a particular topic or concept of interest, and that these yearly networks represent how these topics related to each other. Using the term association data we can therefore not only determine which terms were used, but the context in which they were used.

2.4 Term selection

When studying the "zeitgeist" of a community two types of conceptual entities seem to matter most:

  1. Central concepts: concepts central to the community's main mission and focus. For example, in D-Lib Magazine such concepts may be captured by the terms "digital library", "information", and "access". These concepts are assumed to be stable and experience slow shifts over time.
  2. Innovations: novel concepts are introduced at a particular time and are associated with the rapid changes and trends that a community experiences, e.g. "XML" and "Preservation". This idea is similar to that of "bursty words" as introduced by (Kleinberg 2002). Such concepts are highly unstable, shift meaning and context, and may either fade into oblivion or eventually become "central concepts" themselves.

We adopted two different approaches to associate a set of terms with "central concepts" and "innovations". First, frequency tables were generated for each year in which D-Lib was published including 2004 (all issues) and 1995 (six issues starting with July of that year). The frequency of each term was normalized by dividing it by the total of all term occurrences in that year. Those terms that appeared in the top 10 ranked terms for each of the recorded years were adopted as indicators of "central concepts".

Second, terms were selected from the frequency tables for each year by ranking each term by the values of their absolute changes in frequency from one consecutive year to another. In other words, the 10 terms that underwent the largest, positive changes in their occurrence frequency from one year to another were adopted as "innovations" within the community. These terms were selected and the evolution of their frequency examined over the 10 years of D-Lib publication.

3 Results

3.1 Central Concepts

The set of terms associated with "central concepts", as listed in Table 2, demonstrates a strong focus on a limited number of concepts as evidenced by the fact that out of 100 possible terms (10 years multiplied by the top 10 most frequent terms each year) only 22 unique terms occurred. The lists of each year's most frequent terms were therefore highly stable. For example, as expected the terms "digital", "library" and "information" are among the ten most frequent terms in all 10 years D-Lib publication.

 

Most Frequent Used Terms
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
digit librari inform inform inform librari librari librari librari librari
inform inform librari user librari digit digit digit digit digit
librari digit user librari digit inform inform inform inform inform
user project digit digit system collect metadata metadata servic access
object access data servic user search servic user provid search
system user search system servic archiv provid develop metadata metadata
access document imag resourc access provid access access system resourc
research provid system access provid access user provid user develop
project collect access provid research data system collect refer servic
search system document search work system collect resourc develop data

 

spacer

Table 2. Central concepts selected on the basis of their top 10 frequency ranking in all years (1995-2004).

 

To determine how stably a term occurred in each year's list of the most frequent terms, we counted the number of years that a particular term occurred among the 10 most frequent terms. This number indicates its "centrality", i.e. how central and stable the term represented the content of D-Lib Magazine. Table 3 lists the number of times a term occurred among the 10 most frequent terms over 10 years.

Frequency Term
10 librari
10 inform
10 digit
9 access
8 user
8 system
7 provid
5 servic
5 search
4 metadata
4 collect
3 resourc
3 develop
3 data
2 research
2 project
2 document
1 work
1 refer
1 object
1 imag
1 archiv

Table 3. Frequency of Terms Appearing in the Top 10 Each Year

This data suggests that some of the most frequent terms are relative newcomers. The term "metadata" only started occurring among the 10 most frequent terms in 2001, but maintained a stable top position in the following 4 years. In other words, 10 years after the introduction of Dublin Core (Weibel, 1995), metadata remains a central concept within the content of D-Lib Magazine.

Although the lists of the 10 most frequent terms are quite stable across the past 10 years, this stability hides the fact that these terms have undergone significant shifts in context and meaning. We invite the reader to explore the networks of associative relations that have been generated from term co-occurrences in Table 2.

3.2 Innovations

As mentioned, we selected a set of terms which over the years have experienced the largest positive shifts in occurrence frequency from one consecutive year to another, i.e. 1995 to 1996, 1996 to 1997, etc. as indicators of "innovations" occurring in the D-Lib corpus. These sets indicate which terms have most strongly entered the discussions at D-Lib Magazine at some point in time, after which they may have continued to become a stable part of the D-Lib content or disappeared after a short initial spike in interest.

Table 4 lists the strongest growing terms for each pair of years since D-Lib Magazine started production in 1995. Each term is linked to its network graphs and a graph of its normalized frequency in the 1995 to 2004 period.

 

Terms With the Greatest Frequency Increase from the Previous Years
1995-1996 1996-1997 1997-1998 1998-1999 1999-2000 2000-2001 2001-2002 2002-2003 2003-2004
librari 1995
1996
search 1996
1997
servic 1997
1998
imag 1998
1999
collect 1999
2000
metadata 2000
2001
web 2001
2002
question 2002
2003
search 2003
2004
collect 1995
1996
data 1996
1997
link 1997
1998
digit 1998
1999
archiv 1999
2000
digit 2000
2001
evalu 2001
2002
refer 2002
2003
preserv 2003
2004
metadata 1995
1996
databas 1996
1997
resourc 1997
1998
librari 1998
1999
search 1999
2000
servic 2000
2001
resourc 2001
2002
librari 2002
2003
access 2003
2004
student 1995
1996
languag 1996
1997
cost 1997
1998
univers 1998
1999
librari 1999
2000
univers 2000
2001
game 2001
2002
servic 2002
2003
tool 2003
2004
resourc 1995
1996
text 1996
1997
access 1997
1998
doi 1998
1999
cost 1999
2000
model 2000
2001
cultur 2001
2002
doi 2002
2003
digit 2003
2004
inform 1995
1996
user 1996
1997
journal 1997
1998
content 1998
1999
data 1999
2000
publish 2000
2001
commun 2001
2002
xml 2002
2003
cost 2003
model 1995
1996
watermark 1996
1997
index 1997
1998
articl 1998
1999
document 1999
2000
inform 2000
2001
educ 2001
2002
answer 2002
2003
nsdl 2003
2004
technolog 1995
1996
imag 1996
1997
authent 1997
1998
local 1998
1999
catalog 1999
2000
openurl 2000
2001
site 2001
2002
system 2002
2003
titl 2003
2004
support 1995
1996
protect 1996
1997
softwar 1997
1998
databas 1998
1999
record 1999
2000
learn 2000
2001
interact 2001
2002
journal 2002
2003
rss 2003
2004
web 1995
1996
charact 1996
1997
server 1997
1998
figur 1998
1999
descript 1999
2000
user 2000
2001
preserv 2001
2002
identifi 2002
2003
particip 2003
2004

 

Table 4. Terms that underwent strongest positive shifts in occurrence frequency from one consecutive year to another

 

spacer

These terms strongly overlap with the stable concepts listed in Table 2. In fact, only 6 of the "central concepts" never occur in the list of "innovations": "develop", "object", "project", "provid", "research" and "work". This indicates that the "central" concepts, even though among the top most frequent terms, still experienced strong shifts in their relative frequencies from one consecutive year to the other.

In fact, a term could be among the ten most frequent terms in years 1 and 2, suddenly strongly increase its frequency in year 3, and thus be registered among the group of "innovations". Other terms may silently grow in frequency over the years, never to achieve particular spikes in occurrence frequency and thus not be detected by this methodology as "innovations". Conversely, a term may not be among the most frequent in any year, but may have experienced sufficient spikes in occurrence frequency so that it can still be counted among the "innovations". Indeed, this is a short-coming of this approach. Term frequency is an intuitive but coarse indicator of a term's "importance" or "innovation" value. We are currently developing more advanced methodologies which exploit the structural, temporal features of term-term relationships to avoid these shortcomings and provide a more nuanced, detailed analysis.

Table 4 nevertheless highlights a number of interesting trends associated with innovation and shifts of interest in the DL community. To elucidate these trends, we plotted the 10-year frequency evolution of all terms which experienced short spikes in their occurrence frequency. These graphs indicate how the frequency of these terms has evolved over the years, and thus how the DL community has changed its focus.

A particularly interesting case is the term "preserv" (Figure 1). Its steadily increasing frequency throughout the years corresponds to our observation of it becoming a more important topic in the study of digital libraries (NDIIPP, 2002). However, it is not until 2004 that it gains enough of a positive increase to emerge as an "innovation".

A chart showing the frequency of articles for each year of publication

Figure 1. Frequency of "preserv".

The term "openurl" (Figure 2) reveals an interesting pattern: it first appeared in 2001 when a number of seminal papers on the subject were published (Van de Sompel & Beit-Arie, 2001a; Van de Sompel & Beit-Arie, 2001b). Discussion dropped off in the corpus in 2002 but began to climb again in 2003 and 2004 as the institutional acceptance of OpenURL increased.

A chart showing the frequency of the term 'openurl'

Figure 2. Frequency of "openurl".

Some terms display other patterns of frequency shifts. The term "doi" (Figure 3) steadily increased in frequency from 1998 to a set of spikes in 1999, 2001 and 2003 when status and project reports were published (Paskin, 1999; Beit-Arie et al., 2001; Paskin, 2003). The aggregate pattern is one of growth and subsequent stabilization.

A chart showing the frequency of the term 'doi'

Figure 3. Frequency of "doi".

An example of a term that has increased its frequency in a steady, stable pattern of growth is "xml" (Figure 4). Starting in 1997 to 2004 it has continuously grown in frequency. Although there have been small declines and large jumps in particular years, the trend is clearly one of stable growth.

A chart showing the frequency of the term 'xml'

Figure 4. Frequency of "xml".

"RSS" (Figure 5) has existed in its current form since 1999 (Reagle, 2004). The frequency graph of the term, and its appearance in the "innovations" list in 2004 however indicates that it has only recently attracted the attention of the DL community.

A chart showing the frequency of the term 'rss'

Figure 5. Frequency of "rss".

The maturation of the NSDL community is evidenced from the evolution of the term "nsdl" (Figure 6) in D-Lib Magazine. Its appearance in the list of "innovations" is due to a spike in term frequency occurring in 2003-2004. It was first mentioned in 1998, was discussed in earnest in 2001 and has since demonstrated continued growth.

A chart showing the frequency of the term 'nsdl'

Figure 6. Frequency of "nsdl".

Oddly, the term "user" (Figure 7), although highly general, has experienced a gradual decline in term frequency since its spikes in 1995 and 1998. Several hypothesis could be proposed to explain this phenomenon. New terms may have supplanted "user" to denote similar concepts, or this area is receiving less attention because it is generally considered to be a "solved" problem. A similar phenomenon may be observed for the term "image" which has experienced a declining frequency over the past 10 years.

A chart showing the frequency of the term 'user'

Figure 7. Frequency of "user".

Some terms are conspicuous by their absence. The Open Archives Initiative (OAI) (Lagoze & Van de Sompel, 2003). just missed being an "innovation" in 2001, scoring as the 11th term undergoing the largest positive frequency jump from 2000. However, the multitude of terms that authors have used to describe this concept ("OAI", "OAI-PMH", "OAI-Compliant", "OAI-Compatible", etc.) limits the visibility of the concept in the frequency-based methods.

3. Conclusions and Future Work

D-Lib Magazine constitutes an attractive corpus for detecting trends: publications are characterized by a low publication latency, it is published monthly, and its focus is on summary and "awareness" types of articles. In our analysis, we have identified terms that are both of stable importance to the community and the ones that have experienced rapid positive shifts in interest.

We observed that D-Lib Magazine has a strong, stable focus on a relatively small set of central concepts that were unsurprisingly represented by the terms "digital", "library" and "information". The number of unique terms associated with innovations over the past 10 years was higher but overlapped to a large degree with the list of stable "central" concepts, indicating that certain terms, such as "metadata", which have been discussed for extensive periods will steadily grow in importance to be established as a central focus of the D-Lib community.

Future work will focus on expanding this methodology to avoid the limitations of frequency-based indicators. Frequency simply corresponds to how often something is mentioned, not necessarily its true underlying importance to a community. Shifts in occurrence frequency are therefore not accurate indicators of innovations and changes in interest. Nevertheless, some interesting patterns have been detected and term frequency is easily and efficiently derived from any corpus. It can therefore be used to compare and cross-validate trends in other publications. In other words, we can study trends within the texts of a particular publication, or apply the same analysis to multiple collections so that we can compare and cross-validate the observed trends for different communities.

Long-time readers of D-Lib Magazine may find that the identified "central concepts" and "innovations" match their intuition and recollection of the past 10 years. However, newcomers who did not have the benefit of perusing every issue the day it originally was published may benefit from this automated analysis and summarization of the main lines of thought and evolutions within the DL community. We are working to expand the tools presented here for more general corpus summarization and exploration purposes.

References

Beit-Arie, O., Blake, M., Caplan, P., Flecker, D., Ingoldsby, T., Lannom, L., Mischo, W. H., Pentz, E., Rogers, S. & Van de Sompel, H. (2001). "Linking to the Appropriate Copy: Report of a DOI-Based Prototype", D-Lib Magazine 7(9). Available at: <doi:10.1045/september2001-caplan>.

Kleinberg, J. "Bursty and hierarchical structure in streams." In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 91-101. ACM Press, 2002

Lagoze, C. & Van de Sompel, H. (2003). "The Making of the Open Archives Initiative Protocol for Metadata Harvesting", Library Hi-Tech 21(2), pp. 118-128.

NDIIPP, (2002). "Preserving Our Digital Heritage: Plan for the National Digital Information Infrastructure and Preservation Program". Available at: <http://www.digitalpreservation.gov/index.php?nav=3&subnav=1>.

Paskin, N. (1999). "DOI: Current Status and Outlook", D-Lib Magazine 5(5). Available at: <doi:10.1045/may99-paskin>.

Paskin, N. (2003). "DOI: A 2003 Progress Report", D-Lib Magazine 9(6). Available at: <doi:10.1045/june2003-paskin>.

Porter, M. (1980). "An Algorithm for Suffix Stripping", Automated Library and Information Systems, 14(3), pp. 130-137.

Reagle, J. (2004). "Web RSS (Syndication) History". Available at: <http://goatee.net/2003/rss-history.html>.

Van de Sompel & Beit-Arie (2001a). "Open Linking in the Scholarly Information Environment Using the OpenURL Framework", D-Lib Magazine 7(3). Available at: <doi:10.1045/march2001-vandsompel>.

Van de Sompel & Beit-Arie (2001b). "Generalizing the OpenURL Framework Beyond References to Scholarly Works: The Bison-Futé Mode", D-Lib Magazine 7(7/8). Available at: <doi:10.1045/july2001-vandesompel>.

Weibel, S. (1995). "Metadata: The Foundations of Resource Description", D-Lib Magazine 1(1). Available at: <doi:10.1045/july95-weibel>.

Wilson, B. (2004). "About D-Lib Magazine", Available at: <http://www.dlib.org/about.html>.

(On January 18, 2005, at the authors' request an incorrect reference was removed from this article.)

Copyright © 2005 Johan Bollen, Michael L. Nelson, Giridhar Manepalli, Giridhar Nandigam and Suchitra Manepalli
spacer
spacer

Top | Contents
Search | Author Index | Title Index | Back Issues
Previous Article | Next article
Home | E-mail the Editor

spacer
spacer

D-Lib Magazine Access Terms and Conditions

doi:10.1045/january2005-bollen