Trends in Large-Scale Subject Repositories
Noting a lack of broad empirical studies on subject repositories, the authors investigate subject repository trends that reveal common practices despite their apparent isolated development. Data collected on year founded, subjects, software, content types, deposit policy, copyright policy, host, funding, and governance are analyzed for the top ten most-populated subject repositories. Among them, several trends exist such as a multi- and interdisciplinary scope, strong representation in the sciences and social sciences, use of open source repository software for newer repositories, acceptance of pre- and post-prints, moderated deposits, submitter responsibility for copyright, university library or departmental hosting, and discouraged withdrawal of materials. In addition, there is a loose correlation between repository size and age. Recognizing the diversity of all subject repositories, the authors recommend that tools for assessment and evaluation be developed to guide subject repository management to best serve their respective communities.
In library and information science literature, the subject repository frequently receives credit for its longevity and success (Armbruster, 2009; Hey & Hey, 2006; Xia, 2008). At the same time, there is little practical information available on the development and management of subject repositories (Adamick & Reznik-Zellen, 2010). A growing number of subject repositories are actively collecting and disseminating resources, and there is a need for empirical literature that identifies their commonalities so that project managers or librarians charged with building subject repositories can make informed decisions about repository development.
While there are best practices for digital collections that guide issues such as interoperability, strong metadata, and usability, there is a lack of general resources specific to the subject repository environment. This may be due to their organic development within disciplines, or due to an expectation that subject repositories can fulfill their missions using the same technical and organizational approaches that work for institutional or funder repositories. The general literature on subject repositories uses the success of PubMed Central (PMC), arXiv, and RePEc to illustrate the success of subject repositories as a whole. These three iconic repositories operate on fundamentally different technical and business models, their deposit policies are distinct, and they support and engage their communities in different ways.
As repositories of all types continue to grow in size and number, some documentation and standardization is needed to ensure informed work and to discourage redundancy particularly with regard to subject repositories where duplication of content and effort is a real risk (Harnand, 2010). To understand the management of subject repositories from a broader perspective, public data was collected to identify general trends. Identifying points of commonality among successful repositories could offer guidance to repository managers seeking to build domain-specific collections. A follow-up to the September 2010 D-Lib Magazine article "Representation and Recognition of Subject Repositories", this study illustrates subject repository trends and concludes by articulating a need to determine mission-appropriate, domain-agnostic standards against which to measure their development.
An Analysis of the Top Ten Most-Populated Subject Repositories
The top ten subject repositories by size across all disciplinary domains were identified by comparing data between Open DOAR, ROAR, and Ranking Web of World Repositories. The repositories selected host English-language scholarly documents such as pre-prints or post-prints, are disciplinary, multi-disciplinary or interdisciplinary, and are national or international in scope. Institutional repositories, archives, image collections and strict data collections were excluded. The following subject repositories were identified as the ten most-populated: PMC, CiteSeerx, arXiv, Research Papers in Economics (RePEc), Social Science Research Network (SSRN), AgEcon Search, Policy Archive, E-prints in Library and Information Science (E-LIS), Archive of European Integration (AEI), and Organic EPrints [n1]. (See Figure 1.)
Fig. 1: Total items in the top ten repositories
Data on nine basic metrics for each repository was collected during Spring 2010 through the repository registries and from the repository sites themselves, including year founded, subjects, software, content types, deposit policy, copyright policy, host, funding, governance. This data was then verified by contacting each repository's manager. Each metric was analyzed to determine what trends may exist between the ten repositories.
Year Founded. The establishment of the top ten most-populated repositories spans 17 years, from 1991 to 2008, with an average of 2.8 years between the launches. Half of the top ten repositories were founded in the decade before 2000 and half in the decade following the millennium (See Table 1).
Table 1: Repositories sorted by year founded
Subjects. The subjects of the big ten are quite diverse, though they tend to fall among the sciences and social sciences (See Table 2). Only SSRN incorporates any humanities subjects: classics, English literature, and philosophy. Apart from that exception, the subjects represented are the biomedical and life sciences, chemistry and chemical technology, library science, information science, business, economics, physics, mathematics, computer science, quantitative biology, qualitative finance and statistics, agriculture, public policy, geography and regional studies, law and politics, organic agriculture, food and veterinary science, ecology and environmental studies. Only the Policy Archive is dedicated to a single subject. Five repositories (SSRN, arXiv, CiteSeerx, RePEc, and E-LIS) are multi-disciplinary (where multidisciplinarity refers to a non-integrative array of disciplines). Four are interdisciplinary (where interdisciplinary describes a field where traditional academic boundaries are crossed or blended): PMC, Organic Eprints, AEI, and AgEcon Search (See Tabled 2).
Table 2: Disciplinary coverage of the top ten subject repositories
Software. Five repositories use local software platforms and five use the open source software DSpace or EPrints. The five repositories using local platforms are PMC, CiteSeerx, arXiv, RePEc, and SSRN. AgEcon Search and Policy Archive use DSpace, and E-LIS, AEI, and Organic Eprints use EPrints. None of the top ten repositories use hosted software.
Content Types. While most of the repositories profiled can accommodate several types of content (working papers, reports, theses, conference materials, multimedia, grey literature, and data), "articles" are the only common content type among the top ten repositories (where articles includes pre-, post-, and publisher versions of published journal articles).
Deposit Policy. Most of the repositories moderate submissions. AEI, Policy Archive, Organic EPrints, and E-LIS all require registration, and review submissions before they are posted. SSRN and AgEcon Search both require user registrations, but submissions are unmoderated. The remaining repositories have unique deposit policies. PMC accepts peer reviewed material from life sciences journals that are scientifically and technically high quality, defined by the National Library of Medicine (NLM). Individual author manuscripts are also accepted through approved processing systems when they are products of approved funding agencies with public access policies such as NIH, Wellcome Trust, or Howard Hughes Medical Institute. RePEc manages submissions via institutional or departmental deposit, where institutions or departments must build a contributing RePEc archive. arXiv requires registration and endorsement by a registered contributor. arXiv endorsers have authored a defined number of papers within an archive or subject class's endorsement domain. CiteSeerx crawls the web for materials, and users can also make a submission by providing a content link and email address. The item will then be automatically processed and indexed by CiteSeerx.
Copyright Policy. None of the repositories require a copyright transfer agreement, but most require the depositor to agree to a non-exclusive right to distribute the submission. Journal contributors can deposit their complete contents (Full Participation), all NIH-funded articles, or all NIH-funded articles with other selected articles (NIH Portfolio), or selective submissions such as open access articles (Selective Deposit) into PMC without copyright transfer. Individual authors grant PMC the non-exclusive right to disseminate their work through the repository. arXiv requires that submitters grant the repository an irrevocable non-exclusive right to distribute submissions if they are not in the public domain. SSRN requires a non-exclusive, revocable license to distribute the submission. AgEcon Search generates a license as part of the submission process that grants the repository a non-exclusive right to distribute the submission. Both SSRN and AgEcon Search will remove materials from their repositories upon request. Organic EPrints, E-LIS, and AEI require submitters to grant the repository permission to disseminate the work, and will remove materials if requested, though it is discouraged. SSRN, Policy Archive, AEI, E-LIS, Organic EPrints, and AgEcon Search all state that the submitter is responsible for verifying that he/she has the right to post materials. CiteSeerx, Policy Archive, and RePEc do not provide information about copyright.
Host. Top ten repository hosts include university libraries, university systems, multiple organizations, a consortium, a publisher, and a research center. Eight of ten repository hosts are institutions of higher education. Four repositories are hosted by university libraries: the Cornell University Library hosts arXiv, the University of Minnesota Library (with the University of Minnesota Department of Applied Economics) hosts AgEcon Search, the Indiana University-Purdue University Indianapolis Library hosts Policy Archive, and the University of Pittsburgh Library hosts AEI. PMC is also hosted by a library: the National Library of Medicine through its National Center for Biotechnology Information. CiteSeerx, RePEc, and E-LIS are all hosted by university systems. CiteSeerx is hosted by the Pennsylvania State University's College of Information Sciences and Technology. RePEc services are hosted by multiple organizations: the University of Connecticut hosts Internet Documents in Economics Access Service (IDEAS), the RePEc Author Service and Economics Departments, Institutes and Research Centers in the World (EDIRC), the Swedish Business School at Örebro University hosts EconPapers and LogEc, the Munich University Library hosts the Personal RePEc Archive (MPRA), SUNY-Oswego hosts New Economics Papers, the Valencian Economic Research Institute in Spain hosts the CitEc service, and the RePEc domain itself is hosted at Boston College. In addition, the Economists Online service is hosted by the organization Nereus. E-LIS is hosted by a university supercomputing consortium's advanced E-Publishing team (AEPIC) at Italian Consorzio Interuniversitario Lombardo per Elaborazione Automatics (CILEA). Finally, SSRN is hosted by the publisher Social Science Electronic Publishing Inc., and Organic EPrints is hosted by a research center, the International Centre for Research in Organic Food Systems (ICROFS).
Funding. Funding for the top ten repositories is federal, institutional, organizational, or by for-profit companies. PMC and CiteSeerx are federally funded by NIH and NSF, respectively. AgEcon Search receives funding from the USDA Economic Research Service (among other funders). In addition, Organic EPrints receives some of its funding from the German government's Geschäftsstelle Bundesprogramm Ökologischer Landbau in der Bundesanstalt für Landwirtschaft und Ernährung. AgEcon Search, Policy Archive, and Organic EPrints are all funded by societies, organizations, or research centers. AgEcon Search receives support from the Agricultural and Applied Economics Association, the European Association of Agricultural Economists, the Farm Foundation, and the International Association of Agricultural Economists. Policy Archive receives support from the Center for Governmental Studies and Organic EPrints receives support from the International Centre for Research in Organic Food Systems, the Research Institute of Organic Agriculture. arXiv, AEI, and Policy Archive all receive funding from their host libraries, which are the Cornell University Library, the University of Pittsburgh Library, and the IUPUI Library, respectively. In addition, arXiv has a distinctive new voluntary collaborative business model that solicits annual contributions from the top 200 user institutions (Cornell University Library, 2010). SSRN is unique in that it is supported by a publishing corporation, the Social Science Electronic Publishing Inc. Four of the repositories above currently receive funding from multiple sources: arXiv, AgEcon Search, Policy Archive, and Organic EPrints. E-LIS and RePEc are volunteer-driven.
Governance. Five of the top ten most-populated subject repositories have an external governance structure. PMC has a national advisory committee that advises the Directors of NIH, the NLM, and the National Center for Biotechnology Information on the scope and development of the repository. The advisory board consists of information science, biomedical, and general public appointees by the NIH director. Cornell University Library staff receive consulting services from arXiv's advisory board of representatives from the physics, astronomy, and mathematics communities, and ex officio subject-based advisory committee chairs. Advisory committees exist for physics, mathematics, computer science, quantitative biology, quantitative finance, and statistics archives. The SSRN Board of Trustees is comprised of members of business, law, psychology, and economics disciplines. AgEcon Search's advisory board includes economic, agricultural organizational, and federal representation. E-LIS receives guidance from an administrative board, a technical board, and an editorial board with representatives from library and information science. CiteSeerx, RePEc, Policy Archive, AEI, and Organic EPrints do not have external governance structures.
The survey above gives a small snapshot of the management of subject repositories, and should not be considered representative of the 150-400 existing subject repositories [n2]. Of the top ten repositories, there is a decrease of almost 500,000 items from the first to second most-populated repository PMC and Citeseerx respectively and an even more significant decrease of over 750,000 between Citeseerx and RePEc. There is a third notable drop between SSRN and the five remaining repositories. Even among ten subject repositories, a long tail is already visible (See Figure 1). The difference in size between the first and second most-populated repositories alone is larger than the combined number of items in the five smallest of the top ten repositories. When the size of the top ten repositories is compared to the remaining 150-400, it is clear that the remaining repositories differ in scope. All subject repositories have a similar mission to collect and disseminate their fields' research output (Darby et. al., 2008), but the practices of the top ten may not be applicable to all subject repositories given their diversity. That being stated, it would not be ill-advised to imitate some of practices of the top subject repositories: they all have large collections, which implies strong community adoption, though it may not be correlative.
Half of the top ten subject repositories were launched in the last five years, which indicates a current demand for storing and disseminating open access research through repositories. Though the five oldest repositories include two of the most-populated repositories, the most populated repository, PMC, was established in the year 2000. The quick population of PMC can be attributed to a National Institutes of Health mandate that requires the deposit of all publications produced with NIH funding. When PMC is excluded, the distribution of content (total items) may be loosely attributable to age, as the top five most-populated repositories were all established by 2000 (See Figure 2). The science and social science disciplines are predominant, while humanities and arts have the lowest representation amongst the top ten. This finding can be attributed to disciplinary differences in publication rates, pre-print culture, and adoption of electronic media, where, in general, the humanities and arts lack a pre-print culture, have lower publication rates, and adopt electronic media at a lower rate than the science, social science and engineering disciplines (Crow, 2006; Kling & McKim, 2000; Schonfeld & Housewright, 2010; Sparks, 2005). Most of the top ten repositories are multi- or interdisciplinary, which indicates that users will identify with a larger-scale resource site, will identify with a topic that cuts across multiple disciplines, or with the interdisciplinary nature of their fields.
Fig. 2: Repository size and year launched
In general, the oldest repositories use locally developed software, and the newer repositories use open source software. Software development was necessary for the oldest repositories, as most open source repository or digital collection software used today was not developed until 1997 [n3]. All of the older and larger repositories manage far fewer types of content than the younger, smaller repositories, an interesting reverse trend. For example, PMC, arXiv, and Citeseerx manage articles and citations only. RePEc adds software, working papers, and contact information to this core group. AEI, E-LIS, and ORganic EPrints manage the widest variety of content. But these younger, smaller repositories are operating on software that was not made available until 2000. What is unclear is the extent to which software options might impact the kinds of content types that are collected and managed in a given repository. For example, AEI, E-LIS, and Organic EPrints all use EPrints and, from a technical point of view, can manage a wide range of content types, whereas the older larger repositories all operate on local software platforms and perhaps were not designed for a range of content types.
Eight of the repositories monitor submissions for metadata, scope, or quality, or filter submitters in some way. While the repositories do not facilitate a peer review process the way that scholarly journals do, the monitoring process may be surprising to those who mistakenly associate open access with lower quality research. The filtering process affirms that the open access movement works to remove access barriers to research, but not quality filters (Suber, 2009). For example, assuring strong metadata on author submissions has a positive impact on the discoverability of the work once in the repository. PMC, RePEc, arXiv, and Citeseerx all have very unique deposit policies, and they are the four largest subject repositories, which indicates that a variety of collection management strategies can be effective in building a subject repository.
Naturally, none of the repositories require authors to sign a copyright transfer agreement, but once materials are submitted, most of the repositories discourage authors from withdrawing them. arXiv is the only repository that requires an irrevocable non-exclusive right to distribute materials, and the other repositories will withdraw materials upon request. Because a repository's purpose is to share research broadly, it is logical that removal of materials would be avoided, especially considering the potential workload of submission removal for repository managers. Most of the repositories state that they are not responsible for copyright infringement, and instead require that authors or submitters understand publisher permissions. Tracking copyright permissions at the article level for hundreds of authors and publishers would be logistically complex and daunting for large scale repositories. In lieu of providing one-on-one reference help for copyright navigation, many of the repositories offer information about author rights.
University libraries are frequent hosts of the top ten subject repositories, which is intuitive considering that institutional repositories are nearly always library-managed. When a repository is hosted by a university department, another common scenario, the department's discipline is reflected, naturally, in the repository's scope. For example, AgEcon Search, which is co-hosted by the University of Minnesota Library and the University's Department of Applied Economics, is a repository for agricultural and applied economics. Similarly, Citeseerx, which is hosted by the Pennsylvania State University's College of Information Sciences and Technology, collects materials in computer and information science. Funding can come through the repository's host, as in the case of AEI, and it can also come through an external source, as with Citeseerx. Most frequently, however, the host is not the single source of funding. Several of the repositories' funding sources have changed over time. For example, the Joint Information Systems Committee (JISC) of the United Kingdom Higher Education Funding Councils originally funded Working Papers in Economics (from which RePEc emerged) through its Electronic Libraries Programme (eLib), awarded to the NetEc group. Another example is SSRN: while it is currently funded by the Social Science Electronic Publishing, Inc., it was initially co-sponsored by the University of Chicago Booth School of Business, the European Corporate Governance Institute, Korea University, and the Stanford Law School. Varied funding histories illustrate that sustainability is a real concern for subject repositories, as it is for many digital projects. arXiv's voluntary collaborative business model, where institutional users are asked for support, is a high-profile response to the sustainability issue of concern to many subject repositories. Other top repositories address sustainability by receiving support from multiple sources. Half of the repositories have an external governance structure, but there is no evident connection between governance bodies and disciplinarity, size, or funder type. The hosting, funding, and governance models of the top subject repositories indicate a collaborative nature of subject repository development: it is rare that a single body hosts, funds, and governs a subject repository.
This study illustrates that there are a number of trends among the ten largest subject repositories:
Although the subject repositories investigated for this analysis were selected for their high content counts, size alone cannot measure the success or failure of subject repositories in a meaningful way. All repositories hold a similar mission to disseminate the research outputs of a given scholarly community; what is missing from their environment are practical metrics for evaluating the impact of subject repositories within a community. Collection size is almost irrelevant in so far as the size and activity of a given scholarly community will determine the size of its repository, in number of items. It would be a disservice to evaluate the success or failure of subject repositories based solely on their collection size. While size is a useful metric and can correlate to the activity of a given research community, it disadvantages smaller or emerging communities of practice, particularly in research environments that are increasingly interdisciplinary (Novarese, 2008). All repositories can benefit from the development of standardized and customizable assessment tools, although this may be more of a concern for smaller repositories that may have fewer resources with which to build repository services, let alone evaluate them. Subject repositories could be evaluated on how often they are recognized by their respective communities, by the number of items and item downloads as relative to the size of the community, by the willingness of authors to submit to and use materials from the collection, or by their ability to document a growing or changing field.
Broad-based research on subject repositories will be a welcome addition to the vibrant existing scholarly communication literature, and this examination of large subject repositories serves as a jumping off point for further research and the development of assessment tools.
Funding for this project comes from the National Science Foundation through grant numbers 0936857 and 0531171. Any opinions, findings, conclusions or recommendations expressed here are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
[n1] The databases track information about subject repositories in different ways. RWWR has a top 400 list and an institutional repository list; subject repositories were manually selected from their 400 list. Open DOAR lists "disciplinary" repositories (202), and ROAR lists "research cross-institutional" repositories (146). In addition, the data on the number of items is inconsistent between the repository registries.
[n2] Identified by comparing data between Open DOAR, ROAR, and RWWR.
[n3] Fedora, EPrints, and DSpace weren't created and marketed as repository platforms until 1997, 2000, and 2002, respectively. AgEcon Search was initially founded in 1995 on local software, but switched to DSpace in 2008.
 Armbruster, C. & Romary L. (2009). Comparing repository types: Challenges and barriers for subject-based repositories, research repositories, national repository systems and institutional repositories in serving scholarly communications. Retrieved from: http://ssrn.com/abstract=1506905.
 Crow, Raym (2006). The case for institutional repositories: A SPARC position paper. Washington, D.C.: Scholarly Publication and Academic Resources Coalition. Retrieved from: http://scholarship.utm.edu/20/.
 Darby, R M., Jones, C. M., Gilbert, L.D., & Lambert, S. C. (2008) Increasing the productivity of interactions between subject and institutional repositories. New Review of Information Networking, 14(2), 117-135. doi:10.1080/13614570903359381.
 Harnand, S. (2010). "Subject" is not a repository but a tag [Web log comment]. Retrieved from: http://www.xpapers.org/2010/04/definition-of-subject-repository.html.
 Kling, R., & McKim, G (2000). Not just a matter of time: Field differences and the shaping of electronic media in supporting scientific communication. Journal of the American Society for Information Science, 51(14). Retrieved from: http://arXiv.org/ftp/cs/papers/9909/9909008.pdf.
 Novarese, M. & Zimmerman, C. (2008). Heterodox economics and dissemination of research through the internet: The experience of RePEc and NEP. On The Horizon, 16(4): 198-204. doi:10.1108/10748120810912529.
 Schonfeld, R. & Housewright, R. (2010). Faculty survey 2009: Key strategic insights for libraries, publishers, and societies. Ithaka S + R. Retrieved from: http://www.ithaka.org/ithaka-s-r/research/faculty-surveys-2000-2009/faculty-survey-2009.
 Sparks, S. (2005). JISC disciplinary differences report. Rightscom Ltd.
 Suber, P. (2009). A field guide to misunderstandings about open access. April 2009 SPARC Open Access Newsletter, 132. Retrieved from: http://www.arl.org/sparc/publications/articles/openaccess_fieldguide.shtml.
About the Authors