The Five Stars of Online Journal Articles a Framework for Article Evaluation
I propose five factors peer review, open access, enriched content, available datasets and machine-readable metadata as the Five Stars of Online Journal Articles, a constellation of five independent criteria within a multi-dimensional publishing universe against which online journal articles can be evaluated, to see how well they match up with current visions for enhanced research communications. Achievement along each of these publishing axes can vary, analogous to the different stars within the constellation shining with varying luminosities. I suggest a five-point scale for each, by which a journal article can be evaluated, and provide diagrammatic representations for such evaluations. While the criteria adopted for these scales are somewhat arbitrary, and while the rating of a particular article on each axis may involve elements of subjective judgment, these Five Stars of Online Journal Articles provide a conceptual framework by which to judge the degree to which any article achieves or falls short of the ideal, which should be useful to authors, editors and publishers. I exemplify such evaluations using my own recent publications of relevance to semantic publishing.
Many people will be familiar with Tim Berners-Lee's Five Stars of Linked Open Data (Text Box 1), incremental steps that categorise the publication of open data on the Web in levels of increasing usefulness, that encapsulate the present shared vision of the Semantic Web as a Web of Linked Open Data, and that individuals can use to rate their own data publication.
To complement these, I wish to propose the Five Stars of Online Journal Articles, in particular to characterize the potential for improvement to the primary medium of scholarly communication made possible by Web technologies. The background to these Five Stars of Online Journal Articles involves semantic publishing, considerations of the future of research communications, and the Semantic Web itself.
The Semantic Web
While proponents of the Semantic Web have on occasion appeared to resemble Old Testament prophets, whose messages of truth went unheeded by the general populace, uptake by a number of influential parties such as the BBC and skilful marketing of Semantic Web concepts under the banner of 'Linked Data' have recently brought Semantic Web technologies into more widespread acceptance. The principles are quite simple. If entities and their relationships can be identified and defined in machine-readable form by the use of unique URIs referencing publicly available and commonly accepted structured defined vocabularies (ontologies), and if each of these relationships is expressed as a simple subject predicate object statement (a 'triple'), following the syntax of the Resource Description Framework (RDF), then such statements can be combined into interconnected information networks (RDF graphs) in which the truth content of each original statement is maintained, thereby creating a web of knowledge, the Semantic Web.
Ontological descriptions of entities and their relationships enable data from independent sources to be integrated without ambiguity or loss of precision of meaning, a situation that would be impossible if the entities were to be described in other ways such as XML, where the lack of universally agreed meanings for markup terms frequently leads to confusion with respect to synonyms (for example, whether "creator" in one schema is equivalent to "composer" or to "choreographer" in another) or homonyms (for example, the potentially different meanings of the markup term "gift", meaning "present" in an English database but "poison" in a German one).
There now exist many powerful examples of how use of Semantic Web technologies 'under the hood' permit integration into unified and coherent services of data originally encoded using non-compatible metadata models and housed in heterogeneous databases. The best example with which I have personally been involved is CLAROS, "The World of Art on the Semantic Web", in which information describing ancient art objects housed in the world's museums is integrated from a number of scholarly sources (Kurtz et al., 2009). The benefits of Semantic Web technologies for libraries were recently discussed at the 2011 annual Semantic Web in Libraries meeting entitled Scholarly Communication in the Web of Data, held in Hamburg, Germany, in November 2011.
Journal publication, as the primary dissemination channel and public record of new research results, is a vital ingredient of the scholarly workflow, and its key commodity, the original research article, is of primary importance, since it provides a dated 'version of record' of the authors' hypotheses, supporting results and conclusions at the time of publication, validated by peer review, and as such becomes an immutable part of the scientific record. The basic format of the scientific journal article has changed little since its inception some 350 years ago. It remains a linear rhetorical narrative, in which authors attempt to persuade readers of the correctness of specific hypotheses by the presentation of experimental evidence selected from larger bodies of data.
At present, the majority of journal publishers use the Internet simply as a convenient mechanism for distributing journal articles in PDF format, providing electronic facsimiles of printed pages. While PDF documents are convenient for printing and off-line reading, the typical lack of any form of semantic enhancement or user interactivity, and the difficulties they present for machine interpretation, presently inhibits the development of automated services that could enrich the content of journal articles or link information between articles.
However, various initiatives have recently started to change this status quo, exploring how the Web can be used to enrich online scholarly communications in various ways that are not possible in print. For example, exemplars bearing semantic enhancements have been made of HTML versions of journal articles (Shotton, 2009; Shotton et al., 2009), text-mining Web services have been created that can automatically add semantic markup to named entities within HTML text (Pafilis et al., 2009) or pull back contextual information from cited papers (Wan et al., 2010), and 'smart' PDF readers such as Utopia Documents have been developed that provide annotation overlays to enrich the otherwise-static content of PDF articles (Attwood et al., 2010). Silvio Peroni and I have developed the SPAR (Semantic Publishing and Referencing) Ontologies to facilitate such developments (Shotton, 2010; Peroni and Shotton, 2011), and publishers, including the Royal Society of Chemistry's Project Prospect, Elsevier's Article of the Future, and Pensoft Journals, are starting to provide semantically enriched journal articles as part of their routine publishing workflows.
The publication of journal articles with such enhancements has come to be known as 'semantic publishing', a term that I define as the use of simple Web and Semantic Web technologies:
The goal of such semantic publishing is that the data, information and knowledge described in the online article can more easily be found, extracted, combined and reused.
The future of research communication
Four key meetings were held during 2011, bringing together academics, computer scientists and scholarly publishers to discuss the future of scholarly communication. The first of these, a workshop entitled Beyond the PDF, organized and hosted in January 2011 by Philip Bourne at the University of California, San Diego, itself built on an earlier HyPER workshop organized in May 2010 in Amsterdam by Anita de Waard of Elsevier Labs (de Waard et al., 2009). It was followed by a meeting entitled Beyond Impact, organized by Cameron Neylon of STFC at the Wellcome Trust headquarters in London in May 2011, that considered alternative metrics to the journal impact factor for the evaluation of research and particularly researcher merit. In August 2011, a further meeting on The Future of Research Communication, organized by Phil Bourne of UCSD, Tim Clark of Harvard University, Robert Dale of Macquarie University, Anita de Waard of Elsevier Labs, Ivan Herman of the W3C, Eduard Hovy of the University of Southern California and myself, was held in Germany as a Schloss Dagstuhl Perspectives Workshop. This led to the formation of the Force11 Community dedicated to the improvement of research communication and e-scholarship, and to the publication in October 2011 of the Force11 White Paper (Bourne et al., 2011), that was submitted as evidence both to the Royal Society's Science as a Public Enterprise project and to the UK Cabinet Office's public consultation Making Open Data Real. Most recently, in October 2011, Microsoft Research and Harvard University jointly hosted a meeting in Cambridge, Massachusetts, entitled Transforming Scholarly Communication that took these ideas forward. The thinking undertaken at these meetings has contributed significantly to the formulation of the Five Stars of Online Journal Articles.
2. The Five Stars of Online Journal Articles
I propose five factors peer review, open access, enriched content, available datasets and machine-readable metadata as the Five Stars of Online Journal Articles, a constellation of five independent criteria within a multi-dimensional publishing universe against which online journal articles can be evaluated, designed to characterize the potential for improvement to the journal article made possible by Web technologies.
While Tim Berners-Lee's Five Stars of Linked Open Data build one upon the other, representing degrees of achievement or completeness along the single axis of online data publication, the proposed Five Stars of Online Journal Articles are complementary, forming a constellation arranged along five independent axes within a multi-dimensional publishing universe, each of which can be evaluated on its own merits. Of course, the degree of achievement along each of these publishing axes can vary, equivalent to the different stars within the constellation shining with varying luminosities.
The Five Stars of Online Journal Articles thus encapsulate a richer vision. Each star is highly desirable in its own right, but it is only by achieving them all in combination that we will truly advance scholarly communication. Let us now consider how we might score performance of individual articles against each star. My comments are addressed primarily to authors, but it should be clear to everyone that realization of these publishing goals will require the active and enthusiastic collaboration of journal publishers and editors.
2.1 Peer review
Ensure your article is peer reviewed, to provide assurance of its scholarly value, quality and integrity.
Quality assurance of journal articles has traditionally been provided by anonymous pre-publication peer review. In my own experience, peer review has always been a positive experience, the reviews being fair and the modification made in response to reviewers' comments invariably improving the overall readability and quality of the articles. However, the practice is currently being seriously challenged, for several reasons. First, the system for undertaking pre-publication peer review is inefficient and protracted, labouring under the strain of an ever-increasing number of papers being submitted for publication. Second, the academics who are expected to undertake this activity for the benefit of their fellow academics, and without payment from the scholarly publishers, are increasingly reluctant to do so, since good reviewing takes effort for which reviewers receive scant reward in terms of academic recognition, when they are under pressure from other quarters. Third, the service has been criticised for failing to achieve its objective of ensuring that those papers accepted for publication are consistently of high quality. Finally, it has on occasion provided scope for extreme academic malpractice, by giving opportunity for the reviewer to delay the publication of a competitor's work while undertaking research based on stolen ideas, thereby giving the reviewer academic credit that rightly belongs to the authors of the paper under review.
Three approaches have been proposed to improve the peer-review process and guard against such malpractice. First, that reviewer anonymity should be withdrawn, not only to reduce misconduct, but also to permit academic credit to be awarded more transparently to the majority of good reviewers who give their time to this process. Second, that the reviews should be published along with the reviewed article, so that readers can see the contributions made by the reviewers to the final text. Third, most controversially, that the process of quality assurance should be decoupled from the act of publication.
A small number of journals have now adopted fully open reviewing for all their papers, with apparently satisfactory results. However, critics of this policy point out that, at least in some disciplines such as the humanities that have less openly critical cultures, the lack of anonymity may inhibit reviewers from delivering more forthright comments, thus weakening the review process.
Since publication can now be undertaken entirely on line, at considerably reduced cost compared to printing paper journals, there is no requirement that an article be prepared in its final form before being published. Post-publication peer review enables the responsibility for reviewing to be broadened from the two or three individuals selected by a journal's editorial staff to the wider academic community, whose feedback can then be incorporated into a revised version of the paper which is then re-published.
Such post-publication peer review is considered 'light-weight' by many scientists, and has been criticized as working well for controversial or high-interest papers, but less effectively for sound papers of more limited interest, not least because readers are under time pressures of their own, and are reluctant to engage in activities for which there are no established academic reward mechanisms. However, post-publication peer review is the rigorous norm for those who publish Internet specifications and Web standards documents RFCs (Requests for Comments) published by the Internet Engineering Task Force (IETF) and Candidate Recommendations published by the World Wide Web Consortium (W3C) where the whole purpose of the initial publication is to make these documents available for post-publication peer review for a specified period of time, during which comments and criticisms from any interested parties are received and acted upon, before the new standards are formally agreed and published.
Papers published on line can additionally receive comments and be subject to approval ratings by readers, with the quality of the paper being determined at least in part by its perceived usefulness, although in practice the uptake of opportunities to make such comments on published articles has been limited.
These alternative possibilities enable us to evaluate the process of peer review, the first of the Five Stars of Online Journal Articles, in terms of its effectiveness and openness, using the following simply five-point scale, from 0 to 4:
2.2 Open Access
Ensure others have cost-free open access both to read and to reuse your published article, to ensure its greatest possible readership and usefulness.
The most fundamental change that the Internet has brought to scholarly publishing in recent years, over and above the move from print to online provision of journal articles, and the greatest challenge to the traditional business models of subscription access publishers, has been the growth of open access (OA) provision, in which articles are made available to readers without subscription or fee barriers. Without the technical possibility of using the Internet to deliver content cheaply, the open access movement would have been still-born.
As with peer review, varying degrees of access openness are possible, and careful distinctions have to be made. In particular, an open access article may be available to read without payment, but such an article may remain covered by copyright and license restrictions. These prevent all forms of transmission, reproduction and reuse beyond that allowed by the 'fair use' or 'fair dealing' principles of copyright law, thereby preventing reuse of the content for text mining, for the production of derivative works, or for commercial purposes, without the written permission of the copyright owners.
The nomenclature used to characterise different types of open access is confusing and variably employed. My understanding in this area has been guided by two particularly helpful blog posts on this subject by Peter Suber (2008) and Peter Murray-Rust (2011), who clearly distinguish two orthogonal axes of classification:
While both imply 'free' (a potentially ambiguous word), gratis open access equates to 'free as in beer', while libre open access equates to 'free as in speech'. Gratis open access is thus a necessary but not a sufficient condition for libre open access.
The fundamental Open Access declarations that relate to scholarly publishing the 2002 Budapest Open Access Initiative, the 2003 Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities, and the 2003 Bethesda Statement on Open Access Publishing all defined OA in 'libre'-oriented prose, but many publishers' definitions of open access equate only to gratis open access. Clarity of what it means to be fully open is given by the Open Definition of the Open Knowledge Foundation:
"A piece of content or data is open if anyone is free to use, reuse, and redistribute it subject only, at most, to the requirement to attribute and share-alike."
Both green open access and gold open access articles can be either gratis open access or libre open access. Libre open access is most clearly specified by use of an explicit license, such as the Creative Commons Attribution License, that states clearly what rights are given to the reader/user of the article, or by the use of a rights waiver that places the article in the public domain. Without such a clear specification, it is wise to assume that any OA article is only 'gratis open access'.
Clearly, it is difficult to bring these two orthogonal classifications together into the single evaluation scale required for the second of the Five Stars, so I have taken the conservative approach of assuming that OA articles are only gratis open access, unless otherwise stated:
These categorizations require further explanatory comment:
No open access
A few subscription access journals, mostly in the biomedical sciences, permit the author to upload the published version of the article to an institutional repository or personal Web site (i.e. self-archiving green/gratis open access) after an embargo period of typically six to twelve months. Information about this is given by SHERPA/RoMEO. However, timely access to newly published research information remains strictly limited to journal subscribers.
Self-archiving green/gratis open access
In physics, mathematics and computer science, use of Cornell University's ArXiv preprint repository is the norm. ArXiv is an exemplary repository, since all its content is either made available under the Creative Commons Attribution license or the Creative Commons Attribution-Noncommercial-ShareAlike license, or has been placed in the public domain by association with the Creative Commons Public Domain Declaration. Thus clearly ArXiv content is green/libre open access.
However, for most research disciplines, where there is no culture of depositing preprints in a single subject-specific open archive before submission to journals for publication, green open access is a poor fourth among the open access choices. This is both because of the difficulty that potential readers have in finding the open access versions of articles scattered across institutional repositories (although new cross-repository search services such as CORE are improving that situation), and because the license arrangements for reuse of such content are typically unclear.
Consider, for example, the Oxford Research Archive, the institutional repository of the University of Oxford. ORA provides a helpful Copyright Guide for the benefit of authors depositing works in ORA, in which the copyright restrictions that publishers may place on published works are discussed, and where the possibility of using a Creative Commons licence is mentioned. However, its guidance to readers concerning ORA content is as follows:
"The full text of many of these items is freely available to be used in accordance with copyright and end-user permissions." (My emphasis)
Eprints Soton, the University of Southampton Institutional Research Repository, carries an identical statement, while Dspace@Cambridge, the institutional repository of the University of Cambridge, has a more restrictive blanket statement:
"Copyright and other intellectual property rights subsist in this site, in the Deposited Works and in any accompanying documentation and metadata. Unless otherwise noted, Deposited Works in DSpace@Cambridge are made freely available for access, printing and download for the purposes of non-commercial research or private study only. You may not further copy, reproduce, publish, ... or otherwise use a Deposited Work in whole or in part or in any manner or in any media without the express written permission from the appropriate rights owner(s) of the Deposited Work(s)." (My emphasis)
The metadata for the most recently deposited ORA research article, a PDF copy of Knight et al. (2011), The Puzzle of Migrant Labour Shortage and Rural Labour Surplus in China, contains no information about the open access status of this article, which one must therefore assume to be only gratis open access. That is confirmed by going to the original article on the Elsevier journal web site, China Economic Review 22 (4): 585-600 (December 2011) doi:10.1016/j.chieco.2011.01.006, where there is a link "Permissions and Reprints" that takes you to a page entitled "RightsLink Copyright Clearance Centre". There one can calculate the cost of reusing the article for purposes other than reading for research or private study. To use 15 print copies of the article as course material for teaching within the University of Oxford would cost £26.74, while to use 15 print copies of the article for training purposes within a commercial organization would cost £384.34.
One must assume that the situation is similar in other institutional repositories, and that items are available only as green/gratis open access unless there is an explicit libre open access license. Therefore, as Peter Murray-Rust concludes: "By default, unless the author/self-archivist makes a special effort, the reader (of an item in an institutional repository) has no rights of use over the deposited item." (Murray-Rust, 2011).
Funder-mandated green/gratis open access
Although the fee paid by the funding agency to the publisher to deposit a copy of the article in PubMed Central is substantial (typically $3,000-$5,000 per article), it is important to realize that this gains the reader only gratis open access rights. In particular, content is not available for text mining. Of the entire content of PubMed Central, some 2.3 million articles, only about 10% are within what PubMed Central terms the Open Access Subset, having some form of libre open access license these coming mainly from publishers that themselves have a gold/libre open access policy. It is from the reference lists of those articles, rather than the full PubMed Central corpus, that we have created the Open Citations Corpus of some 6.3 million bibliographic references, referencing about 20% of all the biomedical articles published between 1950 and 2010, including the most important papers in every discipline. Published under a libre Creative Commons attribution license and expressed in RDF, these citations are available for human scrutiny and for automated querying via a SPARQL endpoint, and the entire corpus can also be downloaded for re-use.
Author-pays gold/gratis open access
The fees for author-pays gold/gratis open access are typically substantial in the range of $500-$3,250 per article. The SHERPA/RoMEO Web site provides details. Since my Five Stars are designed to operate at the article level rather than the journal level, I do not distinguish here between articles that are made open access on an individual basis within what is otherwise a subscription access journal (sometimes called a 'hybrid' open access journal), and articles within a 'true' open access journal in which all the articles are open access. Elsevier calls the former arrangement 'sponsored access', and a journal in which all the articles are open access an 'author pays journal'. At the individual article level, however, there is no difference: the author pays a fee, and everyone can read the article freely on the publisher's Web site.
Rather, the issue is whether or not readers have reuse rights over the article, or whether only gold/gratis open access is granted. For example, Elsevier's policy concerning author-pays open access articles is stated on its Terms and Conditions page, which one can reach by clicking the Terms and Conditions footnote on the home page of its only open access journal, International Journal of Surgery Case Reports. Among other restrictions, this states:
"All content contained on or accessed from the Site ... is owned by Elsevier or its licensors and is protected by copyright, trademark and other intellectual property and unfair competition laws. You may not copy, display, distribute, ... or create other derivative works from ... all or any part of the Content ... except as otherwise expressly permitted under these Terms and Conditions, relevant license or subscription agreement or authorization by us. Unless expressly authorized by us, you may not ... automatically search, scrape, extract, deep link or index any Content."
Clearly, in exchange for the ~$3,000 Elsevier article sponsorship fee or author-pays journal article fee, readers are only getting gold/gratis open access. This low value for money is perhaps reflected in the fact that, in 2009, from within the 450 Elsevier journals offering 'sponsored access' to their articles, only 515 'sponsored access' articles were published.
Author-pays gold/libre open access
The fees for author-pays gold/libre open access are also typically substantial again in the range of $500-$3,250 per article, but in this case the publisher allows third parties both to read all its articles on the journal Web site free of charge, and to reuse the content. As Peter Suber (2008) points out, libre open access encompasses a range of possibilities, corresponding to which permissions for reuse have been granted. For example, it might be possible to use the content of the article to create a derivative work, but not if it is to be used for commercial purposes. The scope for reuse is determined by the nature of the license under which the article is published. It is thus an over-simplification to call libre open access 'full open access'.
BioMed Central and The Public Library of Science (PLoS), the two major publishers of OA journals in the biomedical sciences, both use the most permissive open access attribution license for all the works they publish. (The BioMed Central Open Access license agreement bears a different name, but is otherwise identical to the Creative Commons Attribution License employed by PLoS.) Under that license, authors retain ownership of the copyright for their content, but allow third parties to download, reuse, reprint, modify, distribute, and/or copy the content for any purpose including commercial, as long as the original authors and source are cited. No permission is required from the authors or the publishers. This is clearly the most helpful situation for potential reusers of published articles. The semantic enhancements we were able to apply to the Reis et al. (2008) article in PLoS Neglected Tropical Diseases (Shotton et al., 2009) were made possible because the article was published under such a license.
2.3 Enriched content
Use the full potential of Web technologies and Web standards to provide interactivity and semantic enrichment to the content of your online article.
Web technology can be used to provide various semantic enhancements of scholarly journals articles, links to external information sources of relevance to the textual context, and different types of user interactivity, as outlined in the Semantic publishing section of the Introduction.
Since the various types of semantic enrichment possible and some of the means for achieving them have been detailed elsewhere (Shotton et al., 2009; Shotton and Portwin, 2009), they will not be discussed further here. As stated above, several publishers and journal editors are undertaking such enrichments. However, these would best be achieved during authoring. When writing an article, authors can easily achieve quick wins in terms of functionality by ensuring plentiful links to external Web resources are provided (e.g. to their own home pages, to reagent suppliers' catalogues, and to cited articles). An open source plugin to Word 2007 has been published that permits semantic markup of named entities according to chosen ontologies (Fink et al., 2010), and it is hoped that other such semantic authoring tools will soon become available.
2.4 Available datasets
Ensure that all the data supporting the results you report are fully published under an open license, with sufficient metadata to enable their re-interpretation and reuse.
Through the Brussels Declaration of STM Publishing, academic publishers have strongly endorsed the principal that research data relating to journal articles should be made freely available, to enable inspection of the data and validation of the claims made in the article, and to permit data reuse in other contexts. Particularly if the research has been undertaken with public funding, it is now increasingly held that research data should be regarded as a common good (Boulton et al., 2011; Wood et al., 2010), and mechanisms to facilitate their publication are being proposed (Greenberg et al., 2009; Van der Graaf and Waaijers, 2011; Bourne et al., 2011). However, in this commendable enthusiasm for openness, it is important to acknowledge the personal time and effort invested by the researchers who discover or create the data, and their moral right to have the first chance to explore, publish on and benefit academically from the data, before publishing them for the benefit of others.
It is also important to emphasize that the term 'data' should be interpreted here in very general terms, to encompass any outputs from a research investigation over and above the text of journal articles. Thus 'data' can include images, sound recordings and videos, graphs and diagrams, animations and simulations, mathematical models, protocols and workflow, and software, as well as numerical datasets.
The principles of how best to make data available on the Web have already been described by Tim Berners-Lee in his Five Stars of Linked Open Data. Some overlap is inevitable, but the following ratings are intended to reflect the nature of the data made available, and where, when and to whom that availability is granted.
Where data are published is of great importance. Authors should bear in mind the very unsatisfactory nature of journal supplementary information files as repositories for valuable research data, in terms of openness, discoverability, curation, and reliable persistence (Evangelou et al., 2005; Anderson et al., 2006; Smit, 2011). As safer havens for published data, they should look instead to institutional repositories or, better, to subject-specific databases and repositories. For example, the Dryad Data Repository curates biological datasets linked to peer-reviewed journal articles, makes them available pre-publication to peer reviewers, and then publishes them, either at the same time as the article or after an optional embargo period, under a Creative Commons CC Zero open data waiver, with DataCite DOIs to permit proper citation and the award of academic credit.
2.5 Machine-readable metadata
Publish machine-readable metadata describing both your article and your cited references, so that these can be discovered automatically.
To date, publishers have employed a variety of proprietary XML-based informational models and document type definitions (DTDs) to mark up component parts of an electronic document (title, author list, abstract, etc.) in ways that assist the publishing process, but all too often even these basic metadata are not made available to readers, who are given only a PDF version of the article.
Modern Web information management techniques employing W3C standards such as RDF and OWL2 permit such information to be encoded using standard vocabularies in ways that permit computers to query metadata and integrate Web-based information from multiple resources in an automated manner. The SPAR (Semantic Publishing and Referencing) Ontologies are just some of the vocabularies being used for this purpose to describe scholarly publications (Peroni and Shotton, 2011).
Using these Web standards and vocabularies, it is possible to provide semantic descriptions of the structural and rhetorical components of the article using DoCO, the Document Components Ontology, and to create and publish machine-readable RDF metadata that describe the journal article itself, i.e. that encode the standard bibliographic information defining the article (authors, publication year, title, journal name, volume number, page numbers, DOI, etc.) using FaBiO, the FRBR-aligned Bibliographic Ontology and BiRO, the Bibliographic Reference Ontology. It is also possible similarly to encode bibliographic information for all the references within the article's reference list, and to use CiTO, the Citation Typing Ontology, both to assert the existence of a citation between the citing and the cited papers (i.e. <Paper A> cito:cites <Paper B> . ) and also to characterise the type or nature of that citation both factually and rhetorically (Shotton, 2010; Peroni and Shotton, 2011).
Of course, machine-readable metadata need not stop there. There are a growing number of checklists and minimum information standards specifying the information that should be included within research publications, or defining the metadata to describe articles or datasets within particular domains. One such example is MIIDI, a Minimal Information standard for reporting an Infectious Disease Investigation. Using the MIIDI Editor, metadata may be structured according to MIIDI to describe an infectious disease investigation and its research outputs, including journal articles and research datasets. For the former, the metadata can include statements about the main hypotheses of the research investigation and the principle conclusions described in the article, in addition to providing factual statements concerning the nature of the disease, the number of patients, etc.
The availability of article metadata can be rated on the following scale:
There are several ways in which such metadata may be made available. As indicated above, structural markup may be included within the XHTML document itself. By using RDFa, it is also possible to embed semantic markup within the Web document in such a way that these machine-readable metadata become part of the Web of Linked Open Data. Other possibilities of embedded markup exist using microdata within HTML5 documents. Alternatively, bibliographic and citation metadata can accompany the relevant journal article as supplementary online RDF files: such files accompany Shotton (2010) and our enhanced version of Reis et al. (2008). However, as for the research datasets relating to the article, it is advantageous if the relevant metadata files are also submitted to appropriate linked open data repositories, such as those of the Open Bibliography Project and the Open Citation Corpus.
Detailed metadata describing the content of a paper can form the basis for a structured digital summary describing the essence of an article in both human- and machine-readable form, which can be published as an Open Research Report in an open access data journal (more strictly, in this case, a 'metadata journal'), while individual factual statements from a paper can be published as nanopublications (Groth et al., 2010).
3. Evaluating published articles against the Five Stars
While the criteria adopted for the evaluation scales presented in Section 2 are somewhat arbitrary, and while the rating of a particular article on each axis may involve elements of subjective judgments, these Five Stars of Online Journal Articles provide a conceptual framework by which to judge the degree to which any article achieves or falls short of the ideal, which should be useful to authors, editors and publishers, who should now ask themselves:
"How do my online journal articles rate against these Five Stars?"
As an exercise in 'drinking my own champagne', I have evaluated the Reis et al. (2008) article as it was before and after our semantic enhancements, and also my own recent publications, including this article, to provide exemplars. Each is rated with respect to each of the Five Stars of Online Journal Articles on the five-point scales given in Section 2. I present the results both by means of constellation diagrams, within which the stars have different magnitudes, and in tabular form, with an overall numerical rating for each paper. (Full bibliographic details of the following papers are given in the References section.)
Reis et al. (2008).
Journal: PLoS Neglected Tropical Diseases. Publisher: Public Library of Science.
This research paper contains different types of analysed data concerning the risk factors of contracting the disease leptospirosis for inhabitants of an urban slum in Salvador, Brazil. The underlying unpublished raw datasets contain confidential information about individuals' health, financial, familial and employment status.
Rating: Original version: http://dx.doi.org/10.1371/journal.pntd.0000228.
One week after its publication, I chose the Reis et al. (2008) paper for semantic enhancement, which was undertaken with the support of the authors and PLoS and is described in Shotton et al. (2009) and Shotton and Portwin (2009). I then republished the paper with these enhancements to act as an exemplar.
Rating: Semantically enhanced version: http://dx.doi.org/10.1371/journal.pntd.0000228.x001.
Journal: Learned Publishing. Publisher: Association of Learned and Professional Society Publishers.
This paper describes and reviews the state of semantic publishing. There are no numerical data in the paper.
Shotton et al. (2009).
Journal: PLoS Computational Biology. Publisher: Public Library of Science.
This paper describes the semantic enhancements applied to Reis et al. (2009). As such, it contains no primary research data of its own.
Journal: J. Biomedical Semantics. Publisher: BioMed Central.
Shotton (2012) (This article.)
Journal: D-Lib Magazine. Publisher: Corporation for National Research Initiatives (CNRI).
This article is a position paper presenting ideas (in terms of the FaBiO ontology, a fabio:proposition), and contains no research data.
While D-Lib Magazine does not undertake formal peer review of its articles, and has a policy of not publishing articles that have previously appeared elsewhere, this particular article has benefited from substantial comments made by colleagues (see Acknowledgements for further details) on a preprint of this paper that was published in Nature Preceedings in preparation for the Microsoft Research/Harvard University meeting in October 2010 entitled Transforming Scholarly Communication. I am grateful to the editorial team of D-Lib Magazine for their flexibility in accepting this article despite the fact that the preprint had already been published, since the comments received have constituted an effective post-publication responsive peer review of the preprint, and have stimulated a major revision and expansion of the text, resulting in significant enhancements to both the content and the quality of the resulting D-Lib article. As a result of these enhancements, the evaluation scales for some of the Five Stars have been amended, causing the ratings given above for Shotton (2009) and for Shotton et al. (2009) to be lowered relative to the scores given to those papers in the preprint.
The above ratings show that the nature of the article will influence the overall rating obtained. For example, reviews and position papers with no primary research data will always score low in terms of available datasets.
The Five Stars of Online Journal Articles Ontology, available from http://purl.org/spar/fivestars/, is a simple ontology written in OWL 2 DL that forms part of SPAR, a suite of Semantic Publishing and Referencing Ontologies (http://purl.org/spar/). It is intended for use by publishers and others wishing to encode Five Stars ratings, such as those shown above, in machine-readable form, so they can accompany other machine-readable metadata for the article. The following RDF graph, shown in turtle notation, gives the Five Stars ratings for this article:
Ubiquity Press has indicated that it wishes to adopt such evaluations and give each of its published articles a Five Star rating. I encourage other publishers to do so too.
I am most grateful to Bob DuCharme who, inspired by Berners-Lee's Five Stars of Linked Open Data, challenged me to come up with five stars for semantic publishing, following a talk entitled Applying XML and Semantic Technologies to Liberate Infectious Disease Data that I gave at the recent Oxford XML Summer School. I thank Tanya Gray and Katherine Fletcher for feedback after reading a preliminary draft of this paper. I wish particularly to acknowledge the input made by Silvio Peroni, who insisted that I specify evaluation scales for all five stars, and whose proposals concerning peer review and open access I have incorporated; by those who participated in a brief but lively discussion of the Five Stars preprint on the Beyond the PDF mail list, particularly Cameron Neylon who suggested a radical revision of my original evaluation scale for peer review, Phillip Lord for wise remarks concerning post-publication peer review and RFCs, and Peter Murray-Rust for his insistence that the type of license under which Open Access publications are published is critical; and by Brian Hole of Ubiquity Press, both for his comments and for his general enthusiasm for the Five Star concept. Their suggestions have constituted an effective post-publication peer review of a preprint of this paper, as mentioned above, and I thank them all sincerely for taking the time and making the effort to supply these valuable critiques.
 Anderson NR, Tarczy-Hornoch P and Bumgarner RE (2006). On the persistence of supplementary resources in biomedical publications. BMC Bioinformatics 7: 260. http://dx.doi.org/10.1186/1471-2105-7-260.
 Attwood TK, Kell DB, McDermott P, Marsh J, Pettifer SR and Thorne D (2010). Utopia documents: linking scholarly literature with research data. Bioinformatics 26: i568-i574. http://dx.doi.org/10.1093/bioinformatics/btq383.
 Bourne P, Clark T, Dale R, de Waard A, Herman I, Hovy E and Shotton D, on behalf of the Force11 community (2011). Force11 White Paper: Improving the Future of Research Communication and e-Scholarship. (Published 28 October 2011). http://force11.org/white_paper.
 de Waard A, Buckingham Shum S, Carusi A, Park J, Samwald M, and Sándor Á (2009). Hypotheses, Evidence and Relationships: The HypER Approach for Representing Scientific Knowledge Claims. In: Proceedings 8th International Semantic Web Conference, Workshop on Semantic Web Applications in Scientific Discourse (26 Oct 2009, Washington DC.). Lecture Notes in Computer Science, Springer Verlag: Berlin. http://oro.open.ac.uk/18563/.
 Evangelou E, Trikalinos TA and Ioannidis JP (2005). Unavailability of online supplementary scientific information from articles published in major journals. FASEB J. 19: 1943-1944. http://dx.doi.org/10.1096/fj.05-4784lsf.
 Fink JL, Fernicola P, Chandran R, Parastatidis S, Wade A, Naim O, Quinn GB and Bourne PE (2010). Word add-in for ontology recognition: semantic enrichment of scientific literature. BMC Bioinformatics 11: 103. http://dx.doi.org/10.1186/1471-2105-11-103.
 Greenberg J, White HC, Carrier S and Scherle R (2009). A metadata best practice for a scientific data repository. Journal of Library Metadata 9 (3-4): 194-212. http://dx.doi.org/10.1080/19386380903405090.
 Kurtz D, Parker G, Shotton D, Klyne G, Schroff F, Zisserman A and Wilks Y (2009). CLAROS bringing classical art to a global public. Proc. IEEE e-Science Conference, Oxford, 9-11 December 2009, pp 20-27. http://doi.ieeecomputersociety.org/10.1109/e-Science.2009.11.
 Pafilis E, O'Donoghue SI, Jensen LJ, Horn H, Kuhn M, Brown NP and Schneider R (2009). Reflect augmented browsing for the life scientist. Nature Biotechnology 27: 508-510. http://dx.doi.org/10.1038/nbt0609-508.
 Reis RB, Ribeiro GS, Felzemburgh RDM, Santana FS, Mohr S, Melendez AXTO, Queiroz A, Santos AC, Ravines RR, Tassinari WS, Carvalho MS, Reis MG and Ko AI (2008). Impact of environment and social gradient on Leptospira infection in urban slums. PLoS Neglected Tropical Diseases 2: e228. http://dx.doi.org/10.1371/journal.pntd.0000228.
 Shotton D and Portwin K (2009). Technical implementation of the semantic enhancements applied to Reis et al. (2008) Impact of environment and social gradient on Leptospira infection in urban slums. PLoS Neglected Tropical Diseases 2(4): e228. Supporting Information File S1 to Shotton et al. (2009). http://dx.doi.org/10.1371/journal.pntd.0000228.x009.
 Shotton D, Portwin K, Klyne G, and Miles A (2009). Adventures in semantic publishing: exemplar semantic enhancement of a research article. PLoS Computational Biology 5: e1000361. http://dx.doi.org/10.1371/journal.pcbi.1000361.
 Suber P (2008). Gratis and libre Open Access. SPARC Open Access Newsletter (August 2008 issue). http://www.arl.org/sparc/publications/articles/gratisandlibre.shtml.
 Van der Graaf M and Waaijers L (2011). A Surfboard for Riding the Wave. Towards a four country action programme on research data. A Knowledge Exchange Report. http://www.knowledge-exchange.info/Default.aspx?ID=469.
 Wan S, Paris C and Dale R (2010). Supporting browsing-specific information needs: Introducing the Citation-Sensitive In-Browser Summariser. Web Semantics: Science, Services and Agents on the World Wide Web 8: 196-202. http://dx.doi.org/10.1016/j.websem.2010.03.002.
 Wood J, Andersson T, Bachem A, Best C, Genova F, Lopez DR, Los W, Marinucci M, Romary L, Van de Sompel H, Vigen J, Wittenburg P, Giaretta D and Hudson RL (2010). Riding the wave: How Europe can gain from the rising tide of scientific data. Final report of the High Level Expert Group on Scientific Data; A submission to the European Commission, October 2010. Available from http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlg-sdi-report.pdf.
About the Author