Supporting Science through the Interoperability of Data and Articles
IJsbrand Jan Aalbersberg
Whereas it is established practice to publish relevant findings of a research project in a scientific article, there are no standards yet as to whether and how to make the underlying research data publicly accessible. According to the recent PARSE.Insight study of the EU, over 84% of scientists think it is useful to link underlying digital research data to peer-reviewed literature. This trend is reinforced by funding bodies, who to an increasing extent require the grantees to deposit their raw datasets at freely accessible repositories. And also the publishing industry believes that raw datasets should be made freely accessible.. This article presents an overview of how Elsevier as a scientific publisher with over 2,000 journals gives context to articles that are available on their full-text platform SciVerse ScienceDirect, by linking out to externally hosted data at the article level, at the entity level, and in a deeply integrated way. With this overview, Elsevier invites dataset repositories to collaborate with publishers to create an optimal interoperability between the formal scientific literature and the associated research data improving the scientific workflow and ultimately supporting science.
Content innovation at Elsevier is about improving the peer-reviewed scientific communication between the author and the reader. For centuries this communication has taken place based on the traditional print format though more recently in the form of the PDF. Whereas the digital revolution brought great improvements to many processes around scholarly communication (like submission, discoverability, access, and archiving), it has had almost no impact on its content, format, and presentation, i.e., on the scientific article itself. One of the ways Elsevier aims at utilizing the new digital possibilities to add value to the scientific article is by putting it in context with external resources that are related to the article and relevant to the respective research community. This can be achieved in two ways.
In the following we will detail and describe examples of these three types of "internet wiring" between scientific articles and research datasets. It thus serves as an invitation to dataset repositories to actively collaborate with publishers to provide a seamless interoperability between scientific articles and their associated datasets, which will result in an improved workflow for the scientist.
Scientific research is all about collaboration and never limited to one information source only. In recent years, more and more digital repositories have been set up to host research data in the different fields. Even among information scientists, there is no consensus on the number of such repositories, but "there are clearly thousands in existence" . Not to mention the amount of actual data stored in these repositories, as their individual datasets could easily amount to many terabytes of data in fields like earth and planetary sciences. 
It is customary on the web that if information is not sufficiently interlinked with other relevant information, it tends to be invisible and thus unused. This also applies to research data: if such data in a data repository is not connected to the relevant literature, it is invisible and then the use and re-use of the data is limited.
"Dataset linking" aims at bridging this gap which occurs when a research article is available on a publisher's full-text platform such as SciVerse ScienceDirect, while the underlying dataset is hosted on an entirely different service. It is Elsevier's objective to connect such datasets with the article in a bi-directional way: both from the article at SciVerse ScienceDirect to the research dataset and from the dataset to the research article.
Linking to a dataset
SciVerse ScienceDirect has created a generic mechanism to link from articles to datasets. It is based on using the article's DOI as the common linking pin between an article and its dataset, and it enables the dataset repository to position a link to the dataset on the SciVerse ScienceDirect article page. The associated workflow consists of the following five distinct steps:
Figure 1: Dataset linking using image-based linking.
The linking mechanism being used between SciVerse ScienceDirect and the data repository is so-called "image-based linking" (see Figure 1). Every time a user views the HTML version of an article, an image-link request containing the article DOI is sent to the external repository, asking whether a dataset is available for that article.
Current status of dataset linking
So far, Elsevier has established linking cooperation with two dataset repositories.
Note: in July 2010, the interoperability with PANGAEA was expanded by also embedding a Google Maps application created by PANGAEA for those articles, where such a map is available (see further below).
Figure 2: Dataset linking to PANGAEA (l) and CCDC (r).
Next to dataset linking, which usually concerns linking from the article as a whole, SciVerse ScienceDirect also allows linking from "entities" mentioned in the full text to related datasets. "Entities" in this context are occurrences of a discipline-specific concept used by researchers to communicate and categorize the objects in their research. A concept can be of different origins:
Linking from entities in an article on SciVerse ScienceDirect to external resources (e.g., the Protein Data Bank or a biodiversity database) requires two distinct steps:
Step 1: Identify the entities in the article
Entity identification can be done either manually (by the author) or automatically (through text-mining).
Step 2: Creating the actual links in the HTML
Once a user opens an article page on SciVerse ScienceDirect, the marked-up entities are rendered as hyperlinks pointing at the defined target resource (see Figure 3). A necessary condition is that the URL structure of the target database caters for entity-related URLs which is unfortunately not yet the case with all domain-specific databases.
Figure 3: Article on SciVerse ScienceDirect, where entities (in this case: Genbank Accession numbers) appear as hyperlinks in a table as well as in the text.
More specifically, the URL of the dataset associated to the entity has to be constructible from an entity-type specific URL base (determined by the entity mark-up) plus an entity-specific modifier. For example, for all protein identifiers from PDB (Protein Data Bank), the shared URL base is "http://www.rcsb.org/pdb/explore.do?structureId=", while the marked-up protein entity (e.g., "2KAF") has to be affixed to it as modifier.
Current status of entity linking
Currently, Elsevier asks authors to mark-up occurrences of a variety of entities including the following (to learn more about the target resource, click on the linked resource name):
In addition to using manual identification of entities by authors, SciVerse ScienceDirect recently also integrated two entity-linking tools that rely exclusively on text-mining: Reflect [6, 7] and NextBio [8, 9] (see Figure 4).
Figure 4: Entity links as created through text mining, using Reflect (l) and NextBio (r).
Application-based dataset linking
So far, all solutions described to connect an original research article with underlying datasets (as deposited at external data repositories) consist of creating links from the article to the dataset. However, it would be much more user-friendly if the scientist could see within the article itself what to find in the data repository, or even already see the data inside the article: this would require fewer clicks and would keep the context of the article in place. Elsevier offers this possibility in its SciVerse ScienceDirect platform through the capability of applications (see Figure 5).
Figure 5: Application-based dataset linking.
This capability opens a window in the HTML article page, which can be used by an application to present article-specific data that is extracted in real time from an external dataset repository. Also here we can have two different situations:
Figure 6: Article on SciVerse ScienceDirect with a Google Maps application created by PANGAEA. Also the data used to construct the map comes from PANGAEA.
Figure 7: Article on SciVerse ScienceDirect with a Protein Viewer application created by Elsevier. The data is pulled in real time from PDB.
As it has been established by e.g., the PARSE.Insight study, the preservation and therefore deposition of research datasets is of crucial importance to the progress of science. However, preservation and deposition by itself is not sufficient. Research data that cannot be found through, or is not connected to, the associated and peer-reviewed research articles is factually not part of the "official" research information that a researcher can use in his research.
Fortunately, publishers like Elsevier now provide the means to connect research articles with datasets deposited at external data repositories. Obviously, this also requires that such data repositories provide the means to get connected with (i.e., to and from) articles at publisher sites. Organisations like DataCite do play an important role in getting the latter accomplished, by being a single point of contact for publishers and creating a single mechanism to interlink between publishers and data repositories.
The approach of providing links between research articles and datasets can be expanded even further by providing the scientist with access to (an outline of) the data inside the associated research article with the appropriate links to the external datasets. Also for this, there are a multitude of possibilities, as shown by SciVerse ScienceDirect applications like the PDB Protein Viewer or the PANGAEA Google Maps. The future of intense and seamless interoperability between publishers and data repositories lies wide open in front of us and Elsevier invites all parties interested in that future to collaborate!
 Scott Weidman, Thomas Arrison (eds.) (2010): Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop.
About the Authors