Special Issue on Mining Scientific Publications
Petr Knoth and Zdenek Zdrahal
Digital libraries that store scientific publications are becoming increasingly important in research. They are used not only for traditional tasks such as finding and storing research outputs, but also as sources for discovering new research trends and evaluating research excellence. The rapid growth in the number of scientific publications being deposited in digital libraries is making it no longer sufficient to provide access to content only. It is equally important to improve the processes by which research is being accomplished. Recent developments in natural language processing, information retrieval and the semantic web make it possible to transform the way we work with scientific publications. However, in order to improve these technologies and carry out experiments, researchers need to be able to easily access and use large databases of scientific publications.
The papers in this issue of D-Lib Magazine were presented at the 1st International Workshop on Mining Scientific Publications, held during JCDL 2012. The workshop's aim was to bring together people from different backgrounds who are interested in analysing and mining databases of scientific publications, who develop systems that enable the analysis and mining of scientific databases, and who develop novel technologies that improve the way research is being done.
The papers in this special issue deal with the following three themes:
Technical infrastructures that provide access to content are an enabling component of mining research publications. The workshop participants recognised that the existing technical infrastructures do not provide sufficient support for mining. Among the main issues are missing or restricted access to publications via APIs, the insufficient number of freely available datasets from different domains, and restricted access to full-text due to technical and legal issues. The article Specialized Research Datasets in the CiteSeerx Digital Library, written by Sumit Bhatia and his colleagues at The Pennsylvania State University, addresses technical infrastructure. It provides an overview of datasets provided by CiteSeerx that can be used to carry out selected semantic enrichment tasks including author disambiguation and information extraction.
Four papers in the issue fall under the area of semantic enrichment, and they address a very representative set of challenges. The paper entitled Automatic and Interactive Browsing Hierarchy Construction for Scientific Publication Collections, by Hui Yang of Georgetown University presents a novel algorithm for automatic construction of browsing hierarchies. Automatically extracted hierarchies are scalable and could prove to be more effective than manually constructed ones. Roman Kern of Graz University of Technology, with his colleagues from Mendeley Ltd. UK and the University of Passau, wrote the paper TeamBeam Meta-Data Extraction from Scientific Literature which proposes a new approach for metadata extraction of scientific papers by analysing the full text of publications. This contribution to the workshop received the Best Paper Award. Domain-Independent Trigger Phrases for Mining Abstracts, authored by Ron Daniel of Elsevier Labs, describes a scalable method for extraction of phrases indicating semantic classes, such as hypothesis, method or goal. Finally, Marc Bertin and Iana Atanassova of the Paris Sorbonne University explore the problems that arise when citations in publications are used for different purposes. In their article Semantic Enrichment of Scientific Publications and Metadata they develop a method for semantic annotation and classification of citations. In principle, it is possible to see this publication as a bridge between the area of semantic enrichment and the analysis of citations for measuring impact.
In the area of analysing research publications, Robert Patton and his colleagues at the Oak Ridge National Laboratory presented a paper entitled Identification of User Facility Related Publications, describing the new approach to measuring impact of scientific user facilities. The need for automatic methods for impact measurement may be part of the justification for funding requested by a research facility. The last two papers in this area utilise visual information to help people analyse and explore information stored in scientific databases. Satoshi Fukuda and colleagues proposed in Extraction and Visualization of Technical Trend Information from Research Papers and Patents a method for automatically creating a technical trend map from information in research papers and patents. Finally, Drahomira Herrmannova and Petr Knoth presented a novel visual search interface for exploring, comparing and contrasting content in research databases and as an aid for discovery of implicit relationships in data. Their paper is titled Visual Search for Supporting Content Exploration in Large Document Collections.
We believe that papers in this special issue provide an excellent overview of the research challenges, and describe some state-of-the-art solutions, in the domain of mining research publications. We hope readers will find this special issue useful.
About the Guest Editors