D-Lib Magazine

Search D-Lib:

July/August 2012
Volume 18, Number 7/8
Table of Contents

Guest Editorial

Special Issue on Mining Scientific Publications

Petr Knoth and Zdenek Zdrahal
KMi, The Open University
{z.zdrahal, p.knoth}@open.ac.uk

Andreas Juffinger
The European Library
andreas.juffinger@kb.nl

doi:10.1045/july2012-guest_editorial

Printer-friendly Version

Digital libraries that store scientific publications are becoming increasingly important in research. They are used not only for traditional tasks such as finding and storing research outputs, but also as sources for discovering new research trends and evaluating research excellence. The rapid growth in the number of scientific publications being deposited in digital libraries is making it no longer sufficient to provide access to content only. It is equally important to improve the processes by which research is being accomplished. Recent developments in natural language processing, information retrieval and the semantic web make it possible to transform the way we work with scientific publications. However, in order to improve these technologies and carry out experiments, researchers need to be able to easily access and use large databases of scientific publications.

The papers in this issue of D-Lib Magazine were presented at the 1st International Workshop on Mining Scientific Publications, held during JCDL 2012. The workshop's aim was to bring together people from different backgrounds who are interested in analysing and mining databases of scientific publications, who develop systems that enable the analysis and mining of scientific databases, and who develop novel technologies that improve the way research is being done.

The papers in this special issue deal with the following three themes:

Infrastructures, systems, datasets or APIs that enable analysis of large volumes of scientific publications;
Semantic enrichment of scientific publications by means of text-mining, crowdsourcing or other methods;
Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence and to aid content exploration.

Technical infrastructures that provide access to content are an enabling component of mining research publications. The workshop participants recognised that the existing technical infrastructures do not provide sufficient support for mining. Among the main issues are missing or restricted access to publications via APIs, the insufficient number of freely available datasets from different domains, and restricted access to full-text due to technical and legal issues. The article Specialized Research Datasets in the CiteSeer^x Digital Library, written by Sumit Bhatia and his colleagues at The Pennsylvania State University, addresses technical infrastructure. It provides an overview of datasets provided by CiteSeer^x that can be used to carry out selected semantic enrichment tasks including author disambiguation and information extraction.

Four papers in the issue fall under the area of semantic enrichment, and they address a very representative set of challenges. The paper entitled Automatic and Interactive Browsing Hierarchy Construction for Scientific Publication Collections, by Hui Yang of Georgetown University presents a novel algorithm for automatic construction of browsing hierarchies. Automatically extracted hierarchies are scalable and could prove to be more effective than manually constructed ones. Roman Kern of Graz University of Technology, with his colleagues from Mendeley Ltd. UK and the University of Passau, wrote the paper TeamBeam — Meta-Data Extraction from Scientific Literature which proposes a new approach for metadata extraction of scientific papers by analysing the full text of publications. This contribution to the workshop received the Best Paper Award. Domain-Independent Trigger Phrases for Mining Abstracts, authored by Ron Daniel of Elsevier Labs, describes a scalable method for extraction of phrases indicating semantic classes, such as hypothesis, method or goal. Finally, Marc Bertin and Iana Atanassova of the Paris Sorbonne University explore the problems that arise when citations in publications are used for different purposes. In their article Semantic Enrichment of Scientific Publications and Metadata they develop a method for semantic annotation and classification of citations. In principle, it is possible to see this publication as a bridge between the area of semantic enrichment and the analysis of citations for measuring impact.

In the area of analysing research publications, Robert Patton and his colleagues at the Oak Ridge National Laboratory presented a paper entitled Identification of User Facility Related Publications, describing the new approach to measuring impact of scientific user facilities. The need for automatic methods for impact measurement may be part of the justification for funding requested by a research facility. The last two papers in this area utilise visual information to help people analyse and explore information stored in scientific databases. Satoshi Fukuda and colleagues proposed in Extraction and Visualization of Technical Trend Information from Research Papers and Patents a method for automatically creating a technical trend map from information in research papers and patents. Finally, Drahomira Herrmannova and Petr Knoth presented a novel visual search interface for exploring, comparing and contrasting content in research databases and as an aid for discovery of implicit relationships in data. Their paper is titled Visual Search for Supporting Content Exploration in Large Document Collections.

We believe that papers in this special issue provide an excellent overview of the research challenges, and describe some state-of-the-art solutions, in the domain of mining research publications. We hope readers will find this special issue useful.

About the Guest Editors

Petr Knoth is a Research Associate in the Knowledge Media institute, The Open University focusing on various topics in natural language processing and information retrieval. He has been involved in four European Commission funded projects (KiWi, Eurogene, Tech-IT-EASY and DECIPHER) and four JISC funded projects (CORE, ServiceCORE, DiggiCORE and RETAIN) and has a number of publications at international conferences based on this work. Petr received his master's degree from the Brno University of Technology.

Zdenek Zdrahal is a Senior Research Fellow at Knowledge Media Institute of the Open University and Associate Professor at The Faculty of Electrical Engineering, Czech Technical University. He has been a project leader and principal investigator in a number of research projects in the UK, Czech Republic, and Mexico. His research interests include knowledge modelling and management, reasoning, KBS in engineering design, and Web technology. He is an Associate Editor of IEEE Transactions on Systems, Man and Cybernetics.

Andreas Juffinger is Technical and Operations Manager at The European Library. He graduated from the Technical University Graz with a specialization in Web 2.0 Crawling, Information Retrieval and Machine Learning (Statistical Learning Theory and Reinforcement Learning). He has more than 10 years of professional work experience as a consultant, enterprise java developer and system architect. He is an expert in Oracle Databases, Enterprise Java Applications, and Java Web Applications. Since 2003 he has worked on several European and National funded projects as technical and project manager. He has been key researcher at the Know-Center, Austria's Competence Center for Knowledge Management, for graphs, graph mining and complex network analysis. His research work focuses on graph pattern analysis in large graph databases and complex networks. During his research career he has published more than 30 scientific articles in journals and proceedings of high level conferences, such as NIPS, WWW, JCDL, ECDL, Hypertext, TPDL.