D-Lib Magazine

Daniel Duma, University of Edinburgh
danielduma@gmail.com

Maria Liakata, University of Warwick
m.liakata@warwick.ac.uk

Amanda Clare, Aberystwyth University
afc@aber.ac.uk

James Ravenscroft, University of Warwick
ravenscroft@papro.org.uk

Ewan Klein, University of Edinburgh
ewan@inf.ed.ac.uk

DOI: 10.1045/september2016-duma

Abstract

Wouldn't it be helpful if your text editor automatically suggested papers that are contextually relevant to your work? We concern ourselves with this task: we desire to recommend contextually relevant citations to the author of a paper. A number of rhetorical annotation schemes for academic articles have been developed over the years, and it has often been suggested that they could find application in Information Retrieval scenarios such as this one. In this paper we investigate the usefulness for this task of CoreSC, a sentence-based, functional, scientific discourse annotation scheme (e.g. Hypothesis, Method, Result, etc.). We specifically apply this to anchor text, that is, the text surrounding a citation, which is an important source of data for building document representations. By annotating each sentence in every document with CoreSC and indexing them separately by sentence class, we aim to build a more useful vector-space representation of documents in our collection. Our results show consistent links between types of citing sentences and types of cited sentences in anchor text, which we argue can indeed be exploited to increase the relevance of recommendations.

Keywords: Core Scientific Concepts; CoreSC; Context Based; Citation Recommendation; Anchor Text; Incoming Link Contexts

1 Introduction

Scientific papers follow a formal structure, and the language of academia requires clear argumentation [9]. This has led to the creation of classification schemes for the rhetorical and argumentative structure of scientific papers, of which two of the most prominent are Argumentative Zoning [19] and Core Scientific Concepts (CoreSC, [11]). The former focusses on the relation between current and previous work whereas the latter mostly on the content of a scientific investigation. These are among the first approaches to incorporate successful automatic classification of sentences in full scientific papers, using a supervised machine learning approach.

It has often been suggested that these rhetorical schemes could be applied in information retrieval scenarios [19], [12], [3]. Indeed, some experimental academic retrieval tools have tried applying them to different retrieval modes [18], [14], [1], and here we explore their potential application to a deeper integration with the writing process.

Our aim is to make automatic citation recommendation as relevant as possible to the author's needs and to integrate it into the authoring work flow. Automatically recommending contextually relevant academic literature can help the author identify relevant previous work and find contrasting methods and results. In this work we specifically look at the domain of biomedical science, and examine the usefulness of CoreSC for this purpose.

2 Previous Work

The ever-increasing volume of scientific literature is a fact, and the need to navigate it a real one. This has brought much attention to the task of Context-Based Citation Recommendation (CBCR) over the last few years [6, 5, 3, 7]. The task consists in recommending relevant papers to be cited at a specific point in a draft scientific paper, and is universally framed as an information retrieval scenario.

We need to recommend a citation for each citation placeholder: a special token inserted in the text of a draft paper where the citation should appear. In a standard IR approach, the corpus of potential papers to recommend (the document collection) is indexed for retrieval using a standard vector-space-model approach. Then, for each citation placeholder, the query is generated from the textual context around it (the citing context), and a similarity measure between the query and each document is then applied to rank the documents in the collection. A list of documents ranked by relevance is returned in reply to the query, so as to maximise the chance of picking the most useful paper to cite.

The citing sentence is the sentence in which the prospective citation must appear. It determines the function of this citation and therefore provides information that can be exploited to increase the relevance of the suggested citations.

As it is common practice, we evaluate our performance at this task by trying to recover the original citations found in papers that have already been published.

Perhaps the seminal piece of work in this area is He et al.'s [6] work, where they built an experimental citation recommendation system using the documents indexed by the CiteSeerX search engine as a test collection (over 450,000 documents), which was deployed as a testable system [8]. Recently, all metrics on this task and dataset were improved by applying multi-layered neural networks [7]. Other techniques have been applied to this task, such as collaborative filtering [2] and translation models [5], and other aspects of it have been explored, such as document representation [3] and context extraction [16].

2.1 Incoming link contexts

In order to make contextual suggestions as useful and relevant as possible, we argue here that we need to apply a measure of understanding to the text of the draft paper. Specifically, we hypothesize that there is a consistent relation between the type of citing sentence and the type of cited sentence.

In this paper, we specifically target incoming link contexts, also known as "anchor text" in the information retrieval literature, which is text that occurs in the vicinity of a citation to a document. Incoming link contexts (henceforth ILCs) have previously been used to generate vector-space representations of documents for the purpose of context-based citation recommendation. The idea is intuitive: a citation to a paper is accompanied by text that often summarizes a key point in the cited paper, or its contribution to the field. It has been found experimentally that there is useful information in these ILCs that is not found in the cited paper itself [17], and using them exclusively to generate a document's representation has proven superior to using the contents of the actual document [3]. Typically these contexts are treated as a single bag-of-words, often simply concatenated.

We propose a different approach here, where we separate the text in these contexts according to the type of sentence. All sentences of a same type from all ILCs to a same document are then indexed into the same field in a document in our index, allowing us to query by type of sentence in which the keywords appeared. Figure 1 illustrates our approach: the class of citing sentence is the query type, and for each query type we learn a set of weights to apply to finding the extracted keywords in different types of cited sentences in ILCs.

Figure 1: A high-level illustration of our approach. The class of the citing sentence is the query type and it determines a set of weights to apply to the classes of sentences in the anchor paragraphs of links to documents in our collection. In this example, for Bac, only three classes have non-zero weights: Bac, Met and Res. We show extracts from three different citing papers, exemplifying terms matching in different classes of sentences.

Our approach is to apply existing rhetorical annotation schemes to classify sentences in citing documents and use this segmentation of the anchor text to a citation to increase the relevance of recommendations.

For the task of recommending a citation for a given span of text, the ideal resource for classifying these spans would deal with the function of a citation within its argumentative context. While specific schemes for classifying the function of a citation have been developed (e.g. [20]), we are not aware of a scheme particularly tailored to our domain of biomedical science. Instead, we employ the CoresSC class of a citing sentence as a proxy for the function of all citations found inside it, which we have previously shown is a reasonable approach [4]. CoreSC takes the sentence as the minimum unit of annotation, continuing the standard approach to date, which we maintain in this work.

3 Methodology

We label each sentence in our corpus according with CoreSC (see Table 1), which captures its rhetorical function in the document, and we aim to find whether there is a particular link between the class of citing sentence and the class of cited text, that is, the classes of sentences found in ILCs.

Table 1: CoreSC classes and their description. CoreSC is a content-focussed rhetorical annotation scheme developed and tested in the biomedical domain [11, 10]. Note that in this work we treat Method-Old and Method-New as a single category.

As illustrated in Figure 2, we apply a cut-off date to separate our corpus into a large document collection and a smaller test set from which we will extract our queries for evaluation. We index each document in our document collection into a Lucene index, creating a field in each document for each class of CoreSC (Hypothesis, Background, Method, etc.). We collect incoming-link contexts to all the documents in our document collection, that is, the potential documents to recommend, only from the document collection, excluding documents in our test set. We extract the paragraph where the incoming citation occurs as the ILC, keeping each sentence's label. All the text in sentences of that class from all ILCs to that document will be indexed into the same field. This allows us to apply different weights to the same keywords depending on the class of sentence they originally appeared in ILCs.

Figure 2: Indexing and query generation for evaluation using the same corpus. We use a cut-off year of publication to create our document collection and our test set. Each document in the collection is indexed containing only text from Incoming Link Contexts (ILCs) citing it from other documents in the document collection. Text from all sentences of the same CoreSC class from all ILCs to this document are indexed into a single Lucene field. Citations to this document from the test set are then used to generate the queries to evaluate on, where the keywords are extracted from the citing context (one sentence up, one down, including the citing sentence) and the query type is the class of the citing sentence.

3.1 Evaluation

In order to reduce purpose-specific annotation, we use the implicit judgements found in existing scientific publications as our ground truth. That is, we substitute all citations in the text of each paper in our test set with citation placeholders and make it our task to match each placeholder with the correct reference that was originally cited. We only consider resolvable citations, that is, citations to references that point to a paper that is in our collection, which means we have access to its metadata and full machine-readable contents.

We measure how well we did at our task by how far down the list of ranked retrieval results we find the original paper cited. We use two metrics to measure accuracy: Normalized Discounted Cumulative Gain (NDCG), a smooth discounting scheme over ranks, and top-1 accuracy, which is just the number of times the original paper was retrieved in the first position.

3.2 Query extraction

For evaluation, the class of citing sentence becomes the query type, and for each type we apply a different set of per-field weights to each extracted term. We extract the context of the citation using a symmetric window of three sentences: one before the citation, the sentence containing the citation and one after. This is a frequently applied method [7] and is close to what has been assumed to be the optimal window of two sentences up, two down [13], while yielding fewer query terms and therefore allowing us more experimental freedom through faster queries.

3.3 Similarity

We use the default Lucene similarity formula for assessing the similarity between a query and a document (Figure 3).

In this formula, the coord term is an absolute multiplier of the number of terms in the query q found in the document d, tf is the absolute term frequency score of term t in document d, idf(t) is the inverse document score and norm is a normalization factor that divides the overall score by the length of document d. Note that all these quantities are per-field, not per-document.

3.4 Technical implementation

We index the document collection using the Apache Lucene retrieval engine, specifically through the helpful interface provided by Elasticsearch 2.2. For each document, we create one field for each CoreSC class, and index into each field all the words from all sentences in the document that have been labelled with that class.

The query is formed of all the terms in the citation's context that are not in a short list of stopwords. Lucene queries take the basic form field:term, where each combination of field and term form a unique term in the query. We want to match the set of extracted terms to all fields in the document, as each field represents one class of CoreSC.

The default Lucene similarity formula (Figure 2) gives a boost to a term matching across multiple fields, which in our case would introduce spurious results. To avoid this, we employ DisjunctionMax queries, where only the top scoring result is evaluated out of a number of them. Having one query term for each of the classes of CoreSC for each distinct token (e.g. Bac:"method", Goa:"method", Hyp:"method", etc.), only the one with the highest score will be evaluated as a match.

3.5 Weight training

Testing all possible weight combinations is infeasible due to the combinatorial explosion, so we adopt the greedy heuristic of trying to maximise the objective function at each step.

Our weight training algorithm can be summarized as "hill climbing with restarts". For each fold, and for each citation type, we aim to fond the best combination of weights to set on sentence classes that will maximise our metric, in this case the NDCG score that we compute by trying to recover the original citation. We keep the queries the same in structure and term content and we only change the weights applied to each field in a document to recommend. Each field, as explained above, contains only the terms from the sentences in the document of one CoreSC class.

The weights are initialized at 1 and they move by —1, 6, and —2 in sequence, going through a minimum of three iterations. Each time a weight movement is applied, it is only kept if the score increases, otherwise the previous weight value is restored.

This simple algorithm is not guaranteed to find a globally optimal combination of parameters for the very complex function we are optimizing, but it is sufficient for our current objective. We aim to apply more robust parameter tuning techniques to learning the weights in future work.

4 Experiments

Our corpus is formed of one million papers from the PubMed Central Open Access collection. These papers are already provided in a clean, hand-authored XML format with a well defined XML schema. For our experiments we used all papers published up to and including 2014 as our document collection (~950K documents), and selected 1,000 random papers published in or after 2015 as our test set. We treat the documents in the test set as our "draft" documents from which to extract the citations that we aim to recover and their citation contexts. We generate the queries from these contexts and the query type is the CoreSC class of the citing sentence. These are to our knowledge the largest experiments of this kind ever carried out with this corpus.

We need to test whether our conditional weighting of text spans based on CoreSC classification is actually reflecting some underlying truth and is not just a random effect of the dataset. To this end, we employ four-fold cross-validation, where we learn the weights for three folds and test their impact on one fold, and we report the averaged gains over each fold.

The full source code employed to run these experiments and instructions on how to replicate them are available on GitHub. The automatically annotated corpus is currently available on request, and we aim to make it publicly available shortly.

5 Results and Discussion

Figure 4 shows the results for the seven classes of citing sentences for which there was consistent improvement across all four folds, with a matrix of the best weight values that were found for each fold. On the right-hand side are the testing scores obtained for each fold and the percentage increase over the baseline, in which all weights are set to 1. For the remaining four classes (Experiment, Model, Motivation, Observation) the experiments failed to find consistent improvement, with wild variation across folds.

Figure 4: Weight values for the query types (types of citing sentences) that improved across all folds. The weight values for the four folds are shown, together with test scores and improvement over the baseline. These weights apply to text indexed from sentences in ILCs to the same document and the weight cells are shaded according to their value, darker is higher. In bold, citation types that consistently improve across folds. On the right-hand side are the scores obtained through testing and the percentage increase over the baseline, in which all weights were set to 1. (*NDCG and Accuracy (top-1) are averaged scores over all citations in the test set for that fold.)

As is to be expected, the citations are skewed in numbers towards some CoreSC classes. A majority of citations occur within sentences that were automatically labelled Background, Methodology and Results, no doubt due to a pattern in the layout of the content of articles. This yields many more Bac, Met and Res citations to evaluate on, and for this reason we set a hard limit to the number of citations per CoreSC to 1,000 in these experiments.

A number of patterns are immediately evident from these initial results. For all query types, it seems to be almost universally useful to know that Background or Methodology sentences in a document's incoming link contexts match the query terms extracted from the citation context. The possibility exists that this is partly an effect of there being more sentences of type Background and Method in our collection.

Similarly, it seems it is better to ignore other classes of sentences in the incoming link contexts of candidate papers, specifically Experiment, Hypothesis Motivation and Observation. Also notably, Conclusion seems to be relevant only to queries of type Goal, Hypothesis and Result. Even more notably, Goal and Object seem relevant to Goal queries, and exclusively to them.

Note here that the fact that a weight combination was found where the best weight for a citing sentence class is zero does not mean that including information from this CoreSC is not useful but rather that it is in fact detrimental, as eliminating it actually increased the average NDCG score. These are of course averaged results, and it is certain that the weights that we find are not optimal for each individual test case, only better on average.

It is important to note that our evaluation pipeline necessarily consists of many steps, and encounters issues with XML conversion, matching of citations with references, matching of references in papers to references in the collection, etc., where each step in the pipeline introduces a measure of error that we have not estimated here. The one we can offer an estimate for is that of the automatic sentence classifier. The Sapienta classifier we employ here has recently been independently evaluated on a different corpus from the originally annotated corpus used to train it. It yielded 51.9% accuracy over all eleven classes, improving on the 50.4% nine-fold cross-validation accuracy over its training corpus [15].

Further to this, we judge that the consistency of correlations we find confirms that what we can see in Figure 4 is not due to random noise, but rather hints at underlying patterns in the connections between scientific articles in the corpus.

Figure 5 shows our results as a graph, with the per-class weights following from the class of citing sentence to the class of cited sentence. For this graph, we take a "majority vote" for the weights from Figure 4: if three folds agree and a fourth differs by a small value, we take this to be noise and use the majority value. If folds agree in two groups we average the values.

Figure 5: Citation network: links between query types and classes of cited sentences. On the left, the results presented here of CoreSC-labelled incoming link contexts. *On the right, a comparison with previous work (see [4]), where we explored the link between citing sentences and CoreSC-labelled contents of the cited document. The thickness of the lines represents the weight given to terms indexed from that class of cited sentence.

We show a side-by-side comparison of these new results with our previous results where we indexed a document's actual contents instead of the incoming link contexts to it. We had previously proposed that there is an observable link between the class of citing sentence and the class of sentence in the cited document [4]. Now we find the same evidence for a link between the class of citing sentence and the class of sentence within incoming link contexts, so inside other documents citing a given document.

There are both similarities and differences between the weights found for incoming link contexts and document text. Background and Method are almost as universally relevant for one as for the other, and Results equally as irrelevant for citing sentences of classes Conclusion and Goal. However, we also find that whereas sentences of type Observation found inside a document's text are useful (for Background, Object and Result), they are not when they are found inside incoming link contexts to that document.

6 Conclusion and Future Work

We have presented a novel application of CoreSC discourse function classification to context-based citation recommendation, an information retrieval application. We have carried out experiments on the full PubMed Central Open Access Corpus and found strong indications of correlation between different classes of sentences in the Incoming Link Contexts of documents citing a single document. We also find that these relationships are not intuitively predictable and yet consistent.

This suggests that there are gains to be reaped in a practical application of CoreSC to context-based citation recommendation. In future work we aim to evaluate this against more standard approaches, such as concatenating and indexing the anchor text and the document text together.

References

About the Authors

Daniel Duma is a PhD student at the University of Edinburgh. His work on context-based citation recommendation currently straddles Natural Language Processing and Information Retrieval.

Maria Liakata is Assistant Professor in Computer Science at the University of Warwick. She has a DPhil from the University of Oxford on learning pragmatic knowledge from text and her research interests include natural language processing (NLP), text mining, related social and biomedical applications, analysis of multi-modal and heterogeneous data and biological text mining. Her work has contributed to advances in knowledge discovery from corpora, automation of scientific experimentation and automatic extraction of information from the scientific literature. She has published widely both in NLP and interdisciplinary venues.

Amanda Clare is a lecturer in the Department of Computer Science at Aberystwyth University. Her research areas include bioinformatics, data mining and data analysis, and the representation of science.

James Ravenscroft is the Chief Technology Officer at Filament Ltd, a Machine Learning Consultancy Group. He has over five years industry experience, previously working as an Architect for IBM's Watson division. He is also studying a PhD on Natural Language Processing techniques part time at the University of Warwick where he specialises in classification and semantic enrichment of scientific discourse.

Ewan Klein is Professor of Language Technology in the University of Edinburgh's School of Informatics, where his areas of research include the semantic web, text mining, and the human-data interface. He is a co-author of the O'Reilly book "Natural Language Processing with Python" (2009) and a project leader of the open source Natural Language Toolkit (NLTK).