Ad-Hoc Classification of Electronic Clinical Documents

This Material is based on work supported in part by the National Science Foundation, Library of Congress and Department of Commerce under cooperative agreement number EEC-9209623. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsors.

This material is also based on work supported in part by DARPA NRaD Contract Number N66001-94-D-6054.

Ad-hoc classification is an information management approach taken when a user needs to sort a large number of documents into non-standard categories. The classification is typically conducted a limited number of times, as there is no long-standing information need being addressed. Ad-hoc classification systems must be easy and efficient for them to be used by non-technical domain-expert analysts.

CIIR's research into ad-hoc classification of clinical documents was initiated by membership of Harvard Pilgrim Health Plan (HPHC), a large Boston-based HMO, which has had computerized patient medical records for more than 25 years. Coded portions of the automated medical record are used extensively for health care quality improvement. However the unstructured text portions of the record, which include both dictated and hand-written provider notes, have been relatively inaccessible, except by resource-intensive manual chart review.

HPHC brought to CIIR the following question: to what extent can automated information systems reduce the burden of manual chart review in support of quality measurement? The pilot topic for this research, aspects of which were reported at MedInfo'95 and SCAMC 1995, concerned identification of electronic medical record encounter notes documenting acute exacerbations of pediatric asthma. This was an ad-hoc classification task - the need is to sort thousands of encounter notes into categories of either asthma exacerbation or no exacerbation. Once sorted and counted, these documents would not need to be classified again. Following proof of concept, a prototype PC-based classifier application has been developed and is being implemented at HPHC as part of CIIR's technology transfer mission.

Current ad-hoc classification research is part of a Defense Advanced Research Projects Agency contract on Text Analysis and Access Techniques under the Healthcare Information Infrastructure Program. The testbed data consists of mammography reports from Naval Medical Centers. The goal of the project is to develop a prototype classifier, known as HTC, which could assist in the automated understanding and manipulation of radiology reports through analysis of the language in those reports. The type of tasks to which HTC could be applied include identifying the words radiologists use to document findings for which they recommend biopsy, ultrasound, or special mammographic views. The classifier could also create screening profiles for detection of mammogram reports with evidence of suspicious calcifications, but for which the appropriate follow-up recommendations may not have been made.

On a conceptual level, HTC classification is modeled according to the following work flow.

Processes - similar to other machine learning tasks, using advanced IR techniques:

The classification processes are based on Inquery, an advanced full text information retrieval system developed by CIIR. Given a collection of text documents, Inquery indexes the documents and enables retrieval from the collection of documents which are relevant to a user's query. Application as a classification engine rests on Inquery's Feedback functionality, which provides a means for automatically refining a query to more accurately reflect the user's interests, using a set of "good/bad" relevance judgments of documents reviewed by the user. Detailed description of the classifier operation is beyond the scope of this discussion. However, we would like to present several of the issues and system features with which we are working.

Working in the medical domain in general, and with radiology reports in particular, presents several challenges to information systems developers. The first challenge, found throughout medical domain, is that of poor data quality. Medical data is generally not collected with an intention of extensive electronic manipulation, and clinical text data, in particular, is expected to be accessed via the coded and numeric data fields associated with it, rather than by the content of the text data itself.

In theory, every aspect of every health care activity could be documented in coded form. In practice, however, this is neither feasible nor always desirable, for a variety of reasons. In summary, coded data often serves well to answer health care questions of Who, When, Where and What, but often can not reveal the Why of health care practice. The knowledge essential for understanding the rationale of health care decision making is usually embedded in unstructured text. Advanced information retrieval systems, rather than conventional database management systems, must be developed for this task.

In our situation, despite the fact that the five Naval Medical Centers supplying data use identical clinical information systems, and used the same utility programs to extract their data, there was widespread data quality problems, within and between sites, concerning inconsistent field names, inconsistent use of controlled vocabularies, unpredictable use of all upper case letters in some blocks of text, null fields, and duplicate records. An extensive effort was required to analyze, normalize, and structure the data in order to use it as a system development and research testbed.

In addition to preprocessing of the data to normalize the report structure, the general medical data challenge is largely addressed with the expansion of a small number of common medical abbreviations. This processing is primarily applicable to the "Reason For Exam" section of the report, which is generally a transcription of a concise hand-written note from the patient's primary physician. This expansion allows more of the text to be handled appropriately by NegExpander, below.

But two challenges are particular to radiology and similar test reports where expert observations are documented in words: absent findings and modifier permutations. There is in this document type, extensive recording of absent findings through the use of negation and conjunction. This characteristic is appropriate in interpretive results reporting in order to specifically document that untoward findings were looked for and not found. There is also wide use of permutations of modifier words in standard descriptive phrases, which need to be grouped and treated in similar fashions.

The challenge of the documentation of absent findings is addressed by a new classification feature called NegExpander. Our need is to represent differentially within Inquery instances of positive and negative evidence that happen to include the same key words, and to expand that representation across all components of conjunctive phrases. To do this, we detect in the text the occurrences of a set of negation words (no, absent, without, etc.) and conjunctions. We use a part-of-speech tagger to identify noun phrases in the conjunctive phrases, and replace the original text with tokenized noun phrase negations.

"NO SUSPICIOUS MASSES, SUSPICIOUS CALCIFICATIONS OR SECONDARY SIGNS OF MALIGNANCY ARE SEEN."

"NO_SUSPICIOUS_MASSES, NO_SUSPICIOIUS_CALICIFICATIONS OR NO_SECONDARY_SIGNS_OF_MALIGNANCY ARE SEEN."

"No suspicious masses" will not be 'confused' with "suspicious masses" in indexing, retrieval or classification.

The challenge of similar noun phrases with variable modifiers is addressed as one step (Step 2 below) of a multi-step interface feature that facilitates user creation of benchmark data. Called RelHelper, this interface presents the domain expert user with documents most likely to be relevant to their classification question and creates relevant document files (needed for training and testing the classifier) while the user is reviewing and scoring selected documents.

After scoring documents with RelHelper, a classification query profile is created by the HTC training process using Inquery's Feedback functionality, and applied to test or target documents. The output of the classification is a three category (or bin) sort, with each document being placed in either a Positive, Uncertain, or Negative Bin. Distinctions between the Bins are based on user-selected cutoff values for the desired correct and incorrect rates of document assignment to the Bins. For example, the user may seek a True Positive assignment rate of 95% (tolerating up to 5% False Positives) and a True Negative rate of 90%.

During classification, documents whose likelihood of relevance is greater than the positive cutoff are assigned to the Positive Bin, and those with likelihood less than the negative cutoff are assigned to the Negative Bin. Documents whose likelihood of relevance falls in-between the cutoffs are assigned to the Uncertain Bin. Note that the document ranking provided by Inquery, which we have called a "likelihood of relevance", is actually a belief value generated according to the system's internal calculus, not a probability. The belief values corresponding to the user's desired cutoff values for the Bin sort are determined using logistic regression parameters from the benchmark relevance scores of the training collection.

When testing a classification profile, HTC uses the relevance scores of the test collection to assess the performance of the profile. It is often the case that the first application of HTC generates a classification profile which does not meet the desired targets cutoffs. In this case, the user may review and refine the profile, either by directly editing the profile, increasing the number of training documents, by modifying the type and number of words used by Feedback, or by modifying one or more Feedback system parameters. The HTC interface facilitates each of these approaches.

The classification question reported in this paper concerns suspicious calcifications and may be described as follows: Classify screening mammogram reports that include findings of calcification according to whether the radiologist recommends continued routine screening appropriate for the patient's age versus recommendations for more urgent and diagnostic radiographic or surgical procedures. Reports in which no specific recommendations are made are excluded from these experiments.

This pilot question represents a prime potential application of HTC as an automated quality assurance monitor. HTC can create classification profiles designed to detect specific evidence of suspicious conditions in mammograms. Vast numbers of mammograms can be automatically classified for risk of these conditions according to user-defined confidence levels. For high risk cases, coded data, such as diagnosis and procedure codes, could be accessed in other institutional information systems and reviewed for occurrences of codes signifying appropriate follow-up for the suspected conditions. Cases without evidence of appropriate follow-up would then be manually reviewed.

We have conducted a number of preliminary experiments to evaluate the performance of HTC. The Calcification Question training collection consists of 82 relevant and 65 irrelevant documents, while the test collection has 76 relevant and 24 irrelevant documents. We report our results for two experiments in terms of the IR metrics of precision and recall for the Positive and Negative Bins.

Applied to this classification research, in each Bin, precision is the ratio of documents correctly classified into that Bin (ie. relevant documents into the Positive Bin and irrelevant documents into the Negative Bin) compared to the total number of documents classified into that Bin. Recall is the ratio of documents classified into their correct Bin compared to the number of that category of document in the total collection.

We expected Inquery's Feedback module to generate a fairly good classification query profile, with which HTC could rank most of the relevant test documents high and most of the irrelevant test documents low. We could not expect to see all the relevant documents ranked before all irrelevant documents, a perfect classification separation, although this is the goal of iterative refinement of the classifier. Feedback in HTC has fifteen operational parameters whose settings may be set by the user. We explored many of the combinations of these settings in this preliminary research. In the following experiments the cutoff values are set at 90% correct assignment for both the Positive and Negative Bins.

The first experiment concerns the types of linguistic features to be included in the classification profile for best performance. Linguistic features are the structural units of language and are made up of one or more individual terms. A term is a string of characters, separated from another string by blank spaces or certain punctuation. The most familiar terms are single English words and numbers, such as "MAMMOGRAM" and "1997". More unusual single terms found in this research include ad-hoc abbreviations such as "PLS" and "MAMMO".

Feedback can identify and process features which are either single terms (words, abbreviations, numbers, codes, etc.) or more complex features, called co-occurring terms, which are pairs of single terms. Co-occurring terms can be either ordered or unordered, and with a variable number of other terms between them, measured as their proximity to each other.

As an example of a co-occurring term feature, consider the pair of words "BENIGN CALCIFICATION". If Feedback is set to evaluate features which are unordered pairs with proximity 2, it will identify all of the following as instances of the co-occurring term "BENIGN CALCIFICATION":

Table 1 presents the results of the first experiment: the ratio of the number of single terms and co-occurring terms, the precision and recall of the Positive Bin and the Negative Bin. We have fixed the total number of features in the classification query profile at 70 for these experiments.

ratio of single/co-occurring terms	70/0	60/10	50/20	40/30	35/35	30/40	20/50	10/60	0/70
number of retrieved documents	100	100	100	100	100	97	93	99	100
precision of Positive Bin	98%	98%	97%	97%	97%	96%	95%	97%	96%
recall of Positive Bin	84%	84%	89%	90%	92%	94%	90%	85%	80%
precision of Negative Bin	76%	78%	73%	75%	86%	88%	78%	90%	85%
recall of Negative Bin	95%	91%	91%	91%	83%	76%	64%	78%	70%

In these experiments, an almost equal mix of single terms and co-occurring terms provides the best overall result. Precision in the Positive Bin remains constant at the 96% to 98% level through all ratios of features. Recall in the Positive Bin is greatest near an equal mix of feature types. Negative Bin precision generally improves with increasing co-occurring features, however, recall in the Negative Bin falls significantly.

For the radiology reports used in this study, evidence for document relevance, the Positive Bin, is coming equally from single and co-occurring terms, and a combination of both improves performance.

Evidence for document irrelevance, manifest in the Negative Bin, is more complex and has to be understood as a trade off, frequently seen in IR, of precision for recall. For profiles constructed of single term features exclusively there is a low proportion of truly irrelevant documents compared to relevant (the low precision in the Negative Bin), however, the overall recall in the Negative Bin is high because, along with the incorrectly classified relevant documents, almost all of the irrelevant documents have been classified to this Bin.

Shifting the feature type proportions from single to co-occurring terms increases the precision in the Negative Bin, meaning that a greater proportion of the documents classified to that Bin are correctly sorted irrelevant documents. However the co-occuring term features appear to be a more restrictive evidence, and overall, fewer documents of either relevance are being sorted to the Negative Bin. As more co-occurring terms replace single terms, increasing numbers of irrelevant documents are not being classified to the Negative Bin, and recall in that Bin drops.

The second experiment concerns the number of features in the classification query profile for best performance. Table 2 presents the number of features (in this experiment, using single terms only), the number of retrieved documents, and the precision and recall of the Positive Bin and the Negative Bin.

Table 2. Precision and Recall by Number of Classification Features (Single Terms)

number of features	15	25	35	45	55	65	75
number of retrieved documents	87	94	100	100	100	100	100
precision of Positive Bin	92%	93%	97%	98%	98%	98%	98%
recall of Positive Bin	89%	90%	89%	89%	82%	89%	84%
precision of Negative Bin	77%	64%	75%	74%	76%	79%	76%
recall of Negative Bin	53%	61%	91%	95%	95%	95%	95%

Initially, increasing the number of features increased the number of document retrieved and the precision of the Positive Bin. Thirty five features lead to retrieval of all one hundred documents in the test collection. Neither precision nor recall improved with the inclusion of more than sixty five features because, for the size of this training collection, all the significant, relevant and irrelevant features had been added to the query profile. If the user is most concerned with the Positive Bin, forty five features show the best result, otherwise inclusion of sixty five features shows the best overall performance.

These preliminary experiments were conducted to evaluate the performance of HTC as a classifier of ad-hoc electronic clinical documents. Using a limited number of the operational parameters available, and a single classification topic and testbed, we find that HTC performs well. Manipulation of independent variables produces understandable effects in the dependent variables, which serves as proof of concept of the prototype application. HTC is easy to use through the complex processes of creating training and testing data collections, and the creation and application of classification profile.

There are five areas in which we would like to continue development of ad-hoc classification systems for clinical data. The first concerns making use of the internal structure of clinical documents. As evident in the original mammogram report, radiology reports contain many data fields. Although in our experience these fields are used in highly unpredictable ways, they may be manipulated to be more useful for automated classification tasks.

We have built two subfields, <REASON FOR EXAM> and <IMPRESSION>, into the unstructured text portions of our data in creating our normalized mammogram report, in the anticipation that we may want to differentially consider the evidence they contain. We have yet to explore this, although parallel work at CIIR in automated ICD-9-CM code assignment to hospital discharge summaries has taken advantage of building internal structure into those clinical documents.

The second area for further development concerns increasing the degree of natural language understanding used by the classifier. Our approach of detection and expansion of negative evidence across conjunctive phrases has shown promising results. However, we have incorporated only a small number of negation variants and no analytic logic to their application. We have frequent incorrect actions from NegExpander due the word 'NOT', which we do not currently normalize, and due to use of prepositional phrases in the conjunctive phrases.

Third, we would like to further explore the many classification parameters available with Inquery's Feedback module. We need to compare a variety of feature weighting methodologies, study the learning curves for variably sized training sets and further analyze the effect of combinations of feature types. In particular, we have had some success modifying the parameters governing the relative importance of positive and negative evidence within individual documents, and will continue working with this variable.

Fourth, we would like to upgrade HTC interface to allow the user to inspect the contribution of every feature to a retrieved document. The user can then decide if particular features are skewing the classification, and can reweight the features in the query profile.

Finally, we will expand our research topics to other mammography questions, and to other testbeds, including automated medical record encounter notes and hospital discharge summaries.

Copyright © 1997 Center for Intelligent Information Retrieval