|
|
||||||||
1 All authors: Division of Musculoskeletal Radiology, Department of Radiology, Massachusetts General Hospital, 32 Fruit St., YAW 6E, Boston, MA 02114.
Received October 30, 2003;
accepted after revision June 30, 2004.
Address correspondence to B. J. Thomas.
Abstract
|
|
|---|
MATERIALS AND METHODS. Using commercially available software with embedded Boolean logic, we created a text search algorithm to categorize reports of radiography examinations into "fracture," "normal," and "neither normal nor fracture." The algorithm was refined and optimized through repeated testing on 512 consecutive ankle radiography reports from a single clinical imaging center. The final algorithm was applied on a different set of 750 consecutive radiography reports of the spine and extremities produced at three different clinical imaging sites and interpreted by 44 different radiologists. Expert reviewers assessed the accuracy of the final classification. The chi-square test or Fisher's exact test was performed to determine the reproducibility of results across different clinical imaging sites.
RESULTS. The computerized classification was highly accurate for the classification of radiography reports into "normal" (specificity, 91.6%; sensitivity, 91.3%), "neither normal nor fracture" (sensitivity, 87.8%; specificity, 94.9%), and "fracture" (sensitivity, 94.1%; specificity, 98.1%) categories. This performance showed no significant difference across the three sites (p > 0.05).
CONCLUSION. Computerized categorization of narrative-text radiography reports is highly sensitive and specific and can be used to classify reports from different imaging sites generated by different radiologists. This method can be an extremely powerful tool in future cost-effectiveness studies, health care policy studies, operations assessments, and quality control.
|
|
|---|
Boolean analysis of text reports is a binary-based data-mining technique that uses dependency and association rules. It enables complex search strategies using Boolean operators, such as an "AND," "OR," "NOT," and so on, which can be combined with specific words (e.g., "fracture") either in isolation or in sequence. The occurrence within a predetermined number of words can be identified and either included or excluded. For example, the search string "no fracture"/4 identifies the ordered occurrence of the words "no" and "fracture" within four words of each other. This search will thus identify statements such as "no fracture," "no new fracture," "no visible fracture," "no acute fracture," and "no evidence of fracture."
The purpose of this study is to create and validate Boolean language search strings capable of classifying radiography reports into one of three categories: normal, fracture, and "neither normal nor fracture." Although some of the clinical "rules" look only at normal and fracture outcomes [24], we have decided to include a "neither normal nor fracture" category because of our belief in the significance of other types of abnormalities.
|
|
|---|
Creating the Search Strategy
This study was limited to radiographs of the ankle obtained in 2001 and all
radiographs of the spine and extremities obtained in 2003. Construction of a
search algorithm was an iterative process, as we attempted to deal with
idiosyncrasies of syntax and with certain classification difficulties on the
boundary between normal and abnormal.
A consecutive series of 512 radiography reports from ankle examinations performed at a single health center in 2001 was used as the test set. An initial pass through the data identified fracture cases if the word "fracture" was actually used in the Impression field, while excluding those reports with a negative statement ("no evidence of fracture," "without evidence of fracture," and so on). For example, the search string "fracture NOT `no fracture'/8" identifies all reports that have the word "fracture" in them and excludes, from the selected set, the reports that have the words "no" and "fracture" within eight words of each other. Thus, all reports that say "No fracture," "No acute fracture," "No evidence of fracture," "No radiographic evidence of acute or displaced fracture," and so on are excluded, thus yielding a set of reports with positive findings for fracture.
Cases were classified as neither normal nor fracture using a list of terms indicating generic abnormalities, not necessarily specific to the ankle. For example, terms ending in -osis, -otic, -itis, -itic, and so on were used to classify cases as neither normal nor fracture. For example, the search string "itis NOT `no *itis'/8" detects reports that have words ending in itis such as arthritis, myositis, tendinitis, and bursitis and excludes negative constructions as described earlier.
If neither fracture nor one of the "neither normal nor fracture" diagnoses applied, the case was classified as normal.
The validity of the classification was tested by two authors who subsequently read and manually classified the reports. From this analysis, several findings were quickly discovered. First, certain observations did not make sense when classified in this manner. Although perhaps not normal, they were unlikely to justify the imaging examination, either because of their prevalence in the asymptomatic population (heal spurs, normal variants, osteopenia, enthesopathy, calcified vessels) or because they would have been known without imaging (soft-tissue swelling, old or healed fractures). These findings were specifically excluded from subsequent searches.
Second, because different radiologists have different dictation styles, the optimal proximity between wordsfor example, between "no" and "fracture" as described earlierhad to be obtained by trial and error. For example, by testing different proximities between four and 12, it was determined that permitting an eight-word separation between "no" and "fracture" gave the greatest overall accuracy. Shorter separations missed some negative constructions; longer ones included too many coincidental occurrences, and thus missed some fractures.
Third, reports that applied to more than one examination ("associated" examinations) had to be excluded because one examination might be positive and one negative. For example, a single report for a foot and an ankle examination might say "No fractures of the ankle. Fracture of the base of the fifth metatarsal present." Classification of such a report proved to be impossible by our method.
Fourth, in some instances, important findings were omitted from the Impression field. In others, no impression was given. Therefore, the search algorithm was applied to both the Impression and the Body fields of the report in a sequential manner to pick as many positive reports as possible in each category.
Fifth, certain errors of classification persisted, despite repeated modifications to the search algorithm. For example, a report that says, "No evidence of effusion. Fracture of the fifth metatarsal present" is excluded from the fracture category in our search, because the words "no" and "fracture" are within eight words of each other, even though these words are in different sentences.
Ultimately, a cascade of search strings was used. Based on the presence or absence of text in the Impression section, the ankle X-ray reports were classified into reports with an impression and reports without an impression. The reports with impression were initially categorized into "Fracture 1" and "Filtered Reports 1" using a search of the Impression field. Further filtration of the "Filtered Reports 1" category, searching the Body section of the report, was done to repatriate erroneously categorized fracture cases to the "Fracture 2" category, thereby generating the "Filtered Reports 2" category. Then we performed a search of the Impression field in the "Filtered Reports 2" category to obtain "Neither Normal Nor Fracture 1" (NNNF 1) and "Filtered Reports 3" categories. A search of the Body field was done in the "Filtered Reports 3" category to pick "Neither Normal Nor Fracture 2" (NNNF 2) cases erroneously categorized in the "Filtered Reports 3" category, thereby generating the "Normal 1" category. For the group of reports without an impression, a sequential search was done on the Body section of the report alone, thereby generating the "Fracture 3," "Neither Normal Nor Fracture 3" (NNNF 3), and "Normal 2" categories, and the results were added to each other. The sequence of search can be better understood with the help of Figure 1.
|
Testing the Search Strategy
The validity of the final computer classification was tested against an
entirely new set of consecutive cases, consisting of radiography reports of
the spine and extremities drawn from three different clinical sites: a
hospital emergency department, a community-based outpatient health center, and
a hospital-based walk-in department. A different panel of radiologists served
each site, although there was some overlap. A consecutive series of 750
reports from the year 2003 was identified.
Each report was read by two of the authors. Their consensus classification as normal, fracture, or "neither normal nor fracture" was used as the gold standard for comparison. The final computerized classification algorithm was run for the same cases, and the results were compared with our gold standard to calculate specificity, sensitivity, positive predictive value, and negative predictive value.
To determine whether there were significant differences in the accuracy of the search strings in the three different clinical settings, we used the chi-square test or we used the Fisher's exact test if the chi-square test was not valid owing to an inadequate sample size.
|
|
|---|
Similarly, the search results for "neither normal nor fracture" were compared with the gold standard and yielded 281 correctly categorized reports of a possible 320. The sensitivity of the search was 87.8% (95% CI, 83.791.2%), the overall specificity was 94.9%, the positive predictive value was 92.7%, and the negative predictive value was 91.3%. There was no difference in the ability to identify "neither normal nor fracture" examinations among the three sites (chi-square test, p = 0.2321).
When compared with our gold standard, the search algorithm for the fracture category identified correctly 112 of 119 fractures, resulting in a sensitivity of 94.1% (95% CI, 88.397.6%), an overall specificity of 98.1%, a positive predictive value of 90.3%, and a negative predictive value of 98.9%. There was no difference in the ability to identify fracture examinations among the three sites (Fisher's exact test, p = 0.3252).
No significant difference was detected in the ability of the search algorithm to separate normal from abnormal (combined fracture and "neither normal nor fracture") or to separate cases of "neither normal nor fracture" and fracture across the three sites.
|
|
|---|
Because the use and associated costs of medical imaging have increased, controversy has developed over its role in the delivery of health care services. Efforts to control costs have focused on the notion of "appropriateness," which is an outgrowth of the traditional medical concept of "indication." Various attempts to limit to those indications deemed appropriate by various authorities have been made [26], but these efforts have had limited success, in part because of their subjectivity. Further, appropriate applications of technology are constantly evolving. Finally, the appropriateness of a particular examination may depend on the availability of alternative resources. For example, in certain instances, imaging may serve as a substitute for a referral to a specialist.
An alternative approach has been to limit the use of imaging to those clinical instances in which the expected yield of positive findings is above certain thresholds. Various clinical "rules" have been developed to determine when imaging should be used [27].
In theory, evaluation of the results of imaging could help to determine indications and whether the use of particular techniques by individual physicians or groups of physicians is consistent with that of their peers. Unfortunately, large-scale applications of this method are laborious, because they require ongoing evaluation of the outcomes of large numbers of imaging studies. Manual coding ("results coding") by trained personnel is rarely performed because it is time-consuming and costly [9], and coding by radiologists has been generally resisted by the radiology community [9].
A successful approach to this problem has been the use of natural language analysis [923] and Boolean language analysis [24, 25] to classify text reports into a structured outcome format to make the data useful for clinical research.
Such natural language analysis has been recognized as a promising research tool [5]. The challenge of this approach is to devise a search strategy that can classify results into the desired categories with a high degree of accuracy across a broad range of users and body parts. Different authors have described the use of natural language analysis for the detection of patients with suspected disease conditions such as pneumonia [9, 1517], inhalational anthrax [18], tuberculosis [19, 20], and breast cancer [21]. This type of analysis has also been successfully used to study an emergency radiology report database [24], head trauma database [25], stroke database [22], and ventilationperfusion lung scan report database [23].
Our results have been comparable to those of previously described natural language processing systems. For example, the natural language processor described by Hripcsak et al. [13] has shown a sensitivity of 81% and a specificity of 99% in coding narrative chest radiography reports. Similarly, the natural language processing system, described by Zingmond et al. [14] had a sensitivity of 90% and a specificity of 82% in the classification of narrative chest radiography reports.
We believe that our method can, with further refinement and validation, be used to provide data about the effectiveness with which medical imaging is used. Thus, individual practices or groups with a high percentage of normal imaging studies may be overusing a technique. An important caveat is that a high percentage of follow-up examinations will tend to cause an over-estimation of the contribution of the radiographs. For example, an orthopedic practice in which every patient has a fracture will produce an extremely small number of normal radiographs.
In summary, we have shown that an automated analysis of radiology reports of radiography examinations can classify findings into normal and significantly abnormal categories. Additional work will be required to determine whether it can be used to perform comparisons of ordering practices among physician groups.
|
|
|---|
This article has been cited by other articles:
![]() |
M. Torriani, B. J. Thomas, M. A. Bredella, and H. Ouellette MRI of Metatarsal Head Subchondral Fractures in Patients with Forefoot Pain Am. J. Roentgenol., March 1, 2008; 190(3): 570 - 575. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. de Bruijn, A. Cranney, S. O'Donnell, J. D. Martin, and A. J. Forster Identifying Wrist Fracture Patients with High Accuracy by Automatic Categorization of X-ray Reports J. Am. Med. Inform. Assoc., November 1, 2006; 13(6): 696 - 698. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. H. Thrall Reinventing Radiology in the Digital Age: Part II. New Directions and New Stakeholder Value Radiology, October 1, 2005; 237(1): 15 - 18. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |