|
|
||||||||
Original Research |
1 All authors: Department of Radiology, Massachusetts General Hospital, 25 New Chardon St., Ste. 400E, Boston, MA 02114.
Received December 6, 2007;
accepted after revision February 4, 2008.
K. J. Dreyer and T. J. Schultz receive royalties from patent licensing of
the Leximer natural language processing engine to Nuance, which is the
commercial vendor of the product. The other authors have no financial
disclosure to make and had complete and independent access to the study data
and the manuscript.
Abstract
|
|
|---|
MATERIALS AND METHODS. This study was performed on a radiology reports database covering the years 1995–2004. From this database, 120 reports with and without recommendations were selected and randomized. Two radiologists independently classified these reports according to presence of recommendations, time frame, and imaging technique suggested for follow-up or repeated examinations. The natural language processing program then was used to classify the reports according to the same criteria used by the radiologists. The accuracy of classification of recommendation features was determined. The program then was used to determine the patterns of recommendation features for different patients and imaging features in the entire database of 4,211,503 reports.
RESULTS. The natural language processing program had an accuracy of 93.2% (82/88) for identifying the imaging technique recommended by the radiologists for further evaluation. Categorization of recommended time frames in the reports with the 88 recommendations obtained with the program resulted in 83 (94.3%) accurate classifications and five (5.7%) inaccurate classifications. Recommendations of CT were most common (27.9%, 105,076 of 376,918 reports) followed by those for MRI (17.8%). In most (85.4%, 322,074/376,918) of the reports with imaging recommendations, however, radiologists did not specify the time frame.
CONCLUSION. Accurate determination of recommended imaging techniques and time frames in a large database of radiology reports is possible with a natural language processing program. Most imaging recommendations are for high-cost but more accurate radiologic studies.
Keywords: radiology practice recommendations recommended imaging techniques
|
|
|---|
Analysis of recommendations by means of manual auditing of a large number of radiology reports would be time consuming and therefore most likely impractical. In this respect, results of a study [6] have validated the use of a natural language processing engine (Leximer, Nuance) for classification of electronic structured and unstructured radiology reports on the basis of the presence or absence of recommendations. In the version used in that study [6], the program was trained merely to determine the presence or absence of recommendations. The purposes of this study were to validate the natural language processing program for extraction of recommendation features, such as time frame and imaging technique, from electronic radiology reports and to assess patterns of recommendation features in a large database of radiology reports.
|
|
|---|
Validation Study
Results of a previous study
[6] validated the natural
language processing engine Leximer (Nuance) for categorization of radiology
reports according to the presence of recommendations. Recommendations in
radiology reports were defined as recommendations, requests, or suggestions
for any further actions, such as imaging, clinical correlation, and surgical
or pathologic correlation, in a specified or unspecified time frame.
In this validation study, the Leximer program was used to first categorize radiology reports for the presence or absence of recommendations in a database comprising 4,279,179 electronic radiology reports made from 1995 to 2004. Of these reports, 67,676 had incomplete or no text and therefore could not be processed with the program. These reports were excluded from data analysis. Thus a total of 4,211,503 reports were analyzed. This database comprised reports of all imaging techniques, including CT, MRI, radiography, fluoroscopy, nuclear medicine, sonography, angiography, special procedures, and unspecified imaging examinations.
From the database, one of the investigators selected 88 consecutive radiology reports with recommendations and 32 consecutive reports without recommendations from the year 2005. These 120 reports covered all of the aforementioned imaging techniques and were interpreted by 42 radiologists at our institution. Reports with and those without recommendations were randomized for evaluation by radiologists and with the natural language processing engine.
Report Analysis by Radiologists
To validate the accuracy of the natural language processing engine for
classifying the recommendation features, two radiologists with 11 and 7 years
of experience independently analyzed the 120 radiology reports. Each
radiologist classified the reports into those with and those without
recommendations. In addition, reports also were categorized according to
recommended imaging technique and time frame. These two radiologists were not
involved in the training of the natural language processing program, and they
were not aware of the clinical records, previous radiology reports, or results
of classification with the program.
Report Analysis with Natural Language Processing Engine
The natural language processing program was run to analyze the unstructured
radiology reports by reducing the entropy or noise (data without much
diagnostic value) and preserving the outcome or signal (data with meaning or
intent) through use of natural language processing principles
[6]. The program parsed
specified signals or outcome, such as recommendations from other contents,
through phrase-level extraction, text parsing (breaking text into smaller
parts with punctuation-based phrase isolation through use of an internally
developed parser), and syntactic algorithms (created to group phrases). The
natural language processing principles are described in the report of the
validation study [6].
In this study, the natural language processing engine was further trained to identify recommendation features such as recommended imaging technique (for example, CT, MRI) and time frame by further integrating the resulting signal phrases with syntactic extraction algorithms [6]. If the recommendation was "perform a CT in 3 months to assess for stability," then the recommended time frame was defined as 90 days, and the recommended imaging technique was defined as CT.
The natural language processing program was run to identify reports with recommendations and isolated sentences that had high signal for the recommendation concepts. These sentences were used to generate another statistical histogram for obtaining terms describing recommendations. Terms such as "recommend," "suggest," "follow up," "CT," "MR," "one year," and "6 weeks" obtained by the histogram as strong signals for the recommendation features were checked and verified by radiologists. This final list of terms was used for forming decision trees.
Separate rules were defined for this version of the program to extract specific recommendation features from the syntax of identified sentences. This classification was then validated with multiple iterations to achieve higher accuracy. One of the authors, who had not participated in the validation study, used approximately 2,000 radiology reports from examinations with different imaging techniques and 70 decision tree–optimizing iterations to train the program for extraction of recommendation features. These training sets of reports were selected to include all imaging techniques and body regions and were distinct from the set of reports used in the validation study.
With a computer program written in the C# programming language, a report was obtained from a text file and broken down into its composite elements or phrases (text parsing for phrase-level extraction), and these phrases were processed for signal extraction with the decision trees. The terms that matched the terms in the algorithms of the decision trees enhanced or diminished the likelihood that the phrase represented a recommendation signal. These algorithms are used to find matches between the phrases parsed with the natural language processing program (raw concepts) and leaf nodes in the decision tree. For example, if a report stated "MRI is recommended," "MRI" was the raw concept in the report that matched the leaf node "MRI" in the decision tree, which then rolls to "MRI" (parent node) and "Magnetic Resonance" and "imaging modality" (root node) in that order.
The tree is structured in such a way that generalized concepts such as "imaging modality" are at a higher level, "MR" is a middle-level concept, and specific concepts such as "MRI," "MR angiography," and "MR spectroscopy" are at a lower level, each level representing a branch. The tree is seven layers deep and has a base of 170 discrete nodes. The stemming algorithms increase the leaf-node hits within the radiology reports to more than 1,000. For example, terms such as "MR angiogram," "MR angiography," and "MR angiographic" may not match a specific leaf node but are mapped to "MR angio" with stemming algorithms. These stemming algorithms determine the hits at each leaf node while traversing upward and stopping at the root node.
Decision tree logic was used to normalize these terms to standard ontologies for statistical analysis. For example, all time frames recommended were converted to days. Different concepts used for the same imaging technique were grouped together. For example, MRI, MR, magnetic resonance imaging, and MR imaging all were grouped under MR.
The radiology reports selected for the validation study were assessed with the natural language processing program first for classification into two categories, reports with and those without recommendations. The reports with recommendations then were classified into those containing imaging recommendations and those containing nonimaging recommendations. The recommended imaging techniques extracted included CT, MRI, radiography, sonography, fluoroscopy, nuclear medicine, mammography, angiography, special procedures (such as myelography, ERCP, imaging-guided biopsy, other imaging procedures, and arthrography) and other, unspecified imaging techniques. Nonimaging recommendations included unspecified nonimaging recommendation, surgery, clinical correlation, pathology, histopathology, and endoscopy. The time frame identified was converted to number of days. When a report had a recommendation for the same day or within hours, the time frame was classified 0. No time frame specified was classified –1.
Recommendation Features in Entire Database
After the validation study, the entire database of 4,211,503 radiology
reports from 1995–2004 was analyzed with the natural language processing
program for recommendations, recommended imaging technique, and time frame.
For construction of this database, all reports from the radiology information
system (RIS) were transferred through a Health Level 7 link. This link is an
automated interface for transferring all unstructured and structured radiology
reports from the RIS to the database repository within 5 minutes of their
entry in the RIS.
In 2001, radiologist order entry was introduced at our institution to assist physicians in ordering examinations. The radiologist order entry database has clinical indications and International Classification of Diseases 9 codes for the examinations performed. The radiologist order entry data were transferred to the database repository through an open database connectivity link, a standard database access method that allows access to data from any application. Therefore, the source of clinical indications for a study was the radiologist order entry component of our RIS.
The natural language processing program initially categorized the radiology reports on the basis of presence of imaging and nonimaging recommendations. The results of the analysis and other information from our RIS were stratified in the comprehensive database into data fields such as age group, patient type (inpatient or outpatient), referring physician, imaging technique, and clinical indication.
For reports with imaging recommendations, the patterns of the recommendation features, recommended imaging technique, and recommended time frame were determined for different age groups (0–9 years, 10–19 years, 20–29 years, 30–39 years, 40–49 years, 50–59 years, 60–69 years, and older than 70 years), radiology subspecialties (nuclear medicine, thoracic radiology, pediatric radiology, neuroradiology, emergency radiology, cardiac imaging, breast imaging, musculoskeletal imaging, and abdominal imaging), clinical indications (text obtained from the radiologist order entry, such as lung cancer and back pain), patient types (inpatient and outpatient), referring physicians (name obtained from radiologist order entry), and imaging techniques. Results of the natural language processing analysis were displayed and exported in the form of graphs and pivot tables with an online analytic processing server, which provided rapid results for various analytic multidimensional queries performed on the database. Temporal trends of the volume of imaging examinations and recommendation features such as recommended imaging technique and recommended time frame from 1995 to 2004 also were assessed.
Different radiologists in tertiary health centers often report different imaging techniques and examinations performed on different body regions. To ensure that the differences in recommended imaging techniques and time frames for different radiologists were not due to interpretation of different imaging techniques, we stratified the data for imaging techniques and body regions.
Data Analysis
SAS statistical software (version 9.1, SAS) and Excel spreadsheet software
(Microsoft) were used to analyze recommendation features in the entire
database of radiology reports. Interobserver agreement between the two
radiologists was determined with the kappa test. The two radiologists'
classifications were considered the standard of reference for determining the
accuracy of classification of the recommendation features, such as recommended
imaging technique and recommended time frame, with the natural language
processing program. The program result was labeled inaccurate when the
recommended technique or time frame was wrongly classified. For example, for a
report with the recommendation "CT is recommended at 6 weeks,"
categorization of recommendation type as a non-CT technique or time frame
other than 6 weeks was labeled inaccurate.
Statistical analysis of the data with multiple logistic regression tests was performed to determine the effect of predictor variables (age group, sex, imaging technique, radiology subspecialty, patient type, clinical indication, International Classification of Diseases 9 code) on the outcome variables (recommended time frame, recommended imaging technique). A Mantel-Haenszel chi-square trend test was used to determine significant differences in trends of radiology examination volumes and recommendations over the duration of the study. Bonferroni correction of the multiple logistic regression and Mantel-Haenszel trend tests were not performed because most p values were extremely small (p < 0.0001), possibly owing to exaggeration of the statistical difference in the large study sample used in our study.
|
|
|---|
=
1, p < 0.05) between the two radiologists for presence and absence
of recommendations, recommended imaging technique, and time frame associated
with the recommendations. As reported in our previous study
[6], the natural language
processing engine had an accuracy of 100% for classifying the reports on the
basis of the presence and absence of recommendations. According to both radiologists, the range of time frames recommended in the reports was day 0 (same day) (2.3%, 2/88) to 720 days (2.3%, 2/88). For most (73.9%, 65/88) of the radiology reports in the validation sample, however, the time frame for recommendation, whether nonimaging or imaging, was not specified. After unspecified time frame, 3 months (4.5%, 4/88) and 1 week (3.4%, 3/88) were the most frequently recommended follow-up time frames in reports with imaging recommendations.
The natural language processing engine accurately classified the recommended time frame in 83 (94.3%) of the 88 reports with recommendations. The time frame was inaccurate in five (5.7%) of the reports. Inaccurate detection of recommended time frame resulted mainly from the presence of numbers other than time intervals in the impression section of radiology reports. Such errors occurred in reports such as one with an impression section that read "3 cm rounded fluid-filled cyst in the right pelvis, likely physiologic ovarian cyst; recommend 6 week follow up ultrasound." For such reports the program falsely classified the time frame recommended as 3 weeks instead of 6 weeks. It misclassified a time frame as 12 years instead of 12 months for a report stating "follow-up CT chest is suggested in 12 months to ensure stability and complete 2 years of surveillance."
For recommended imaging technique, the program accurately classified 93.2% (82/88) and inaccurately classified 6.8% (6/88) of the reports. When imaging tests in addition to the recommended imaging technique were mentioned in the impression section, the program falsely classified the other technique as the recommended technique. For example, a report with the statement "hepatocellular carcinoma can be PET negative, and therefore continued follow up with MRI is advised" was misclassified, and the recommended imaging technique was labeled nuclear medicine rather than MRI. In a report that stated "CTA is recommended for further evaluation of the PCA, basilar artery, and the ophthalmic arteries, which are not visualized on the MR angiography," the program wrongly classified the recommended technique as MRI rather than CT, resulting in inaccurate classification. For a report in which sonography was referred to as a scan, the program falsely categorized the technique as a nuclear medicine recommendation because of the use of the term "scan."
Recommendation Features in Entire Database
A total of 348,689 radiologic examinations were performed in 1995 and
547,310 examinations in 2004 with an average annual increase of 5.2% ±
1.6%. The average annual increase in the volume of CT examinations was 14.6%
± 2.5% (30,852 examinations in 1995, 103,390 examinations in 2004), in
MRI examinations was 26.0% ± 6.9% (7,513 to 54,237), and in sonographic
examinations was 9.8% ± 1.4% (28,482 to 65,770). Most (87.5%,
3,683,901/4,211,503) of the radiology reports had no recommendations of any
sort, imaging or nonimaging. Only 12.5% (527,602/4,211,503) of the radiology
reports had recommendations for subsequent action.
Of the reports with recommendations, 71.4% (376,918/527,602) contained recommendations of further imaging and 28.6% (150,684/527,602) had nonimaging recommendations. Figure 1 summarizes the frequency of recommended imaging techniques. Patterns for recommended imaging techniques by age group (Fig. 2) showed that CT was the most frequently recommended imaging technique for patients older than 20 years and that radiography was most frequently recommended for younger patients. Among the reports of different imaging studies, there was a significant statistical difference between recommended imaging techniques (p < 0.001) (Table 1). Recommendations of nonimaging evaluation and CT were the most common types of recommendations regardless of the specialty of the ordering physician. Recommendations of different imaging techniques, including CT, MRI, nuclear medicine, radiography, and sonography, in reports from different radiology subspecialties are summarized in Table 2. Within each radiology subspecialty, there were differences between radiologists (p < 0.001) in rates of recommended imaging techniques and time frames, although the examination types reported were not different.
|
|
|
|
The most frequently recommended imaging techniques among inpatients and outpatients are summarized in Figure 3. The patterns for the recommendation features in radiology examinations performed for common presenting clinical indications are summarized in Table 3. The trends for recommended imaging techniques from 1995 to 2004 are summarized in Figure 4. There was a significant increase in recommendations of high-cost imaging examinations such as CT, MRI, and sonography (p < 0.0001).
|
|
|
In the reports containing imaging recommendations, the time frame recommended ranged from 0 to 1,825 days (same day to 5 years). Reports such as those on CT colonography had much longer intervals for performing follow-up imaging, such as "follow-up colonography at 3–5 years," than did other reports. There were also recommendations of mammographic screening in 5 years. However, recommended time frames were not specified in 85.4% (322,074/376,918) of the radiology reports with imaging recommendations. Only 14.6% (54,844/376,918) of the radiology reports contained a specified time frame, 6 months being the most frequent time frame recommended (28.6%, 15,703 of 54,844 reports with a specific recommended time frame). Irrespective of the age group, the time frames in most reports were not specified. Table 4 summarizes the commonly recommended time frames in the reports of different imaging examinations. Regardless of the radiologic subspecialty origin of the report, the most frequently recommended time frame was unspecified.
|
|
|
|
|---|
Natural language processing has been used to identify clinical information in radiology reports and to map them to structured representations containing medical terms [11]. It also has been used to automatically create an enriched document containing structured components obtained from free-text reports [11]. The document provided reliable and efficient access to clinical information in patient reports for a broad range of clinical applications. The natural language processing program Leximer has been used to assess the presence or absence of findings and recommendations in a database of radiology reports [6]. Extraction of clinical information from reports can help in a variety of applications, such as quality assessment, clinical research, and development of decision support guidelines. Because natural language processing makes it possible to analyze millions of documents in a matter of few hours, it is a practical and appealing method of data mining for relevant clinical information.
Studies [6, 7, 11, 12] have shown that natural language processing is an accurate technique for assessing unstructured free-text clinical reports and documents with a positive predictive value (precision) of 76.0–99.4%, sensitivity (recall) of 70.0–98.2%, and specificity of 85.0–99.9%. The results of this study show that the current version of the natural language processing program Leximer is accurate for classifying radiology reports on the basis of recommended imaging techniques (accuracy, 93.2%) and recommended time frames (accuracy, 94.3%).
Results with the natural language processing program revealed that most radiology reports with imaging recommendations (85.4%) did not have a specified time frame irrespective of patient age, clinical indication, radiology subspecialty, and imaging technique. Many reports with unspecified time frames had suggestions for follow-up or repeated imaging based on clinical correlation or course of disease. As of this writing, the natural language processing program does not discriminate whether an unspecified time frame is for follow-up imaging to assess disease progression or for obtaining additional information to help with diagnosis. In the latter case it is conceivable that the radiologist intends to perform imaging at the earliest convenience. In the former scenario, it is unlikely that immediate imaging is intended, and some referring physicians may prefer that the radiologist specify the time frame in which follow-up would be relevant.
More than one fourth (28.6%) of the overall recommendations in our study were for nonimaging follow-up or evaluations. Nonimaging recommendations contributed to more than one third of all recommendations in reports of patients younger than 20 years and only one fourth of all recommendations for patients older than 50 years. This finding underscores the importance of the most recent version of the natural language processing program versus the previous version, which only discriminated presence or absence of recommendations. For cost studies of recommendation practices, the version with recommendation feature extraction can offer substantial advantages.
Among reports with imaging recommendations, fewer than 10% of radiology
reports lacked recommendations of specific imaging studies. Approximately one
fourth of all recommended imaging techniques were CT. Over the 10-year period
of the database, there was an average annual increase of 14.6% in the volume
of CT examinations performed at our institution and an 18.7% increase in
recommendations of CT. Rates of recommendation of CT, however, were lower for
children and younger adults (
10%) compared with those for patients older
than 50 years (
20%). Nonetheless we noticed that radiology reports of CT
examinations were accompanied by a high rate of follow-up CT. Compared with
recommendation of most other imaging techniques, higher recommendation rates
for CT were found in most radiology subspecialties and age groups. Substantial
differences in recommending CT also were found between different radiologists
in a given subspecialty. Use of the natural language processing program helped
identify clinical indications with high rates of recommendations of CT, such
as lung malignancy and abdominal or pelvic pain. High rates of recommendation
of CT may have implications on risk associated with radiation dose because
collective or cumulative radiation dose increases with multiple CT and
radiation-based examinations.
High-cost imaging tests, such as CT, MRI, and sonography, constituted approximately 56% of all recommended imaging techniques and 40% of all recommendations in our database. Compared with 1995, in 2004, there was a 14.4% ± 1.5% increase in the volume of these examinations and an increase of 21.0% ± 4.2% in overall recommendations of CT, MRI, and sonography. An interesting finding was that there also was a decrease in the growth of recommendations of radiography starting in 1998. These patterns may be due to increasing perception or experience among radiologists about the role of CT, MRI, and sonography in obtaining additional information in view of remarkable technologic advances in these techniques. The findings raise concern, however, about possible inappropriate use of these expensive imaging techniques. The integral consulting role of radiologists and the desire and possibly the expectation of some referring physicians for clinical management guidance from radiology reports must be borne in mind. A similar trend toward an increase in use of CT studies and the effect on increasing radiation exposure has been described [3].
There were considerations associated with the natural language processing program used in our study. To the best of our knowledge, the program has not been validated for analysis of reports from multiple imaging centers. It also is not integrated with the clinical or health information system. Therefore, it cannot be used to gather critical clinical information to gauge the effect of imaging recommendations on patient care and management. Furthermore, the program lacks the ability to track reports for distinct medical record numbers. For example, for an individual patient, it does not find how many examinations were performed and what the recommendation features were. This lack of individual information makes it difficult to infer whether a patient or physician has complied with a recommended technique and the time frame for follow-up or repeated imaging. Another limitation of the program is inability to categorize reports with periodic or multiple follow-up recommendations. Multiple interval or periodic recommendations in individual radiology reports are counted as a unit recommendation. This limitation might have led to underestimation of recommendation rates in our study.
A limitation of our study was that we included reports of all imaging techniques but did not perform power analysis to determine the number of cases necessary for accurate validation. In our study, the standard error of accuracy in analysis of 120 reports was 3.3%. Thus even at the lowest limit of the confidence interval (–3.3%), the accuracy would still be close to 90% for both time frame recommended and recommended imaging technique. Standard error of the accuracy estimated with 120 reports depends largely on the number of subjects and reports studied and not on the size of the population or number of reports to which it is applied. Thus an estimate based on analysis of 120 reports is valid regardless of the number of reports to which it was applied, such as numbers as large as the more than 4 million reports used in our study.
Another limitation of our study was that it was a retrospective analysis of radiology reports. Our study was limited in that it represented trends in radiology reports from a single institution. Another consideration may be that we did not determine the effect of the imaging recommendations. Whether an imaging recommendation materialized into an actual examination or was followed up in the recommended time was not assessed. We also did not determine how physicians interpreted a particular recommendation in cases of unspecified imaging requests.
A study of recommendation features for different clinical indications and patient demographic features may aid in assessment and establishment of recommendation policies for varied clinical and patient attributes. It also may help radiologists consistently follow recommendation guidelines for various clinical indications and patient demographic features.
The version of the natural language processing program used in this study is accurate for determining specific recommendation features, such as imaging technique and time frame, in large databases of radiology reports. Further studies can help to determine the best and most conclusive imaging technique as the first study in the presence of clinical indications associated with high recommendation rates to restrict, if possible, the risk and costs of multiple serial imaging studies. Assessment of recommendation features with the natural language processing program may help radiologists limit inconsistencies in recommendation practices regarding an imaging examination and its timing.
|
|
|---|
This article has been cited by other articles:
![]() |
C. L. Sistrom, K. J. Dreyer, P. P. Dang, J. B. Weilburg, G. W. Boland, D. I. Rosenthal, and J. H. Thrall Recommendations for Additional Imaging in Radiology Reports: Multifactorial Analysis of 5.9 Million Examinations Radiology, November 1, 2009; 253(2): 453 - 461. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |