|
|
||||||||
1 Department of Radiology and Radiation Oncology, Division of Radiological
Science, Nagasaki University Graduate School of Biomedical Sciences, 1-7-1,
Sakamoto, Nagasaki 852-8501, Japan.
2 Department of Radiation Epidemiology, Atomic Bomb Disease Institute, Nagasaki
University School of Medicine, 1-12-4 Sakamoto, Nagasaki 852-8501,
Japan.
3 General Research Center, Nippon Bunri University, Ichiki 1727, Oita 870-0397,
Japan.
4 Department of Radiology, Kurt Rossmann Laboratories for Radiologic Image
Research, The University of Chicago, 5841 S Maryland Ave., Chicago, IL
60637.
Received September 15, 2003;
accepted after revision February 4, 2004.
Supported by a grant-in-aid for scientific research to K. Ashizawa from the
Ministry of Education in Japan (no. 12670886) and by a grant from the U.S.
Public Health Service (no. CA62625).
Abstract
|
|
|---|
MATERIALS AND METHODS. We selected 130 clinical cases of diffuse lung disease. We used a single three-layer, feed-forward ANN with a back-propagation algorithm. The ANN was designed to differentiate among 11 diffuse lung diseases by using 10 clinical parameters and 23 HRCT features. Therefore, the ANN consisted of 33 input units and 11 output units. Subjective ratings for 23 HRCT features were provided independently by eight radiologists. All clinical cases were used for training and testing of the ANN by implementing a round-robin technique. In the observer test, a subset of 45 cases was selected from the database of 130 cases. HRCT images were viewed by eight radiologists first without and then with ANN output. The radiologists' performance was evaluated with receiver operating characteristic (ROC) analysis with a continuous rating scale.
RESULTS. The average area under the ROC curve for ANN performance obtained with all clinical parameters and HRCT features was 0.956. The diagnostic performance of four chest radiologists and four general radiologists was increased from 0.986 to 0.992 (p = 0.071) and 0.958 and 0.971 (p < 0.001), respectively, when they used the ANN output based on their own feature ratings.
CONCLUSION. The ANN can provide a useful output as a second opinion to improve general radiologists' diagnostic performance in the differential diagnosis of certain diffuse lung diseases using HRCT.
|
|
|---|
An artificial neural network (ANN), which is a computational model based on neurons in the human brain, has recently been applied to a variety of pattern recognition and data classifications in medical imaging such as chest radiography, chest CT, and mammography [816]. In the differential diagnosis, ANN has the ability to merge information, such as radiologic and clinical findings, and to learn the relationship between input data and output data by using different patterns obtained from a large number of clinical cases. Grenier et al. [17] established a baseline for development of a computer-aided diagnosis (CAD) system to specifically diagnose chronic diffuse infiltrative lung diseases with Bayesian analysis. In their article, they assessed the diagnostic value of clinical data, chest radiography, and chest CT for the differential diagnosis of chronic diffuse infiltrative lung diseases; their results showed the supplementary contribution of CT to clinical and radiographic data for the diagnosis. However, the effect of this diagnostic tool as CAD on radiologists' performance was not evaluated.
In this study, we applied an ANN to the differential diagnosis of diffuse lung disease on HRCT, and we evaluated the diagnostic performance of the ANN by using receiver operating characteristic (ROC) analysis. We also evaluated the effect of the ANN output on radiologists' diagnostic performance.
|
|
|---|
|
The 11 diffuse lung diseases selected for differential diagnosis were sarcoidosis, diffuse panbronchiolitis, nonspecific interstitial pneumonia, lymphangitic carcinomatosis, usual interstitial pneumonia, silicosis, bronchiolitis obliterans with organizing pneumonia (BOOP) and chronic eosinophilic pneumonia, pulmonary alveolar proteinosis, miliary tuberculosis, lymphangiomyomatosis, and Pneumocystis carinii pneumonia and cytomegalovirus pneumonia. These 11 diseases were selected because they include relatively common diffuse lung diseases. We placed BOOP and chronic eosinophilic pneumonia into one group because the HRCT findings for these two diseases are similar and overlap, and differentiation between them using HRCT is difficult [18, 19]. Both P. carinii pneumonia and cytomegalovirus pneumonia are opportunistic infections in an immunocompromised host and often occur simultaneously [20]. In addition, the HRCT findings for both diseases are similar [21]; thus, we also placed them into one group.
The 10 clinical parameters included the patient's age and sex; duration and severity of symptoms; temperature; immune status; underlying malignancy; and history of smoking, dust exposure, and drug treatment. The HRCT features were classified into four categories: the extent, distribution, and characteristics of the pulmonary lesion and other thoracic abnormality. The distribution of the pulmonary lesion included upper or lower, right or left, central or peripheral, and dorsal or ventral predominance. The characteristics of the pulmonary lesion were grouped into four subtypes using a modification of the classification described by Webb et al. [22] as follows: linear opacity (peribronchovascular interstitial thickening, interlobular septal thickening, centrilobular branching opacity, intralobular reticular opacity, and nonseptal line), nodular opacity (centrilobular and subpleural small nodules, random nodules, and nodules or masses), decreased lung opacity (bronchiectasis, honey-combing, lung cysts, and cavitary lesions), and increased lung opacity (ground-glass opacity and consolidation). Other thoracic abnormalities included pleural effusion, lymphadenopathy, and heart size. The reason for using these 10 clinical parameters and 23 HRCT features for the ANN was that the chest radiologists considered them important as input data for the differential diagnosis of diffuse lung diseases in the clinical work.
Database
We selected 130 actual clinical cases of diffuse lung disease for training
and testing the ANN. These patients had undergone chest HRCT and received a
definite diagnosis between January 1997 and December 2000. All 130 patients
had only one disease entity; patients with two or more disease entities were
excluded. The study group was composed of 57 men and 73 women who ranged in
age from 18 to 85 years (mean, 52 years). The number of cases for each disease
ranged from four to 28. There were 28 cases of sarcoidosis, 17 of diffuse
panbronchiolitis, 16 of nonspecific interstitial pneumonia, 15 of lymphangitic
carcinomatosis, 12 of usual interstitial pneumonia, 11 of silicosis, 11 of
BOOP or chronic eosinophilic pneumonia, seven of pulmonary alveolar
proteinosis, five of miliary tuberculosis, four of lymphangiomyomatosis, and
four of P. carinii pneumonia or cytomegalovirus pneumonia. All cases
were diagnosed on the basis of clinical criteria: pathologic proof for the
pulmonary lesions (for all patients with sarcoidosis, nonspecific interstitial
pneumonia, usual interstitial pneumonia, BOOP, and lymphangiomyomatosis; and
for three patients with chronic eosinophilic pneumonia, three with diffuse
panbronchiolitis, and three with pulmonary alveolar proteinosis),
bacteriologic proof (all patients with miliary tuberculosis, Pneumocystis
carinii pneumonia, and cytomegalovirus pneumonia), or detailed clinical
correlation (all patients with lymphangitic carcinomatosis and silicosis; and
the remaining patients with chronic eosinophilic pneumonia, diffuse
panbronchiolitis, and pulmonary alveolar proteinosis).
An example of ratings for each clinical parameter is shown in Table 1. All clinical parameters in each case were available in our medical records. The absolute value was used for age, duration of symptoms, temperature, and history of smoking. The severity of symptoms was classified into five grades on the basis of the Hugh-Jones classification as described by Xu et al. [23] (Appendix 1). Sex, immune status, underlying malignancy, history of dust exposure and of drug treatment were defined as 0 or 1.
|
|
|
Subjective ratings for the 23 HRCT features were provided independently by eight radiologists: four chest radiologists with 13, 10, 6, and 5 years of experience and four general radiologists with 9, 4, 3, and 3 years of experience. They were not informed about the correct diagnosis and clinical parameters except patient age and sex to eliminate bias. After two chest radiologists confirmed that lung lesions were distributed diffusely in all cases, six imagesthree HRCT images and three images were obtained using mediastinal window settings at the level of the aortic arch, the tracheal carina, and 2 cm above the right hemidiaphragm were selected for the ratings to minimize the observers' review time, as shown in Figure 2. Several authors have reported that an accurate diagnosis could be made by using a limited number of slices for HRCT images as well as HRCT images at 1-cm intervals in the differential diagnosis of diffuse lung disease [24, 25]. In addition to these three anatomic levels, when a pulmonary lesion existed in the apex or lung base predominantly, we also showed the HRCT image of these levels. Table 2 shows examples of two radiologists' ratings for 23 HRCT features for the patient with sarcoidosis shown in Figure 2. All subjective ratings for HRCT features except the distribution of the pulmonary lesion were rated on a scale of 0 to 4. The distribution of the lesion was rated from 1 to 5 (1, upper > > lower; 2, upper > lower; 3, upper = lower; 4, upper < lower; and 5, upper < < lower). All input data for the ANN were normalized to a range from 0 to 1.
|
Evaluation of ANN Performance
We implemented a round-robin (or leave-one-out) method for training and
testing the ANN by using all clinical cases. With this method, ratings for all
the cases in the database except one were used for training, and ratings for
the omitted case were applied to the testing with the trained ANN. The ANN was
trained on a combination of the input data obtained from clinical parameters
and subjective ratings for HRCT features for each case from the eight
radiologists and was tested with the input data from each radiologist's
feature ratings. In this method, we did not use the eight radiologists'
ratings of the same case independently because of the potential correlation
among them. If we had used one radiologist's ratings for training and the
other radiologists' ratings of the same case for testing the ANN, this overlap
could have produced a positive bias in the evaluation of ANN performance.
Therefore, the ANN was trained and tested on a per-case basis. This procedure
was repeated until every case in the database was used once as a testing
case.
The ANN performance was evaluated with ROC analysis. Binormal ROC curves for diagnosing of diffuse lung disease were estimated by using the LABROC1 algorithm developed by Metz [2628]. An ROC curve for detecting each particular disease in the presence of the other 10 diseases was obtained by examining the output values from the single output unit that corresponded to the single disease in question and by considering cases of the disease as "actual positives" and cases of any other disease as "actual negatives." To assess ANN performance for each disease (disease-specific classification), we calculated the areas under each of these 11 ROC curves (i.e., the Az values) for the eight radiologists. We also evaluated the ANN performance for each radiologist by calculating the Az values for each of these eight ROC curves.
Observer Test
An observer test was performed 6 months after the eight radiologists had
provided subjective ratings for the HRCT features in 130 cases. For the
observer test, a limited subset of 45 cases was selected from the database of
130 cases by two chest radiologists who did not participate in the observer
test. The reason for reducing the number of cases for the observer test was to
decrease the time required for observers. The 45 cases included 17 men and 28
women who ranged in age from 19 to 85 years (mean, 52 years). For these 45
cases, the ANN performance (Az value) was comparable to
that obtained for all 130 cases. In addition, the distribution of disease
categories in the subset was similar to that of the complete set of 130 cases
(10 cases of sarcoidosis, six of diffuse panbronchiolitis, six of nonspecific
interstitial pneumonia, five of lymphangitic carcinomatosis, four of usual
interstitial pneumonia, four of silicosis, four of BOOP or chronic
eosinophilic pneumonia, two of miliary tuberculosis, two of pulmonary alveolar
proteinosis, one of lymphangiomyomatosis, and one of P. carinii
pneumonia or cytomegalovirus pneumonia).
Eight radiologists who provided subjective ratings for HRCT features in advance participated in the observer test. The observers were told that the subset of 45 cases used for the observer test had been selected from the 130 cases for which they had extracted HRCT features 6 months earlier; that only one of the 11 possible diseases was the correct diagnosis for each case and normal cases were not included; and that the ANN outputs presented to observers were obtained by using their own feature ratings as input data for the ANN. The observers were not informed of the distribution of each disease category in the subset of 45 cases.
Before the test, three training cases were shown to the observers to familiarize them with the rating method and with the use of the ANN output as a second opinion. Initially, each observer was presented HRCT images and clinical parameters and rated the likelihood of each of the 11 diffuse lung diseases. The observer's confidence level was represented on an analog continuous-rating scale with a line-checking method [10, 15, 29]. Observers marked their confidence levels along the 11 lines on the score sheet. Ratings of definitely absent and definitely present were marked above the left and the right ends of the line, respectively. Subsequently, the ANN outputs were presented to the observer. Figures 3A and 3B shows examples of graphs of the ANN output used in this observer test. In the second interpretations, observers used a red pencil to mark their confidence levels along the same 11 lines if they changed their confidence levels as a result of the ANN outputs.
|
|
Data Analysis
For data analysis, the confidence level was scored by measurement of the
distance from the left end of the line to the marked point and converting the
measurement to a scale from 0 to 100.
The radiologists' diagnostic performance without and with ANN output was evaluated by ROC analysis [26]. We defined confidence ratings data with the correct diagnosis as actual positives and those with any other diseases as actual negatives. For each observer and each interpretation condition (with and without ANN output), we used a maximum-likelihood estimation to fit a binormal ROC curve to the confidence ratings data for all 11 possible diseases in the 45 cases [27]. We combined data for all diseases because of the small number of cases of each disease. The Az value was then calculated for each fitted ROC curve. The statistical significance of differences between ROC curves for each interpretation condition was determined by applying a two-tailed Student's t test for paired data to the observer-specific Az values. Average ROC curves were generated to represent the overall performance for each group of observers for four chest radiologists and four general radiologists by averaging the two binormal parameters of their individual ROC curves [28]. We also calculated the sensitivity and specificity for each of eight radiologists using confidence ratings data. A case that was diagnosed correctly with the highest confidence rating was judged as one true-positive and 10 true-negative findings. Confidence ratings data in a case that was diagnosed correctly with the second or more highest confidence rating was judged as one false-negative, one false-positive, and nine true-negative findings.
Another indication of observer performance was the number of correctly diagnosed cases for which the observer's ranking was changed by the ANN output. We used four rankings: 1, 2, 3, and 4 and more, where 1 corresponded to a case that the observer diagnosed correctly with the highest confidence rating, 2 corresponded to a case diagnosed correctly with the second highest confidence rating, and so on. If a ranking was improved, such as a change from 2 to 1, by the ANN output, the ANN affected the diagnostic performance beneficially; the opposite indicated a detrimental effect. The statistical significance of the difference between the number of cases affected beneficially and that affected detrimentally was analyzed by a two-tailed t test for paired data.
|
|
|---|
|
Table 3 shows the ANN performance for each radiologist based on their own feature ratings. The Az values for the eight radiologists ranged from 0.946 to 0.969 (average Az = 0.956). These Az values show a relatively high performance.
|
When we evaluated the ANN performance, we used not only all 33 features, but also the 10 clinical parameters alone or the 23 HRCT features alone as input units for the ANN. The average Az value for the eight radiologists obtained with HRCT features alone was 0.910, and the Az value obtained with clinical parameters alone was 0.884. The ANN performance with all 33 features was superior to that with 10 clinical parameters alone or 23 HRCT features alone.
Observer Test
Table 4 shows the
Az values for ROC curves of eight radiologists obtained
without and with ANN output. The performance of each of the two groups of
observers (i.e., four chest radiologists and four general radiologists) is
illustrated by the average Az values in Figures
5A and
5B. The average
Az value for the four chest radiologists without and with
ANN output was 0.986 and 0.992, respectively
(Fig. 5A). The improvement did
not reach statistical significance for the chest radiologists (p =
0.071). The average Az value for the four general
radiologists without and with ANN output was 0.958 and 0.971, respectively
(Fig. 5B). The
Az value for the general radiologists' subgroup increased
significantly when the ANN output was available (p < 0.001). The
sensitivity and specificity for each of eight radiologists without and with
ANN output are shown in Table
5. The average values for both sensitivity and specificity for
general radiologists were improved significantly with the use of the ANN
output (p < 0.05), whereas the average sensitivity and specificity
for chest radiologists were not increased significantly (p = 0.25 and
p = 0.50, respectively).
|
|
|
|
The number of cases affected either beneficially or detrimentally by the ANN output for each radiologist is shown in Figure 6. The number of cases in which the observers changed their ranking for correct diagnosis was 34 of 360 (45 x 8) cases cumulatively. The observers changed their responses in 2.215.6% of the 45 cases. The confidence level was affected beneficially in 29 cases and was affected detrimentally in five cases. The average numbers of cases affected beneficially and detrimentally by ANN output for all radiologists were 3.6 and 0.6, respectively; this difference was statistically significant (p < 0.05).
|
|
|
|---|
In this study, we used a round-robin (or leave-one-out) method for training and testing the ANN. With this method, the ANN was trained on a combination of the input data obtained from HRCT feature ratings by eight radiologists with various years of experience and was tested with the input data from each radiologist's feature ratings. When we use the ANN in an actual clinical situation, each radiologist will likely be required to extract features and make a differential diagnosis on the basis of their own extracted features. Therefore, this study simulated the potential application of the ANN in the clinical situation. In previous studies on the differential diagnosis of diffuse lung disease on chest radiographs or solitary pulmonary nodules on chest radiographs and HRCT images, however, feature ratings as input data were given by experienced radiologists, and an ANN trained on the basis of these data had a good performance and significantly improved the diagnostic performance of observers who did not extract features [10, 14, 15]. Therefore, it would be clinically more feasible to develop an ANN that is trained with input data provided by experienced radiologists in the future.
Because each radiologist extracted features subjectively, interobserver variation exists. In particular, less experienced radiologists could not always extract features consistently. We examined the effect of interobserver variation based on correlation coefficients on subjective ratings for 23 HRCT features among the eight radiologists. The median of the correlation coefficients was 0.566, showing that the correlation was not strong. However, the Az values of the ANN for each radiologist ranged from 0.946 to 0.969, which shows a relatively high performance. These results seem to indicate that the ANN can learn certain specific patterns even if interobserver variation for HRCT feature ratings exists to some degree, although 10 clinical parameters were also included as input data.
Researchers have reported that an accurate diagnosis could be made with clinical parameters such as age and sex and HRCT features only in the differentiation of diffuse lung disease [7]. However, Grenier et al. [17] reported that a higher diagnostic performance was obtained after HRCT features were used with clinical parameters and radiographic findings. Therefore, we evaluated ANN performance by using not only all 33 features, but also the 10 clinical parameters alone or the 23 HRCT features alone as input data. Compared with the diagnostic performance of the ANN using all 33 input data (Az = 0.956), the Az values with 23 HRCT features alone and 10 clinical parameters alone were 0.910 and 0.884, respectively. These results indicate that the ANN performance with all or some of the clinical parameters in addition to HRCT features was higher than that with HRCT features alone. However, the ANN with all input data used in this study may not necessarily be suitable for differential diagnosis of diffuse lung disease. For clinical application of the ANN, it is desirable that only a small number of essential input data would be applied while maintaining a high Az value for diagnostic performance. Therefore, further study is needed for examining the minimum number of input units required. Although all clinical parameters were used as input data for the ANN in the present study, these data are not always available in actual clinical settings. In the future, it may be necessary to design an ANN in which only the clinical parameters available can be used as input data.
Because training of the ANN depends strongly on the database, a comprehensive database that covers a wide distribution of patterns for each disease is desirable. It is impractical to select many types of diffuse lung diseases including uncommon diseases as output units. In addition, collecting a sufficiently large number of clinical cases at one institution, especially for less common diseases, would be difficult. Thus, we selected 11 types of relatively common diffuse lung diseases for differential diagnosis, and 130 cases for which HRCT was performed and a definite diagnosis was established at our institution in a certain period were available as a database. Although other differential diagnoses may be seen in some situations, these 11 disease entities account for most of the diffuse lung diseases that we encounter in our daily work. Therefore, the use of this ANN is helpful in providing the radiologists with a list as well as a likelihood of these 11 types of disease. Although the number of clinical cases was not large in this study, the number of cases for each disease correlated relatively well with the actual incidence of clinical cases. We need to increase the number of cases to represent a wide distribution of radiologic patterns for each disease in the future.
The effect of the ANN on radiologists' performance in the differentiation of diffuse lung disease was evaluated by an observer test. The difference in Az values without and with ANN output was higher for general radiologists than for chest radiologists, and the former reached statistical significance (p < 0.001). It should be noted that ANN outputs shown to observers were based on their own feature ratings and that this seems to simulate an actual clinical situation. The average Az value for general radiologists with ANN output was relatively close to that for chest radiologists without ANN output. This result indicates that the ANN has a potential usefulness in the future in clinical settings for assisting radiology residents or less experienced radiologists who do not specialize in chest radiology in making a correct diagnosis.
The diagnostic performance of the ANN alone was lower than that of radiologists without ANN output. Nevertheless, the diagnostic performance of general radiologists using ANN was significantly improved. Similar results were reported in studies on the detection of abnormalities such as lung nodules and interstitial opacities on chest radiographs and microcalcifications on mammograms [2931]. This finding can be interpreted as follows: compared with chest radiographs, HRCT images can be used by radiologists alone to make a correct diagnosis in the differentiation of diffuse lung disease with relatively high performance. However, less experienced radiologists would fail to recognize important findings; in these situations, the ANN output could alert radiologists to make a differential diagnosis by merging of HRCT features and clinical parameters again carefully, resulting in a correct diagnosis. These interpretations are supported by the fact that the number of cases affected beneficially was significantly higher than that affected detrimentally (p < 0.05).
In conclusion, ANN has the ability to differentiate among certain diffuse lung diseases using HRCT, and it can provide a useful output as a second opinion to improve the diagnostic accuracy of general radiologists.
Acknowledgments
We thank E. Lanzl for editing this manuscript.
|
|
|---|
This article has been cited by other articles:
![]() |
K. Yamashita, T. Yoshiura, H. Arimura, F. Mihara, T. Noguchi, A. Hiwatashi, O. Togao, Y. Yamashita, T. Shono, S. Kumazawa, et al. Performance Evaluation of Radiologists with Artificial Neural Network for Differential Diagnosis of Intra-Axial Cerebral Tumors on MR Images AJNR Am. J. Neuroradiol., June 1, 2008; 29(6): 1153 - 1158. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Tagaya, N. Kurimoto, H. Osada, and A. Kobayashi Automatic Objective Diagnosis of Lymph Nodal Disease by B-Mode Images From Convex-Type Echobronchoscopy Chest, January 1, 2008; 133(1): 137 - 142. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Nie, Q. Li, F. Li, Y. Pu, D. Appelbaum, and K. Doi Integrating PET and CT Information to Improve Diagnostic Accuracy for Lung Nodules: A Semiautomatic Computer-Aided Method J. Nucl. Med., July 1, 2006; 47(7): 1075 - 1080. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |