|
|
||||||||
1 Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University, Central Radiology Viewing Area, Rm. 117, 600 N. Wolfe St., Baltimore, MD 21287.
Received March 11, 2002;
accepted after revision April 1, 2002.
Supported in part by the Robert Wood Johnson Clinical Scholars Program.
Abstract
|
|
|---|
MATERIALS AND METHODS. Data from the 1064 patients who received an angiographically based diagnosis of pulmonary embolism in the Prospective Investigation of Pulmonary Embolism Diagnosis study were encoded using a previously described method. The 21 input variables represented abnormalities identified on each patient's ventilationperfusion scan and chest radiograph. Two methodsan artificial neural network with one hidden layer and a multivariate logistic regressionwere compared for accuracy in predicting the presence or absence of pulmonary embolism on subsequent pulmonary arteriography.
RESULTS. No significant difference was observed between the two methods. Areas under the receiver operating characteristic curves ± standard deviation were 0.78 ± 0.02 for the artificial neural network model and 0.79 ± 0.02 for the logistic regression model. Furthermore, use of these two methods resulted in no more diagnostic accuracy than did the use of a simple threshold model based only on the number of subsegmental perfusion defects, which was the dominant input variable.
CONCLUSION. In the study population, the usefulness of data from ventilationperfusion scans as predictors of the presence of a pulmonary embolism was similar for the three analytic methods, a finding that reinforces the importance of making comparisons to simpler or more established methods when performing studies involving complex analytic models, such as artificial neural networks.
|
|
|---|
Applications of artificial neural networks commonly involve supplying a collection of numeric values as input to the network. These input values may represent physiologic measurements, pixel values within a region of interest, or other clinical data. The neural network is then constructed to provide a single output value indicating the likelihood of a binary outcome, usually whether or not a particular disease is present.
Modeling a binary outcome based on a collection of input values is not a unique feature of artificial neural networks. Perhaps the best-known conventional example of a mathematic model of binary outcomes is logistic regression [12, 13]. Even though the logistic regression model can be represented as a single, relatively simple equation relating the input values to the outcome value, it is nevertheless a powerful method for analyzing multivariate data from many sources.
In evaluating sophisticated analytic methods such as artificial neural networks, one should consider how these complex methods perform in comparison to simpler, more conventional ones like logistic regression. The purpose of this study is to compare the performance of a neural network with a more conventional analysis method, logistic regression, concerning an important diagnostic problem in a well-known clinical data set.
The diagnosis of pulmonary embolism based on data from ventilationperfusion scans was chosen as the clinical domain for this study. The disease continues to be a source of significant morbidity and mortality, and its diagnosis remains challenging [14]. Several artificial neural networks have been developed for the diagnosis of pulmonary embolism from findings observed on ventilationperfusion scans [15,16,17,18,19,20,21]. These studies incorporated comparisons with the diagnostic accuracy of experienced physicians as a performance benchmark. In most cases [15,16,17,18,19,20], the overall performance of the neural network was found to be similar to that of the physicians. The conclusions of these studies may be that the neural network is a successful or promising method of data analysis for the diagnosis of pulmonary embolism; however, whether the neural network is unique in its power to predict the presence of this disease is unknown because none of the studies involved a direct comparison with any other data analysis method. In other domains such as cancer diagnosis [22] and selected large medical data sets [23], some evidence indicates that artificial neural networks may not outperform more conventional statistical methods.
|
|
|---|
The PIOPED study involved 1493 patients with acute symptoms suggestive of pulmonary embolism for whom a ventilationperfusion scan was requested. In all patients, a 133Xe ventilation scan, a 99mTc perfusion scan, and a chest radiograph were obtained. Of the 1493 patients, 1099 underwent pulmonary arteriography, the results of which established a diagnosis in 1064. (The other 35 patients had nondiagnostic arteriographic results.) I analyzed the final group of 1064 patients, which included 383 patients with an arteriographically based diagnosis of pulmonary embolism.
Data Conversion
Before subjecting any data to a mathematic model, one must define the input
variables of the model and establish the methods to convert the data into the
numeric input variables of the model. I used a previously published method for
converting the PIOPED imaging findings into input variables
[21], a technique based on the
scoring of findings on ventilation scans, perfusion scans, and chest
radiographs in the PIOPED study.
The data conversion process transformed the PIOPED imaging data into a total of 21 variables that served as input to the mathematic models evaluated in this study. Eighteen of these variables were taken directly from the PIOPED data input forms: Each lung was divided into upper, middle, and lower thirds, producing a total of six zones for two lungs. These six zones were evaluated for all three of the PIOPED imaging modalities (ventilation scanning, perfusion scanning, and chest radiography), resulting in 18 variables. Each of these 18 variables was given a value between 0 and 4, depending on the size of any abnormality in the corresponding lung zone visible on the corresponding imaging modality. The scoring system was as follows: 0, if the findings were normal; 1, if less than 25% of a zone was affected; 2 if between 25% and 50% of a zone was affected; 3, if between 51% and 75% of a zone was affected; and 4, if more than 75% of a zone was affected.
Three additional variables were included in the data conversion process [21], bringing the total number of input variables to 21. The first additional variable indicated the total number of subsegmental perfusion mismatches, regardless of which lung had the mismatches. A mismatched subsegment was considered one in which the ventilationperfusion scan showed less perfusion than ventilation.
The second additional variable indicated the presence or absence of symmetry in the number of mismatched lung subsegments. Because pulmonary emboli are most often bilateral [25], symmetry may help in differentiating perfusion defects caused by emboli from those caused by other diseases. NR and NL represent the number of mismatched subsegments in the right and left lungs, respectively, and the second variable was encoded as 0, if NR and NL both equaled 0; 1, if NR was more than NL; 2, if NR equaled NL and neither equaled 0; and 3, if NR was less than NL. The third additional variables indicated the size of the largest pleural effusion, if present, and was encoded as follows: 0, if no effusion was present; 1, if the effusion was small; 2, if the effusion was medium; and 3, if the perfusion was large. The data were converted with the Stata program (version 7; Stata, College Station, TX).
Artificial Neural Network Model
The artificial neural network used in this study has been previously
described [21]. This model
consists of 21 input units, a layer of 15 hidden units, and one outcome unit
(Fig. 1). Each of the 15 hidden
units is linked to each of the 21 input units and to the outcome unit, which
is a common neural network structure that differs from many neural network
applications only in the number of input and hidden units. Each input variable
from the data is assigned to one input unit in the neural network structure.
The neural network's output unit represents the outcome variable, which is the
presence (1) or absence (0) of pulmonary embolism predicted by the model. The
development of a neural network depends on the determination of a parameter,
called a weight, associated with each of the links between each node. The
process by which the weights are determined is called training, a process
conceptually analogous to the fitting of statistical regression models.
|
For this study, a standard back-propagation algorithm [1] was implemented as a C program compiled in the CodeWarrior programming environment (Metrowerks, Austin, TX) and run on a Power Macintosh G3 computer (Apple, Cupertino, CA). The neural network was trained for 200 iterations using the back-propagation algorithm. The number of iterations was intended to be an intermediate value so that there would be significant learning without loss of the ability to generalize. This technique was drawn from previous results [21].
Evaluation of Model Performance
The neural network was trained using the PIOPED study data. To test the
model, I used a jackknife procedure, which allows training and testing the
model using the entire data set while ensuring that the training and testing
are not performed simultaneously on the data. In the jackknife implementation,
one of the 1064 cases was selected. The model was then trained with the
back-propagation algorithm using the 21 input variables and one outcome
variable (the reference diagnosis) from each of the remaining 1063 cases.
After training, the model was tested on the input data from the one selected
case that was not used in the training process. The model calculated a
predicted outcome on the basis of the input data for the selected case, and
this predicted outcome was compared with the reference diagnosis for the
selected case. The entire procedure was repeated 1064 times, with a different
selected case left out each time. The reference diagnosis is the presence or
absence of pulmonary embolism on pulmonary arteriography in the PIOPED
study.
The predicted outcome variable for each repetition is a number between 0 (indicating absence of pulmonary embolism) and 1 (indicating presence of pulmonary embolism) that can be interpreted as representing the degree of certainty of the diagnosis that the model assigns to the selected case corresponding to the jackknife repetition. These assigned certainties for the diagnosis of pulmonary embolism along with the reference diagnosis were used to generate a receiver operating characteristic (ROC) curve indicating the capability of the neural network to recognize a case of pulmonary embolism.
Other Data Models
In similar fashion to the artificial neural network, a logistic regression
model with 21 independent variables was fitted and tested using the PIOPED
study data. The logistic regression model was fitted using the Stata program.
An identical jackknife protocol was used. Like the neural network, the
logistic regression provides a predicted outcome value that can be interpreted
as the degree of certainty of disease presence, and this predicted value was
used to generate a ROC curve indicating the capability of logistic regression
to recognize the presence of pulmonary embolism. The jackknife procedure has
been shown to be a valid evaluation of both artificial neural networks and
logistic regression [26].
To provide a baseline comparison, I also evaluated a simple threshold model. In this model, the value of the most influential input variable, the number of subsegmental perfusion mismatches, was interpreted as a simple indicator of certainty of the presence of an embolism. This simple indicator of certainty was used to generate a ROC curve indicating the capability of a simple threshold model to predict the presence of pulmonary embolism.
Statistical Analysis
The area under the ROC curve was used as the main index of performance for
each of the data models. The area was calculated using a nonparametric method
(trapezoidal rule) instead of being derived from a fit of a binormal
distribution because preliminary data analysis indicated deviation of the data
from a binormal distribution. Therefore, it was not necessary to make the
assumption of an underlying binormal distribution. The areas under the ROC
curves were calculated and compared statistically using Stata's
roctab and roccomp commands, respectively. The
roccomp command compares the areas under two or more ROC curves using
a nonparametric algorithm based on the chisquare distribution
[27]. This method accounts for
correlated data in which multiple ROC curves are calculated for the same set
of cases.
|
|
|---|
|
|
Also listed in Figure 3 are the results of three additional ROC analyses, which are provided for comparison. To evaluate the importance of the number of hidden units, I constructed a neural network with only one hidden unit instead of 15. The neural network with one hidden unit was evaluated with a procedure identical to that of the model with 15 hidden units. No statistically significant difference (p = 0.76) was observed between the two neural networks.
ROC curves were also constructed for the overall probability of pulmonary embolism expressed by the consensus of the reviewers and referring physicians of the PIOPED study. The areas for these two ROC curves are given in Figure 3 and represent performance obtained without computerized modeling. The PIOPED consensus interpretations of the ventilationperfusion scans were more accurate than the results from the neural network, but this difference was not statistically significant (p = 0.085). The referring physician's clinical impression before ventilationperfusion scanning was significantly less accurate statistically than the prediction of the neural network with 15 hidden units (p = 0.002).
|
|
|---|
Computationally intensive methodssuch as artificial neural networksshould be compared with simpler or more conventional models if the superiority of the more complex method is to be truly established. One reason for comparing a complex model with simpler ones is exemplified by the main result of this studythere may be no significant difference among models of various complexities. In the setting of equivalent performance, one would favor a simple model over a more complex one in the interests of computational efficiency and economy. The artificial neural network requires the most computation because the back-propagation algorithm [1] used to fit the model is highly iterative. Computing a logistic regression is also iterative, but the amount of iteration is generally less than that required for the back-propagation algorithm. The threshold model is computationally simple because it requires no iteration. For example, depending on the number of input variables and the available computing power, a neural network may require many minutes to fit. Fitting a logistic regression may require a few seconds, whereas performing a simple threshold analysis is essentially instantaneous.
Efficiency and economy are not the only reasons to favor simpler models. Simpler models are often better understood than complex ones. For example, in the case of logistic regression versus neural network, the coefficients of the logistic regression model are related to the mathematic odds of having the disease in question, where the weights associated with each unit in the neural network have no defined meaning. As a result, the relative importance of each variable in the logistic regression model can be quantified. Such quantification is not possible with the artificial neural network. The neural network, with its complex collection of weights and units, must be viewed to some degree as a "black box" [28].
In a review pointing out similar performances between the neural network and various statistical models for other domains, some speculations are made regarding the cause of this similarity despite the theoretic advantages of neural networks [23]. These hypothetical explanations include inadequate modeling by neural networks and several potential limitations in the data on which the models are based. In this study, the explanation of the similarity in performance of the data models is that it is a fortuitous result arising from the method by which the PIOPED study data were encoded into input variables. The mean value of each of the 21 input variables is shown in Figure 4 according to the presence or absence of pulmonary embolism. As depicted in the figure, variable 19, the total number of subsegmental perfusion mismatches, showed a large difference between its value for positive and negative cases relative to the difference between positive and negative cases for the other input variables. The dominance of a single input variable explains why even the univariate model of a simple threshold could perform on a par with logistic regression and a neural network. This explanation also illustrates the importance of devoting as much care to examining the raw data itself as to modeling it, a principle applicable to any research data.
|
Some important limitations of this study are related to whether the results can be generalized. The PIOPED study data represent a population with a relatively high (383/1064, 36%) prevalence of disease, raising the possibility of both spectrum bias [29] and context bias [30]. Studies that are subject to both of these types of bias show results that would be different had the study been performed on a more representative population with lower prevalence of disease. Although results based on ROC analysis are theoretically unaffected by variations in disease prevalence, the results may nevertheless be affected if the variation in prevalence is associated with major differences in other characteristics. In my study, however, all of the data modeling methods would have been subjected to the same bias or biases, so their comparative difference (or lack thereof) would not be expected to have been affected to the same degree.
My results are based on a comparison of only three models (artificial neural network, logistic regression, and threshold), and other models may exist that could interpret the PIOPED data set significantly better than the ones evaluated in this study. Also, I used a single method of encoding the raw data into input variables; a different encoding method potentially could produce different comparative results. However, the area under the ROC curve associated with the neural network in my study, 0.78 (SD, 0.02), is similar to those of previous studies evaluating neural network analysis of ventilationperfusion scanning data [15,16,17,18,19,20,21] where the mean area under the ROC curve for neural networks was 0.81 (SD, 0.06). This similarity between the performance of the neural networks previously described in the literature and the performance of the neural network in this study exists despite heterogeneity in the neural network structures, encoding of input variables, and patient populations among these studies. This similarity in neural network performance suggests that data analysis methods other than the three I used would have performed as they did in this study.
Performance of the various artificial neural networks [15,16,17,18,19,20,21] may be similarly accurate in the prediction of the presence of pulmonary embolism, but the models differ according to the way in which their input variables are derived from the imaging data. The first type [19,20,21] relies on input variables that encode observations and interpretations made by humans. The second type [15,16,17,18] relies on computer-generated values derived directly from image processing and is therefore not subject to variations associated with human observation. Human judgment does affect both types of models, however, because humans still determine which input variables are created or chosen when the models are developed.
In conclusion, the accuracy of ventilationperfusion scanning data in predicting the presence of pulmonary embolism in the PIOPED study population is similar for a number of analytic methods. The similarity of results is consistent with the dominant role of a single variable, the number of ventilationperfusion mismatches, in the PIOPED data set, with little additional predictive information provided by the other variables. The results reinforce the importance of including comparisons to simpler or more conventional methods when performing studies using complex data analysis models. However, these results should not discourage the development of complex data models. It may well be that such models have not yet been developed to their full potential.
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |