|
|
||||||||
Fundamentals of Clinical Research for Radiologists |
1 Department of Biostatistics and Epidemiology, Cleveland Clinic Foundation, 9500 Euclid Ave., Cleveland, OH.
Received October 28, 2004; accepted after revision November 3, 2004.
Series editors: Nancy Obuchowski, C. Craig Blackmore, Steven Karlik, and
Caroline Reinhold.
Why ROC?
In module 13 [1], we defined the basic measures of accuracy: sensitivity (the probability the diagnostic test is positive for disease for a patient who truly has the disease) and specificity (the probability the diagnostic test is negative for disease for a patient who truly does not have the disease). These measures require a decision rule (or positivity threshold) for classifying the test results as either positive or negative. For example, in mammography the BI-RADS (Breast Imaging Reporting and Data System) scoring system is used to classify mammograms as normal, benign, probably benign, suspicious, or malignant. One positivity threshold is classifying probably benign, suspicious, and malignant findings as positive (and classifying normal and benign findings as negative). Another positivity threshold is classifying suspicious and malignant findings as positive. Each threshold leads to different estimates of sensitivity and specificity. Here, the second threshold would have higher specificity than the first but lower sensitivity. Also, note that trained mammographers use the scoring system differently. Even the same mammographer may use the scoring system differently on different reviewing occasions (e.g., classifying the same mammogram as probably benign on one interpretation and as suspicious on another), leading to different estimates of sensitivity and specificity even with the same threshold.
Which decision threshold should be used to classify test results? How will the choice of a decision threshold affect comparisons between two diagnostic tests or between two radiologists? These are critical questions when computing sensitivity and specificity, yet the choice for the decision threshold is often arbitrary.
ROC curves, although constructed from sensitivity and specificity, do not depend on the decision threshold. In an ROC curve, every possible decision threshold is considered. An ROC curve is a plot of a test's false-positive rate (FPR), or 1 specificity (plotted on the horizontal axis), versus its sensitivity (plotted on the veritical axis). Each point on the curve represents the sensitivity and FPR at a different decision threshold. The plotted (FPR, sensitivity) coordinates are connected with line segments to construct an empiric ROC curve. Figure 1 illustrates an empiric ROC curve constructed from the fictitious mammography data in Table 1. The empiric ROC curve has four points corresponding to the four decision thresholds described in Table 1.
|
|
An ROC curve begins at the (0, 0) coordinate, corresponding to the strictest decision threshold whereby all test results are negative for disease (Fig. 1). The ROC curve ends at the (1, 1) coordinate, corresponding to the most lenient decision threshold whereby all test results are positive for disease. An empiric ROC curve has h 1 additional coordinates, where h is the number of unique test results in the sample. In Table 1 there are 200 test results, one for each of the 200 patients in the sample, but there are only five unique results: normal, benign, probably benign, suspicious, and malignant. Thus, h = 5, and there are four coordinates plotted in Figure 1 corresponding to the four decision thresholds described in Table 1.
The line connecting the (0, 0) and (1, 1) coordinates is called the "chance diagonal" and represents the ROC curve of a diagnostic test with no ability to distinguish patients with versus those without disease. An ROC curve that lies above the chance diagonal, such as the ROC curve for our fictitious mammography example, has some diagnostic ability. The further away an ROC curve is from the chance diagonal, and therefore, the closer to the upper left-hand corner, the better discriminating power and diagnostic accuracy the test has.
In characterizing the accuracy of a diagnostic (or screening) test, the ROC curve of the test provides much more information about how the test performs than just a single estimate of the test's sensitivity and specificity [1, 2]. Given a test's ROC curve, a clinician can examine the trade-offs in sensitivity versus specificity for various decision thresholds. Based on the relative costs of false-positive and false-negative errors and the pretest probability of disease, the clinician can choose the optimal decision threshold for each patient. This idea is discussed in more detail in a later section of this article. Often, patient management is more complex than is allowed with a decision threshold that classifies the test results into positive or negative. For example, in mammography suspicious and malignant findings are usually followed up with biopsy, probably benign findings usually result in a follow-up mammogram in 36 months, and normal and benign findings are considered negative.
When comparing two or more diagnostic tests, ROC curves are often the only valid method of comparison. Figure 2A, 2B illustrates two scenarios in which an investigator, comparing two diagnostic tests, could be misled by relying on only a single sensitivityspecificity pair. Consider Figure 2A. Suppose a more expensive or risky test (represented by ROC curve Y) was reported to have the following accuracy: sensitivity = 0.40, specificity = 0.90 (labeled as coordinate 1 in Fig. 2A); a less expensive or less risky test (represented by ROC curve X) was reported to have the following accuracy: sensitivity = 0.80, specificity = 0.65 (labeled as coordinate 2 in Fig. 2A). If the investigator is looking for the test with better specificity, then he or she may choose the more expensive, risky test, not realizing that a simple change in the decision threshold of the less expensive, cheaper test could provide the desired specificity at an even higher sensitivity (coordinate 3 in Fig. 2A).
|
|
Now consider Figure 2B. The ROC curve for test Z is superior to that of test X for a narrow range of FPRs (0.00.08); otherwise, diagnostic test X has superior accuracy. A comparison of the tests' sensitivities at low FPRs would be misleading unless the diagnostic tests are useful only at these low FPRs.
To compare two or more diagnostic tests, it is convenient to summarize the tests' accuracies with a single summary measure. Several such summary measures are used in the literature. One is Youden's index, defined as sensitivity + specificity 1 [2]. Note, however, that Youden's index is affected by the choice of the decision threshold used to define sensitivity and specificity. Thus, different decision thresholds yield different values of the Youden's index for the same diagnostic test.
Another summary measure commonly used is the probability of a correct
diagnosis, often referred to simply as "accuracy" in the
literature. It can be shown that the probability of a correct diagnosis is
equivalent to
![]() | (1) |
Summary measures of accuracy derived from the ROC curve describe the inherent accuracy of a diagnostic test because they are not affected by the choice of the decision threshold and they are not affected by the prevalence of disease in the study sample. Thus, these summary measures are preferable to Youden's index and the probability of a correct diagnosis [2]. The most popular summary measure of accuracy is the area under the ROC curve, often denoted as "AUC" for area under curve. It ranges in value from 0.5 (chance) to 1.0 (perfect discrimination or accuracy). The chance diagonal in Figure 1 has an AUC of 0.5. In Figure 2A the areas under both ROC curves are the same, 0.841. There are three interpretations for the AUC: the average sensitivity over all false-positive rates; the average specificity over all sensitivities [3]; and the probability that, when presented with a randomly chosen patient with disease and a randomly chosen patient without disease, the results of the diagnostic test will rank the patient with disease as having higher suspicion for disease than the patient without disease [4].
The AUC is often too global a summary measure. Instead, for a particular clinical application, a decision threshold is chosen so that the diagnostic test will have a low FPR (e.g., FPR < 0.10) or a high sensitivity (e.g., sensitivity > 0.80). In these circumstances, the accuracy of the test at the specified FPRs (or specified sensitivities) is a more meaningful summary measure than the area under the entire ROC curve. The partial area under the ROC curve, PAUC (e.g., the PAUC where FPR < 0.10, or the PAUC where sensitivity > 0.80), is then an appropriate summary measure of the diagnostic test's accuracy. In Figure 2B, the PAUCs for the two tests where the FPR is between 0.0 and 0.20 are the same, 0.112. For interpretation purposes, the PAUC is often divided by its maximum value, given by the range (i.e., maximumminimum) of the FPRs (or false-negative rates [FNRs]) [5]. The PAUC divided by its maximum value is called the partial area index and takes on values between 0.5 and 1.0, as does the AUC. It is interpreted as the average sensitivity for the FPRs examined (or average specificity for the FNRs examined). In our example, the range of the FPRs of interest is 0.200.0 = 0.20; thus, the average sensitivity for FPRs less than 0.20 for diagnostic tests X and Z in Figure 2B is 0.56.
Although the ROC curve has many advantages in characterizing the accuracy of a diagnostic test, it also has some limitations. One criticism is that the ROC curve extends beyond the clinically relevant area of potential clinical interpretation. Of course, the PAUC was developed to address this criticism. Another criticism is that it is possible for a diagnostic test with perfect discrimination between diseased and nondiseased patients to have an AUC of 0.5. Hilden [6] describes this unusual situation and offers solutions. When comparing two diagnostic tests' accuracies, the tests' ROC curves can cross, as in Figure 2A, 2B. A comparison of these tests based only on their AUCs can be misleading. Again, the PAUC attempts to address this limitation. Last, some [6, 7] criticize the ROC curve, and especially the AUC, for not incorporating the pretest probability of disease and the costs of misdiagnoses.
The ROC Study
Weinstein et al. [1] describe the common features of a study of the accuracy of a diagnostic test. These include samples from both patients with and those without the disease of interest and a reference standard for determining whether positive test results are true-positives or false-positives, and whether negative test results are true-negatives or false-negatives. They also discuss the need to blind reviewers who are interpreting test images and other relevant biases common to these types of studies.
In ROC studies we also require that the test results, or the interpretations of the test images, be assigned a numeric value or rank. These numeric measurements or ranks are the basis for defining the decision thresholds that yield the estimates of sensitivity and specificity that are plotted to form the ROC curve. Some diagnostic tests yield an objective measurement (e.g., attenuation value of a lesion). The decision thresholds for constructing the ROC curve are based on increasing the values of the attenuation coefficient. Other diagnostic tests must be interpreted by a trained observer, often a radiologist, and so the interpretation is subjective. Two general scales are often used in radiology for observers to assign a value to their subjective interpretation of an image. One scale is the 5-point rank scale: 1 = definitely normal, 2 = probably normal, 3 = possibly abnormal or equivocal, 4 = probably abnormal, and 5 = definitely abnormal.
The other popular scale is the 0100% confidence scale, where 0% implies that the observer is completely confident in the absence of the disease of interest, and 100% implies that the observer is completely confident in the presence of the disease of interest. The two scales have strengths and weaknesses [2, 8], but both are reasonably well suited to radiology research. In mammography a rating scale already exists, the BI-RADS score, which can be used to form decision thresholds from least to most suspicion for the presence of breast cancer.
When the diagnostic test requires a subjective interpretation by a trained reviewer, the reviewer becomes part of the diagnostic process [9]. Thus, to properly characterize the accuracy of the diagnostic test, we must include multiple reviewers in the study. This is the so-called MRMC, multiple-reader multiple-case, ROC study. Much has been written about the design and analysis of MRMC studies [1020]. We mention here only the basic design of MRMC studies, and in a later subsection we describe their statistical analysis.
The usual design for the MRMC study is a factorial design, in which every reviewer interprets the image (or images if there is more than one test) of every patient. Thus, if there are R reviewers, C patients, and I diagnostic tests, then each reviewer interprets C x I images, and the study involves R x C x I total interpretations. The accuracy of each reviewer with each diagnostic test is characterized by an ROC curve, so R x I ROC curves are constructed. Constructing pooled or consensus ROC curves is not the goal of these studies. Rather, the primary goals are to document the variability in diagnostic test accuracy between reviewers and report the average, or typical, accuracy of reviewers. In order for the results of the study to be generalizeable to the relevant patient and reviewer populations, representative samples from both populations are needed for the study. Often expert reviewers take part in studies of diagnostic test accuracy, but the accuracy for a nonexpert may be considerably less. An excellent illustration of the issues involved in sampling reviewers for an MRMC study can be found in the study by Beam et al. [21].
Examples of ROC Studies in Radiology
The radiology literature, and the clinical laboratory and more general medical literature, contain many excellent examples of how ROC curves are used to characterize the accuracy of a diagnostic test and to compare accuracies of diagnostic tests. We briefly describe here three recent examples of ROC curves being used in the radiology literature.
Kim et al. [22] conducted a prospective study to determine if rectal distention using warm water improves the accuracy of MRI for preoperative staging of rectal cancer. After MRI, the patients underwent surgical resection, considered the gold standard regarding the invasion of adjacent structures and regional lymph node involvement. Four observers, unaware of the pathology results, independently scored the MR images using 4- and 5-point rating scales. Using statistical methods for MRMC studies [13], the authors determined that typical reviewers' accuracy for determining outer wall penetration is improved with rectum distention, but that reviewer accuracy for determining regional lymph node involvement is not affected.
Osada et al. [23] used ROC analysis to assess the ability of MRI to predict fetal pulmonary hypoplasia. They imaged 87 fetuses, measuring both lung volume and signal intensity. An ROC curve based on lung volume showed that lung volume has some ability to discriminate between fetuses who will have good versus those who will have poor respiratory outcome after birth. An ROC curve based on the combined information from lung volume and signal intensity, however, has superior accuracy. For more information on the optimal way to combine measures or test results, see the article by Pepe and Thompson [24].
In a third study, Zheng et al. [25] assessed how the accuracy of a mammographic computer-aided detection (CAD) scheme was affected by restricting the maximum number of regions that could be identified as positive. Using a sample of 300 cases with a malignant mass and 200 normals, the investigators applied their CAD system, each time reducing the maximum number of positive regions that the CAD system could identify from seven to one. A special ROC technique called "free-response receiver operating characteristic curves" (FROC) was used. The horizontal axis of the FROC curve differs from the traditional ROC curve in that it gives the average number of false-positives per image. Zheng et al. concluded that limiting the maximum number of positive regions that the CAD could identify improves the overall accuracy of CAD in mammography. For more information on FROC curves and related methods, I refer you to other articles [2629].
Statistical Methods for ROC Analysis
Fitting Smooth ROC Curves
In Figure 1 we saw the
empiric ROC curve for the test results in
Table 1. The curve was
constructed with line segments connecting the observed points on the ROC
curve. Empiric ROC curves often have a jagged appearance, as seen in
Figure 1, and often lie
slightly below the "true," smooth, ROC curvethat is, the
test's ROC curve if it were constructed with an infinite number of points (not
just the four points in Fig. 1)
and an infinitely large sample size. A smooth curve gives us a better idea of
the relationship between the diagnostic test and the disease. In this
subsection we describe some methods for constructing smooth ROC curves.
The most popular method of fitting a smooth ROC curve is to assume that the test results (e.g., the BI-RADS scores in Table 1) come from two unobserved distributions, one distribution for the patients with disease and one for the patients without the disease. Usually it is assumed that these two distributions can be transformed to normal distributions, referred to as the binormal assumption. It is the unobserved, underlying distributions that we assume can be transformed to follow a binormal distribution, and not the observed test results. Figure 3 illustrates the hypothesized unobserved binormal distribution estimated for the observed BI-RADS results in Table 1. Note how the distributions for the diseased and nondiseased patients overlap.
|
Let the unobserved binormal variables for the nondiseased and diseased
patients have means µ0 and µ1, and variances
0 [2] and
1 [2],
respectively. Then it can be shown
[30] that the ROC curve is
completed described by two parameters:
![]() | (2) |
![]() | (3) |
(See Appendix 1 for a formula that links parameters A and B to the ROC curve.) Figure 4 illustrates three ROC curves. Parameter A was set to be constant at 1.0 and parameter B varies as follows: 0.33 (the underlying distribution of the diseased patients is three times more variable than that of the nondiseased patients), 1.0 (the two distributions have the same SD), and 3.0 (the underlying distribution of the nondiseased patients is three times more variable than that of the diseased patients). As one can see, the curves differ dramatically with changes in parameter B. Parameter A, on the other hand, determines how far the curve is above the chance diagonal (where A = 0); for a constant B parameter, the greater the value of A, the higher the ROC curve lies (i.e., greater accuracy).
|
Parameters A and B can be estimated from data such as in Table 1 using maximum likelihood methods [30, 31]. For the data in Table 1, the maximum likelihood estimates (MLEs) of parameters A and B are 2.27 and 1.70, respectively; the smooth ROC curve is given in Figure 1. Fortunately, some useful software [32] has been written to perform the necessary calculations of A and B, along with estimation of the area under the smooth curve (see next subsection), its SE and confidence interval (CI), and CIs for the ROC curve itself (see Appendix 1).
Dorfman and Alf [30] suggested a statistical test to evaluate whether the binormal assumption was reasonable for a given data set. Others [33, 34] have shown through empiric investigation and simulation studies that many different underlying distributions are well approximated by the binormal assumption.
When the diagnostic test results are themselves a continuous measurement (e.g., CT attenuation values, or measured lesion diameter), it may not be necessary to assume the existence of an unobserved, underlying distribution. Sometimes continuous-scale test results themselves follow a binormal distribution, but caution should be taken that the fit is good (see the article by Goddard and Hinberg [35] for a discussion of the resulting bias when the distribution is not truly binormal yet the binormal distribution is assumed). Zou et al. [36] suggest using a Box-Cox transformation to transform data to binormality. Alternatively, one can use software like ROCKIT [32] that will bin the test results into an optimal number of categories and apply the same maximum likelihood methods as mentioned earlier for rating data like the BI-RADS scores.
More elaborate models for the ROC curve that can take into account covariates (e.g., the patient's age, symptoms) have also been developed in the statistics literature [3739] and will become more accessible as new software is written.
Estimating the Area Under the ROC Curve
Estimation of the area under the smooth curve, assuming a binormal
distribution, is described in Appendix 1. In this subsection, we describe and
illustrate estimation of the area under the empiric ROC curve. The process of
estimating the area under the empiric ROC curve is nonparametric, meaning that
no assumptions are made about the distribution of the test results or about
any hypothesized underlying distribution. The estimation works for tests
scored with a rating scale, a 0100% confidence scale, or a true
continuous-scale variable.
The process of estimating the area under the empiric ROC curve involves four simple steps: First, the test result of a patient with disease is compared with the test result of a patient without disease. If the former test result indicates more suspicion of disease than the latter test result, then a score of 1 is assigned. If the test results are identical, then a score of 1/2 is assigned. If the diseased patient has a test result indicating less suspicion for disease than the test result of the nondiseased patient, then a score of 0 is assigned. It does not matter which diseased and nondiseased patient you begin with. Using the data in Table 1 as an illustration, suppose we start with a diseased patient assigned a test result of "normal" and a nondiseased patient assigned a test result of "normal." Because their test results are the same, this pair is assigned a score of 1/2.
Second, repeat the first step for every possible pair of diseased and nondiseased patients in your sample. In Table 1 there are 100 diseased patients and 100 nondiseased patients, thus 10,000 possible pairs. Because there are only five unique test results, the 10,000 possible pairs can be scored easily, as in Table 2.
|
Third, sum the scores of all possible pairs. From Table 2, the sum is 8,632.5.
Fourth, divide the sum from step 3 by the number of pairs in the study sample. In our example we have 10,000 pairs. Dividing the sum from step 3 by 10,000 gives us 0.86325, which is our estimate of the area under the empiric ROC curve. Note that this method of estimating the area under the empiric ROC curve gives the same result as one would obtain by fitting trapezoids under the curve and summing the areas of the trapezoids (so-called trapezoid method).
The variance of the estimated area under the empiric ROC curve is given by DeLong et al. [40] and can be used for constructing CIs; software programs are available for estimating the nonparametric AUC and its variance [41].
Comparing the AUCs or PAUCs of Two Diagnostic Tests
To test whether the AUC (or PAUC) of one diagnostic test (denoted by
AUC1) equals the AUC (or PAUC) of another diagnostic test
(AUC2), the following test statistic is calculated:
![]() | (4) |
The test statistic Z follows a standard normal distribution. For a two-tailed test with significance level of 0.05, the critical values are 1.96 and +1.96. If Z is less than 1.96, then we conclude that the accuracy of diagnostic test 2 is superior to that of diagnostic test 1; if Z exceeds +1.96, then we conclude that the accuracy of diagnostic test 1 is superior to that of diagnostic test 2.
A two-sided CI for the difference in AUC (or PAUC) between two diagnostic
tests can be calculated from
![]() | (5) |
![]() | (6) |
/2 is a value from the
standard normal distribution corresponding to a probability of
/2. For
example, to construct a 95% CI,
= 0.05, thus z
/2 =
1.96. Consider the ROC curves in Figure 2A. The estimated areas under the smooth ROC curves of the two tests are the same, 0.841. The PAUCs where the FPR is greater than 0.20, however, differ. From the estimated variances and covariance in Table 3, the value of the Z statistic for comparing the PAUCs is 1.77, which is not statistically significant. The 95% CI for the difference in PAUCs is more informative: (0.004 to 0.086); the CI for the partial area index is (0.02 to 0.43). The CI contains large positive differences, suggesting that more research is needed to investigate the relative accuracies of these two diagnostic tests for FPRs less than 0.20.
|
Analysis of MRMC ROC Studies
Multiple published methods discuss performing the statistical analysis of
MRMC studies
[1320].
The methods are used to construct CIs for diagnostic accuracy and statistical
tests for assessing differences in accuracy between tests. A statistical
overview of the methods is given elsewhere
[10]. Here, we briefly mention
some of the key issues of MRMC ROC analyses.
Fixed- or random-effects models.The MRMC study has two samples, a sample of patients and a sample of reviewers. If the study results are to be generalized to patients similar to ones in the study sample and to reviewers similar to ones in the study sample, then a statistical analysis that treats both patients and reviewers as random effects should be used [13, 14, 1720]. If the study results are to be generalized to just patients similar to ones in the study sample, then the patients are treated as random effects but the reviewers should be treated as fixed effects [1320]. Some of the statistical methods can treat reviewers as either random or fixed, whereas other methods treat reviewers only as fixed effects.
Parametric or nonparametric.Some of the methods rely on models that make strong assumptions about how the accuracies of the reviewers are correlated and distributed (parametric methods) [13, 14], other methods are more flexible [15, 20], and still others make no assumptions [1619] (nonparametric methods). The parametric methods may be more powerful when their assumptions are met, but often it is difficult to determine if the assumptions are met.
Covariates.Reviewers' accuracy may be affected by their training or experience or by characteristics of the patients (e.g., age, sex, stage of disease, comorbidities). These variables are called covariates. Some of the statistical methods [15, 20] have models that can include covariates. These models provide valuable insight into the variability between reviewers and between patients.
Software.Software is available for public use for some of the methods [32, 42, 43]; the authors of the other methods may be able to provide software if contacted.
Determining Sample Size for ROC Studies
Many issues must be considered in determining the number of patients needed
for an ROC study. We list several of the key issues and some useful references
here, followed by a simple illustration. Software is also available for
determining the required sample size for some ROC study designs
[32,
41].
Consider the following example. Suppose an investigator wants to conduct a study to determine if MRI can distinguish benign from malignant breast lesions. Patients with a suspicious lesion detected on mammography will be prospectively recruited to undergo MRI before biopsy. The pathology results will be the reference standard. The MR images will be interpreted independently by two reviewers; they will score the lesions using a 0100% confidence scale. An ROC curve will be constructed for each reviewer; AUCs will be estimated, and 95% CIs for the AUCs will be constructed. If MRI shows some promise, the investigator will plan a larger MRMC study.
The investigator expects 2040% of patients to have pathologically
confirmed breast cancer (PREVp = 0.20.4); thus, k =
1.54.0. The investigator expects the AUC of MRI to be approximately
0.80 or higher. The variance function of the AUC often used for sample size
calculations is as follows:
![]() | (7) |
1(AUC)
x 1.414, where
1 is the inverse of the cumulative
normal distribution function
[2]. For our example, AUC =
0.80; thus
1(0.80) = 0.84 and A = 1.18776. The
variance function, VF, equals (0.00489) x [(15.05387) +
(9.41077) / 4.0] = 0.08512, where we have set k = 4.0. For k
= 1.5, the VF = 0.10429.
Suppose the investigator wants a 95% CI no wider than 0.10. That is, if the
estimated AUC from the study is 0.80, then the lower bound of the CI should
not be less than 0.75 and the upper bound should not exceed 0.85. A formula
for calculating the required sample size for a CI is
![]() | (8) |
/2 = 1.96 for a 95% CI and L is the desired
half-width of the CI. Here, L = 0.05. N is the number of
patients with disease needed for the study; the total number of patients
needed for the study is N x (1 + k). For our example,
N equals [1.962 x 0.08512] / 0.052 =
130.8 for k = 4.0, and 160.3 for k = 1.5. Thus, depending on
the unknown prevalence of breast cancer in the study sample, the investigator
needs to recruit perhaps as few as 401 total patients (if the sample
prevalence is 40%) but perhaps as many as 654 (if the sample prevalence is
only 20%).
Finding the Optimal Point on the Curve
Metz [46] derived a formula
for determining the optimal decision threshold on the ROC curve, where
"optimal" is in terms of minimizing the overall costs.
"Costs" can be defined as monetary costs, patient morbidity and
mortality, or both. The slope, m, of the ROC curve at the optimal
decision threshold is
![]() | (9) |
Examining the ROC curve labeled X in Figure 2A, 2B, we see that the slope is very steep in the lower left where both the sensitivity and FPR are low, and is close to zero at the upper right where the sensitivity and FPR are high. The slope takes on a high value when the patient is unlikely to have the disease or the cost of a false-positive is large; for these situations, a low FPR is optimal. The slope takes on a value near zero when the patient is likely to have the disease or treatment for the disease is beneficial and carries little risk to healthy patients; in these situations, a high sensitivity is optimal [3]. A nice example of a study using this equation is given in [48]. See also work by Greenhouse and Mantel [49] and Linnet [50] for determining the optimal decision threshold when a desired level for the sensitivity, specificity, or both is specified a priori.
Conclusion
Applications of ROC curves in the medical literature have increased greatly in the past few decades, and with this expansion many new statistical methods of ROC analysis have been developed. These include methods that correct for common biases like verification bias and imperfect gold standard bias, methods for combining the information from multiple diagnostic tests (i.e., optimal combinations of tests) and multiple studies (i.e., meta-analysis), and methods for analyzing clustered data (i.e., multiple observations from the same patient). Interested readers can search directly for these statistical methods or consult two recently published books on ROC curve analysis and related topics [2, 39]. Available software for ROC analysis allows investigators to easily fit, evaluate, and compare ROC curves [41, 51], although users should be cautious about the validity of the software and check the underlying methods and assumptions.
APPENDIX 1. Area Under the Curve and Confidence Intervals with Binormal Model
Under the binormal assumption, the receiver operating characteristic (ROC)
curve is the collection of points given by
![]() |
to +
and represents all the
possible values of the underlying binormal distribution, and
is the
cumulative normal distribution evaluated at c. For example, for a
false-positive rate of 0.10,
(c) is set equal to 0.90; from
tables of the cumulative normal distribution, we have
(1.28) = 0.90.
Suppose A = 2.0 and B = 1.0; then the sensitivity = 1
(0.72) = 1 0.2358 = 0.7642. ROCKIT [32] gives a confidence interval (CI) for sensitivity at particular false-positive rates (i.e., pointwise CIs). A CI for the entire ROC curve (i.e., simultaneous CI) is described by Ma and Hall [52].
Under the binormal distribution assumption, the area under the smooth ROC
curve (AUC) is given by
![]() |
[2.0 /
(2.0)] =
[1.414] =
0.921. The variance of the full area under the ROC curve is given as standard output in programs like ROCKIT [32]. An estimator for the variance of the partial area under the curve (PAUC) was given by McClish [5]; a Fortran program is available for estimating the PAUC and its variance [41].
Acknowledgments
I thank the two series' coeditors and an out-side statistician for their helpful comments on an earlier draft of this manuscript.
References
This article has been cited by other articles:
![]() |
K. Soreide Receiver-operating characteristic curve analysis in diagnostic, prognostic and predictive biomarker research J. Clin. Pathol., January 1, 2009; 62(1): 1 - 5. [Full Text] [PDF] |
||||
![]() |
S. Satoh, Y. Kitazume, S. Ohdama, Y. Kimula, S. Taura, and Y. Endo Can Malignant and Benign Pulmonary Nodules Be Differentiated with Diffusion-Weighted MRI? Am. J. Roentgenol., August 1, 2008; 191(2): 464 - 470. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Sonego, A. Kocsor, and S. Pongor ROC analysis: applications to the classification of biological sequences and 3D structures Brief Bioinform, May 1, 2008; 9(3): 198 - 209. [Abstract] [Full Text] [PDF] |
||||
![]() |
References J. ICRU, April 1, 2008; 8(1): 57 - 62. [PDF] |
||||
![]() |
J. Sanz, P. Kuschnir, T. Rius, R. Salguero, R. Sulica, A. J. Einstein, S. Dellegrottaglie, V. Fuster, S. Rajagopalan, and M. Poon Pulmonary Arterial Hypertension: Noninvasive Detection with Phase-Contrast MR Imaging Radiology, April 1, 2007; 243(1): 70 - 79. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Gatsonis and P. Paliwal Meta-analysis of diagnostic and screening test accuracy evaluations: methodologic primer. Am. J. Roentgenol., August 1, 2006; 187(2): 271 - 281. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Garcia, J. Lessick, M. H. K. Hoffmann, and for the CATSCAN Study Investigators Accuracy of 16-row multidetector computed tomography for the assessment of coronary artery stenosis. JAMA, July 26, 2006; 296(4): 403 - 411. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Crystal, S. Strano, J. D. Keen, M. H. Ebell, C. Gatsonis, E. D. Pisano, and E. Hendrick Digital and film mammography. N. Engl. J. Med., February 16, 2006; 354(7): 765 - 767. [Full Text] [PDF] |
||||
![]() |
P. Skaane, L. Niklason, and N. A. Obuchowski Receiver Operating Characteristic Analysis: A Proper Measurement for Performance in Breast Cancer Screening? Am. J. Roentgenol., February 1, 2006; 186(2): 579 - 580. [Full Text] [PDF] |
||||
![]() |
M. H. K. Hoffmann, H. Shi, B. L. Schmitz, F. T. Schmid, M. Lieberknecht, R. Schulze, B. Ludwig, U. Kroschel, N. Jahnke, W. Haerer, et al. Noninvasive Coronary Angiography With Multislice Computed Tomography JAMA, May 25, 2005; 293(20): 2471 - 2478. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |