August 2006, VOLUME 187
NUMBER 2

Recommend & Share

August 2006, Volume 187, Number 2

Research

Fundamentals of Clinical Research for Radiologists

Meta-Analysis of Diagnostic and Screening Test Accuracy Evaluations: Methodologic Primer

+ Affiliations:
1Center for Statistical Sciences, Brown University, Box G-H, Providence RI, 02912.

2Present address: Department of Psychiatry, Yale University, New Haven, CT 06511.

Citation: American Journal of Roentgenology. 2006;187: 271-281. 10.2214/AJR.06.0226

ABSTRACT
Next section

OBJECTIVE. Interest in evidence-based diagnosis is growing rapidly as diagnostic and screening techniques proliferate. In this article we provide an overview of systematic reviews of diagnostic performance and discuss in detail statistical methods for the most common variant of the problem: meta-analysis of studies in which a pair of estimates of sensitivity and specificity is reported. The need to account for possible variations in threshold for test positivity across studies led to the formulation of the Summary ROC (SROC) curve method. We discuss graphical and model-based ways to estimate, summarize, and compare SROC curves, and we present an example from a meta-analysis of data on techniques for staging cervical cancer. We also present a brief survey of the methodologic literature for addressing heterogeneity, correlated data, multiple thresholds per study, and systematic reviews of ROC studies. We conclude with a discussion of the significant methodologic challenges that continue to face investigators in this area of diagnostic medicine research.

CONCLUSION. Systematic reviews of diagnostic performance are a rigorous approach to examining and synthesizing evidence in the evaluation of diagnostic and screening tests. The information from such reviews is needed by clinicians, health policy makers, researchers in diagnostic medicine, developers of diagnostic techniques, and the general public. However, despite progress in study quality and reporting and in methodologic development, major challenges confront investigators undertaking these reviews.

Keywords: diagnostic accuracy, evidence, meta-analysis, statistical methods, Summary ROC curve, systematic reviews

Introduction
Previous sectionNext section

The need for systematic reviews of diagnostic and screening tests has grown markedly in recent years as technologic advances have brought forth a vast array of such techniques. Patients, physicians, and policy makers all need information on the reliability and performance of tests and the interpretation of results. In addition, the increased availability of a plethora of diagnostic and screening techniques has meant increased use of tests and a dramatic increase in health care costs.

As evidence-based medicine expands from therapy to diagnosis, the role of systematic reviews acquires added importance [1]. The information from systematic reviews of diagnostic and screening tests is necessary for the following purposes: determination of the proper and efficacious use of diagnostic and screening tests in the clinical setting; decision making about health care policy and financing; evaluation of the performance and status of a diagnostic technique to determine areas for further research, development, and evaluation; and evaluation of the quality and scope of available primary studies of diagnostic and screening techniques and thus development of information necessary for determining directions of future research in diagnostic medicine.

A taxonomy of the important aspects of evaluation of diagnostic and screening tests would distinguish three broad areas of end points: the diagnostic performance of the test, assessed with measures of test accuracy and predictive value; the impact of the test on the process of care, assessed by metrics of the effect of the test on subsequent diagnostic and therapeutic decision making; and the impact of the test on patient-level outcomes, including mortality, morbidity, satisfaction and health-related quality of life, health care utilization, and cost [2-4].

It is also possible, although not formally practiced, to distinguish developmental levels for a technique, following the trajectory from early development to broad dissemination. For example, a four-stage categorization would include stage 1 (discovery), in which the technical parameters and diagnostic criteria of a technique are established; stage 2 (introduction), in which diagnostic performance is assessed and fine tuning of the technology is performed in single-institution studies; stage 3 (maturity), in which the technique is evaluated in comparative, multicenter, prospective clinical studies (efficacy); and stage 4 (dissemination), in which the technique is evaluated as used by the community at large (effectiveness) [3].

Appropriate end points can be selected for each developmental level of a technique. In general, however, evaluation of diagnostic performance is a relevant end point for studies at any stage. The most commonly used metric of diagnostic performance, and the one discussed in detail in this primer, is the pair of estimated sensitivity and specificity values for a test. Others include receiver operating characteristic (ROC)-based measures and measures of the predictive value of a test.

This primer focuses exclusively on systematic reviews of the diagnostic performance of tests. We provide a brief description of the main steps in conducting systematic reviews, from formulating the research question through primary study retrieval and data collection to data analysis and interpretation of results. We also discuss statistical methods for deriving summaries of diagnostic performance data and give an example of an application to meta-analysis of the diagnostic accuracy of tests in the detection of lymph node involvement in women with cervical cancer. The article considers methods for meta-analysis of studies in which a single pair of sensitivity and specificity estimates is reported. Extensions of the basic method are described, and a brief guide to the methodologic literature is provided. We summarize our recommendations and discuss methodologic and subject-matter challenges in the last section.

Overview of Systematic Reviews of Diagnostic Accuracy
Previous sectionNext section

The conduct of a systematic review of diagnostic test accuracy proceeds through the following major steps [5, 6]:

  1. Definition of the objectives of the review.

  2. Literature search and retrieval of studies.

  3. Assessment of study quality and applicability to the clinical problem at hand.

  4. Extraction of data.

  5. Statistical analysis.

  6. Interpretation of results and development of recommendations.

Each of the six steps in the process involves its own challenges and can be further refined with more detailed flowcharts [7]. We provide a brief description of the tasks involved in each step.

Definition of the Objectives of the Review

A systematic review of diagnostic accuracy begins with defining the clinical context and developing a precise description of the diagnostic question for which test accuracy is to be assessed. This part of the process is similar to the development of the protocol for a primary study. It includes specification of the clinical question giving rise to the potential use of the test or tests under investigation, the technical characteristics of the tests, the conditions under which the tests are interpreted, and the reference information used in the assessment of test accuracy [8]. Because systematic reviews of diagnostic accuracy are called on to inform the use of diagnostic tests in clinical care, comparisons of alternative tests are most valuable.

Literature Search and Retrieval of Studies

Although on search strategies extensive literature for studies of therapy is available, the corresponding body of literature on diagnostic test evaluation is relatively small. Deville et al. [9] and Bachmann et al. [10] discuss strategies relating to diagnostic and screening tests.

The search for appropriate studies must be comprehensive, objective, and reproducible, and the searcher must consider all available evidence. The search should not simply be for documents in English and should cover publications beyond journals, such as conference proceedings and other reports. Hand searching through publications, reference checking, and searching for unpublished reports often is necessary, especially to assess the extent of publication bias. Finally, it is important to document the process and the outcome of each search.

Assessment of Study Quality and Applicability

The scope of assessment of study quality is broad and not generally well defined. In the context of studies of diagnostic performance, assessment of quality has to consider the important features of the design and execution of the study, including factors such as definition of the research question and clinical context, specification of appropriate patient population, description of the diagnostic techniques under study and their interpretation, detailed accounting of how the reference standard information was defined and obtained, and any other factors that can affect the integrity of the study and the generalizability of the results.

Methods of quality assessment may focus on the absence or presence of key qualities in the study report (checklist approach), use scores developed for this purpose (scale approach), or use the levels-of-evidence methods by which a level or grade is assigned to studies fulfilling a predefined set of criteria. The literature on assessment of the quality of therapy studies is extensive, at least in comparison with the literature on diagnostic test evaluations [11, 12]. Two developments in the diagnostic area are the Standards for Reporting of Diagnostic Accuracy (STARD) checklist for reporting of studies of diagnostic accuracy [4, 13, 14] and the quality assessment tool for diagnostic accuracy (QUADAS) for assessing the quality of studies of diagnostic accuracy [12, 15]. The former may be beneficial in improving the quality of published reports and, indirectly, in improving the quality of primary studies. The latter is a rigorously constructed tool that can be used by investigators undertaking new systematic reviews.

Incorporation of quality assessment results into meta-analysis is a matter of debate. A simple and perhaps draconian approach is to exclude studies of poor quality. A less drastic alternative is to use quality scores as weights in the statistical analysis. However, the exact definition of the weights is often a matter of disagreement, and the statistical rationale for their use is shaky. Another alternative, which we recommend to investigators, is to conduct sensitivity analysis. The goal of sensitivity analysis is to assess the contribution of poor-quality studies to the results of the full meta-analysis. The assessment is made by comparing the results from the statistical analysis with the results of the specific studies included and excluded. Sensitivity analysis also can be used to assess the effect on diagnostic accuracy of a study characteristic or a combination of study characteristics.

Extraction of Data

In studies of imaging techniques, test results are most commonly reported as binary (yes or no) or ordinal categoric. An example of the latter often used in ROC studies is a five-category scale for degree of suspicion about the presence of a target condition. The categories are commonly described as follows: 1 = definitely normal, 2 = probably normal, 3 = equivocal, 4 = probably abnormal, and 5 = definitely abnormal. In recent years, degree of suspicion assessments also have been made on nearly continuous scales, for example, scales from 1 to 100. Continuous test results are typically reported in the evaluation of laboratory tests, such as the concentration of a substance.

A binary test result is typically obtained by dichotomizing a test outcome measured on a continuous scale. The continuous scale can be observed directly, as is the case with many laboratory tests. As an alternative, the scale can be a latent, unobservable one, as is the case with the observer's degree of suspicion in ROC studies. In either case, the binary test result is obtained by application of a threshold for test positivity. The presence of such a threshold is a fundamental theme in the evaluation of diagnostic and screening tests.

In this primer, as in most published work on diagnostic and screening test evaluation, disease status is assumed to be binary. Thus, for a particular threshold of test positivity, the study results can be presented in the familiar two-by-two table showing cross classification of disease status and test outcome (Table 1).

TABLE 1: Two-by-Two Table of Binary Test Results Versus Disease Status

Although it may seem reasonable to expect that obtaining an appropriate two-by-two table from a published study should be rather straightforward, practical experience suggests that this is not always the case. Investigators need to consider carefully the data report and may also need to contact the authors of the report to obtain the necessary information.

Measures of test performance are defined either conditionally on disease status (sensitivity, specificity) or conditionally on test result (predictive value). Commonly used metrics include test sensitivity = P(T+|D+); specificity = P(T-|D-); positive predictive value = P(D+|T+); and negative predictive value = P(D-/T-), where P(...) is the probability of the event in parentheses, T is the test result, and D is the true disease status. In addition, studies may report other metrics, such as diagnostic odds ratio (OR): sens spec / (1 - sens)(1 - spec); positive likelihood ratio: LR+ = P(T+|D+) / P(T+|D-) = sens / (1 - spec); and negative likelihood ratio: LR- = P(T-|D+) / P(T-|D-) = (1 - sens) / spec. See also the recent article in the AJR by Weinstein et al. [16].

This primer is concerned mainly with meta-analysis of studies reporting estimates of pairs of sensitivity and specificity. The methods discussed in the next section assumes the availability of a single two-by-two table from each study. However, the results of some studies are reported with more than one threshold of test positivity and even more than one definition of disease status. It is important for investigators to record all the information on alternative thresholds reported in retrieved studies and to determine which of the thresholds of test positivity is the most relevant for the purposes of the systematic review. The methods for combining data when several thresholds are used in each study is beyond the scope of this primer but is discussed briefly later in the Other Methods section.

Statistical Analysis

Because binary test outcomes are defined on the basis of an explicit or implicit threshold for test positivity, it follows that measures of binary test performance depend on the particular threshold used to generate the binary test outcomes. This dependence is a fundamental aspect of diagnostic test evaluation. In the case of test sensitivity and specificity, dependence on the threshold induces a tradeoff between the two quantities as the threshold for positivity is moved across all possible values. The curve of all pairs of sensitivity and specificity values achieved by moving the threshold across its possible range is the ROC curve [17, 18].

Comparison of tests on the basis of ROC curves takes into consideration the actual curves and is aided by summary measures that have been proposed in the literature. The area under the curve (AUC) is the most commonly used summary and can be interpreted as average sensitivity for the test, taken over all specificity values. Strictly speaking, the AUC is equal to the probability that if a pair of diseased and nondiseased subjects is selected at random, the diseased subject will be ranked correctly by the test. Other summaries of the ROC curve include partial areas under the curve, values of sensitivity corresponding to selected values of specificity (and vice versa), and optimal operating points, defined according to specific criteria. ROC analysis and other statistical methods for diagnostic test evaluation are described in textbooks by Zhou et al. [19] and Pepe [20] and in chapters by Toledano et al. [21] and Toledano [22].

Digression to ROC analysis is necessary to highlight the role of the positivity threshold and its consequences. A direct implication of this issue in meta-analysis of sensitivity and specificity estimates is that the method has to account for the possibility of different thresholds across studies. The use of simple or weighted averages of sensitivity and specificity to draw statistical conclusions is not methodologically defensible. A simple example to illustrate this point is a meta-analysis of three studies with the sensitivity and specificity estimates described in Figure 1. The estimated sensitivity and specificity pairs are (0.1, 0.9), (0.8, 0.8), and (0.9, 0.1). The average pair is (0.6, 0.6). Clearly, the (0.6, 0.6) pair does not represent these data in any useful way; thus, a simple averaging of sensitivity and specificity is not an adequate approach.

“Average” values of sensitivity and specificity sometimes are used as descriptive summaries of the observed data. Typically, this approach would be the case when the observed variability in one or both of the two quantities is small.

Interpretation of Results

Interpretation of the findings from a meta-analysis of diagnostic performance must address the relevance of the results to the four general aims stated earlier. That is, this section of the report should highlight the specific ways in which the data provide information about the proper use of the particular test, preferably in comparison with alternative techniques; discuss how the findings can be used to make decisions about health care policy and financing; summarize the quality of the available studies, pointing to areas in which more research is needed; and provide information about possible areas of improvement in the performance of the techniques under review.

Statistical Methods for Meta-Analysis of Sensitivity and Specificity Data
Previous sectionNext section
Summary ROC (SROC) Curve for a Single Test

Our focus is on meta-analyses in which each study contributes a two-by-two table of data, on the basis of which a pair of estimates of sensitivity and specificity can be obtained. To introduce statistical notation, the ith study (i = 1,...I) contributes data in the format shown in Table 2. With the notation of Table 2, the estimates of sensitivity and 1 - specificity from the ith study are TPRi = di / ni1 and FPRi = bi / ni0, where TPR is the true-positive rate and FPR is the false-positive rate.

TABLE 2: Format for ith Study

figure
View larger version (7K)
Fig. 1 Graph shows that averaging sensitivities and specificities can be misleading. TPR = true-positive rate, FPR = false-positive rate.

The display of paired estimates of sensitivity and specificity in ROC coordinates (FPR, TPR) is a key step in the process of statistical analysis. Such plots ideally should include error bars for each of the two estimates. However, the bars often make the plot rather busy. An additional plot to consider is a forest plot, which shows the sensitivity and specificity estimates of each study side by side and may also include the numerators and denominators used to construct the estimates (Figure 2).

Simple derivation of an SROC curve—An easy way to construct a graphical summary of (FPR, TPR) estimates was introduced by Moses and colleagues in 1993 [23]. In this approach, the original data are first transformed into new variables S and D, defined as follows for the ith study: where logit(a) = log [a / (1 - a)]. The next step is to fit a linear regression model of the form

The fitted model provides a value of D for each value of S. In the final step, the D and S pairs are transformed back into ROC coordinates to obtain an SROC curve.

The transformed variable D is actually the diagnostic odds ratio estimated from each individual primary study in the meta-analysis. The variable S has a less straight-forward interpretation. A little algebra shows that S increases when the probability of a positive test result increases in both the diseased and nondiseased populations. Hence, S can be interpreted as a proxy for the test positivity threshold operating in the particular study. This way of constructing an SROC curve is roughly based on an implicit assumption that the variation in diagnostic odds ratio across studies is a function of the threshold for test positivity.

The foregoing model can be easily extended to incorporate covariates measuring study characteristics or group characteristics of the participants in the individual primary studies. The linear model would then have the following form: where the X variables can be suitably defined to represent characteristics of the study design, the test technology, and study participant characteristics as used in subgroup analyses. A model with appropriately defined indicator variables X can also be used to compare tests.

SROC summaries—In analogy with the usual ROC curve, a natural summary of the SROC is the AUC. However, the choice of the exact limits for defining the area is a matter of some debate. In particular, some authors prefer to compute the area only over the range of the observed FPR values to avoid the inherent uncertainties about extrapolating beyond the range of the observed data. Other authors support the use of a partial area over a range of FPR values of interest in the context of the particular test. In this primer we report the full AUC estimates because of their simplicity, intuitive interpretation, and avoidance of arbitrary choices of limits of FPR values.

Another global summary of the SROC curve is the so-called Q* (“Q-star”) statistic, which measures the value of TPR at the point where the curve intersects the x + y = 1 diagonal line. This is the point on the curve where sensitivity equals specificity. For a symmetric curve, this value is also the point at which the curve is closest to the ideal point (FPR = 0, TPR = 1).

In addition to the global summary measures, the SROC curve can be used to estimate TPR for each fixed value of FPR and, conversely, standard errors of the estimates can be obtained using the delta method. We include such estimates in the analysis of the cervical cancer data (Fig. 5).

SROC summaries can be used to compare the performance of alternative diagnostic and screening tests for a particular diagnostic question. These comparisons are relatively straightforward when statistical independence can be assumed to hold, as when SROC curves of alternative tests are derived from separate sets of studies or from overlapping sets of studies in which test results were not correlated. However, the situation is technically more complex when test results within a study are correlated, as is the case when a paired design has been used to compare tests. We discuss this issue later, in Other Methods.

SROC properties and limitations—The shape of the SROC curve derived from the foregoing linear regression model depends on the values of the linear model parameters a and b [24]. The special case of b = 0 corresponds to the situation in which the true diagnostic odds ratio is assumed to be constant across all studies. In this case, the SROC curve is symmetric along the x + y = 1 diagonal line. If b 1 0, the curve is not symmetric. Indeed, it turns out that when | b | > 1, the SROC curve derived from the linear regression model has a counterintuitive property: According to the curve, the sensitivity of the test decreases as the FPR increases. Estimated values of b greater than 1 or less than -1 indicate that the simple linear regression model is not adequate for constructing an SROC curve.

figure
View larger version (24K)
Fig. 2 Forest plot of CT sensitivity and specificity estimates and their confidence intervals. LAG = lymphangiography.

figure
View larger version (8K)
Fig. 3 Observed true-positive rates (TPR) and false-positive rates (FPR) for three imaging techniques. LAG = lymphangiography.

SROC curve computations based on the linear regression model are a simple and useful method for developing such curves. There are, however, potentially important technical difficulties to overcome if the results of this approach are used to draw formal statistical inferences. First, the presence of sampling error in the variable S on the right hand side of the linear model may affect the magnitude of the estimates of b and its SE. The sampling error may increase the uncertainty in the estimate of b, leading erroneously to the conclusion that the SROC is symmetric. Second, the linear model uses summaries from the two-by-two tables of the individual studies and ignores the statistical precision of these summaries. Unfortunately, the precision of TPR and FPR estimates is somewhat complex because it depends not only on overall sample size but also on the sample sizes for diseased and nondiseased subjects in the study. Hence, simple weighting by sample size is not sufficient. In addition, the left-hand-side and right-hand-side variables in the linear model have their own estimates of statistical precision, making it difficult to decide on a single weight for the particular study. Third, the linear model does not account for the presence of correlations in the data, such as those resulting from the use of paired designs within individual primary studies.

Binary regression for SROC analysis—Because of the methodologic difficulties described, it is prudent for investigators to consider the use of alternative approaches to estimating SROC parameters for purposes of formal statistical inference. An early such approach predated the linear regression method and used the bivariate normal distribution of the estimates of sensitivity and specificity from each study, with a linear relation between the true values of sensitivity and specificity to account for the effect of threshold [25].

A streamlined alternative to the linear regression model is to use a variant of logistic regression, which models directly the data ineach two-by-two table [26-28]. If Y is the binary test result (yes=1, no=0) and D the binary disease status for an individual patient in a given study, the form of the model is as follows: The binary regression model is intuitively based on the usual conceptualization of the binary test outcome resulting from a positivity threshold (denoted here by θ). In other words, the binary test outcome is obtained by dichotomizing a continuous variable that has different distributions for diseased and nondiseased subjects. The parameter α measures the distance between the centers of the diseased and nondiseased populations, and the parameter β measures the ratio of the SDs in the two populations. The mathematic details of the model and its relation to the linear model approach are sketched in Appendix 1.

The use of binary regression allows investigators to avoid key difficulties associated with the linear model approach, notably the errors-in-variables problem and the need to account for differences in sample size across studies. As shown in Appendix 1, it is possible to translate the findings of binary regression analysis into linear model parametrization. However, the SROC curves obtained from binary regression analysis always lead to values of the slope between -1 and 1 and hence avoid the counterintuitive properties of curves with | b | > 1 obtained from the linear model. Binary regression models can be fitted with standard software, such as Proc NLMixed in SAS [29]. The SAS code for fitting a binary regression model using Proc NLMixed is in Appendix 2.

Example: Meta-Analysis of Cervical Cancer Staging Data

To illustrate the SROC method we use data from a meta-analysis of diagnostic imaging tests in the detection of lymph node metastasis in patients with cervical cancer [30]. This systematic review was conducted to compare the performance of three imaging techniques: lymphangiography (LAG), CT, and MRI. The published report describes how the problem was formulated, how the relevant studies were identified and reviewed, and how the diagnostic performance data were extracted. Briefly, studies were located with a MEDLINE literature search combined with hand searching of bibliographies from retrieved articles. Included studies had histologic confirmation of cervical cancer, uniformly appropriate reference standard information, and evidence of blinding in study design. In addition, included studies had a minimum sample size of 20 patients, reported criteria for test positivity, and presented sufficient data to complete the necessary two-by-two table.

In our example we included data from 42 studies, 13 of which evaluated LAG, 19 evaluated CT, and 10 evaluated MRI. Nine studies evaluated more than one test, but this feature of the data is ignored for the purposes of this analysis. The pairs of observed values of sensitivity and specificity are presented in ROC coordinates in Figure 3. We are not using exactly the same set of studies presented in the published paper, and hence the results of this example may differ from those in the article, particularly in the case of the LAG evaluation.

SROC curves were derived separately for each test by both the binary and the linear regression methods. The results of the binary regression fit are presented in detail and are followed by summary tables from the linear regression fit. The latter are included for comparison purposes. For each test, the binary regression model assumed common location (α) and scale (β) parameters across the studies but a separate threshold value for each study. Table 3 summarizes the results from the binary regression fit.

TABLE 3: Estimates of Summary Receiver Operating Characteristic Curve Parameters, Area Under the Curve (AUC), and Q* Statistic for Each Technique (Binary Regression Model)

The scale parameter is not statistically different from zero for all three techniques. Instead of assuming it is zero and plotting the SROC curves as symmetric, we used the estimated value of β to derive the plots in Figures 3 and 4. The SROC curves are superimposed on the observed data in Figure 4. The SROC curves with superimposed 95% confidence intervals for TPR and FPR at three points are shown in Figure 5.

The SROC curve for LAG stays consistently below the curves of the other two techniques. The MRI curve dominates the CT curve, and its summary values of AUC and Q* estimates dominate the other two techniques. However, only one of the paired comparisons of the AUC estimates (LAG vs MRI) is statistically significant. A comparison of the confidence intervals for TPR and FPR also shows overlap at each of three points chosen in Figure 5. We conclude that although there is a trend for MRI to have better performance than CT and LAG, only the AUC of MRI is statistically different from that of LAG.

For comparative purposes, we present the numeric results from the linear regression fit of the SROC curve (Table 4). The actual curves and summary estimates of AUC and Q* are close but not identical to those derived from the binary regression analysis. For a more detailed view of the comparison, we converted the SROC equation from the binary regression to the form that would be obtained from a linear regression fit, using the formulas in Appendix 1. Table 5 shows the results for CT.

TABLE 4: Estimates of Summary Receiver Operating Characteristic Curve Parameters, Area Under the Curve (AUC), and Q* Statistic for Each of Three Techniques (Linear Regression Model, Unweighted)

TABLE 5: Comparison of Binary and Linear Regression Summary Receiver Operating Characteristic Analyses for CT Data

Other Methods
Previous sectionNext section

The SROC method is limited in two important respects. First, the statistical framework does not consider the presence of random variation between studies. This fixed-effects framework implicitly assumes that the universe of all studies to which inferences apply is only the specific studies used in the meta-analysis and that in addition to sampling variation within studies, the only other possible variation can be explained by study-level covariates. As a result of its assumptions, a fixed-effects approach to meta-analysis is generally expected to provide artificially more precise results than an approach that provides a fuller account of variability in the data [31]. The second important limitation of the specific fixed-effects approach is that it ignores correlations in the data within studies. In this section, we briefly discuss statistical methods based on hierarchical models designed to address these limitations. We also provide references to the literature on meta-analysis of ROC studies.

figure
View larger version (15K)
Fig. 4 Estimated SROC curves and original data points for three imaging techniques. TPR = true-positive rate, FPR = false-positive rate, LAG = lymphangiography.

figure
View larger version (17K)
Fig. 5 Summary Receiver Operating Characteristic curves with confidence intervals for selected (TPR, FPR) points. TPR = true-positive rate, FPR = false-positive rate, LAG = lymphangiography.

Hierarchical Summary ROC Analysis

The binary regression model is the building block for a hierarchical model describing the full range of variation in the data. In particular, the hierarchical model differentiates within-study from between-studies variability and systematic from random variability. For example, a model for the cervical cancer data accounts for two levels of variability. In level 1, within-study variation is modeled by binary regression. In level 2, between-studies variation is modeled by distributions of the threshold and location parameters. The mean of the distribution of the parameters may depend on study-level covariates (e.g., test type).

A hierarchical model can be fitted with fully bayesian methods [27] or likelihood-based approximations as implemented in the Proc NLMixed procedure of SAS [28]. A Hierarchical SROC (HSROC) curve can be derived by use of the population means of the parameters. In addition to providing a full account of the variability in the data, the hierarchical model accounts implicitly for correlations within studies. If information exists for such correlations, it can be included explicitly by suitable extensions of the model. In particular, such formulations are useful for modeling data from studies conducted with paired designs.

An alternative way to build hierarchical models for diagnostic accuracy data is to consider a variant of the Kardaun approach and use the bivariate asymptotic normal distribution of the estimates of sensitivity and specificity from each study [32]. Although this approach has been used to derive “average” estimates of sensitivity and specificity, a practice criticized earlier in this article, it is easy to modify the model to derive SROC curves.

Meta-Analysis with Multiple Thresholds from Individual Studies

In the SROC methods discussed earlier, it is assumed that a single two-by-two table is obtained from each study. If multiple thresholds for test positivity are used in the primary studies, ordinal regression methods and their hierarchical formulations can be used to perform the statistical analysis [33].

Meta-Analysis of ROC Data

The choice of suitable statistical methods for combining data from ROC studies depends on the type of data considered. If the full ROC data are available—for example, the complete two-by-five table of disease status by test results when a five-point ordinal categoric scale is used—then ordinal regression methods can be used. It is not necessary for all studies to use the same number of categories in reporting of test results [33, 34].

If the emphasis is on meta-analysis of summaries of the ROC curve, the appropriate methods have to be tailored to the specific summary. For meta-analysis of estimates of the AUC from independent studies, McClish [35] describes weighted average estimators, Zhou [36] describes a generalized estimating equation approach, and Hellmich et al. [37] describe a bayesian method. A hierarchical model for such data can be constructed in a straightforward manner with the asymptotic distribution of the estimate of the AUC for the first level of the model and proceeding as in the HSROC model for the other levels. Because the distributions involved are all normal, the process of fitting and checking such models is fairly routine [31, 38].

Discussion
Previous sectionNext section

As interest in evidence-based diagnosis increases, so does the demand for information from systematic reviews of studies of diagnostic accuracy. The information from such reviews is a key ingredient for all subsequent evaluation of diagnostic techniques. Because empiric studies of test outcomes can be prohibitively difficult to conduct in practice, research synthesis and modeling of health outcomes and costs often remain the only viable options. For such undertakings, the information from meta-analysis of diagnostic performance is crucial.

Meta-analysis of accuracy evaluations is not as streamlined or easy to perform and summarize as meta-analysis of therapy evaluations. A key difference is the nature of the summary measure. In therapy studies, the summary can be as simple as an overall success rate with appropriately quantified variability and uncertainty. In diagnostic accuracy studies, however, the summary is a curve (or several curves if patient subsets are considered). Comparisons of curves are inherently more complex and nuanced than comparisons of means or proportions. Thus, systematic reviews of diagnostic accuracy present the research community with a challenging set of questions about how best to summarize the information and how to use it in analysis and decision making. For example, the methodology for incorporation of SROC curves in modeling outcomes and costs is not fully developed, and practical experience in this type of analysis is relatively scarce. In most published modeling exercises, the sensitivity and specificity of tests are assumed to be a single pair of numbers.

Two major determinants of the success of systematic reviews of diagnostic accuracy are the availability of relevant studies of adequate quality and the development of a consensus around the methods for such reviews. In recent years, the quality of diagnostic and screening test evaluations has improved, but the hill still seems steep [14, 39-41]. In the same period, the methods for systematic review of diagnostic accuracy have progressed and matured. Evidence of methodologic progress is the growing list of published work and the formation of the Cochrane Diagnostic Reviews initiative [42] late in 2003. The researchers involved in this initiative are at work preparing the methodologic infrastructure for performing diagnostic accuracy reviews and including them in a new division of the Cochrane Library.

Despite progress in study quality and reporting and in methodologic development, major challenges confront investigators venturing into the world of systematic reviews of diagnostic and screening tests. The following is a partial list of challenges:

  • The literature contains many small studies, which are usually retrospective and of uncertain quality.

  • The detail and accuracy of reporting on study methods and results vary greatly. It is often impossible to determine key study characteristics, such as study cohort, technical aspects of the techniques involved, and definition of gold standard information.

  • Even for relatively tightly defined clinical questions, multiple sources of heterogeneity among studies are operating, threshold differences being only one. It is therefore important for the review to explore such sources of variation and to use appropriate statistical techniques.

  • An important source of heterogeneity not addressed in this article is heterogeneity due to observer. Empiric data suggest that within-study observer variability can be of the same order of magnitude as variability across studies. Hierarchical modeling can be a powerful framework for incorporating observer variability in the analysis of individual studies [43, 44]. However, detailed data on observer variability are not usually reported, making it necessary for investigators to contact the authors of studies if such an analysis is to be undertaken.

  • As is the case with most technology, diagnostic and screening techniques evolve rapidly. In the absence of a consensus on a framework for diagnostic technology assessment, there is risk of increasing the heterogeneity in a systematic review by inclusion of studies that clearly do not reflect the current state of a technique. By contrast, such a framework is in place for the evaluation of therapy. In that context, a systematic review, for example, would not combine estimates of effects reported in phase 1 and 2 studies with those reported in phase 3 studies.

  • Particular forms of bias exist within many primary studies [45]. The effect of such within-study bias on systematic reviews has to be considered. Methods for handling bias within the primary studies need to be developed.

In confronting methodologic and practical challenges, investigators conducting systematic reviews of diagnostic accuracy are likely to find colleagues and collaborators. The era of evidence-based diagnosis is here to stay.

APPENDIX 1: Binary Regression Model
Previous sectionNext section

For a single study, the model can be described as follows.

Let Yij represent the test result (1 = positive, 0 = negative) and Dij the true disease status on the jth individual in the ith study. In our notation, we code D = 1 / 2, if diseased, and -1 / 2 if nondiseased. The binary regression model is based on the assumption that the response arises from the discretization of an underlying continuous latent variable with threshold θi. The latent variable follows logistic distributions for diseased and nondiseased subjects, and the two distributions can be distinguished by a location parameter (αi) and a scale parameter (βi). The diagnostic performance of the test in the ith study is a function of the location and the scale parameters. Formally,

The binary regression model is closely related to the usual ROC model and implies that for the ith study:

If the location and scale parameters are assumed to be constant across all studies, the model reduces to a relation between the true-positive rate (TPR) and false-positive rate (FPR) that is similar to the relation postulated in the model described by Moses et al. [23]. In particular: where c0 = -αe-β / 2, c1 = e. It is clear that c1 is greater than 0; and that b, which is equal to (c1 - 1) / (c1 + 1), takes values between -1 and 1.

An SROC curve and its summary measures can be estimated from the binary regression model. In addition, study-level and subject-level covariates can be easily incorporated, resulting in models of the form: which corresponds to simultaneously fitting several SROC curves to subsets of the data.

The large number of parameters in the binary regression model creates identifiability problems without additional assumptions. For example, the model without covariates has three parameters for each table; hence, it is not identifiable for a single table. However, with suitable assumptions, such as the one leading to the analogue of the Moses model, the binary regression model can be made identifiable. Other assumptions about the parameters allow the exploration of heterogeneity across studies. For example, studies may have different location parameters (thus different overall accuracies) but the same scale parameter and the same threshold. Such exploration of heterogeneity is rather limited within the fixed-effects type of approach we present in this article. More elaborate exploration of heterogeneity requires the use of hierarchical models.

APPENDIX 2: Software for Fitting a Binary Regression Model
Previous sectionNext section

SAS Code for Binary Regression Model (for CT); data binreg1; input study test n_pos n_tp dis dis1; cards; 1 1 10 8 1 0.5...................................................... 42 3 24 2 zero –0.5

data final; set binreg1; if test=2; /* create indicator variable for each study */ if study=1 then s1=1; else s1=0;............... if study=42 then s42=42; else s42=0;

run;

proc nlmixed data=final1 maxiter=5000 cov;

parms t18=0 t19=0 t20=0 t21=0 t22=0 t23=0 t24=0 t25=0 t26=0 t27=0 t28=0 t29=0 t30=0 t31=0 t32=0 t33=0 t34=0 t35=0 t36=0; logitp=(t18*s18+t19*s19+t20*s20+t21*s21+t22*s22+t23*s23+t24*s24+t25*s25+t26*s26+t27*s27+t28*s28+t29*s29+t30*s30+t31*s31+ t32*s32+t33*s33+t34*s34+t35*s35+t36*s36-a*dis1) / exp (b*dis1); p=exp(logitp) / (1+exp (logitp)); model n_tp∼binomial (n_pos, p); run;

Series editors: Craig A. Beam, C. Craig Blackmore, Steven Karlik, and Caroline Reinhold.

This is the 23rd and final artricle in the series designed by the American College of Radiology (ACR), the Canadian Association of Radiologists, and the American Journal of Roentgenology. The series has been designed to progressively educate radiologists in the methodologies of rigorous clinical research, from the most basic principles to a level of considerable sophistication. The articles are intended to complement interactive software that permits the user to work with what he or she has learned, which is available on the ACR Web site (www.acr.org).

Project coordinator: Bruce J. Hillman, Chair, ACR Commission on Research and Technology Assessment.

Staff coordinator: Jonathan H. Sunshine, Senior Director for Research, ACR.

Address correspondence to C. Gatsonis ().

We thank the editors for inviting us to prepare this review and the referees for their comments and suggestions.

References
Previous sectionNext section
1. Knottnerus JA, ed. The evidence base of clinical diagnosis. London, UK: BMJ Books, 2002 [Google Scholar]
2. Thornbury JR. Clinical efficacy of diagnostic imaging: love it or leave it. AJR 1994; 162:1-8 [Abstract] [Google Scholar]
3. Gatsonis C. Design of evaluations of imaging technologies: development of a paradigm. Acad Radiol 2000; 7:681-683 [Google Scholar]
4. Gatsonis C. Do we need a checklist for reporting the results of diagnostic test evaluations? Acad Radiol 2003; 10:599-600 [Google Scholar]
5. Irwig L, Tosteson AN, Gatsonis CA, et al. Guidelines for meta-analyses evaluating diagnostic tests. Ann Intern Med 1994; 120:667-676 [Google Scholar]
6. Irwig L, Macaskill P, Glasziou P, Fahey M. Meta-analytic methods for diagnostic test accuracy. J Clin Epidemiol 1995; 48:119-130 [Google Scholar]
7. Pai M. Systematic reviews of diagnostic test evaluations: what's behind the scenes? ACP J Club 2004; 141:11-13 [Google Scholar]
8. Gatsonis C, McNeil B. Collaborative evaluation of diagnostic tests: experience of the Radiologic Diagnostic Oncology Group. Radiology 1990; 175:571-575 [Google Scholar]
9. Deville WL, Bezemer PD, Bouter LM. Publications on diagnostic test evaluation in family medicine journals: an optimal search strategy. J Clin Epidemiol 2000; 53:65-69 [Google Scholar]
10. Bachmann LM, Coray R, Estermann P, Ter Riet G. Identifying diagnostic studies in MEDLINE: reducing the number needed to read. J Am Med Inform Assoc 2002; 9:653-658 [Google Scholar]
11. Moher D, Jadad AR, Tugwell P. Assessing the quality of randomized controlled trials: current issues and future directions. Int J Technol Assess Health Care 1996; 12:195-208 [Google Scholar]
12. Whiting P, Rutjes AW, Reitsma JB, Bossuyt PM, Kleijnen J. The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews. BMC Med Res Methodol 2003; 3:25 [Google Scholar]
13. Bossuyt P, Reitsma J, Bruns D, et al. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Clin Chem 2003; 49:7-18 [Google Scholar]
14. Bossuyt P, Reitsma J, Bruns D, et al. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. BMJ 2003; 326:41-44 [Google Scholar]
15. Whiting P, Rutjes AW, Reitsma JB, Glas AS, Bossuyt PM, Kleijnen J. Sources of variation and bias in studies of diagnostic accuracy: a systematic review. Ann Intern Med 2004; 140:189-202 [Google Scholar]
16. Weinstein S, Obuchowski N, Lieber M. Clinical evaluation of diagnostic tests. AJR 2005; 184:14-19 [Abstract] [Google Scholar]
17. Hanley J. Receiver operating characteristic (ROC) methodology: the state of the art. Crit Rev Diagn Imaging 1989; 29:307-335 [Google Scholar]
18. Obuchowski N. ROC analysis. AJR 2005; 184:364-372 [Abstract] [Google Scholar]
19. Zhou XH, Obuchowski N, McClish D. Statistical methods in diagnostic medicine. New York, NY: Wiley,2002 [Google Scholar]
20. Pepe M. The statistical evaluation of medical tests for misclassification and prediction. New York, NY: Oxford University Press, 2003 [Google Scholar]
21. Toledano AY, Herman BA. Case study: evaluating accuracy of cancer diagnostic tests. In: Beam C, ed. Biostatistical applications in cancer research. Boston, MA: Kluwer, 2002:219-232 [Google Scholar]
22. Toledano AY. Cancer diagnostics: statistical methods. In: Beam C, ed. Biostatistical applications in cancer research. Boston, MA: Kluwer, 2002:183-218 [Google Scholar]
23. Moses LE, Littenberg B, Shapiro D. Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations. Stat Med 1993; 12:1293-1316 [Google Scholar]
24. Walter S. Properties of the SROC for diagnostic test data. Stat Med 2002; 21:1237-1256 [Google Scholar]
25. Kardaun JWPF, Kardaun OJWF. Comparative diagnostic performance of three radiological procedures for the detection of lumbar disk herniation. Methods Inf Med 1990; 29:12-22 [Google Scholar]
26. Rutter C, Gatsonis C. Regression methods for meta-analysis of diagnostic test data. Acad Radiol 1995; 2:S48-S56 [Google Scholar]
27. Rutter C, Gatsonis C. A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat Med 2001; 20:2865-2884 [Google Scholar]
28. Macaskill P. Empirical Bayes estimates generated in a hierarchical summary ROC analysis agreed closely with those of a full Bayesian analysis. J Clin Epidemiol 2004; 57:925-932 [Google Scholar]
29. SAS/STAT 9.1 user's guide. Cary, NC: SAS Institute, 2004 [Google Scholar]
30. Scheidler J, Hricak H, Yu KK, Subak L, Segal MR. Radiological evaluation of lymph node metastases in patients with cervical cancer: a meta-analysis. JAMA 1997; 278:1096-1101 [Google Scholar]
31. Normand SL. Tutorial in biostatistics: meta-analysis—formulating, evaluating, combining, and reporting. Stat Med 1999; 18:321-359 [Google Scholar]
32. Reitsma J, Glas A, Rutjes A, Scholten R, Bossuyt P, Zwinderman A. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol 2005 , 58:982-990 [Google Scholar]
33. Dukic V, Gatsonis C. Meta-analysis of diagnostic test accuracy assessment studies with varying number of thresholds. Biometrics 2003; 59:936-946 [Google Scholar]
34. Kester ADM, Buntinx F. Meta-analysis of ROC curves. Med Decis Making 2000; 20:430-439 [Google Scholar]
35. McClish DK. Combining and comparing area estimates across studies or strata, Med Decis Making 1992; 12:274-279 [Google Scholar]
36. Zhou X. Empirical Bayes combination of estimated areas under ROC curves using estimating equations. Med Decis Making 1996; 16:24-28 [Google Scholar]
37. Hellmich M, Abrams KR, Sutton AJ. Bayesian approaches to meta-analysis of ROC curves. Med Decis Making 1999; 19:252-264 [Google Scholar]
38. DuMouchel W, Normand SL. Computer modeling and graphical strategies for meta-analysis. In: Stangle D, Berry D, eds. Meta-analysis in medicine and health policy. New York, NY: Dekker,2000 [Google Scholar]
39. Beam C, Sostman HD, Zheng J-Y. Status of clinical MR evaluations 1985-1988: baseline and design for further assessments. Radiology 1991; 180:265-270 [Google Scholar]
40. Black WC. How to evaluate the radiology literature. AJR 1990; 154:17-22 [Abstract] [Google Scholar]
41. Cooper LS, Chalmers TC, McCally M, Berrier J, Sacks HS. The poor quality of early evaluations of magnetic resonance imaging. JAMA 1988; 259:3277-3280 [Google Scholar]
42. Cochrane reviews of diagnostic test accuracy. The Cochrane Collaboration Web site. Available at: www.cochrane.org/newslett/ccnews31-lowres.pdf. Accessed May 31, 2006 [Google Scholar]
43. Gatsonis CA. Random effects models for diagnostic accuracy data. Acad Radiol 1995; 2:S14-S21 [Google Scholar]
44. Ishwaran H, Gatsonis C. A general class of hierarchical ordinal regression models with applications to correlated ROC analysis. Can J Stat 2000; 28:731-750 [Google Scholar]
45. Lijmer JG, Mol BW, Heisterkamp S, et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA 1999; 282:1061-1066 [Google Scholar]

Recommended Articles

Meta-Analysis of Diagnostic and Screening Test Accuracy Evaluations: Methodologic Primer

Full Access
American Journal of Roentgenology. 2005;184:364-372. 10.2214/ajr.184.2.01840364
Citation | Full Text | PDF (277 KB) | PDF Plus (467 KB) 
Full Access, ,
American Journal of Roentgenology. 2005;184:14-19. 10.2214/ajr.184.1.01840014
Citation | Full Text | PDF (157 KB) | PDF Plus (267 KB) 
Full Access, , ,
American Journal of Roentgenology. 2006;187:556-561. 10.2214/AJR.05.1750
Abstract | Full Text | PDF (1433 KB) | PDF Plus (1511 KB) 
Full Access, , , ,
American Journal of Roentgenology. 2006;187:580-583. 10.2214/AJR.05.0667
Abstract | Full Text | PDF (103 KB) | PDF Plus (184 KB) 
Full Access, , ,
American Journal of Roentgenology. 2020;215:87-93. 10.2214/AJR.20.23034
Abstract | Full Text | PDF (705 KB) | PDF Plus (717 KB) | Supplemental Material 
Full Access, ,
American Journal of Roentgenology. 2004;183:1539-1544. 10.2214/ajr.183.6.01831539
Citation | Full Text | PDF (43 KB) | PDF Plus (169 KB)