|
|
||||||||
Fundamentals of Clinical Research for Radiologists |
1 Center for Statistical Sciences, Brown University, Box G-H, Providence RI,
02912.
2 Present address: Department of Psychiatry, Yale University, New Haven, CT
06511.
Received February 13, 2006; accepted after revision February 20, 2006.
Series editors: Craig A. Beam, C. Craig Blackmore, Steven Karlik, and
Caroline Reinhold.
OBJECTIVE. Interest in evidence-based diagnosis is growing rapidly as diagnostic and screening techniques proliferate. In this article we provide an overview of systematic reviews of diagnostic performance and discuss in detail statistical methods for the most common variant of the problem: meta-analysis of studies in which a pair of estimates of sensitivity and specificity is reported. The need to account for possible variations in threshold for test positivity across studies led to the formulation of the Summary ROC (SROC) curve method. We discuss graphical and model-based ways to estimate, summarize, and compare SROC curves, and we present an example from a meta-analysis of data on techniques for staging cervical cancer. We also present a brief survey of the methodologic literature for addressing heterogeneity, correlated data, multiple thresholds per study, and systematic reviews of ROC studies. We conclude with a discussion of the significant methodologic challenges that continue to face investigators in this area of diagnostic medicine research.
CONCLUSION. Systematic reviews of diagnostic performance are a rigorous approach to examining and synthesizing evidence in the evaluation of diagnostic and screening tests. The information from such reviews is needed by clinicians, health policy makers, researchers in diagnostic medicine, developers of diagnostic techniques, and the general public. However, despite progress in study quality and reporting and in methodologic development, major challenges confront investigators undertaking these reviews.
Keywords: diagnostic accuracy evidence meta-analysis statistical methods Summary ROC curve systematic reviews
The need for systematic reviews of diagnostic and screening tests has grown markedly in recent years as technologic advances have brought forth a vast array of such techniques. Patients, physicians, and policy makers all need information on the reliability and performance of tests and the interpretation of results. In addition, the increased availability of a plethora of diagnostic and screening techniques has meant increased use of tests and a dramatic increase in health care costs.
As evidence-based medicine expands from therapy to diagnosis, the role of systematic reviews acquires added importance [1]. The information from systematic reviews of diagnostic and screening tests is necessary for the following purposes: determination of the proper and efficacious use of diagnostic and screening tests in the clinical setting; decision making about health care policy and financing; evaluation of the performance and status of a diagnostic technique to determine areas for further research, development, and evaluation; and evaluation of the quality and scope of available primary studies of diagnostic and screening techniques and thus development of information necessary for determining directions of future research in diagnostic medicine.
A taxonomy of the important aspects of evaluation of diagnostic and screening tests would distinguish three broad areas of end points: the diagnostic performance of the test, assessed with measures of test accuracy and predictive value; the impact of the test on the process of care, assessed by metrics of the effect of the test on subsequent diagnostic and therapeutic decision making; and the impact of the test on patient-level outcomes, including mortality, morbidity, satisfaction and health-related quality of life, health care utilization, and cost [2-4].
It is also possible, although not formally practiced, to distinguish developmental levels for a technique, following the trajectory from early development to broad dissemination. For example, a four-stage categorization would include stage 1 (discovery), in which the technical parameters and diagnostic criteria of a technique are established; stage 2 (introduction), in which diagnostic performance is assessed and fine tuning of the technology is performed in single-institution studies; stage 3 (maturity), in which the technique is evaluated in comparative, multicenter, prospective clinical studies (efficacy); and stage 4 (dissemination), in which the technique is evaluated as used by the community at large (effectiveness) [3].
Appropriate end points can be selected for each developmental level of a technique. In general, however, evaluation of diagnostic performance is a relevant end point for studies at any stage. The most commonly used metric of diagnostic performance, and the one discussed in detail in this primer, is the pair of estimated sensitivity and specificity values for a test. Others include receiver operating characteristic (ROC)-based measures and measures of the predictive value of a test.
This primer focuses exclusively on systematic reviews of the diagnostic performance of tests. We provide a brief description of the main steps in conducting systematic reviews, from formulating the research question through primary study retrieval and data collection to data analysis and interpretation of results. We also discuss statistical methods for deriving summaries of diagnostic performance data and give an example of an application to meta-analysis of the diagnostic accuracy of tests in the detection of lymph node involvement in women with cervical cancer. The article considers methods for meta-analysis of studies in which a single pair of sensitivity and specificity estimates is reported. Extensions of the basic method are described, and a brief guide to the methodologic literature is provided. We summarize our recommendations and discuss methodologic and subject-matter challenges in the last section.
Overview of Systematic Reviews of Diagnostic Accuracy
The conduct of a systematic review of diagnostic test accuracy proceeds through the following major steps [5, 6]:
Each of the six steps in the process involves its own challenges and can be further refined with more detailed flowcharts [7]. We provide a brief description of the tasks involved in each step.
Definition of the Objectives of the Review
A systematic review of diagnostic accuracy begins with defining the
clinical context and developing a precise description of the diagnostic
question for which test accuracy is to be assessed. This part of the process
is similar to the development of the protocol for a primary study. It includes
specification of the clinical question giving rise to the potential use of the
test or tests under investigation, the technical characteristics of the tests,
the conditions under which the tests are interpreted, and the reference
information used in the assessment of test accuracy
[8]. Because systematic reviews
of diagnostic accuracy are called on to inform the use of diagnostic tests in
clinical care, comparisons of alternative tests are most valuable.
Literature Search and Retrieval of Studies
Although on search strategies extensive literature for studies of therapy
is available, the corresponding body of literature on diagnostic test
evaluation is relatively small. Deville et al.
[9] and Bachmann et al.
[10] discuss strategies
relating to diagnostic and screening tests.
The search for appropriate studies must be comprehensive, objective, and reproducible, and the searcher must consider all available evidence. The search should not simply be for documents in English and should cover publications beyond journals, such as conference proceedings and other reports. Hand searching through publications, reference checking, and searching for unpublished reports often is necessary, especially to assess the extent of publication bias. Finally, it is important to document the process and the outcome of each search.
Assessment of Study Quality and Applicability
The scope of assessment of study quality is broad and not generally well
defined. In the context of studies of diagnostic performance, assessment of
quality has to consider the important features of the design and execution of
the study, including factors such as definition of the research question and
clinical context, specification of appropriate patient population, description
of the diagnostic techniques under study and their interpretation, detailed
accounting of how the reference standard information was defined and obtained,
and any other factors that can affect the integrity of the study and the
generalizability of the results.
Methods of quality assessment may focus on the absence or presence of key qualities in the study report (checklist approach), use scores developed for this purpose (scale approach), or use the levels-of-evidence methods by which a level or grade is assigned to studies fulfilling a predefined set of criteria. The literature on assessment of the quality of therapy studies is extensive, at least in comparison with the literature on diagnostic test evaluations [11, 12]. Two developments in the diagnostic area are the Standards for Reporting of Diagnostic Accuracy (STARD) checklist for reporting of studies of diagnostic accuracy [4, 13, 14] and the quality assessment tool for diagnostic accuracy (QUADAS) for assessing the quality of studies of diagnostic accuracy [12, 15]. The former may be beneficial in improving the quality of published reports and, indirectly, in improving the quality of primary studies. The latter is a rigorously constructed tool that can be used by investigators undertaking new systematic reviews.
Incorporation of quality assessment results into meta-analysis is a matter of debate. A simple and perhaps draconian approach is to exclude studies of poor quality. A less drastic alternative is to use quality scores as weights in the statistical analysis. However, the exact definition of the weights is often a matter of disagreement, and the statistical rationale for their use is shaky. Another alternative, which we recommend to investigators, is to conduct sensitivity analysis. The goal of sensitivity analysis is to assess the contribution of poor-quality studies to the results of the full meta-analysis. The assessment is made by comparing the results from the statistical analysis with the results of the specific studies included and excluded. Sensitivity analysis also can be used to assess the effect on diagnostic accuracy of a study characteristic or a combination of study characteristics.
Extraction of Data
In studies of imaging techniques, test results are most commonly reported
as binary (yes or no) or ordinal categoric. An example of the latter often
used in ROC studies is a five-category scale for degree of suspicion about the
presence of a target condition. The categories are commonly described as
follows: 1 = definitely normal, 2 = probably normal, 3 = equivocal, 4 =
probably abnormal, and 5 = definitely abnormal. In recent years, degree of
suspicion assessments also have been made on nearly continuous scales, for
example, scales from 1 to 100. Continuous test results are typically reported
in the evaluation of laboratory tests, such as the concentration of a
substance.
A binary test result is typically obtained by dichotomizing a test outcome measured on a continuous scale. The continuous scale can be observed directly, as is the case with many laboratory tests. As an alternative, the scale can be a latent, unobservable one, as is the case with the observer's degree of suspicion in ROC studies. In either case, the binary test result is obtained by application of a threshold for test positivity. The presence of such a threshold is a fundamental theme in the evaluation of diagnostic and screening tests.
In this primer, as in most published work on diagnostic and screening test evaluation, disease status is assumed to be binary. Thus, for a particular threshold of test positivity, the study results can be presented in the familiar two-by-two table showing cross classification of disease status and test outcome (Table 1).
|
Although it may seem reasonable to expect that obtaining an appropriate two-by-two table from a published study should be rather straightforward, practical experience suggests that this is not always the case. Investigators need to consider carefully the data report and may also need to contact the authors of the report to obtain the necessary information.
Measures of test performance are defined either conditionally on disease status (sensitivity, specificity) or conditionally on test result (predictive value). Commonly used metrics include test sensitivity = P(T+|D+); specificity = P(T-|D-); positive predictive value = P(D+|T+); and negative predictive value = P(D-/T-), where P(...) is the probability of the event in parentheses, T is the test result, and D is the true disease status. In addition, studies may report other metrics, such as diagnostic odds ratio (OR): sens spec / (1 - sens)(1 - spec); positive likelihood ratio: LR+ = P(T+|D+) / P(T+|D-) = sens / (1 - spec); and negative likelihood ratio: LR- = P(T-|D+) / P(T-|D-) = (1 - sens) / spec. See also the recent article in the AJR by Weinstein et al. [16].
This primer is concerned mainly with meta-analysis of studies reporting estimates of pairs of sensitivity and specificity. The methods discussed in the next section assumes the availability of a single two-by-two table from each study. However, the results of some studies are reported with more than one threshold of test positivity and even more than one definition of disease status. It is important for investigators to record all the information on alternative thresholds reported in retrieved studies and to determine which of the thresholds of test positivity is the most relevant for the purposes of the systematic review. The methods for combining data when several thresholds are used in each study is beyond the scope of this primer but is discussed briefly later in the Other Methods section.
Statistical Analysis
Because binary test outcomes are defined on the basis of an explicit or
implicit threshold for test positivity, it follows that measures of binary
test performance depend on the particular threshold used to generate the
binary test outcomes. This dependence is a fundamental aspect of diagnostic
test evaluation. In the case of test sensitivity and specificity, dependence
on the threshold induces a tradeoff between the two quantities as the
threshold for positivity is moved across all possible values. The curve of all
pairs of sensitivity and specificity values achieved by moving the threshold
across its possible range is the ROC curve
[17,
18].
Comparison of tests on the basis of ROC curves takes into consideration the actual curves and is aided by summary measures that have been proposed in the literature. The area under the curve (AUC) is the most commonly used summary and can be interpreted as average sensitivity for the test, taken over all specificity values. Strictly speaking, the AUC is equal to the probability that if a pair of diseased and nondiseased subjects is selected at random, the diseased subject will be ranked correctly by the test. Other summaries of the ROC curve include partial areas under the curve, values of sensitivity corresponding to selected values of specificity (and vice versa), and optimal operating points, defined according to specific criteria. ROC analysis and other statistical methods for diagnostic test evaluation are described in textbooks by Zhou et al. [19] and Pepe [20] and in chapters by Toledano et al. [21] and Toledano [22].
Digression to ROC analysis is necessary to highlight the role of the positivity threshold and its consequences. A direct implication of this issue in meta-analysis of sensitivity and specificity estimates is that the method has to account for the possibility of different thresholds across studies. The use of simple or weighted averages of sensitivity and specificity to draw statistical conclusions is not methodologically defensible. A simple example to illustrate this point is a meta-analysis of three studies with the sensitivity and specificity estimates described in Figure 1. The estimated sensitivity and specificity pairs are (0.1, 0.9), (0.8, 0.8), and (0.9, 0.1). The average pair is (0.6, 0.6). Clearly, the (0.6, 0.6) pair does not represent these data in any useful way; thus, a simple averaging of sensitivity and specificity is not an adequate approach.
|
Interpretation of Results
Interpretation of the findings from a meta-analysis of diagnostic
performance must address the relevance of the results to the four general aims
stated earlier. That is, this section of the report should highlight the
specific ways in which the data provide information about the proper use of
the particular test, preferably in comparison with alternative techniques;
discuss how the findings can be used to make decisions about health care
policy and financing; summarize the quality of the available studies, pointing
to areas in which more research is needed; and provide information about
possible areas of improvement in the performance of the techniques under
review.
Statistical Methods for Meta-Analysis of Sensitivity and Specificity Data
Summary ROC (SROC) Curve for a Single Test
Our focus is on meta-analyses in which each study contributes a two-by-two
table of data, on the basis of which a pair of estimates of sensitivity and
specificity can be obtained. To introduce statistical notation, the
ith study (i = 1,...I) contributes data in the format shown
in Table 2. With the notation
of Table 2, the estimates of
sensitivity and 1 - specificity from the ith study are
TPRi = di /
ni1 and FPRi =
bi / ni0, where TPR is the
true-positive rate and FPR is the false-positive rate.
|
The display of paired estimates of sensitivity and specificity in ROC coordinates (FPR, TPR) is a key step in the process of statistical analysis. Such plots ideally should include error bars for each of the two estimates. However, the bars often make the plot rather busy. An additional plot to consider is a forest plot, which shows the sensitivity and specificity estimates of each study side by side and may also include the numerators and denominators used to construct the estimates (Figure 2).
|
![]() |
![]() |
![]() |
The fitted model provides a value of D for each value of S. In the final step, the D and S pairs are transformed back into ROC coordinates to obtain an SROC curve.
The transformed variable D is actually the diagnostic odds ratio estimated from each individual primary study in the meta-analysis. The variable S has a less straight-forward interpretation. A little algebra shows that S increases when the probability of a positive test result increases in both the diseased and nondiseased populations. Hence, S can be interpreted as a proxy for the test positivity threshold operating in the particular study. This way of constructing an SROC curve is roughly based on an implicit assumption that the variation in diagnostic odds ratio across studies is a function of the threshold for test positivity.
The foregoing model can be easily extended to incorporate covariates
measuring study characteristics or group characteristics of the participants
in the individual primary studies. The linear model would then have the
following form:
![]() |
SROC summariesIn analogy with the usual ROC curve, a natural summary of the SROC is the AUC. However, the choice of the exact limits for defining the area is a matter of some debate. In particular, some authors prefer to compute the area only over the range of the observed FPR values to avoid the inherent uncertainties about extrapolating beyond the range of the observed data. Other authors support the use of a partial area over a range of FPR values of interest in the context of the particular test. In this primer we report the full AUC estimates because of their simplicity, intuitive interpretation, and avoidance of arbitrary choices of limits of FPR values.
Another global summary of the SROC curve is the so-called Q* ("Q-star") statistic, which measures the value of TPR at the point where the curve intersects the x + y = 1 diagonal line. This is the point on the curve where sensitivity equals specificity. For a symmetric curve, this value is also the point at which the curve is closest to the ideal point (FPR = 0, TPR = 1).
In addition to the global summary measures, the SROC curve can be used to estimate TPR for each fixed value of FPR and, conversely, standard errors of the estimates can be obtained using the delta method. We include such estimates in the analysis of the cervical cancer data (Fig. 5).
|
SROC properties and limitationsThe shape of the SROC curve derived from the foregoing linear regression model depends on the values of the linear model parameters a and b [24]. The special case of b = 0 corresponds to the situation in which the true diagnostic odds ratio is assumed to be constant across all studies. In this case, the SROC curve is symmetric along the x + y = 1 diagonal line. If b 1 0, the curve is not symmetric. Indeed, it turns out that when | b | > 1, the SROC curve derived from the linear regression model has a counterintuitive property: According to the curve, the sensitivity of the test decreases as the FPR increases. Estimated values of b greater than 1 or less than -1 indicate that the simple linear regression model is not adequate for constructing an SROC curve.
|
Binary regression for SROC analysisBecause of the methodologic difficulties described, it is prudent for investigators to consider the use of alternative approaches to estimating SROC parameters for purposes of formal statistical inference. An early such approach predated the linear regression method and used the bivariate normal distribution of the estimates of sensitivity and specificity from each study, with a linear relation between the true values of sensitivity and specificity to account for the effect of threshold [25].
A streamlined alternative to the linear regression model is to use a
variant of logistic regression, which models directly the data ineach
two-by-two table
[26-28].
If Y is the binary test result (yes=1, no=0) and D the
binary disease status for an individual patient in a given study, the form of
the model is as follows:
![]() |
). In other words, the binary test outcome
is obtained by dichotomizing a continuous variable that has different
distributions for diseased and nondiseased subjects. The parameter
measures the distance between the centers of the diseased and nondiseased
populations, and the parameter ß measures the ratio of the SDs in the two
populations. The mathematic details of the model and its relation to the
linear model approach are sketched in Appendix 1. The use of binary regression allows investigators to avoid key difficulties associated with the linear model approach, notably the errors-in-variables problem and the need to account for differences in sample size across studies. As shown in Appendix 1, it is possible to translate the findings of binary regression analysis into linear model parametrization. However, the SROC curves obtained from binary regression analysis always lead to values of the slope between -1 and 1 and hence avoid the counterintuitive properties of curves with | b | > 1 obtained from the linear model. Binary regression models can be fitted with standard software, such as Proc NLMixed in SAS [29]. The SAS code for fitting a binary regression model using Proc NLMixed is in Appendix 2.
Example: Meta-Analysis of Cervical Cancer Staging Data
To illustrate the SROC method we use data from a meta-analysis of
diagnostic imaging tests in the detection of lymph node metastasis in patients
with cervical cancer [30].
This systematic review was conducted to compare the performance of three
imaging techniques: lymphangiography (LAG), CT, and MRI. The published report
describes how the problem was formulated, how the relevant studies were
identified and reviewed, and how the diagnostic performance data were
extracted. Briefly, studies were located with a MEDLINE literature search
combined with hand searching of bibliographies from retrieved articles.
Included studies had histologic confirmation of cervical cancer, uniformly
appropriate reference standard information, and evidence of blinding in study
design. In addition, included studies had a minimum sample size of 20
patients, reported criteria for test positivity, and presented sufficient data
to complete the necessary two-by-two table.
In our example we included data from 42 studies, 13 of which evaluated LAG, 19 evaluated CT, and 10 evaluated MRI. Nine studies evaluated more than one test, but this feature of the data is ignored for the purposes of this analysis. The pairs of observed values of sensitivity and specificity are presented in ROC coordinates in Figure 3. We are not using exactly the same set of studies presented in the published paper, and hence the results of this example may differ from those in the article, particularly in the case of the LAG evaluation.
SROC curves were derived separately for each test by both the binary and
the linear regression methods. The results of the binary regression fit are
presented in detail and are followed by summary tables from the linear
regression fit. The latter are included for comparison purposes. For each
test, the binary regression model assumed common location (
) and scale
(ß) parameters across the studies but a separate threshold value for each
study. Table 3 summarizes the
results from the binary regression fit.
|
The scale parameter is not statistically different from zero for all three techniques. Instead of assuming it is zero and plotting the SROC curves as symmetric, we used the estimated value of ß to derive the plots in Figures 3 and 4. The SROC curves are superimposed on the observed data in Figure 4. The SROC curves with superimposed 95% confidence intervals for TPR and FPR at three points are shown in Figure 5.
|
For comparative purposes, we present the numeric results from the linear regression fit of the SROC curve (Table 4). The actual curves and summary estimates of AUC and Q* are close but not identical to those derived from the binary regression analysis. For a more detailed view of the comparison, we converted the SROC equation from the binary regression to the form that would be obtained from a linear regression fit, using the formulas in Appendix 1. Table 5 shows the results for CT.
|
|
Other Methods
The SROC method is limited in two important respects. First, the statistical framework does not consider the presence of random variation between studies. This fixed-effects framework implicitly assumes that the universe of all studies to which inferences apply is only the specific studies used in the meta-analysis and that in addition to sampling variation within studies, the only other possible variation can be explained by study-level covariates. As a result of its assumptions, a fixed-effects approach to meta-analysis is generally expected to provide artificially more precise results than an approach that provides a fuller account of variability in the data [31]. The second important limitation of the specific fixed-effects approach is that it ignores correlations in the data within studies. In this section, we briefly discuss statistical methods based on hierarchical models designed to address these limitations. We also provide references to the literature on meta-analysis of ROC studies.
Hierarchical Summary ROC Analysis
The binary regression model is the building block for a hierarchical model
describing the full range of variation in the data. In particular, the
hierarchical model differentiates within-study from between-studies
variability and systematic from random variability. For example, a model for
the cervical cancer data accounts for two levels of variability. In level 1,
within-study variation is modeled by binary regression. In level 2,
between-studies variation is modeled by distributions of the threshold and
location parameters. The mean of the distribution of the parameters may depend
on study-level covariates (e.g., test type).
A hierarchical model can be fitted with fully bayesian methods [27] or likelihood-based approximations as implemented in the Proc NLMixed procedure of SAS [28]. A Hierarchical SROC (HSROC) curve can be derived by use of the population means of the parameters. In addition to providing a full account of the variability in the data, the hierarchical model accounts implicitly for correlations within studies. If information exists for such correlations, it can be included explicitly by suitable extensions of the model. In particular, such formulations are useful for modeling data from studies conducted with paired designs.
An alternative way to build hierarchical models for diagnostic accuracy data is to consider a variant of the Kardaun approach and use the bivariate asymptotic normal distribution of the estimates of sensitivity and specificity from each study [32]. Although this approach has been used to derive "average" estimates of sensitivity and specificity, a practice criticized earlier in this article, it is easy to modify the model to derive SROC curves.
Meta-Analysis with Multiple Thresholds from Individual Studies
In the SROC methods discussed earlier, it is assumed that a single
two-by-two table is obtained from each study. If multiple thresholds for test
positivity are used in the primary studies, ordinal regression methods and
their hierarchical formulations can be used to perform the statistical
analysis [33].
Meta-Analysis of ROC Data
The choice of suitable statistical methods for combining data from ROC
studies depends on the type of data considered. If the full ROC data are
availablefor example, the complete two-by-five table of disease status
by test results when a five-point ordinal categoric scale is usedthen
ordinal regression methods can be used. It is not necessary for all studies to
use the same number of categories in reporting of test results
[33,
34].
If the emphasis is on meta-analysis of summaries of the ROC curve, the appropriate methods have to be tailored to the specific summary. For meta-analysis of estimates of the AUC from independent studies, McClish [35] describes weighted average estimators, Zhou [36] describes a generalized estimating equation approach, and Hellmich et al. [37] describe a bayesian method. A hierarchical model for such data can be constructed in a straightforward manner with the asymptotic distribution of the estimate of the AUC for the first level of the model and proceeding as in the HSROC model for the other levels. Because the distributions involved are all normal, the process of fitting and checking such models is fairly routine [31, 38].
Discussion
As interest in evidence-based diagnosis increases, so does the demand for information from systematic reviews of studies of diagnostic accuracy. The information from such reviews is a key ingredient for all subsequent evaluation of diagnostic techniques. Because empiric studies of test outcomes can be prohibitively difficult to conduct in practice, research synthesis and modeling of health outcomes and costs often remain the only viable options. For such undertakings, the information from meta-analysis of diagnostic performance is crucial.
Meta-analysis of accuracy evaluations is not as streamlined or easy to perform and summarize as meta-analysis of therapy evaluations. A key difference is the nature of the summary measure. In therapy studies, the summary can be as simple as an overall success rate with appropriately quantified variability and uncertainty. In diagnostic accuracy studies, however, the summary is a curve (or several curves if patient subsets are considered). Comparisons of curves are inherently more complex and nuanced than comparisons of means or proportions. Thus, systematic reviews of diagnostic accuracy present the research community with a challenging set of questions about how best to summarize the information and how to use it in analysis and decision making. For example, the methodology for incorporation of SROC curves in modeling outcomes and costs is not fully developed, and practical experience in this type of analysis is relatively scarce. In most published modeling exercises, the sensitivity and specificity of tests are assumed to be a single pair of numbers.
Two major determinants of the success of systematic reviews of diagnostic accuracy are the availability of relevant studies of adequate quality and the development of a consensus around the methods for such reviews. In recent years, the quality of diagnostic and screening test evaluations has improved, but the hill still seems steep [14, 39-41]. In the same period, the methods for systematic review of diagnostic accuracy have progressed and matured. Evidence of methodologic progress is the growing list of published work and the formation of the Cochrane Diagnostic Reviews initiative [42] late in 2003. The researchers involved in this initiative are at work preparing the methodologic infrastructure for performing diagnostic accuracy reviews and including them in a new division of the Cochrane Library.
Despite progress in study quality and reporting and in methodologic development, major challenges confront investigators venturing into the world of systematic reviews of diagnostic and screening tests. The following is a partial list of challenges:
In confronting methodologic and practical challenges, investigators conducting systematic reviews of diagnostic accuracy are likely to find colleagues and collaborators. The era of evidence-based diagnosis is here to stay.
APPENDIX 1: Binary Regression Model
For a single study, the model can be described as follows.
Let Yij represent the test result (1 = positive, 0 =
negative) and Dij the true disease status on the
jth individual in the ith study. In our notation, we code
D = 1 / 2, if diseased, and -1 / 2 if nondiseased. The binary
regression model is based on the assumption that the response arises from the
discretization of an underlying continuous latent variable with threshold
i. The latent variable follows logistic
distributions for diseased and nondiseased subjects, and the two distributions
can be distinguished by a location parameter (
i)
and a scale parameter (ßi). The diagnostic
performance of the test in the ith study is a function of the
location and the scale parameters. Formally,
![]() |
The binary regression model is closely related to the usual ROC model and
implies that for the ith study:
![]() |
![]() |
If the location and scale parameters are assumed to be constant across all
studies, the model reduces to a relation between the true-positive rate (TPR)
and false-positive rate (FPR) that is similar to the relation postulated in
the model described by Moses et al.
[23]. In particular:
![]() |
![]() |
e-ß / 2, c1
= e-ß. It is clear that c1 is greater than 0; and
that b, which is equal to (c1 - 1) / (c1 + 1), takes
values between -1 and 1.
An SROC curve and its summary measures can be estimated from the binary
regression model. In addition, study-level and subject-level covariates can be
easily incorporated, resulting in models of the form:
![]() |
The large number of parameters in the binary regression model creates identifiability problems without additional assumptions. For example, the model without covariates has three parameters for each table; hence, it is not identifiable for a single table. However, with suitable assumptions, such as the one leading to the analogue of the Moses model, the binary regression model can be made identifiable. Other assumptions about the parameters allow the exploration of heterogeneity across studies. For example, studies may have different location parameters (thus different overall accuracies) but the same scale parameter and the same threshold. Such exploration of heterogeneity is rather limited within the fixed-effects type of approach we present in this article. More elaborate exploration of heterogeneity requires the use of hierarchical models.
APPENDIX 2: Software for Fitting a Binary Regression Model
SAS Code for Binary Regression Model (for CT); data binreg1; input study test n_pos n_tp dis dis1; cards; 1 1 10 8 1 0.5...................................................... 42 3 24 2 zero 0.5
data final; set binreg1; if test=2; /* create indicator variable for each study */ if study=1 then s1=1; else s1=0;............... if study=42 then s42=42; else s42=0;
run;
proc nlmixed data=final1 maxiter=5000 cov;
parms t18=0 t19=0 t20=0 t21=0 t22=0
t23=0 t24=0 t25=0 t26=0 t27=0 t28=0
t29=0 t30=0 t31=0 t32=0 t33=0 t34=0
t35=0 t36=0;
logitp=(t18*s18+t19*s19+t20*s20+t21*s21+t22*s22+t23*s23+t24*s24+t25*s25+t26*s26+t27*s27+t28*s28+t29*s29+t30*s30+t31*s31+
t32*s32+t33*s33+t34*s34+t35*s35+t36*s36-a*dis1)
/ exp (b*dis1); p=exp(logitp) / (1+exp (logitp)); model
n_tp
binomial (n_pos, p); run;
Acknowledgments
We thank the editors for inviting us to prepare this review and the referees for their comments and suggestions.
References
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |