|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Fundamentals of Clinical Research for Radiologists |
1 Departments of Radiology, Neurosurgery and Health Services, and the Center for Cost and Outcomes Research, University of Washington, Seattle, WA.
Received September 21, 2000;
accepted after revision October 5, 2000.
Supported in part by grant HS-094990 from the Agency for Healthcare
Research and Quality and by a Veterans Administration ERIC (Epidemiology
Research and Information Center) grant.
Introduction
|
|
|---|
In 1977, Fineberg [1] described a hierarchal scheme for evaluating diagnostic tests that consisted of four levels of efficacy. Fryback and Thornbury [2] and Thornbury [3] later revised this scheme into a model consisting of six tiers of diagnostic efficacy (Table 1).
|
In addition to a hierarchy for what to evaluate, there is also a hierarchy for how to evaluate it. The randomized clinical trial is the "gold standard" in the realm of clinical trials, although few have actually been performed for diagnostic tests. This is in part because of the expense and difficulty conducting randomized clinical trials. Although the randomized clinical trial is the best scientific method to combat bias, other strategies exist for evaluating diagnostic tests. These strategies include case series, case-control studies, cohort studies, and modeling.
In this article, I review the hierarchal scheme for assessing the efficacy of diagnostic technologies and the various study designs that can be used to evaluate the different levels of efficacy. I end with a brief introduction to some of the issues involved in diagnostic screening.
|
|
|---|
Once the decision has been made to concentrate on efficacy, the next question is on which aspect of efficacy to focus. Guyatt et al. [5] made the observation that "...we must go beyond accuracy and try to determine if our patients are better off as a result of new technologies." However, the link between patient outcomes and a diagnostic test is frequently tenuous. One may fail to observe a beneficial effect on patient outcome because a test is truly worthless, meaning that it is not accurate. However, there are other possibilities. The information from an accurate test may be used incorrectly by the clinician. Or there may be no effective therapy. Or the patient does not comply with effective therapy. Or the patient may not have adequate access to effective therapy. The six-tiered model disaggregates the overall effect of a diagnostic test in an attempt to discern and account for these various possibilities.
Technical Efficacy
Technical efficacy refers to the ability to produce an image and is
generally measured through the physical characteristics of the image (e.g.,
signal-to-noise ratio, resolution). This phase of investigation should be
exploratory to determine the possible uses for a diagnostic test. One should
explore a wide range of conditions and patients. At this stage, blinded
interpretations should be avoided to allow the discovery of unexpected
correlations and to refine interpretations. The danger of being too stringent
at this stage of evaluation is that the development of promising technologies
might actually be delayed if a rigorous but inappropriately early evaluation
is negative. This phase can also be thought of as the laboratory phase of
investigation, at which time technical parameters are optimized for clinical
use.
Diagnostic Accuracy Efficacy
To be useful, not only must an image be produced, it also must be
interpreted. The ability to differentiate normal from abnormal in the
interpretation of a test is diagnostic accuracy. Diagnostic tests are ideally
compared with a gold standard to determine accuracy. The two-by-two table is
the standard way to display the comparison of a new diagnostic
testusually called the index testwith that of a gold standard
test, called the reference test (Table
2). The results of the reference test determine the presence or
absence of disease. The parameters of sensitivity, specificity, positive
predictive value, and negative predictive value can all be derived from a
two-by-two table. The cells of the two-by-two table define four possible test
results: true-positives, false-positives, false-negatives, and true-negatives.
A case is a true-positive (TP) result when the diagnostic test is positive and
the subject has the disease. Similarly, a true-negative (TN) result is when
the diagnostic test is negative and the subject does not have the disease.
False-positive (FP) results occur when a patient without the disease has
positive test findings, and false-negative (FN) results occur when a patient
with the disease has negative test findings.
|
The sensitivity of a diagnostic test is defined as the number of true-positive cases divided by all cases with the disease (TP / TP + FN) (Fig. 1). Specificity is the number of true-negative cases divided by all cases without the disease (TN / TN + FP) (Fig. 2). Sensitivity and specificity are related to the columns of the two-by-two table and are stable characteristics of a diagnostic test. This means that they do not change with varying disease prevalence. Positive predictive value refers to the number of patients with the disease with a positive test divided by all those with a positive test (TP / TP + FP) (Fig. 3) Negative predictive value is the number of patients without the disease with a negative test divided by all those with negative findings (TN / TN + FN) (Fig. 4).
|
|
|
|
Predictive values are in one respect more clinically relevant than sensitivity and specificity because they answer the question, "If a test is positive or negative, what is the likelihood of a patient having the disease?" In contrast, sensitivity and specificity address the question, "Given that the patient does or doesn't have the disease, what is the probability that the test will be positive or negative?" One important characteristic of predictive values is that, unlike sensitivity and specificity, they vary with disease prevalence. Tables 3 and 4 illustrate this point. Table 3 is a two-by-two table for a diagnostic test with 90% sensitivity and specificity that is applied to a population with a high (50%) prevalence of disease. In this setting, the predictive values are also quite high (90%). However, take the same diagnostic test and apply it to a population with a much lower disease prevalence (1%), and the positive predictive value decreases precipitously.
|
|
The two-by-two table assumes that a test result is dichotomous (either positive or negative). However, there are frequently many cut points to define a positive or negative test. This situation can be summarized using a receiver operator characteristic (ROC) curve. The ROC curve is a plot of sensitivity versus 1-specificity for a family of cut points that define positive and negative for a test. For example, a degenerated disk loses signal on T2-weighted MR images. One can create a scale of 1-5 to describe this signal loss, with 1 being no signal loss and 5 being complete signal loss. Now assume that we have a direct line to a divine, omniscient being who tells us gold standard truth as to whether a disk is desiccated. We could then construct an ROC curve using each level of signal abnormality as a cutoff for normal versus abnormal. In the first instance, 1 represents normal and 2-5 represent abnormal. The second cutoff would be 1 or 2 are normal and 3-5 are abnormal, and so forth. An advantage of ROC curves is that diagnostic accuracy can be quantified for the complete range of cut points by calculating the area under the curve (Az). A perfect diagnostic test would have an Az of 1. A diagnostic test that conveyed no useful information would have an Az of 0.5. Such quantification facilitates the comparison of diagnostic tests.
Diagnostic Impact Efficacy
A diagnostic test can be quite accurate and yet still not provide
clinically useful information. Measures of diagnostic impact attempt to
quantify the importance of a diagnostic test to diagnostic thinking. This is
usually assessed using questionnaires that clinicians complete before and
after receiving the results of the diagnostic test. Clinicians can be asked to
rank diagnostic possibilities or even to assign probabilities to given
diagnoses. If the probabilities converge on a given diagnosis, or important
diagnoses are excluded, then the test has diagnostic merit. Diagnostic entropy
is a concept that stems from the work of Shannon and Weaver
[6] in the 1940s, based on
engineering information theory. The probability for a given diagnosis is
compared with the spread of probabilities over all diagnoses. Diagnostic
entropy increases as the probabilities become more evenly spread across the
diagnoses. Entropy decreases as probabilities concentrate around a single or a
few possibilities. The problem with assessing diagnostic entropy, as well as
other schemes to quantify diagnostic impact, is that it requires clinicians to
make reliable and valid estimates of disease probabilities, something in which
few physicians have training.
Therapeutic Impact Efficacy
Just as diagnostic impact assesses the ability of a diagnostic test to
affect a diagnosis, therapeutic impact assesses the degree to which a
diagnostic test influences subsequent therapeutic choices. This is also
generally measured with questionnaires to physicians; but with appropriate
study design, subsequent therapies can be measured, and differences in
therapies can be attributed to diagnostic tests.
Fineberg [1] examined the impact that CT of the head had on diagnostic and therapeutic plans. All physicians requesting a head CT were asked to list the probabilities of the diagnoses being considered. They were also asked, if no CT were available, what diagnostic tests they would definitely and probably require and what their treatment plan would be. Medical records were then reviewed at discharge to determine which diagnostic tests were actually performed and what therapies were instituted. Fineberg found that between 41% and 73% fewer diagnostic tests were performed than were projected by the physician before CT. The therapeutic plan changed in 19% of patients. This study was one of the first published examples measuring the diagnostic and therapeutic impact of a radiologic intervention, and it helped to define the paradigm later adopted by Fryback and Thornbury [2].
Patient Outcome Efficacy
Measures of patient outcome have traditionally been limited to mortality
and morbidity. However, in recent years researchers have focused more
attention on health-related quality of life, which refers to the patient's
appraisal of and satisfaction with his current level of functioning as
compared with what the patient perceives to be possible or ideal
[7]. A physician's estimate of
the success or failure of an intervention is no longer sufficient. The
patient's perspective as well has become important in determining efficacy.
This is seen in the study by Dixon et al.
[8], in which the researchers
compared quality-adjusted life years (QALYs), as well as diagnostic and
therapeutic impact, before and after brain and spine MR imaging. A QALY
indicates a patient's willingness to trade-off length of life for quality of
life. There are a variety of methods to quality-adjust life years, including
the standard gamble, time trade-off, and rating scales
[9]. These methods will be
described in detail in future articles. Dixon et al.
[8] used a questionnaire (the
QALY toolkit [10]) to estimate
the adjusted quality of life for different health states. The key point is
that quality adjustment is from the patient's and not the physician's
perspective. Although Dixon et al. found important effects on the clinicians'
diagnostic confidence and therapeutic plans, there was no change in the
patients' quality of life.
Societal Efficacy
In the era of constrained resources, those who pay for health care demand
value. This implies that a new technology not only must improve patient
outcomes, but also must maximize the health that can be bought for a dollar.
Cost-effectiveness analyses are now commonly incorporated into the evaluation
of new technologies and in all likelihood will remain an important aspect of
technology assessment. An excellent example of this sort of study was
described by Colice et al.
[11]. The researchers used
decision analytic modeling to compare the cost-effectiveness of screening
asymptomatic patients with lung cancer for brain metastases using head CT
versus scanning patients only when they became symptomatic. They determined
that the cost per QALY ($70,000) with the screening strategy would be
substantially higher than that of many accepted medical interventions, and
thus not justified given the assumptions used in their model.
Methods of Assessing Diagnostic Technologies
|
|
|---|
In choosing a study design, the first decision for researchers is whether they have a question that should be answered with a descriptive or an analytic study. Descriptive studies, which can also be regarded as hypothesis generating, include case reports, case series, and cross-sectional studies. They usually describe the epidemiologic characteristics of diseases, or in the case of radiology, how imaging findings relate to patient characteristics. Measuring all variables at a single time is the distinguishing characteristic of cross-sectional studies. The classic study by Jensen et al. [12] of MR imaging findings in patients without lower back pain is an example of a cross-sectional study. The researchers identified 98 subjects, performed MR imaging on them, and determined the lack of lower back pain at one time point. In fact, most imaging investigations are cross-sectional in nature. Although cross-sectional studies are relatively easy to perform, a disadvantage is that it is frequently impossible to determine if the exposure preceded the disease or the disease preceded the exposure. For example, it has been observed that individuals with spinal stenosis are more likely to have lower activity levels, but it is impossible to determine from cross-sectional data if it is the stenosis that leads to less activity or less activity that leads to spinal stenosis.
Unlike descriptive studies, analytic studies allow hypothesis testing to determine the association between an exposure (risk factor) and an outcome (disease). Analytic studies can be divided into observational and experimental. Observational studies can be further divided into case-control and cohort studies. Patients in case-control studies are selected on the basis of whether they have the disease (or outcome) of interest. The proportion of cases with the exposure of interest is then compared with controls. For example, if sciatica is the disease of interest and nerve root compression is the exposure, a case-control study would identify patients with sciatica and then a matched group of patients without sciatica.
In contrast, a cohort study chooses subjects on the basis of the exposure (or risk factor) and then examines the proportion of subjects in each exposure group with and without the outcome of interest. These studies are usually done prospectively, with the exposure identified and the subjects then followed up over time for the development of an outcome. However, cohort studies can also be retrospective. Risk factors can be identified in the past and then the cohort assembled on the basis of these past data. One can then look at the subjects' current disease status to determine if a relevant outcome has occurred. An example of a prospective cohort study in radiology is the study by Nevitt et al. [13], who assembled a cohort of subjects with and without new osteoporotic vertebral compression fractures (the risk factor) and looked at the proportion of patients in each group who developed subsequent back pain and functional limitation (the outcomes). They found that new vertebral fractures were strongly associated with increased pain and limitations in functional status.
Case-control studies are particularly useful for examining rare outcomes, because subjects are selected on the basis of their having the outcome of interest. Conversely, cohort studies are useful for rare risk factors, because subjects are chosen on the basis of their having a particular exposure.
Experimental or intervention studies are also prospective cohort studies, because participants are enrolled on the basis of risk factors. However, experimental studies differ from observational studies in that the exposure status is assigned by the investigator. We at the University of Washington are currently conducting a randomized trial comparing a rapid MR imaging with radiography as the initial imaging technique in patients with lower back pain. The exposure we are studying is the imaging study, to which patients are randomly assigned. We are measuring a variety of outcomes, but a back-pain-specific functional status measure, the modified Roland scale [14], is our primary outcome of interest. We will monitor patients for 1 year and determine if one exposure group has significantly different outcomes from the other.
Although observational studies can control for known risk factors, both at the design and the analysis stages, a researcher can never be confident that all important risk factors that influence outcome have been identified. The unique strength of a randomized trial is that, on average, all factors, known and unknown, are controlled. Deyo [15] provided the interesting example of comparing two batches of fruit and matching them on characteristics that would seem important, such as shape, source, edibility, size, and weight (Table 5). It might appear to some that the two groups were well matched, but ultimately you're still comparing apples with oranges.
|
Randomized trials are the most powerful study design for excluding bias, but because they are generally difficult to conduct and are quite expensive, it is neither practical nor desirable to do randomized trials for every diagnostic imaging question. An alternative study design that is potentially widely applicable is modeling. Modeling refers to the use of decision analytic techniques to model clinical situations. Frequently used for cost-effectiveness analysis, decision modeling usually refers to constructing a decision tree that incorporates, in a quantifiable manner, various aspects of clinical practice. The advantage of decision analysis is that it deals systematically with complex situations, although failure to account for all aspects of a complex situation is a potential weakness.
The first step in constructing a decision model is to identify the clinical starting point, which identifies the group of patients for whom the analysis is conducted. Second, the diagnostic and therapeutic choices that can be applied to that population are defined. Third, probabilities are assigned to the information derived from diagnostic tests and intermediate clinical states resulting from treatments. Fourth, patient outcomes are defined that form the end points for the analysis.
Screening refers to examining people who do not have signs or symptoms for the presence of disease. Black and Welch [16, 17] have highlighted three problems with screening: lead-time bias, length bias, and pseudodisease.
Lead time refers to the interval between detection of clinically occult disease by screening and the point when the disease would have manifested clinically. This lead time causes an apparent increase in survival, known as lead-time bias, in all screening programs. This increase in survival would be equal to the lead time if testing were continuous, but is one-half the lead time for single episodes of screening [18]. Adjusting for lead-time bias usually is not possible, because lead times for new tests are not known, and there is no guarantee that disease detected by screening progresses at the same rate as disease that appears clinically.
Disease that progresses more slowly will be more likely to be identified by a screening test than rapidly progressive disease simply because slower-growing cases are in the detectable preclinical stage for a longer time. Thus, screening preferentially detects disease with slower progression compared with disease that manifests clinically. Not surprisingly, this bias, termed length bias, may result in an apparent improvement in survival, when in fact the screening program has only increased the identification of slowly progressive cases relative to the clinically more important rapidly progressive ones.
Perhaps the ultimate example of length bias is when a screening test detects "disease" that would never manifest itself clinically. Some subjects may have disease that progresses so slowly that the individual would have died from other causes before the disease became clinically apparent. This effect is termed pseudodisease, and it causes an apparent improvement in survival attributable to screening.
I have reviewed a variety of research methods that can be applied to evaluating diagnostic tests. Each has relative advantages and disadvantages that must be weighed before deciding which to use. In addition, a test can be evaluated at several possible levels ranging from diagnostic accuracy to cost-effectiveness. Without a doubt, demand will be increasing for data that can show that a new technology improves patient outcomes. As Guyatt [19] has written:
We must go beyond accuracy and try to determine if our patients are better off as a result of new technologies. Randomized trials focusing on patient outcomes are the only way to investigate these issues convincingly and definitively and should be conducted when the stakes are high enough.
Acknowledgments
I thank Peter Jucovy and Craig Blackmore for their insightful comments.
|
|
|---|
This article has been cited by other articles:
![]() |
W. Hollingworth Radiology Cost and Outcomes Studies: Standard Practice and Emerging Methods Am. J. Roentgenol., October 1, 2005; 185(4): 833 - 839. [Full Text] [PDF] |
||||
![]() |
S. Weinstein, N. A. Obuchowski, and M. L. Lieber Clinical Evaluation of Diagnostic Tests Am. J. Roentgenol., January 1, 2005; 184(1): 14 - 19. [Full Text] [PDF] |
||||
![]() |
S. J. Karlik Exploring and Summarizing Radiologic Data Am. J. Roentgenol., January 1, 2003; 180(1): 47 - 54. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |