|
|
||||||||
Fundamentals of Clinical Research for Radiologists |
1 All authors: The Russell Morgan Department of Radiology and Radiological Sciences, JHOC Rm. 4155, P. O. Box 0814, Johns Hopkins Medical Institutions, 601 N. Caroline St., Baltimore, MD 21287.
Received March 20, 2002;
accepted after revision April 22, 2002.
This is the seventh in a series designed by the American College or
Radiology (ACR), the Canadian Association of Radiologists,
and the American Journal of Roentgenology. The series, which will
ultimately comprise 22 articles, is designed to progressively educate
radiologists in the methodologies of rigorous clinical research, from the most
basic principles to a level of considerable sophistication. The articles are
intended to complement interactive software that permits the user to work with
what he or she has learned, which is available on the ACR Web site
(www.acr.org).
Introduction
|
|
|---|
Although medical imaging is used in the diagnosis of most human ailments, mammography is the only diagnostic imaging examination currently in widespread use as a screening tool [7]. Multidetector CT is being evaluated as a means of detecting early-stage lung carcinoma [8, 9] and colorectal adenomatous polyps [10, 11], but it is not yet an accepted routine screening examination. Indeed, the concept of disease screening, including its appropriateness and evaluation, is not as straightforward as it may first appear. Even the basic assumption that early treatment will improve prognosis may not be true in all circumstances. Moreover, even if this assumption is justifiable for a particular condition, the risks or costs that are associated with any screening test (and any consequent "induced" procedures) must be weighed against the benefits. Thus, any new application of an imaging procedure to screen for disease should be considered an unproven method of disease control until its risks, benefits, and costs have been rigorously evaluated. Ideally, such evaluations should be completed before widespread use of the procedure for disease screening is undertaken or recommended [12].
Making and evaluating recommendations on the use of imaging studies for disease screening is one of the more difficult problems in medical imaging and clinical medicine. This article will discuss the use of screening tests for detecting early disease or for detecting risk factors for developing disease. Consideration will be given to the appropriateness criteria for two major elements of health screening programs: the condition or disease for which screening is being performed and the screening test itself. Within the context of these two elements, potential biases in the evaluation of screening programs and other critical issues in the evaluation of screening programs will be presented.
Appropriateness Criteria: The Disease or Condition Being
Screened
|
|
|---|
|
Substantial Morbidity or Mortality If Untreated
The criterion of seriousness relates primarily to issues of both
cost-effectiveness and ethics. The elimination or amelioration of adverse
health consequences must justify resource expenditures on radiologic imaging
for disease screening. Likewise, the consequences of failing to detect and
treat the disease early must be sufficiently grave to ethically warrant
exposing individuals to the risks (e.g., radiation exposure or false-positive
diagnosis) and discomforts of the screening procedure itself. Life-threatening
conditions, such as heart disease and cancer, and those known to have serious
and irreversible consequences, such as congenital hypothyroidism and
phenylketonuria, clearly meet the criterion of seriousness. On the other hand,
medical imaging tests should be thoroughly evaluated for risks and benefits
before being used to screen for certain asymptomatic conditions, such as
gallstones. Although asymptomatic gallstones are fairly prevalent, rarely are
they life-threatening and, in fact, the condition may never become
symptomatic.
High Preclinical Prevalence
For a screening test to be effective, it must reveal a sufficient number of
preclinical disease cases to justify the testing costs. Thus, the prevalence
of preclinical disease must be high in the population for which screening is
recommended. Targeting high-risk populations can increase the prevalence of
the detectable preclinical phase of the disease and thus the number of cases
detected on screening. This strategy will likely be applied to the emerging
approaches to lung cancer screening using multidetector CT. Exceptions to the
criterion concerning high prevalence of the detectable preclinical disease
should be made if screening for rare conditions can be accomplished using
tests that are accurate, inexpensive, and noninvasive. Although
phenylketonuria occurs in only one of 15,000 neonates, widespread screening is
justified by the effectiveness and low cost of the test and by the serious
public health consequences of not detecting the disease in its preclinical
phase.
Existence of a Critical Point and Appropriate Therapy
Screening tests are only effective if the condition or disease has a
critical point (point CP in Fig.
1) so that treatment instituted before the critical point is more
efficacious than treatment provided later. In the case of screening for
preclinical neoplastic conditions, the critical point coincides with the onset
of metastasis [12]. Thus, the
critical point must occur during the detectable preclinical phase of the
disease because screening is ineffective (and, indeed, unnecessary) after the
onset of symptoms (i.e., during the clinical phase of the disease). If the
critical point occurs soon after the onset of the detectable preclinical
phase, screening may be too late to be useful. Conversely, screening may also
be less effective early in the onset of the detectable preclinical phase if
lesions are extremely small and are just at the threshold of
detectability.
For screening to improve patient outcomes, an effective treatment for the disease must be available. A critical question in evaluating the importance of screening for a condition is whether treatment of the preclinical disease detected on screening is more effective than intervention initiated after the disease becomes symptomatic. Here, the natural history of the disease should be carefully considered. Figure 1 illustrates that the natural history of disease can be divided into preclinical and clinical phases. The preclinical phase is the period from the biologic onset of disease to the onset of clinical manifestations of the disease. During this phase, the condition is asymptomatic but detectable on a screening test. The detectable preclinical phase of disease is defined as the interval between the point at which the disease can be detected on screening (point B in Fig. 1) and the point at which symptoms develop [13] (point S in Fig. 1).
For screening to be beneficial, treatment initiated during the detectable preclinical phase must result in a better prognosis than therapy given after symptoms develop. For example, some subtypes of breast cancer develop for 3-8 years before becoming palpable at routine clinical breast examinations. During this stage, nonpalpable breast carcinomas may be detected on mammography. Many of these carcinomas are confined to the breast and are not associated with lymph node metastasis. Diagnosing and treating breast cancer during the preclinical phase result in a higher percentage of the cases remaining noninvasive (i.e., ductal carcinoma in situ), a lower percentage of cases of axillary lymph node metastasis, and a better 5-year patient survival rate than when breast cancer is diagnosed during the clinical phase [14].
Conversely, if early treatment engenders no difference in the patient's prognosis or health outcome, then the application of a screening test is neither necessary nor effective. For example, screening for lung carcinoma with chest radiography has historically been discouraged because the disease has a poor prognosis regardless of the phase during which treatment is initiated. Similarly, little justification exists in screening for conditions that are completely curable during the clinical phase of their natural history.
Low Incidence of Pseudodisease
A pseudodisease is a disease that does not require treatment because it
does not affect patients' length or quality of life in a significant way.
Screening for a disease will be ineffective if the screening test reveals
substantial pseudodisease. Two sources of pseudodisease have been described
[12,
15]. A type I pseudodisease is
a condition that is diagnosed via a screening test and does not progress to
symptomatic disease; it may even regress over time. This is a recognized
phenomenon in screening for breast carcinoma; not all cases of ductal
carcinoma in situ progress to invasive or metastatic disease
[16,
17]. A type II pseudodisease
is an indolent, slowly progressive disease found in conditions with long
detectable preclinical phases or among patients with short life expectancies
who may die from other causes
[12]. This latter type of
pseudodisease has been described in prostate carcinoma. Although the
prevalence of clinically apparent prostate carcinoma in men aged 60-70 years
is only about 1% [18], more
than 40% of men in their 60s who have normal findings at rectal examinations
have histologic evidence of disease
[19] when prostate tissue is
removed during cystectomy performed for bladder cancer. Because patients with
pseudodisease do not die from the disease for which screening is performed,
the survival of these patients is erroneously attributed to early treatment.
If adjustments are not made for the detection of pseudodisease in a screening
program, an overdiagnosis bias occurs
[12]. For both types of
pseudodisease, a screening test with positive results may cause the patients
to undergo unnecessary tests and therapy. For these reasons, screening for
conditions with a high frequency of pseudo-disease is not cost-effective.
Appropriateness Criteria: The Screening Test
|
|
|---|
Test Accuracy
A screening test is 100% accurate if it can be used to correctly classify
individuals having preclinical disease as test-positive and those without
preclinical disease as test-negative. In its simplest form, the assessment of
the accuracy of a diagnostic technology involves two dichotomies: disease that
is present (+) or absent (-) and test results that are positive (+) or
negative (-). A 2 x 2 matrix (Fig.
2) is frequently used to illustrate the four outcome combinations
in which n, the total number of test results examined, is expressed
by the equation n = a + b + c + d. Two of the counts, a and
d, correspond to correct test results (true-positive and
true-negative, respectively), whereas b is the number of
false-positive results and c is the number of false-negative
results.
|
Because the counts for the four outcomes are highly dependent on the sample size, it is customary to express them as rates. For example, a / (a + c) is equal to the proportion of individuals who have the disease and who have positive test results, or the rate of true-positives, also known as the sensitivity of the test; d / (b +d) is equal to the proportion of individuals who do not have the disease and who have negative test results, or the rate of true-negatives, also known as the specificity of the test; c / (a + c) is equal to the proportion of individuals who have the disease but have falsely negative test results, or the rate of false-negatives; and b / (b + d) is equal to the proportion of individuals who do not have the disease but who have falsely positive test results, or the rate of false-positives. Thus, sensitivity is the probability of an individual having positive test results when the disease is truly present, and specificity is the probability of an individual having negative test results when the disease is truly absent.
The usefulness of a screening test is evaluated by its positive and negative predictive values. The predictive value of a negative test (d / [c + d]) is the probability that a patient with a negative result on the diagnostic test truly does not have the disease for which the screening was conducted. Conversely, the predictive value of a positive test (a / [a + b]) is the probability that a patient with a positive result on the screening test truly has the disease for which the screening was conducted. The positive and negative predictive values of a test are dependent on the prevalence of the disease.
As the sensitivity of a screening test increases, the number of individuals with pre-clinical disease not diagnosed by the test decreases. A highly specific test has a low percentage of healthy individuals who are misclassified as having positive test results. Decisions regarding specific criteria for acceptable levels of sensitivity and specificity for a given preclinical disease involve weighing the consequences of leaving cases undetected (false-negatives) against erroneously classifying healthy persons as having the disease (false-positives). In general, sensitivity should be increased at the expense of specificity if the consequences of missing preclinical disease are great, such as when the disease is serious, detectable during its preclinical phase, and curable. Conversely, high specificity is desirable when the costs or risks associated with further diagnostic tests (i.e., surgical biopsy) are substantial. In this circumstance, ethics require that the screened population be informed that a negative result on the screening test does not absolutely guarantee that the disease is not present, only that the likelihood of having the disease is low.
One way to address the problem of the trade-off between the sensitivity and specificity is by administering several screening tests in parallel or sequentially. The former involves performing all the screening tests at the same time and considering individuals with positive results on any of the tests to be true-positive cases. This approach gives greater sensitivity than that achievable by performing each test alone because the condition is less likely to be missed; however, the approach lowers specificity because false-positive diagnoses are also more likely. When screening tests are administered sequentially, an initial screening test is performed, and only those individuals with positive test results undergo an additional screening procedure. Generally, sequential testing results in higher specificity than that achievable with a single test because positive results on a series of tests are more likely to represent a true-positive finding. This method, however, also lowers sensitivity.
Test Reproducibility
Any test being considered for use in a screening program must have
reproducible results. For imaging tests, four important sources of variability
can affect the reproducibility of results. The first relates to a biologic
variation that might affect the performance of the test (i.e., patient size or
cardiac motion). The second relates to the reproducibility of the test itself
(i.e., patient positioning or film processing in the acquisition and
production of mammograms). Third, intraobserver variability refers to
differences in the way the same radiologist interprets a specific screening
test at different times. Finally, interobserver variability refers to
inconsistencies attributable to differences in the way different radiologists
interpret the same screening examination. Interobserver variability is
minimized if the interpretation criteria and end points are defined and
quantifiable and is greater if the criteria are vague and subjective. Both
intra- and interobserver variabilities have been reported
[20,21,22]
in the interpretation of screening mammograms, description of specific
lesions, and recommendations for follow-up examinations, using the American
College of Radiology Breast Imaging Reporting and Data System (BI-RADS)
[23].
A common but flawed approach to measuring the accuracy of a potential screening test is to extrapolate data on tests performed in populations with symptomatic disease to screening populations [13]. However, using an asymptomatic population involves testing many subjects to identify a group with disease and following up those subjects to ascertain the true disease status. Both positive and negative test results in the subjects should be verified by acceptable methods such as histopathology and clinical or imaging followup. With respect to the latter, a follow-up period of sufficient length is critical. If the follow-up period is too short, false-negative cases may be missed; if it is too long, new cases of disease (e.g., "interval cancer") may be inaccurately classified as false-negatives.
Test Safety, Availability, and Cost-Effectiveness
Because screening tests are performed on asymptomatic
individualsmost of whom are healthy and do not have preclinical
diseasethe tests must not be associated with significant morbidity or
mortality. Even a minor side effect or adverse consequence to the screened
population will likely offset the benefits of screening
[12]. Radiation dose and the
likelihood that the screening test itself may induce malignancies are
frequently considered adverse consequences of screening tests involving
imaging
[24,25,26].
Other sources of morbidity that affect an individual's decision to undergo or
forego screening include the discomfort associated with the test (e.g.,
compression with screening mammography or bowel preparation for screening
barium enema examinations).
The screening test should be accessible to the population for whom it is indicated. Screening cannot be effective if the screening test is available only at large medical centers. Likewise, if the examination is costly, insurers may choose not to provide screening coverage, and patients may be unwilling or unable to pay for the test out of pocket.
Evaluating the Effectiveness of Screening
|
|
|---|
Comparability of the Screened and Unscreened Groups
In determining the efficacy of a screening test, the screened and
unscreened groups must be comparable with regard to all factors affecting the
end point under evaluation, with the exception of the screening experience. In
this regard, patient recruitment and self-selection bias (volunteer bias)
should be taken into account. People who choose to participate in a screening
program are likely to differ from those who do not volunteer in several ways
that may affect survival [27,
28]. Volunteers tend to have
better health and lower mortality rates than the general population and are
more likely to adhere to prescribed medical regimes. Consequently, an
observational study design comparing mortality rates of screened and
un-screened groups is likely to show that those who volunteer to undergo
screening have lower mortality rates, regardless of any effect of screening.
On the other hand, those who volunteer for screening programs may represent
the "worried well," or asymptomatic individuals who are at higher
risk of developing disease because of medical or family history or lifestyle
factors. Such individuals might have an increased risk of mortality regardless
of the efficacy of the screening program. Thus, the direction of potential
patient selection bias may be difficult to predict and the magnitude of such
events even more difficult to quantify. Randomization schemes are used to
overcome self-selection bias in studies evaluating potential screening tests
by assigning individuals to screened and unscreened study groups after they
agree to participate in the study.
Lead-Time and Length-Time Biases
Showing the benefit of treatment initiated during the preclinical phase of
a disease is surprisingly difficult. Two widely recognized problems that arise
when the benefits of screening are evaluated by comparing screened to
un-screened populations are lead-time bias and length-time bias.
Lead time is the interval between the diagnosis of a disease at screening and the time at which it would have been detected via the onset of clinical symptoms [29]. Lead time, therefore, is the amount of time that the diagnosis was advanced as a result of screening (Figs. 1 and 3). Because screening is applied to asymptomatic individuals, every case of disease detected at screening has had its time of diagnosis advanced. Whether that lead time is a matter of days, months, or years varies by disease, individual, and screening procedure. For a disease that progresses rapidly from the preclinical to the clinical phase, less lead time will be gained from screening than for a disease that develops slowly and has a longer preclinical phase.
|
Lead time also varies with how soon the screening test is performed after the preclinical disease becomes detectable. For screened patients, cause-specific survival is measured as the length of time from disease detection on the screening test to death from the disease. For patients not screened, cause-specific survival is measured as the length of time from clinical diagnosis to death from the disease. For example, Figure 3 illustrates the hypothetical histories of two women with breast cancer. We assumed that the age of both women at the biologic onset of disease was 35 years and that the disease was detectable on screening when the women were 44 years old. One women (A) was screened at age 47, and her breast cancer was detected at that time. The other woman (B) did not undergo screening mammography; her breast cancer was diagnosed when she was 50 after she discovered a lump in her breast. Both women died at the age of 53. Because woman A survived 3 years longer after detection of breast cancer than woman B, screening appears to be beneficial when in fact it only pushed the time of diagnosis forward. This phenomenon is commonly referred to as lead-time bias [30,31,32,33,34,35,36]. If an estimate of lead time is not taken into account when comparing mortality between screened versus unscreened populations, survival will be erroneously overestimated for the screening-detected cases simply because the diagnosis was made earlier in the natural history of the disease. A second way to account for the effect of lead time on the efficacy of a screening program is to compare the age-specific death rates in the screened and unscreened groups rather than the length of survival from diagnosis to death.
Length-time bias refers to the overrepresentation among screening-detected cases of those diseases with long preclinical phases and thus more favorable prognoses. Diseases with a long preclinical phase are more readily detected on screening tests than are the more rapidly progressing diseases with shorter preclinical phases. An assumption underlying the concept of length-time bias is that diseases with long preclinical phases are more indolent and would have more favorable prognoses, regardless of any effect of the screening program itself. Thus, length-time bias could lead to an erroneous conclusion that screening is beneficial when, in fact, observed differences in mortality rates resulted merely from detection of cases of less rapidly fatal diseases, whereas cases of diseases that are more rapidly fatal were diagnosed after symptoms developed. Length-time bias is difficult to quantify. Its effect is greatest for cases detected at the initial screening; thus, one method of controlling for length-time bias is to compare cases detected at a subsequent screening (i.e., after the initial screening) to those detected clinically (when the patient develops symptoms).
Comparison of Cause-Specific and All-Cause Mortality Rates
The most definitive measure of the efficacy of the screening program is a
comparison of the cause-specific mortality rates of those whose disease was
diagnosed on screening and those whose diagnosis was made after the
development of symptoms. Because the target disease causes only a small
proportion of deaths in a screening-eligible population, a statistically
precise estimate of differences in mortality rates or a statistically
significant effect of screening on all-cause mortality rates can rarely be
shown. However, evaluating the all-cause mortality rates may help to ensure
that a major harm or benefit is not being missed. An all-cause mortality rate
is all-inclusive and provides data relevant to the question of whether other
risks are somehow changed along the continuum of the application of the
screening test, the diagnosis of a disease, and the treatment. Second, an
all-cause mortality rate provides an important perspective on the magnitude of
benefit from screening. It puts cause-specific mortality reduction in the
context of other competing risks and thus permits an estimate of the overall
benefit to be reasonably expected by a particular individual who undergoes a
screening evaluation [35]
Absolute Risk Versus Relative Risk
The effectiveness of screening can be expressed in terms of the relative
risk, which is the ratio of the cause-specific mortality rate in the study
group to that in the control group, or to the relative risk reduction, which
is 1 minus this ratio. Although calculations of relative risk are valid, they
can be misleading because they convey no information about an individual's
baseline risk. The absolute risk reduction is increasingly recognized as a
more appropriate measure of effectiveness of screening interventions
[37]. Absolute risk reduction
is expressed as the product of risk and relative risk reduction. For example,
suppose a screening-eligible individual has a 2% probability of dying of a
particular disease over the next 20 years. If the relative risk reduction from
screening is 50%, the absolute risk reduction is 1%. Reporting absolute risk
reduction is especially appropriate for screening because the overall risk to
be averted is usually small. The absolute risk reduction puts the potential
benefit in proper perspective so that an individual or his or her health care
provider can weight it against the potential side effects and costs. The
reciprocal of the absolute risk reduction is the number of individuals who
must be screened to prevent one death or adverse event. In our example, this
number is 100 or 1/0.01. The perception of the absolute risk reduction from
screening may be significantly affected by the detection of a pseudodisease
that, as discussed previously, falsely increases the perceived risk of
developing the disease and the perceived effectiveness of earlier
treatment.
Study Designs for Evaluation of Screening Tests
|
|
|---|
Observational analytic studies, both case-control and cohort, are also used to evaluate the efficacy of screening programs. In the case-control design, individuals with and without the disease are compared with respect to their prior exposure to the screening test. As with any case-control study, the definition and selection of the cases and controls are of critical importance to the validity of the findings [38, 39]. In a cohort study, the case-fatality rate of those who chose to be screened is compared with the case-fatality rate among those whose diagnoses were made due to the onset of symptoms. Interpretion of the results of cohort studies requires consideration of the potential effects of the self-selection of participants as well as lead-time and length-time biases [40].
Because the chief threat to validity is that screened and unscreened cases cannot be compared, the optimal assessment of the efficacy of a screening program derives from randomized trials. If the sample size is sufficiently large, the process of randomization controls any potential confounding variables. Patient self-selection or volunteer bias, a problem when comparing screened and unscreened groups in observational studies, does not influence the validity of randomized trials: after a group of volunteers agrees to participate in the study, individuals who are to undergo screening are chosen at random from the group by the investigators. Adjusting for the lead-time average can eliminate lead-time bias in comparisons of survival rates of patients whose disease was detected via screening versus those whose disease was detected clinically or, preferably, in comparisons of the age-specific mortality rates for the screened and the unscreened groups. Trials can also address the potential for length-time bias by comparing the mortality experience of the groups after repeated screenings.
In the United States, few randomized trials have evaluated programs that use imaging to screen for preclinical disease. The Health Insurance Plan Breast Cancer Screening Project [41] was a randomized trial conducted to evaluate whether periodic breast cancer screening with mammography and physical examination would result in reduced breast cancer mortality rates among women whose ages ranged from 40 to 64 years old. After 9 years of follow-up, an overall statistically significant reduction in breast cancer mortality was found among women who were offered screening compared with women who were assigned to usual medical care.
Although randomized trials provide the best and most valid data on the efficacy of screening programs, a fair amount of evidence on screening programs has come from nonexperimental study designs. Cost, feasibility, and ethical concerns can make randomized trials controversial. As radiologic screening for disease becomes more common, considerations of new evaluation methodologies to determine costs and benefits may be needed. The challenge for the future is to better identify which screening tests are appropriate for which populations. Emerging quantitative techniques of eliciting patient preferences [42] and of analyzing benefits, harms, and costs over time [43, 44] may help radiology meet this challenge.
APPENDIX 1. Screening for Preclinical Disease: Glossary of Terms
|
|
|---|
|
|
|
|---|
This article has been cited by other articles:
![]() |
W. C. Black, E. A. Krupinski, A. Relyea-Chew, and F. S. Chew Methodology and Application of Clinical Trials in Radiology: Self-Assessment Module Am. J. Roentgenol., March 1, 2008; 190(3_Supplement): S23 - S28. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. F. Layton, J. Huston 3rd, H. J. Cloft, T. J. Kaufmann, K. N. Krecke, and D. F. Kallmes Specificity of MR Angiography as a Confirmatory Test for Carotid Artery Stenosis: Is It Valid? Am. J. Roentgenol., April 1, 2007; 188(4): 1114 - 1116. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. T. Kolber, G. Zipp, D. Glendinning, and J. J. Mitchell Patient Expectations of Full-Body CT Screening Am. J. Roentgenol., March 1, 2007; 188(3): W297 - W304. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. H. Sunshine and K. E. Applegate Technology Assessment for Radiologists Radiology, February 1, 2004; 230(2): 309 - 314. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. M. Hall and L. F. Rogers CT Screening Examinations Am. J. Roentgenol., April 1, 2003; 180 (4): 1178 - 1179. [Full Text] [PDF] |
||||
![]() |
L. F. Rogers Whole-Body CT Screening: Edging Toward Commerce Am. J. Roentgenol., October 1, 2002; 179(4): 823 - 823. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |