How Many Observers Are Needed in Clinical Studies of Medical Imaging?
Many examples in the literature testify to the existence of differences in how trained observers interpret medical images. Beam et al. [1] conducted a study of 108 American College of Radiology–certified mammographers who interpreted the same 79 mammograms; they found that sensitivities ranged from 0.47 to 1.0 and specificities ranged from 0.36 to 0.99. In a nine-observer study of digital and conventional chest imaging, Thaete et al. [2] reported the observers' receiver operating characteristic (ROC) curve areas. An ROC curve is a plot of the sensitivity of a diagnostic test versus its false-positive rate. The area under the ROC curve is a global measure of a test's accuracy [3]. Thaete et al. found differences in the areas of ROC curves between observers: 0.23 for rib fractures, 0.19 for pneumothorax, and 0.13 for alveolar infiltrate. In an initial experience with a commercially available MR coronary angiographic unit, Bogaert et al. [4] concluded that the examination was premature for routine use partly because of the poor agreement between observers—that is, a kappa value of 0.35.
Why do we care about these differences? Because imaging techniques do not make diagnoses; rather, they aid observers who make the diagnosis [5, 6]. Observers possess different cognitive, visual, and perceptual abilities. To understand the performance of medical imaging technology, we need to study the critical components of the technology, including the observers.
Beam et al. [5] listed five questions unanswered for most imaging techniques: How much does an imaging technique improve the diagnostic ability of the average radiologist? How much of an improvement over the use of a reference technique will this new technique typically make? How much variability in diagnostic abilities is there to be found in the general population of radiologists and in subspecialties? Is gain in diagnostic performance dependent on characteristics of the radiologists (e.g., years of experience, specialty, training) and if so, how? How much disagreement in diagnosis is to be naturally expected between radiologists using the same imaging technique or for the typical radiologist when reinterpreting the same images?
None of these questions can be addressed by a study with a single observer. In this article, I address a simple but important question: How many observers are needed in clinical studies of medical imaging?
Phases in the Clinical Assessment of the Performance of Medical Imaging
Fryback and Thornbury [7] proposed a six-level hierarchic model for assessing the efficacy of diagnostic tests. Level 2 is the level at which the diagnostic performance of the imaging system including the observers is evaluated. Within this level, different types of studies with different purposes are arranged in a hierarchic assessment of diagnostic performance [3]. The different types of studies require different numbers of patients and observers to meet the purposes of the study.
Phase I is the exploratory phase of the clinical assessment of diagnostic performance during which a new test is first evaluated on human subjects. These studies usually include a small sample of patients, often 10–50; the sample is composed of patients with known, clinically manifest disease and healthy volunteers. The goal is to determine whether the test can distinguish between those with clear disease and healthy subjects. If it cannot, then there is no reason to pursue the test further.
Phase I studies should usually include two or three observers. The rationale is that two observers is the minimum number needed to assess interobserver differences, and a third observer allows two additional interobserver comparisons. More observers are useful, but because this study is a phase I study, we do not want to spend resources on a test with potentially limited diagnostic ability.
Phase I studies are the first opportunities to collect data on interobserver differences in a clinical setting (the fifth question posed by Beam et al. [5]). Based on these data, we may look for ways to change certain attributes of the imaging device to reduce these differences, just as Bogaert et al. [4] did in their MR coronary angiography study. However, eliminating interobserver differences is unlikely. Phase I studies should also assess intraobserver differences and compare the magnitudes of intra- and interobserver differences.
Phase II is the challenge phase. The patient sample includes difficult cases, in terms of their pathologic, clinical, or comorbid features, that might confound the imaging system [8]. Usually, the patient sample size is between 50 and 200. The goals of these studies are to determine whether the test fails (i.e., excessive false-positives or false-negatives) for any patient subgroups and to compare its performance with other competing tests on these difficult cases. A nice example of a phase II study is the nine-observer chest imaging study by Thaete et al. [2], which included patients with single and multiple, subtle and typical disorders.
Phase II studies commonly include between five and 10 observers, as in the nine-observer study of Thaete et al. [2]. Here, we want to compare the accuracy of the tests and examine the relationship between accuracy and the pathologic, clinical, and comorbid features of the patient. Sometimes investigators report only the majority opinion of the observers; this method of reporting data weakens the study because important differences between tests and important relationships with patients' characteristics can be missed. Furthermore, patients are rarely treated on the basis of the majority finding of multiple observers; differences in observers are to be expected and reported, not concealed by a majority [6].
Phase III, or advanced phase, studies are for mature tests. Usually several hundred cases are collected through a prospective sampling plan. The goals of these studies are to estimate the test's performance for a well-defined clinical population and often compare it with other tests; the questions posed by Beam et al. [5] are answered during this phase of testing. Reliable reports of the test's performance (e.g., sensitivity, specificity, ROC curve area) come from these studies, not from phase I or II studies that use atypical easy or atypical difficult cases, respectively.
In general, phase III studies require more than 10 observers from several institutions for the sample of observers to be generalizable to a population of observers. For example, the mammography study by Beam et al. [1] included 108 observers from 50 medical centers. A formula for estimating the required sample sizes for these studies (both patient and observer sample sizes) is available [9–11]; it requires an estimate of the accuracy of the tests, the difference expected between tests [12], and the correlations and variability between observers and tests. Estimates of these correlations and variances from several studies are discussed in an article by Rockette et al. [13]. Other methods [14, 15] are available for estimating the parameters of multiobserver studies; these estimates can be used to resize for future studies.
Sometimes investigators conduct phase I and occasionally phase II studies with only one observer. These single-observer studies provide no opportunity to assess the frequency, nature, and magnitude of observer differences. However, these differences are real and inherent to the imaging system. Thus, studies that overlook these differences make a smaller contribution to the assessment of an imaging system than studies that have more than one observer.
Statistical Analysis of Multiple-Observer Studies
In phase I studies we want to test whether the imaging system has any diagnostic ability (e.g., whether its ROC curve area > 0.5); we also want to assess the nature, frequency, and often the explanation for differences in the findings and diagnoses of the observers. In phase I studies, we are interested in observer differences for individual patients; in contrast, in phases II and III we focus more on observers' accumulated performance over the sample of patients. The frequency of agreements between observers can be conveniently quantified by the kappa statistic [16]. The kappa statistic describes the frequency of agreement beyond mere chance agreement; it ranges from –1 (complete disagreement) to 1 (complete agreement), where a value of zero indicates agreement no greater than that of mere guessing. Other aspects of the analysis are often descriptive, such as characteristics of the images on which the observers disagreed.
In phase II studies, we want to know how observer performance with the new test compares with the existing tests and how observer performance is affected by characteristics of the patients and the disease. Regression methods are ideally suited for this sort of analysis; however, standard regression methods do not accommodate multiple observers. Fortunately, special regression methods have been developed for multiple-observer imaging studies [17, 18].
Phase III studies are designed to estimate the performance of the imaging system for a well-defined population of patients and a well-defined population of observers who will use the medical device. These studies are often referred to as “MRMC”—or multiple-reader, multiple-case—studies to emphasize the two populations. Several methods exist for estimating and comparing the performance of imaging systems for the relevant populations of patients and observers [14, 15, 17, 19, 20].
Conclusion
From the earliest phase to the final phases of assessment, multiple-observer studies are critical to clinical studies of medical imaging. Multiple observers are needed to document and assess the nature of observer differences (phase I), to evaluate how observer differences relate to patient and disease characteristics (phase II), and to estimate the imaging system's performance for populations of relevant patients and observers (phase III). Study designs of medical imaging should routinely include more than one observer, the appropriate number depending on the goals of the study. We should avoid reporting only the consensus or majority findings of the observers.
Footnote
Address correspondence to N. A. Obuchowski ([email protected]).
References
1.
Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by US radiologists. Arch Intern Med 1996; 156:209 –213
2.
Thaete FL, Fuhrman CR, Oliver JH, et al. Digital radiography and conventional imaging of the chest: a comparison of observer performance. AJR 1994; 162:575 –581
3.
Zhou HH, Obuchowski NA, McClish DK. Statistical methods for diagnostic medicine. New York, NY: Wiley & Sons,2002
4.
Bogaert J, Kuzo R, Dymarkowski S, Beckers R, Piessens J, Rademakers FE. Coronary artery imaging with real-time navigator three-dimensional turbo-field-echo MR coronary angiography: initial experience. Radiology 2003; 226:707 –716
5.
Beam CA, Baker ME, Paine SS, Sostman HD, Sullivan DC. Answering unanswered questions: proposal for a shared resource in clinical diagnostic radiology research. Radiology 1992; 183:619 –620
6.
Obuchowski NA, Zepp RC. Simple steps for improving multiple-reader studies in radiology. AJR 1996; 166:517 –521
7.
Fryback DG, Thornbury JR. The efficacy of diagnostic imaging. Med Decis Making 1991; 11:88 –94
8.
Ransohoff DJ, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med 1978; 299:926 –930
9.
Obuchowski NA. Multireader, multimodality receiver operating characteristic curve studies: hypothesis testing and sample size estimation using an analysis of variance approach with dependent observations. Acad Radiol 1995; 2[suppl 1]:S22 –S29
10.
Obuchowski NA. Multireader ROC studies: a comparison of study designs. Acad Radiol 1995; 2:709 –716
11.
Obuchowski NA. Sample size tables for receiver operating characteristic studies. AJR 2000; 175:603 –608
12.
Obuchowski NA. Determining sample size for ROC studies: what is reasonable for the expected difference in tests' ROC areas? (letter) Acad Radiol 2003; 10:1327 –1328
13.
Rockette HE, Campbell WL, Britton CA, Holbert JM, King JL, Gur D. Empiric assessment of parameters that affect the design of multiobserver receiver operating characteristic studies. Acad Radiol 1999; 6:723 –729
14.
Beiden SV, Wagner RF, Campbell G. Components-of-variance models and multiple-bootstrap experiments: an alternative method for random-effects, receiver operating characteristic analysis. Acad Radiol 2000; 7:341 –349
15.
Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Invest Radiol 1992; 27:723 –731
16.
Fleiss JL. Statistical methods for rates and proportions. New York, NY: Wiley & Sons,1981
17.
Ishwaran H, Gatsonis CA. A general class of hierarchical ordinal regression models with applications to correlated ROC analysis. Can J Stat 2000; 28:731 –750
18.
Toledano AY, Gatsonis C. Ordinal regression methodology for ROC curves derived from correlated data. Stat Med 1996; 15:1807 –1826
19.
Obuchowski NA, Rockette HE. Hypothesis testing of the diagnostic accuracy for multiple diagnostic tests: an ANOVA approach with dependent observations. Commun Statist Simula 1995; 24:285 –308
20.
Beiden SV, Wagner RF, Campbell G, Metz CE, Jiang Y. Components-of-variance models for random-effects ROC analysis: the case of unequal variance structure across modalities. Acad Radiol 2001; 8:605 –615
Information & Authors
Information
Published In
Copyright
© American Roentgen Ray Society.
History
Submitted: August 7, 2003
Accepted: October 8, 2003
First published: November 23, 2012
Authors
Metrics & Citations
Metrics
Citations
Export Citations
To download the citation to this article, select your reference manager software.