AJR 2002; 179:1551-1553
© American Roentgen Ray Society
Observer Performance Studies: Detection of Single Versus Multiple Abnormalities of the Chest
Carl R. Fuhrman1,
Cynthia A. Britton1,
Thomas Bender1,
Jules H. Sumkin1,
Manuel L. Brown2,
J. Michael Holbert3,
Thomas S. Chang1,
Howard E. Rockette4 and
David Gur1
1 Department of Radiology, University of Pittsburgh, 200 Lothrop St.,
Pittsburgh, PA 15213-2582.
2 Department of Radiology, Henry Ford Hospital, 2799 W. Grand Blvd., Detroit, MI
48202.
3 Department of Radiology, Scott & White Clinic, Tempe, TX 76508.
4 Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA
15213-2582.
Received April 3, 2002;
accepted after revision May 16, 2002.
Supported in part by grants CA66594, CA67947, and CA84507 from the National
Cancer Institute, National Institutes of Health.
Address correspondence to D. Gur.
Abstract
OBJECTIVE. We used receiver operating characteristic (ROC) analysis
to compare two methods of evaluating observer performance in detecting an
abnormality on chest radiographs. In the first method, the abnormality in
question, rib fracture, was one of five investigated, and it was the only one
of interest in the second.
MATERIALS AND METHODS. Eight experienced observers viewed 117
posteroanterior chest radiographs in two interpretation modes. Fifty-four of
these images depicted rib fractures that had been rated as subtle for
detection. The likelihood of the presence of a rib fracture was rated as one
of five abnormalities in question in one mode and the sole abnormality of
interest in the other mode.
RESULTS. Six of the observers performed better during the
single-abnormality mode, one performed equally well in both modes, and one
performed better during the multiple-abnormality mode. The average area under
the ROC curves (Az) was 0.73 ± 0.07 for the
multiple-abnormality mode and 0.80 ± 0.04 for the single-abnormality
mode. The results were significantly different (p < 0.05).
CONCLUSION. Study methodology can significantly affect the results
in ROC studies, particularly for abnormalities that may not be perceived as
primary or important. The order in which abnormalities appear on a checklist
report form may be important.
Introduction
Evaluations and comparisons of imaging systems frequently use observer
performance studies that include interpretations of cases with both positive
and negative findings. The nature of these studies forces a methodology of
reporting (scoring) that is similar to a checklist, because each abnormality
in question has to be addressed, and the observer is forced to estimate the
likelihood of its presence [1].
This type of reporting may be similar to that of some clinical practices, but
in general, it is very different from the "freestyle" reporting
that radiologists are accustomed to in most environments.
Asking the general question of whether a case shows negative (or positive)
findings for any of one or more specific abnormalities may be comparable to
asking abnormality-specific questions and using the highest score as an
indication (summary index) for the overall status of the case
[2]. However, to the best of
our knowledge, observer performance in detecting a specific abnormality in a
multiple-abnormality environment has not previously been compared with
performance in detecting the same abnormality when observers are asked to rate
only that abnormality. We undertook this comparison to determine what
implications the choice of methodology might have for the questionnaire format
that should be used in such studies and for the effect of possible
"satisfaction-of-search" phenomenon in these studies. We further
sought to better understand the potential of methodology-dependent effects for
maximizing observer performance in the detection of a specific finding as
compared with other findings that may be included in a study.
In an attempt to explore some of these issues, we performed a two-mode
observer-performance study in which 117 posteroanterior chest images were
reviewed twice by eight experienced radiologists. In one mode, they were asked
to indicate on an ordinal continuous rating scale the likelihood of the
presence (or absence) of one of five abnormalities in a specific order
(interstitial disease, nodule, pneumothorax, alveolar infiltrate, and rib
fracture). In the other mode, they were told to focus solely on the presence
(or absence) of rib fracture and ignore all other findings. The results of
these observations were analyzed and compared using receiver operating
characteristic (ROC) methodology.
Materials and Methods
Our study cases were selected in a historical prospective mode from the
pool of high-quality posteroanterior chest images acquired at
PresbyterianUniversity Hospital of the University of Pittsburgh Medical
Center Health System under a protocol approved by the institutional review
board. For this project, we selected a subset of 117 cases that were verified
using a comprehensive protocol that has been described previously
[1,
3,
4].
Table 1 summarizes the
distribution of depicted abnormalities in the cases selected for this study.
Sixty-three of the 117 cases had negative findings for rib fracture and were
rated "difficult" for the determination of the absence of rib
fracture by experienced reviewers who did not participate as observers in the
study. Fifty-four cases had positive findings for rib fracture and were rated
"subtle" by at least one of the reviewers. At least two of the
reviewers rated 38 of the negative cases and 38 of the positive cases
"difficult" or "subtle," respectively.
Our study consisted of two interpretation modes. In the first mode,
observers were asked to rate the probability of the presence or absence of
each of five abnormalities on a checklist type of questionnaire. Questions
about the abnormalities appeared in the following order: interstitial disease,
nodule, pneumothorax, alveolar infiltrates, and rib fracture. In the second
mode, observers were asked a single focused question related only to the
presence (or absence) of rib fracture being depicted on the images. We allowed
at least 2 years to elapse between the time an observer viewed a case the
first time and repeated viewing of the case for the other mode. Given the
large number of cases and the complexity of the interpretation tasks, our
experience indicated that this would be more than sufficient time to ensure
that the cases were not remembered.
Eight experienced radiologists viewed and rated each of the 117 radiographs
twice. Reviewing sessions included approximately 60 cases displayed in one of
the interpretation modes. During each session, observers were presented with a
stack of envelopes, each containing one radiograph. The images were arranged
in the order of interpretations for that session. Observers reported the
results for each case on a scoring form.
In the first mode, sliding scales were presented, one for each abnormality
in question. The observers were asked to indicate the likelihood of the
presence (or absence) of the abnormality by selecting a rating between 0 and
100 that corresponded to their own estimated probability that the abnormality
in question was present. The form included additional subordinate questions
for interstitial disease and nodules only that were presented on the basis of
answers to the primary detection questions. We required a response to all
items before the observer could proceed to the next case. Observers were
allowed to spend as much time as desired viewing and rating each image.
In the second mode, observers were instructed to focus on detection
(determining the presence) of a rib fracture and to ignore any other
abnormalities that might be depicted. The same sliding scale was used for
indicating the probability that a rib fracture was present. The areas under
the ROC curves (Az) for each observer and each abnormality
were computed using the computer program ROCFIT (University of Chicago,
Chicago, IL) [5]. The two
interpretation modes of interest were compared using the multireader multicase
approach [6], in which all
cases with rib fracture were called "abnormal cases" and cases
without rib fracture, even if these depicted other abnormalities, were labeled
"negative cases."
Results
Table 2 summarizes the
performances of individual observers and the average for the group in the two
interpretation modes. As can be seen from this table, interobserver
variability is high for this group of subtle cases. Six observers performed
better when the interpretation mode focused solely on the presence or absence
of a rib fracture, one observer performed as well in both modes, and one
observer decreased in performance when interpretation was done in the
single-abnormality mode. Overall, the group increased performance (from
Az = 0.73 ± 0.07 to Az = 0.80
± 0.04), and the multireader multicase analysis indicated that as a
group they performed significantly better in detecting subtle rib fractures
(p < 0.05) during the single-abnormality mode.
View this table:
[in this window]
[in a new window]
|
TABLE 2 Performances of Individual Observers and Group Average in Detection of
Abnormalities on 117 Chest Radiographs in Two Interpretation Modes
|
|
Discussion
This preliminary and relatively simple study highlighted the need for
careful design of observer performance studies. On one hand, these studies
need to be efficient from a data collection point of view; to this end, a
multiple-abnormality approach can be undertaken. At the same time, the search
for multiple abnormalities may distract observers from the abnormality in
question (in our study, presence or absence of rib fracture was the last
question to be answered) to a point that the methodology may reduce an
observer's ability to objectively address the clinical question of
interest.
The results presented here may be the outcome of several experimental
conditions that should not be ignored. First, we did not randomize (or
counterbalance) the interpretation modes. However, the second interpretation
took place more than 2 years after the initial one; therefore, we do not think
that case retention was a factor. Second, observers were not told that the
presence of rib fracture was of special interest during the initial
interpretation. During the multiple-abnormality mode, a satisfaction-of-search
phenomenon may have resulted in a less than optimal search when observers
looked specifically for rib fracture, despite the checklist type of reporting.
This factor may have been important in our study. Because rib fracture was the
last in a series of questions during the multiple-abnormality mode, observers
could have perceived other abnormalities as being more important. If that is
true, once those abnormalities were identified, observers may not have been as
attentive in their search for rib fracture. Last, the questions appeared in
the same fixed order, and the question regarding the presence or absence of a
rib fracture appeared last in the multiple-abnormality interpretation mode.
Hence, the results presented here may represent an upper limit on the
magnitude of the effect. A much larger study would have been required if the
questions regarding the different abnormalities had been randomized.
Preformatted (structured) reporting in the clinical environment using
templates will similarly result in a fixed ordering of reportable findings.
For this reason, the results of our study may be relevant to the use of such
an approach in the clinical environment. Although questions regarding the
appropriate reporting format remain, using a checklist type of rating in the
laboratory environment may prevent several practical problems associated with
multitasking ROC experiments.
References
- Herron JM, Bender T, Campbell WL, Sumkin JH, Rockette HE, Gur D.
Effects of luminance and resolution on observer performance with chest
radiographs. Radiology
2000;215:169
-174[Abstract/Free Full Text]
- Rockette HE, Gur D, Cooperstein LA, et al. Effect of two rating
formats in multi-disease ROC study of chest images. Invest
Radiol 1990;25:225
-229[Medline]
- Slasky BS, Gur D, Good WF, et al. Receiver operating characteristic
analysis of chest image interpretation with conventional, laser-printed, and
high-resolution workstation images. Radiology
1990;174:775
-780[Abstract/Free Full Text]
- Thaete FL, Fuhrman CR, Oliver JH, et al. Digital radiography and
conventional imaging of the chest: a comparison of observer performance.
AJR
1994;162:575
-581[Abstract/Free Full Text]
- Metz CE, Herman BA, Shen JH. Maximum likelihood estimation of
receiver operating characteristic (ROC) curves from continuously-distributed
data. Stat Med
1998;17:1033
-1053[Medline]
- Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic
rating analysis: generalization to the population of readers and patients with
the jackknife method. Invest Radiol
1992;27:723
-731[Medline]

CiteULike
Complore
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?