AJR Get Involved! Join ARRS Today
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Fuhrman, C. R.
Right arrow Articles by Gur, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Fuhrman, C. R.
Right arrow Articles by Gur, D.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?
Hotlight (NEW!)
Right arrow
What's Hotlight?
AJR 2002; 179:1551-1553
© American Roentgen Ray Society


Observer Performance Studies: Detection of Single Versus Multiple Abnormalities of the Chest

Carl R. Fuhrman1, Cynthia A. Britton1, Thomas Bender1, Jules H. Sumkin1, Manuel L. Brown2, J. Michael Holbert3, Thomas S. Chang1, Howard E. Rockette4 and David Gur1

1 Department of Radiology, University of Pittsburgh, 200 Lothrop St., Pittsburgh, PA 15213-2582.
2 Department of Radiology, Henry Ford Hospital, 2799 W. Grand Blvd., Detroit, MI 48202.
3 Department of Radiology, Scott & White Clinic, Tempe, TX 76508.
4 Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15213-2582.

Received April 3, 2002; accepted after revision May 16, 2002.

 
Supported in part by grants CA66594, CA67947, and CA84507 from the National Cancer Institute, National Institutes of Health.

Address correspondence to D. Gur.


Abstract
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
OBJECTIVE. We used receiver operating characteristic (ROC) analysis to compare two methods of evaluating observer performance in detecting an abnormality on chest radiographs. In the first method, the abnormality in question, rib fracture, was one of five investigated, and it was the only one of interest in the second.

MATERIALS AND METHODS. Eight experienced observers viewed 117 posteroanterior chest radiographs in two interpretation modes. Fifty-four of these images depicted rib fractures that had been rated as subtle for detection. The likelihood of the presence of a rib fracture was rated as one of five abnormalities in question in one mode and the sole abnormality of interest in the other mode.

RESULTS. Six of the observers performed better during the single-abnormality mode, one performed equally well in both modes, and one performed better during the multiple-abnormality mode. The average area under the ROC curves (Az) was 0.73 ± 0.07 for the multiple-abnormality mode and 0.80 ± 0.04 for the single-abnormality mode. The results were significantly different (p < 0.05).

CONCLUSION. Study methodology can significantly affect the results in ROC studies, particularly for abnormalities that may not be perceived as primary or important. The order in which abnormalities appear on a checklist report form may be important.


Introduction
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Evaluations and comparisons of imaging systems frequently use observer performance studies that include interpretations of cases with both positive and negative findings. The nature of these studies forces a methodology of reporting (scoring) that is similar to a checklist, because each abnormality in question has to be addressed, and the observer is forced to estimate the likelihood of its presence [1]. This type of reporting may be similar to that of some clinical practices, but in general, it is very different from the "freestyle" reporting that radiologists are accustomed to in most environments.

Asking the general question of whether a case shows negative (or positive) findings for any of one or more specific abnormalities may be comparable to asking abnormality-specific questions and using the highest score as an indication (summary index) for the overall status of the case [2]. However, to the best of our knowledge, observer performance in detecting a specific abnormality in a multiple-abnormality environment has not previously been compared with performance in detecting the same abnormality when observers are asked to rate only that abnormality. We undertook this comparison to determine what implications the choice of methodology might have for the questionnaire format that should be used in such studies and for the effect of possible "satisfaction-of-search" phenomenon in these studies. We further sought to better understand the potential of methodology-dependent effects for maximizing observer performance in the detection of a specific finding as compared with other findings that may be included in a study.

In an attempt to explore some of these issues, we performed a two-mode observer-performance study in which 117 posteroanterior chest images were reviewed twice by eight experienced radiologists. In one mode, they were asked to indicate on an ordinal continuous rating scale the likelihood of the presence (or absence) of one of five abnormalities in a specific order (interstitial disease, nodule, pneumothorax, alveolar infiltrate, and rib fracture). In the other mode, they were told to focus solely on the presence (or absence) of rib fracture and ignore all other findings. The results of these observations were analyzed and compared using receiver operating characteristic (ROC) methodology.


Materials and Methods
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Our study cases were selected in a historical prospective mode from the pool of high-quality posteroanterior chest images acquired at Presbyterian—University Hospital of the University of Pittsburgh Medical Center Health System under a protocol approved by the institutional review board. For this project, we selected a subset of 117 cases that were verified using a comprehensive protocol that has been described previously [1, 3, 4]. Table 1 summarizes the distribution of depicted abnormalities in the cases selected for this study. Sixty-three of the 117 cases had negative findings for rib fracture and were rated "difficult" for the determination of the absence of rib fracture by experienced reviewers who did not participate as observers in the study. Fifty-four cases had positive findings for rib fracture and were rated "subtle" by at least one of the reviewers. At least two of the reviewers rated 38 of the negative cases and 38 of the positive cases "difficult" or "subtle," respectively.


View this table:
[in this window]
[in a new window]
 
TABLE 1 Distribution of Abnormalities Detected in 117 Chest Radiographs by Type

 

Our study consisted of two interpretation modes. In the first mode, observers were asked to rate the probability of the presence or absence of each of five abnormalities on a checklist type of questionnaire. Questions about the abnormalities appeared in the following order: interstitial disease, nodule, pneumothorax, alveolar infiltrates, and rib fracture. In the second mode, observers were asked a single focused question related only to the presence (or absence) of rib fracture being depicted on the images. We allowed at least 2 years to elapse between the time an observer viewed a case the first time and repeated viewing of the case for the other mode. Given the large number of cases and the complexity of the interpretation tasks, our experience indicated that this would be more than sufficient time to ensure that the cases were not remembered.

Eight experienced radiologists viewed and rated each of the 117 radiographs twice. Reviewing sessions included approximately 60 cases displayed in one of the interpretation modes. During each session, observers were presented with a stack of envelopes, each containing one radiograph. The images were arranged in the order of interpretations for that session. Observers reported the results for each case on a scoring form.

In the first mode, sliding scales were presented, one for each abnormality in question. The observers were asked to indicate the likelihood of the presence (or absence) of the abnormality by selecting a rating between 0 and 100 that corresponded to their own estimated probability that the abnormality in question was present. The form included additional subordinate questions for interstitial disease and nodules only that were presented on the basis of answers to the primary detection questions. We required a response to all items before the observer could proceed to the next case. Observers were allowed to spend as much time as desired viewing and rating each image.

In the second mode, observers were instructed to focus on detection (determining the presence) of a rib fracture and to ignore any other abnormalities that might be depicted. The same sliding scale was used for indicating the probability that a rib fracture was present. The areas under the ROC curves (Az) for each observer and each abnormality were computed using the computer program ROCFIT (University of Chicago, Chicago, IL) [5]. The two interpretation modes of interest were compared using the multireader multicase approach [6], in which all cases with rib fracture were called "abnormal cases" and cases without rib fracture, even if these depicted other abnormalities, were labeled "negative cases."


Results
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Table 2 summarizes the performances of individual observers and the average for the group in the two interpretation modes. As can be seen from this table, interobserver variability is high for this group of subtle cases. Six observers performed better when the interpretation mode focused solely on the presence or absence of a rib fracture, one observer performed as well in both modes, and one observer decreased in performance when interpretation was done in the single-abnormality mode. Overall, the group increased performance (from Az = 0.73 ± 0.07 to Az = 0.80 ± 0.04), and the multireader multicase analysis indicated that as a group they performed significantly better in detecting subtle rib fractures (p < 0.05) during the single-abnormality mode.


View this table:
[in this window]
[in a new window]
 
TABLE 2 Performances of Individual Observers and Group Average in Detection of Abnormalities on 117 Chest Radiographs in Two Interpretation Modes

 


Discussion
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
This preliminary and relatively simple study highlighted the need for careful design of observer performance studies. On one hand, these studies need to be efficient from a data collection point of view; to this end, a multiple-abnormality approach can be undertaken. At the same time, the search for multiple abnormalities may distract observers from the abnormality in question (in our study, presence or absence of rib fracture was the last question to be answered) to a point that the methodology may reduce an observer's ability to objectively address the clinical question of interest.

The results presented here may be the outcome of several experimental conditions that should not be ignored. First, we did not randomize (or counterbalance) the interpretation modes. However, the second interpretation took place more than 2 years after the initial one; therefore, we do not think that case retention was a factor. Second, observers were not told that the presence of rib fracture was of special interest during the initial interpretation. During the multiple-abnormality mode, a satisfaction-of-search phenomenon may have resulted in a less than optimal search when observers looked specifically for rib fracture, despite the checklist type of reporting. This factor may have been important in our study. Because rib fracture was the last in a series of questions during the multiple-abnormality mode, observers could have perceived other abnormalities as being more important. If that is true, once those abnormalities were identified, observers may not have been as attentive in their search for rib fracture. Last, the questions appeared in the same fixed order, and the question regarding the presence or absence of a rib fracture appeared last in the multiple-abnormality interpretation mode. Hence, the results presented here may represent an upper limit on the magnitude of the effect. A much larger study would have been required if the questions regarding the different abnormalities had been randomized.

Preformatted (structured) reporting in the clinical environment using templates will similarly result in a fixed ordering of reportable findings. For this reason, the results of our study may be relevant to the use of such an approach in the clinical environment. Although questions regarding the appropriate reporting format remain, using a checklist type of rating in the laboratory environment may prevent several practical problems associated with multitasking ROC experiments.


References
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 

  1. Herron JM, Bender T, Campbell WL, Sumkin JH, Rockette HE, Gur D. Effects of luminance and resolution on observer performance with chest radiographs. Radiology 2000;215:169 -174[Abstract/Free Full Text]
  2. Rockette HE, Gur D, Cooperstein LA, et al. Effect of two rating formats in multi-disease ROC study of chest images. Invest Radiol 1990;25:225 -229[Medline]
  3. Slasky BS, Gur D, Good WF, et al. Receiver operating characteristic analysis of chest image interpretation with conventional, laser-printed, and high-resolution workstation images. Radiology 1990;174:775 -780[Abstract/Free Full Text]
  4. Thaete FL, Fuhrman CR, Oliver JH, et al. Digital radiography and conventional imaging of the chest: a comparison of observer performance. AJR 1994;162:575 -581[Abstract/Free Full Text]
  5. Metz CE, Herman BA, Shen JH. Maximum likelihood estimation of receiver operating characteristic (ROC) curves from continuously-distributed data. Stat Med 1998;17:1033 -1053[Medline]
  6. Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Invest Radiol 1992;27:723 -731[Medline]

Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Fuhrman, C. R.
Right arrow Articles by Gur, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Fuhrman, C. R.
Right arrow Articles by Gur, D.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?
Hotlight (NEW!)
Right arrow
What's Hotlight?


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS