|
|
||||||||
1
Department of Radiology, University of Pennsylvania, 3400 Spruce St.,
Philadelphia, PA 19444.
2
MCP Hahnemann University, School of Public Health, Broad and Vine Sts.,
Philadelphia, PA 19102.
3
Department of Radiology, Thomas Jefferson University, 11th and Walnut Sts.,
Philadelphia, PA 19107.
Received November 17, 2000;
accepted after revision March 9, 2001.
Supported by grant P01 - CA 53141 from the National Cancer Institute, U. S.
Public Health Service, Department of Health and Human Services.
Abstract
|
|
|---|
MATERIALS AND METHODS. We collected radiographs from a stratified sample of 100 patients seen in the emergency department. The images were obtained using computed radiography, and the digital images were printed on film and stored for display on a workstation. A group of seven experienced radiologists reported the cases using both film and the workstation display. The results were analyzed using mixture distribution analysis (MDA).
RESULTS. The reliability expressed as the percentage of agreement of a typical observer relative to the majority was computed from the MDA. The result was 90% for both hard copy and soft copy with bootstrap confidence intervals of 86-94%.
CONCLUSION. We conclude that, in the emergency department, soft-copy interpretation is as reliable as hard-copy interpretation. The strength of this conclusion depends on the validity of the MDA approach as well as the extent to which the observer sample and case sample are representative of the emergency department.
|
|
|---|
We propose a measure of reliability based on observer agreement that uses the mathematic technique of mixture distribution analysis (MDA) to compute the relative percentage of agreement [6,7]. Reliability is defined as the likelihood of agreement of one typical observer relative to the interpretation of the majority. This can be contrasted with accuracy, which is the likelihood of agreement with an external standard. The relative percentage of agreement has a range of 0-100. A value of 0 means that there is no agreement with the majority, 50 means that agreement occurs 50% of the time, and 100 means that agreement occurs all the time. The MDA methodology requires a minimum of five observers, but it does not require proven cases in the test sample. It does not estimate accuracy because there is no gold standard. The relative percentage of agreement is an estimate of the consistency of performance over cases and over observers. We have found a good correlation between the accuracy as measured by ROC analysis and the reliability as measured by the MDA [8].
The study reported here uses MDA to compare the reliability of soft-copy and hard-copy interpretations on a sample of 100 film radiographs drawn from the emergency department and interpreted by seven experienced radiologists who regularly interpret emergency department cases. The sample was stratified by the type of examination.
|
|
|---|
The Case Sample
A data coordinator, who was also a radiology technologist and was
thoroughly familiar with radiography in the emergency department, assembled
the sample of 100 cases. The emergency department is divided into a major
trauma unit, an emergency unit (hereafter called the emergency department),
and a unit called Penn Rapid Care that takes care of less critically ill
patients. Only those images from the emergency department obtained during the
day shift were included in the study. (The case mix is similar during the
evening and night shifts.) The images were categorized as depicting the chest,
abdomen, spine and pelvis, and extremities. The cases were acquired in the
order of patient arrival to the imaging facility. Most emergency department
patients undergo imaging for a single body region. When noncontiguous regions
were imaged, for example, ankle and chest or ankle and wrist, each region was
considered separately. All the standard views (e.g., posteroanterior and
lateral chest) were included with the case. The general mix of cases in the
emergency department and the number of cases selected in each category are
shown in Table 1.
|
Of 313 successive patients, 55 were excluded in advance: 38 trauma cases, five nasal and facial bones, two postreduction of fractures, and 10 cases interpreted by potential observers in the study. The data coordinator selected a case, printed a duplicate copy, and assigned it a sequential number. That number was written on a label and pasted over the patient identifying information in the corner of the film, as required by the University of Pennsylvania's institutional review board. The images in each case were also retrieved from the local archive and stored in a research database before they were sent to the PACS archive, where data compression is applied. The data coordinator verified the reason for requesting the examination (the clinical history) by retrieving the emergency department request form. Information on the form was compared with the entry in the radiology information system. A total of 115 cases were excluded after selection because the clinical history either could not be retrieved or was not verifiable (n = 53) or because uncompressed soft-copy images could not be obtained (n = 62). The remaining 143 cases were used to fill the categories in the test set.
The Interpretation Test
Seven observers participated in the study. All were staff radiologists who
either interpreted emergency department cases on the day shift or regularly
rotated through the emergency department reading room on the evening shift
(5:00 P.M.-11:00 P.M.). None had given the primary interpretation or had
previously reviewed any of the cases. The hard-copy cases were divided into 10
groups of 10 each and given to the radiologists to interpret using a standard
form (Appendix) with check boxes and a place to write a comment. The reason
for requesting the examination was stated for each case, and the observer was
asked to decide, with respect to the clinical history, whether findings were
positive, equivocal, or negative. If equivocal, the observer was asked to
state whether findings were ruled equivocal for technical reasons or for
diagnostic reasons. The observer was also asked to report any findings not
related to the clinical history and to indicate whether patient care would be
affected by them.
After 6 months, the soft-copy cases were transferred to removable disks (Jaz 2GB; Iomega, Roy, UT) that were given to each of the observers. Two Advantage display workstations (General Electric Medical Systems, Milwaukee, WI) in the radiology interpretation rooms were equipped with the disk drives, and the cases were interpreted again. Both workstations had 2 x 2.5K monitors and were located in dimly lit rooms. The luminance range of the workstation display was 0.2-200 cd/m2. The luminance range and linearity were checked weekly during the study with a J17 illuminometer (Tektronix, Beaverton, OR). The viewing distance was not controlled. Observers were free to use any image-processing functions on the workstation.
Data Reduction and Analysis
The method for performing the MDA using the expectationmaximization
algorithm has been described in the literature
[6,7].
The objective of the analysis is to estimate the parameters of three binomial
distributions representing three groups of cases: easy normal cases, hard
cases (normal and abnormal), and easy abnormal cases. The parameters that are
computed are the probability that a case in the group will be interpreted as
positive (m) and the proportion of cases in that group (p).
Hard cases are defined as those on which observers do not agree. Because
observers cannot distinguish normal from abnormal, they are grouped together.
A finer distinction into hard normals and hard abnormals can be made, but
doing so requires estimating the parameters of four binomial distributions and
requires more cases and more observers than were used in our study.
The input to the expectationmaximization algorithm is a set of interpretations coded as positive and negative. The observers in our study preferred to use three categories of findings: positive, equivocal, and negative. An analysis of the distribution of equivocal reports showed that they are equally as likely to be associated with a positive or negative interpretation. Our study had more negative than positive interpretations, so the equivocal interpretations were combined with the negative interpretations. The standard deviation of the mixture distribution parameters was computed using 1000 bootstrap samples with replacement [9].
The relative percentage of agreement was calculated as
![]() |
![]() |
The kappa statistic was calculated for pairs of observers using the three response categories [10]. The mean and standard deviation of the kappa statistic were computed for the 21 possible pairings of seven observers for hard copy and soft copy.
|
|
|---|
|
|
Mixture Distribution Analysis
The results of the MDA are shown in
Table 4 and plotted in
Figure 1 as the means and the
95% confidence intervals (CIs) calculated from the bootstrap standard
deviations. The relative percentage of agreement was 90.1 (86-94, 95% CI) for
hard copy and 89.8 (86-94, 95% CI) for soft copy.
|
|
Kappa
The mean kappa statistic for 21 pairs of observers was 0.48 (SD, 0.07) for
the hard copy and 0.49 (SD, 0.08) for the soft copy.
|
|
|---|
The value of the kappa statistic is also difficult to interpret because it varies with the ratio of abnormal to normal cases, even when the basic level of agreement does not change [12]. The calculation of kappa assumes that agreement must be due to chance when there is a high prevalence of either normal or abnormal cases. In imaging studies, observers actually agree on high prevalence cases. When two radiologists report a case as showing normal findings in a population in whom most of the studies being done show normal findings, their interpretations should not be classified as chance agreement.
We conclude that, in the emergency department, soft-copy interpretation is as reliable as hard-copy interpretation. The strength of this conclusion depends on the validity of the MDA approach and the extent to which the observer sample and case sample represent the cases seen in the emergency department. Table 1 and Figure 1 show that when cases were interpreted using hard copy, the proportion of cases called negative was slightly lower and the proportion of cases called positive was slightly higher. The differences were well within the 95% confidence limits, but suppose that they are real: it would take a very large study to determine if the observed difference was due to a real difference between hard copy and soft copy or was just variation by chance. Gur [13] has pointed out that very small differences in the accuracy of image interpretation may have a "significant effect on the actual added value of expert image interpretation." Using the MDA in a field study eliminates the need for a verification panel and frees up expert time for case interpretation. A larger study can be done using the added observers without an increase in total effort. The MDA methodology is useful for comparing imaging systems in situations in which independent proof is not available.
APPENDIX: Example of Question on Response Form
|
|
|---|
|
|
|
|---|
This article has been cited by other articles:
![]() |
References J. ICRU, April 1, 2008; 8(1): 57 - 62. [PDF] |
||||
![]() |
N Buls, W Shabana, P Verbeek, P Pevenage, and J De Mey Influence of display quality on radiologists' performance in the detection of lung nodules on radiographs Br. J. Radiol., September 1, 2007; 80(957): 738 - 743. [Abstract] [Full Text] [PDF] |
||||
![]() |
T Pudas, L Korsoff, T Kallio, M Uhari, and A Alanen Influence of film digitization on radiological interpretation Br. J. Radiol., November 1, 2005; 78(935): 993 - 996. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Balassy, M. Prokop, M. Weber, J. Sailer, C. J. Herold, and C. Schaefer-Prokop Flat-Panel Display (LCD) Versus High-Resolution Gray-Scale Display (CRT) for Chest Radiography: An Observer Preference Study Am. J. Roentgenol., March 1, 2005; 184(3): 752 - 756. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. H. Fuchsjager, C. M. Schaefer-Prokop, E. Eisenhuber, P. Homolka, M. Weber, M. A. Funovics, and M. Prokop Impact of Ambient Light and Window Settings on the Detectability of Catheters on Soft-Copy Display of Chest Radiographs at Bedside Am. J. Roentgenol., November 1, 2003; 181(5): 1415 - 1421. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. D. Johnson Hard- versus Soft-Copy Interpretation Radiology, June 1, 2003; 227(3): 629 - 630. [Full Text] [PDF] |
||||
![]() |
B. I. Reiner, E. L. Siegel, and F. J. Hooper Accuracy of Interpretation of CT Scans: Comparing PACS Monitor Displays and Hard-Copy Images Am. J. Roentgenol., December 1, 2002; 179(6): 1407 - 1410. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. F. Rogers PACS: Radiology in the Digital World Am. J. Roentgenol., September 1, 2001; 177(3): 499 - 499. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |