|
|
||||||||
1
Klinikum Krefeld, Institute of Diagnostic Radiology, Lutherplatz 40, 47805
Krefeld, Germany.
2
Present address: Siemens AG, Medical Engineering,
Völklinger Str. 2, 40219
Düsseldorf, Germany
3
German National Research Center for Information Technology (GMD), Institute
for Applied Information Technology, Schloss Birlinghoven, 53754 Sankt
Augustin, Germany.
Received July 24, 1998;
accepted after revision September 21, 1999.
B. Graf is a consultant to Siemens.
Abstract
|
|
|---|
MATERIALS AND METHODS. The study was conducted with clinical digital chest radiographs of 48 patients. CT images of the same patient group served as the gold standard. Data on four different monitor conditions (1K overview, 2K overview, 1K with postprocessing, and 2K with postprocessing) were collected using a 6-point confidence-rating scale and interpreted with an alternative free-response receiver operating characteristic.
RESULTS. When magnification and window settings were applied on the 1K monitor at the expense of an increased interpretation time, observer performance with the 1K monitor was not significantly different from that with the 2K monitor. A significant difference only occurred between the 1K monitor postprocessing condition and the 1K monitor overview condition.
CONCLUSION. Considering diagnostic accuracy, the 1K monitor is sufficient for the detection of pulmonary nodules, provided that postprocessing optionsespecially magnificationare applied. Further comparative monitor studies on the detectability of other abnormalities (e.g., fine interstitial structures) need to be performed.
|
|
|---|
Because the spatial resolution of the monitor is a main perceptual factor of the diagnostic quality in soft-copy interpretation [5], our first hypothesis was that the higher resolution of the 2K monitor compared with that of the 1K monitor would change the diagnostic quality. Our second hypothesis was that the use of voluntary image postprocessing would further change the diagnostic decision quality when the perceptual factors are the main influence on the diagnostic decision. We made no assumption about the direction of change because diagnostic decisions tend to be influenced not only by perceptual factors, but also by individual cognitive reaction styles, the stress of a diagnostic situation, and the differences in user performance. Our hypotheses were tested in an alternative free-response receiver operating characteristic (ROC) design.
|
|
|---|
In the final patient sample group, the chest radiographs of 36 patients (17 women, 19 men) revealed at least one pulmonary nodule, and the chest radiographs of 12 patients (eight women, four men) did not indicate any pulmonary nodules. For all patients, CT scans of the chest served as gold standard. The patients with nodules were 36-77 years old (mean, 61 ± 10 years), and the patients without nodules were 34-80 years old (mean, 60 ± 14 years). The period between CT and digital luminescence radiography examination was 0-14 days (mean, 4 ± 4 days) for the patients with nodules and 0-13 days (mean, 4 ± 4 days) for the patients without nodules. The indications for the CT examinations for all patients are shown in Figure 1. Figure 2 shows a histogram of nodule diameter for the 68 pulmonary nodules seen on the CT scans. Pulmonary nodules were caused by bronchus carcinoma (n = 8), metastasis (n = 36), inflammatory infiltration (n = 6), lymphoma (n = 6), scar (n = 11), and histiocytoma (n = 1). The period between CT and digital luminescence radiography examination of inflammatory infiltrations was 3 days; no substantial evolution of the inflammatory process occurred during this time. The localization of the nodules was registered schematically by drawing them in standardized lung diagrams for every patient record.
|
|
For all patients, the set of digital radiographs consisted of a posteroanterior and a lateral image. The digital luminescence radiographs were obtained with 125 kVp, 200-cm focus-detector distance, and a 12:1 grid (40 lines per centimeter, focused to 180 cm, 2.5 mm A1 filter) on a Digiscan 2H digital imaging plate system (Siemens, Erlangen, Germany) with ST-V screens and a sensitivity class of 200. According to the body size of each patient, a screen format of 35 x 35 cm with 1760 x 1760 x 10 bits per pixel for smaller patients or 35 x 43 cm2 with 1760 x 2144 x 10 bits per pixel for larger patients was applied. The radiographs had a pixel size of 0.2 mm and spatial resolution of 2.5 line pairs/mm.
Compared Monitors
In our radiology department, 1K monitors (high-contrast grayscale Simomed
monitors, type: SMM 2183 L; Siemens, Karlsruhe, Germany) are used for
soft-copy reporting. In this study, the conventional 1K monitor was compared
with a newly developed high-resolution 2K monitor (high-contrast gray-scale
Simomed monitor, type: SMM 21190 P; Siemens, Karlsruhe, Germany). The 1K
monitor was connected to a SUN Sparc 5 computer (TRITEC Electronic, Mainz,
Germany) and the 2K monitor to a SUN Ultra Sparc 1 computer (TRITEC
Electronic). On the 2K monitor, a later software version was implemented with
a user interface slightly different from that of the 1K monitor.
Table 1 summarizes the relevant
monitor parameters. The different software versions on the 1K and 2K monitors
did not affect the study because the observers applied only a few identical
software functions. The button design was composed of text on the 1K monitor
and icons on the 2K monitor.
|
Brightness of both monitors was adjusted to 0.06 foot-lamberts for the dark control field and 73 foot-lamberts for the bright control field. A monitor test pattern (standard RP-133) of the Society of Motion Pictures and Television Engineers (New York, NY) was used for the adjustment of brightness. The brightness values were regularly controlled using a Mavo-Monitor digital instrument (Gossen, Erlangen, Germany). The ambient room lighting was dimmed during image viewing.
On both monitors, digital luminescence radiographs were presented in two different ways: overview images (a) without postprocessing and (b) with voluntary postprocessing (double magnification and window-level adjustment). The digital luminescence radiographs were displayed in full resolution on the 2K monitor overview and on the 1K monitor overview with half acquisition size. The maximum size of the overview image segment was 1024 x 1024 pixels on the 1K monitor and 2048 x 2048 pixels on the 2K monitor (area on both monitors, 300 x 300 mm2). The rest of the display was used for the application menu. The different digital luminescence radiography image formats (quadratic, 1760 x 1760 pixels; rectangular, 1760 x 2144 pixels) were displayed on the monitors under the overview condition; areas of 880 x 880 pixels (half acquisition size, factor 0.5) on the 1K monitor and 1760 x 1760 pixels (full resolution, factor 1) on the 2K monitor were assigned to the quadratic digital luminescence radiographs. The rectangular images were visualized with a slightly different display scale factor on both monitors. The images were fit into the quadratic segment and visualized at an area of 841 x 1024 pixels (factor 0.48) on the 1K monitor and 1682 x 2048 pixels (factor 0.96) on the 2K monitor.
Study Design
Six certified radiologists with a minimum of 9 years' experience in
interpreting chest radiographs participated in the study. Eight viewing
sessions per observer were separated by a relatively short interval of 0-8
days (mean, 2 ± 2 days) because the study had to be performed in a
period of only 7 weeks. Image repetition occurred at an interval of 1-12 days
(mean, 5 ± 3 days). The examination period was limited because the 2K
monitor was on loan from Siemens.
The sessions took place in our regular interpretation room. Before the sessions, each observer was given written instructions. Each of the radiologists had at least 2 years of experience with soft-copy interpretation.
All radiographs of the 48 patients (68 nodules) were presented under four different monitor conditions: 1K and 2K monitors each with and without voluntary postprocessing (magnification and window-level adjustment). The monitor types could not be anonymous because of the different display formats (1K monitor, landscape; 2K monitor, portrait). Furthermore, the 2K monitor was easy to identify with the naked eye.
All digital luminescence radiographs were made anonymous, and the image name was hidden during interpretation session. For every observer, the cases were randomly divided into four subsets of 12 images each. In every session, each observer viewed one subset on the 1K monitor and one subset on the 2K monitor, both either with or without voluntary postprocessing. A specific image was reviewed only once per session. The sessions were interchanged in such a manner that three observers started with the 1K monitor and the other three observers started with the 2K monitor. Image presentation order was designed to offset any practice effects.
The observers were asked to detect every pulmonary nodule and subjectively estimate the nodule's suspected presence on a six-level confidence-rating scale: 1 = no, 2 = weak, 3 = possible, 4 = probable, 5 = strong, and 6 = unequivocal suspected presence of a pulmonary nodule. Furthermore, observers specified the location and estimated the diameter of the nodules. All specifications were assessed by the chairperson of the session in a standardized lung diagram with the help of anatomic landmarks such as the chest and bone structures. On all sessions, the observer could page forward and backward between posteroanterior and lateral radiographs. When image interpretation was finished for one image and interpretation time was stopped, the nodule diameter was measured by the observer. The observers were given unlimited interpretation time. Interpretation time was recorded with the digital clock displayed on the monitor. Timing began as soon as the image appeared on the screen and ended when the observer had completed the diagnostic evaluation.
In the viewing sessions with postprocessing, window settings and magnification could be used by the observers. On both monitors, the observers chose between two different magnifier functions: "magic glass" and "magnify." With the magic glass function, the enlarged part of the image is precisely one quarter of the image segment. Within this image detail, the original image is displayed magnified by two. Using the magnify function, the whole image is displayed with an enlarged matrix (double magnification). Thus, only a part of the image is displayed in the image segment. No other option (e.g., unsharp masking) was permitted. Generally, in our department, window levels (width and center) of chest radiographs are adjusted to width = 820, center = 613 for the posteroanterior radiographs and width = 604, center = 651 for the lateral radiographs. For some radiographs, these values were changed and then archived by the radiologist during initial reporting. For all images, the window values were not reset before our retrospective study. The archived window values served as base values. In four of 48 patients, the lateral image was not optimally adjusted because window-level changes were not saved by the radiologists during initial interpretation. The settings for the monitors and the interpretation environment were standardized for all sessions.
Statistical Analysis
Pretest (alternative free-response ROC).As previous studies
on the threshold detectability of pulmonary nodules have already shown, the
visibility limit of soft-tissue nodules in the lungs on conventional
radiographs is a diameter of 3 mm
[6]. With a nodule size of 8-10
mm, observers could separate true cancer from noise that mimics the appearance
of cancer on radiography. The results of hard- and soft-copy interpretation
did not differ [7]. Also,
statistic and psychologic item test theory makes it necessary to evaluate
whether a set of stimulus items (pulmonary lung nodules) is adequate in the
representation of the abilities of test subjects
[8]; therefore, it is necessary
to have a broad spectrum of items in the middle degrees of difficulty.
Therefore, we conducted an alternative free-response ROC testing for the
complete sample of 68 lung nodules, including nodules with a diameter of 3-6
mm, as a statistical pretest to see whether our set of stimuli (pulmonary
nodules) is adequate to represent our observers' abilities and is in
concordance with the experimental results.
ROC analysis and alternative free-response ROC analysis.ROC and alternative free-response ROC analyses are methods originally developed in psychophysics. The main advantage of ROC and alternative free-response ROC analysis compared with other psychophysic scaling methods is the separation between the sensory threshold function and the cognitive decision bias with the four classes of possible answers (true-positive, true-negative and false-positive, false-negative) and with the assumption of two different probability distributions for positive and negative reactions [9]. The value of a simple one-dimensional ROC testing for evaluating diagnostic competence has been proven in several clinical studies about medical diagnostic quality control and management [10,11,12], but it neither requires correct localization of nodules nor allows for multiple abnormalities to be present in the same image.
Correct localization of nodules and multiple abnormalities present in the same image would be possible with a multidimensional ROC analysis, which would cause some mathematic curve estimation, consistency, and separation problems. The free-response ROC analysis [13] and, as a specific form of it, the alternative free-response ROC analysis [14, 15] have been developed as approximate solutions. Both analyses allow an arbitrary number of nodules per image. Observers indicate the confidence levels and the locations of all perceived nodules. Alternative free-response ROC as a variation of free-response ROC scores images and detected stimuli simply in a different weighting. Instead of false-positive detections, false-positive images are counted. Therefore, only the highest confidence false-positive decision per image is included regardless of how many lower confidence false-positive decisions are made on images with and without nodules.
In this study, the total number of positive cases was equal to the number of pulmonary nodules, and the total number of negative cases was determined by counting false-positive images and true-negative images. The standard computer program ROCFIT (Metz CE, Chicago, IL) was applied to calculate the area under the alternative free-response ROC curve (Az), which represents an index of observer performance. This software is adapted and improved from the original RSCORE II computer program (Dorfman DD, Iowa City, IA) developed by Dorfman and Alf [16]. The application of ROCFIT to alternative free-response ROC data was suggested by Chakraborty and Winter [15] and has already been used by other researchers [14,15,16,17,18,19].
The alternative free-response ROC parameters a and b for
the binormal ROC distribution were averaged across the observers to produce an
average alternative free-response ROC. The standard deviations (SD) were
computed by the formula [20]:
![]() |
Az,av is the average area under the curve of a monitor
condition and Az,i is the area under the curve for one
observer. N is the number of observers. The alternative free-response
ROC was calculated and both statistically significant differences between
monitor conditions and observers were determined by using a two-tailed
Student's t test for paired observations (
=0.05).
Further analyses.To test differences in observer
performance for statistical differences, we evaluated interpretation times
with two-tailed Student's t tests for paired observations
(
=0.05) between the four monitor conditions (six pairs) and between the
six observers (15 pairs). We also calculated sensitivity and specificity with
equally weighted positive responses
[21] for the two sample sizes
(n = 45 nodules, n = 68 nodules) and tested their
differences for statistical significance with a two-tailed Student's
t test for paired observations (
= 0.05) systematically for
different monitor conditions and different observers for the sample of nodules
with a diameter of 7-35 mm (n = 45). Finally, the application of the
postprocessing options was investigated.
|
|
|---|
Alternative Free-Response ROC Analysis
After the statistical pretest for n = 68 nodules with an
alternative free-response ROC analysis, 23 nodules with a diameter of 3-6 mm
were excluded from the alternative free-response ROC analysis because of their
poor detectability. According to Oestmann and Galanski
[20] and the statistical test
item theory, the area under the curve values with a range of 0.65-0.90 are
desirable.
An average alternative free-response ROC curve was determined for each monitor condition (Fig. 3). For the four monitor conditions and the six observers, alternative free-response ROC curves were calculated (Figs. 4,5,6,7).
|
|
|
|
|
Considering the six pairs of monitor conditions, only one pair showed a significant difference. On the 1K monitor with postprocessing, the average observer performance was significantly better than on the 1K monitor overview (p < 0.05) (Table 2). With regard to observer performance, two observers stood out. The results of observer 2 were significantly worse than those of observers 4, 5, and 6, and observer 6 performed significantly better than all other observers. The performances of observers 1, 3, 4, and 5 did not significantly differ (p < 0.05) (Table 2).
|
Analysis of Interpretation Times
The average interpretation times for the observers and for the monitor
conditions are shown in Table
3. Observer 2 had the longest average reaction times, ranging from
96 to 157 sec under all monitor conditions, and observer 6 had the shortest
average reaction times, ranging from 31 to 64 sec. Observer 6 had
significantly shorter interpretation times over all monitor conditions
compared with observers 2, 3, 4, and 5, and observer 2 had significantly
longer interpretation times over all monitor conditions compared with
observers 1, 4, and 6. Observers 3, 4, and 5 were not significantly different.
All calculations were conducted for p < 0.05.
|
Averaged over all observers, the monitor conditions were significantly different for the 1K monitor without postprocessing condition versus the 1K monitor condition with postprocessing condition, the 1K monitor without postprocessing condition versus the 2K monitor with postprocessing, and the 2K monitor without postprocessing condition versus the 2K monitor with postprocessing condition. For these compared monitor conditions, the application of postprocessing functions led to a significant increase in the observers' reaction times.
In comparison with the 2K monitor without postprocessing, interpretation time increased by a mean factor of 1.5 ± 0.57 when both postprocessing options were applied on the 1K monitor. The monitor conditions factors are presented in Table 4.
|
Sensitivity and Specificity
If all nodules (n = 68) are included in the analysis, an average
sensitivity of 44% (SD = 3.5%) is calculated. If small nodules with a diameter
smaller than 7 mm are excluded (n = 45), average sensitivity
increases to 61% (SD = 4.6%). Average specificity for both sample sizes
amounts to 68% (SD = 16.9%).
The following values were determined for the sample of n = 45 nodules with a diameter larger than 6 mm. The differences in sensitivity and specificity between the four monitor conditions were not significant. Observers 2 (p = 0.04) and 4 (p = 0.01) had a significantly higher sensitivity than observer 6. The specificity values averaged over the four monitor conditions reflect the alternative free-response ROC results, indicating similar differences between the observers. Observer 2 had the worst average specificity (38%) while observer 6 achieved the best average specificity (91%). The other observers achieved an average specificity of 63% (observer 1), 70% (observer 3), 69% (observer 4), and 75% (observer 5). Observer 2 achieved a significantly lower specificity than all other observers and observer 6 reached a significantly higher specificity than all other observers (p < 0.05). The specificity of observer 1 was significantly lower than the specificity of observer 4 (p = 0.014) and observer 5 (p = 0.043).
Application of Postprocessing Options
The application of the postprocessing options (window-level adjustment or
magnification or both) was averaged for all observers and cases, and the
results are presented in Figure
8. Three observers (observers 1, 3, and 5) applied the magic glass
on both monitor types. The other three observers preferred the magnify option
but used it only on the 1K monitor because it is too slow on the 2K monitor,
which has a larger number of pixels. These observers applied the magic glass
on the 2K monitor. When the lateral image was not optimally adjusted (4/48),
all observers applied window-level adjustment on the 1K monitor for all views
and on the 2K monitor for 83% of the cases (average across the observers).
Differences were detected between the observers. Observer 3 used both
postprocessing options for all images. Observer 6 used neither magnification
nor window settings in considerably more cases than the other observers.
|
|
|
|---|
Among our six observers, a high interobserver variability was detected in all investigative conditions, but the accuracy of performance was consistent for each observer under the different viewing conditions. Observers 2 and 6 stood out among the six observers. Under all four monitor conditions, the maximum area under the curve value was calculated for observer 6 and the minimum area under the curve value was determined for observer 2 (Figs. 4,5,6,7). These results are also reflected in the calculation of the specificity in which observer 2 had a high false-positive rate of 62% and observer 6 had a low false-positive rate of 9%. On the other hand, observer 2 achieved a significantly higher sensitivity than observer 6 (p = 0.04). Overall sensitivity was poor (61%), even after exclusion of nodules with a diameter smaller than 7 mm.
The analysis of interpretation or reaction times and their evaluation leads to some influence factors. The first factor is considering the monitor conditions in conjunction with the use of image processing functions. As expected, significant elongation in interpretation times occur with the use of image processing functions for the 1K and 2K monitors compared with the same monitors used without the image enhancement functions. Average interpretation time increased by a factor of 1.5 in the sessions with postprocessing compared with image interpretation without postprocessing on both monitor types, and observer 6, the observer with the shortest interpretation time, used neither magnification nor window settings in considerably more cases than the other observers.
Second, reaction times between the different observers were affected, independent from the use of image processing functions or the type of monitor used. Although observer 3 always used postprocessing functions, his reaction times did not differ significantly from those of observers 1, 4, and 5, indicating that the use of postprocessing functions is not the main factor for the length of interpretation times. Also, the range of mean interpretation times for each observer under different monitor conditions was relatively stable. This assumption is supported by the observation that the two observers, observers 2 and 6, with the most noticeable interpretation times also show the most differing diagnostic decision style from the other observers, according to the results of both the area under the curve analysis and the calculation of specificity. Reaction times are indicative of individual differences in processing information and intellectual ability; our results may indicate some cognitive bias in the diagnostic decisions of our observers. Here, further examinations are necessary.
There are five limitations to our study that should be discussed. First, the chest radiographs were selected during a 1-year period from one digital luminescence radiography unit in our radiology department. Only a limited number of patients were available who additionally underwent a CT examination (gold standard) in an interval of 0-14 days consecutively between the CT and the digital luminescence radiography examination. To obtain the maximum number of patients for the study, we included nodules with a location-dependent poor detectability positioned in the vicinity of anatomic structures: vascular pattern, heart shadow, and diaphragm in the lower quadrants, and bony structures in the upper quadrants. If such nodules had been excluded from the analysis, better ROC results might have been achieved by our observers.
Second, because the magnification is the most relevant factor in comparing 1K and 2K monitors, window-level adjustment should have been included in both the overview image and the postprocessing image to simply compare voluntary magnification of images versus nonmagnification.
Third, the study had to be performed in only 7 weeks because the 2K monitor was only available for that time period. If the study period had been longer, we could have defined longer intervals between the ROC sessions and, thus, may have been able to completely exclude practice effects.
Fourth, the overall applicability of the results of this study is somewhat limited because only one abnormality was investigated. The results are not generalizable to the routine interpretation of chest radiographs until other abnormalities (e.g., interstitial line structures, diffuse lung structures, pneumothoraces) are also evaluated.
Fifth, our lung nodule sample belonged to the clinical sample in our radiology department, so it is possible that the statistical sample distribution characteristics of the prevalences are not identical with the population distributions. Also, the distributions of gray values for the chosen lung nodules may not be characteristic for the diagnosed syndromes.
In conclusion, the results of this study indicate that no statistically significant differences in observer performance exist between 1K and 2K monitors for one specific abnormality (pulmonary nodules) provided that magnification and window settings are applied on the 1K monitor at the expense of an increased interpretation time. Considerable differences were detected among the six observers, but not among the four monitor conditions. Five of six observers subjectively preferred the 2K monitor because of its high resolution. Significant differences were not detected among the four monitor conditions with regard to sensitivity and specificity. Our results indicate that the use of 2K monitors without postprocessing could possibly expedite soft-copy interpretation in daily clinical routine.
Our initial hypothesis that different types of monitors in combination with different types of image enhancement functions will lead to statistically significant differences in diagnostic decisions could not be confirmed. Further comparative monitor studies on the detectability of other abnormalities (e.g., interstitial line structures, diffuse lung structures, pneumothoraces) still need to be performed. Cognitive and perceptual bias and their influence on diagnostic decision making need to be further investigated. Studies using 4K digital luminescence radiography screens should be performed as soon as 4K screens are widely available to test for a significant increase of diagnostic accuracy when using the 2K monitor with magnification.
Acknowledgments
We thank C. M. Schaefer-Prokop, M. Koschut, C. Paselk, K. D. Neubauer, H.
W. Goergens, and H. J. Persicke for their participation in this study. We also
thank B. Holzki, U. Bick, and E. A. Krupinski for their valuable
assistance.
|
|
|---|
This article has been cited by other articles:
![]() |
C. Balassy, M. Prokop, M. Weber, J. Sailer, C. J. Herold, and C. Schaefer-Prokop Flat-Panel Display (LCD) Versus High-Resolution Gray-Scale Display (CRT) for Chest Radiography: An Observer Preference Study Am. J. Roentgenol., March 1, 2005; 184(3): 752 - 756. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |