|
|
||||||||
Original Research |
1 Department of Radiology, Seoul National University Bundang Hospital, 300
Gumi-dong, Bundang-gu, Seongnam-si, Gyeonggi-do, Seoul 463-707, Korea.
2 Institute of Radiation Medicine, Seoul National University College of
Medicine, Seoul National University Medical Research Center, Seoul,
Korea.
3 Max-Planck-Institut für Informatik, Saarbrücken, Germany.
4 Medical Research Collaborating Center, Seoul National University Hospital,
Seoul National University College of Medicine, Seoul, Korea.
Received May 2, 2007;
accepted after revision June 24, 2007.
Address correspondence to K. H. Lee
(kholee{at}snubhrad.snu.ac.kr).
Abstract
|
|
|---|
MATERIALS AND METHODS. One hundred chest CT images were compressed to 5:1, 8:1, 10:1, and 15:1. Five radiologists determined if the original and compressed images were identical (negative response) or different (positive response). The correlation between the results for each metric and the number of readers with positive responses was evaluated using Spearman's rank correlation test. Using the pooled readers' responses as the reference standard, we performed receiver operating characteristic (ROC) analysis to determine the cutoff values balancing sensitivity and specificity and yielding 100% sensitivity in each metric. These cutoff values were then used to estimate the visually lossless thresholds for the compressions for the 100 original images, and the accuracy of the estimates of two metrics was compared (McNemar test).
RESULTS. The correlation coefficients were –0.918 and 0.925 for PSNR and the HDR-VDP, respectively. The areas under the ROC curves for the two metrics were 0.983 and 0.984, respectively (p = 0.11). The PSNR and HDR-VDP accurately predicted the visually lossless threshold for 69% and 72% of the 100 images (p = 0.68), respectively, at the cutoff values balancing sensitivity and specificity and for 43% and 47% (p = 0.22), respectively, at the cutoff values reaching 100% sensitivity.
CONCLUSION. Both metrics are promising in predicting the perceptible compression artifacts and therefore can potentially be used to estimate the visually lossless threshold.
Keywords: artifacts CT data compression image quality metric JPEG 2000 visually lossless threshold
|
|
|---|
If a compressed image is indistinguishable from its original by radiologists, there is no basis for arguing that this visually lossless compression hinders diagnostic accuracy [9]. Although the visually lossless criterion allows relatively lower compression levels, this conservative criterion would be more readily acceptable even by skeptical radiologists and has been gaining support as a practicable compression level [1, 9–14].
To estimate the visually lossless threshold, human observers need to determine whether a compressed image is distinguishable from its original at various compression levels. Because compression tolerance varies by image content [2], the establishment of a robust visually loss-less threshold for various images would require a very large study. Instead, image quality metrics can be used for this image discrimination task. These metrics include traditional mathematical metrics, such as peak signal-to-noise ratio (PSNR), and computer-based perceptual metrics modeling the human visual system [15].
The purpose of this study was to determine whether PSNR and a perceptual metric, the High–Dynamic Range Difference Predictor (HDR-VDP) [16], can predict the presence of perceptible artifacts in Joint Photographic Experts Group (JPEG) 2000–compressed chest CT images and therefore can be used to estimate the visually lossless threshold for such compressions.
|
|
|---|
CT Scanning
This study included 100 consecutive adult patients (60 men and 40 women;
age range, 19–94 years) who underwent contrast-enhanced chest CT using
16-MDCT scanners (Brilliance, Philips Medical Systems) during a period of 7
days in January 2006. The scanning parameters were as follows: detector
collimation, 1.5 mm; gantry rotation time, 0.5 second; pitch, 1.19–1.25;
tube potential, 120 kVp; and effective mAs, 109–185 (mean ± SD,
161 ± 18) using automatic tube current modulation. Reconstruction
parameters were as follows: section thickness, 2 mm; section interval, 1 mm;
mediumsharp reconstruction algorithm (filter type C); matrix, 512 x 512;
and field of view, 210–391 mm.
One image containing the lung was randomly selected per patient to form a 100-image set. These images included 52 and 48 sections above and below the carina, respectively. The types of lesions shown in the images based on two body radiologists' subjective classification are tabulated in Table 1; these radiologists reviewed the images together after completing their visual analyses, which we describe later. If an image contained more than two lesions, they chose the most prominent lesion by consensus.
|
Image Compression
The 100 original images, having a bit depth of 12 bits/pixel aligned on a
2-byte boundary, were irreversibly compressed to four levels (5:1, 8:1, 10:1,
and 15:1) using a JPEG 2000 algorithm (PICS Tools, Pegasus Imaging Company).
These compressed images were then decompressed, yielding 400 compressed (and
then decompressed) images for comparison with their originals. The JPEG 2000
encoder was set to default settings: 9–7 wavelet filter for irreversible
compression; single tile; six levels of wavelet decomposition; size of
code-block, 64 x 64; size of precinct, 32,768 x 32,768; and a
single layer. The actual compression levels—that is, the ratio of the
original 16 bits/pixel to the compressed size in bits/pixel—achieved for
the four nominal levels were 5.00 ± 0.02 (mean ± SD), 8.00
± 0.04, 10.01 ± 0.05, and 14.99 ± 0.11, respectively. The
variations from the nominal levels were considered unimportant in this study.
For subsequent analyses, window level and width were fixed at –600 and
1,500 H, respectively, which are the default lung window settings in our
practice.
Human Observer Analysis
Five board-certified body radiologists participated. They had 4, 5, 5, 7,
and 8 years of working experience in interpreting body CT findings,
respectively.
Each of the 400 compressed images was paired with its original. The 400 image pairs were randomly assigned to one of eight reading sessions, avoiding repetition of a patient in a session. The order of reading sessions changed among the readers. Sessions were separated by a minimum of 2 weeks.
Each image pair was alternately displayed on a single monitor, and the order of the original and its compression was randomized. The reader selectively toggled between the two images, returning to the first image as desired. Each reader was blinded to the tested compression levels and independently determined whether the two images were identical (or indistinguishable) or different (or distinguishable). When making comparisons, the readers were asked to pay attention particularly to structural details, such as the small airways, pulmonary vessels, interlobular septa, and interlobar fissures, and to the texture of the organs. They were unaware that all images had been irreversibly compressed.
Images were displayed in a one-by-one format (1,483 x 1,483 pixels) using viewing software (PiView STAR, SmartPACS), a monochrome monitor (ME315, Totoku) with a matrix size of 1,536 x 2,048 and display size of 31.8 x 42.3 cm, and matching video hardware (LV32P1, Totoku). The display system was calibrated [17] with software (Medivisor Gray-Scale, Totoku) and a luminance meter (Minolta LS-110, Konica Minolta). The maximum and minimum luminances were 408.8 and 0.8 cd/m2, respectively. Ambient room light was subdued.
Images were presented with the lung window setting. Each reader reviewed the images without time constraints. The reading distance was limited to a range of 32–78 cm by aiming a laser beam in front of each reader's forehead onto a ruler perpendicular to the monitor screen. The reading distance had been measured during 30 minutes of their clinical work. Limiting the reading distance was to reproduce our clinical practice because a reading distance that was too close and one that was too far would artificially enhance and degrade the readers' sensitivity to compression artifacts, respectively [10].
PSNR
After converting the images to 8-bit images by adjusting the window
settings, PSNR (in decibels [dB]) was calculated as follows:
![]() |
Perceptual Model
Similar to other perceptual metrics
[15], the HDR-VDP is a
computational model that simulates low-level retinal processing of the human
visual system. It is an extension of the Visual Difference Predictor (VDP)
described by Daly [18] that
improved prediction of perceptible image differences in the full visible range
of luminance (high–dynamic range) by modeling local adaptation,
nonlinear response of photoreceptors, and optics of the human eyes
[16]. Because modern medical
display systems offer higher–dynamic range and are significantly
brighter than older cathode ray tube displays, the extended metric is more
suitable for our application. The HDR-VDP takes two images as input and then
outputs a probability-of-detection map in which the pixel value indicates the
probability, ranging from 0 to 1, that an observer viewing the two images will
detect the difference at that pixel location.
The model prediction was performed for the 400 image pairs after transforming each 8-bit image to high–dynamic range luminance format according to the display function of our display system. We set the same viewing conditions (matrix size, display size, reading distance range, and maximum luminance) for the model observer as those for the human observers. The Minkowski metric [19] with a summation parameter (β) of 2.4 was used to summarize the probability-of-detection map in a single numeric value [15].
Statistical Analysis
A biostatistician participated in the study design and performed
statistical analyses using statistical software (SAS software, version 9.1,
SAS Institute). If a reader rated a compressed image as identical to the
original, the response was coded as negative; otherwise, it was coded as
positive. Interobserver agreement for the 400 image pairs was measured using
kappa statistics for multiple readers
[20]. The five readers'
responses were pooled: If three or more readers responded positively, the
pooled response was considered positive; otherwise, it was considered
negative. For each of the 100 original images, the visually lossless threshold
range was determined: If the pooled response was positive at a given
compression level, the visually lossless threshold was regarded as below that
level; otherwise, it was regarded as above that level. Therefore, the visually
lossless threshold of each image could lie in one of the following ranges:
< 5:1, 5:1–8:1, 8:1–10:1, 10:1–15:1, and > 15:1.
The correlation between the results of each metric and the number of readers with positive responses was evaluated using Spearman's rank correlation test. Regarding the pooled readers' responses as the reference standard, we performed receiver operating characteristic (ROC) analysis for the PSNR and HDR-VDP results. For each metric, we recorded cutoff values balancing sensitivity and specificity (where the sum of sensitivity and specificity is the maximum) and yielding 100% sensitivity (where no false-negative prediction occurs). In these analyses, the 95% CIs and p values were adjusted for the clustering effect, which could be introduced by compressing an image to multiple compression levels [21–23].
Finally, the visually lossless threshold range of each original image was estimated by each metric using the cutoff values determined in the ROC analyses. This was to simulate the visually lossless threshold estimation process in a real situation, during which the compression level would be adjusted iteratively until the visually lossless threshold would be found. The accuracy of the estimated visually lossless threshold would be compared between the two metrics using the McNemar test regarding the pooled readers' decisions as the reference standard. A p value of less than 0.05 was considered a statistically significant difference.
|
|
|---|
|
|
|
|
|
Estimation of Visually Lossless Thresholds
The visually lossless threshold ranges determined by the pooled readers'
responses were 5:1–8:1, 8:1–10:1, and 10:1–15:1 for 38
(38%), 55 (55%), and seven (7%) of the 100 original images, respectively. No
image showed a positive response (distinguishable) at a certain compression
level and a negative response (indistinguishable) at a higher compression
level in the pooled readers' responses, although such a case occurred
sporadically in the individual readers' responses (in 16 [3.2%] of 100 x
5 image–reader combinations).
With the cutoff values balancing sensitivity and specificity, the PSNR and HDR-VDP metrics accurately predicted the visually loss-less threshold range for 69% and 72% of the 100 original images, respectively (p = 0.68); underestimated it for 2% and 9%; and overestimated it for 29% and 19%. With the cutoff values yielding 100% sensitivity, the PSNR and HDR-VDP metrics accurately predicted the visually lossless threshold range for 43% and 47% of the images, respectively (p = 0.22); underestimated it for 57% and 53%; and overestimated it for none of the images.
|
|
|---|
However, the metrics' predictions were not perfect, showing overestimations and underestimations of the visually lossless threshold indicating insufficient and excessive conservativeness in determining the visually lossless threshold, respectively. Whether our results can be generalized to different anatomic regions or different imaging techniques remains uncertain. We should also note that we tested only a single window setting (lung) and whether the prediction of perceptual artifacts and the estimated visually lossless threshold at that window setting would be valid at other window settings is uncertain.
The PSNR has been widely used to measure compressed image quality because of its computational simplicity. Although PSNR is a reliable image quality metric for homogeneous distortions [15, 24], its accuracy is known to be limited across a wide range of image content [15]. To overcome this limitation, several perceptual metrics that incorporate perceptual factors of the human visual system have been proposed [15, 25]. Of the perceptual metrics, the VDP proposed by Daly [18] and Visual Discrimination Model (VDM) [26] are the most popular and have been the most extensively validated [27–29]. These metrics take different approaches in modeling the human visual system, which has been summarized by Li et al. [28]. The VDM has been reported to be limited in detecting the signal of an arbitrary frequency because it operates solely in the spatial domain and uses a limited number of discrete frequency bands [28]. However, the prediction performances of these two metrics are known to be comparably accurate for nonmedical images [28]. Although others have introduced the proprietary VDM to medical fields [30–33], we used the HDR-VDP, a publicly available extension [34] of Daly's VDP.
For nonmedical images, many researchers have claimed that perceptual metrics outperform PSNR [15, 24, 35], and others have reported no significant difference [36]. For medical images, Siddiqui et al. [31, 32] reported that VDM correlated with human readers' subjective image quality ratings better than PSNR in four chest CT and six radiography images compressed up to 90:1 using the JPEG [31, 32] or JPEG 2000 [31] algorithm. Our ROC analysis suggests no significant difference between PSNR and the HDR-VDP results. With the cutoff values balancing sensitivity and specificity, the HDR-VDP showed higher sensitivity, and PSNR provided higher specificity. The discrepancy between the studies by Siddiqui et al. and ours is not likely explained by the difference in the perceptual model being used (VDM vs HDR-VDP) because the discrepancy lies in the performance of PSNR, which was much better in our results.
We postulated several reasons as to why the two metrics did not show a significant difference in our results. First, we compressed a set of homogeneous images (chest CT images with fixed scanning parameters) using a single compression algorithm to a narrow range of compression levels and then displayed them with a single window setting. This experimental setting might have caused relatively uniform compression artifacts, diluting the aforementioned drawback of PSNR and the advantage of the HDR-VDP in robustness. Second, because many aspects of the human visual system are already taken into account in the design of JPEG 2000 so distortions are gradually introduced to minimize their visibility [37], a simple metric such as PSNR might correlate well with human perception near the visually lossless threshold. Third, most perceptual metrics, including the HDR-VDP, rely partly on unverified assumptions in modeling the human visual system [25]. Many psychovisual studies to validate the perceptual metrics used test images with relatively simple patterns [27–29]. Therefore, perceptual metrics are not foolproof measures of the perceptible artifacts in complex medical images. Nevertheless, further investigations on perceptual metrics are needed given their potential to cope with the data explosion in the radiology field [3].
Although we did not formally analyze regional variation in the perceptible compression artifacts in an image, we have an impression that the artifacts perceived by the radiologists and HDR-VDP were more pronounced at the chest wall and mediastinum than at the lungs, despite the mathematic artifacts being more evenly distributed (Fig. 3). Although this finding needs to be confirmed by another experiment, several considerations should be raised. First, from a perceptual viewpoint, the lung areas, which usually have more clinical importance than the chest wall and mediastinum in an image with a lung window setting, might be more tolerant to the compression—and therefore compressible to a higher level—than the chest wall and mediastinum. Second, if the lung and chest wall–mediastinum were analyzed separately, the HDR-VDP might show significantly better predictions than PSNR. Nevertheless, our study results on the predictions of the two metrics remain valid from the conservative standpoint that we intended to eliminate possible diagnostic inaccuracy due to perceptible artifacts regardless of their locations in an image.
Our study has limitations. First, during the visual analysis, the readers might have learned characteristic artifact patterns that are not clinically important and then relied on these patterns to make decisions, which is a different process from real diagnostic interpretation. However, this limitation seems unavoidable and is common to investigations on the visually lossless threshold [9–14]. Second, because we randomly selected images with the intent to generalize our results throughout the chest, many images necessarily contained only normal structures. Nevertheless, we believe that our results would be reproducible even with a study sample containing more abnormalities because our analysis results of the human visual comparison, PSNR, and HDR-VDP are not likely to be affected by the presence of abnormalities. The alternate displaying method used in this study is known to be very sensitive to image differences regardless of image content [9]. Third, to avoid a possible clustering effect, we tested only a single image per patient, which is unlike a real clinical situation wherein radiologists scroll through a series of images. Video quality metrics [35, 36] may more accurately reflect the real clinical situation. Fourth, because the two compared images were alternately displayed, the readers could use temporal contrast (i.e., luminance change in time at a given region), which is not explicitly modeled by the HDR-VDP. However, because this displaying method is more sensitive to image differences than a side-by-side comparison [9], the determined visually lossless threshold should be more conservative and readily acceptable.
In conclusion, both PSNR and the tested perceptual metric, the HDR-VDP, are promising in predicting perceptible artifacts in JPEG 2000–compressed chest CT images and therefore can potentially be used to estimate the visually lossless threshold for such compressions.
Acknowledgments
We thank the radiologists in our department who participated as
readers.
|
|
|---|
This article has been cited by other articles:
![]() |
K. J. Kim, B. Kim, K. H. Lee, T. J. Kim, R. Mantiuk, H.-S. Kang, and Y. H. Kim Regional Difference in Compression Artifacts in Low-Dose Chest CT Images: Effects of Mathematical and Perceptual Factors Am. J. Roentgenol., August 1, 2008; 191(2): W30 - W37. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Bajpai, K. H. Lee, B. Kim, K. J. Kim, T. J. Kim, Y. H. Kim, and H. S. Kang Differences in Compression Artifacts on Thin- and Thick-Section Lung CT Images Am. J. Roentgenol., August 1, 2008; 191(2): W38 - W43. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Ringl, R. Schernthaner, E. Sala, K. El-Rabadi, M. Weber, W. Schima, C. J. Herold, and A. K. Dixon Lossy 3D JPEG2000 Compression of Abdominal CT Images in Patients with Acute Abdominal Complaints: Effect of Compression Ratio on Diagnostic Confidence and Accuracy Radiology, August 1, 2008; 248(2): 476 - 484. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Kim, K. H. Lee, K. J. Kim, R. Mantiuk, H.-r. Kim, and Y. H. Kim Artifacts in Slab Average-Intensity-Projection Images Reformatted from JPEG 2000 Compressed Thin-Section Abdominal CT Data Sets Am. J. Roentgenol., June 1, 2008; 190(6): W342 - W350. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |