|
|
||||||||
Original Research |
1 Department of Radiology, Breast Imaging, Brigham and Women's Hospital, 75
Francis St., Boston, MA 02115.
2 AVON Breast Center, Boston, MA.
3 Department of Radiology, Institute of Technology Assessment, Massachusetts
General Hospital, Boston, MA.
Received April 10, 2007;
accepted after revision July 6, 2007.
Address correspondence to D. Georgian-Smith
(dgeorgiansmith{at}partners.org).
Abstract
|
|
|---|
MATERIALS AND METHODS. We found that 6,381 consecutive screening mammograms were interpreted by a primary reader. This radiologist then reinterpreted the studies using CAD ("CAD reader"). A second human reader who was blinded to the CAD results but knowledgeable of the primary reader's findings reviewed the studies, looking for abnormalities not seen by the first reader.
RESULTS. Two cancers were called back by the second human reader that were not called back by the CAD reader; however, the CAD system had marked the findings, but they were dismissed by the primary reader. Because of the small numbers, the difference between the CAD and second human reader was not statistically significant. The CAD and human second readers increased the recall rates 6.4% and 7.2% (p = 0.70), respectively, and the biopsy rates 10% and 14.7%. The positive predictive value was 0% (0/3) for the CAD reader and was 40% (2/5) for the human second reader. The relative increases in the cancer detection rate compared with the primary reader's detection rate were 0% for the CAD reader and 15.4% (2/13) for the human second reader (p = 0.50).
CONCLUSION. A human second reader or the use of a CAD system can increase the cancer detection rate, but we found no statistical difference between the two because of the small sample size. A possible benefit from a human second reader is that CAD systems can only point to possible abnormalities, whereas a human must determine the significance of the finding. Having two humans review a study may increase detection rates due to interpreter—hence, perceptual—variability and not just increased detection.
Keywords: breast cancer computer-aided detection mammography mammography recall rates screening mammography
|
|
|---|
One method shown to improve cancer detection in screening mammography is the use of double reading by two radiologists. Previously, we reported the benefit of double reading in which 7.7% additional cancers were detected by a second human reader that had been missed by the primary reader of approximately 6,000 screening mammograms (Hulka C et al., presented at the 1994 annual meeting of the Radiological Society of North America [RSNA]). Although double reading requires more resources and a delay in interpretation, referring physicians and women preferred this system to immediate online interpretation once they were informed of the benefits [4]. Other investigators have reported an increase in detection rates of 5–15% as a result of double reading [5–7].
Another proven method to increase the sensitivity of screening mammography is the use of a CAD system, whether in a community or an academic practice. In a private practice, Freer and Ulissey [8] reported an almost 20% increase in cancer detection with the use of a CAD system. Birdwell et al. [9] reported an overall detection rate of 29 cancers in 8,000 cases (four cases per 1,000 women) in which the use of CAD prompted the detection of two additional cancers and thereby increased detection sensitivity by 7.4% [9]. Most recently in the largest study to date of more than 21,000 screening studies, Morton et al. [10] had similar results: a 7.6% increase in the number of cancers detected by the addition of CAD.
Because both double reading and CAD have been independently shown to improve sensitivity for breast cancer detection, we retrospectively reviewed our clinical academic practice to compare CAD with a blinded human second reader for the detection of additional breast cancer not seen by a primary radiologist. The purpose of this study was to compare the practice of a human second reader with a CAD reader for the reduction of the number of false-negative cases resulting from review by a primary radiologist.
|
|
|---|
Radiologists
Eight academic, dedicated breast imagers with an average of 14 years
(range, 3–26 years) of experience independently reviewed the mammograms
under three conditions. No one radiologist interpreted a dominant number of
mammograms. The mean percentage of cases per radiologist was 12% (range,
7–20%). The interpretations represented a fair cross section of our
academic practice to ensure the generalizability of our study results.
Review Sequence
We prospectively designed the review sequence so that the human second
reader was blinded to the results of the CAD, but the clinical care of the
patient was determined by the results of both. Therefore, the clinical
practice was based on these two additional reviews. The results of this study
were determined from a retrospective review of the clinical practice.
Institutional review board approval was obtained before the start of the
study. This study was conducted before HIPAA regulations were in effect.
We set three review conditions as follows: first, the primary radiologist; second, the CAD reader who was the primary radiologist along with input from the CAD system; and, third, the second human reader, a different radiologist than the primary one, who reinterpreted the images without knowledge of the CAD results but with knowledge of the primary reader's results.
Study Design
The primary reader—The screening films were batch read.
Residents or fellows in training were not permitted to review the films so
that the primary reader would not be influenced by another human reader. The
primary reader interpreted each film-screen mammogram and then locked his or
her impression into a computer. Previously obtained films, preferably 2 years
old or older, were available for comparison but were not interrogated by the
CAD system. Prior mammograms were available in 78% (3,930/5,049) of the
patients at the time of interpretation. When an abnormality was detected that
warranted the patient to be called back for diagnostic evaluation, the primary
reader marked the area or areas of concern on the films with a wax marker.
These notations were important so that the second human reader would know
which areas were areas of concern to the primary reader.
The CAD reader—After recording an impression without CAD, the primary reader turned on the CAD system for the four views obtained of the current year. The mammograms were reinterpreted by the primary radiologist with the assistance of the CAD markings. These impressions were then entered into the computer in a separate file, and any new areas of concern were not marked on the films with a wax marker so that the second human reader would not be alerted to the CAD markings.
The second human reader—After the primary reader completed the initial interpretation and the interpretation with CAD, a different radiologist quickly scanned each case to look for areas of suspicion that had not been detected by the primary reader. The intention of this reader was to find mammographic signs of suspicion that had not been detected by the primary reader. It was not the function of the second radiologist to judge the appropriateness of the calls made by the primary reader. Therefore, the primary reader's markings were never reversed by the second human reader.
Callback Workflow
Patients were recalled for additional diagnostic workup based on findings
from any of the three interpretations. All patients who were called back for
additional workup came to the diagnostic breast imaging center at the
hospital.
Pathology and Follow-Up
The histologic results for all of the cases recommended for biopsy were
obtained. Malignant cases were defined as those with ductal carcinoma in situ,
invasive ductal carcinoma, or invasive lobular carcinoma at either core biopsy
or surgical biopsy. All of the screening patients were closely tracked for at
least 12 months to identify potential false-negative cases.
Statistical Analysis
Statistical significance was determined using a two-sided McNemar test to compare the performance of the CAD reader to the second human reader. There were no cases of multifocal or multicentric malignancies.
|
|
|---|
|
An additional 30 cases were called back by the CAD reader for a CAD additional callback rate of 0.47% (30/6,381) cases (Table 1). Of these 30 callbacks, three (10%, 3/30) were recommended for biopsy. There were no malignancies.
There were 34 cases (34/6,381, 0.53%) that were called back by the second human reader that were in addition to the primary reader's callbacks (Table 1). Except for one case, these cases were different from those that were called back on the basis of CAD markings. Five of the 34 cases (14.7%) were recommended for biopsy. Of the second reader's recommendations that went to biopsy, two of the five (40%) were malignant. The increase in cancer detection contributed by the second reader was from 13 by the primary reader to a total of 15 cases for a relative increase in the cancer detection rate of 15.4% (2/13) (Table 1).
There was no statistical significance in the performance between the CAD and second human readers as measured by the recall rates and by the relative increase in recall rates (p =0.70) and in cancer detection rates (p =0.50) (Table 1). However with a difference detected between the CAD and second human reader of only two cancers, no statistical significance could be shown due to such a small number. In contrast, the difference in the number of call-back cases between the two readers was much greater. With 63 cases of disagreement, we would have had 80% power to detect the difference if either reader was associated with twice as many callbacks as the other reader.
The overall screening cancer detection rate for all three readers was 2.35 per 1,000 (15/6,381). In addition, there were three interval malignancies not detected by any of the readers that developed within 1 year. There were, thus, a total of 18 cancers at the time of imaging in the cohort, for a prior probability of 2.82 cases per 1,000 women.
Malignancies Detected by the Second Human Reader
Two additional cancers not seen by the primary reader or the CAD reader
were called back by the second reader (Figs.
1A,
1B,
1C and
2A,
2B,
2C,
2D). Of clinical significance
is that the CAD system had marked both of the lesions, but the markings were
dismissed by the primary reader. Both cases were called back for diagnostic
workup by the second reader that resulted in recommendations for
short-interval follow-up. Biopsies were subsequently recommended on the basis
of findings at the 6-month workup, and malignancies were then diagnosed.
Because cancer was detected within 1 year of the incident screening as a
result of the second human reader's calls, these cases were counted as
true-positive calls for the second reader.
|
|
|
|
|
|
|
|
|
|
|
|
|---|
Our results show that in our academic screening practice the performance of a CAD system and a human second reader were not statistically significantly different as measured by malignancy detection rates or call-back rates, with both readers adding cases to those identified by the primary reader. However, the differences between these two additional readers were too small to detect statistical significance.
A recent large study by Fenton et al. [11] of more than 200,000 women also showed no change in the detection of breast cancer by CAD in a consortium of community practices. The cancer detection rate before and after CAD was 4.15 and 4.20 cases per 1,000 women, respectively; however, the biopsy rates increased from 14.7 to 17.6 cases per 1,000 women (p < 0.001) and the recall rates increased from 10.1 to 13.2 cases per 1,000 women (p < 0.001), respectively. Therefore, there was an overall decrease in accuracy after the use of CAD, although admittedly there was a proportionate increase in ductal carcinoma in situ, due to the detection of calcifications by CAD, to invasive cancer. Although the study by Fenton et al. was clearly different in design and purpose than ours, both studies showed a lack of improvement in cancer detection with the use of CAD.
Although the number of cases was too small to determine if the CAD or if the human reader was superior, the results of this study nevertheless highlight an important issue about the use of CAD: CAD can identify a lesion, but the human reader still must determine its significance. Two cancers highlighted by CAD were dismissed by a human, the primary reader. A possible advantage of a human second reader is not only that more lesions may be detected but also that a second human may have a different interpretation (reader variability) of the findings and that difference may lead to earlier diagnosis. Variability in human perception is highlighted by the fact that 64 cases were called back by the CAD reader and human second reader but that only one case was called back by both readers. This finding suggests that, although the clinical outcomes may look the same statistically, the two forms of double reading may be vastly different and, therefore, may not be equivalent.
No previous studies have used this specific methodology to measure CAD against a second reader, to our knowledge, but the authors of several reports have compared CAD with double readings in case-controlled environments. Destounis et al. [12] questioned the effect of CAD on the false-negative rate of screening-detected cancers that had double reading in which the second reader was aware of the initial reader's responses. They retrospectively reviewed prior mammograms in 318 cancer cases that had been clinically interpreted as negative by double reading. The prior years' mammograms were interrogated by CAD if three of five radiologists on a panel had deemed findings to be "actionable." CAD marked 37 of the 52 cases (71%) as showing actionable findings. Therefore, the theoretic value of the CAD markings is that CAD would have reduced the false-negative rate from 31% (98/318) to 19% (61/318).
Birdwell et al. [13] also conducted a study with similar results evaluating the potential effect of CAD on the false-negative rate, but unlike the cases in the Destounis et al. study [12], the clinical cases had been reviewed by only one radiologist. A panel reviewed the prior years' mammograms that had been interpreted as negative in 427 patients with screening-detected cancers and found that 115 cases had lesions visible that were deemed "actionable," for a false-negative rate of 27%. CAD marked the majority of findings, 77% (88/115), similar to the previous study. In addition, there was a theoretic reduction in the false-negative rate by using CAD from 27% (115/427) to 6.3% (27/427).
Our usual clinical practice of screening mammography is that of double reading in which the second reader is aware of the initial reader's interpretation. The ultimate goal of a second reader is to reduce the false-negative rate of the first reader. We have reported detection of an additional 7.7% of cancers with the use of double reading before the use of CAD (Hulka C et al., presented at the 1994 annual meeting of RSNA). This finding has been corroborated in community practice by Taplin et al. [14], who showed that the average increase in sensitivity with double reading was 7%. With the introduction of CAD to our clinical setting, we did not find CAD helpful in reducing the false-negative rate of the primary reader. Keep in mind, though, that our methods of evaluating CAD differed significantly from those of the previous two studies of Destounis et al. [12] and Birdwell et al. [13] in that most of our cases were negative and our study group did not stem from a cohort of malignant cases.
As we previously noted, the primary reader in our study dismissed two additional malignant cases that were identified by our second human reader. To conclude clinical significance would be hasty (p =0.50) (Table 1). However, note that the CAD system marked those two cases found by the human second reader and marked the one case of interval cancer considered by consensus to have detectable mammographic findings, but the human reader of the CAD marks, the primary reader, dismissed the findings. So the potential remains, despite our results, for CAD to impact the clinical outcome of false-negative cases.
Nevertheless, there is still no clear consensus in the literature as to the benefit of CAD versus double reading, particularly because of the effect of false-positive marks and their effect on specificity. Karssemeijer et al. [15] performed a retrospective study of 250 cancer cases and 250 normal cases that had been independently double read clinically. They evaluated the sensitivity of CAD for these cases. However, they looked at only the findings that had been marked by the radiologists and ignored the potential false-negative interpretations of cases with CAD-only marks. Their conclusion was that the performance of an independent double reading was significantly better than that of CAD (Tukey-Kramer, p = 0.009) because of fewer false-positive marks by a human reader and, hence, a higher specificity for double reading than CAD. These results are in contrast to those of Ciatto et al. [16], who found CAD to be significantly more specific than double reading. Those investigators retrospectively reviewed the screening mammograms of patients with findings interpreted as negative who developed interval cancers. They looked at the effects of double reading and CAD on those cases. CAD was almost as sensitive as independent double reading but, in contrast to the previous study, CAD was more specific. These disparate results, whether CAD or double reading is superior, may be due to differences in methods and study populations.
Most of the findings highlighted by CAD in our study were false-positive markings. False-positives are distracting, and undoubtedly the large number of them contributed to the lack of appreciation by the radiologists in the three "actionable" cases that were falsely interpreted as negative yet marked by CAD: two false-negatives by the CAD reader and one false-negative interval case that, by consensus, showed abnormal findings. With the initial CAD software, there was approximately one mark per film (four per patient), and obviously benign findings, such as axillary lymph nodes and vascular calcifications, were noted. A large number of false-positive marks by CAD have been noted in several studies [17–20].
The effects of superfluous markings on the detection of true findings, which the results of our study illustrate, were discussed by Ikeda et al. [17] and Astley [18]. To calculate a false-positive CAD marking rate, we will use our optimum average number of two marks per case. We know that of the 18 malignant cases in a population of approximately 6,000 patients, there were two malignant cases that had no mammographic findings as determined by consensus retrospective review, leaving 16 cases with mammographic findings. If CAD had marked both projections in all of the cases, there would have been a total of approximately 12,000 marks for the entire population and only 32 marks would have indicated findings of malignancy, resulting in 11,968 marks that did not indicate malignancy and a false-positive rate of 99.7% (11,968/12,000). In this setting, the radiologist becomes numb to the CAD markings, rendering each to be insignificant and easily dismissed. This problem explains why the results of studies using retrospective study cases differ from those using prospective clinical studies.
The human second reader is rapidly processing findings and unconsciously or consciously dismissing most of them. The CAD marks must be actively evaluated. As Ikeda et al. [17] noted, "it is the radiologist's knowledge of breast cancer imaging and diagnostic acumen that influences the choice to recall a finding, not the marking of the CAD system." Our results clearly corroborate this opinion. Moreover, we also agree that the lack of action on CAD markings that subsequently prove to be malignant is not an indication of radiologists performing below the standard of care, particularly because more than 99% of the markings should be dismissed.
Limitations to this study may be the sample size and the inability to show statistical significance between the human second reader and the CAD reader. To show statistical significance, one would need a difference of at least six cases between the groups, and any fewer has a power of zero. If the difference of two cancers per group were maintained, we would have needed at least four times the number of screening cases, more than 24,000, and all of the malignancies would need to have been identified by only one reader, an unlikely scenario. Another difference is that the CAD was used on only analogue films. There may be differences with the new digital algorithms, reducing the number of false-positive CAD marks and thereby reducing the confounding impact of the false-positive CAD marks. CAD has always excelled in detecting calcifications. Now, with the large size of a given mammographic image and marked contrast of digital imagery, detection of calcifications by humans has perhaps improved over film-screen mammography, thereby possibly reducing the effect of CAD. All of these issues deserve further study and would affect the results of our form of study.
The workflow of our current practice has changed because of the advent of digital screening mammography. We currently do not use a second human reader, but we do use the CAD system to supplement the primary reader. This change has occurred because of the increase in time it now takes to review a digital screening mammogram over an analogue one, which anecdotally has been measured in our practice to be twice as long. The additional time needed by two human readers, the primary and second, is too long to be acceptable in our institution, which strives to complete interpretations within 24 hours of acquisition. This practical consideration led to the dissolution of the second human reader in our practice. Each practice should evaluate the merits of a second radiologist, whether human or computer, given one's own workflow.
In conclusion, the results of our study are concordant with those of previous studies showing that either a second human reader or a CAD system can increase the detection of cancers in a screening program. However, our experience highlights a phenomenon that has been minimally emphasized—namely, how the human who uses CAD interprets the CAD markings. We showed that the two cases detected by the human second reader were not recalled by the primary radiologist despite their identification by the CAD system; the primary reader was perhaps influenced by the very large number of false-positive marks.
CAD only identifies. The interpretation of a mammogram still depends on the judgment of a radiologist, relying on experience and knowledge, to determine its importance. Variability in human performance is highlighted by the fact that the CAD reader and the human second reader called back different cases from one another. A CAD system can minimize perceptual failure but cannot compensate for interpretation failure. The use of a human second reader has the advantage of offering a second interpretation and an increase in perception.
Acknowledgments
We thank Donna Burgess for her contributions, managing the operational
aspects and supporting data preparation, to this study.
|
|
|---|
This article has been cited by other articles:
![]() |
References J. ICRU, December 1, 2009; 9(2): 89 - 104. [PDF] |
||||
![]() |
R. L. Birdwell The Preponderance of Evidence Supports Computer-aided Detection for Screening Mammography Radiology, October 1, 2009; 253(1): 9 - 16. [Full Text] [PDF] |
||||
![]() |
R. M. Nishikawa and L. L. Pesce Computer-aided Detection Evaluation Methods Are Not Created Equal Radiology, June 1, 2009; 251(3): 634 - 636. [Full Text] [PDF] |
||||
![]() |
F. J. Gilbert, S. M. Astley, M. G.C. Gillan, O. F. Agbaje, M. G. Wallis, J. James, C. R.M. Boggis, S. W. Duffy, and the CADET II Group Single Reading with Computer-Aided Detection for Screening Mammography N. Engl. J. Med., October 16, 2008; 359(16): 1675 - 1684. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. F. Brem Blinded Comparison of Computer-Aided Detection with Human Second Reading in Screening Mammography: The Importance of the Question and the Critical Numbers Game Am. J. Roentgenol., November 1, 2007; 189(5): 1142 - 1144. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |