AJR Get Involved! Great Benefits! Join ARRS
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Taplin, S. H.
Right arrow Articles by Lehman, C. D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Taplin, S. H.
Right arrow Articles by Lehman, C. D.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?
DOI:10.2214/AJR.05.0940
AJR 2006; 187:1475-1482
© American Roentgen Ray Society


Original Research

Testing the Effect of Computer-Assisted Detection on Interpretive Performance in Screening Mammography

Stephen H. Taplin1,2, Carolyn M. Rutter1 and Constance D. Lehman3

1 Group Health Cooperative, Center for Health Studies, Seattle, WA 98101.
2 National Cancer Institute, Applied Research Program, Division of Cancer Control and Population Sciences, 6130 Executive Blvd., MSC 7004, EPN 4500, Bethesda, MD 20892.
3 Department of Radiology, University of Washington, Seattle Cancer Care Alliance, Seattle, WA 98109.

Received June 2, 2005; accepted after revision January 6, 2006.

 
This work was funded by grant CA63731 from the National Cancer Institute (NCI), but the opinions are solely those of the authors and do not imply any endorsement by NCI or the federal government. In addition, R2 Technology provided equipment and technical assistance for the project. The findings are those of the authors and cannot be construed to reflect the thoughts or opinions of R2 Technology.

Address correspondence to S. H. Taplin.


Abstract
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
OBJECTIVE. The objective of our study was to test whether the use of computer-assisted detection (CAD) improves sensitivity at no cost to specificity for the detection of breast cancer and enables more accurate assessment of fatty breast tissue compared with dense breast tissue.

MATERIALS AND METHODS. We created a stratified random sample of screening mammograms weighted with difficult cases split evenly among women with fatty breast tissue and those with dense breast tissue: 114 patients were cancer-free, 114 had cancer 1 year after screening, and 113 had cancer 13-24 months after screening. In test settings 6 months apart, 19 community radiologists interpreted 341 bilateral screening mammograms with and without CAD. We compared the sensitivity and specificity using regression models adjusting for repeated measures.

RESULTS. CAD assistance did not affect overall sensitivity (cancer by 1 year: 63.2% without CAD and 62.0% with CAD; cancer in 13-24 months: 33.5% without CAD and 32.3% with CAD), but its effect differed for visible masses that were marked by CAD compared with those that were not marked by CAD (hereafter referred to as "unmarked"). CAD was associated with improved sensitivity for marked visible cancers and decreased sensitivity for unmarked visible masses; the sensitivities without and with CAD, respectively, were as follows: marked cancer by 1 year, 82.7% versus 83.1%; marked cancer in 13-24 months, 44.2% versus 57.9%; unmarked cancer by 1 year, 37.4% versus 30.1%; unmarked cancer in 13-24 months, 29.7% versus 23.0% (p < 0.03 for both interactions between assistance and CAD marking for cancer by 1 year and cancer in 13-24 months). CAD marked 77% (70/91) of the visible cancers by 1 year and 67.3% (37/55) of the visible cancers in 13-24 months. CAD marked more visible calcified lesions (86%) than masses and asymmetric densities (67%) (p < 0.05). Overall specificity was 72% without and 75% with CAD (p < 0.02). CAD had a greater effect on both specificity (p < 0.02) and sensitivity (p < 0.03) among radiologists who interpret more than 50 mammograms per week. The results were the same for fatty breast tissue and dense breast tissue.

CONCLUSION. In this experiment, CAD increased interpretive specificity but did not affect sensitivity because visible noncalcified lesions that went unmarked by CAD were less likely to be assessed as abnormal by radiologists. Breast density did not affect CAD's performance.

Keywords: BI-RADS • breast cancer • computer-assisted detection • diagnosis • screening mammography


Introduction
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Although screening mammography is widely used throughout the United States, there is concern about its accuracy in practice [1, 2]. Efforts to improve technical image quality since implementation of legislation to accredit United States mammography facilities in 1992 [1, 3] have proven successful, but evaluating and improving interpretive quality are greater challenges [4-7]. Reports show that radiologists interpret the same screening mammograms with only moderate levels of agreement and have a wide range of interpretive accuracy as measured by the area under the receiver operating characteristic curve [6, 8-10]. To maximize mammography's contribution to reducing breast cancer mortality, we need ways to improve mammography interpretation [11].

Methods for improving mammography interpretation are challenging because increasing radiologists' ability to find cancer when it is present (sensitivity) often comes at the cost of decreasing the ability to conclude that cancer is absent in someone who is disease-free (specificity). True improvements in mammography screening performance reflect either an increased ability to find lesions (detection) or better ability to distinguish benign from malignant lesions (discrimination) once detection occurs. When individuals simply interpret a higher proportion of lesions as being abnormal without improving detection or discrimination, sensitivity increases at a cost to specificity.

One promising approach to improving mammography interpretation is computer-assisted detection (CAD) [12-14]. This approach uses computer software to identify and mark areas of concern on digitized versions of film-screen mammograms [15] or soft-copy images obtained with full-field digital mammography equipment [16]. The marked images can be displayed on a small monitor at the base of the reviewer board after the initial interpretive review of the original films or on a large monitor as soft-copy images for full-field digital image review [12, 14].

Early evaluations of CAD showed improvements in cancer detection [13, 17, 18], although a more recent study has raised questions about CAD's performance [19]. CAD is now a U.S. Food and Drug Administration-approved and reimbursed service [5, 15, 18], but several issues remain unresolved including the effect of CAD on specificity, whether CAD can be used to find cancers earlier than otherwise would occur, and CAD's performance in dense versus fatty breast tissue [20-22]. This study was undertaken to evaluate whether CAD improves interpretive performance in a carefully constructed set of cases that allow the evaluation of cancer detection within 1 versus 2 years after screening, characterization of cancers, and the evaluation of the differential effects of CAD in fatty versus dense breasts. We hypothesized that CAD would improve sensitivity at no cost to specificity and would enable more accurate assessment of fatty compared with dense breast tissue.


Materials and Methods
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Setting
This study included 19 radiologists practicing at six accredited facilities in a single integrated health plan. The 19 radiologists included 13 (68%) with 10 or more years of experience interpreting mammograms, four (21%) with 5-9 years of experience, and two (11%) with 1-4 years of experience. Together, these radiologists interpret approximately 40,000 screening and diagnostic mammography examinations per year, although their individual reported experience varied: ≤ 50/wk (n = 10); 51-100/wk (n = 7);> 100/wk (n = 2). Radiologists consented to participate in this study and were given continuing medical education credit for their time. The study was reviewed and approved by the health plan's human subjects and study review committees.

Sample
We randomly sampled women with at least 2 years of health plan enrollment subsequent to screening mammograms interpreted in the health plan from 1996 through 1998. Each selected woman contributed at most one examination to the potential sample, with the most recent examination selected for inclusion when more than one was available. We excluded examinations from women with a diagnosis of breast cancer before 1996, those missing a breast density assessment or an interpretation consistent with the American College of Radiology lexicon in use at the time [23], and those with a history of breast augmentation.

We identified a set of 56,387 screening mammography examinations that met eligibility criteria including 28,965 (51%) from women with fatty breasts (BI-RADS density 1 or 2, almost entirely fat and scattered fibroglandular fat) and 27,422 from women with dense breasts (BI-RADS density 3 or 4, heterogeneously dense and extremely dense) [23]. For these women, we linked data from screening mammograms to data from the Surveillance Epidemiology and End Results Reporting (SEER) Registry. We identified 527 with a diagnosis of invasive cancer or ductal carcinoma in situ diagnosed within 2 years of the selected screening mammogram: 226 women with fatty breasts and 301 women with dense breasts. Including all tumors diagnosed within 2 years since screening meant the entire set of tumors detected at the time of the screening examination or in the subsequent interval was included as potential cases. We excluded women with lobular in situ lesions.

We used stratified sampling to select films, allowing us to create a difficult test set that focuses on three key issues: first, the effect of CAD on sensitivity and specificity; second, the differential effect of CAD by breast density; and, third, the ability of CAD to enable radiologists to find cancers earlier than using conventional mammography alone. Films were stratified by breast density (dense vs fatty), cancer status (noncancer, cancer by 1 year, or cancer in 13-24 months), and the original radiologist's assessment in practice (BI-RADS categories 0, 4, or 5 = positive; or BI-RADS categories 1, 2, or 3 = negative). We randomly selected films within each stratum. Power calculations were based on a chi-square test to detect the overall effect of assistance by simultaneously calculating the statistical significance of changes in true-positive and false-positive rates [24]. We planned to include at least 240 cancers (60 in each density-time-to-diagnosis stratum) and 120 noncancers (60 in each density stratum) to achieve 80% power to find a 2.5-point difference in either sensitivity or specificity.

Table 1 shows the distribution of selected films across the sampling strata. We took a random sample among women in each quadrant of a table of cancer by 12 months (yes or no) after screening, cancer in 13-24 months (yes or no) after screening, and interpretation (negative or positive) for each level of density (fatty or dense). This test set was more difficult than a radiologist would encounter in practice. Table 2 shows the expected sensitivity and specificity values based on our sampling, and those numbers (e.g., 71.4% and 75.8% for cancers by 1 year in fatty breasts) are substantially lower than would occur in clinical practice, where sensitivity and specificity are reported to be 78% and 92%, respectively [25].


View this table:
[in this window]
[in a new window]

 
TABLE 1: Study Sample by Time of Cancer Occurrence, Breast Density, and Clinical Interpretation

 

View this table:
[in this window]
[in a new window]

 
TABLE 2: Average Sensitivity and Specificity of Reviewers With and Without Computer Assistance

 

Expert Review
For analysis, all films were reviewed for BI-RADS density by an expert radiologist on our team. Films from women with cancer were also reviewed for visibility of the cancer and whether a CAD mark corresponded to the area of the cancer. The radiologist had all prior films and those leading to the diagnosis during her evaluation of the study films. She was also told the location of the cancer as recorded in SEER. She used the 4-point density scale recommended in BI-RADS and recorded visibility as "easily visible," "visible (should have been found by most radiologists)," "subtle (could have been found and worked up on a good day)," "visible only in retrospect," and "not visible." A tracing of the mammogram was made on a clear overlay of the image and the cancer was circled. At a separate time, a printout of the CAD-marked image was compared with the overlay to establish if the CAD mark was within the area of the cancer. The visible cancers were morphologically classified by one investigator as one of the following: mass, architectural distortion, asymmetric density, asymmetric density and architectural distortion, calcification, or mixed type with calcification.

Training in CAD
We used the ImageChecker M2 1000 system (version 2.2, R2 Technology) for this test. A 2-hour course in CAD was required before the reviewers began interpreting images, and a shorter version was given before resuming the readings 6 months later. The course included practice cases reviewed by representatives of R2 Technology on the study viewing board and an explanation of CAD. Before each study was reviewed by a radiologist, 10 cases were reviewed to reacquaint the radiologist with the R2 equipment and the data-recording procedures. The results of these interpretations were not included in the analysis.

Interpretive Testing
Figure 1 shows the design of the reinterpretation study. Each radiologist rated all the films both with and without assistance in two independent 4-hour sessions at the baseline (first interpretation) and then again 6 months later (second interpretation). To control for the order of computer assistance, radiologists were divided into two groups (A and B) and films were divided into two sets (I and II), as noted in Figure 1. The two film sets included an approximately equal number of films from each stratum. The order of the sets was changed for the two radiology groups so that we could evaluate whether findings were affected by seeing the films with CAD assistance at the first or second interpretation. Within the test sets, the order was not altered between the first and second interpretations.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 1 Study design showing that order of films was altered for each set of radiologists. The crossover design included two rounds separated by at least 6 months. Similar numbers of original interpretations were included in each set to attain similar levels of difficulty. Each case included one prior film set if available and summary sheet with age of woman at time of mammogram and her family history of breast cancer.

 
At each session, study radiologists interpreted approximately 90 examinations without assistance and 100 examinations with assistance including the 10 practice examinations. The radiologists viewed the film on the reviewer board and then, after a delay, pressed a button to call up the digitized image on the ImageChecker monitor just as occurs in practice when interpreting images with assistance. Each case included craniocaudal and mediolateral oblique views of each breast. The reviewer board included comparator films from bilateral screening examinations within the last 3 years when they were available and a summary sheet with the woman's age and family history of breast cancer. If more than one set of comparators existed, we included the one that occurred 2-3 years before the index film. Radiologists were informed that the film set included more cancers than would be expected in a typical screening population, but they were not told the proportion of cancers in the set.

At each review session, radiologists recorded their assessments on data sheets that included separate ratings for each breast using BI-RADS coding (category 0, 1, 2, 3, 4, or 5). BI-RADS coding in use at the time associated a descriptive summary and recommendations with each of the following assessments: category 0 = incomplete, 1 = negative, 2 = benign, 3 = probably benign, 4 = suspicious for malignancy, and 5 = highly suggestive of malignancy. Radiologists also recorded their start and end time for each review session.

Analysis
Our analyses examine the effect of computer assistance on the average sensitivity and specificity of assessments made by a group of radiologists. We dichotomized the BI-RADS scale by considering BI-RADS categories 1, 2, and 3 assessments as "negative" and BI-RADS categories 0, 4, and 5 assessments as "positive." We combined the two breast assessments into a single assessment per woman using the assessment in the breast with cancer for women with unilateral cancer, the minimum assessment for women with bilateral cancer (i.e., a negative assessment if one was given), and the maximum assessment for women without cancer (i.e., the positive assessment if one was given). When calculating maximum and minimum assessment for a woman, category 0 cases were treated as a 3.5 in the ordered BI-RADS categories from 1 to 5.

We describe the overall test performance using simple averages of sensitivity and specificity across radiologists. We tested for differences in sensitivity and specificity using generalized estimating equations that adjust for clustering due to repeated assessment of mammograms both between and within radiologists [26]. This approach focuses on correct estimation of SEs of regression coefficients for non-nested data, but it does not allow estimation of CIs for estimated sensitivity and specificity. We separately estimated the effects of covariates on sensitivity and specificity using logistic regression models, and we tested for both overall effects of assistance and for differential effects of assistance as a function of mammogram and radiologist characteristics. We tested for differential intervention effects (i.e., effect modification) by including interactions between assistance and covariates.

Covariates considered in the separate models included breast density (dense or fatty), reviewer experience (< 10 vs ≥ 10 years), interpretation volume (≤ 50 vs > 50 mammograms/wk), and order of presentation of test films (assistance first or second). For women with cancer, we also used separate models to test the effect on sensitivity of the following: lesion visibility, presence of calcifications, and presence of a mass. Because our sample included relatively few radiologists, covariance estimates adjust for the small cluster size [27]. Statistical tests are based on corresponding regression coefficients and reflect two-sided tests.

Tests for the association between marking of cancers by the CAD algorithm and lesion visibility, breast density, and timing of cancer diagnosis are based on the chi-square statistic or Fisher's exact test.


Results
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Table 1 shows the distribution of study examinations among women with fatty and dense breasts using expert density assessment. We identified 354 cases but excluded 13 cases because of specific procedural problems (e.g., marks appeared on films, prior films fell from the reviewer board between review sessions, or the index examination was misidentified). The final film set included 341 examinations (227 of patients with cancer, and 114 of patients without cancer). Sixty-five percent of the patients had a prior mammogram that could be used as a comparator film at the time of interpretation. The cancer cases included four bilateral cancers.

Of the 21 radiologists who initially agreed to participate in this study, 19 interpreted films during both review sessions including four who had prior CAD experience. Most (n = 17, 89%) interpret mammograms less than 40% of their work time and about half (n = 10, 53%) interpret 50 or fewer mammograms per week. Their average review time for approximately 90 examinations was 1 hour 35 minutes for unassisted review and 1 hour 32 minutes for assisted review.


Figure 2
View larger version (14K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 2A Graphs show true-positive versus false-positive rates for each of the 19 radiologists with (red) and without (black) computer-assisted detection (CAD). Graphs show patients with cancer by 1 year (A) and patients with cancer in 13-24 months (B) after mammography screening.

 


Figure 3
View larger version (14K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 2B Graphs show true-positive versus false-positive rates for each of the 19 radiologists with (red) and without (black) computer-assisted detection (CAD). Graphs show patients with cancer by 1 year (A) and patients with cancer in 13-24 months (B) after mammography screening.

 
Table 2 shows the sensitivity and specificity of study radiologists with and without assistance. The table also shows that the sensitivity and specificity in the test setting were close to or higher than the proportion of positive interpretations among cancers and negative interpretations among the noncancers that occurs in clinical practice. The overall proportion of positive tests in clinical practice among the cancers by 1 year after screening was 57% and was 29% for cancers detected 13-24 months after screening.

Sensitivity
Sensitivity on the test set was higher than the positive proportion in practice. On the test set, sensitivity dropped slightly with computer assistance (cancer by 1 year: 63.2% vs 62.0%; cancer in 13-24 months: 33.5% vs 32.3%, respectively), but the differences with and without assistance were not statistically significant for either cancer group. As shown in Figures 2A and 2B, there was no apparent separation of the performance of assisted and unassisted interpretation when graphed as their true-positive versus false-positive rate for cancers by 1 year or cancer in 13-24 months. However, in the modeling of sensitivity, reading volume significantly modified the effect of assistance on sensitivity for detection of cancers within 1 year (p < 0.02, based on the interaction between assistance and reading volume).

Among radiologists who interpret 50 or fewer mammograms per week, CAD was associated with small improvements in sensitivity (cancer by 1 year: 1.7 points higher with assistance [63.3% vs 65.0%, respectively]; cancer in 13-24 months: 0.1 point higher with assistance [34.8% vs 34.9%]). Among radiologists who interpret more than 50 mammograms per week, CAD was associated with decreased sensitivity (cancer by 1 year: 4.5 points lower [63.2% vs 58.7%]); cancer by 2 years: 2.7 points lower [32.2% vs 29.5%]). We did not find evidence that the effect of assistance on sensitivity was modified by either breast density (p > 0.5, both cancer groups) or lesion visibility (p > 0.15, both cancer groups). Finally, the effect of computer assistance was not altered by whether CAD was used with the test set during the first or second round of review (p > 0.15, both cancer groups).

Specificity
Specificity, relative to unassisted interpretations, was higher with assistance than without assistance (72% vs 75%, respectively; p < 0.02). Interpretation volume significantly modified the effect of assistance on specificity (p < 0.03, based on an interaction between assistance and volume). Assistance was associated with a larger increase in specificity among radiologists who interpret more than 50 mammograms per week (6.6 points higher with assistance [78.4% vs 71.8%]) relative to those who interpret 50 or fewer mammograms per week (0.8 points higher with assistance [72.4% vs 71.6%]). We did not find evidence that the effect of assistance on specificity was modified either by breast density (p > 0.95) or by whether the mammogram was first seen with or without CAD (p > 0.75) (e.g., whether CAD was used at the first or second interpretation).

Table 3 shows CAD marking by lesion visibility among the visible lesions. Sixty-seven percent (n = 152/227) of the cancers were visible but only 146 are considered in Tables 3 and 4 because six were missing information on markings (one cancer by 12 months and five cancers in 13-24 months). The proportion of visible masses differed by density and time to the cancer diagnosis. As expected, visibility differed significantly for masses in fatty versus dense breasts (p < 0.05). Among the cancers within fatty breasts (n = 124), 23% (n = 28) were easily visible, 18% (n = 23) were visible, 23% (n = 28) were subtle, 9% (n = 11) were visible in retrospect, and 27% (n = 34) were not visible. Among the cancers within dense breasts (n = 103), 7% (n = 7) were easily visible, 26% (n = 27) were visible, 22% (n = 23) were subtle, 5% (n = 5) were visible in retrospect, and 40% (n = 41) were not visible.


View this table:
[in this window]
[in a new window]

 
TABLE 3: Proportion of Visible Lesions Within Each Density Level and Cancer Group That Were Markeda by Computer-Assisted Detection (CAD)

 

View this table:
[in this window]
[in a new window]

 
TABLE 4: Characteristics and Markinga by Computer-Assisted Detection (CAD) of Visible Lesions

 

Among the cancers found within 12 months (n = 114), 25% (n = 28) were easily visible, 31% (n = 35) were visible, 22% (n = 25) were subtle, 4% (n = 4) were visible in retrospect, and 19% (n = 22) were not visible. Among the cancers found 13-24 months (n = 113), 6% (n = 7) were easily visible, 13% (n = 15) were visible, 23% (n = 26) were subtle, 11% (n = 12) were visible in retrospect, and 47% (n = 53) were not visible.

Among the visible cancers with both lesion type and marking information (n = 145), 50% (n = 72) were masses without calcifications, 16% (n = 23) were calcifications only, 14% (n = 21) were mixed types with calcifications, 13% (n = 19) were asymmetric densities, 4% (n = 6) were architectural distortions, and 3% (n = 4) included both architectural distortions and asymmetric densities. The analysis of visibility and CAD marking excluded films missing CAD marking information. Among known visible cancers, the likelihood that a cancer was marked by CAD increased with the degree of visibility reported by the expert radiologist (p < 0.001) and did not differ between fatty and dense breasts (p > 0.50).

Across all cases, radiologists changed their interpretation from negative when unassisted to positive when assisted for an average of 8.4% of the cases and they changed from positive when unassisted to negative when assisted in an average of 10.5% of the cases. On average, 81% of cases were interpreted the same whether assisted or unassisted. Among the visible cancers radiologists changed their interpretation from negative when unassisted to positive when assisted in an average of 9.2% of the cases and they changed from positive when unassisted to negative when assisted in an average of 10.2% of the cases.

Table 4 shows CAD marking by cancer characteristics. Overall, 73% of cancers were marked by CAD. The CAD algorithm marked 77% (70/91) of the visible cancers diagnosed within 1 year, and 67% (37/55) of the visible cancers found in 13-24 months. Computer marking of the lesions differed by lesion characteristics. CAD markings were more likely among cancers associated with calcifications (86%, 38/44) than cancers without calcifications (i.e., mass, asymmetric density, or architectural distortion) (67%, 68/101) (p < 0.03).

For cancers diagnosed by 1 year and those diagnosed in 13-24 months, the impact of CAD on sensitivity differed significantly (p < 0.03) depending on whether the technology marked the lesions. Among the visible cancers, CAD assistance was associated with increases in sensitivity when cancers were marked (82.7-83.1% for marked cancer by 1 year; 55.2-57.9% for marked cancer in 13-24 months) and with decreases in sensitivity when cancers were not marked (from 37.4% to 30.1% for unmarked cancer by 1 year; from 29.7% to 23.0% for unmarked cancer in 13-24 months).


Discussion
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Our study raises a question about whether CAD is working as expected and demonstrates a critical approach to the evaluation of the technology. To our knowledge, this study is the first test using a stratified random sample of cancer cases from a screened population rather than a selected case series of visible cancers. Although CAD technology provides considerable hope for radiologists faced with the daunting task of finding three to five cancers in every thousand screening mammograms they interpret, that hope must be met by performance in practice [14]. We evaluated CAD's additional impact among cancers in dense versus fatty breasts and cancers diagnosed within 1 year or in 13-24 months. In both fatty and dense breasts and for cancer diagnosed by 1 year versus cancer diagnosed in 13-24 months, CAD did not improve overall sensitivity. On the contrary, lack of CAD marks may result in missed cancers. We also found that, contrary to expectations [14], CAD improved specificity.

Although our results are surprising, they are consistent with recent work that showed no significant effect of CAD in an academic setting [18], but they differ from earlier reports showing improved detection [2, 10, 12, 13]. Our study offers an opportunity to better understand CAD, but comparing it with other studies is difficult because several factors influence the results of CAD evaluations: first, how improvement was measured; second, the mix of masses, asymmetric lesions, and calcified lesions involved; third, the mix of breast densities; fourth, the CAD algorithm; and, fifth, the interpretive abilities of the radiologists. How each study addresses each of these factors influences its comparability to the others.

Other work has looked at whether CAD marks lesions [4, 11, 13, 17, 28]. Our study tested the effect of CAD on radiologists' recommendations and hence on sensitivity and specificity. Our findings suggest that the radiologists did not act on all CAD marks. With CAD assistance, specificity improved or was unchanged. However, not all the visible tumors were marked by CAD and it appears that radiologists relied on the presence of CAD markings when assistance was present. One previous study showed a reduction in sensitivity with CAD assistance among cancers diagnosed by 1 year [29]. In our study, sensitivity decreases were restricted to radiologists who interpret more than 50 mammograms per week.

The early reports that describe the marking of visible lesions [13, 17] found the same rate of CAD marking that we report: CAD did not mark 23% of the lesions that were visible in retrospect [13]. Training for our study included instructions to the radiologists to recommend evaluation even if there were no CAD marks, but similar to two other studies, we found that radiologists were less likely to recommend evaluation of visible lesions that were not marked by CAD [8, 20, 22]. The reaction of radiologists to markings is critical to the success of this technology. Careful attention to this aspect of CAD instruction is essential in future training and research, and the tendency to defer to technology needs acknowledgment as a risk of its use.

The case mix, including the distribution of breast densities and tumor characteristics across films, also influences a CAD evaluation. Breast density can affect the performance of CAD because of its influence on lesion visibility [29]. Therefore, studies that have a higher proportion of dense breasts may show a lower impact of CAD on overall cancer detection because fewer lesions are visible. However, we found that marking of visible cancers occurs with similar frequency among fatty and dense breast tissue. Therefore in studies restricted to the effect of CAD on detection of visible lesions, density is unlikely to have an effect.

In keeping with previous reports, we found that calcifications are more likely to be marked by CAD than masses, asymmetric densities, or architectural distortions [12, 17, 30]. In practice, masses account for most nonpalpable breast lesions (59%) and most missed lesions (70%) [17, 31]. Within the film set examined by Warren Burhenne et al. [13] and Birdwell et al. [17], 47% of cancers were masses and 51% were calcified lesions. Because their set included a higher proportion of calcifications than expected in clinical practice, their study results may overestimate the potential impact of CAD [4, 13, 17].

Another consideration in our test set is the presence of bilateral cancers. The scoring system we used combines the breast-level interpretation into a single interpretation for each patient. For women without cancer, we took the most abnormal interpretation. For women with cancer, we took the most abnormal interpretation from the breast with cancer. In a woman with bilateral cancer, we took the least abnormal interpretation. The four cases of bilateral cancer included in our set resulted in 113 assessments (57 unassisted and 56 assisted). Of these, 67% were in agreement for both breasts (i.e., both positive or both negative). Only 37 (33%) of the bilateral examinations were given a negative assessment because the radiologist did not identify both cancers. All bilateral cancers occurred within 12 months of screening, and these 37 discordant assessments represent only 0.9% of the 4,234 assessments of films with cancers diagnosed within 12 months. Based on how rare these interpretations were (0.9%) of the total number of interpretations, it is unlikely that our "minimum score" applied to bilateral cancers affected our results.

CAD algorithms continue to evolve over time, and therefore the particular version of a vendor's product that is tested will influence conclusions about CAD performance [13, 22, 32]. Product differences between vendors further complicate the evaluation of this technology. Because we used version 2.2 of ImageChecker by R2 Technology, we expected some improvement in performance relative to findings from Warren Burhenne et al. [13] who used the earlier version of ImageChecker (version 1.2). Although a newer version of ImageChecker is now on the market, our study demonstrates an approach to assessment that needs replication to test the new algorithm's performance with respect to visible noncalcified masses and asymmetric densities.

Finally, radiologist experience as reflected in training and interpreting volume may also affect CAD evaluation. Whether the study involves breast radiologists or general radiologists may affect performance measures [33]. Furthermore, the volume of images that radiologists interpret is expected to affect their performance, although the evidence for this is thin to date [14, 34, 35]. High-volume breast specialists may not have as much potential for improvement with the use of CAD as low-volume general radiologists. The CAD studies reported to date include primarily breast specialists [4, 10, 12, 13, 17]. One report including a breast specialist is from community practice [28]. Our study included mostly general radiologists and approximately half interpret 50 or fewer mammograms per week.

A potential limitation of our study is the test setting; the radiologists knew they were being evaluated and that the cancer rate in the test set was higher than that in practice. This knowledge might have made them pay closer attention to the films, and there is some evidence that the radiologists in this study elevated their performance. Sensitivity and specificity on the test set were the same or higher than expected from sampling. This elevation of practice would make it harder to show an advantage with computer assistance, but it should not affect the differential findings with respect to marked and unmarked lesions. In addition, a recent study that showed that the prevalence of lung abnormalities did not affect estimates of interpretive performance on chest radiographs is reassuring [36]. The authors of that lung study suggested that the use of a case set enriched with cancers would not bias estimates of CAD performance [36]. Finally, a recent evaluation of CAD in a practice setting showed no change in specificity and no significant improvement in detection among 24 radiologists interpreting 59,139 examinations compared with the same radiologists interpreting 56,432 examinations without CAD [18].

Another limitation of the study is that we used one radiologist as the standard for assessment of breast density and visibility of cancers. Although visualization of a gradient of markings by CAD across the radiologist's assessment of degree of visibility provides some internal validation, our findings with respect to the effect of CAD on the diagnosis of visible cancers might differ slightly if another radiologist had decided which cancers were visible.

Finally, we recognize there is no gold standard for the lesion characteristics and density designations identified in this study. The proportion of "visible" cancers and lesions with markings across lesion characteristics might change using other "experts" as the standard; however, the findings that CAD had no effect on sensitivity and specificity would not be altered. In studies in which individuals or panels of experts select visible lesions, the ambiguities of "visibility" and the unknown pool from which they are drawn (all cancers found at diagnosis, all cancers reported within a set period of time, or all cancers that occur among all screened women) make it difficult to generalize to the full range of cancers that occur in a population of women seen in a center. That generalization is what is critical to understanding whether CAD will improve practice and how well it is working.

Despite the challenges of evaluating CAD, progress is being made. The fears of making specificity worse are not being born out. Warren Burhenne et al. [13], Freer [12], or Gur et al. [18] do not show clinically significant increases in recall among radiologists using CAD in mammographic interpretations. Our study suggests that specificity may actually improve, especially among experienced radiologists, but some of this improvement may come at a cost to sensitivity. The challenge is that to improve performance, radiologists must use judgment regarding whether to act on their findings or on the computer algorithm's markings. Our study also suggests that when CAD does not mark a lesion, reviewers may ignore their own findings, thereby decreasing their sensitivity. This finding raises the question of whether there is a potential for CAD to do harm.

To mitigate this potential harm, training should emphasize lesion characteristics of masses, asymmetric densities, and architectural distortions that may be missed by CAD while efforts continue to improve algorithms used to identify these lesions. For the algorithms to be improved, they should be first tested on a common test set with a known proportion of cases among high-versus low-density breast tissue and a known proportion of masses, calcifications, and mixed lesions to allow comparison with existing algorithms before testing new ones in a clinical setting. Then in clinical tests, studies should report the lesion characteristics and the proportion of lesions marked by CAD within the lesion subgroups. Ideally, studies would be based on film sets that are representative of the population to which CAD would be applied (e.g., screening mammograms or diagnostic mammograms). If sampling is done, the distribution of films across true-positive and true-negative interpretations among cancers and noncancers should also be included.

Visible lesions that are unmarked by CAD offer a potential for harm if radiologists do not act on them because of the absence of CAD markings. These visible unmarked lesions also offer the best chance to improve mammography performance if the CAD evaluation of masses, asymmetric densities, and architectural distortions is improved through further research [32].


Acknowledgments
 
We wish to acknowledge the careful work of Deb Seger, Alice Park, and Letitia Hodgkinson who handled the many details of implementing this study within an active health care environment and Tammy Dodd for shepherding the manuscript to completion. We also greatly appreciate the assistance of R2 Technology who provided equipment and technical assistance for the project.


References
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 

  1. U.S. Food and Drug Administration. Quality Mammography Standards: Final Rule. 21 CFR § 16 and 900 [docket no. 95N-0192], RIN 0910-AA24 ed. Washington, DC: Department of Health and Human Services.1997
  2. Fintor L, Brown M, Fischer R, et al. The impact of mammography quality improvement legislation in Michigan: implication for the National Quality Standards Act. Am J Public Health1998; 88:667 -671[Abstract/Free Full Text]
  3. McLelland R, Hendrick RE, Zinninger MD, Wilcox PA. The American College of Radiology Mammography Accreditation Program. AJR 1991; 157:473 -479[Abstract/Free Full Text]
  4. Pisano ED, Schell M, Rollins J, et al. Has the Mammography Quality Standards Act affected the mammography quality in North Carolina? AJR 2000; 174:1089 -1091[Abstract/Free Full Text]
  5. Hendrick RE, Chrvala CA, Plott CM, Cutter GR, Jessop NW, Wilcox-Buchalla P. Improvement in mammography quality control: 1987-1995. Radiology 1998;207 : 663-668[Abstract/Free Full Text]
  6. Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by US radiologists: findings from a national sample. Arch Intern Med 1996;156 : 209-213[CrossRef][Medline]
  7. Elmore JG, Miglioretti DL, Reisch LM, et al. Screening mammograms by community radiologists: variability in false-positive rates. J Natl Cancer Inst 2002; 94:1373 -1380[Abstract/Free Full Text]
  8. Beam CA, Guse CE, Sullivan DC. A sequential chart for the audit-based evaluation of screening mammogram interpretation. Acad Radiol 1999; 6:216 -223[CrossRef][Medline]
  9. Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR. Variability in radiologists' interpretations of mammograms. N Engl J Med 1994; 331:1493 -1499[Abstract/Free Full Text]
  10. Kerlikowske K, Grady D, Barclay J, et al. Variability and accuracy in mammographic interpretation using the American College of Radiology Breast Imaging Reporting and Data System. J Natl Cancer Inst1998; 90:1801 -1809[Abstract/Free Full Text]
  11. Taplin SH, Ichikawa L, Yood MU, et al. Reason for late-stage breast cancer: absence of screening or detection, or breakdown in follow-up? J Natl Cancer Inst 2004;96 : 1518-1527[Abstract/Free Full Text]
  12. Freer TW, Ulissey MJ. Screening mammography with computer-aided detection: prospective study of 12,860 patients in a community breast center. Radiology 2001;220 : 781-786[Abstract/Free Full Text]
  13. Warren Burhenne LJ, Wood SA, D'Orsi CJ, et al. Potential contribution of computer-aided detection to the sensitivity of screening mammography. Radiology 2000;215 : 554-562 [Erratum in Radiology 2000; 216:306][Abstract/Free Full Text]
  14. Astley SM, Gilbert FJ. Computer-aided detection in mammography. Clin Radiol 2004;59 : 390-399[CrossRef][Medline]
  15. Roque AC, Andre TC. Mammography and computerized decision systems: a review. Ann N Y Acad Sci 2002;980 : 83-94[Medline]
  16. Nawano S, Murakami K, Moriyama N, Kobatake H, Takeo H, Shimura K. Computer-aided diagnosis in full digital mammography. Invest Radiol 1999; 34:310 -316[CrossRef][Medline]
  17. Birdwell RL, Ikeda DM, O'Shaughnessy KF, Sickles EA. Mammographic characteristics of 115 missed cancers later detected with screening mammography and the potential utility of computer-aided detection. Radiology 2001;219 : 192-202[Abstract/Free Full Text]
  18. Gur D, Sumkin JH, Rockette HE, et al. Changes in breast cancer detection and mammography recall rates after the introduction of a computer-aided detection system. J Natl Cancer Inst2004; 96:185 -190[Abstract/Free Full Text]
  19. Elmore JG, Carney PA. Computer-aided detection of breast cancer: has promise outstripped performance? J Natl Cancer Inst 2004; 96:162 -163[Free Full Text]
  20. Ciatto S, Rosselli Del Turco M, Burke P, Visioli C, Paci E, Zappa M. Comparison of standard and double reading and computer-aided detection (CAD) of interval cancers at prior negative screening mammograms: blind review. Br J Cancer 2003;89 : 1645-1649[CrossRef][Medline]
  21. D'Orsi CJ. Computer-aided detection: there is no free lunch. Radiology 2001;221 : 585-586[Free Full Text]
  22. Zheng B, Ganott MA, Britton CA, et al. Soft-copy mammographic readings with different computer-assisted detection cuing environments: preliminary findings. Radiology 2001;221 : 633-640[Abstract/Free Full Text]
  23. American College of Radiology (ACR). Breast imaging reporting and data system (BI-RADS), 3rd ed. Reston, VA: ACR,1998
  24. Pepe MS, Urban N, Rutter C, Longton G. Design of a study to improve accuracy in reading mammograms. J Clin Epidemiol1997; 50:1327 -1338[CrossRef][Medline]
  25. Yankaskas BC, Taplin SH, Ichikawa L, et al. Association between mammography timing and measures of screening performance in the U.S. Radiology 2005;234 : 363-373[Abstract/Free Full Text]
  26. Miglioretti DL, Heagerty PJ. Marginal modeling of multilevel binary data with time-varying covariates. Biostatistics2004; 5:381 -398[Abstract]
  27. Mancl LA, DeRouen TA. A covariance estimator for GEE with improved small-sample properties. Biometrics 2001;57 : 126-134[CrossRef][Medline]
  28. Destounis SV, DiNitto P, Logan-Young W, Bonaccio E, Zuley ML, Willison KM. Can computer-aided detection with double reading of screening mammograms help decrease the false-negative rate? Initial experience. Radiology 2004;232 : 578-584[Abstract/Free Full Text]
  29. Ho WT, Lam PW. Clinical performance of computer-assisted detection (CAD) system in detecting carcinoma in breasts of different densities. Clin Radiol 2003;58 : 133-136[CrossRef][Medline]
  30. Baker JA, Rosen EL, Lo JY, Gimenez EI, Walsh R, Soo MS. Computer-aided detection (CAD) in screening mammography: sensitivity of commercial CAD systems for detecting architectural distortion. AJR 2003; 181:1083 -1088[Abstract/Free Full Text]
  31. Liberman L, Abramson AF, Squires FB, Glassman JR, Morris EA, Dershaw DD. The Breast Imaging Reporting and Data System: positive predictive value of mammographic features and final assessment categories. AJR 1998; 171:35 -40[Abstract/Free Full Text]
  32. Qian W, Sun X, Song D, Clark RA. Digital mammography: wavelet transform and Kalman-filtering neural network in mass segmentation and detection. Acad Radiol 2001;8 : 1074-1082[CrossRef][Medline]
  33. Sickles EA, Wolverton DE, Dee KE. Performance parameters for screening and diagnostic mammography: specialist and general radiologists. Radiology 2002;224 : 861-869[Abstract/Free Full Text]
  34. Beam CA, Conant EF, Sickles EA. Association of volume and volume-independent factors with accuracy in screening mammogram interpretation. J Natl Cancer Inst 2003;95 : 282-290[Abstract/Free Full Text]
  35. Barlow WE, Chi C, Carney PA, et al. Accuracy of screening mammography interpretation by characteristics of radiologists. J Natl Cancer Inst 2004; 96:1840 -1850[Abstract/Free Full Text]
  36. Gur D, Rockette HE, Armfield DR, et al. Prevalence effect in a laboratory environment. Radiology 2003;228 : 10-14[Abstract/Free Full Text]

Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
NEJMHome page
J. J. Fenton, S. H. Taplin, P. A. Carney, L. Abraham, E. A. Sickles, C. D'Orsi, E. A. Berns, G. Cutter, R. E. Hendrick, W. E. Barlow, et al.
Influence of Computer-Aided Detection on Performance of Screening Mammography
N. Engl. J. Med., April 5, 2007; 356(14): 1399 - 1409.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
R. F. Brem
Clinical Versus Research Approach to Breast Cancer Detection with CAD: Where Are We Now?
Am. J. Roentgenol., January 1, 2007; 188(1): 234 - 235.
[Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
D. Gur and J. H. Sumkin
CAD in Screening Mammography.
Am. J. Roentgenol., December 1, 2006; 187(6): 1474 - 1474.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Taplin, S. H.
Right arrow Articles by Lehman, C. D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Taplin, S. H.
Right arrow Articles by Lehman, C. D.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS