AJR F and L Medical Products: Radiation Protection & More
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Taplin, S. H.
Right arrow Articles by Brenner, R. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Taplin, S. H.
Right arrow Articles by Brenner, R. J.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?
Hotlight (NEW!)
Right arrow
What's Hotlight?
AJR 2000; 174:1257-1262
© American Roentgen Ray Society


Accuracy of Screening Mammography Using Single Versus Independent Double Interpretation

S. H. Taplin1,2,3, C. M. Rutter1, J. G. Elmore4, D. Seger1, D. White5 and R. James Brenner6

1 Center for Health Studies, Group Health Cooperative of Puget Sound, P. O. Box 34021, Seattle, WA 98124-1448.
2 Department of Family Medicine, University of Washington, Box 356390, Seattle, WA 98195.
3 Group Health Cooperative of Puget Sound, 1730 Minor Ave., Ste. 160, Seattle WA 98101.
4 Department of Medicine, University of Washington, Box 356420, Seattle, WA 98195.
5 Department of Radiology, Group Health Cooperative of Puget Sound, 209 Martin luther King Jr. Way, Tacoma, WA 98405.
6 Joyce Einsenberg Keefer Breast Center, Saint John's Hospital, 1328 - 22nd St., Santa Monica, CA 90404.

Received June 14, 1999; accepted after revision October 11, 1999.

 
Supported by grant CA6371 from the National Cancer Institute and a Generalist Faculty grant for J. G. Elmore from the Robert Wood Johnson Foundation. However, all opinions and findings are the sole responsibility of the authors.

Address correspondence to S. H. Taplin.


Abstract
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
OBJECTIVE. We conducted an analysis among 31 community radiologists to identify the average change in screening mammography interpretive accuracy afforded by independent double interpretation.

MATERIALS AND METHODS. We assessed interpretive accuracy using a stratified random sample of test mammograms that included 30 women with cancer and 83 without. Radiologists were unaware of clinical information and of each other's assessments. We describe accuracy for individual radiologists and for double interpretation, including average sensitivity, specificity, diagnostic likelihood ratios positive and negative, and area under the receiver operating characteristic (ROC) curve. We also assessed weighted and nonweighted kappa statistics among all 465 pairs of radiologists and 31,465 pairs of unique pairs. The assessment for double interpretations used the "highest" (i.e., most abnormal) assessment of the two radiologists. We calculated the difference between each radiologist's individual accuracy and the average accuracy across that radiologist's 30 double interpretations.

RESULTS. We found the following average accuracy statistics for individual radiologists: sensitivity, 79%; specificity, 81%; diagnostic likelihood ratio positive, 5.53; diagnostic likelihood ratio negative, 0.26; and area under the ROC curve, 0.85. The mean kappa statistic among radiologists for cancer cases increased with double interpretation from 0.59 to 0.70, and for noncancer cases from 0.30 to 0.34. Double interpretation resulted in an average increase in sensitivity of 7%, an average decrease in specificity of 11%, a decrease in diagnostic likelihood ratio positive of 2.35, a decrease in diagnostic likelihood ratio negative of 0.06, and an increase in area under the ROC curve of 0.02.

CONCLUSION. Independent double interpretation does not increase accuracy as measured by the area under the ROC curve.


Introduction
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Achieving mortality reductions through screening mammography depends on both technical image quality (film characteristics, equipment, and processing) and skillful interpretation [1, 2]. The United States Congress passed the Mammography Quality Standards Act to ensure minimum quality standards among facilities providing screening to Medicare-eligible women [3]. Efforts to improve technical image quality have led to improvements, but evaluating and improving interpretive quality have been a greater challenge [2, 4]. Several reports have indicated that screening mammography interpretations show only moderate levels of agreement on analysis of the same images, and radiologists have a broad distribution of skills as measured by the area under the receiver operating characteristic (ROC) curve [5,6,7]. To improve screening mammography, several studies advocate that two radiologists interpret every case, but the average effect of this approach is not clear [8,9,10,11].

Double interpretation can occur in a variety of ways, but usually means that the assessments of two radiologists are combined or adjudicated for a single set of images [12]. Objective independent double interpretation occurs when the radiologists are unaware of each other's interpretations and a rule is used to resolve differences that occur [8, 12]. Consensus interpretation occurs when the two radiologists discuss their differences and come to a mutually acceptable interpretation [9]. Beam et al. [12] noted that consensus interpretation introduced influences that are hard to measure. These researchers also found that the interpretive ability of each radiologist affects the accuracy of the pair, so that some pairs may be better than others. In large practices selective pairing may not be possible. In small practices the number of choices may be restricted. Studies to date have included volunteer pairs from community facilities or special programs so that the average impact of double interpretation is unclear. Obuchowski and Zepp [13] pointed out that estimation of the average radiologist's accuracy from multiradiologist studies more accurately reflects expectations for an imaging technology. In this analysis, we take advantage of data collected in an objective independent fashion on a set of test mammograms read by 31 radiologists. We use this data to compare the interpretive accuracy of individuals and of all possible unique pairs.


Materials and Methods
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Study Overview
We assessed objective independent double interpretation using randomly sampled images from a known population of patients with and without cancer. We compared individual and double interpretation in three ways: by examining the distribution across five interpretive categories; by assessing agreement using kappa statistics; and by measuring the sensitivity, specificity, diagnostic likelihood ratio statistics, and area under ROC curve. Kappa statistics measure case-by-case agreement between radiologists, but two radiologists can agree when both are wrong [14, 15]. To understand the reasons for agreement and disagreement that influence double interpretation, we also evaluated all possible pairs of assessments by individual radiologists among cancer and noncancer patients.

Setting and Sample Selection
This study took place at the Group Health Cooperative of Puget Sound, a predominantly staff model consumer-controlled health maintenance organization with 398,000 enrollees who are racially similar to the surrounding community [16]. Screening mammography occurs through an organized breast cancer screening program using five centers and an automated database that can be linked to a regional cancer Surveillance Epidemiology and End Results registry [17, 18]. All mammography facilities were accredited by the American College of Radiology (ACR) in 1990 and use dedicated equipment, two views per breast, and the screen-film technique with a grid.

We used stratified random sampling to select 120 bilateral mammography examinations. The selected examinations are referred to as the "index" and include mediolateral oblique and craniocaudal views for each breast of women seen for screening through Group Health's screening program described by Taplin et al. [18]. All examinations occurred among asymptomatic women 40 years old or older. Except as noted, these women were screened between January 1, 1990 and December 31, 1991, and remained continuously enrolled for at least 2 years subsequent to the screening dates (n = 21,567). Among these women, we used the Surveillance Epidemiology and End Results linkage to identify all ductal carcinoma in situ and invasive breast cancer diagnosed within the year after the index examination.

For the 21,567 screening examinations, we categorized the original radiologists' interpretations in a manner consistent with published recommendations regarding mammography audits [19, 20]. The interpretations were considered positive if the radiologist recommended short-interval follow-up, additional workup, surgical evaluation, or biopsy; other interpretations were considered negative.

We randomly selected a sample of 120 examinations from four strata (true-positive, false-positive, true-negative, and false-negative). Among the 21,567 women, the sensitivity and specificity of the original interpretations were 82% and 84%, respectively. One original purpose of the study was to distinguish among radiologists, so we created a test set that would be more difficult than the usual practice. We included a higher proportion of false-negatives and false-positives than occurred among the original 21,567 examinations. This sampling created a screening test set that included 33 women with cancer and 87 without (24 true-positive and nine false-negative, 16 false-positive and 71 true-negative). To identify enough false-negative mammograms for the sample, we randomly selected nine from 51 negative screening cases that were imaged from 1985 through 1992.

We reviewed the medical record of each woman selected to verify that the radiologist's recommendation for the index examination was accurately captured in the automated record. We also identified the appropriate image and reviewed it for the presence of pencil or other markings. All index examinations with missing mammograms, misclassified interpretations, or marked mammograms that could not be cleaned were replaced by random selection of another examination with the same interpretive classification and cancer status.

Study Protocol
We conducted this study among 31 volunteer radiologists from three group health centers after obtaining approval from the appropriate research and human subjects review boards. Before viewing mammograms, radiologists were given a brief survey to assess their experience interpreting mammograms.

We rotated the original films through the three radiology centers in groups of approximately 30 examinations each. During a total of four screening sessions, each radiologist independently reviewed each group of examinations on a multipanel viewer in a private room. The radiologist was allowed up to 1 hr to review the two images of each breast. Screening views from the immediate prior screen were available and were shown for 62% of the cases. All images were masked to remove patient identifiers. The radiologists were not given any clinical information and were unaware of the prevalence of cancer in the test sets. All participating radiologists were assigned anonymous study identifiers, and the investigators remained unaware of individual results.

Study radiologists provided separate assessments for each breast for each woman included in the study sample. The interpretive options were arrayed on a five-point ordinal scale modified from the ACR lexicon and shown in Table 1 [21]. Our rating scale linked disease ratings with follow-up recommendations but combined ACR categories 1 (negative) and 2 (benign) because they had the same clinical implication (normal interval follow-up).


View this table:
[in this window]
[in a new window]
 
TABLE 1 Average Percentages of Use of Interpretive Categories for Single (n = 31) and Double (n = 465) Interpretations Among Patients With and Without Cancer

 

Data Analysis
We identified the radiologist's interpretation for each woman when interpreting alone and when interpreting in combination with every other radiologist.

For the individual radiologist interpreting alone (i.e, single interpretation), we systematically combined his or her separate assessment for each breast to create a single assessment for each woman. Women with cancer were assigned the assessment given for the breast with cancer. The study group included no cases of bilateral breast cancer. Women without cancer were assigned the most abnormal assessment from the two breast images.

For each unique pair of radiologists (double interpretation), we also identified an interpretation for all women in the study set. There were 465 (31 x 30 / 2) possible unique pairs of radiologists. For each unique pair of radiologists, we assigned the most abnormal of the two radiologists' interpretations to each woman in the study set and then calculated accuracy statistics for the pair as noted in the following text.

Analyses describe the average percentage of interpretations in each assessment category across all 31 individual and 465 double interpretations. In Tables 2 and 3, we show the proportion of all assessments accounted for by each possible pair of interpretations. These percentages are shown separately for assessments of women with and without cancer.


View this table:
[in this window]
[in a new window]
 
TABLE 2 Agreement Among 31 Radiologists for 13,950 Possible Pairwise Comparisons When Cancer Was Present

 

View this table:
[in this window]
[in a new window]
 
TABLE 3 Agreement Among 31 Radiologists for 38,595 Possible Pairwise Comparisons When Cancer Was Not Present

 

Agreement among individual radiologists and agreement among radiologist pairs were described using nonweighted and weighted kappa statistics. The kappa statistic measures agreement between pairs of ratings after accounting for chance [22]. The weighted kappa statistic allows partial agreement that depends on the ordinal relationship between two assessments. For our weighted analyses, we counted assessments that differed by one category as 75% agreement, assessments that differed by two categories as 50% agreement, and assessments that differed by three categories as 25% agreement. We report the average kappa statistics across all pairs of radiologists and across all unique pairs of radiologist pairs (i.e., no radiologist is included in both pairs).

We evaluated changes in accuracy resulting from double interpretation using five different measures: sensitivity, specificity, diagnostic likelihood ratio positive, diagnostic likelihood ratio negative, and the area under the ROC curve. Sensitivity is the proportion of women with cancer who are correctly identified as disease-positive by the radiologist (or radiologist pair). Specificity is the proportion of women without cancer correctly identified as disease-negative by the radiologist (or radiologist pair). Diagnostic likelihood ratios reflect the odds of disease given an assessment and can be viewed as measures of the clinical information provided by test results [23]. Because a positive test result should increase the odds of disease, the diagnostic likelihood ratio positive should be greater than 1. Similarly, a negative test should mean that disease is less likely, and the diagnostic likelihood ratio negative should be from 0 to 1. Finally, we used the area under the ROC curve for an overall measure of accuracy. The area under the ROC curve captures differences in the separate distributions of radiologist ratings for disease-positive and disease-negative examinations [24].

We defined positive results as those which the radiologist assessed as "possibly abnormal," "suspicious abnormality," and "highly suggestive," because these were linked with immediate follow-up (either additional images or biopsy). This definition is consistent with the first of three definitions proposed by Linver [25] and the definition of a positive test used to identify the sample.

We report the average accuracy statistics for the 31 radiologists and the 465 radiologist pairs with an interval ranging from the 25th percentile to the 75th percentile. More direct comparisons between radiologists and radiologist pairs were made using change scores. Each radiologist contributes to 30 radiologist pairs. Change scores are the difference between each radiologist's individual accuracy and the average accuracy across that radiologist's 30 paired interpretations. We report the average kappa statistic across all possible pairs of radiologists and all possible pairs of separate pairs.


Results
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Radiologists' Characteristics
All 31 radiologists included in this study completed diagnostic radiology residencies. Two thirds of the radiologists (n = 21) completed residency before 1985. Of the participating radiologists, 45% were women, and all participants were 33-64 years old (average, 45 years). The radiologists reported interpreting an average of 42 mammograms per week (range, 10-150) during the year before the study, and had an average of 9 years 10 months of experience interpreting mammograms (range, 1-30 years of experience). The radiologists had completed an average of 41 hr (range, 0-129) of continuing medical education in mammography by the time of the study.

Mammographic Examinations
Mammograms of three women with cancer and four women without cancer were removed from the study before the analysis because of marks placed on their mammograms by the radiologists during the study. The remaining 113 examinations included 30 examinations of women with cancer and 83 examinations of women without cancer. Within this set of films, original radiologists were 70% sensitive and 86% specific. Among the 30 women with cancer, the mammograms of one were identified as having normal findings by all observers.

There are 930 (31 x 30) and 2573 (31 x 83) possible individual interpretations for examinations with and without cancer, respectively. There are 465 (31 x 30/2) separate pairs of radiologists. Across these radiologist pairs, there are 13,950 (465 x 30) double interpretations for examinations with cancer and 38,595 (465 x 83) possible double interpretations for examinations without cancer. There are 31,465 ([31 x 30 x 29 x 28] / [4 x 3 x 2 x 1]) possible pairs of unique pairs.

The women who contributed these 113 mammography examinations were an average of 50 years old and those with cancer had primarily small (i.e., < 2 cm in diameter; 18/30) invasive (25/30) tumors.

Distribution of Ratings
Table 1 shows the overall distribution of assessments for single and double interpretations. There is an upward shift toward more suspicious assessments for double interpretations. Among women with cancer, the shift is primarily into the "highly suggestive" category, with 11.3% more of the women rated as highly suggestive when radiologists were paired. Across examinations of women without cancer, the shift was primarily into the "needs additional evaluation" category, withn 10.3% more examinations rated as needing additional assessment when radiologists were paired.

Agreement Among Ratings
Tables 2 (cancer patients) and 3 (noncancer patients) provide a detailed description of the correspondence between pairs of individual radiologists' ratings.

Among women with cancer, 56.2% of all paired interpretations agree (boldface in Table 2) and 42.9% of interpretations disagree. Most disagreement occurs between suspicious and highly suggestive assessments (30.5% [13.1 / 42.9] of disagreements). "Needs additional evaluation" and "suspicious" assessment pairs account for 16.3% (7 / 42.9) of disagreements, and "needs additional evaluation" and "highly suggestive" assessments account for 19.1% (8.2/42.9) of disagreements. Finally, "negative or benign" and "needs additional evaluation" pairs account for 19.3% (8.3/42.9) of all disagreements. Among all cancer patients for whom disagreements occur, additional imaging is recommended 25.6% of the time, biopsy consideration 72.3% of the time, and short-interval follow-up 2.1% of the time.

Among women without cancer (Table 3), 71.9% of all paired assessments agree (boldface in Table 3) and 28.0% disagree. The disagreement primarily occurs between "negative or benign" and "needs additional evaluation" assessments (66.4% [18.6/28.0] of all disagreements). Among all cases in which disagreement occurs, the more abnormal interpretation resulted in a recommendation for additional imaging 77.1% of the time, biopsy consideration 6.5% of the time, and short-interval follow-up 16.2% of the time.

Agreement is higher among double interpretations than among single interpretations (Table 4,) especially among cancer patients. Weighting improved agreement among individual radiologists for women with cancer but otherwise resulted in only small changes.


View this table:
[in this window]
[in a new window]
 
TABLE 4 Agreement Beyond Chance Between 465 Pairs of Single Interpretations and 31,465 Distinct Pairs of Radiologist Pairs

 

Distribution of Accuracy Statistics
Table 5 shows the five accuracy measures for individual and paired radiologists and the change in the average score between single and double interpretations. Sensitivity increased, specificity decreased, diagnostic likelihood ratio positive decreased, diagnostic likelihood ratio negative decreased, and area under the ROC curve increased slightly.


View this table:
[in this window]
[in a new window]
 
TABLE 5 Average Performance Measures for Single and Double Interpretations

 


Discussion
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Using a large number of radiologists and all possible pairs, we can measure average accuracy for individual and double interpretations. To our knowledge, ours is the largest study of independent double interpretation of this kind reported to date and provides a needed opportunity to examine how this technology works among multiple radiologists [13]. We assess differences between the accuracy of individuals and the accuracy of independent double interpretations. Use of independent double interpretations resulted in an average increase in sensitivyt of 7%, decrease in specificity of 11%, decrease in diagnostic likelihood ratio positive of 2.35, decrease in diagnostic likelihood ratio negative of 0.06, and increase in area under the ROC curve of 0.015.

Double interpretation results in higher sensitivity than interpretation by a single radiologist, but lower specificity. The average area under the ROC curve for individual radiologists is similar to the average area under the ROC curve for radiologist pairs, suggesting that the main effect of double interpretation is a trade-off between sensitivity and specificity. In terms of ROC analysis, this means that the cut point (the point at which a test is called positive) of the test was changed, but the overall accuracy was not.

The observed trade-off in sensitivity and specificity is a well-recognized characteristic, but the effect of this trade-off on single measures of accuracy has not been previously reported to our knowledge. To use sensitivity and specificity to decide whether double interpretation is better, one must make a judgment that improvement in one characteristic is more valued than a decrement in the other. Several studies have shown double interpretation leads to an 8-15% improvement in sensitivity and a 2-5% decrease in specificity when two radiologists interpret the images independently [8,9,10,11,12, 26,27,28]. Most efforts to implement double interpretation have emphasized improvements in sensitivity, and the loss of specificity is accepted [12].

Use of the area under the ROC curve avoids a judgment regarding the relative merits of sensitivity and specificity and provides a single measure of accuracy [29, 30]. This single measure captures a radiologist's ability to distinguish between disease and nondisease across all definitions of a positive test outcome [25, 29, 30]. The small average increase in area under the ROC curve for an individual radiologist compared with his or her accuracy in a pair suggests that independent double interpretation results in little improvement in overall accuracy. This finding is reinforced by likelihood ratio statistics. Average diagnostic likelihood ratios show that positive results from radiologist pairs tend to be less informative than positive results from individual radiologists, whereas negative results from radiologist pairs tend to be more informative than negative results from individual radiologists. Our findings suggest that although independent double interpretations can improve sensitivity, the overall accuracy relative to single radiologists shows no increase.

Nonetheless, the average change in sensitivity and specificity that we observed from double interpretation would have a significant impact in practice. In a population of 24,000 screened women seen through our organized program, approximately 144 cancers are found per year [18]. A 7% improvement in sensitivity means that 10 additional cases of cancer could be found earlier. However, the corresponding 11% decrease in specificity would mean that an additional 2640 women would receive unnecessary additional evaluation. As shown in Table 2, most of the additional evaluation would be imaging, but a small proportion would be biopsy. The implications of these biopsies and the cumulative probability of biopsy need closer evaluation before we assume that the increase in sensitivity benefits women in general. Evidence now exists that the cumulative false-positive rate is substantial for women [6].

We found a somewhat lower change in sensitivity and a higher change in specificity compared with other reports [8,9,10,11,12, 26, 28]. The reasons for the differences are unclear, but several possibilities exist: our cancer and noncancer cases were selected for their difficulty; prior work involved a limited number of radiologist pairs; our study involved a test set, whereas prior reports were from clinical practice; and we used independent interpretations and systematically used the highest rating for the assessment rather than consensus. Whether our tumors were more difficult to detect than others in published studies is not clear because the characteristics are not reported elsewhere. Even if our cases were more difficult than those in other published studies, the impact of including them is unclear. A more difficult set may have reduced the average accuracy of an individual but increased the potential for improvement with double interpretation.

The strength of this study is its use of a large number of radiologists, a test set, and unbiased pairing of radiologists. Our collection of 31 community radiologists is an important difference from prior work because it represents a wider potential spectrum of skill among an unselected group. Although the relationship between test sets and clinical practice may be unclear, the former avoids important biases by having a known cancer outcome for all women and standardization of the mammograms and cancers across all radiologists. Use of all possible pairs of radiologists, and a rule for assigning the assessment of a double interpretation, avoid the issues of hierarchy and experience that influence the interpretations from consensus interpretations [12].

This study shows what may occur on average, but it also has some weaknesses. We cannot account for expected statistical correlation between paired interpretations, and therefore cannot estimate the standard error of our measures and compare accuracy measures with a level of precision. The study included images from the early 1990s. Although the facilities were accredited, current technology might result in better images and detection. The potential for improvement might also be decreased, so we do not expect that the use of older mammograms resulted in an underestimate of the difference between individual and paired interpretations. Finally, we used a rating scale that compressed the first two interpretive categories and therefore may have artificially increased agreement. Estimates of agreement using current ACR terminology might show more disagreement.

Our findings support the idea that agreement on assessments is moderate according to the criteria proposed by Landis and Koch [31]. The average agreement beyond chance for cancer cases was lower than that reported for two radiologists in a detailed study by Kerlikowske et al. [7] and comparable to that found by Elmore et al. [6]. The agreement reported by Kerlikowske et al. was between two academic radiologists who were both trained by the same expert radiologist (Kerlikowske K, personal communication). Elmore et al. included a more diverse group of volunteers but did not report separate levels of agreement for cancer and noncancer patients.

Our work adds to the literature by showing where individuals agree and disagree and demonstrating that though agreement may be moderate with respect to individual assessments, overall accuracy may still be high, as reflected in the area under the ROC curve. As shown in Table 3, any two radiologists will recognize and agree that a woman is without breast cancer almost two thirds of the time. However, agreement can also occur when both radiologists wrongly identify findings as malignant (7% of pairs). Most of this agreement results in additional imaging. Only a small proportion (0.2%) of the pairings showed agreement in which both radiologists thought a biopsy was necessary in a woman without cancer. Disagreement among radiologists regarding noncancer cases would lead to additional imaging 77.1% of the time. Although such disagreement between radiologists may lead to anxiety, unnecessary biopsy will be a relatively unusual event. Even though disagreement between radiologists occurs, their average accuracy as measured by the area under the ROC curve is high (0.85).

Agreement is lower among women with cancer. As shown in Table 2, only 44.6% (3.4+13.1+28.1) of all pairs agree that a biopsy should be considered when cancer is present. Similar to the women without cancer, radiologists agree and are wrong (14.2% of all pairings), but this is relatively unusual. Disagreement between radiologists would result in biopsy consideration most commonly (72.3% [1+0.5+7+1+0.2+8.2+13.1/42.9]) if the most aggressive recommendation was taken, but additional imaging is also a common recommendation when a second radiologist disagrees (25.6% [8.3+2.7/42.9]). The implication of this disagreement is that women with cancer will have a better chance of having it found.

Despite the apparent advantage of independent double interpretation for women with cancer, its impact in general is not clear. The diagnostic likelihood ratio for positive and negative tests reflects the probability of cancer once a test is complete [7, 23]. Independent double interpretation makes it much less likely that cancer is present when a test result is positive. This reduction in the likelihood of cancer is a result of the reduction in specificity. Like the area under the ROC curve, this single measure of accuracy suggests that accuracy is not improved substantially by independent double interpretation. Given this limitation, more should be done to explore alternative forms of double interpretation before it becomes widespread practice. Considerations should include the use of new computer scanner software and consensus approaches to double interpretation that may limit the unnecessary evaluation of benign lesions.


Acknowledgments
 
We thank Lou Grothaus for his assistance with design, K. Rosvik for her careful implementation of the study, and the many radiologists who remain anonymous but who gave their time to be involved.


References
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 

  1. Bassett LW. Clinical image evaluation. Radiol Clin North Am 1995;33:1027 -1039[Medline]
  2. Fintor L, Brown M, Fischer R, et al. The impact of mammography quality improvement legislation in Michigan: implication for the National Quality Standards Act. Am J Public Health 1998;88:667 -671[Abstract/Free Full Text]
  3. Food and Drug Administration. Quality mammography standards: final rule. 21 CFR parts 16 and 900. Washington, DC: United States Dept. of Health and Human Services, 1997
  4. Hendrick RE, Chrvala CA, Plott CM, Cutter GR, Jessop NW, Wilcox-Buchalla P. Improvement in mammography quality control: 1987-1995. Radiology 1998;207:663 -668[Abstract/Free Full Text]
  5. Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by US radiologists. Arch Intern Med 1996;156:209 -213[Abstract/Free Full Text]
  6. Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR. Variability in radiologists' interpretations of mammograms. N Engl J Med 1994;331:1493 -1499[Abstract/Free Full Text]
  7. Kerlikowske K, Grady D, Barclay J, et al. Variability and accuracy in mammographic interpretation using the American College of Radiology breast imaging reporting and data system. J Natl Cancer Inst 1999;90:1801 -1809[Abstract/Free Full Text]
  8. Thurfjell EL, Lernevall KA, Taube AAS. Benefit of independent double reading in a population-based mammography screening program. Radiology 1994;191:241 -244[Abstract/Free Full Text]
  9. Anttinen I, Pamilo M, Soiva M, Roiha M. Double reading of mammography screening films: one radiologist or two? Clin Radiol 1993;48:414 -421[Medline]
  10. Brown J, Bryan S, Warrent R. Mammography screening: an incremental cost-effectiveness analysis of double versus single reading of mammograms. BMJ 1996;312:809 -812[Abstract/Free Full Text]
  11. Warren RML, Duffy SW. Comparison of single reading with double reading of mammograms, and change in effectiveness with experience. Br J Radiol 1995;68:958 -962[Abstract/Free Full Text]
  12. Beam CA, Sullivan DC, Layde PM. Effect of human variability on independent double reading in screening mammography. Acad Radiol 1996;3:891 -897[Medline]
  13. Obuchowski NA, Zepp RC. Simple steps for improving multiple-reader studies in radiology. AJR 1996;166:517 -521[Abstract/Free Full Text]
  14. Feinstein AR, Cicchetti DV. High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol 1990;43:543 -549[Medline]
  15. Boyd NF, Wolfson C, Moskowitz M, et al. Observer variation in the interpretation of xeromammograms. J Natl Cancer Inst 1982;68:357 -363
  16. Campbell KM, Holm K. Preventive service utilization: among older women—a comparison of HMO members, private insurance policy holders, and Medicare recipients. Olympia, WA: Washington State Dept. of Health, Center for Health Statistics, 1995
  17. Miller BA, Reis LAG, Hankey BF. SEER cancer statistics review 1973-1990. Bethesda, MD: National Cancer Institute, 1993. NIH publication 93-2789
  18. Taplin SH, Mandelson MT, Anderman C, et al. Mammography diffusion and trends in late-stage breast cancer: evaluating outcomes in a population. Cancer Epidemiol Biomarkers Prev 1997;6:625 -631[Abstract]
  19. Linver MN, Osuch JR, Brenner RJ, Smith RA. The mammography audit: a primer for the Mammography Quality Standards Act (MQSA). AJR 1995;165:19 -25[Abstract/Free Full Text]
  20. Fletcher RH, Fletcher SW, Wagner EH. Clinical epidemiology: the essentials. Baltimore: Williams & Wilkins, 1996
  21. Kopans DB, D'Orsi CJ, Adler DD, et al. Breast imaging reporting and data system. Reston, VA: American College of Radiology, 1993
  22. Fleiss JL. Statistical methods for rates and proportions. New York: Wiley, 1981
  23. Boyko EJ. Ruling out or ruling in disease with the most sensitive or specific diagnostic test: short cut or wrong turn? Med Decis Making 1994;14:175 -179
  24. Hanley JA. Receiver operating characteristic (ROC) methodology: the state of the art. Diagn Imaging 1989;29:307 -335
  25. Linver MN. Diagnosis of diseases of the breast. Philadelphia: Saunders, 1997
  26. Anderson EDC, Muir BB, Walsh JS, Kirkpatrick AE. The efficacy of double reading mammograms in breast screening. Clin Radiol 1994;49:248 -251[Medline]
  27. Blanks RG, Wallis MG, Moss SM. A comparison of cancer detection rates achieved by breast cancer screening programmes by number of readers, for one and two view mammography: results from the UK National Health Service breast screening programme. J Med Screen 1998;5:195 -201[Abstract/Free Full Text]
  28. Williams LJ, Hartswood M, Prescott RJ. Methodological issues in mammography double reading studies. J Med Screen 1998;5:202 -206[Abstract/Free Full Text]
  29. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982;143:29 -36[Abstract/Free Full Text]
  30. Tosteson ANA, Weinstein MC, Wittenberg J, Begg CB. ROC curve regression analysis: the use of ordinal regression models for diagnostic test assessment. Environ Health Perspect 1994;102:73 -78
  31. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1997;33:159 -174

Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
RadiologyHome page
S. Hofvind, B. M. Geller, R. D. Rosenberg, and P. Skaane
Screening-detected Breast Cancers: Discordant Independent Double Reading in a Population-based Screening Program
Radiology, September 29, 2009; (2009) radiol.2533090210v1.
[Abstract] [Full Text]


Home page
J Med ScreenHome page
S. Hofvind, B. C Yankaskas, J.-L. Bulliard, C. N Klabunde, and J. Fracheboud
Comparing interval breast cancer rates in Norway and North Carolina: results and challenges
J Med Screen, September 1, 2009; 16(3): 131 - 139.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
D. Georgian-Smith, R. H. Moore, E. Halpern, E. D. Yeh, E. A. Rafferty, H. A. D'Alessandro, M. Staffa, D. A. Hall, K. A. McCarthy, and D. B. Kopans
Blinded Comparison of Computer-Aided Detection with Human Second Reading in Screening Mammography
Am. J. Roentgenol., November 1, 2007; 189(5): 1135 - 1141.
[Abstract] [Full Text] [PDF]


Home page
JNCI J Natl Cancer InstHome page
J. G. Elmore and R. J. Brenner
The More Eyes, the Better to See? From Double to Quadruple Reading of Screening Mammograms
J Natl Cancer Inst, August 1, 2007; 99(15): 1141 - 1143.
[Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
R. E. Hendrick, G. R. Cutter, E. A. Berns, C. Nakano, J. Egger, P. A. Carney, L. Abraham, S. H. Taplin, C. J. D'Orsi, W. Barlow, et al.
Community-Based Mammography Practice: Services, Charges, and Interpretation Methods
Am. J. Roentgenol., February 1, 2005; 184(2): 433 - 438.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
H. L. Kundel and M. Polansky
Measurement of Observer Agreement
Radiology, August 1, 2003; 228(2): 303 - 308.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
S. C. Harvey, B. Geller, R. G. Oppenheimer, M. Pinet, L. Riddell, and B. Garra
Increase in Cancer Detection and Recall Rates with Independent Double Interpretation of Screening Mammography
Am. J. Roentgenol., May 1, 2003; 180(5): 1461 - 1467.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
N. Karssemeijer, J. D. M. Otten, A. L. M. Verbeek, J. H. Groenewoud, H. J. de Koning, J. H. C. L. Hendriks, and R. Holland
Computer-aided Detection versus Independent Double Reading of Masses on Mammograms
Radiology, April 1, 2003; 227(1): 192 - 200.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
E. B. Cole, E. D. Pisano, E. O. Kistner, K. E. Muller, M. E. Brown, S. A. Feig, R. A. Jong, A. D. A. Maidment, M. J. Staiger, C. M. Kuzmiak, et al.
Diagnostic Accuracy of Digital Mammography in Patients with Dense Breasts Who Underwent Problem-solving Mammography: Effects of Image Processing and Lesion Type
Radiology, January 1, 2003; 226(1): 153 - 160.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Taplin, S. H.
Right arrow Articles by Brenner, R. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Taplin, S. H.
Right arrow Articles by Brenner, R. J.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?
Hotlight (NEW!)
Right arrow
What's Hotlight?


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS