|
|
||||||||
1
Center for Health Studies, Group Health Cooperative of Puget Sound, P. O. Box
34021, Seattle, WA 98124-1448.
2
Department of Family Medicine, University of Washington, Box 356390, Seattle,
WA 98195.
3
Group Health Cooperative of Puget Sound, 1730 Minor Ave., Ste. 160, Seattle WA
98101.
4
Department of Medicine, University of Washington, Box 356420, Seattle, WA
98195.
5
Department of Radiology, Group Health Cooperative of Puget Sound, 209 Martin
luther King Jr. Way, Tacoma, WA 98405.
6
Joyce Einsenberg Keefer Breast Center, Saint John's Hospital, 1328 - 22nd St.,
Santa Monica, CA 90404.
Received June 14, 1999;
accepted after revision October 11, 1999.
Supported by grant CA6371 from the National Cancer Institute and a
Generalist Faculty grant for J. G. Elmore from the Robert Wood Johnson
Foundation. However, all opinions and findings are the sole responsibility of
the authors.
Abstract
|
|
|---|
MATERIALS AND METHODS. We assessed interpretive accuracy using a stratified random sample of test mammograms that included 30 women with cancer and 83 without. Radiologists were unaware of clinical information and of each other's assessments. We describe accuracy for individual radiologists and for double interpretation, including average sensitivity, specificity, diagnostic likelihood ratios positive and negative, and area under the receiver operating characteristic (ROC) curve. We also assessed weighted and nonweighted kappa statistics among all 465 pairs of radiologists and 31,465 pairs of unique pairs. The assessment for double interpretations used the "highest" (i.e., most abnormal) assessment of the two radiologists. We calculated the difference between each radiologist's individual accuracy and the average accuracy across that radiologist's 30 double interpretations.
RESULTS. We found the following average accuracy statistics for individual radiologists: sensitivity, 79%; specificity, 81%; diagnostic likelihood ratio positive, 5.53; diagnostic likelihood ratio negative, 0.26; and area under the ROC curve, 0.85. The mean kappa statistic among radiologists for cancer cases increased with double interpretation from 0.59 to 0.70, and for noncancer cases from 0.30 to 0.34. Double interpretation resulted in an average increase in sensitivity of 7%, an average decrease in specificity of 11%, a decrease in diagnostic likelihood ratio positive of 2.35, a decrease in diagnostic likelihood ratio negative of 0.06, and an increase in area under the ROC curve of 0.02.
CONCLUSION. Independent double interpretation does not increase accuracy as measured by the area under the ROC curve.
|
|
|---|
Double interpretation can occur in a variety of ways, but usually means that the assessments of two radiologists are combined or adjudicated for a single set of images [12]. Objective independent double interpretation occurs when the radiologists are unaware of each other's interpretations and a rule is used to resolve differences that occur [8, 12]. Consensus interpretation occurs when the two radiologists discuss their differences and come to a mutually acceptable interpretation [9]. Beam et al. [12] noted that consensus interpretation introduced influences that are hard to measure. These researchers also found that the interpretive ability of each radiologist affects the accuracy of the pair, so that some pairs may be better than others. In large practices selective pairing may not be possible. In small practices the number of choices may be restricted. Studies to date have included volunteer pairs from community facilities or special programs so that the average impact of double interpretation is unclear. Obuchowski and Zepp [13] pointed out that estimation of the average radiologist's accuracy from multiradiologist studies more accurately reflects expectations for an imaging technology. In this analysis, we take advantage of data collected in an objective independent fashion on a set of test mammograms read by 31 radiologists. We use this data to compare the interpretive accuracy of individuals and of all possible unique pairs.
|
|
|---|
Setting and Sample Selection
This study took place at the Group Health Cooperative of Puget Sound, a
predominantly staff model consumer-controlled health maintenance organization
with 398,000 enrollees who are racially similar to the surrounding community
[16]. Screening mammography
occurs through an organized breast cancer screening program using five centers
and an automated database that can be linked to a regional cancer Surveillance
Epidemiology and End Results registry
[17,
18]. All mammography
facilities were accredited by the American College of Radiology (ACR) in 1990
and use dedicated equipment, two views per breast, and the screen-film
technique with a grid.
We used stratified random sampling to select 120 bilateral mammography examinations. The selected examinations are referred to as the "index" and include mediolateral oblique and craniocaudal views for each breast of women seen for screening through Group Health's screening program described by Taplin et al. [18]. All examinations occurred among asymptomatic women 40 years old or older. Except as noted, these women were screened between January 1, 1990 and December 31, 1991, and remained continuously enrolled for at least 2 years subsequent to the screening dates (n = 21,567). Among these women, we used the Surveillance Epidemiology and End Results linkage to identify all ductal carcinoma in situ and invasive breast cancer diagnosed within the year after the index examination.
For the 21,567 screening examinations, we categorized the original radiologists' interpretations in a manner consistent with published recommendations regarding mammography audits [19, 20]. The interpretations were considered positive if the radiologist recommended short-interval follow-up, additional workup, surgical evaluation, or biopsy; other interpretations were considered negative.
We randomly selected a sample of 120 examinations from four strata (true-positive, false-positive, true-negative, and false-negative). Among the 21,567 women, the sensitivity and specificity of the original interpretations were 82% and 84%, respectively. One original purpose of the study was to distinguish among radiologists, so we created a test set that would be more difficult than the usual practice. We included a higher proportion of false-negatives and false-positives than occurred among the original 21,567 examinations. This sampling created a screening test set that included 33 women with cancer and 87 without (24 true-positive and nine false-negative, 16 false-positive and 71 true-negative). To identify enough false-negative mammograms for the sample, we randomly selected nine from 51 negative screening cases that were imaged from 1985 through 1992.
We reviewed the medical record of each woman selected to verify that the radiologist's recommendation for the index examination was accurately captured in the automated record. We also identified the appropriate image and reviewed it for the presence of pencil or other markings. All index examinations with missing mammograms, misclassified interpretations, or marked mammograms that could not be cleaned were replaced by random selection of another examination with the same interpretive classification and cancer status.
Study Protocol
We conducted this study among 31 volunteer radiologists from three group
health centers after obtaining approval from the appropriate research and
human subjects review boards. Before viewing mammograms, radiologists were
given a brief survey to assess their experience interpreting mammograms.
We rotated the original films through the three radiology centers in groups of approximately 30 examinations each. During a total of four screening sessions, each radiologist independently reviewed each group of examinations on a multipanel viewer in a private room. The radiologist was allowed up to 1 hr to review the two images of each breast. Screening views from the immediate prior screen were available and were shown for 62% of the cases. All images were masked to remove patient identifiers. The radiologists were not given any clinical information and were unaware of the prevalence of cancer in the test sets. All participating radiologists were assigned anonymous study identifiers, and the investigators remained unaware of individual results.
Study radiologists provided separate assessments for each breast for each woman included in the study sample. The interpretive options were arrayed on a five-point ordinal scale modified from the ACR lexicon and shown in Table 1 [21]. Our rating scale linked disease ratings with follow-up recommendations but combined ACR categories 1 (negative) and 2 (benign) because they had the same clinical implication (normal interval follow-up).
|
Data Analysis
We identified the radiologist's interpretation for each woman when
interpreting alone and when interpreting in combination with every other
radiologist.
For the individual radiologist interpreting alone (i.e, single interpretation), we systematically combined his or her separate assessment for each breast to create a single assessment for each woman. Women with cancer were assigned the assessment given for the breast with cancer. The study group included no cases of bilateral breast cancer. Women without cancer were assigned the most abnormal assessment from the two breast images.
For each unique pair of radiologists (double interpretation), we also identified an interpretation for all women in the study set. There were 465 (31 x 30 / 2) possible unique pairs of radiologists. For each unique pair of radiologists, we assigned the most abnormal of the two radiologists' interpretations to each woman in the study set and then calculated accuracy statistics for the pair as noted in the following text.
Analyses describe the average percentage of interpretations in each assessment category across all 31 individual and 465 double interpretations. In Tables 2 and 3, we show the proportion of all assessments accounted for by each possible pair of interpretations. These percentages are shown separately for assessments of women with and without cancer.
|
|
Agreement among individual radiologists and agreement among radiologist pairs were described using nonweighted and weighted kappa statistics. The kappa statistic measures agreement between pairs of ratings after accounting for chance [22]. The weighted kappa statistic allows partial agreement that depends on the ordinal relationship between two assessments. For our weighted analyses, we counted assessments that differed by one category as 75% agreement, assessments that differed by two categories as 50% agreement, and assessments that differed by three categories as 25% agreement. We report the average kappa statistics across all pairs of radiologists and across all unique pairs of radiologist pairs (i.e., no radiologist is included in both pairs).
We evaluated changes in accuracy resulting from double interpretation using five different measures: sensitivity, specificity, diagnostic likelihood ratio positive, diagnostic likelihood ratio negative, and the area under the ROC curve. Sensitivity is the proportion of women with cancer who are correctly identified as disease-positive by the radiologist (or radiologist pair). Specificity is the proportion of women without cancer correctly identified as disease-negative by the radiologist (or radiologist pair). Diagnostic likelihood ratios reflect the odds of disease given an assessment and can be viewed as measures of the clinical information provided by test results [23]. Because a positive test result should increase the odds of disease, the diagnostic likelihood ratio positive should be greater than 1. Similarly, a negative test should mean that disease is less likely, and the diagnostic likelihood ratio negative should be from 0 to 1. Finally, we used the area under the ROC curve for an overall measure of accuracy. The area under the ROC curve captures differences in the separate distributions of radiologist ratings for disease-positive and disease-negative examinations [24].
We defined positive results as those which the radiologist assessed as "possibly abnormal," "suspicious abnormality," and "highly suggestive," because these were linked with immediate follow-up (either additional images or biopsy). This definition is consistent with the first of three definitions proposed by Linver [25] and the definition of a positive test used to identify the sample.
We report the average accuracy statistics for the 31 radiologists and the 465 radiologist pairs with an interval ranging from the 25th percentile to the 75th percentile. More direct comparisons between radiologists and radiologist pairs were made using change scores. Each radiologist contributes to 30 radiologist pairs. Change scores are the difference between each radiologist's individual accuracy and the average accuracy across that radiologist's 30 paired interpretations. We report the average kappa statistic across all possible pairs of radiologists and all possible pairs of separate pairs.
|
|
|---|
Mammographic Examinations
Mammograms of three women with cancer and four women without cancer were
removed from the study before the analysis because of marks placed on their
mammograms by the radiologists during the study. The remaining 113
examinations included 30 examinations of women with cancer and 83 examinations
of women without cancer. Within this set of films, original radiologists were
70% sensitive and 86% specific. Among the 30 women with cancer, the mammograms
of one were identified as having normal findings by all observers.
There are 930 (31 x 30) and 2573 (31 x 83) possible individual interpretations for examinations with and without cancer, respectively. There are 465 (31 x 30/2) separate pairs of radiologists. Across these radiologist pairs, there are 13,950 (465 x 30) double interpretations for examinations with cancer and 38,595 (465 x 83) possible double interpretations for examinations without cancer. There are 31,465 ([31 x 30 x 29 x 28] / [4 x 3 x 2 x 1]) possible pairs of unique pairs.
The women who contributed these 113 mammography examinations were an average of 50 years old and those with cancer had primarily small (i.e., < 2 cm in diameter; 18/30) invasive (25/30) tumors.
Distribution of Ratings
Table 1 shows the overall
distribution of assessments for single and double interpretations. There is an
upward shift toward more suspicious assessments for double interpretations.
Among women with cancer, the shift is primarily into the "highly
suggestive" category, with 11.3% more of the women rated as highly
suggestive when radiologists were paired. Across examinations of women without
cancer, the shift was primarily into the "needs additional
evaluation" category, withn 10.3% more examinations rated as needing
additional assessment when radiologists were paired.
Agreement Among Ratings
Tables 2 (cancer patients)
and 3 (noncancer patients)
provide a detailed description of the correspondence between pairs of
individual radiologists' ratings.
Among women with cancer, 56.2% of all paired interpretations agree (boldface in Table 2) and 42.9% of interpretations disagree. Most disagreement occurs between suspicious and highly suggestive assessments (30.5% [13.1 / 42.9] of disagreements). "Needs additional evaluation" and "suspicious" assessment pairs account for 16.3% (7 / 42.9) of disagreements, and "needs additional evaluation" and "highly suggestive" assessments account for 19.1% (8.2/42.9) of disagreements. Finally, "negative or benign" and "needs additional evaluation" pairs account for 19.3% (8.3/42.9) of all disagreements. Among all cancer patients for whom disagreements occur, additional imaging is recommended 25.6% of the time, biopsy consideration 72.3% of the time, and short-interval follow-up 2.1% of the time.
Among women without cancer (Table 3), 71.9% of all paired assessments agree (boldface in Table 3) and 28.0% disagree. The disagreement primarily occurs between "negative or benign" and "needs additional evaluation" assessments (66.4% [18.6/28.0] of all disagreements). Among all cases in which disagreement occurs, the more abnormal interpretation resulted in a recommendation for additional imaging 77.1% of the time, biopsy consideration 6.5% of the time, and short-interval follow-up 16.2% of the time.
Agreement is higher among double interpretations than among single interpretations (Table 4,) especially among cancer patients. Weighting improved agreement among individual radiologists for women with cancer but otherwise resulted in only small changes.
|
Distribution of Accuracy Statistics
Table 5 shows the five
accuracy measures for individual and paired radiologists and the change in the
average score between single and double interpretations. Sensitivity
increased, specificity decreased, diagnostic likelihood ratio positive
decreased, diagnostic likelihood ratio negative decreased, and area under the
ROC curve increased slightly.
|
|
|
|---|
Double interpretation results in higher sensitivity than interpretation by a single radiologist, but lower specificity. The average area under the ROC curve for individual radiologists is similar to the average area under the ROC curve for radiologist pairs, suggesting that the main effect of double interpretation is a trade-off between sensitivity and specificity. In terms of ROC analysis, this means that the cut point (the point at which a test is called positive) of the test was changed, but the overall accuracy was not.
The observed trade-off in sensitivity and specificity is a well-recognized characteristic, but the effect of this trade-off on single measures of accuracy has not been previously reported to our knowledge. To use sensitivity and specificity to decide whether double interpretation is better, one must make a judgment that improvement in one characteristic is more valued than a decrement in the other. Several studies have shown double interpretation leads to an 8-15% improvement in sensitivity and a 2-5% decrease in specificity when two radiologists interpret the images independently [8,9,10,11,12, 26,27,28]. Most efforts to implement double interpretation have emphasized improvements in sensitivity, and the loss of specificity is accepted [12].
Use of the area under the ROC curve avoids a judgment regarding the relative merits of sensitivity and specificity and provides a single measure of accuracy [29, 30]. This single measure captures a radiologist's ability to distinguish between disease and nondisease across all definitions of a positive test outcome [25, 29, 30]. The small average increase in area under the ROC curve for an individual radiologist compared with his or her accuracy in a pair suggests that independent double interpretation results in little improvement in overall accuracy. This finding is reinforced by likelihood ratio statistics. Average diagnostic likelihood ratios show that positive results from radiologist pairs tend to be less informative than positive results from individual radiologists, whereas negative results from radiologist pairs tend to be more informative than negative results from individual radiologists. Our findings suggest that although independent double interpretations can improve sensitivity, the overall accuracy relative to single radiologists shows no increase.
Nonetheless, the average change in sensitivity and specificity that we observed from double interpretation would have a significant impact in practice. In a population of 24,000 screened women seen through our organized program, approximately 144 cancers are found per year [18]. A 7% improvement in sensitivity means that 10 additional cases of cancer could be found earlier. However, the corresponding 11% decrease in specificity would mean that an additional 2640 women would receive unnecessary additional evaluation. As shown in Table 2, most of the additional evaluation would be imaging, but a small proportion would be biopsy. The implications of these biopsies and the cumulative probability of biopsy need closer evaluation before we assume that the increase in sensitivity benefits women in general. Evidence now exists that the cumulative false-positive rate is substantial for women [6].
We found a somewhat lower change in sensitivity and a higher change in specificity compared with other reports [8,9,10,11,12, 26, 28]. The reasons for the differences are unclear, but several possibilities exist: our cancer and noncancer cases were selected for their difficulty; prior work involved a limited number of radiologist pairs; our study involved a test set, whereas prior reports were from clinical practice; and we used independent interpretations and systematically used the highest rating for the assessment rather than consensus. Whether our tumors were more difficult to detect than others in published studies is not clear because the characteristics are not reported elsewhere. Even if our cases were more difficult than those in other published studies, the impact of including them is unclear. A more difficult set may have reduced the average accuracy of an individual but increased the potential for improvement with double interpretation.
The strength of this study is its use of a large number of radiologists, a test set, and unbiased pairing of radiologists. Our collection of 31 community radiologists is an important difference from prior work because it represents a wider potential spectrum of skill among an unselected group. Although the relationship between test sets and clinical practice may be unclear, the former avoids important biases by having a known cancer outcome for all women and standardization of the mammograms and cancers across all radiologists. Use of all possible pairs of radiologists, and a rule for assigning the assessment of a double interpretation, avoid the issues of hierarchy and experience that influence the interpretations from consensus interpretations [12].
This study shows what may occur on average, but it also has some weaknesses. We cannot account for expected statistical correlation between paired interpretations, and therefore cannot estimate the standard error of our measures and compare accuracy measures with a level of precision. The study included images from the early 1990s. Although the facilities were accredited, current technology might result in better images and detection. The potential for improvement might also be decreased, so we do not expect that the use of older mammograms resulted in an underestimate of the difference between individual and paired interpretations. Finally, we used a rating scale that compressed the first two interpretive categories and therefore may have artificially increased agreement. Estimates of agreement using current ACR terminology might show more disagreement.
Our findings support the idea that agreement on assessments is moderate according to the criteria proposed by Landis and Koch [31]. The average agreement beyond chance for cancer cases was lower than that reported for two radiologists in a detailed study by Kerlikowske et al. [7] and comparable to that found by Elmore et al. [6]. The agreement reported by Kerlikowske et al. was between two academic radiologists who were both trained by the same expert radiologist (Kerlikowske K, personal communication). Elmore et al. included a more diverse group of volunteers but did not report separate levels of agreement for cancer and noncancer patients.
Our work adds to the literature by showing where individuals agree and disagree and demonstrating that though agreement may be moderate with respect to individual assessments, overall accuracy may still be high, as reflected in the area under the ROC curve. As shown in Table 3, any two radiologists will recognize and agree that a woman is without breast cancer almost two thirds of the time. However, agreement can also occur when both radiologists wrongly identify findings as malignant (7% of pairs). Most of this agreement results in additional imaging. Only a small proportion (0.2%) of the pairings showed agreement in which both radiologists thought a biopsy was necessary in a woman without cancer. Disagreement among radiologists regarding noncancer cases would lead to additional imaging 77.1% of the time. Although such disagreement between radiologists may lead to anxiety, unnecessary biopsy will be a relatively unusual event. Even though disagreement between radiologists occurs, their average accuracy as measured by the area under the ROC curve is high (0.85).
Agreement is lower among women with cancer. As shown in Table 2, only 44.6% (3.4+13.1+28.1) of all pairs agree that a biopsy should be considered when cancer is present. Similar to the women without cancer, radiologists agree and are wrong (14.2% of all pairings), but this is relatively unusual. Disagreement between radiologists would result in biopsy consideration most commonly (72.3% [1+0.5+7+1+0.2+8.2+13.1/42.9]) if the most aggressive recommendation was taken, but additional imaging is also a common recommendation when a second radiologist disagrees (25.6% [8.3+2.7/42.9]). The implication of this disagreement is that women with cancer will have a better chance of having it found.
Despite the apparent advantage of independent double interpretation for women with cancer, its impact in general is not clear. The diagnostic likelihood ratio for positive and negative tests reflects the probability of cancer once a test is complete [7, 23]. Independent double interpretation makes it much less likely that cancer is present when a test result is positive. This reduction in the likelihood of cancer is a result of the reduction in specificity. Like the area under the ROC curve, this single measure of accuracy suggests that accuracy is not improved substantially by independent double interpretation. Given this limitation, more should be done to explore alternative forms of double interpretation before it becomes widespread practice. Considerations should include the use of new computer scanner software and consensus approaches to double interpretation that may limit the unnecessary evaluation of benign lesions.
Acknowledgments
We thank Lou Grothaus for his assistance with design, K. Rosvik for her
careful implementation of the study, and the many radiologists who remain
anonymous but who gave their time to be involved.
|
|
|---|
This article has been cited by other articles:
![]() |
S. Hofvind, B. M. Geller, R. D. Rosenberg, and P. Skaane Screening-detected Breast Cancers: Discordant Independent Double Reading in a Population-based Screening Program Radiology, September 29, 2009; (2009) radiol.2533090210v1. [Abstract] [Full Text] |
||||
![]() |
S. Hofvind, B. C Yankaskas, J.-L. Bulliard, C. N Klabunde, and J. Fracheboud Comparing interval breast cancer rates in Norway and North Carolina: results and challenges J Med Screen, September 1, 2009; 16(3): 131 - 139. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Georgian-Smith, R. H. Moore, E. Halpern, E. D. Yeh, E. A. Rafferty, H. A. D'Alessandro, M. Staffa, D. A. Hall, K. A. McCarthy, and D. B. Kopans Blinded Comparison of Computer-Aided Detection with Human Second Reading in Screening Mammography Am. J. Roentgenol., November 1, 2007; 189(5): 1135 - 1141. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. G. Elmore and R. J. Brenner The More Eyes, the Better to See? From Double to Quadruple Reading of Screening Mammograms J Natl Cancer Inst, August 1, 2007; 99(15): 1141 - 1143. [Full Text] [PDF] |
||||
![]() |
R. E. Hendrick, G. R. Cutter, E. A. Berns, C. Nakano, J. Egger, P. A. Carney, L. Abraham, S. H. Taplin, C. J. D'Orsi, W. Barlow, et al. Community-Based Mammography Practice: Services, Charges, and Interpretation Methods Am. J. Roentgenol., February 1, 2005; 184(2): 433 - 438. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. L. Kundel and M. Polansky Measurement of Observer Agreement Radiology, August 1, 2003; 228(2): 303 - 308. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. C. Harvey, B. Geller, R. G. Oppenheimer, M. Pinet, L. Riddell, and B. Garra Increase in Cancer Detection and Recall Rates with Independent Double Interpretation of Screening Mammography Am. J. Roentgenol., May 1, 2003; 180(5): 1461 - 1467. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Karssemeijer, J. D. M. Otten, A. L. M. Verbeek, J. H. Groenewoud, H. J. de Koning, J. H. C. L. Hendriks, and R. Holland Computer-aided Detection versus Independent Double Reading of Masses on Mammograms Radiology, April 1, 2003; 227(1): 192 - 200. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. B. Cole, E. D. Pisano, E. O. Kistner, K. E. Muller, M. E. Brown, S. A. Feig, R. A. Jong, A. D. A. Maidment, M. J. Staiger, C. M. Kuzmiak, et al. Diagnostic Accuracy of Digital Mammography in Patients with Dense Breasts Who Underwent Problem-solving Mammography: Effects of Image Processing and Lesion Type Radiology, January 1, 2003; 226(1): 153 - 160. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |