AJR Women's Imaging Online
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Berg, W. A.
Right arrow Articles by Sexton, M. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Berg, W. A.
Right arrow Articles by Sexton, M. J.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?
AJR 2000; 174:1769-1777
© American Roentgen Ray Society


Breast Imaging Reporting and Data System

Inter- and Intraobserver Variability in Feature Analysis and Final Assessment

Wendie A. Berg1,2, Cristina Campassi1, Patricia Langenberg3 and Mary J. Sexton3

1 Department of Radiology, University of Maryland School of Medicine, 22 S. Greene St., Baltimore, MD 21201.
2 The Greenebaum Cancer Center, University of Maryland School of Medicine, Baltimore, MD 21201.
3 Department of Epidemiology and Preventive Medicine, University of Maryland School of Medicine, Baltimore, MD 21201.

Received September 22, 1999; accepted after revision November 5, 1999.

 
Presented in part at the 1997 and 1998 Radiological Society of North America annual meetings, Chicago, November 1997 and November 1998.

Supported by a grant from the Susan G. Komen Breast Cancer Foundation.

Address correspondence to W. A. Berg.


Abstract
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
OBJECTIVE. We sought to evaluate the use of the Breast Imaging Reporting and Data System (BI-RADS) standardized mammography lexicon among and within observers and to distinguish variability in feature analysis from variability in lesion management.

MATERIALS AND METHODS. Five experienced mammographers, not specifically trained in BI-RADS, used the lexicon to describe and assess 103 screening mammograms, including 30 (29%) showing cancer, and a subset of 86 mammograms with diagnostic evaluation, including 23 (27%) showing cancer. A subset of 13 screening mammograms (two with malignant findings, 11 with diagnostic evaluation) were rereviewed by each observer 2 months later. Kappa statistics were calculated as measures of agreement beyond chance.

RESULTS. After diagnostic evaluation, the interobserver kappa values for describing features were as follows: breast density, 0.43; lesion type, 0.75; mass borders, 0.40; special cases, 0.56; mass density, 0.40; mass shape, 0.28; microcalcification morphology, 0.36; and microcalcification distribution, 0.47. Lesion management was highly variable, with a kappa value for final assessment of 0.37. When we grouped assessments recommending immediate additional evaluation and biopsy (BI-RADS categories 0, 4, and 5 combined) versus follow-up (categories 1, 2, and 3 combined), five observers agreed on management for only 47 (55%) of 86 lesions. Intraobserver agreement on management (additional evaluation or biopsy versus follow-up) was seen in 47 (85%) of 55 interpretations, with a kappa value of 0.35-1.0 (mean, 0.60) for final assessment.

CONCLUSION. Inter- and intraobserver variability in mammographic interpretation is substantial for both feature analysis and management. Continued development of methods to improve standardization in mammographic interpretation is needed.


Introduction
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
In an effort to standardize mammographic reporting, the American College of Radiology and experts in mammography developed the Breast Imaging Reporting and Data System (BI-RADS) lexicon [1, 2]. Terms have been developed to describe breast density, lesion features, impression, and recommendations. Improved accuracy in mammographic reporting has been shown [3] with use of a computer decision aid when patient age and observers' scaled values for a variety of morphologic features were entered. These descriptors were weighted for predictive power, and the output to the observer was an advisory estimate of the probability of malignancy [3]. D'Orsi and Kopans [4] further refined this initial success by ranking descriptors by likelihood of malignancy, and at least one recent study adds further validation to this approach [5]. Similarly, input of BI-RADS descriptors into an artificial neural network has been shown to improve the positive predictive value of breast biopsy [6]. The final rule of the Mammography Quality Standards Act requires the use of the BI-RADS final assessment categories on all mammographic interpretations. Variability in the application of the BI-RADS terminology in practice has not been widely studied. One study designed to validate use of the BI-RADS lexicon among five observers showed moderate agreement [7] except in the terms used to describe associated findings and special cases. We sought to extend those results by evaluating observers who were not trained in the same facility and to measure intraobserver variability among several observers.

Our study had the following three goals: to assess variability in feature analysis (description of lesion) and lesion management (threshold for biopsy); to identify which lesion descriptors are consistently used and to determine the positive predictive value of each major descriptor in the BI-RADS lexicon in our series of cases; and to provide guidance for possible areas of improvement in either terminology or training of interpreting physicians.


Materials and Methods
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Five experienced mammographers reviewed 103 proven cases having prospectively nonpalpable lesions. Experience of the reviewers ranged from interpreting 100 mammograms per week for 4 years to interpreting 600 mammograms per week for 10 years. Three observers worked together in the same private practice, and a fourth observer had previously worked in the same group. All had been trained independently, and all met all standards as qualified interpreting physicians under the Mammography Quality Standards Act. Two observers each had at least 5 years' experience as a mammographer in an academic institution.

The case population was selected to maintain the expected rate of malignancy among lesions referred for biopsy. We had 30 cases of cancer in our series, representing 29% of lesions. Sixty-nine benign lesions and four high-risk lesions (three cases of atypical ductal hyperplasia and one of lobular carcinoma in situ) were included. At the same time, we sought to include at least three examples of each of the BI-RADS major descriptors for mass borders, asymmetric densities, microcalcification morphology, and distribution, except that all typically benign calcifications were condensed into the term "coarse," with only three such examples included. Ninety lesions were biopsy-proven (44 by 14-gauge core needle biopsy with 2 years of follow-up showing stability or regression, and 46 by surgical excision). Thirteen lesions were shown to be stable on 4-year follow-up mammography.

Good-quality copy mammograms were used, with the lesion marked on both craniocaudal and mediolateral oblique views. Observers were asked to review the screening films initially without comparison films and to complete a form detailing the findings using the BI-RADS lexicon and assessment and recommendation categories. Observers were asked to choose the single most worrisome applicable descriptor from each category. Assessment was finalized in BI-RADS categories together with corresponding recommendations: 1, negative, routine screening; 2, benign findings, routine screening; 3, probably benign, short interval follow-up (6 months); 4, suspicious, biopsy; 5, highly suggestive of malignancy, biopsy; and 0, needs additional evaluation.

For a subset of 86 lesions, additional diagnostic evaluation was immediately available under separate cover, and the observers were asked to again describe features and make a final assessment. Additional evaluation consisted of magnification images for 50 lesions, spot compression images for 14 lesions and sonograms for seven of these 14, true lateral images for five, one laterally exaggerated craniocaudal image, and one mammogram after aspiration. For 15 lesions, including four cases of microcalcifications, the only additional evaluation was comparison films from at least 4 years earlier. Comparison films were also supplied for 13 lesions with other additional evaluation. This subset of 86 cases included 23 cases of cancer (27% of lesions), the four high-risk lesions, and 59 benign lesions. Observers were again permitted to use category 0, needs additional evaluation, because personal preferences varied as to the extent of additional evaluation deemed necessary to make a final assessment.

To assess intraobserver variability, all observers rereviewed 13 randomly selected cases after a minimum of 2 months had elapsed since the first interpretation, longer than the 6 weeks advocated by Metz [8]. Two cancerous lesions were included, as were one high-risk lesion and nine benign lesions; 11 cases had diagnostic evaluation.

Kappa statistics were calculated using Stata software (Stata Press, College Station, TX) to assess the proportion of inter- and intraobserver agreement beyond that expected by chance [9]. The method for estimating an overall kappa value in the case of multiple observers and multiple categories is based on the work of Landis and Koch [10, 11] as follows: for each category j, a kappa statistic (for multiple raters) is calculated comparing category j with the other categories pooled. A weighted average is used to combine these kappa values, where the weight for a given kappa value is the product of pj, the proportion of ratings in category j, and (1 - pj), the proportion of ratings not in category j. BI-RADS final assessment categories 1 and 2 were combined for this analysis. A value of {kappa} = 1.0 corresponds to complete agreement, 0 to no agreement, and less than 0 to disagreement. Svanholm et al. [12] have suggested that a kappa value of equal to or less than 0.50 be taken as poor and equal to or greater than 0.75 as excellent reproducibility. Landis and Koch [10] have suggested that a kappa value of equal to or less than 0.20 indicates slight agreement; 0.21-0.40, fair agreement; 0.41-0.60, moderate agreement; 0.61-0.80, substantial agreement; and 0.81-1.00, almost perfect agreement.


Results
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Observer Performance
On the basis of the screening images alone, 27-29 of 30 cancers (90-97%) were recommended for additional evaluation or biopsy across the five observers. After diagnostic evaluation, 21 or 22 of 23 cancers (91-96%) were recommended for further additional evaluation or biopsy. The malignancy rate of lesions recommended for biopsy (BI-RADS final assessment categories 4 and 5) ranged from 43% to 55% across observers.

Interobserver Variability
Breast density.—Moderate agreement was seen across our observers in describing breast density, with an overall kappa value of 0.43. Moderate agreement was seen in the use of the terms "fatty" ({kappa} = 0.76), "minimal scattered fibroglandular elements" ({kappa} = 0.43), and "extremely dense" ({kappa} = 0.45). Poor agreement was seen in use of the term "heterogeneously dense" ({kappa} = 0.17).

Lesion type.—The five observers agreed on lesion type (mass, mass with calcifications, microcalcifications, special case, or calcifications with associated density) for 75 of 103 lesions, with a kappa value of 0.75. As has previously been observed [7], distinction of a "focal asymmetric density" from an "indistinct" mass was problematic (Fig. 1A,1B,1C). For eight (29%) of 28 lesions, the presence or absence of microcalcifications in a mass or focal density was the source of disagreement. Similarly, another eight lesions (29%) represented disagreement on the presence or absence of a mass or focal density associated with definite microcalcifications. In detailing results for individual terms describing feature analysis (Tables 1 and 2), we will consider only the observers' interpretations characterizing the lesion as a "pure" mass, density, or microcalcifications; we will exclude mixed lesions.



View larger version (114K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 1A. —57-year-old woman with lesion variably termed "focal asymmetric density" versus "indistinct" mass. Bilateral routine craniocaudal (A) and mediolateral oblique (B) mammograms show focal normal variant asymmetric parenchyma (arrows) that was unchanged for 5 years.

 


View larger version (109K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 1B. —57-year-old woman with lesion variably termed "focal asymmetric density" versus "indistinct" mass. Bilateral routine craniocaudal (A) and mediolateral oblique (B) mammograms show focal normal variant asymmetric parenchyma (arrows) that was unchanged for 5 years.

 


View larger version (122K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 1C. —57-year-old woman with lesion variably termed "focal asymmetric density" versus "indistinct" mass. Spot compression mammogram was provided with comparison films. Three observers described finding as focal asymmetric density, considered normal (BI-RADS category 1, two observers), or needing sonography (BI-RADS category 0, one observer). Remaining two observers viewed finding as mass with indistinct borders and suspicious, needing biopsy (BI-RADS category 4). This distinction of focal asymmetric density from indistinct mass is one of the most problematic in mammographic interpretation, as was also seen in study of Baker et al. [7].

 

View this table:
[in this window]
[in a new window]

 
TABLE 1 Agreement in Description of 34 Noncalcified Lesions After Diagnostic Evaluation and Correlation with Final Assessment (27, Benign; Seven, Malignant)

 

View this table:
[in this window]
[in a new window]

 
TABLE 2 Agreement in Description of 36 Foci of Microcalcifications After Diagnostic Evaluation and Correlation with Final Assessment Categories (26, Benign; Three, High-Risk; Seven, Malignant)

 

Mass borders.—On the basis of the screening evaluation, complete agreement in feature analysis of mass borders or special cases was seen across all five observers in only seven (16%) of 45 lesions. After diagnostic evaluation, complete agreement of all five observers was seen for 12 (38%) of 32 masses or special cases ({kappa} = 0.40).

The greatest agreement was in the use of the term "circumscribed round" ({kappa} = 0.64) after diagnostic evaluation. Only two interpretations describing a lesion as a circumscribed mass assessed the lesion as suspicious, biopsy (BI-RADS category 4). Indeed, only one (4%) of 28 interpretations describing a circumscribed round mass was a malignant lesion (Fig. 2, Table 1), consistent with the work of Sickles [13], which showed a 1.4% risk of malignancy in such lesions, and with the recommendation that such lesions be considered probably benign (BI-RADS category 3) [1, 4]. Sonograms had been provided for two masses the authors considered circumscribed, which accounted for none of the observers' interpretations describing the lesion as circumscribed. Not surprisingly, sonography was recommended for additional evaluation of nine (32%) of 28 masses considered circumscribed by the observers before they would render a final assessment.



View larger version (156K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 2. —65-year-old woman with infiltrating ductal (colloid) carcinoma. Spot magnification mammogram shows mass considered circumscribed and lobulated by three observers, microlobulated by one observer, and circumscribed and round by fifth observer. Sonography was recommended by three observers describing mass as circumscribed; the other observers assessed mass as suspicious (BI-RADS category 4).

 

Masses termed "microlobulated" by one observer were more likely to be considered indistinct or spiculated by another observer ({kappa} = 0.32). All were considered suspicious (BI-RADS category 4) or in need of additional evaluation (BI-RADS category 0) and one (10%) of 10 such lesions was malignant (Fig. 2). Very few lesions were termed "obscured"; essentially no agreement was seen in use of this term ({kappa} = 0.10). A mass considered obscured by one observer was more than twice as likely to be considered indistinct or circumscribed by another.

Only fair agreement was seen on use of the term "indistinct" ({kappa} = 0.28). Surprisingly, assessments of indistinct masses ranged from out-right benign (BI-RADS category 2), through highly suggestive of malignancy (BI-RADS category 5) (Table 1). None of those assessed as category 2 (three interpretations) or category 3 (two interpretations) proved to be malignant, whereas six (50%) of 12 of those assessed as category 4 and one (50%) of two of those assessed as category 5 proved malignant.

Moderate agreement was seen for the term "spiculated" (Fig. 3A,3B,3C) ({kappa} = 0.58). All lesions so described were assessed as suspicious (BI-RADS category 4) or highly suggestive of malignancy (BI-RADS category 5) by all observers, and indeed 12 (75%) of 16 interpretations describing a spiculated mass were of malignant lesions.



View larger version (83K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 3A. —57-year-old woman with infiltrating ductal carcinoma. Craniocaudal (A) and mediolateral oblique mammograms (B) show focal density (arrows). From screening images alone, one observer deemed this negative (BI-RADS category 1) although lesion was marked on films.

 


View larger version (81K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 3B. —57-year-old woman with infiltrating ductal carcinoma. Craniocaudal (A) and mediolateral oblique mammograms (B) show focal density (arrows). From screening images alone, one observer deemed this negative (BI-RADS category 1) although lesion was marked on films.

 


View larger version (166K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 3C. —57-year-old woman with infiltrating ductal carcinoma. Spot magnification mammogram in mediolateral projection reveals spiculated mass, considered highly suggestive of malignancy (BI-RADS category 5) by all observers.

 

Special cases.—Complete agreement was seen for the one dilated duct. The four lymph nodes included were as likely to be considered benign circumscribed masses as lymph nodes ({kappa} = 0.50). Lesions termed "asymmetric breast tissue" ({kappa} = 0.38) or "focal asymmetric densities" ({kappa} = 0.60) by one observer were both more likely to be considered indistinct masses by another (Fig. 1A,1B,1C), as seen in the study of Baker et al. [7]. If the description was asymmetric breast tissue, the lesion was always considered normal or benign and proved to be. Focal asymmetric densities were more problematic, with assessments ranging from negative (BI-RADS category 1) to highly suggestive of malignancy (BI-RADS category 5). Overall, five (25%) of 20 lesions interpreted as a pure focal asymmetric density without calcifications proved to be malignancies (Table 1).

Mass shape.—In describing mass shape, a random distribution of use of descriptors was seen, with all terms (round, oval, lobulated, and irregular) equally likely to be used ({kappa} = 0.28).

Mass density.—The overall kappa value of 0.40 that we found for mass density is comparable to the relatively low interobserver agreement previously shown by Jackson et al. [14]. In particular, no agreement was seen for density termed "lower than that of surrounding parenchyma" ({kappa} = -0.01).

Microcalcification morphology.—Overall agreement was fair ({kappa} = 0.36) for microcalcification morphology and moderate ({kappa} = 0.47) for description of microcalcification distribution (Table 2). For lesions considered pure calcifications (without mass or associated density), only fair agreement was seen in use of the term "coarse, benign" ({kappa} = 0.29). When calcifications were described as coarse and benign by one observer, they were more likely to be considered punctate or pleomorphic by another (Fig. 4). One unusual recurrent malignancy proved particularly problematic (Fig. 5), being considered benign or probably benign by all observers. This one lesion accounted for four of five of the interpretations of malignant calcifications considered coarse or benign.



View larger version (125K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 4. —77-year-old woman with calcifications caused by ductal carcinoma in situ (DCIS). Global magnification mammogram in true lateral projection shows several groups of calcifications (arrows) due to multifocal DCIS. One observer considered these coarse in morphology, multiple groups in distribution, and probably benign (short-term follow-up recommended, BI-RADS category 3). Another considered these to be punctate, multiple groups, and suspicious (biopsy recommended, BI-RADS category 4). Three observers considered these pleomorphic (one clustered, one linearly distributed, and one multiple groups) with assessments suspicious (BI-RADS category 4), highly suggestive of malignancy (BI-RADS category 5), and needs additional evaluation (BI-RADS category 0), respectively.

 


View larger version (116K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 5. —56-year-old woman with recurrent carcinoma manifesting as unusual microcalcifications (arrows) on spot magnification mammogram. At histopathology, calcifications were in necrotic center of infiltrating ductal carcinoma that developed at site of lumpectomy 8 years earlier for ductal carcinoma in situ. Four observers considered these calcifications to be coarse: three assessed them as benign (BI-RADS category 2) and one as probably benign (BI-RADS category 3). One observer described these as amorphous and benign (BI-RADS category 2).

 

Although good agreement was seen on use of the term "milk of calcium" ({kappa} = 0.71), such calcifications were more likely to be described as punctate or pleomorphic by another observer (Fig. 6A,6B). All lesions described as milk of calcium were considered benign (BI-RADS category 2) or probably benign (BI-RADS category 3) and proved to be benign.



View larger version (130K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 6A. —57-year-old woman with microcalcifications in apocrine metaplasia. Spot magnification mammograms in craniocaudal (A) and mediolateral oblique (B) projections show several clusters of microcalcifications (arrows) due to fibrocystic changes, with microcysts and apocrine metaplasia. Observers variably described these as milk of calcium, benign (BI-RADS category 2), punctate or pleomorphic in multiple clusters and suspicious (BI-RADS category 4), or pleomorphic in segmental distribution and highly suggestive of malignancy (BI-RADS category 5).

 


View larger version (142K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 6B. —57-year-old woman with microcalcifications in apocrine metaplasia. Spot magnification mammograms in craniocaudal (A) and mediolateral oblique (B) projections show several clusters of microcalcifications (arrows) due to fibrocystic changes, with microcysts and apocrine metaplasia. Observers variably described these as milk of calcium, benign (BI-RADS category 2), punctate or pleomorphic in multiple clusters and suspicious (BI-RADS category 4), or pleomorphic in segmental distribution and highly suggestive of malignancy (BI-RADS category 5).

 

Calcifications described as coarse and benign, milk of calcium, or amorphous were more likely to be considered punctate by another observer (Figs. 3A,3B,3C, 5, and 6A,6B). Agreement on "punctate" morphology was only fair ({kappa} = 0.36). Although it has been suggested that this term implies uniform round calcifications that can be considered probably benign (BI-RADS category 3) [15], eight (13%) of 61 interpretations describing punctate microcalcifications considered them suspicious (BI-RADS category 4) and recommended biopsy. Those eight interpretations included one cancerous lesion (Fig. 4) and one lobular carcinoma in situ. Another seven interpretations described cases of atypical ductal hyperplasia as punctate and probably benign. Thus, the positive predictive value of a description of punctate calcifications was one (1.6%) of 61 considering only malignancies, and nine (15%) of 61 when including high-risk lesions as well. Although distribution of calcifications clearly also influences assessment, the analysis is not straightforward: one lesion considered punctate calcifications in a segmental distribution by two observers was assessed as suspicious by one observer and benign by the other. The distribution of the other punctate calcifications assessed as suspicious was described as clustered (n = 4), linear (n = 1), or multiple clusters (n = 2).

"Amorphous" calcifications were problematic, with only fair agreement on use of the term ({kappa} = 0.25). Again, calcifications described as amorphous by one observer were more likely to be described as pleomorphic or punctate by another. Assessments ranged from benign (BI-RADS category 2) to suspicious, biopsy (BI-RADS category 4), and one (5%) of 20 was malignant.

Only fair agreement was seen in use of the term "pleomorphic" ({kappa} = 0.37). Calcifications described as pleomorphic were given assessments ranging from benign (BI-RADS category 2) to highly suggestive of malignancy (BI-RADS category 5). Of 55 interpretations describing the lesion as pure pleomorphic calcifications, 22 (40%) were malignant (Table 2).

All "branching or fine linear" calcifications were appropriately considered suspicious (BI-RADS category 4) or highly suggestive of malignancy (BI-RADS category 5), and six (43%) of 14 pure branching calcifications without associated mass were malignant. A lesion described by one observer as branching calcifications was more likely described as pleomorphic by another observer ({kappa} = 0.37).

Microcalcification distribution.—The distribution of most lesions manifesting as microcalcifications was termed "clustered" ({kappa} = 0.58). This description was used nearly in proportion to the overall case mix, with 18 (16%) of 113 clustered microcalcifications actually being malignant and final management assessments spanning the spectrum of BI-RADS categories (Table 2).

Very few cases were termed "linear" in distribution, and little agreement existed on use of the term ({kappa} = 0.19). Such calcifications were more likely to be considered clustered or segmental by another observer. Three (30%) of 10 calcifications described as linear were malignant, and only one interpretation assessed (punctate) calcifications in a linear distribution as probably benign (BI-RADS category 3), with the rest considered suspicious (BI-RADS category 4).

Moderate agreement was observed in the use of the descriptor "segmental" ({kappa} = 0.46). Only one observer assessed (punctate) segmental calcifications as benign, as mentioned; one case described as segmental in distribution was considered suspicious, and 12 (86%) of 14 were considered highly suggestive of malignancy (BI-RADS category 5). Ten (71%) of 14 segmental calcifications were actually malignant. This figure compares favorably with the results of Liberman et al. [5], in which 74% of lesions described as segmental and sent for biopsy proved to be malignant.

"Regional" and "diffuse" distributions have been associated with a high likelihood of benignity [4]. Very few lesions were so characterized, and little agreement was seen in the use of either term ({kappa} = 0.29 for regional and {kappa} = 0.08 for diffuse). If calcifications were described as regional in distribution by one observer, they were more likely considered segmental or clustered in distribution by another. Of 10 interpretations describing calcifications as regional, one (10%) was of a malignant lesion. None of those described as diffuse was malignant, and they were more likely to be described as clustered by another observer.

Final assessment.—Table 3 presents interobserver agreement on lesion assessment and management based on screening images and after diagnostic evaluation. Overall kappa value was 0.21 after screening and 0.38 after diagnostic evaluation for BI-RADS final assessment categories, with assessment categories 1 and 2 considered equivalent. Disagreement on final assessment was greatest for lesions placed in BI-RADS categories 3, probably benign (recommend short-interval follow-up), and 4, suspicious (consider biopsy). This disagreement likely reflects a variation in the intervention threshold of individual observers, with some observers biopsying lesions having a low probability of malignancy.


View this table:
[in this window]
[in a new window]

 
TABLE 3 Agreement on Final Assessment for 515 Interpretations of 103 Screening Mammograms and 430 Interpretations of 86 Mammograms for Diagnostic Evaluation

 

Considering together those interpretations with a benign (BI-RADS category 2) or probably benign (BI-RADS category 3) assessment as one group and those recommending either biopsy or immediate additional evaluation as another group, agreement was seen for 70 (68%) of 103 lesions from screening films. Of 68 benign lesions, observers concurred for 39 (57%), recommending additional evaluation in 37 (54%) or classifying them as normal or benign in two (3%) lesions. Agreement was also seen in the recommendation for additional evaluation or biopsy for all four (100%) high-risk lesions and for 27 (90%) of the 30 cancerous lesions. The third observer initially rated one cancerous lesion as normal (Fig. 3A,3B,3C), one as benign, and one as probably benign; and the fifth observer rated one of the same lesions as benign (Fig. 4). As noted, a third cancerous lesion (Fig. 5) was considered benign or probably benign by all observers.

Agreement diminished after additional evaluation compared with screening assessments, with agreement seen in 47 (55%) of 86 cases. Observers concurred for 25 (42%) of 59 benign lesions, one (25%) of four high-risk lesions, and 21 (91%) of 23 cancerous lesions. As mentioned, of 23 cancerous lesions, 21 (91%) were recommended for additional evaluation or biopsy by three observers and 22 (96%) by two observers.

Intraobserver Variability
When cases were reassessed by the same observer, substantial disagreement persisted, although to a lesser degree than among different observers (Fig. 7). After diagnostic evaluation, inconsistency in description of mass borders and focal asymmetries ranged from 29% to 57% (mean, 40%) and of microcalcification morphology from 14% to 71% (mean, 49%). Major disagreement in final assessment or patient management—defined as immediate additional evaluation or biopsy (BI-RADS categories 0, 4, or 5) versus follow-up (BI-RADS categories 2 or 3)—occurred in 0-27% of cases. Overall, observers' final assessment was unchanged in 36 (65%) of 55 second interpretations, whereas major disagreement between first and second diagnostic impression was seen in eight (15%) of repeated interpretations.



View larger version (146K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Fig. 7. —53-year-old woman with fibroadenoma shown on spot magnification mammogram. Variability was great with this patient. On initial interpretation, two observers considered this a focal asymmetric density (one as highly suspicious of malignancy [BI-RADS category 5], and one as probably benign [BI-RADS category 3]). Another three observers considered this an indistinct mass and suspicious (n = 2, BI-RADS category 4) or highly suggestive (n = 1, BI-RADS category 5) of malignancy. Three observers described associated pleomorphic calcifications. On second interpretation, all observers changed their description. Two observers considered mass circumscribed and lobulated; one considered it obscured; one, spiculated; and one, microlobulated. Final assessment changed for one observer who initially classified the mass as indistinct and highly suggestive of malignancy (BI-RADS category 5,) then reclassified it as circumscribed and benign (BI-RADS category 2).

 


Discussion
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Variability in mammographic interpretation can be attributed both to differences in detection of lesions and to variation in lesion characterization and subsequent management. This topic received great attention after publication of results by Elmore et al. [16] in 1995, in which 10 radiologists reviewed 150 mammograms (27 of patients with cancer). Immediate workup of women with cancer was recommended for 74-96% of patients; biopsy was recommended for 33-82% of patients with cancer on the basis of standard images; immediate workup was recommended for 11-65% (and biopsy for 3-20%) of women without cancer. It has been proposed [17, 18] that much of this variation is attributable to variation in intervention threshold, with observers being plotted at differing points along nearly the same receiver operating characteristic curve (plot of true-positive fraction against false-positive fraction) [19]. A high sensitivity (true-positive rate) for some observers was achieved at the expense of a low specificity (high false-positive rate). Further, eight of the cancerous lesions selected were misinterpreted initially, suggesting a bias toward subtle cases [17]. Similar results were observed in a study by Beam et al. [20] in which a broad sample of 108 radiologists reviewed screening mammograms from 79 women (45 with cancer). Again, variations in both detection and intervention threshold contributed to interobserver variability. Screening sensitivity ranged from 47% to 100% and specificity from 36% to 99% [20].

Kerlikowske et al. [21] reported 78% agreement ({kappa} = 0.58) in the final assessments of two observers and 86% intraobserver agreement ({kappa} = 0.73) in evaluation of 2616 mammograms. Those investigators did not use category 3, probably benign, because theirs was a screening assessment. From 73% to 78% of cases with cancer were considered abnormal by those observers, and fair to moderate agreement was described for feature analysis [21]. High breast density was shown to increase disagreement twofold in both lesion detection and final assessment [21]. Far less agreement was seen in the earlier work of Boyd et al. [22, 23], in which nine radiologists reviewed 100 xeromammograms. Between pairs of radiologists, kappa values ranged from 0.17 to 0.55 for diagnostic assessment, from 0.16 to 0.70 for the presence or absence of a mass, and from 0.34 to 0.73 for the presence or absence of calcifications [23].

We sought to simplify the analysis of variability in mammographic interpretation by excluding detection as a variable and evaluating only feature analysis and lesion management for marked lesions, using BI-RADS standardized terminology. As with many clinicians in practice, our investigators were not trained specifically in the terminology but do use the final assessment categories in their routine practice. Further, they were given ordered lists of the terminology by increasing likelihood of malignancy according to the criteria set forth by D'Orsi and Kopans [4] and were asked to use the single most worrisome applicable descriptor from each category.

We found disagreements between observers in clinically significant management (biopsy versus follow-up) in 32% (33/103) of screening interpretations and in 45% (39/86) after diagnostic evaluation. Observers disagreed with themselves in management 15% (8/55) of the time. The larger source of variation appeared to be lesion description, with disagreement on description of 38 (84%) of 45 mass borders on screening and 20 (63%) of 32 after diagnostic evaluation. Similarly, we found disagreement on 33 (70%) of 47 descriptions of microcalcification morphology on screening and 33 (75%) of 44 after diagnostic evaluation. Even when the same lesions were described using the same terminology, clinically significant variations in management occurred. In particular, one ductal carcinoma in situ was described as clustered pleomorphic microcalcifications by all five observers and was interpreted as probably benign by one, whereas the others all recommended biopsy. Despite the variability, the overall performance of our observers was outstanding, with recommendations for additional evaluation or biopsy (BI-RAD categories 4, 5, or 0) of 90-97% of cancerous lesions on screening and 91-96% after diagnostic evaluation.

Our results are similar in magnitude to variability in chest radiograph interpretation as studied in the 1950s. Yerushalmy et al. [24] found a 1-in-3 chance that two observers disagreed as to progression, regression, or stability of inflammatory disease and a 1-in-5 chance of the same observer disagreeing with him- or herself. Although such variation is inherent in observational studies, training in mammographic lesion description at least has the potential to reduce variability of feature analysis. If correlation of feature analysis and management can be further substantiated, we may be able to reduce the variability in management as well.

Our study likely underestimates true variations in practice for two major reasons. We did not allow variation in detection, because the lesions were marked on the films. Further, overlap in professional activities of our observers would be expected to diminish variability in lesion management.

From a medicolegal perspective, the more specifically descriptors are correlated with a specific risk of malignancy, the more a standard of management will apply when specific terminology is used. Thus it would be beneficial for the mammography community to establish which terminology is beneficial in predicting management (probably largely at the extremes of benign and highly suspicious) and to expect consistency in those cases. Obviously, variability will exist in the management of indeterminate lesions as a function of interventional thresholds. Again, clarification of the appearances of such lesions, understanding of this variability, and definition of reasonable practice standards for those lesions is in the best interest of all radiologists.

We found little consistency for the use of descriptors of mass shape or density and would propose that management be based largely on the appearance of the border of mass lesions. Indeed, we found that despite variability in describing individual cases, aggregate use of particular terms correlated with the expected risk of malignancy. "Circumscribed" masses have been shown to have a risk of malignancy of less than 2% [13, 15] and were consistently described and managed as benign or probably benign in our study, with variability largely due to the use of immediate sonography or short-term follow-up. Only one (4%) of 28 interpretations describing a mass as circumscribed, round, or oval was of a malignancy in our series. Shape may play a role with circumscribed masses, in that circumscribed gently lobulated masses were more likely malignant (4/33 [12%]) in our series. Similarly, "spiculated" masses were uniformly considered suspicious or highly suggestive of malignancy, and 12 (75%) of 16 such lesions were malignant. From the work of D'Orsi and Kopans [4] and others, it seems reasonable that a mass that is solid but not circumscribed on a baseline mammogram generally warrants biopsy. Thus, we were surprised to see several lesions described as "indistinct" yet assessed as benign. The inappropriate classification of such lesions as benign (BI-RADS category 2) would appear to be an area for greater training, and obviously sonography will play a substantial role in the evaluation of mammographically and clinically indeterminate lesions.

Distinguishing normal variant asymmetric densities from indistinct masses would appear to be another area for greater training. Sickles [25] has recently shown success in short interval follow-up of densities seen on only one view, although relatively little has been published to provide criteria for such a screening assessment, and spot compression or sonography may be needed to adequately evaluate many such densities. As in the study of Baker et al. [7], we found that even expert mammographers have difficulty distinguishing normal variant focal asymmetric densities from indistinct masses. Management is predicated on such perceptual differences. In one study [26], most cases of missed breast cancer that were visible in retrospect but not on blinded review were asymmetric densities. Comparison with old films may be helpful for such lesions.

For calcifications, training in identifying suspicious morphology or distribution is needed. When either suspicious morphology or suspicious distribution is seen, biopsy is appropriate [26]. We expect to continue to see variability in the use of the term "punctate" as has been seen in other work [7], but we may see improved performance if appropriate management is clarified. For example, a cluster of round punctate calcifications may be probably benign (BI-RADS category 3) [15], but similar calcifications in a segmental or linear distribution would be at least suspicious (BI-RADS category 4) [27]. Diffuse punctate calcifications may reasonably be considered benign [1]. Further data on this topic are needed. Increased familiarity with the illustrated BI-RADS lexicon [2] and training of interpreting physicians in the use of BI-RADS are expected to enhance consistency in description of lesions and will allow data analysis across multiple sites, ensuring congruency between description and subsequent management.

We noted great variation in the management of clustered amorphous calcifications, which likely reflects variation in the interventional threshold for these low-suspicion lesions. Further data establishing the likelihood of malignancy of indeterminate calcifications with description of distribution are needed, as is consensus about the appropriate threshold for intervention (e.g., >2% risk of malignancy). Because it is the decision to monitor or to biopsy that most influences patient care, greater efforts are needed to develop standardized cases emphasizing probably benign and suspicious lesions to minimize variation in interventional thresholds among radiologists.

In our study, inconsistency was greatest for probably benign and suspicious assessments. As stated, the results of Sickles [13, 15, 25] and Varas et al. [28] suggest reliable criteria exist for probably benign lesions, and Orel et al. [29] recently confirmed that only three (2%) of 141 lesions prospectively classified as probably benign (BI-RADS category 3) before biopsy proved to be malignant. Broader verification of the use of category 3 and the availability of examples of appropriate lesions for this assessment are needed. Such an assessment is not to be made generally on screening images alone, but only after a complete diagnostic workup. Furthermore, practitioners are encouraged to audit their outcomes [30] for lesions placed in category 3 and may reconsider their individual practice if the rate of malignancy exceeds 2% or if delays in diagnosis are affecting patient outcomes (i.e., lymph node status and stage of disease at diagnosis).

In summary, the BI-RADS lexicon is an important step forward in standardizing reporting. Variability is inherent in the practice of radiology and is not necessarily problematic. Indeed, agreement was seen in the recommendation for additional evaluation or biopsy for all four highrisk lesions and for 27 (90%) of 30 malignant lesions despite substantial variability in feature analysis and case-by-case management. It is our collective responsibility to identify areas in existing practices in which even experts have difficulty and to improve on those areas with broad data collection, training, and education. Ultimately, such efforts will help guide us all to improve our accuracy in reporting, to diagnose cancer at its earliest stage, and to avoid unnecessary biopsy of benign lesions.


Acknowledgments
 
We thank the observers who participated in this study: Cecilia Brennecke, Judy Destouet, Nagi Khouri, Barbara Savader, and Rosy Singh. Without their support, this study could not have been performed. We also thank the Susan G. Komen Breast Cancer Foundation for their continuing support.


References
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 

  1. American College of Radiology. Breast Imaging Reporting and Data System, 2nd ed. Reston, VA: American College of Radiology, 1995
  2. American College of Radiology. Illustrated Breast Imaging Reporting and Data System (BI-RADS), 3rd ed. Reston, VA: American College of Radiology, 1998
  3. D'Orsi CJ, Getty DJ, Swets JA, Pickett RM, Seltzer SE, McNeil BJ. Reading and decision aids for improved accuracy and standardization of mammographic diagnosis. Radiology 1992;184:619 -622[Abstract/Free Full Text]
  4. D'Orsi CJ, Kopans DB. Mammographic feature analysis. Semin Roentgenol 1993;28:204 -230[Medline]
  5. Liberman L, Abramson AF, Squires CB, Glassman JR, Morris EA, Dershaw DD. The Breast Imaging Reporting and Data System: positive predictive value of mammographic features and final assessment categories. AJR 1998;171:35 -40[Abstract/Free Full Text]
  6. Baker JA, Kornguth PJ, Lo JY, Williford ME, Floyd CE Jr. Breast cancer: prediction with artificial neural network based on BI-RADS standardized lexicon. Radiology 1995;196:817 -822[Abstract/Free Full Text]
  7. Baker JA, Kornguth PJ, Floyd CE Jr. Breast Imaging Reporting and Data System standardized mammography lexicon: observer variability in lesion description. AJR 1996;166:773 -778[Abstract/Free Full Text]
  8. Metz CE. ROC methodology in radiologic imaging. Invest Radiol 1986;21:720 -733[Medline]
  9. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas 1960;20:37 -46
  10. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159 -174[Medline]
  11. Landis JR, Koch GG. A one-way component of variance model for categorical data. Biometrics 1977;33:671 -679
  12. Svanholm H, Starklint H, Gunderen HJG, Fabricius J, Barlebo H, Olsen S. Reproducibility of histomorphologic diagnoses with special reference to the kappa statistic. APMIS 1989;97:689 -698[Medline]
  13. Sickles EA. Nonpalpable, circumscribed, noncalcified solid breast masses: likelihood of malignancy based on lesion size and age of patient. Radiology 1994;192:439 -442[Abstract/Free Full Text]
  14. Jackson VP, Dines KA, Bassett LW, Gold RH, Reynolds HE. Diagnostic importance of the radiographic density of noncalcified breast masses: analysis of 91 lesions. AJR 1991;157:25 -28[Abstract/Free Full Text]
  15. Sickles EA. Periodic mammographic follow-up of probably benign lesions: results in 3184 consecutive cases. Radiology 1991;179:463 -468[Abstract/Free Full Text]
  16. Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR. Variability in radiologists' interpretations of mammograms. New Engl J Med 1995;331:1493 -1499[Abstract/Free Full Text]
  17. Kopans DB. The accuracy of mammographic interpretation. New Engl J Med 1994;331:1521 -1522[Free Full Text]
  18. D'Orsi CJ, Swets JA. Variability in the interpretation of mammograms (letter). New Engl J Med 1995;332:1172
  19. Metz CE. Some practical issues of experimental design and data analysis in radiological ROC studies. Invest Radiol 1989;24:234 -245[Medline]
  20. Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by US radiologists: findings from a national sample. Arch Intern Med 1996;156:209 -213[Abstract]
  21. Kerlikowske K, Grady D, Barclay J, et al. Variability and accuracy in mammographic interpretation using the American College of Radiology Breast Imaging Reporting and Data System. J Natl Cancer Inst 1998;90:1801 -1809[Abstract/Free Full Text]
  22. Boyd NF, Wolfson C, Moskowitz M, et al. Observer variation in the interpretation of xeromammograms. J Natl Cancer Inst 1982;68:357 -363
  23. Boyd NF, Wolfson C, Moskowitz M, et al. Observer variation in the classification of mammographic parenchymal patterns. J Chron Dis 1986;39:465 -472[Medline]
  24. Yerushalmy J, Garland LH, Harkness JT, et al. An evaluation of the role of serial chest roentgenograms in estimating the progress of disease in patients with pulmonary tuberculosis. Am Rev Tuberc 1951;64:225 -248
  25. Sickles EA. Findings at mammographic screening on only one projection: outcomes analysis. Radiology 1998;208:471 -475[Abstract/Free Full Text]
  26. Harvey JA, Fajardo LL, Innis CA. Previous mammograms in patients with impalpable breast carcinoma: retrospective vs blinded interpretation. AJR 1993;161:1167 -1172[Abstract/Free Full Text]
  27. D'Orsi CJ. The American College of Radiology mammography lexicon: an initial attempt to standardize terminology (commentary). AJR 1996;166:779 -780[Free Full Text]
  28. Varas X, Leborgne F, Leborgne JH. Nonpalpable, probably benign lesions: role of follow-up mammography. Radiology 1992;184:409 -414[Abstract/Free Full Text]
  29. Orel SG, Kay N, Reynolds C, Sullivan DC. BI-RADS categorization as a predictor of malignancy. Radiology 1999;211:845 -850[Abstract/Free Full Text]
  30. Sickles E. Auditing your practice. In: Categorical course in breast imaging. Oak Brook, IL: Radiological Society of North America, 1995:81 -91

Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
Am. J. Roentgenol.Home page
E.-K. Kim, K. H. Ko, K. K. Oh, J. Y. Kwak, J. K. You, M. J. Kim, and B.-W. Park
Clinical Application of the BI-RADS Final Assessment to Breast Sonography in Conjunction with Mammography
Am. J. Roentgenol., May 1, 2008; 190(5): 1209 - 1215.
[Abstract] [Full Text] [PDF]


Home page
JNCI J Natl Cancer InstHome page
D. L. Miglioretti, R. Smith-Bindman, L. Abraham, R. J. Brenner, P. A. Carney, E. J. A. Bowles, D. S. M. Buist, and J. G. Elmore
Radiologist Characteristics Associated With Interpretive Performance of Diagnostic Mammography
J Natl Cancer Inst, December 19, 2007; 99(24): 1854 - 1863.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
L. Berlin
Accuracy of Diagnostic Procedures: Has It Improved Over the Past Five Decades?
Am. J. Roentgenol., May 1, 2007; 188(5): 1173 - 1178.
[Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
G. J. R. Porter, A. J. Evans, E. J. Cornford, H. C. Burrell, J. J. James, A. H. S. Lee, and J. Chakrabarti
Influence of Mammographic Parenchymal Pattern in Screening-Detected and Interval Invasive Breast Cancers on Pathologic Features, Mammographic Features, and Patient Survival
Am. J. Roentgenol., March 1, 2007; 188(3): 676 - 683.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
E. S. Burnside, J. E. Ochsner, K. J. Fowler, J. P. Fine, L. R. Salkowski, D. L. Rubin, and G. A. Sisney
Use of Microcalcification Descriptors in BI-RADS 4th Edition to Stratify Risk of Malignancy
Radiology, February 1, 2007; 242(2): 388 - 395.
[Abstract] [Full Text] [PDF]


Home page
Br. J. Radiol.Home page
M P Sampat, G J Whitman, T W Stephens, L D Broemeling, N A Heger, A C Bovik, and M K Markey
The reliability of measuring physical characteristics of spiculated masses on mammography
Br. J. Radiol., December 1, 2006; 79(Special_Issue_2): S134 - S140.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
W. A. Berg, J. D. Blume, J. B. Cormack, and E. B. Mendelson
Operator Dependence of Physician-performed Whole-Breast US: Lesion Detection and Characterization.
Radiology, November 1, 2006; 241(2): 355 - 365.
[Abstract] [Full Text] [PDF]


Home page
J. Nutr.Home page
C. J. Fabian and B. F. Kimler
Mammographic Density: Use in Risk Assessment and as a Biomarker in Prevention Trials
J. Nutr., October 1, 2006; 136(10): 2705S - 2708S.
[Full Text] [PDF]


Home page
JNCI J Natl Cancer InstHome page
W. E. Barlow, E. White, R. Ballard-Barbash, P. M. Vacek, L. Titus-Ernstoff, P. A. Carney, J. A. Tice, D. S. M. Buist, B. M. Geller, R. Rosenberg, et al.
Prospective breast cancer risk prediction model for women undergoing screening mammography.
J Natl Cancer Inst, September 6, 2006; 98(17): 1204 - 1214.
[Abstract] [Full Text] [PDF]


Home page