May 2005, VOLUME 184
NUMBER 5

Recommend & Share

May 2005, Volume 184, Number 5

Research

Fundamentals of Clinical Research for Radiologists

Reader Agreement Studies

+ Affiliation:
1Health Services Research and Development Service (124), Department of Veterans Affairs, 810 Vermont Ave., NW, Washington, DC 20420.

Citation: American Journal of Roentgenology. 2005;184: 1391-1397. 10.2214/ajr.184.5.01841391

Next section

This article presents several approaches for evaluating reader agreement. The dominant technique in the radiology literature is weighted and unweighted Cohen's kappa and the associated measure, percent agreement. Percent agreement is an intuitive approach to measuring agreement but does not adjust for chance. Kappa provides a measure of agreement beyond that which would be expected by chance, as estimated by the observed data. Both the bi-rater and multirater kappa statistics have several limitations that are difficult to resolve. Although there are alternative approaches to measuring agreement, kappa remains the most commonly used measure.

Reader agreement studies have an important role in advancing radiology practice, technique, training, and quality control. Extremely common in the radiology literature, reader agreement studies determine the magnitude of agreement between or among readers. Potential applications include developing reliable diagnostic rules [1], understanding variability in treatment recommendations [2], evaluating the effects of training on interpretation consistency [3], determining the reliability of classification systems (lexicon development) [4], and comparing the consistency of different sources of medical information [5]. Agreement studies should not be confused with studies of accuracy in which measures of sensitivity and specificity and ROC curves are commonplace for comparisons when a reference standard (known truth) exists. Although these studies evaluate the validity of a measure and require a reference standard, agreement studies most commonly focus on the reliability of evaluations between different readers or in the same reader on different occasions; agreement studies do not require a reference standard.

Several methods are available for evaluating reader agreement, but the dominant technique in the radiology literature is weighted and unweighted Cohen's kappa and the associated measure, percent agreement. Because of the popularity of kappa in radiology research, this paper will focus on bi-rater and multirater kappa. Included in this presentation will be a discussion of the basic data requirements, calculation formulas, interpretation of the kappa coefficient as a measure of strength of agreement, and statistical significance testing. This discussion will be followed by an exploration of several limitations of kappa, especially those that pertain to comparability across studies. Formulas are provided in sufficient detail for those who wish to replicate the calculations, but an in-depth understanding of the mathematics is not necessary to appreciate the application and limitations of kappa.

Bi-Rater Kappa
Previous sectionNext section

Cohen's kappa is a common technique for estimating paired interrater agreement for nominal and ordinal-level data [6]. Kappa is a coefficient that represents agreement obtained between two readers beyond that which would be expected by chance alone [7]. A value of 1.0 represents perfect agreement. A value of 0.0 represents no agreement. Although such instances are rare, kappa can also exhibit negative values when observed agreement is less (worse) than chance. Key assumptions for using kappa include the following: elements being rated (images, diagnoses, clinical indications, and so forth) are independent of each other, one rater's classifications are made independently of the other rater's classifications, the same two raters provide the classifications used to determine kappa, and the rating categories are independent of one another [8]. The last assumption may be difficult to satisfy in some imaging studies in which there are subtle differences in lesion characteristics and decision criteria. When differences between rating categories are not clear, careful study design is essential to maximize the independence among rating categories. Alternatives include dropping confusing categories or merging related categories. Although not always possible, adjustments in the classification scheme should be consistent with clinical practice.

Bi-rater kappa is used to test the hypothesis that agreement exists between two raters beyond that which would be expected by chance. Bi-rater kappa provides a measure of the relative intensity of agreement or disagreement between two readers rating the same elements using an identical classification system. A two-by-two contingency table illustrates hypothetic data in which two readers independently viewed the same set of 100 images from diagnostic mammograms with a simple classification criterion, malignant or benign (Table 1). To estimate kappa, both raters must use the same number of rating criteria so that the number of columns representing the rating categories used by rater 1 equals the number of rows representing the rating categories used by rater 2. Kappa is calculated using the formula: where po is the proportion of cases in which agreement exists between two raters, and pe is the proportion of cases in which raters would agree by chance.

TABLE 1 Two Readers Evaluating 100 Images (Counts)

If we divide each cell count by the total sample size (n = 100), a matrix of probabilities is created (Table 2). Each cell contains the proportion of the total number of images (n = 100), not the count. As an example, the proportion of images in which reader 1 and reader 2 agree that an image is benign is 0.20 (20/100) or 20% (0.20 × 100).

TABLE 2 Two Readers Evaluating 100 Images (Proportions)

The overall proportion of readings in which reader 1 and reader 2 agree is calculated by summing the diagonal probabilities in Table 2:

This “proportion agreement” is converted to a percentage and reported as “percent agreement.” The interpretation of percent agreement is straightforward: Reader 1 and reader 2 agreed with each other on 80% of the classifications. The approach and calculations are the same for larger tables in which readers must consider more than two options in their decision making. Using as an example the American College of Radiology's BI-RADS lexicon [9] for final assessment, agreement could be based on each reader assigning each case to one of four categories: benign, probably benign, suspicious, or highly suggestive of malignancy. The resulting data would be reported in a four-by-four table in which the sum of the probabilities in the four diagonal cells represents the proportion agreement (po).

The advantage of the kappa statistic over percent agreement is its adjustment for the proportion of cases in which the raters would agree by chance alone. Because we are unlikely to know the true value of chance, the marginal probabilities from the observed data are used to estimate a surrogate for chance. The proportions in the total column and in the total row represent the marginal probabilities. Chance agreement is derived from the observed data, so it will likely change if different readers evaluate the same images. Using Table 2, the proportion of chance agreement (pe) is computed as follows:

Once the proportion of observed agreement (po) and the proportion of chance agreement (pe) are established, kappa is calculated using the formula:

Using a common interpretation guideline offered by Landis and Koch [7], a kappa of 0.52 reflects a moderate level of agreement (Table 3).

TABLE 3 Interpretation Guidance for Strength of Agreement

Statistical Significance

To test the null hypothesis that the kappa coefficient is not different from zero (i.e., no better than chance), an estimate of the standard error (SE) for a one-sample test is calculated from the formula [10]:

A kappa test statistic is compared with the standard normal distribution. The equation for obtaining the test statistic is as follows:

Using a one-tailed test, the test statistic is statistically significant because it exceeds the critical value of 1.645 (alpha, 0.05) [6]. This result supports the alternative hypothesis that the kappa coefficient is different from zero (i.e., better than chance).

Although some effort has been directed toward estimating sample size requirements for comparisons among two or more kappa coefficients [11, 12], methods for calculating power for one kappa coefficient have not received much attention [11]. As a general rule of thumb, 30 cases with two readers is a reasonable minimum sample size as long as a moderate-level or better kappa coefficient (κ > 0.40) is expected and you want to show that kappa is different from a value of zero.

Confidence Intervals

For estimating confidence intervals, a different formula is used for SE [10]. There are other more accurate and complicated formulas for SE [6, 13, 14]:

Given an estimate of kappa of 0.52, the 95% confidence interval would be 0.33–0.71:

Weighted Kappa

Kappa treats disagreements the same regardless of whether a close decision on a rank-ordered classification system has clinical relevance. As an example, a rank-ordered rating scale from benign, probably benign, suspicious, and highly suspicious of malignancy, from which one rater concludes a lesion is suspicious and the other rater concludes that the lesion is highly suspicious, may result in the same clinical decision, immediate follow-up with biopsy. In this event, a disagreement between these two categories is much less important than a disagreement in which one rater rates a lesion as highly suspicious and the other rater rates the same lesion as probably benign.

Weighted kappa was developed to provide partial credit. The observed and expected proportions of each cell are multiplied by a weight before using them to calculate kappa. Weights can be established a priori (before data collection) using clinical experience [10], or they can be calculated after data collection using a simple algorithm for assigning weights that uses the same weighting strategy regardless of the data characteristics or rating criteria. Weighted kappa and unweighted kappa will be the same when there are only two decision categories. An example based on the BI-RADS classification system is provided in Appendix 1. For another example of calculating kappa weights, see Kundel and Polansky [15].

APPENDIX 1. Weighted Kappa

TABLE 6 Calculations for Weighted Kappa: Cell Counts

TABLE 7 Calculations for Weighted Kappa: Proportions Observed and Proportions Expected

TABLE 8 Calculations for Weighted Kappa: Quadratic Cell Weights (w)

TABLE 9 Calculations for Weighted Kappa: Weighted Proportions Observed and Weighted Proportions Expected

Special Considerations When Using Bi-Rater Kappa

For small sample sizes, kappa may be underestimated. In this case, a resampling technique (jackknifing) can be used to calculate an unbiased estimate of kappa [8]. Kappa may also be lower if the number of decision categories is excessive. Possible responses to compensate for this effect are to use weighted kappa if the categories are rank-ordered or to combine similar categories, or both. In any good study design, the choice of a weighting or classification scheme should be addressed and resolved before data collection. Overall, the precision (SE) of kappa is expected to improve as the number of patients and raters increases [16]. Although the preceding discussion was limited to two raters, the next section presents a technique for improving precision by comparing more than two raters.

Multirater Generalized Kappa
Previous sectionNext section

When there are more than two raters, generalized kappa is the recommended approach for evaluating interrater agreement [6, 13, 17]. This statistic measures the degree to which interpretation variability arises from differences among cases relative to differences among readers interpreting the same case. It is analogous to analysis of variance and the intraclass correlation used in the assessment of agreement when measured on a continuous scale.

The discussion that follows focuses on estimating agreement among more than two raters, when the number of raters is kept constant and the number of rating categories is greater than two. Slight modifications in the calculations are required when generalized kappa is estimated for only two rating categories or when the number of raters does not remain constant from one classification to another (see Fleiss [6] for alternative calculations). The approach presented here satisfies the likely characteristics of a prospective imaging study design [3].

Table 4 presents hypothetic data for five raters evaluating imaging from 10 patients using three decision categories—benign, suspicious, and malignant. The formulas that follow are from Woolson [13]. Assume the following notation: N = total number of patients, K = total number of raters, R = number of decision categories, and nij = number of raters who classified patient i (rows in Table 4) in category j (columns in Table 4).

TABLE 4 Ratings by Five Radiologists for 10 Patients

The proportion (j) of all classifications that fall within each decision category is presented at the bottom of each column. In this example, 0.40 (40%) of the classifications are in the benign category, 0.24 (24%) are suspicious, and 0.36 (36%) are classified as malignant.

For each patient, the proportion of all possible pairings on which radiologists agree is calculated using the formula:

For patient 1, this would be calculated as shown in:

The proportion of pairs agreeing for each patient is provided in Table 4 in the right column. The overall proportion of agreement () is the mean agreement of all patients, or 0.62. In other words, we estimate that, on average, any two of the five radiologists will agree on a classification about 62% of the time.

As in bi-rater kappa, a correction for chance agreement is necessary to calculate the kappa coefficient. To estimate chance agreement for generalized kappa, the proportion (j) of classifications in each decision category is squared and summed. For Table 4, the expected chance agreement is:

Using the proportion of observed agreement and chance agreement, the generalized kappa statistic is:

Statistical Significance

To test the null hypothesis that the kappa coefficient is not different from zero (i.e., no better than chance), the generalized kappa statistic is compared with the standard normal distribution. The equation for obtaining the test statistic is as follows (see Appendix 2 for SE calculations):

APPENDIX 2. SE for Generalized Kappa

For a one-tailed test (alpha = 0.05), the kappa coefficient is statistically significantly different from zero. Because of rounding, the SE and z-test statistic will be slightly different when calculated by computer algorithm, and there are other calculation methods for SE not presented here [6]. A confidence interval is created using the same procedure as that presented for bi-rater kappa using the generalized kappa coefficient and its SE.

Limitations of Kappa
Previous sectionNext section

Considerable debate surrounds the use of bikappa and generalized kappa as a measure of agreement [18]. As a result, several alternative approaches to measuring agreement have been proposed but have yet to gain wide acceptance in the peer-reviewed literature. A convenient listing of several alternative approaches and references is available on the Internet [19]. Given the dominance of kappa as a measure of agreement in imaging studies, it is important for both investigators and consumers of the literature to understand the limitations of kappa. Following is a brief discussion of the negative effects resulting from variations in case distribution, improper use of weights, and restrictions on the overall generalizability (external validity) of studies using kappa. This is not a complete listing of all the limitations, but rather basic considerations in interpreting any agreement study that uses kappa.

Effects of Case Distribution

A fundamental aspect of agreement studies is the distribution of cases. Because it is unlikely that a study reflects the population prevalence, marginals (row and column totals) based on reader agreement patterns are routinely used as surrogates for prevalence [18]. This surrogate measure of chance agreement is based on the distribution of the cases classified by readers (both bi-rater kappa and generalized kappa). It is possible to find a consistently high level of percent agreement while reporting widely differing kappa values from one study or one comparison to another because of the case distributions. Table 5 provides an example in which two readers with the same percent agreement are presented with differing distributions of cases. The examples provided in Table 5 assume a high level of accuracy by both readers, so that the marginal probabilities match the study case distribution. In both examples, the readers agree in 90% of the classifications; however, kappa is significantly reduced if one classification category dominates. As shown, an increase in the dominance of malignant cases from 50% to 90% resulted in kappa dropping from 0.80 to 0.44.

TABLE 5 Implications of Case Distribution

Limitation 1.—Because of variations in case mix, reported kappa values may vary dramatically from one study to another even when the overall percent agreement is similar.

Limitation 2.—Because varying rater pairs will likely change the category distributions, bi-kappa values on the same set of elements may vary dramatically from one reader pair to another, even when percent agreement is relatively stable.

Weighted Kappa

Adding to the limited comparability of the kappa statistic from one study to another is the use of weighted kappa. There are multiple methods to weight kappa, so the comparability between studies is often limited. This concern, however, is minor when compared with the problem of weight justification [13]. The assignment of weights is an arbitrary exercise, even when an established algorithm is used [6, 7]. The subjectivity of assigning weights should be balanced with a clear explanation of why and how the weights are used [10]. Unfortunately, it is not rare for agreement studies to report weighted kappa with little if any discussion regarding the justification for the weighting scheme used in the study.

Limitation 3.—Weighting schemes are often subjective.

Generalizability

Several factors affect the generalizability (external validity) of an agreement study. These include rater background, clarity of the decision categories, and clinical relevance.

Rater Background.—When using kappa, we assume that the raters have similar levels of experience, training, and specialization (e.g., general radiology residents are not paired with seasoned subspecialists). If this is not the case, kappa may not be an appropriate technique [6].

Limitation 4.—Agreement is likely to be underestimated when raters have dissimilar experience and training.

Characteristic Clarity.—Clear classification definitions and independence are essential in an agreement study. As a result, if a general understanding regarding the basic concepts being rated has not been reached, conducting an agreement study is premature and inappropriate. Similarly, if the difference between classification categories is not clear, agreement will suffer and may not reflect the actual domain of interest. As an example, is there an actual difference between “probably benign” and “suspicious,” or do radiologists treat them clinically the same? In this case, reasons for possible differences among radiologists may include variation in attitudes toward the risk associated with false-negatives and unfamiliarity with subtle differences among the rating categories [2, 20]. It is unwise to give much credence to an agreement study that was based on a questionable classification scheme. An exception would be pilot studies such as lexicon development efforts, but they should be treated as experimental (efficacy) studies.

Limitation 5.—Agreement is likely to be underestimated and not generalizable when rating categories have questionable face validity.

Clinical Relevance.—A general question for any agreement study is whether the observed agreement is representative of clinical practice. Factors to consider include the type of imaging technology used, amount of background information provided, type of imaging (diagnostic or screening), prior imaging results, time allowed for interpretation, prior risk of disease, and comorbidity.

Limitation 6.—Agreement studies often do not reflect actual clinical practice (less information) or imaging prevalence (case mix), so the generalizability of the findings may be overstated.

Conclusion
Previous sectionNext section

Reader agreement studies have an important role in advancing radiology practice, technique, training, and quality control. Although the limitations of kappa are known, it remains a common statistical technique for estimating agreement for nominal and ordinal scale variables. The purpose of this article has been to build a better understanding of both the bi-rater and multirater kappa statistic. As has been shown, several weaknesses are intrinsic to kappa that are difficult to resolve. Although there are alternative approaches to measuring agreement, kappa will likely remain the most commonly used measure. Issues hindering the use of alternatives include mathematic complexity, reduced understanding and interpretability, and lack of consistency with prior research.

At present, agreement studies will continue to use bi-rater kappa, multirater kappa, and weighted kappa as a measure of agreement. However, it is essential that researchers respond to the limitations of kappa not only by improving study design but also by reporting and interpreting the findings appropriately. Recommended steps to improve the quality and usefulness of published reader agreement studies include reporting the characteristics of the raters and their similarities and differences; reporting the source and characteristics of the elements (images) presented to raters; including percent agreement with any kappa coefficient, and including both percent agreement and unweighted kappa if weighted kappa is used; and tempering overgeneralization by reflecting on how the raters, the elements they rated, and the study design differ from general clinical practice. Although the limitations of the kappa statistic may seem insurmountable, the key to proper use and interpretation of kappa, and any other statistic, is understanding its limitations and reporting sufficient data so that others may judge the results.

This is the 17th in the series designed by the American College of Radiology (ACR), the Canadian Association of Radiologists, and the American Journal of Roentgenology. The series, which will ultimately comprise 22 articles, is designed to progressively educate radiologists in the methodologies of rigorous clinical research, from the most basic principles to a level of considerable sophistication. The articles are intended to complement interactive software that permits the user to work with what he or she has learned, which is available on the ACR Web site (www.acr.org).

Project coordinator: G. Scott Gazelle, Chair, ACR Commission on Research and Technology Assessment.

Staff coordinator: Jonathan H. Sunshine, Senior Director for Research, ACR.

Address correspondence to P. E. Crewson ().

I thank Caryn Cohen, the series editors, and anonymous reviewers for comments on earlier drafts of the manuscript.

References
Previous sectionNext section
1. Kinkel K, Helbich TH, Esserman LJ, et al. Dynamic high-spatial-resolution MR imaging of suspicious breast lesions: diagnostic criteria and interobserver variability. AJR 2000; 175:35 –43 [Abstract] [Google Scholar]
2. Elmore JG, Wells CK, Lee CH, et al. Variability in radiologists' interpretations of mammograms. N Engl J Med 1994; 331:1493 –1499 [Google Scholar]
3. Berg WA, D'Orsi CJ, Jackson VP, et al. Does training in the Breast Imaging Reporting and Data System (BI-RADS) improve biopsy recommendations or feature analysis agreement with experienced breast imagers at mammography? Radiology 2002; 224:871 –880 [Google Scholar]
4. Ikeda DM, Hylton NM, Kinkel K, et al. Development, standardization, and testing of a lexicon for reporting contrast-enhanced breast magnetic resonance imaging studies. J Magn Reson Imaging 2001; 13:889 –895 [Google Scholar]
5. Kashner TM. Agreement between administrative files and written medical records: a case of the Department of Veterans Affairs. Med Care 1998; 36:1324 –1336 [Google Scholar]
6. Fleiss JL. Statistical methods for rates and proportions, 2nd ed. New York, NY: Wiley,1981 [Google Scholar]
7. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33:159 –174 [Google Scholar]
8. Cyr L, Francis K. Measures of clinical agreement for nominal and categorical data: the kappa coefficient. Comput Biol Med 1992; 22:239 –246 [Google Scholar]
9. American College of Radiology. Illustrated Breast Imaging Reporting and Data System (BI-RADS), 3rd ed. Reston, VA: American College of Radiology, 1998 [Google Scholar]
10. Cohen J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull 1968; 70:213 –220 [Google Scholar]
11. Lin H-M, Williamson JM, Lipsitz SR. Calculating power for the comparison of dependent K-coefficients. Appl Stat 2003; 52:391 –404 [Google Scholar]
12. Donner A. Sample size requirements for the comparison of two or more coefficients of inter-observer agreement. Stat Med 1998; 17:1157 –1168 [Google Scholar]
13. Woolson RF. Statistical methods for the analysis of biomedical data. New York. NY: Wiley,1987 [Google Scholar]
14. Lee JJ, Tu ZN. A better confidence interval for kappa on measuring agreement between two raters with binary outcomes. J Computat Graph Stat 1994; 3:301 –321 [Google Scholar]
15. Kundel HL, Polansky M. Measurement of observer agreement. Radiology 2003; 228:303 –308 [Google Scholar]
16. Kraemer HC. Evaluating medical tests: objective and quantitative guidelines. Newbury Park, CA: Sage Publications,1992 [Google Scholar]
17. Landis JR, Koch GG. A one-way components of variance model for categorical data. Biometrics 1977; 33:671 –679 [Google Scholar]
18. Feinstein AR, Cicchetti DV. High agreement but low kappa. I. The problems of two paradoxes. J Clin Epidemiol 1990; 43:543 –549 [Google Scholar]
20. Beam CA, Sullivan DC, Layde PM. Effect of human variability on independent double reading in screening mammography. Acad Radiol 1996; 3:891 –897 [Google Scholar]

Recommended Articles

Reader Agreement Studies

Full Access
American Journal of Roentgenology. 2005;184:364-372. 10.2214/ajr.184.2.01840364
Citation | Full Text | PDF (277 KB) | PDF Plus (467 KB) 
Full Access, ,
American Journal of Roentgenology. 2005;184:14-19. 10.2214/ajr.184.1.01840014
Citation | Full Text | PDF (157 KB) | PDF Plus (267 KB) 
Full Access
American Journal of Roentgenology. 2001;176:327-331. 10.2214/ajr.176.2.1760327
Citation | Full Text | PDF (33 KB) | PDF Plus (97 KB) 
Full Access,
American Journal of Roentgenology. 1996;166:517-521. 10.2214/ajr.166.3.8623619
Abstract | PDF (851 KB) | PDF Plus (335 KB) 
Full Access,
American Journal of Roentgenology. 2003;180:917-923. 10.2214/ajr.180.4.1800917
Citation | Full Text | PDF (70 KB) | PDF Plus (111 KB) 
Full Access,
American Journal of Roentgenology. 2004;183:1203-1208. 10.2214/ajr.183.5.1831203
Citation | Full Text | PDF (47 KB) | PDF Plus (108 KB)