Research
Fundamentals of Clinical Research for Radiologists
Reader Agreement Studies
This article presents several approaches for evaluating reader agreement. The dominant technique in the radiology literature is weighted and unweighted Cohen's kappa and the associated measure, percent agreement. Percent agreement is an intuitive approach to measuring agreement but does not adjust for chance. Kappa provides a measure of agreement beyond that which would be expected by chance, as estimated by the observed data. Both the bi-rater and multirater kappa statistics have several limitations that are difficult to resolve. Although there are alternative approaches to measuring agreement, kappa remains the most commonly used measure.
Reader agreement studies have an important role in advancing radiology practice, technique, training, and quality control. Extremely common in the radiology literature, reader agreement studies determine the magnitude of agreement between or among readers. Potential applications include developing reliable diagnostic rules [1], understanding variability in treatment recommendations [2], evaluating the effects of training on interpretation consistency [3], determining the reliability of classification systems (lexicon development) [4], and comparing the consistency of different sources of medical information [5]. Agreement studies should not be confused with studies of accuracy in which measures of sensitivity and specificity and ROC curves are commonplace for comparisons when a reference standard (known truth) exists. Although these studies evaluate the validity of a measure and require a reference standard, agreement studies most commonly focus on the reliability of evaluations between different readers or in the same reader on different occasions; agreement studies do not require a reference standard.
Several methods are available for evaluating reader agreement, but the dominant technique in the radiology literature is weighted and unweighted Cohen's kappa and the associated measure, percent agreement. Because of the popularity of kappa in radiology research, this paper will focus on bi-rater and multirater kappa. Included in this presentation will be a discussion of the basic data requirements, calculation formulas, interpretation of the kappa coefficient as a measure of strength of agreement, and statistical significance testing. This discussion will be followed by an exploration of several limitations of kappa, especially those that pertain to comparability across studies. Formulas are provided in sufficient detail for those who wish to replicate the calculations, but an in-depth understanding of the mathematics is not necessary to appreciate the application and limitations of kappa.
Cohen's kappa is a common technique for estimating paired interrater agreement for nominal and ordinal-level data [6]. Kappa is a coefficient that represents agreement obtained between two readers beyond that which would be expected by chance alone [7]. A value of 1.0 represents perfect agreement. A value of 0.0 represents no agreement. Although such instances are rare, kappa can also exhibit negative values when observed agreement is less (worse) than chance. Key assumptions for using kappa include the following: elements being rated (images, diagnoses, clinical indications, and so forth) are independent of each other, one rater's classifications are made independently of the other rater's classifications, the same two raters provide the classifications used to determine kappa, and the rating categories are independent of one another [8]. The last assumption may be difficult to satisfy in some imaging studies in which there are subtle differences in lesion characteristics and decision criteria. When differences between rating categories are not clear, careful study design is essential to maximize the independence among rating categories. Alternatives include dropping confusing categories or merging related categories. Although not always possible, adjustments in the classification scheme should be consistent with clinical practice.
Bi-rater kappa is used to test the hypothesis that agreement exists between two raters beyond that which would be expected by chance. Bi-rater kappa provides a measure of the relative intensity of agreement or disagreement between two readers rating the same elements using an identical classification system. A two-by-two contingency table illustrates hypothetic data in which two readers independently viewed the same set of 100 images from diagnostic mammograms with a simple classification criterion, malignant or benign (Table 1). To estimate kappa, both raters must use the same number of rating criteria so that the number of columns representing the rating categories used by rater 1 equals the number of rows representing the rating categories used by rater 2. Kappa is calculated using the formula:
where po is the proportion of cases in which agreement exists between two raters, and pe is the proportion of cases in which raters would agree by chance.
If we divide each cell count by the total sample size (n = 100), a matrix of probabilities is created (Table 2). Each cell contains the proportion of the total number of images (n = 100), not the count. As an example, the proportion of images in which reader 1 and reader 2 agree that an image is benign is 0.20 (20/100) or 20% (0.20 × 100).
The overall proportion of readings in which reader 1 and reader 2 agree is calculated by summing the diagonal probabilities in Table 2: 
This “proportion agreement” is converted to a percentage and reported as “percent agreement.” The interpretation of percent agreement is straightforward: Reader 1 and reader 2 agreed with each other on 80% of the classifications. The approach and calculations are the same for larger tables in which readers must consider more than two options in their decision making. Using as an example the American College of Radiology's BI-RADS lexicon [9] for final assessment, agreement could be based on each reader assigning each case to one of four categories: benign, probably benign, suspicious, or highly suggestive of malignancy. The resulting data would be reported in a four-by-four table in which the sum of the probabilities in the four diagonal cells represents the proportion agreement (po).
The advantage of the kappa statistic over percent agreement is its adjustment for the proportion of cases in which the raters would agree by chance alone. Because we are unlikely to know the true value of chance, the marginal probabilities from the observed data are used to estimate a surrogate for chance. The proportions in the total column and in the total row represent the marginal probabilities. Chance agreement is derived from the observed data, so it will likely change if different readers evaluate the same images. Using Table 2, the proportion of chance agreement (pe) is computed as follows: 
Once the proportion of observed agreement (po) and the proportion of chance agreement (pe) are established, kappa is calculated using the formula: 
Using a common interpretation guideline offered by Landis and Koch [7], a kappa of 0.52 reflects a moderate level of agreement (Table 3).
To test the null hypothesis that the kappa coefficient is not different from zero (i.e., no better than chance), an estimate of the standard error (SE) for a one-sample test is calculated from the formula [10]: 
A kappa test statistic is compared with the standard normal distribution. The equation for obtaining the test statistic is as follows: 
Using a one-tailed test, the test statistic is statistically significant because it exceeds the critical value of 1.645 (alpha, 0.05) [6]. This result supports the alternative hypothesis that the kappa coefficient is different from zero (i.e., better than chance).
Although some effort has been directed toward estimating sample size requirements for comparisons among two or more kappa coefficients [11, 12], methods for calculating power for one kappa coefficient have not received much attention [11]. As a general rule of thumb, 30 cases with two readers is a reasonable minimum sample size as long as a moderate-level or better kappa coefficient (κ > 0.40) is expected and you want to show that kappa is different from a value of zero.
For estimating confidence intervals, a different formula is used for SE [10]. There are other more accurate and complicated formulas for SE [6, 13, 14]: 
Given an estimate of kappa of 0.52, the 95% confidence interval would be 0.33–0.71: 
Kappa treats disagreements the same regardless of whether a close decision on a rank-ordered classification system has clinical relevance. As an example, a rank-ordered rating scale from benign, probably benign, suspicious, and highly suspicious of malignancy, from which one rater concludes a lesion is suspicious and the other rater concludes that the lesion is highly suspicious, may result in the same clinical decision, immediate follow-up with biopsy. In this event, a disagreement between these two categories is much less important than a disagreement in which one rater rates a lesion as highly suspicious and the other rater rates the same lesion as probably benign.
Weighted kappa was developed to provide partial credit. The observed and expected proportions of each cell are multiplied by a weight before using them to calculate kappa. Weights can be established a priori (before data collection) using clinical experience [10], or they can be calculated after data collection using a simple algorithm for assigning weights that uses the same weighting strategy regardless of the data characteristics or rating criteria. Weighted kappa and unweighted kappa will be the same when there are only two decision categories. An example based on the BI-RADS classification system is provided in Appendix 1. For another example of calculating kappa weights, see Kundel and Polansky [15].
For small sample sizes, kappa may be underestimated. In this case, a resampling technique (jackknifing) can be used to calculate an unbiased estimate of kappa [8]. Kappa may also be lower if the number of decision categories is excessive. Possible responses to compensate for this effect are to use weighted kappa if the categories are rank-ordered or to combine similar categories, or both. In any good study design, the choice of a weighting or classification scheme should be addressed and resolved before data collection. Overall, the precision (SE) of kappa is expected to improve as the number of patients and raters increases [16]. Although the preceding discussion was limited to two raters, the next section presents a technique for improving precision by comparing more than two raters.
When there are more than two raters, generalized kappa is the recommended approach for evaluating interrater agreement [6, 13, 17]. This statistic measures the degree to which interpretation variability arises from differences among cases relative to differences among readers interpreting the same case. It is analogous to analysis of variance and the intraclass correlation used in the assessment of agreement when measured on a continuous scale.
The discussion that follows focuses on estimating agreement among more than two raters, when the number of raters is kept constant and the number of rating categories is greater than two. Slight modifications in the calculations are required when generalized kappa is estimated for only two rating categories or when the number of raters does not remain constant from one classification to another (see Fleiss [6] for alternative calculations). The approach presented here satisfies the likely characteristics of a prospective imaging study design [3].
Table 4 presents hypothetic data for five raters evaluating imaging from 10 patients using three decision categories—benign, suspicious, and malignant. The formulas that follow are from Woolson [13]. Assume the following notation: N = total number of patients, K = total number of raters, R = number of decision categories, and nij = number of raters who classified patient i (rows in Table 4) in category j (columns in Table 4).
The proportion (p̂j) of all classifications that fall within each decision category is presented at the bottom of each column. In this example, 0.40 (40%) of the classifications are in the benign category, 0.24 (24%) are suspicious, and 0.36 (36%) are classified as malignant.
For each patient, the proportion of all possible pairings on which radiologists agree is calculated using the formula: 
For patient 1, this would be calculated as shown in:
![]() |
The proportion of pairs agreeing for each patient is provided in Table 4 in the right column. The overall proportion of agreement (p̄) is the mean agreement of all patients, or 0.62. In other words, we estimate that, on average, any two of the five radiologists will agree on a classification about 62% of the time.
As in bi-rater kappa, a correction for chance agreement is necessary to calculate the kappa coefficient. To estimate chance agreement for generalized kappa, the proportion (p̂j) of classifications in each decision category is squared and summed. For Table 4, the expected chance agreement is: 
Using the proportion of observed agreement and chance agreement, the generalized kappa statistic is: 
To test the null hypothesis that the kappa coefficient is not different from zero (i.e., no better than chance), the generalized kappa statistic is compared with the standard normal distribution. The equation for obtaining the test statistic is as follows (see Appendix 2 for SE calculations): 
For a one-tailed test (alpha = 0.05), the kappa coefficient is statistically significantly different from zero. Because of rounding, the SE and z-test statistic will be slightly different when calculated by computer algorithm, and there are other calculation methods for SE not presented here [6]. A confidence interval is created using the same procedure as that presented for bi-rater kappa using the generalized kappa coefficient and its SE.
Considerable debate surrounds the use of bikappa and generalized kappa as a measure of agreement [18]. As a result, several alternative approaches to measuring agreement have been proposed but have yet to gain wide acceptance in the peer-reviewed literature. A convenient listing of several alternative approaches and references is available on the Internet [19]. Given the dominance of kappa as a measure of agreement in imaging studies, it is important for both investigators and consumers of the literature to understand the limitations of kappa. Following is a brief discussion of the negative effects resulting from variations in case distribution, improper use of weights, and restrictions on the overall generalizability (external validity) of studies using kappa. This is not a complete listing of all the limitations, but rather basic considerations in interpreting any agreement study that uses kappa.
A fundamental aspect of agreement studies is the distribution of cases. Because it is unlikely that a study reflects the population prevalence, marginals (row and column totals) based on reader agreement patterns are routinely used as surrogates for prevalence [18]. This surrogate measure of chance agreement is based on the distribution of the cases classified by readers (both bi-rater kappa and generalized kappa). It is possible to find a consistently high level of percent agreement while reporting widely differing kappa values from one study or one comparison to another because of the case distributions. Table 5 provides an example in which two readers with the same percent agreement are presented with differing distributions of cases. The examples provided in Table 5 assume a high level of accuracy by both readers, so that the marginal probabilities match the study case distribution. In both examples, the readers agree in 90% of the classifications; however, kappa is significantly reduced if one classification category dominates. As shown, an increase in the dominance of malignant cases from 50% to 90% resulted in kappa dropping from 0.80 to 0.44.
Limitation 1.—Because of variations in case mix, reported kappa values may vary dramatically from one study to another even when the overall percent agreement is similar.
Limitation 2.—Because varying rater pairs will likely change the category distributions, bi-kappa values on the same set of elements may vary dramatically from one reader pair to another, even when percent agreement is relatively stable.
Adding to the limited comparability of the kappa statistic from one study to another is the use of weighted kappa. There are multiple methods to weight kappa, so the comparability between studies is often limited. This concern, however, is minor when compared with the problem of weight justification [13]. The assignment of weights is an arbitrary exercise, even when an established algorithm is used [6, 7]. The subjectivity of assigning weights should be balanced with a clear explanation of why and how the weights are used [10]. Unfortunately, it is not rare for agreement studies to report weighted kappa with little if any discussion regarding the justification for the weighting scheme used in the study.
Limitation 3.—Weighting schemes are often subjective.
Several factors affect the generalizability (external validity) of an agreement study. These include rater background, clarity of the decision categories, and clinical relevance.
Rater Background.—When using kappa, we assume that the raters have similar levels of experience, training, and specialization (e.g., general radiology residents are not paired with seasoned subspecialists). If this is not the case, kappa may not be an appropriate technique [6].
Limitation 4.—Agreement is likely to be underestimated when raters have dissimilar experience and training.
Characteristic Clarity.—Clear classification definitions and independence are essential in an agreement study. As a result, if a general understanding regarding the basic concepts being rated has not been reached, conducting an agreement study is premature and inappropriate. Similarly, if the difference between classification categories is not clear, agreement will suffer and may not reflect the actual domain of interest. As an example, is there an actual difference between “probably benign” and “suspicious,” or do radiologists treat them clinically the same? In this case, reasons for possible differences among radiologists may include variation in attitudes toward the risk associated with false-negatives and unfamiliarity with subtle differences among the rating categories [2, 20]. It is unwise to give much credence to an agreement study that was based on a questionable classification scheme. An exception would be pilot studies such as lexicon development efforts, but they should be treated as experimental (efficacy) studies.
Limitation 5.—Agreement is likely to be underestimated and not generalizable when rating categories have questionable face validity.
Clinical Relevance.—A general question for any agreement study is whether the observed agreement is representative of clinical practice. Factors to consider include the type of imaging technology used, amount of background information provided, type of imaging (diagnostic or screening), prior imaging results, time allowed for interpretation, prior risk of disease, and comorbidity.
Limitation 6.—Agreement studies often do not reflect actual clinical practice (less information) or imaging prevalence (case mix), so the generalizability of the findings may be overstated.
Reader agreement studies have an important role in advancing radiology practice, technique, training, and quality control. Although the limitations of kappa are known, it remains a common statistical technique for estimating agreement for nominal and ordinal scale variables. The purpose of this article has been to build a better understanding of both the bi-rater and multirater kappa statistic. As has been shown, several weaknesses are intrinsic to kappa that are difficult to resolve. Although there are alternative approaches to measuring agreement, kappa will likely remain the most commonly used measure. Issues hindering the use of alternatives include mathematic complexity, reduced understanding and interpretability, and lack of consistency with prior research.
At present, agreement studies will continue to use bi-rater kappa, multirater kappa, and weighted kappa as a measure of agreement. However, it is essential that researchers respond to the limitations of kappa not only by improving study design but also by reporting and interpreting the findings appropriately. Recommended steps to improve the quality and usefulness of published reader agreement studies include reporting the characteristics of the raters and their similarities and differences; reporting the source and characteristics of the elements (images) presented to raters; including percent agreement with any kappa coefficient, and including both percent agreement and unweighted kappa if weighted kappa is used; and tempering overgeneralization by reflecting on how the raters, the elements they rated, and the study design differ from general clinical practice. Although the limitations of the kappa statistic may seem insurmountable, the key to proper use and interpretation of kappa, and any other statistic, is understanding its limitations and reporting sufficient data so that others may judge the results.
This is the 17th in the series designed by the American College of Radiology (ACR), the Canadian Association of Radiologists, and the American Journal of Roentgenology. The series, which will ultimately comprise 22 articles, is designed to progressively educate radiologists in the methodologies of rigorous clinical research, from the most basic principles to a level of considerable sophistication. The articles are intended to complement interactive software that permits the user to work with what he or she has learned, which is available on the ACR Web site (www.acr.org).
Project coordinator: G. Scott Gazelle, Chair, ACR Commission on Research and Technology Assessment.
Staff coordinator: Jonathan H. Sunshine, Senior Director for Research, ACR.
Address correspondence to P. E. Crewson ([email protected]).
I thank Caryn Cohen, the series editors, and anonymous reviewers for comments on earlier drafts of the manuscript.

Audio Available | 
