|
|
||||||||
Fundamentals of Clinical Research for Radiologists |
1 Health Services Research and Development Service (124), Department of Veterans Affairs, 810 Vermont Ave., NW, Washington, DC 20420.
Received November 17, 2004; accepted after revision November 23, 2004.
This is the 17th in the series designed by the American College of
Radiology (ACR), the Canadian Association of Radiologists, and the
American Journal of Roentgenology. The series, which will ultimately
comprise 22 articles, is designed to progressively educate radiologists in the
methodologies of rigorous clinical research, from the most basic principles to
a level of considerable sophistication. The articles are intended to
complement interactive software that permits the user to work with what he or
she has learned, which is available on the ACR Web site
(www.acr.org).
Reader agreement studies have an important role in advancing radiology practice, technique, training, and quality control. Extremely common in the radiology literature, reader agreement studies determine the magnitude of agreement between or among readers. Potential applications include developing reliable diagnostic rules [1], understanding variability in treatment recommendations [2], evaluating the effects of training on interpretation consistency [3], determining the reliability of classification systems (lexicon development) [4], and comparing the consistency of different sources of medical information [5]. Agreement studies should not be confused with studies of accuracy in which measures of sensitivity and specificity and ROC curves are commonplace for comparisons when a reference standard (known truth) exists. Although these studies evaluate the validity of a measure and require a reference standard, agreement studies most commonly focus on the reliability of evaluations between different readers or in the same reader on different occasions; agreement studies do not require a reference standard.
Several methods are available for evaluating reader agreement, but the dominant technique in the radiology literature is weighted and unweighted Cohen's kappa and the associated measure, percent agreement. Because of the popularity of kappa in radiology research, this paper will focus on bi-rater and multirater kappa. Included in this presentation will be a discussion of the basic data requirements, calculation formulas, interpretation of the kappa coefficient as a measure of strength of agreement, and statistical significance testing. This discussion will be followed by an exploration of several limitations of kappa, especially those that pertain to comparability across studies. Formulas are provided in sufficient detail for those who wish to replicate the calculations, but an in-depth understanding of the mathematics is not necessary to appreciate the application and limitations of kappa.
Bi-Rater Kappa
Cohen's kappa is a common technique for estimating paired interrater agreement for nominal and ordinal-level data [6]. Kappa is a coefficient that represents agreement obtained between two readers beyond that which would be expected by chance alone [7]. A value of 1.0 represents perfect agreement. A value of 0.0 represents no agreement. Although such instances are rare, kappa can also exhibit negative values when observed agreement is less (worse) than chance. Key assumptions for using kappa include the following: elements being rated (images, diagnoses, clinical indications, and so forth) are independent of each other, one rater's classifications are made independently of the other rater's classifications, the same two raters provide the classifications used to determine kappa, and the rating categories are independent of one another [8]. The last assumption may be difficult to satisfy in some imaging studies in which there are subtle differences in lesion characteristics and decision criteria. When differences between rating categories are not clear, careful study design is essential to maximize the independence among rating categories. Alternatives include dropping confusing categories or merging related categories. Although not always possible, adjustments in the classification scheme should be consistent with clinical practice.
Bi-rater kappa is used to test the hypothesis that agreement exists between
two raters beyond that which would be expected by chance. Bi-rater kappa
provides a measure of the relative intensity of agreement or disagreement
between two readers rating the same elements using an identical classification
system. A two-by-two contingency table illustrates hypothetic data in which
two readers independently viewed the same set of 100 images from diagnostic
mammograms with a simple classification criterion, malignant or benign
(Table 1). To estimate kappa,
both raters must use the same number of rating criteria so that the number of
columns representing the rating categories used by rater 1 equals the number
of rows representing the rating categories used by rater 2. Kappa is
calculated using the formula:
![]() |
|
If we divide each cell count by the total sample size (n = 100), a matrix of probabilities is created (Table 2). Each cell contains the proportion of the total number of images (n = 100), not the count. As an example, the proportion of images in which reader 1 and reader 2 agree that an image is benign is 0.20 (20/100) or 20% (0.20 x 100).
|
The overall proportion of readings in which reader 1 and reader 2 agree is
calculated by summing the diagonal probabilities in
Table 2:
![]() |
This "proportion agreement" is converted to a percentage and reported as "percent agreement." The interpretation of percent agreement is straightforward: Reader 1 and reader 2 agreed with each other on 80% of the classifications. The approach and calculations are the same for larger tables in which readers must consider more than two options in their decision making. Using as an example the American College of Radiology's BI-RADS lexicon [9] for final assessment, agreement could be based on each reader assigning each case to one of four categories: benign, probably benign, suspicious, or highly suggestive of malignancy. The resulting data would be reported in a four-by-four table in which the sum of the probabilities in the four diagonal cells represents the proportion agreement (po).
The advantage of the kappa statistic over percent agreement is its
adjustment for the proportion of cases in which the raters would agree by
chance alone. Because we are unlikely to know the true value of chance, the
marginal probabilities from the observed data are used to estimate a surrogate
for chance. The proportions in the total column and in the total row represent
the marginal probabilities. Chance agreement is derived from the observed
data, so it will likely change if different readers evaluate the same images.
Using Table 2, the proportion
of chance agreement (pe) is computed as follows:
![]() |
Once the proportion of observed agreement (po)
and the proportion of chance agreement (pe) are
established, kappa is calculated using the formula:
![]() |
Using a common interpretation guideline offered by Landis and Koch [7], a kappa of 0.52 reflects a moderate level of agreement (Table 3).
|
Statistical Significance
To test the null hypothesis that the kappa coefficient is not different
from zero (i.e., no better than chance), an estimate of the standard error
(SE) for a one-sample test is calculated from the formula
[10]:
![]() |
A kappa test statistic is compared with the standard normal distribution.
The equation for obtaining the test statistic is as follows:
![]() |
Using a one-tailed test, the test statistic is statistically significant because it exceeds the critical value of 1.645 (alpha, 0.05) [6]. This result supports the alternative hypothesis that the kappa coefficient is different from zero (i.e., better than chance).
Although some effort has been directed toward estimating sample size
requirements for comparisons among two or more kappa coefficients
[11,
12], methods for calculating
power for one kappa coefficient have not received much attention
[11]. As a general rule of
thumb, 30 cases with two readers is a reasonable minimum sample size as long
as a moderate-level or better kappa coefficient (
> 0.40) is
expected and you want to show that kappa is different from a value of
zero.
Confidence Intervals
For estimating confidence intervals, a different formula is used for SE
[10]. There are other more
accurate and complicated formulas for SE
[6,
13,
14]:
![]() |
Given an estimate of kappa of 0.52, the 95% confidence interval would be
0.330.71:
![]() |
Weighted Kappa
Kappa treats disagreements the same regardless of whether a close decision
on a rank-ordered classification system has clinical relevance. As an example,
a rank-ordered rating scale from benign, probably benign, suspicious, and
highly suspicious of malignancy, from which one rater concludes a lesion is
suspicious and the other rater concludes that the lesion is highly suspicious,
may result in the same clinical decision, immediate follow-up with biopsy. In
this event, a disagreement between these two categories is much less important
than a disagreement in which one rater rates a lesion as highly suspicious and
the other rater rates the same lesion as probably benign.
Weighted kappa was developed to provide partial credit. The observed and expected proportions of each cell are multiplied by a weight before using them to calculate kappa. Weights can be established a priori (before data collection) using clinical experience [10], or they can be calculated after data collection using a simple algorithm for assigning weights that uses the same weighting strategy regardless of the data characteristics or rating criteria. Weighted kappa and unweighted kappa will be the same when there are only two decision categories. An example based on the BI-RADS classification system is provided in Appendix 1. For another example of calculating kappa weights, see Kundel and Polansky [15].
|
|
|
|
|
Special Considerations When Using Bi-Rater Kappa
For small sample sizes, kappa may be underestimated. In this case, a
resampling technique (jackknifing) can be used to calculate an unbiased
estimate of kappa [8]. Kappa
may also be lower if the number of decision categories is excessive. Possible
responses to compensate for this effect are to use weighted kappa if the
categories are rank-ordered or to combine similar categories, or both. In any
good study design, the choice of a weighting or classification scheme should
be addressed and resolved before data collection. Overall, the precision (SE)
of kappa is expected to improve as the number of patients and raters increases
[16]. Although the preceding
discussion was limited to two raters, the next section presents a technique
for improving precision by comparing more than two raters.
Multirater Generalized Kappa
When there are more than two raters, generalized kappa is the recommended approach for evaluating interrater agreement [6, 13, 17]. This statistic measures the degree to which interpretation variability arises from differences among cases relative to differences among readers interpreting the same case. It is analogous to analysis of variance and the intraclass correlation used in the assessment of agreement when measured on a continuous scale.
The discussion that follows focuses on estimating agreement among more than two raters, when the number of raters is kept constant and the number of rating categories is greater than two. Slight modifications in the calculations are required when generalized kappa is estimated for only two rating categories or when the number of raters does not remain constant from one classification to another (see Fleiss [6] for alternative calculations). The approach presented here satisfies the likely characteristics of a prospective imaging study design [3].
Table 4 presents hypothetic data for five raters evaluating imaging from 10 patients using three decision categoriesbenign, suspicious, and malignant. The formulas that follow are from Woolson [13]. Assume the following notation: N = total number of patients, K = total number of raters, R = number of decision categories, and nij = number of raters who classified patient i (rows in Table 4) in category j (columns in Table 4).
|
The proportion (
j) of
all classifications that fall within each decision category is presented at
the bottom of each column. In this example, 0.40 (40%) of the classifications
are in the benign category, 0.24 (24%) are suspicious, and 0.36 (36%) are
classified as malignant.
For each patient, the proportion of all possible pairings on which
radiologists agree is calculated using the formula:
![]() |
For patient 1, this would be calculated as shown in:
|
|
The proportion of pairs agreeing for each patient is provided in
Table 4 in the right column.
The overall proportion of agreement
(
) is the mean agreement of all
patients, or 0.62. In other words, we estimate that, on average, any two of
the five radiologists will agree on a classification about 62% of the
time.
As in bi-rater kappa, a correction for chance agreement is necessary to
calculate the kappa coefficient. To estimate chance agreement for generalized
kappa, the proportion (
j)
of classifications in each decision category is squared and summed. For
Table 4, the expected chance
agreement is:
![]() |
Using the proportion of observed agreement and chance agreement, the
generalized kappa statistic is:
![]() |
Statistical Significance
To test the null hypothesis that the kappa coefficient is not different
from zero (i.e., no better than chance), the generalized kappa statistic is
compared with the standard normal distribution. The equation for obtaining the
test statistic is as follows (see Appendix
2 for SE calculations):
![]() |
|
For a one-tailed test (alpha = 0.05), the kappa coefficient is statistically significantly different from zero. Because of rounding, the SE and z-test statistic will be slightly different when calculated by computer algorithm, and there are other calculation methods for SE not presented here [6]. A confidence interval is created using the same procedure as that presented for bi-rater kappa using the generalized kappa coefficient and its SE.
Limitations of Kappa
Considerable debate surrounds the use of bikappa and generalized kappa as a measure of agreement [18]. As a result, several alternative approaches to measuring agreement have been proposed but have yet to gain wide acceptance in the peer-reviewed literature. A convenient listing of several alternative approaches and references is available on the Internet [19]. Given the dominance of kappa as a measure of agreement in imaging studies, it is important for both investigators and consumers of the literature to understand the limitations of kappa. Following is a brief discussion of the negative effects resulting from variations in case distribution, improper use of weights, and restrictions on the overall generalizability (external validity) of studies using kappa. This is not a complete listing of all the limitations, but rather basic considerations in interpreting any agreement study that uses kappa.
Effects of Case Distribution
A fundamental aspect of agreement studies is the distribution of cases.
Because it is unlikely that a study reflects the population prevalence,
marginals (row and column totals) based on reader agreement patterns are
routinely used as surrogates for prevalence
[18]. This surrogate measure
of chance agreement is based on the distribution of the cases classified by
readers (both bi-rater kappa and generalized kappa). It is possible to find a
consistently high level of percent agreement while reporting widely differing
kappa values from one study or one comparison to another because of the case
distributions. Table 5 provides
an example in which two readers with the same percent agreement are presented
with differing distributions of cases. The examples provided in
Table 5 assume a high level of
accuracy by both readers, so that the marginal probabilities match the study
case distribution. In both examples, the readers agree in 90% of the
classifications; however, kappa is significantly reduced if one classification
category dominates. As shown, an increase in the dominance of malignant cases
from 50% to 90% resulted in kappa dropping from 0.80 to 0.44.
|
Limitation 1.Because of variations in case mix, reported kappa values may vary dramatically from one study to another even when the overall percent agreement is similar.
Limitation 2.Because varying rater pairs will likely change the category distributions, bi-kappa values on the same set of elements may vary dramatically from one reader pair to another, even when percent agreement is relatively stable.
Weighted Kappa
Adding to the limited comparability of the kappa statistic from one study
to another is the use of weighted kappa. There are multiple methods to weight
kappa, so the comparability between studies is often limited. This concern,
however, is minor when compared with the problem of weight justification
[13]. The assignment of
weights is an arbitrary exercise, even when an established algorithm is used
[6,
7]. The subjectivity of
assigning weights should be balanced with a clear explanation of why and how
the weights are used [10].
Unfortunately, it is not rare for agreement studies to report weighted kappa
with little if any discussion regarding the justification for the weighting
scheme used in the study.
Limitation 3.Weighting schemes are often subjective.
Generalizability
Several factors affect the generalizability (external validity) of an
agreement study. These include rater background, clarity of the decision
categories, and clinical relevance.
Rater Background.When using kappa, we assume that the raters have similar levels of experience, training, and specialization (e.g., general radiology residents are not paired with seasoned subspecialists). If this is not the case, kappa may not be an appropriate technique [6].
Limitation 4.Agreement is likely to be underestimated when raters have dissimilar experience and training.
Characteristic Clarity.Clear classification definitions and independence are essential in an agreement study. As a result, if a general understanding regarding the basic concepts being rated has not been reached, conducting an agreement study is premature and inappropriate. Similarly, if the difference between classification categories is not clear, agreement will suffer and may not reflect the actual domain of interest. As an example, is there an actual difference between "probably benign" and "suspicious," or do radiologists treat them clinically the same? In this case, reasons for possible differences among radiologists may include variation in attitudes toward the risk associated with false-negatives and unfamiliarity with subtle differences among the rating categories [2, 20]. It is unwise to give much credence to an agreement study that was based on a questionable classification scheme. An exception would be pilot studies such as lexicon development efforts, but they should be treated as experimental (efficacy) studies.
Limitation 5.Agreement is likely to be underestimated and not generalizable when rating categories have questionable face validity.
Clinical Relevance.A general question for any agreement study is whether the observed agreement is representative of clinical practice. Factors to consider include the type of imaging technology used, amount of background information provided, type of imaging (diagnostic or screening), prior imaging results, time allowed for interpretation, prior risk of disease, and comorbidity.
Limitation 6.Agreement studies often do not reflect actual clinical practice (less information) or imaging prevalence (case mix), so the generalizability of the findings may be overstated.
Conclusion
Reader agreement studies have an important role in advancing radiology practice, technique, training, and quality control. Although the limitations of kappa are known, it remains a common statistical technique for estimating agreement for nominal and ordinal scale variables. The purpose of this article has been to build a better understanding of both the bi-rater and multirater kappa statistic. As has been shown, several weaknesses are intrinsic to kappa that are difficult to resolve. Although there are alternative approaches to measuring agreement, kappa will likely remain the most commonly used measure. Issues hindering the use of alternatives include mathematic complexity, reduced understanding and interpretability, and lack of consistency with prior research.
At present, agreement studies will continue to use bi-rater kappa, multirater kappa, and weighted kappa as a measure of agreement. However, it is essential that researchers respond to the limitations of kappa not only by improving study design but also by reporting and interpreting the findings appropriately. Recommended steps to improve the quality and usefulness of published reader agreement studies include reporting the characteristics of the raters and their similarities and differences; reporting the source and characteristics of the elements (images) presented to raters; including percent agreement with any kappa coefficient, and including both percent agreement and unweighted kappa if weighted kappa is used; and tempering overgeneralization by reflecting on how the raters, the elements they rated, and the study design differ from general clinical practice. Although the limitations of the kappa statistic may seem insurmountable, the key to proper use and interpretation of kappa, and any other statistic, is understanding its limitations and reporting sufficient data so that others may judge the results.
Acknowledgments
I thank Caryn Cohen, the series editors, and anonymous reviewers for comments on earlier drafts of the manuscript.
References
This article has been cited by other articles:
![]() |
S. Y. Lee, W.-H. Jee, and J.-M. Kim Radial Tear of the Medial Meniscal Root: Reliability and Accuracy of MRI for Diagnosis Am. J. Roentgenol., July 1, 2008; 191(1): 81 - 85. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. S. Gierada, T. K. Pilgram, M. Ford, R. M. Fagerstrom, T. R. Church, H. Nath, K. Garg, and D. C. Strollo Lung Cancer: Interobserver Agreement on Interpretation of Pulmonary Findings at Low-Dose CT Screening Radiology, December 1, 2007; 246(1): 265 - 272. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Touze, J.-F. Toussaint, J. Coste, E. Schmitt, F. Bonneville, P. Vandermarcq, J.-Y. Gauvrit, F. Douvrin, J.-F. Meder, J.-L. Mas, et al. Reproducibility of High-Resolution MRI for the Identification and the Quantification of Carotid Atherosclerotic Plaque Components: Consequences for Prognosis Studies and Therapeutic Trials Stroke, June 1, 2007; 38(6): 1812 - 1819. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Moser, J.-C. Dosch, A. Moussaoui, and J.-L. Dietemann Wrist Ligament Tears: Evaluation of MRI and Combined MDCT and MR Arthrography Am. J. Roentgenol., May 1, 2007; 188(5): 1278 - 1286. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |