Cohen's Kappa (κ) - Study Guide | Wiki

Core idea

Overview

Cohen's Kappa is a statistical measure used to assess inter-rater reliability for categorical scales by accounting for the agreement occurring by chance. It provides a more robust metric than simple percent agreement by comparing the observed agreement against the probability of random consensus.

When to use: This statistic is applied when two independent raters or observers are classifying items into mutually exclusive categories. It is particularly useful in clinical psychology and medical diagnosis where subjective judgment must be standardized across different practitioners.

Why it matters: Simply reporting the percentage of agreement can be misleading if certain categories occur very frequently, as raters might agree by luck alone. Cohen's Kappa adjusts for this, ensuring that high reliability scores reflect genuine diagnostic consistency, which is vital for the validity of research findings.

Remember it

Memory Aid

Phrase: Kappa: People Observed minus People Expected, Over One minus People Expected.

Visual Analogy: Imagine two judges sorting cards. Kappa is the 'pure' agreement left over after you sweep away all the matches that happened just by lucky coincidence or random chance.

Exam Tip: Kappa ranges from -1 to 1. If your result is above 1, you've made a calculation error; usually, this happens if you forget to subtract the expected agreement from 1 in the denominator.

Why it makes sense

Intuition

Imagine a scale where 0 represents agreement purely by chance, and 1 represents perfect agreement. Cohen's Kappa positions the observed agreement on this scale, after removing the portion attributable to random luck.

Symbols

Variables

\kappa = Cohen's Kappa, p_o = Observed Agreement, p_e = Expected Agreement

κ

Cohen's Kappa

V a r iab l e

p_{o}

Observed Agreement

V a r iab l e

p_{e}

Expected Agreement

V a r iab l e

Walkthrough

Derivation

Formula: Cohen's Kappa (κ)

Measures inter-rater agreement for categorical data, correcting for chance agreement.

Two raters classify items into the same set of categories.
Ratings are independent.

1

Calculate observed agreement:

The proportion of cases where the two raters agree.

P_{o} = \frac{Number of agreements}{Total ratings}

2

Calculate expected (chance) agreement:

For each category k, multiply the proportions assigned to that category by each rater and sum.

P_{e} = k \sum P_{k 1} \cdot P_{k 2}

3

Compute kappa:

κ = 0 means agreement no better than chance; κ = 1 means perfect agreement.

κ = \frac{P _{o} - P _{e}}{1 - P _{e}}

Result

κ = \frac{P _{o} - P _{e}}{1 - P _{e}}

Source: University Psychology — Research Methods

Where it shows up

Real-World Context

Two clinical psychologists independently diagnosing 50 patients with either 'Clinical Depression' or 'No Diagnosis' to test the reliability of a new screening tool.

Avoid these traps

Common Mistakes

Using Cohen's Kappa for ordinal data without weighting (use Weighted Kappa instead).
Confusing observed agreement with the final Kappa score.
Applying it to more than two raters.

Study smarter

Tips

Interpret values between 0.61 and 0.80 as substantial agreement and above 0.81 as almost perfect.
Be cautious when one category is very rare, as Kappa can behave unpredictably in skewed distributions.
Ensure that the two raters are performing their evaluations independently and without knowledge of the other's score.

Common questions

Frequently Asked Questions

Measures inter-rater agreement for categorical data, correcting for chance agreement.

This statistic is applied when two independent raters or observers are classifying items into mutually exclusive categories. It is particularly useful in clinical psychology and medical diagnosis where subjective judgment must be standardized across different practitioners.

Simply reporting the percentage of agreement can be misleading if certain categories occur very frequently, as raters might agree by luck alone. Cohen's Kappa adjusts for this, ensuring that high reliability scores reflect genuine diagnostic consistency, which is vital for the validity of research findings.

Using Cohen's Kappa for ordinal data without weighting (use Weighted Kappa instead). Confusing observed agreement with the final Kappa score. Applying it to more than two raters.

Two clinical psychologists independently diagnosing 50 patients with either 'Clinical Depression' or 'No Diagnosis' to test the reliability of a new screening tool.

Interpret values between 0.61 and 0.80 as substantial agreement and above 0.81 as almost perfect. Be cautious when one category is very rare, as Kappa can behave unpredictably in skewed distributions. Ensure that the two raters are performing their evaluations independently and without knowledge of the other's score.