Cohen's Kappa (κ)
Cohen's Kappa is a statistical measure used to assess inter-rater reliability for categorical scales by accounting for the agreement occurring by chance. It provides a more robust metric than simple percent agreement by comparing the observed agreement against the probability of random consensus.
This public page keeps the study guide visible while the app adds premium walkthroughs, practice, and saved tools.
Core idea
Overview
Cohen's Kappa is a statistical measure used to assess inter-rater reliability for categorical scales by accounting for the agreement occurring by chance. It provides a more robust metric than simple percent agreement by comparing the observed agreement against the probability of random consensus.
When to use: This statistic is applied when two independent raters or observers are classifying items into mutually exclusive categories. It is particularly useful in clinical psychology and medical diagnosis where subjective judgment must be standardized across different practitioners.
Why it matters: Simply reporting the percentage of agreement can be misleading if certain categories occur very frequently, as raters might agree by luck alone. Cohen's Kappa adjusts for this, ensuring that high reliability scores reflect genuine diagnostic consistency, which is vital for the validity of research findings.
Remember it
Memory Aid
Phrase: Kappa: People Observed minus People Expected, Over One minus People Expected.
Visual Analogy: Imagine two judges sorting cards. Kappa is the 'pure' agreement left over after you sweep away all the matches that happened just by lucky coincidence or random chance.
Exam Tip: Kappa ranges from -1 to 1. If your result is above 1, you've made a calculation error; usually, this happens if you forget to subtract the expected agreement from 1 in the denominator.
Why it makes sense
Intuition
Imagine a scale where 0 represents agreement purely by chance, and 1 represents perfect agreement. Cohen's Kappa positions the observed agreement on this scale, after removing the portion attributable to random luck.
Symbols
Variables
\kappa = Cohen's Kappa, p_o = Observed Agreement, p_e = Expected Agreement
Walkthrough
Derivation
Formula: Cohen's Kappa (κ)
Measures inter-rater agreement for categorical data, correcting for chance agreement.
- Two raters classify items into the same set of categories.
- Ratings are independent.
Calculate observed agreement:
The proportion of cases where the two raters agree.
Calculate expected (chance) agreement:
For each category k, multiply the proportions assigned to that category by each rater and sum.
Compute kappa:
κ = 0 means agreement no better than chance; κ = 1 means perfect agreement.
Result
Source: University Psychology — Research Methods
Where it shows up
Real-World Context
Two clinical psychologists independently diagnosing 50 patients with either 'Clinical Depression' or 'No Diagnosis' to test the reliability of a new screening tool.
Avoid these traps
Common Mistakes
- Using Cohen's Kappa for ordinal data without weighting (use Weighted Kappa instead).
- Confusing observed agreement with the final Kappa score.
- Applying it to more than two raters.
Study smarter
Tips
- Interpret values between 0.61 and 0.80 as substantial agreement and above 0.81 as almost perfect.
- Be cautious when one category is very rare, as Kappa can behave unpredictably in skewed distributions.
- Ensure that the two raters are performing their evaluations independently and without knowledge of the other's score.
Common questions
Frequently Asked Questions
Measures inter-rater agreement for categorical data, correcting for chance agreement.
This statistic is applied when two independent raters or observers are classifying items into mutually exclusive categories. It is particularly useful in clinical psychology and medical diagnosis where subjective judgment must be standardized across different practitioners.
Simply reporting the percentage of agreement can be misleading if certain categories occur very frequently, as raters might agree by luck alone. Cohen's Kappa adjusts for this, ensuring that high reliability scores reflect genuine diagnostic consistency, which is vital for the validity of research findings.
Using Cohen's Kappa for ordinal data without weighting (use Weighted Kappa instead). Confusing observed agreement with the final Kappa score. Applying it to more than two raters.
Two clinical psychologists independently diagnosing 50 patients with either 'Clinical Depression' or 'No Diagnosis' to test the reliability of a new screening tool.
Interpret values between 0.61 and 0.80 as substantial agreement and above 0.81 as almost perfect. Be cautious when one category is very rare, as Kappa can behave unpredictably in skewed distributions. Ensure that the two raters are performing their evaluations independently and without knowledge of the other's score.