Mutual Information (2×2)

Core idea

Overview

Mutual Information quantifies the statistical dependence between two discrete random variables by measuring how much information is shared between them. In the 2×2 contingency case, it calculates the Kullback-Leibler divergence between the joint probability distribution and the product of the marginal distributions of two binary variables.

When to use: Apply this formula when analyzing the relationship between two binary variables, such as comparing a test result with the presence of a disease. It is preferred over linear correlation when you need to capture non-linear dependencies or general statistical association.

Why it matters: It is a foundational concept in communication theory for calculating channel capacity and in machine learning for feature selection. High mutual information indicates that knowing the state of one variable significantly reduces uncertainty about the other.

Symbols

Variables

I(X;Y) = Mutual Information, $p_{00}$ = P(X=0,Y=0), $p_{01}$ = P(X=0,Y=1), $p_{10}$ = P(X=1,Y=0), $p_{11}$ = P(X=1,Y=1)

I(X;Y)

Mutual Information

nats

p_{00}

P(X=0,Y=0)

Variable

p_{01}

P(X=0,Y=1)

Variable

p_{10}

P(X=1,Y=0)

Variable

p_{11}

P(X=1,Y=1)

Variable

Walkthrough

Derivation

Derivation of Mutual Information from a 2×2 Joint Table

Mutual information sums p(x,y) ln(p(x,y)/(p(x)p(y))) over all pairs.

X and Y are binary.
Joint probabilities p00,p01,p10,p11 sum to 1.

1

Start from the definition:

Mutual information quantifies dependence between X and Y.

I (X; Y) = x, y \sum p (x, y) ln \frac{p ( x , y )}{p ( x ) p ( y )}

2

Compute marginals from the 2×2 table:

You need p(x) and p(y) to form the ratio p(x,y)/(p(x)p(y)).

p (x) = y \sum p (x, y), p (y) = x \sum p (x, y)

3

Sum the four terms (p00, p01, p10, p11):

Each non-zero joint probability contributes a term. By convention, 0·ln(0)=0.

I = \sum p_{ij} ln \frac{p _{ij}}{p _{i \cdot} p _{\cdot j}}

Result

I = \sum p_{ij} ln \frac{p _{ij}}{p _{i \cdot} p _{\cdot j}}

Visual intuition

Graph

Graph unavailable for this formula.

The plot displays Mutual Information as a function of the joint probability distribution, exhibiting a concave, non-linear shape that resembles a sigmoid-like or logarithmic growth curve. As the variables shift from independence to perfect correlation, the information value increases from zero to a maximum peak, reflecting the bounded nature of entropy. This curvature illustrates that information gain is constrained by the marginal distributions, with the slope indicating how rapidly uncertainty is reduced as dependency strengthens.

Graph type: sigmoid

Why it behaves this way

Intuition

Imagine a statistical landscape where the 'height' at each (x,y) point represents the deviation from independence. Mutual information is the total 'volume' of these deviations, weighted by how frequently each combination occurs.

I(X;Y)

The amount of information that one random variable (X) provides about another (Y).

A high value means knowing X significantly reduces uncertainty about Y (and vice versa); zero means they are statistically independent.

p(x,y)

The joint probability of observing a specific outcome 'x' for variable X and a specific outcome 'y' for variable Y simultaneously.

How frequently a particular combination of states (x,y) occurs together in the observed data.

p(x)p(y)

The product of the marginal probabilities of X taking outcome 'x' and Y taking outcome 'y', representing their joint probability if X and Y were statistically independent.

The baseline frequency of a combination (x,y) if there were no relationship or shared information between X and Y.

ln \frac{p ( x , y )}{p ( x ) p ( y )}

The 'information content' or 'surprise' associated with a specific (x,y) pair, relative to the expectation of independence, in units of nats.

Measures how much more (or less) likely a specific (x,y) combination is than if X and Y were unrelated. A positive value means more likely, a negative value means less likely.

\sum_{x, y}

Summation over all possible discrete outcomes for X and Y.

Aggregates the information contributions from every possible combination of X and Y to calculate the total shared information.

Signs and relationships

\ln\frac{p(x,y)}{p(x)p(y)}: The natural logarithm transforms the ratio of probabilities into an additive measure of information. If the observed joint probability p(x,y) is larger than p(x)p(y), the log term is positive; if it is smaller, the term is negative.

Free study cues

Insight

Canonical usage

Mutual information is a dimensionless quantity, representing a measure of statistical dependence. It is conventionally expressed in 'nats' when the natural logarithm (ln) is used, or 'bits' when logarithm base 2 (log2)

Common confusion

A common confusion is treating 'nats' or 'bits' as physical units rather than as conventional units for information content, whose choice depends on the logarithm base used in the calculation.

Dimension note

Mutual information is inherently dimensionless because it is calculated from ratios of probabilities, which are themselves dimensionless.

Unit systems

p(x,y)none - Joint probabilities are inherently dimensionless, ranging from 0 to 1.

p(x)none - Marginal probabilities are inherently dimensionless, ranging from 0 to 1.

p(y)none - Marginal probabilities are inherently dimensionless, ranging from 0 to 1.

I(X;Y)nats - The result is a dimensionless measure of shared information. The conventional unit 'nats' is used when the natural logarithm (ln) is applied, as in the given formula.

One free problem

Practice Problem

A researcher is studying the link between a specific gene mutation and a rare trait. In a perfectly balanced population, the joint probabilities are all equal (0.25 each). Calculate the Mutual Information.

P(X=0,Y=0)0.25

P(X=0,Y=1)0.25

P(X=1,Y=0)0.25

P(X=1,Y=1)0.25

Solve for: $I$

Hint: If the joint probability of every cell is equal to the product of its marginal probabilities, the variables are independent.

The full worked solution stays in the interactive walkthrough.

Where it shows up

Real-World Context

In quantifying how informative a medical test result is about disease status, Mutual Information (2×2) is used to calculate Mutual Information from P(X=0,Y=0), P(X=0,Y=1), and P(X=1,Y=0). The result matters because it helps evaluate model behaviour, algorithm cost, or prediction quality before relying on the output.

Study smarter

Tips

Ensure the sum of joint probabilities (p00, p01, p10, p11) equals exactly 1.0 before starting.
Calculate the marginal probabilities for X and Y by summing the rows and columns of the contingency table.
Treat terms where p(x,y) is zero as zero, as the limit of p log(p) as p approaches zero is zero.
The result is measured in nats when using the natural logarithm (ln) or bits when using log base 2.

Avoid these traps

Common Mistakes

Forgetting to normalize probabilities to sum to 1.
Mixing logs (ln vs log2) and units (nats vs bits).

Keep going

Related Formulas

Common questions

Frequently Asked Questions

Mutual information sums p(x,y) ln(p(x,y)/(p(x)p(y))) over all pairs.

Apply this formula when analyzing the relationship between two binary variables, such as comparing a test result with the presence of a disease. It is preferred over linear correlation when you need to capture non-linear dependencies or general statistical association.

It is a foundational concept in communication theory for calculating channel capacity and in machine learning for feature selection. High mutual information indicates that knowing the state of one variable significantly reduces uncertainty about the other.

Forgetting to normalize probabilities to sum to 1. Mixing logs (ln vs log2) and units (nats vs bits).

In quantifying how informative a medical test result is about disease status, Mutual Information (2×2) is used to calculate Mutual Information from P(X=0,Y=0), P(X=0,Y=1), and P(X=1,Y=0). The result matters because it helps evaluate model behaviour, algorithm cost, or prediction quality before relying on the output.

Ensure the sum of joint probabilities (p00, p01, p10, p11) equals exactly 1.0 before starting. Calculate the marginal probabilities for X and Y by summing the rows and columns of the contingency table. Treat terms where p(x,y) is zero as zero, as the limit of p log(p) as p approaches zero is zero. The result is measured in nats when using the natural logarithm (ln) or bits when using log base 2.

References

Sources

Cover, Thomas M., and Joy A. Thomas. Elements of Information Theory. 2nd ed. Wiley-Interscience, 2006.
Wikipedia: Mutual Information
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience.
Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379-423.

Overview

Variables

Derivation

Start from the definition:

Compute marginals from the 2×2 table:

Sum the four terms (p00, p01, p10, p11):

Graph

Intuition

Insight

Practice Problem

Real-World Context

Tips

Common Mistakes

Related Formulas

Entropy (Shannon)

KL Divergence (Bernoulli)

Information Gain

Frequently Asked Questions

Sources