KL Divergence (Bernoulli)

Core idea

Overview

The Bernoulli KL divergence measures the relative entropy between two Bernoulli distributions, quantifying the information lost when distribution q is used to approximate distribution p. It is a non-symmetric metric that characterizes the statistical distance between two binary outcomes across a shared probability space.

When to use: This equation is essential when evaluating the performance of binary classifiers or when comparing a theoretical model to observed binary frequencies. It is frequently applied in machine learning as a component of loss functions like Binary Cross-Entropy and in the context of information-theoretic model selection.

Why it matters: It provides a rigorous way to measure the 'surprise' or extra cost incurred by assuming one set of probabilities when the reality is different. In practice, minimizing this divergence optimizes data transmission and ensures that predictive models are as close to the true data generation process as possible.

Symbols

Variables

$D_{K L}$ = KL Divergence, p = True Probability, q = Model Probability

D_{K L}

KL Divergence

nats

p

True Probability

Variable

q

Model Probability

Variable

Walkthrough

Derivation

Derivation of KL Divergence for Bernoulli Variables

KL divergence measures mismatch between true probability p and model probability q.

Binary variable X∈{0,1}.
True distribution: P(X=1)=p.
Model distribution: Q(X=1)=q.

1

Start from the definition of KL divergence:

KL is an expected log ratio of probabilities.

D_{K L} (P ∥∥ Q) = E_{X \sim P} [ln \frac{P ( X )}{Q ( X )}]

2

Write probabilities for X=1 and X=0:

Bernoulli distributions are determined by their success probabilities.

P (1) = p, Q (1) = q, P (0) = 1 - p, Q (0) = 1 - q

3

Expand the expectation:

This is the standard closed form for Bernoulli KL divergence.

D_{K L} (p ∥∥ q) = p ln \frac{p}{q} + (1 - p) ln \frac{1 - p}{1 - q}

Result

D_{K L} (p ∥∥ q) = p ln \frac{p}{q} + (1 - p) ln \frac{1 - p}{1 - q}

Visual intuition

Graph

The graph depicts a convex, U-shaped parabola representing the divergence between two Bernoulli distributions as the probability parameter p varies. The curve features a global minimum at zero, where the divergence is null when the two distributions are identical, and rises sharply toward vertical asymptotes as p approaches the boundaries of 0 or 1. This shape illustrates that information loss increases non-linearly as the predicted probability deviates from the true target probability.

Graph type: quadratic

Why it behaves this way

Intuition

Imagine two distinct bar charts, each representing a Bernoulli distribution with two bars (success and failure). The KL divergence quantifies the 'extra space' or 'distance' required to describe the first bar chart using

p

The true probability of the 'success' outcome for the reference Bernoulli distribution.

This is the actual likelihood of an event occurring, as observed or known from the true data generating process.

q

The predicted or approximating probability of the 'success' outcome for the model Bernoulli distribution.

This is our model's estimate or hypothesis for the likelihood of the same event.

D_{K L} (p ∥∥ q)

The Kullback-Leibler (KL) divergence between the true distribution 'p' and the approximating distribution 'q'.

This is the total 'information loss' or 'relative entropy' when we use the probabilities from 'q' to describe the outcomes that truly follow 'p'. A higher value means 'q' is a poorer approximation of 'p'.

p ln \frac{p}{q}

The contribution to the total divergence from the 'success' outcome.

This term quantifies the 'surprise' or information discrepancy when the true probability of success is 'p' but we expected 'q', weighted by how often 'p' actually occurs.

(1 - p) ln \frac{1 - p}{1 - q}

The contribution to the total divergence from the 'failure' outcome.

Similar to the success term, this measures the 'surprise' or information discrepancy for the 'failure' outcome, weighted by its true probability '1-p'.

Signs and relationships

\ln: The logarithmic function transforms probability ratios into units of information (nats, for natural logarithm). Its property ensures that the terms `p\ln(p/q)` and `(1-p) $ln$ ((1-p)/(1-q))` are always non-negative
p: The true probabilities 'p' and '(1-p)' act as weighting factors. They ensure that the information discrepancy for each outcome (success or failure)
+: The two terms are summed to account for the total expected information discrepancy across both possible outcomes (success and failure)

Free study cues

Insight

Canonical usage

KL Divergence is a dimensionless quantity, often expressed in 'nats' or 'bits' depending on the base of the logarithm used, but fundamentally represents a unitless measure of information.

Common confusion

Students might confuse 'nats' or 'bits' as physical units rather than as indicators of the logarithm's base, leading to attempts to convert them to other physical units or to expect dimensional consistency with physical

Dimension note

The KL divergence is inherently dimensionless as it is calculated from probabilities, which are themselves dimensionless ratios. While 'nats' or 'bits' are often used to denote the unit of information, these are not physical units.

One free problem

Practice Problem

A coin is known to have a true probability of landing heads of p = 0.5. If a researcher models this coin with an estimated probability q = 0.2, calculate the resulting KL Divergence in nats.

True Probability0.5

Model Probability0.2

Solve for: $D$

Hint: Plug the values into the formula using natural logarithms for both the p/q and (1-p)/(1-q) terms.

The full worked solution stays in the interactive walkthrough.

Where it shows up

Real-World Context

In quantifying how much a model's predicted probability differs from reality, KL Divergence (Bernoulli) is used to calculate KL Divergence from True Probability and Model Probability. The result matters because it helps estimate likelihood and make a risk or decision statement rather than treating the number as certainty.

Study smarter

Tips

Ensure p and q values remain strictly between 0 and 1 to avoid natural logs of zero or infinity.
Remember that D(p||q) is not equal to D(q||p); the order represents the direction from the truth p to the model q.
A divergence of 0 always implies that the two distributions are perfectly identical.

Avoid these traps

Common Mistakes

Swapping p and q (changes the value).
Assuming KL is a distance metric (it isn’t symmetric).

Keep going

Related Formulas

Common questions

Frequently Asked Questions

KL divergence measures mismatch between true probability p and model probability q.

This equation is essential when evaluating the performance of binary classifiers or when comparing a theoretical model to observed binary frequencies. It is frequently applied in machine learning as a component of loss functions like Binary Cross-Entropy and in the context of information-theoretic model selection.

It provides a rigorous way to measure the 'surprise' or extra cost incurred by assuming one set of probabilities when the reality is different. In practice, minimizing this divergence optimizes data transmission and ensures that predictive models are as close to the true data generation process as possible.

Swapping p and q (changes the value). Assuming KL is a distance metric (it isn’t symmetric).