Cross-Entropy (Bernoulli)

Core idea

Overview

Cross-entropy for a Bernoulli distribution quantifies the divergence between the true binary probability p and the predicted probability q. It is the standard metric used in binary classification to penalize models based on how much their predicted distribution differs from the actual target distribution.

When to use: Apply this equation when evaluating binary classification models where outcomes are mutually exclusive. It is the primary loss function used during the training of logistic regression models and binary neural networks.

Why it matters: This function is superior to mean squared error for classification because it provides stronger gradients when the model is confidently wrong. This results in faster convergence during optimization processes like gradient descent.

Symbols

Variables

H(p,q) = Cross-Entropy, p = True Probability, q = Model Probability

H(p,q)

Cross-Entropy

nats

p

True Probability

Variable

q

Model Probability

Variable

Walkthrough

Derivation

Derivation of Cross-Entropy for Bernoulli Variables

Cross-entropy is the expected negative log-probability under a model q when data follow true probability p.

Binary variable X∈{0,1}.
True distribution: P(X=1)=p.
Model distribution: Q(X=1)=q.

1

Start from the definition of cross-entropy:

Cross-entropy is expected negative log-likelihood under the model Q.

H (p, q) = - E_{X \sim p} [ln Q (X)]

2

Write the expectation over X=1 and X=0:

With probability p you observe 1 (log-likelihood ln q), otherwise 0 (log-likelihood ln(1−q)).

H (p, q) = - [p ln q + (1 - p) ln (1 - q)]

Result

H (p, q) = - [p ln q + (1 - p) ln (1 - q)]

Visual intuition

Graph

Graph unavailable for this formula.

The plot shows a convex, logarithmic curve that trends toward infinity as the predicted probability approaches the opposite of the true label. The graph features a global minimum of zero when the prediction perfectly matches the target, illustrating that the penalty for error grows exponentially as the prediction moves away from the ground truth. This shape reflects the core principle of cross-entropy, where the model is heavily penalized for being confidently wrong.

Graph type: logarithmic

Why it behaves this way

Intuition

Imagine two bar charts: one representing the true probabilities 'p' and '1-p', and another representing the model's predicted probabilities 'q' and '1-q'.

H(p,q)

A measure of the average number of bits needed to encode an event from a true distribution 'p' when using a code optimized for a predicted distribution 'q'.

Quantifies how 'surprised' a model is by the actual outcome, averaged over all possible outcomes, when its predictions are 'q' and the true probabilities are 'p'. A higher value means greater divergence or 'surprise'.

p

The true probability of the positive class (e.g., the actual label is 1).

Represents the actual, observed likelihood of an event occurring.

q

The predicted probability of the positive class (e.g., the model's output for label 1).

Represents the model's estimated likelihood of an event occurring.

ln q

The logarithm of the predicted probability of the positive class.

This term contributes to the loss when the true outcome is positive (p=1). It heavily penalizes the model when it predicts a low 'q' for a true positive event, as ln(q) becomes very negative for small 'q'.

ln(1-q)

The logarithm of the predicted probability of the negative class.

This term contributes to the loss when the true outcome is negative (p=0). It heavily penalizes the model when it predicts a high 'q' (meaning low '1-q') for a true negative event.

Signs and relationships

-: The logarithm of a probability (a value between 0 and 1) is always negative or zero. The leading negative sign ensures that the cross-entropy loss is a positive value, which is conventional for loss functions that included in the model.

Free study cues

Insight

Canonical usage

This equation calculates a dimensionless value, often interpreted in 'nats' when using the natural logarithm, quantifying the divergence between two probability distributions.

Common confusion

A common confusion is attempting to assign physical units to cross-entropy or incorrectly interchanging 'nats' (natural logarithm) with 'bits' (base-2 logarithm) without proper conversion or context.

Dimension note

Cross-entropy is a dimensionless measure of the average number of nats (or bits, if a base-2 logarithm is used) required to identify an event from a true distribution, given an encoding optimized for a predicted

Unit systems

$p$ dimensionless - Represents the true probability of an event, a value between 0 and 1.

$q$ dimensionless - Represents the predicted probability of an event, a value between 0 and 1.

H(p,q)dimensionless - The cross-entropy value itself is dimensionless, often implicitly in 'nats' when using the natural logarithm.

One free problem

Practice Problem

A machine learning model predicts a 0.7 probability (q) that an image contains a cat. The actual image is indeed a cat (p = 1.0). Calculate the binary cross-entropy for this prediction in nats.

True Probability1

Model Probability0.7

Solve for: $H$

Hint: Since p = 1, the term (1-p) becomes zero, meaning you only need to calculate -ln(q).

The full worked solution stays in the interactive walkthrough.

Where it shows up

Real-World Context

In expected log-loss when a spam filter over/underestimates spam probability, Cross-Entropy (Bernoulli) is used to calculate Cross-Entropy from True Probability and Model Probability. The result matters because it helps evaluate model behaviour, algorithm cost, or prediction quality before relying on the output.

Study smarter

Tips

Ensure predicted value q is strictly between 0 and 1 to avoid undefined log operations.
Note that p usually represents the ground truth label and is typically 0 or 1.
Lower cross-entropy values indicate a model that is more closely aligned with the true data distribution.

Avoid these traps

Common Mistakes

Using percentages instead of probabilities (0.7 not 70).
Taking ln of 0 (q must be strictly between 0 and 1).

Keep going

Related Formulas

Common questions

Frequently Asked Questions

Cross-entropy is the expected negative log-probability under a model q when data follow true probability p.

Apply this equation when evaluating binary classification models where outcomes are mutually exclusive. It is the primary loss function used during the training of logistic regression models and binary neural networks.

This function is superior to mean squared error for classification because it provides stronger gradients when the model is confidently wrong. This results in faster convergence during optimization processes like gradient descent.

Using percentages instead of probabilities (0.7 not 70). Taking ln of 0 (q must be strictly between 0 and 1).