Cross-Entropy (Bernoulli)
Cross-entropy between true Bernoulli(p) and model Bernoulli(q).
This public page keeps the free explanation visible and leaves premium worked solving, advanced walkthroughs, and saved study tools inside the app.
Core idea
Overview
Cross-entropy for a Bernoulli distribution quantifies the divergence between the true binary probability p and the predicted probability q. It is the standard metric used in binary classification to penalize models based on how much their predicted distribution differs from the actual target distribution.
When to use: Apply this equation when evaluating binary classification models where outcomes are mutually exclusive. It is the primary loss function used during the training of logistic regression models and binary neural networks.
Why it matters: This function is superior to mean squared error for classification because it provides stronger gradients when the model is confidently wrong. This results in faster convergence during optimization processes like gradient descent.
Symbols
Variables
H(p,q) = Cross-Entropy, p = True Probability, q = Model Probability
Walkthrough
Derivation
Derivation of Cross-Entropy for Bernoulli Variables
Cross-entropy is the expected negative log-probability under a model q when data follow true probability p.
- Binary variable X∈{0,1}.
- True distribution: P(X=1)=p.
- Model distribution: Q(X=1)=q.
Start from the definition of cross-entropy:
Cross-entropy is expected negative log-likelihood under the model Q.
Write the expectation over X=1 and X=0:
With probability p you observe 1 (log-likelihood ln q), otherwise 0 (log-likelihood ln(1−q)).
Result
Visual intuition
Graph
Graph unavailable for this formula.
The plot shows a convex, logarithmic curve that trends toward infinity as the predicted probability approaches the opposite of the true label. The graph features a global minimum of zero when the prediction perfectly matches the target, illustrating that the penalty for error grows exponentially as the prediction moves away from the ground truth. This shape reflects the core principle of cross-entropy, where the model is heavily penalized for being confidently wrong.
Graph type: logarithmic
Why it behaves this way
Intuition
Imagine two bar charts: one representing the true probabilities 'p' and '1-p', and another representing the model's predicted probabilities 'q' and '1-q'.
Signs and relationships
- -: The logarithm of a probability (a value between 0 and 1) is always negative or zero. The leading negative sign ensures that the cross-entropy loss is a positive value, which is conventional for loss functions that included in the model.
Free study cues
Insight
Canonical usage
This equation calculates a dimensionless value, often interpreted in 'nats' when using the natural logarithm, quantifying the divergence between two probability distributions.
Common confusion
A common confusion is attempting to assign physical units to cross-entropy or incorrectly interchanging 'nats' (natural logarithm) with 'bits' (base-2 logarithm) without proper conversion or context.
Dimension note
Cross-entropy is a dimensionless measure of the average number of nats (or bits, if a base-2 logarithm is used) required to identify an event from a true distribution, given an encoding optimized for a predicted
Unit systems
One free problem
Practice Problem
A machine learning model predicts a 0.7 probability (q) that an image contains a cat. The actual image is indeed a cat (p = 1.0). Calculate the binary cross-entropy for this prediction in nats.
Solve for:
Hint: Since p = 1, the term (1-p) becomes zero, meaning you only need to calculate -ln(q).
The full worked solution stays in the interactive walkthrough.
Where it shows up
Real-World Context
In expected log-loss when a spam filter over/underestimates spam probability, Cross-Entropy (Bernoulli) is used to calculate Cross-Entropy from True Probability and Model Probability. The result matters because it helps evaluate model behaviour, algorithm cost, or prediction quality before relying on the output.
Study smarter
Tips
- Ensure predicted value q is strictly between 0 and 1 to avoid undefined log operations.
- Note that p usually represents the ground truth label and is typically 0 or 1.
- Lower cross-entropy values indicate a model that is more closely aligned with the true data distribution.
Avoid these traps
Common Mistakes
- Using percentages instead of probabilities (0.7 not 70).
- Taking ln of 0 (q must be strictly between 0 and 1).
Common questions
Frequently Asked Questions
Cross-entropy is the expected negative log-probability under a model q when data follow true probability p.
Apply this equation when evaluating binary classification models where outcomes are mutually exclusive. It is the primary loss function used during the training of logistic regression models and binary neural networks.
This function is superior to mean squared error for classification because it provides stronger gradients when the model is confidently wrong. This results in faster convergence during optimization processes like gradient descent.
Using percentages instead of probabilities (0.7 not 70). Taking ln of 0 (q must be strictly between 0 and 1).
In expected log-loss when a spam filter over/underestimates spam probability, Cross-Entropy (Bernoulli) is used to calculate Cross-Entropy from True Probability and Model Probability. The result matters because it helps evaluate model behaviour, algorithm cost, or prediction quality before relying on the output.
Ensure predicted value q is strictly between 0 and 1 to avoid undefined log operations. Note that p usually represents the ground truth label and is typically 0 or 1. Lower cross-entropy values indicate a model that is more closely aligned with the true data distribution.
References
Sources
- Wikipedia: Cross-entropy
- Elements of Information Theory (2nd ed.) by Thomas M. Cover and Joy A. Thomas
- Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
- Elements of Information Theory (Cover and Thomas)
- Cover, Thomas M., and Joy A. Thomas. Elements of Information Theory. 2nd ed. Wiley-Interscience, 2006.
- Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.