Data & ComputingMachine LearningA-Level

CambridgeAQAAPOntarioNSWCBSEGCE O-LevelMoE

Binary Cross-Entropy

Loss function for binary classification.

Understand the formulaSee the free derivationOpen the full walkthrough

L = - (y ln (p) + (1 - y) ln (1 - p))

Open Full Walkthrough Try Calculator

This public page keeps the free explanation visible and leaves premium worked solving, advanced walkthroughs, and saved study tools inside the app.

Core idea

Overview

Binary Cross-Entropy measures the divergence between two probability distributions, typically the true labels and the predicted probabilities in a binary classification task. It calculates a loss value that penalizes predictions exponentially as they diverge from the actual class value.

When to use: This equation is the standard loss function for binary classification problems where the output is a single probability between 0 and 1. It is most effective when paired with a sigmoid activation function in the final layer of a neural network.

Why it matters: It provides a smooth, convex surface for optimization, allowing gradient descent to effectively update model weights. By heavily penalizing confident but incorrect predictions, it forces the model to learn more distinct boundaries between classes.

Symbols

Variables

L = Loss, y = Actual Label (0/1), p = Predicted Prob

L

Loss

Variable

y

Actual Label (0/1)

Variable

p

Predicted Prob

Variable

Walkthrough

Derivation

Formula: Binary Cross-Entropy (Log Loss)

Binary cross-entropy measures how well predicted probabilities $\overset{y}{^}$ match true binary labels y, heavily penalising confident wrong predictions.

Binary labels y\in\{0,1\}.
Predictions $\overset{y}{^}$ are probabilities in (0,1), commonly from a sigmoid.
Logarithms are natural logs unless specified otherwise (choice changes scale only).

Write loss for one example:

If y=1, only - $ln$ ( $\overset{y}{^}$ ) matters; if y=0, only - $ln$ (1- $\overset{y}{^}$ ) matters.

L (y, \overset{y}{^}) = - [y ln (\overset{y}{^}) + (1 - y) ln (1 - \overset{y}{^})]

Average across N examples:

The dataset loss is the mean of individual losses, giving a single number to minimise during training.

J = \frac{1}{N} i = 1 \sum N L (y_{i}, \overset{y}{^}_{i}) = - \frac{1}{N} i = 1 \sum N [y_{i} ln (\overset{y}{^}_{i}) + (1 - y_{i}) ln (1 - \overset{y}{^}_{i})]

Note: In practice, probabilities are clipped away from 0 and 1 to avoid $ln$ (0).

Result

J = \frac{1}{N} i = 1 \sum N L (y_{i}, \overset{y}{^}_{i}) = - \frac{1}{N} i = 1 \sum N [y_{i} ln (\overset{y}{^}_{i}) + (1 - y_{i}) ln (1 - \overset{y}{^}_{i})]

Source: Standard curriculum — Machine Learning (Classification Losses)

Visual intuition

Graph

The graph displays a logarithmic curve that approaches vertical asymptotes at x=0 and x=1, where the loss tends toward infinity. As the independent variable moves away from the target value, the loss increases sharply, reflecting the penalty for incorrect predictions.

Graph type: logarithmic

Why it behaves this way

Intuition

A landscape where the model aims to find the lowest point, representing minimal divergence between its predicted probabilities and the true class labels, with steep gradients that severely penalize confident incorrect

L

A scalar value quantifying the discrepancy between the true label and the predicted probability for a single data point.

A higher value indicates a worse prediction, meaning the model was more 'wrong' or less 'confident in the correct answer'.

y

The actual, correct binary class label (0 or 1) for the input data.

This is the target value the model is trying to learn and predict.

p

The model's estimated probability that the true label 'y' is 1.

Represents the model's confidence level for the positive class.

ln (p)

The natural logarithm of the predicted probability 'p'.

Penalizes the model more heavily as its predicted probability 'p' for the true class approaches 0 (i.e., confident wrong prediction).

ln (1 - p)

The natural logarithm of the probability that the true label 'y' is 0 (i.e., 1-p).

Penalizes the model more heavily as its predicted probability 'p' for the true class approaches 1 when the true class is 0 (i.e., confident wrong prediction).

Signs and relationships

-: The natural logarithm of a probability (a value between 0 and 1) is always negative or zero. To ensure the loss function 'L' is a non-negative value that can be minimized towards zero, the entire expression is multiplied

Free study cues

Insight

Canonical usage

This equation calculates a dimensionless loss value, representing the divergence between a true binary label and a predicted probability.

Common confusion

A common mistake is to input probabilities as percentages (e.g., 75%) instead of decimal values (e.g., 0.75), which would lead to incorrect logarithmic calculations.

Dimension note

All variables in the Binary Cross-Entropy formula (true label 'y', predicted probability 'p', and the resulting loss 'L') are dimensionless quantities.

Unit systems

$y$ None - The true binary label, typically 0 or 1.

$p$ None - The predicted probability, a value between 0 and 1.

$L$ None - The calculated Binary Cross-Entropy loss, which is a dimensionless score.

One free problem

Practice Problem

A machine learning model identifies a transaction as fraudulent (y = 1). The model's predicted probability of fraud is 0.85. Calculate the binary cross-entropy loss for this specific prediction.

Actual Label (0/1)1

Predicted Prob0.85

Solve for: $L$

Hint: When y = 1, the formula simplifies to L = -ln(p).

The full worked solution stays in the interactive walkthrough.

Where it shows up

Real-World Context

In training a spam classifier with probabilistic output, Binary Cross-Entropy is used to calculate Loss from Actual Label (0/1) and Predicted Prob. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.

Study smarter

Tips

Ensure predicted values p stay within (0, 1) to avoid undefined natural logs at 0 or 1.
The loss is 0 only if the prediction perfectly matches the label.
For multi-class targets, use the Categorical Cross-Entropy variant instead.

Avoid these traps

Common Mistakes

Using p=0 or p=1 directly.
Forgetting the (1-y) term.

Common questions

Frequently Asked Questions

Binary cross-entropy measures how well predicted probabilities \hat{y} match true binary labels y, heavily penalising confident wrong predictions.

This equation is the standard loss function for binary classification problems where the output is a single probability between 0 and 1. It is most effective when paired with a sigmoid activation function in the final layer of a neural network.

It provides a smooth, convex surface for optimization, allowing gradient descent to effectively update model weights. By heavily penalizing confident but incorrect predictions, it forces the model to learn more distinct boundaries between classes.

Using p=0 or p=1 directly. Forgetting the (1-y) term.

Ensure predicted values p stay within (0, 1) to avoid undefined natural logs at 0 or 1. The loss is 0 only if the prediction perfectly matches the label. For multi-class targets, use the Categorical Cross-Entropy variant instead.

References

Sources

Wikipedia: Cross-entropy
Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press.
Deep Learning (Ian Goodfellow, Yoshua Bengio, and Aaron Courville)
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. (Chapter 6, Section 6.2.2.2)
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. (Chapter 4, Section 4.3.4)
Standard curriculum — Machine Learning (Classification Losses)

Binary Cross-Entropy

Overview

Variables

Derivation

Write loss for one example:

Average across N examples:

Graph

Intuition

Insight

Practice Problem

Real-World Context

Tips

Common Mistakes

Related Formulas

Logistic Function

Frequently Asked Questions

Sources