Gradient Descent Equation, Derivation & Rearrangements

Core idea

Overview

Gradient descent is a first-order iterative optimization algorithm used to find the local minimum of a differentiable function. It functions by taking steps proportional to the negative of the gradient of the function at the current point.

When to use: This algorithm is used when training machine learning models like linear regression or neural networks to minimize a loss function. It is preferred when an analytical solution is too computationally expensive or impossible to derive due to high dimensionality.

Why it matters: It is the fundamental engine behind modern artificial intelligence, allowing models to 'learn' by incrementally reducing error. Its efficiency makes it possible to optimize functions with millions of parameters across massive datasets.

Symbols

Variables

$θ_{n e w}$ = New Weight, $θ_{o l d}$ = Old Weight, $α$ = Learning Rate, $\nabla$ J = Gradient

θ_{n e w}

New Weight

Variable

θ_{o l d}

Old Weight

Variable

α

Learning Rate

Variable

\nabla J

Gradient

Variable

Walkthrough

Derivation

Understanding Gradient Descent

Gradient descent is an iterative method for minimising a differentiable cost function by stepping in the direction of steepest decrease.

The cost function J( $θ$ ) is differentiable.
A learning rate $α$ is chosen to ensure stable progress.
The method may converge to a local minimum for non-convex functions.

1

Identify the gradient direction:

The gradient points in the direction of steepest increase of J; to minimise J we move opposite to it.

\nabla J (θ)

2

State the update rule:

Subtract a fraction $α$ of the gradient to take a step downhill in parameter space.

θ_{n e w} = θ_{o l d} - α \nabla J (θ_{o l d})

Note: If $α$ is too large, updates can overshoot; if too small, convergence is very slow.

Result

θ_{n e w} = θ_{o l d} - α \nabla J (θ_{o l d})

Source: Standard curriculum — A-Level Data Science (Optimisation)

Free formulas

Rearrangements

Solve for $θ_{n e w}$

Gradient Descent: Substituting the Gradient

θ_{n e w} = θ_{o l d} - α G

This sequence of steps introduces a shorthand symbol `G` for the gradient of the cost function `J( $θ$ )` to simplify the gradient descent formula.

Difficulty: 2/5

The static page shows the finished rearrangements. The app keeps the full worked algebra walkthrough.

Visual intuition

Graph

Graph unavailable for this formula.

The plot displays a convex, U-shaped parabolic curve representing the cost function relative to a model parameter. The curve features a distinct global minimum at its turning point, which signifies the optimal parameter value where the gradient is zero. In the context of Gradient Descent, the downward slope indicates the direction the algorithm travels to iteratively minimize the loss function.

Graph type: polynomial

Why it behaves this way

Intuition

Imagine a blindfolded person trying to find the lowest point in a valley by always taking a small step downhill in the steepest possible direction.

θ

The set of adjustable parameters (e.g., weights and biases) of the machine learning model.

These are the 'settings' the model tunes to learn patterns from data.

J (θ)

The loss or cost function, which quantifies the error or discrepancy between the model's predictions and the actual data.

It measures 'how wrong' the model is for a given set of parameters; the goal is to make this value as small as possible.

\nabla J (θ)

The gradient of the loss function J with respect to the parameters \theta. It indicates the direction of the steepest increase in the loss.

This vector points towards where the error would grow fastest if parameters were changed in that direction.

α

The learning rate, a hyperparameter that determines the size of the step taken in the direction opposite to the gradient.

It controls how aggressively the model updates its parameters. A larger

α

means bigger steps, potentially faster but riskier convergence; a smaller

α

means smaller, more stable steps.

Signs and relationships

The negative sign preceding α ∇ J(θ): The gradient $\nabla$ J( $θ$ ) points in the direction of steepest ascent of the loss function. To minimize the loss, the parameters must be updated by moving in the opposite direction, hence the negative sign.

Free study cues

Insight

Canonical usage

All terms in the gradient descent update rule must be dimensionally consistent, meaning they must have the same units.

Common confusion

Students often assume all quantities are dimensionless, especially when working with normalized data, and overlook the dimensional requirements for the learning rate ` $α$ ` if parameters or loss functions have units.

Unit systems

$θ$ Varies · Model parameters. Units depend on the specific model and features. Often dimensionless if features are normalized or represent probabilities, or can have units if parameters directly scale physical quantities.

$J (θ)$ Varies · Loss function. Units depend on the chosen loss function and the units of the model's output. For example, Mean Squared Error (MSE) will have units of the squared output variable.

$\nabla J (θ)$ units(J) / units(\theta) · Gradient of the loss function with respect to the parameters. Its units are the units of the loss function divided by the units of the parameters.

$α$ units(\theta)^2 / units(J) · Learning rate. The learning rate scales the gradient term. Its units must ensure that the product `\alpha \nabla J(\theta)` has the same units as `\theta` for dimensional consistency in the update rule.

One free problem

Practice Problem

A model parameter is currently at 5.0. If the learning rate is 0.1 and the gradient of the loss function is 4.0, calculate the updated parameter value.

Old Weight5

Learning Rate0.1

Gradient4

Solve for: Tn

Hint: Subtract the product of the learning rate and the gradient from the old value.

The full worked solution stays in the interactive walkthrough.

Where it shows up

Real-World Context

In updating weights in a neural network, Gradient Descent is used to calculate New Weight from Old Weight, Learning Rate, and Gradient. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.

Study smarter

Tips

Choose a learning rate that is neither too large to cause divergence nor too small to stall progress.
Feature scaling or normalization helps the algorithm converge significantly faster by evening out the cost surface.
Check the cost function value at each iteration to ensure it is consistently decreasing.

Avoid these traps

Common Mistakes

Using a learning rate that is too large.
Adding gradient instead of subtracting.

Keep going

Related Formulas

Common questions

Frequently Asked Questions

Gradient descent is an iterative method for minimising a differentiable cost function by stepping in the direction of steepest decrease.

This algorithm is used when training machine learning models like linear regression or neural networks to minimize a loss function. It is preferred when an analytical solution is too computationally expensive or impossible to derive due to high dimensionality.

It is the fundamental engine behind modern artificial intelligence, allowing models to 'learn' by incrementally reducing error. Its efficiency makes it possible to optimize functions with millions of parameters across massive datasets.

Using a learning rate that is too large. Adding gradient instead of subtracting.

In updating weights in a neural network, Gradient Descent is used to calculate New Weight from Old Weight, Learning Rate, and Gradient. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.

Choose a learning rate that is neither too large to cause divergence nor too small to stall progress. Feature scaling or normalization helps the algorithm converge significantly faster by evening out the cost surface. Check the cost function value at each iteration to ensure it is consistently decreasing.

References

Sources

Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
Wikipedia: Gradient descent
Pattern Recognition and Machine Learning by Christopher M. Bishop
The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
Standard curriculum — A-Level Data Science (Optimisation)