Gradient Descent
Optimization update rule.
This public page keeps the free explanation visible and leaves premium worked solving, advanced walkthroughs, and saved study tools inside the app.
Core idea
Overview
Gradient descent is a first-order iterative optimization algorithm used to find the local minimum of a differentiable function. It functions by taking steps proportional to the negative of the gradient of the function at the current point.
When to use: This algorithm is used when training machine learning models like linear regression or neural networks to minimize a loss function. It is preferred when an analytical solution is too computationally expensive or impossible to derive due to high dimensionality.
Why it matters: It is the fundamental engine behind modern artificial intelligence, allowing models to 'learn' by incrementally reducing error. Its efficiency makes it possible to optimize functions with millions of parameters across massive datasets.
Symbols
Variables
= New Weight, = Old Weight, = Learning Rate, J = Gradient
Walkthrough
Derivation
Understanding Gradient Descent
Gradient descent is an iterative method for minimising a differentiable cost function by stepping in the direction of steepest decrease.
- The cost function J() is differentiable.
- A learning rate is chosen to ensure stable progress.
- The method may converge to a local minimum for non-convex functions.
Identify the gradient direction:
The gradient points in the direction of steepest increase of J; to minimise J we move opposite to it.
State the update rule:
Subtract a fraction of the gradient to take a step downhill in parameter space.
Note: If is too large, updates can overshoot; if too small, convergence is very slow.
Result
Source: Standard curriculum — A-Level Data Science (Optimisation)
Free formulas
Rearrangements
Solve for
Gradient Descent: Substituting the Gradient
This sequence of steps introduces a shorthand symbol `G` for the gradient of the cost function `J()` to simplify the gradient descent formula.
Difficulty: 2/5
The static page shows the finished rearrangements. The app keeps the full worked algebra walkthrough.
Visual intuition
Graph
Graph unavailable for this formula.
The plot displays a convex, U-shaped parabolic curve representing the cost function relative to a model parameter. The curve features a distinct global minimum at its turning point, which signifies the optimal parameter value where the gradient is zero. In the context of Gradient Descent, the downward slope indicates the direction the algorithm travels to iteratively minimize the loss function.
Graph type: polynomial
Why it behaves this way
Intuition
Imagine a blindfolded person trying to find the lowest point in a valley by always taking a small step downhill in the steepest possible direction.
Signs and relationships
- The negative sign preceding α ∇ J(θ): The gradient J() points in the direction of steepest ascent of the loss function. To minimize the loss, the parameters must be updated by moving in the opposite direction, hence the negative sign.
Free study cues
Insight
Canonical usage
All terms in the gradient descent update rule must be dimensionally consistent, meaning they must have the same units.
Common confusion
Students often assume all quantities are dimensionless, especially when working with normalized data, and overlook the dimensional requirements for the learning rate `` if parameters or loss functions have units.
Unit systems
One free problem
Practice Problem
A model parameter is currently at 5.0. If the learning rate is 0.1 and the gradient of the loss function is 4.0, calculate the updated parameter value.
Solve for: Tn
Hint: Subtract the product of the learning rate and the gradient from the old value.
The full worked solution stays in the interactive walkthrough.
Where it shows up
Real-World Context
In updating weights in a neural network, Gradient Descent is used to calculate New Weight from Old Weight, Learning Rate, and Gradient. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.
Study smarter
Tips
- Choose a learning rate that is neither too large to cause divergence nor too small to stall progress.
- Feature scaling or normalization helps the algorithm converge significantly faster by evening out the cost surface.
- Check the cost function value at each iteration to ensure it is consistently decreasing.
Avoid these traps
Common Mistakes
- Using a learning rate that is too large.
- Adding gradient instead of subtracting.
Common questions
Frequently Asked Questions
Gradient descent is an iterative method for minimising a differentiable cost function by stepping in the direction of steepest decrease.
This algorithm is used when training machine learning models like linear regression or neural networks to minimize a loss function. It is preferred when an analytical solution is too computationally expensive or impossible to derive due to high dimensionality.
It is the fundamental engine behind modern artificial intelligence, allowing models to 'learn' by incrementally reducing error. Its efficiency makes it possible to optimize functions with millions of parameters across massive datasets.
Using a learning rate that is too large. Adding gradient instead of subtracting.
In updating weights in a neural network, Gradient Descent is used to calculate New Weight from Old Weight, Learning Rate, and Gradient. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.
Choose a learning rate that is neither too large to cause divergence nor too small to stall progress. Feature scaling or normalization helps the algorithm converge significantly faster by evening out the cost surface. Check the cost function value at each iteration to ensure it is consistently decreasing.
References
Sources
- Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
- Wikipedia: Gradient descent
- Pattern Recognition and Machine Learning by Christopher M. Bishop
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
- Standard curriculum — A-Level Data Science (Optimisation)