Learning rate
Learning rate is a positive step size parameter.
This public page keeps the free explanation visible and leaves premium worked solving, advanced walkthroughs, and saved study tools inside the app.
Core idea
Overview
The learning rate is a scalar hyperparameter that determines the step size at each iteration of an optimization algorithm. It scales the gradient of the loss function, controlling how significantly the model's weights are adjusted in response to estimated error.
When to use: Apply this during the training of machine learning models when using gradient-based optimization like SGD, Adam, or RMSProp. It is used to balance the trade-off between the speed of training and the precision of the convergence toward a minimum.
Why it matters: The learning rate is arguably the most critical hyperparameter; setting it too high causes the model to overshoot the minimum and diverge, while setting it too low results in inefficient training or getting stuck in local minima.
Symbols
Variables
= Learning Rate
Walkthrough
Derivation
Learning Rate
The learning rate α is a positive scalar hyperparameter that controls the step size of each weight update during gradient-based optimisation. Too large and the model overshoots; too small and training is slow or stalls.
- α > 0.
- A differentiable loss function exists so gradients can be computed.
- The learning rate may be fixed or follow a decay schedule.
Gradient Descent Update Rule
Each weight w is adjusted by subtracting the gradient of the loss L multiplied by α. α scales how far we move in the direction of steepest descent.
Effect of a Large α
A very large learning rate causes large weight updates, making the loss oscillate or diverge rather than converge.
Effect of a Small α
A very small learning rate leads to tiny updates; training may converge very slowly or settle in a local minimum.
Learning Rate Decay Example
A common schedule halves α every k iterations, starting from an initial rate α₀, allowing faster early training and finer tuning later.
Note: Common starting values: 0.1, 0.01, 0.001. Search logarithmically to find the best range.
Result
Source: A-Level Data & Computing — Machine Learning
Visual intuition
Graph
Graph unavailable for this formula.
The graph is a horizontal line representing a constant value for the learning rate (alpha) across all values of the independent variable. Since the formula defines alpha as a fixed positive parameter, the line remains parallel to the x-axis and exists only in the region where alpha is greater than zero.
Graph type: constant
Why it behaves this way
Intuition
Imagine a blindfolded person trying to find the lowest point in a hilly landscape; the learning rate determines the size of each step they take in the direction they feel is downhill.
Signs and relationships
- α > 0: The learning rate must be positive because it scales the gradient to move the model parameters *down* the loss landscape towards a minimum.
Free study cues
Insight
Canonical usage
The learning rate is a dimensionless scalar used to control the step size during gradient-based optimization in machine learning.
Common confusion
A common confusion is trying to assign physical units to the learning rate. It is a hyperparameter whose value is tuned empirically, not derived from physical dimensions.
Dimension note
The learning rate is a dimensionless scalar. It acts as a pure scaling factor for the gradient vector, determining the magnitude of weight updates.
Unit systems
One free problem
Practice Problem
A machine learning practitioner is training a neural network with an initial learning rate alpha of 0.01. After observing that the loss is oscillating, they decide to reduce the learning rate to one-fifth of its current value. Calculate the new value for alpha.
Solve for: alpha
Hint: Divide the initial value by 5 to find the reduced rate.
The full worked solution stays in the interactive walkthrough.
Where it shows up
Real-World Context
In tuning α in a training run, Learning rate is used to calculate the alpha value from the measured values. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.
Study smarter
Tips
- Start with a logarithmic search (e.g., 0.1, 0.01, 0.001) to find the best range for your specific architecture.
- Utilize learning rate decay or schedules to reduce alpha as training progresses to fine-tune weights.
- Monitor the loss curve; consistent oscillation or rising loss usually indicates the learning rate is too high.
Avoid these traps
Common Mistakes
- Choosing a rate that is too large.
- Assuming one rate fits all models.
Common questions
Frequently Asked Questions
The learning rate α is a positive scalar hyperparameter that controls the step size of each weight update during gradient-based optimisation. Too large and the model overshoots; too small and training is slow or stalls.
Apply this during the training of machine learning models when using gradient-based optimization like SGD, Adam, or RMSProp. It is used to balance the trade-off between the speed of training and the precision of the convergence toward a minimum.
The learning rate is arguably the most critical hyperparameter; setting it too high causes the model to overshoot the minimum and diverge, while setting it too low results in inefficient training or getting stuck in local minima.
Choosing a rate that is too large. Assuming one rate fits all models.
In tuning α in a training run, Learning rate is used to calculate the alpha value from the measured values. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.
Start with a logarithmic search (e.g., 0.1, 0.01, 0.001) to find the best range for your specific architecture. Utilize learning rate decay or schedules to reduce alpha as training progresses to fine-tune weights. Monitor the loss curve; consistent oscillation or rising loss usually indicates the learning rate is too high.
References
Sources
- Deep Learning (Goodfellow, Bengio, Courville)
- Wikipedia: Learning rate
- Wikipedia: Gradient descent
- Pattern Recognition and Machine Learning (Bishop)
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8: Optimization for Training Deep Models.
- A-Level Data & Computing — Machine Learning