Experience the optimization algorithms behind machine-learning training. Adjust the learning rate and momentum coefficient and watch how plain gradient descent, momentum and Nesterov accelerated gradient descend an elongated valley, shown by the trajectory on a contour plot and the loss convergence curve.
Parameters
Learning rate η
How far each step moves along the gradient direction
Momentum coefficient γ
How much past velocity is carried forward. 0 = plain GD
Iterations
steps
Optimization method
Three update rules that use the gradient differently
Loss surface
Switch the valley elongation (condition number)
Results
—
Final loss f
—
Final x
—
Final y
—
Iterations
—
Converged / Diverged
—
Loss reduction (%)
—
Loss-surface contour and optimisation trajectory
The concentric ellipses are contours of the loss f, and the central dot is the minimum. The coloured path from the start point (−9, 4) is the update trajectory, and a marker travels along it in a loop.
The objective function used here. Nesterov accelerated gradient differs only in evaluating the gradient at the look-ahead point θ+γv instead of the current point. The condition number is b/a.
What is Gradient Descent with Momentum?
🙋
I hear "gradient descent" all the time in machine learning. What is it actually doing?
🎓
Roughly speaking, it just "walks downhill". Think of the loss f — which measures how good the model is — as a landscape; the lowest valley floor is the best set of parameters. You measure the slope at your current spot (the gradient ∇f) and take one step of size η in the opposite direction. Repeat that and you gradually descend to the bottom. As a formula it is θ ← θ − η∇f. Try setting the method on the left to "Plain gradient descent".
🙋
Oh, the trajectory zig-zags. It doesn't head straight for the bottom?
🎓
That's the key point. This "elongated valley" is steep in the vertical direction (y) and shallow in the horizontal direction (x). Plain GD overshoots in the steep y direction, crosses to the other side, comes back — and oscillates. Meanwhile it advances only a little at a time in the shallow x direction. So it bounces between the valley walls while slowly descending sideways. That is the weakness of gradient descent on "ill-conditioned" terrain.
🙋
I see... So what changes if I switch to "Momentum"?
🎓
Picture a "ball" rolling down the slope. Momentum remembers a velocity v and updates with v ← γv − η∇f, θ ← θ + v. In the x direction the gradient keeps pointing the same way, so velocity piles up and accelerates. In the y direction the steps swing left and right, so outgoing and returning velocities cancel and the oscillation is damped. Switch the method to Momentum — the zig-zag drops sharply and the loss curve slides down smoothly.
🙋
It really did get smoother! There's also Nesterov accelerated gradient — is that something else again?
🎓
Same idea, only the "place" where you measure the gradient differs. Plain momentum measures the gradient where you are now, but Nesterov predicts "where will I be once the velocity carries me forward" and measures the gradient at that look-ahead point. Because it anticipates, it brakes earlier as it nears the bottom and overshoot is smaller. For the same η and γ, Nesterov's convergence curve is often cleaner. In practice the standard tuning is to fix γ = 0.9 and make η as large as it can go without diverging.
Frequently Asked Questions
Momentum is an upgrade of gradient descent that adds a memory called "velocity". The update is v ← γv − η∇f, θ ← θ + v, carrying the previous update direction v forward with a coefficient γ (0 to 0.99). Like a ball rolling down a slope, velocity builds up and accelerates along directions where the gradient consistently points the same way, while back-and-forth oscillation is damped because outgoing and returning velocities cancel. As a result it reaches the minimum faster and more smoothly than plain gradient descent even on hard terrain such as an elongated valley.
The learning rate η sets how far each step moves. Too small and convergence is slow; too large and the steps overshoot the minimum and diverge. On this tool's elongated valley (condition number 20), raising η first triggers oscillation along the steep direction and then causes divergence. The momentum coefficient γ controls how much past velocity is retained, and around 0.9 is the standard choice. Raising γ accelerates but too much increases overshoot. In practice, fix γ = 0.9 and take η as large as possible without diverging.
Plain momentum computes the gradient at the current point θ, while Nesterov accelerated gradient (NAG) computes it at the look-ahead point θ + γv, i.e. where the velocity is about to carry you. By looking ahead, it can brake earlier as it approaches the minimum, so overshoot is smaller. With this tool you can confirm that, for the same η and γ, Nesterov shows a smoother loss curve and converges faster. Many deep-learning frameworks offer NAG as an option.
Because the loss surface has a different "condition number". The isotropic paraboloid (a=b=5) has condition number 1: every direction has the same curvature, so the gradient always points toward the minimum and even plain gradient descent converges cleanly. The elongated valley (a=1, b=20) has condition number 20: one direction is steep and another is shallow. Plain GD oscillates in the steep direction and barely advances in the shallow one. Momentum cancels that oscillation and accelerates progress in the shallow direction, so the larger the condition number, the more clearly its benefit shows.
Real-World Applications
Training neural networks: A deep network's weights number in the millions to billions, and their loss landscape is far more complex, elongated and winding than this tool's valley. "SGD with momentum" — stochastic gradient descent combined with momentum — has long been the bread-and-butter optimizer for many classic models such as the ResNet family in image recognition. Without momentum, oscillation makes learning slow and the optimizer is easily thrown off by noisy mini-batch gradients.
The foundation of modern optimizers like Adam: The optimizers in wide use today — Adam, RMSProp, AdamW — all carry momentum internally (an exponential moving average of the first moment). The idea you experience here, "accumulate velocity to accelerate and damp oscillation", is the very core of these state-of-the-art algorithms. A solid intuition for momentum makes hyperparameter tuning far easier when training does not go well.
Design exploration in CAE and numerical optimization: Structural, topology and shape optimization also use gradient methods to drive an objective (mass, stress, compliance) down via its gradient. When the design variables are not on a common scale, the loss landscape becomes an elongated valley and convergence slows dramatically. The notion of improving the condition number with momentum or variable normalization (scaling) is shared by engineering design optimization, not just machine learning.
Inverse analysis and parameter identification: Inverse problems that estimate material-model or boundary-condition parameters to fit experimental data also minimize the error as a loss with gradient descent. Observation noise and uneven sensitivity easily make the loss landscape pathological, and momentum and learning-rate schedules help stabilize convergence. This tool's "Diverged" verdict is the same kind of failure that occurs in real inverse analysis when the learning rate is set too high.
Common Misconceptions and Pitfalls
The most common pitfall is believing that "a larger learning rate always converges faster". It is true that too small an η slows convergence, but too large an η overshoots the minimum, the loss rises instead, and it ultimately diverges. On this tool's elongated valley, pushing η to its maximum of 0.5 makes the loss explode within just a few steps and the verdict turns to "Diverged". Adding momentum further increases the effective step size, so when γ is large you actually need a more modest η. The golden rule is "if it diverges, lower η first".
Next, the assumption that "momentum is a cure-all and oscillation always disappears". Momentum damps oscillation, but pushing γ too high makes momentum itself produce overshoot. It blasts past the minimum, travels to the far side and only then returns — a different kind of oscillation appears. With an extreme value like γ = 0.99, the path can take needless detours before converging. Nesterov accelerated gradient braking via look-ahead is precisely a device to suppress this overshoot.
Finally, the misconception that "once it converges, the loss should reach zero". This tool's objective has its minimum at the origin with f = 0, but with a finite number of iterations the final loss is never exactly zero. The convergence test uses a relative criterion, "final loss is below 1% of the initial loss". In real machine learning the minimum loss is not necessarily zero in the first place (data noise and limited model capacity), and convergence is judged by whether you have reached a "flat floor where the loss no longer drops". The practical instinct is to watch whether the decrease has stalled, not the absolute value of the loss.
How to Use
Set learning rate (eta) between 0.01 and 0.5; typical values for convex surfaces are 0.1–0.2
Set momentum coefficient (gamma) between 0.0 and 0.99; values like 0.9 accelerate convergence on steep gradients
Configure maximum iterations (steps) from 10 to 1000 depending on problem complexity
Click simulate to observe the optimization path; monitor Final loss f, Final x/y coordinates, and Loss reduction (%) in real time
Observe Converged/Diverged status; divergence indicates eta is too large relative to gamma
Worked Example
On a Rosenbrock function f(x,y) = (1-x)² + 100(y-x²)² with initial point x₀=-1.2, y₀=1.0: setting eta=0.001 and gamma=0.9 achieves final loss f≈0.15 at x≈0.85, y≈0.72 within 500 iterations, reducing loss by 98.5%. Without momentum (gamma=0.0), identical eta requires 800+ iterations. Increasing eta to 0.005 with gamma=0.9 converges in 280 steps but risks oscillation near the optimum.
Practical Notes
Neural networks typically use gamma=0.9–0.95 with learning rates scaled by batch size; adaptive methods (Adam) reduce manual tuning burden
High momentum (gamma>0.95) on ill-conditioned problems can overshoot narrow valleys; test gamma increments of 0.05
Monitor divergence: if loss explodes after 50 iterations, reduce eta by half immediately
Momentum helps escape shallow local minima on non-convex surfaces; pure gradient descent (gamma=0) stalls more frequently
For production training, use validation loss to detect overfitting; this simulator shows training loss only