Gradient Descent Visualizer Back
Machine Learning & Optimization

Gradient Descent Optimizer Visualizer

Visualize how SGD, Adam, RMSprop and other optimization algorithms traverse 2D loss landscapes in real time. Understand machine learning optimization intuitively.

Loss Function
Select Function
Optimization Algorithm
Algorithm
Learning Rate α 0.010
β₁ (Momentum / Adam) 0.900
β₂ (RMSprop / Adam) 0.999
Max Steps 500
Controls

Click the canvas to set the starting point

Current State
0
Steps
Loss
‖∇L‖

Algorithm Note

Adam tracks the first moment m and second moment v of the gradient to apply an adaptive, per-parameter learning rate. Default optimizer in most deep-learning frameworks.

CAE Connection: The same gradient-based approach is used in structural and shape optimization. Sensitivity analysis computes gradients of an objective (weight, compliance) with respect to design parameters, then an optimizer like Adam or L-BFGS updates the design.

Click to set start point  |  Color: blue = low loss → red = high loss

What is Gradient Descent?

🧑‍🎓
What exactly is gradient descent? I hear it's how AI "learns," but I don't get how moving around a landscape teaches it anything.
🎓
Basically, think of it like this: the "landscape" in this simulator is a graph of a loss function. The height is the error or cost. The AI's goal is to find the lowest valley—the minimum error. Gradient descent is the algorithm that takes small steps downhill. In practice, it calculates the slope (the gradient) at its current position and moves opposite to it. Try selecting the "Simple Bowl" function above and watch the dot roll down to the center.
🧑‍🎓
Wait, really? So the "Learning Rate" slider is just how big a step it takes? What happens if I set it too high on a steep function?
🎓
Exactly! That's a key insight. The learning rate, often called alpha (α), controls step size. If it's too small, the optimizer is slow. If it's too large, it can overshoot the minimum and even diverge, bouncing out of control. A common case is on the "Steep Valley" function—set α to 0.5 with SGD and watch it jump across the canyon instead of descending smoothly. Now try 0.01; it'll be slow but stable.
🧑‍🎓
Okay, so SGD can be jumpy. That's why we have Adam and RMSprop? What do the β₁ and β₂ parameters actually do?
🎓
Great question. SGD uses only the current gradient. Adam and RMSprop are smarter—they use memory. β₁ controls momentum; it's like giving the optimizer inertia so it doesn't get stuck in tiny bumps. β₂ controls how it adapts the step size for each direction based on past squared gradients, which helps navigate ravines. For instance, on the "Saddle Point" function, compare SGD (gets stuck) to Adam (escapes quickly). Play with the β sliders to see how they smooth or dampen the path.

Physical Model & Key Equations

The core of gradient descent is updating a parameter vector $\theta$ (your position on the landscape) by subtracting the gradient of the loss function $J(\theta)$, scaled by the learning rate $\alpha$.

$$\theta_{t+1}= \theta_t - \alpha \nabla J(\theta_t)$$

Here, $\theta_t$ is the current position, $\nabla J(\theta_t)$ is the gradient (a vector pointing steepest uphill), and $\alpha$ is the learning rate you control in the simulator. This is the vanilla Stochastic Gradient Descent (SGD) update rule.

Advanced optimizers like Adam build on this by estimating the first moment (mean, $m_t$) and second moment (uncentered variance, $v_t$) of the gradients, introducing hyperparameters $\beta_1$ and $\beta_2$ for exponential decay.

$$ \begin{aligned}m_t &= \beta_1 m_{t-1}+ (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1}+ (1 - \beta_2) g_t^2 \\ \theta_{t+1}&= \theta_t - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\end{aligned}$$

$g_t$ is the gradient at step $t$. $\hat{m}_t$ and $\hat{v}_t$ are bias-corrected estimates. $\beta_1$ (Momentum) smooths the path, $\beta_2$ adapts the learning rate per dimension. This is why Adam handles noisy, ill-conditioned landscapes—like the "Complex Terrain" in the simulator—much more effectively.

Real-World Applications

Training Neural Networks: This is the direct application. Every time you hear about a model like GPT or a ResNet being "trained," an optimizer like Adam is navigating a loss landscape with billions of parameters, adjusting weights to minimize prediction error. The choice of optimizer and learning rate schedule is critical for convergence speed and final performance.

CAE Structural Optimization: In Computer-Aided Engineering, the same principle is used for shape or topology optimization. The objective might be to minimize the weight of a car bracket while maintaining strength. Sensitivity analysis provides the gradient of stress with respect to design parameters, and a gradient-based optimizer (like the ones here) iteratively updates the design. This is automated, simulation-driven design.

Robotics & Control Systems: Optimizers are used to tune the parameters of controllers (like PID gains) by minimizing an error function that quantifies how well a robot arm follows a desired trajectory. The landscape is the error over parameter space, and gradient descent finds the best settings.

Financial Modeling: In quantitative finance, models for option pricing or risk assessment have parameters calibrated to market data. This calibration is often framed as an optimization problem—minimizing the difference between model predictions and observed prices—solved using stochastic gradient methods similar to those visualized here.

Common Misconceptions and Points to Note

First, the belief that "a larger learning rate leads to faster convergence" is a dangerous misconception. Try using SGD on the Rosenbrock function in the simulator and increase the learning rate α from 0.1 to 1.0. While the initial few steps descend quickly, you'll soon see it bounce off the valley walls and diverge, moving away from the minimum. In practice, the golden rule is to start with a relatively small value (e.g., 0.001 or 0.01) and use "scheduling" to decay the learning rate when the decrease in the loss value slows down.

Next, the overconfidence that "Adam works well for anything with its default parameters". While Adam is indeed powerful, the default values (β₁=0.9, β₂=0.999) are not always optimal depending on the function's shape. For instance, with problems involving very noisy gradients, if you don't make β₂ smaller (e.g., 0.99) to "forget" past gradients more easily, the adaptive learning rate can become too small and stall. You can confirm that movement becomes more active by trying β₂=0.9 in the tool.

Finally, reaching a minimum does not mean "all problems are solved". The "Himmelblau function" in this simulator has four local minima of equal value. Changing the starting point changes which minimum the algorithm reaches, right? In practical machine learning as well, you must always question whether the found "minimum" is truly the best solution (the global optimum) or an undesirable local optimum. "Random restarts"—trying multiple runs with different initial values—is a fundamental technique to mitigate this risk.

Related Engineering Fields

The concept of "gradient-based optimization" you are experiencing with this tool appears as a common challenge—"finding parameters that minimize an objective function"—in various engineering fields beyond machine learning. For example, in structural optimization (topology optimization), using material placement as parameters, "weight reduction" as the objective function, and "strength" as a constraint, gradient methods are used to automatically generate optimal skeletal shapes. This is essential for lightweight design in automotive and aerospace engineering.

Furthermore, in the field of control engineering, adjusting controller parameters so a system's response follows a target value is also an optimization problem of minimizing error (the loss function). Particularly in "Model Predictive Control (MPC)", fast gradient methods are at the core, calculating optimal control inputs online while predicting future behavior.

Moreover, it is deeply connected to CAE (Computational Mechanics) simulation itself. When analyzing structural deformation using the finite element method, finding the state of force equilibrium (energy minimization) is a form of optimization calculation. In nonlinear analysis, the Newton-Raphson method—an advanced iterative solver using gradients (the tangent stiffness matrix)—is employed. The underlying philosophy of "exploring solutions using gradients" is shared with the algorithms in this tool.

For Further Learning

Once you've developed intuition with this simulator, the next recommended step is to understand at a mathematical level "why" these algorithms are efficient. A good place to start is the update equation for Momentum SGD, which adds "momentum" to SGD. $$v_{t+1} = γ v_t + α ∇L(θ_t), \quad θ_{t+1} = θ_t - v_{t+1}$$ In this equation, you can understand the physical analogy where $v_t$ acts as "velocity" and $γ$ as "friction", suppressing zigzag motion along valley floors. Changing the momentum coefficient in the tool and observing the behavior lets you physically grasp the meaning of the equations.

Then, look back at Adam's equations. You should see that $\hat{m}_t$ is an evolved form of Momentum (with bias correction) and that $\hat{v}_t$ adjusts the learning rate per parameter. This concept of "adaptive learning rate" is also shared by other algorithms like RMSprop and AdaGrad.

Ultimately, understanding through implementation is the most profound. Try simply recreating what this simulator does using just NumPy in Python. By computing the gradient of a 2D function using numerical differentiation and writing the update equations for SGD, Momentum, and RMSprop in a few dozen lines, the algorithm's "movement" will become completely your own. Once you can do that, you'll be able to confidently infer what the optimizers in actual deep learning frameworks (PyTorch/TensorFlow) are doing internally.