Visualize how SGD, Adam, RMSprop and other optimization algorithms traverse 2D loss landscapes in real time. Understand machine learning optimization intuitively.
Loss Function
Select Function
Optimization Algorithm
Algorithm
Learning Rate α
β₁ (Momentum / Adam)
β₂ (RMSprop / Adam)
Max Steps
Controls
Click the canvas to set the starting point
Current State
CAE Connection
The same gradient-based approach is used in structural and shape optimization. Sensitivity analysis computes gradients of an objective (weight, compliance) with respect to design parameters, then an optimizer like Adam or L-BFGS updates the design.
Results
0
Steps
—
Loss
—
‖∇L‖
0.010
Learning Rate
0.900
β₁
0.9990
β₂
500
Max Steps
Main
Click to set start point | Color: blue = low loss → red = high loss
Loss
Theory & Key Formulas
Adam tracks the first moment m and second moment v of the gradient to apply an adaptive, per-parameter learning rate. Default optimizer in most deep-learning frameworks.
What is Gradient Descent?
🙋
What exactly is gradient descent? I hear it's how AI "learns," but I don't get how moving around a landscape teaches it anything.
🎓
Basically, think of it like this: the "landscape" in this simulator is a graph of a loss function. The height is the error or cost. The AI's goal is to find the lowest valley—the minimum error. Gradient descent is the algorithm that takes small steps downhill. In practice, it calculates the slope (the gradient) at its current position and moves opposite to it. Try selecting the "Simple Bowl" function above and watch the dot roll down to the center.
🙋
Wait, really? So the "Learning Rate" slider is just how big a step it takes? What happens if I set it too high on a steep function?
🎓
Exactly! That's a key insight. The learning rate, often called alpha (α), controls step size. If it's too small, the optimizer is slow. If it's too large, it can overshoot the minimum and even diverge, bouncing out of control. A common case is on the "Steep Valley" function—set α to 0.5 with SGD and watch it jump across the canyon instead of descending smoothly. Now try 0.01; it'll be slow but stable.
🙋
Okay, so SGD can be jumpy. That's why we have Adam and RMSprop? What do the β₁ and β₂ parameters actually do?
🎓
Great question. SGD uses only the current gradient. Adam and RMSprop are smarter—they use memory. β₁ controls momentum; it's like giving the optimizer inertia so it doesn't get stuck in tiny bumps. β₂ controls how it adapts the step size for each direction based on past squared gradients, which helps navigate ravines. For instance, on the "Saddle Point" function, compare SGD (gets stuck) to Adam (escapes quickly). Play with the β sliders to see how they smooth or dampen the path.
Physical Model & Key Equations
The core of gradient descent is updating a parameter vector $\theta$ (your position on the landscape) by subtracting the gradient of the loss function $J(\theta)$, scaled by the learning rate $\alpha$.
Here, $\theta_t$ is the current position, $\nabla J(\theta_t)$ is the gradient (a vector pointing steepest uphill), and $\alpha$ is the learning rate you control in the simulator. This is the vanilla Stochastic Gradient Descent (SGD) update rule.
Advanced optimizers like Adam build on this by estimating the first moment (mean, $m_t$) and second moment (uncentered variance, $v_t$) of the gradients, introducing hyperparameters $\beta_1$ and $\beta_2$ for exponential decay.
$g_t$ is the gradient at step $t$. $\hat{m}_t$ and $\hat{v}_t$ are bias-corrected estimates. $\beta_1$ (Momentum) smooths the path, $\beta_2$ adapts the learning rate per dimension. This is why Adam handles noisy, ill-conditioned landscapes—like the "Complex Terrain" in the simulator—much more effectively.
Frequently Asked Questions
The learning rate may be too high. Try reducing the learning rate (e.g., from 0.01 to 0.001) using the slider at the bottom of the screen. Additionally, using Adam or RMSprop instead of SGD often results in less oscillation and more stable convergence.
First, observe the behavior of SGD by changing the starting point, then switch to Adam or RMSprop for comparison. In particular, using the 'ravine' or 'steep valley' loss function presets will make the effects of momentum and adaptive learning rates more pronounced.
The basic update rules are the same, but actual training involves behavior in high-dimensional parameter spaces and the effects of mini-batches. This simulator visualizes the essential dynamics using a 2D loss function, which helps in intuitively understanding optimization algorithms.
Click anywhere on the loss function map displayed on the screen. The red dot will move to that position, and optimization will restart from there. Use this feature to investigate convergence to local minima or behavior at saddle points.
Real-World Applications
Training Neural Networks: This is the direct application. Every time you hear about a model like GPT or a ResNet being "trained," an optimizer like Adam is navigating a loss landscape with billions of parameters, adjusting weights to minimize prediction error. The choice of optimizer and learning rate schedule is critical for convergence speed and final performance.
CAE Structural Optimization: In Computer-Aided Engineering, the same principle is used for shape or topology optimization. The objective might be to minimize the weight of a car bracket while maintaining strength. Sensitivity analysis provides the gradient of stress with respect to design parameters, and a gradient-based optimizer (like the ones here) iteratively updates the design. This is automated, simulation-driven design.
Robotics & Control Systems: Optimizers are used to tune the parameters of controllers (like PID gains) by minimizing an error function that quantifies how well a robot arm follows a desired trajectory. The landscape is the error over parameter space, and gradient descent finds the best settings.
Financial Modeling: In quantitative finance, models for option pricing or risk assessment have parameters calibrated to market data. This calibration is often framed as an optimization problem—minimizing the difference between model predictions and observed prices—solved using stochastic gradient methods similar to those visualized here.
Common Misconceptions and Points to Note
First, the belief that "a larger learning rate leads to faster convergence" is a dangerous misconception. Try using SGD on the Rosenbrock function in the simulator and increase the learning rate α from 0.1 to 1.0. While the initial few steps descend quickly, you'll soon see it bounce off the valley walls and diverge, moving away from the minimum. In practice, the golden rule is to start with a relatively small value (e.g., 0.001 or 0.01) and use "scheduling" to decay the learning rate when the decrease in the loss value slows down.
Next, the overconfidence that "Adam works well for anything with its default parameters". While Adam is indeed powerful, the default values (β₁=0.9, β₂=0.999) are not always optimal depending on the function's shape. For instance, with problems involving very noisy gradients, if you don't make β₂ smaller (e.g., 0.99) to "forget" past gradients more easily, the adaptive learning rate can become too small and stall. You can confirm that movement becomes more active by trying β₂=0.9 in the tool.
Finally, reaching a minimum does not mean "all problems are solved". The "Himmelblau function" in this simulator has four local minima of equal value. Changing the starting point changes which minimum the algorithm reaches, right? In practical machine learning as well, you must always question whether the found "minimum" is truly the best solution (the global optimum) or an undesirable local optimum. "Random restarts"—trying multiple runs with different initial values—is a fundamental technique to mitigate this risk.
Select optimizer from dropdown (SGD, Adam, or RMSprop) to initialize algorithm-specific parameters.
Adjust Learning Rate slider (lrSlider: 0.001–0.1) and momentum coefficients β₁, β₂ for Adam/RMSprop via respective sliders.
Set Max Steps (maxStepSlider: 10–1000) and click Start to watch the optimizer traverse the 2D loss surface, updating Steps, Loss, ‖∇L‖ (gradient norm), and convergence metrics in real time.
Worked Example
Configure Adam optimizer with Learning Rate = 0.01, β₁ = 0.9, β₂ = 0.999, Max Steps = 500 on a Rosenbrock loss function (f(x,y) = (1−x)² + 100(y−x²)²). Initial point: (−1.2, 1.0). After 127 steps: Loss converges to 0.0018, ‖∇L‖ = 0.0042, demonstrating exponential moving average stabilization. SGD with identical learning rate requires 412 steps for equivalent Loss = 0.0025, illustrating adaptive methods' efficiency on ill-conditioned surfaces.
Practical Notes
For ill-conditioned problems (Rosenbrock, Rastrigin), Adam with default β₁=0.9, β₂=0.999 converges 3–4× faster than vanilla SGD; RMSprop occupies middle ground at ~2× speedup.
Increasing Learning Rate above 0.05 causes oscillation and divergence; monitor ‖∇L‖ spike detection to prevent overshooting on steep ravines.
In production ML workflows, start with Adam (lr=0.001), decay Learning Rate by 0.1 every 100 steps if Loss plateaus; β₁=0.9 suits most convex landscapes, raise to 0.95 for heavy momentum retention on sparse updates.