Adam Optimizer Simulator Back
Machine Learning

Adam Optimizer Simulator

Experience Adam, the workhorse algorithm behind deep-learning training. Adjust the learning rate, β₁ and β₂ to watch how the combination of the first moment (momentum) and the second moment (adaptive learning rate) descends an elongated valley, shown in real time as a path on the contour plot and a loss convergence curve.

Parameters
Learning rate α
Base step size per iteration, normalized by √v̂
First-moment decay β₁
How much of the average gradient to carry over (momentum)
Second-moment decay β₂
How much of the average squared gradient to carry over (adaptive rate)
Iterations
steps
Loss surface
Switch the elongation (condition number) of the valley
Results
Final loss f
Final x
Final y
Iterations
Verdict
Loss reduction (%)
Loss surface contours and the Adam path

The concentric ellipses are contours of the loss f, and the central dot is the minimum. The coloured path from the start point (−9, 4) is Adam's update trajectory, traced repeatedly by a marker.

Loss convergence curve
Optimization-path comparison — Adam vs plain GD
Theory & Key Formulas

$$m_t=\beta_1 m_{t-1}+(1-\beta_1)g_t,\qquad v_t=\beta_2 v_{t-1}+(1-\beta_2)g_t^2$$

The first moment m (an exponential moving average of the gradient g) and the second moment v (an EMA of the squared gradient). β₁ and β₂ are decay rates.

$$\hat m_t=\frac{m_t}{1-\beta_1^t},\ \hat v_t=\frac{v_t}{1-\beta_2^t},\qquad \theta_t=\theta_{t-1}-\frac{\alpha\,\hat m_t}{\sqrt{\hat v_t}+\varepsilon}$$

The update uses the bias-corrected m̂ and v̂. The per-parameter division by √v̂ is the very core of what makes Adam "adaptive".

$$f(x,y)=\tfrac12\left(a\,x^{2}+b\,y^{2}\right),\quad \nabla f=(a\,x,\;b\,y)$$

This tool's objective function. The condition number is b/a. In the elongated valley b/a=20, so the gradient scale differs greatly by direction.

What is the Adam Optimizer?

🙋
Every deep-learning textbook says "just use Adam to start". What is Adam actually doing as an optimizer?
🎓
Roughly speaking, Adam takes the best of two ideas: "momentum" and an "adaptive learning rate". The name itself stands for Adaptive Moment Estimation. On top of plain gradient descent it builds in two devices at once — one that stabilizes the direction of travel using an average of past gradients (the first moment m), and one that adjusts the step size per parameter using an average of past squared gradients (the second moment v). Leave the controls on the left as they are and run it with the defaults first.
🙋
The path goes almost straight to the bottom of the valley. The plain gradient descent I saw earlier zigzagged in this same elongated valley.
🎓
That's exactly Adam's strength. This valley is steep in the y direction and gentle in the x direction. Plain GD uses the same learning rate for all parameters, so it overshoots and oscillates in y while barely moving in x. Adam uses the second moment v to automatically adjust each direction — "y has a big gradient, so take a small step; x has a small gradient, so take a big step". That is why it can descend straight even on an ill-conditioned landscape. Look at the "comparison" chart below; the gap from plain GD is clear even for the same number of steps.
🙋
There are these values β₁ and β₂. What are they? What happens if I change the defaults of 0.9 and 0.999?
🎓
β₁ is the decay rate for the first moment — how much of the average gradient to carry over from the past — and 0.9 is standard. β₂ is the decay rate for the second moment, the average squared gradient, and 0.999 is standard. β₂ is closer to 1 because v, which sets the baseline step size, should change slowly. Lowering β₁ weakens the momentum and makes the response nimbler; raising it adds inertia. In practice these two are rarely touched and are usually left at their defaults.
🙋
The formula has a "bias correction" division m̂=m/(1−β₁ᵗ). What is that for?
🎓
Good question. m and v start from zero, right? So at step 1, m can only absorb (1−β₁) of the gradient, coming out smaller and biased toward zero compared to the true average. That is the "initialization bias". The division m̂=m/(1−β₁ᵗ) cancels that shrinkage exactly. At t=1 it scales up by 1/(1−β₁)=10× to balance the books, and as t grows, β₁ᵗ becomes nearly 0 so the correction factor approaches 1 and effectively does nothing. So remember: "bias correction matters most in the first few steps of training". Without it, the early progress would be extremely slow.
🙋
I see. So can I choose the learning rate α with the same intuition as for plain GD?
🎓
That part is a bit different. Adam's update size is α·m̂/(√v̂+ε), and because it normalizes by dividing by √v̂, the effective step per iteration stays roughly on the order of α. So Adam is less sensitive to the learning rate than plain GD, and α=0.001 has become almost the standard starting point in deep learning. On this tool's simple quadratic you can take α much larger, but raise it too much and it oscillates near the minimum, and even more and it diverges. "Start at α=0.001, raise it if it's slow, lower it if it goes wild" is the field rule of thumb.

Frequently Asked Questions

Adam (Adaptive Moment Estimation) is an optimization algorithm that combines momentum with an adaptive learning rate. The first moment m, an exponential moving average of the gradient, sets the direction of travel, while the second moment v, an exponential moving average of the squared gradient, sets a per-parameter step size. The update rule is θ ← θ − α·m̂/(√v̂+ε), where m̂ and v̂ are bias-corrected values. Because it converges quickly and stably even on ill-conditioned landscapes or sparse gradients, Adam is widely used as the default optimizer for deep learning.
β₁ is the decay rate of the first moment (the average gradient), with a default of 0.9. β₂ is the decay rate of the second moment (the average squared gradient), with a default of 0.999. Both control how much past history an exponential moving average keeps. The problem is that m and v are initialized at zero, so in the first few steps the averages are strongly biased toward zero. Bias correction m̂=m/(1−β₁ᵗ) and v̂=v/(1−β₂ᵗ) is a division that cancels this bias exactly; it matters most early in training and becomes nearly inactive (approaching 1) as t grows.
Plain gradient descent uses the same learning rate for all parameters, updating with θ ← θ − α∇f. Adam adds two mechanisms. One is momentum (the first moment m), which averages past gradients to stabilize the direction of travel. The other is a division by the second moment √v, which automatically makes the step smaller in directions where the gradient is large and larger where the gradient is small. As a result, even on problems like an elongated valley where the gradient scale differs greatly by direction, Adam converges smoothly with a step size suited to each parameter.
Adam's update size is α·m̂/(√v̂+ε), and because the √v̂ normalization takes effect, the effective step size per iteration is roughly on the order of α. This makes Adam less sensitive to the learning rate than plain GD, and α=0.001 is the standard starting point for deep learning. On this tool's quadratic function you can take α much larger, but raising it too far causes oscillation near the minimum and, beyond that, divergence. In practice, start at α=0.001, increase it a few-fold if learning is slow, and lower it if the loss becomes unstable.

Real-World Applications

Training neural networks: Adam is the first optimizer reached for across every area of deep learning — image recognition, natural language processing, speech recognition. There are domain-specific tweaks, such as lowering β₂ to 0.98 when training Transformers, but the "Adam triple" of α=0.001, β₁=0.9 and β₂=0.999 is still the default in many implementations. Because it removes the need to tune the learning rate per direction by hand, it is the first candidate when trying out a new model.

Problems with sparse gradients: Adam is especially strong on "sparse gradient" problems like recommender systems and word embeddings, where only a fraction of the many parameters is updated each time. Because the second moment v is accumulated separately for each parameter, Adam can automatically assign a large step to parameters that rarely receive a gradient and a small step to frequently updated ones. This was cited as one of the main motivations in the original Adam paper.

Design exploration in CAE and numerical optimization: Structural, topology and shape optimization all use gradient methods to drive an objective (weight, stress, compliance) down via gradients. If the design variables are not scaled consistently, the loss landscape becomes an elongated valley and convergence slows dramatically. Adam's idea of adapting the step size per direction effectively does the variable normalization for you, so it is useful in engineering design optimization, not just machine learning.

Inverse analysis and parameter identification: Inverse problems that estimate material-model or boundary-condition parameters to fit experimental data also minimize the error as a loss. Parameter sensitivities often differ by orders of magnitude — one coefficient is sensitive to the loss, another is not — making convergence hard with fixed-learning-rate gradient descent. Adam's adaptive learning rate absorbs this spread of sensitivities automatically and stabilizes convergence.

Common Misconceptions and Pitfalls

The most common one is assuming "Adam is always the best optimizer". Adam converges fast and is easy to tune, but it is known that on generalization performance (how well the model does on data beyond the training set) it can lose to a well-tuned SGD with momentum. In image-recognition competitions, more than a few teams finish the run with SGD. Adam is an optimizer good at reaching a "fast, easy and reasonably good solution" — it is not a guarantee of "always the best solution".

Next, the misconception that "once it converges, the loss should be zero". This tool's objective has its minimum at the origin with f=0, but with a finite number of iterations the final loss is never exactly zero. The convergence verdict uses a relative criterion — "final loss below 1% of the initial loss". In real machine learning the minimum loss is not necessarily zero (there is data noise and a limit to model capacity), and convergence is judged by whether you have reached a "flat bottom that no longer goes down". The practical instinct is to watch whether the rate of decrease has stalled, not the absolute loss value.

Finally, the misconception that "a larger learning rate always converges faster". Thanks to the √v̂ normalization, Adam is less sensitive to the learning rate than plain GD, but there is still a limit. Raising the learning rate α to 1.0 in this tool's elongated valley makes even Adam oscillate violently near the minimum or makes the loss diverge. The early phase, where bias correction inflates the effective step, is especially prone to instability, and the divergence risk rises when α is large. "If it diverges, lower α first" is the rule of thumb that does not change for Adam either.

How to Use

  1. Set learning rate (lr) between 0.001–0.1; typical CNNs use 0.001–0.01
  2. Configure β₁ (momentum decay, default 0.9) and β₂ (RMSprop decay, default 0.999) using the range sliders
  3. Specify iteration count (50–5000 steps); ResNet50 training commonly requires 1000+ steps to convergence
  4. Click simulate to observe loss trajectory and final parameters x, y across the optimization landscape
  5. Compare verdicts: if loss stagnates after 500 iterations, increase lr or reduce β₂ slightly

Worked Example

Training a binary classifier with initial loss 2.14. Set lr=0.01, β₁=0.9, β₂=0.999, steps=2000. After optimization: final loss f=0.047, parameters converge to x=−1.23, y=0.89, achieving 97.8% loss reduction in 2000 iterations. Decreasing β₂ to 0.95 produces more aggressive adaptation; loss reaches 0.031 but exhibits higher variance in final 100 steps. Increasing lr to 0.05 causes divergence (final loss 3.42) due to overshooting the convex region.

Practical Notes

  1. Adam with lr=0.001 suits fine-tuning pretrained ImageNet weights; lr=0.01 accelerates training of randomly initialized dense layers
  2. β₁=0.9 balances gradient momentum; reducing to 0.8 helps sparse gradient problems (NLP embeddings)
  3. β₂=0.999 prevents premature decay of RMSprop term; use 0.99 for noisy loss surfaces (GANs, reinforcement learning)
  4. Verify loss reduction percentage exceeds 80% by step 1000 for well-scaled problems; stagnation suggests learning rate is too low
  5. If final loss oscillates, reduce lr by half and extend iterations by 50%