Gradient Clipping Simulator Back
Machine Learning

Gradient Clipping Simulator

Explore gradient clipping, the standard fix for exploding gradients in deep learning. Change the clip threshold and learning rate to see how norm clipping, value clipping and no clipping descend a loss landscape with a steep cliff, shown as a trajectory on a contour plot and as loss and gradient-norm curves in real time.

Parameters
Clip threshold tau
The gradient size is capped at this value
Learning rate eta
How far each step moves along the gradient
Training steps
steps
Clipping method
How the gradient size is capped
Loss landscape
Switch the type of steep terrain
Results
Final loss f
Max gradient norm (before clip)
Clip events triggered
Divergence
Loss reduction (%)
Clipping verdict
Loss-surface contour and optimisation trajectory

The contour lines show the loss landscape f, and the green dot is the minimum. The coloured path from the start point is the update trajectory, with a marker travelling along it on a loop. On the cliff terrain without clipping the path makes a wild jump off-screen.

Loss vs step
Raw gradient norm vs step (before clipping)
Theory & Key Formulas

$$\text{if }\lVert g\rVert\gt \tau:\quad g\leftarrow g\cdot\frac{\tau}{\lVert g\rVert}$$

Clip by norm. When the gradient norm ||g|| exceeds the threshold tau, the whole gradient vector is rescaled uniformly. The direction (descent direction) is preserved while the magnitude is bounded to tau.

$$g_i\leftarrow\operatorname{clip}(g_i,\,-\tau,\,+\tau),\qquad \theta\leftarrow\theta-\eta\,g_{\text{clipped}}$$

Clip by value clamps each component g_i independently to plus/minus tau. Finally the parameters theta are updated with the clipped gradient and the learning rate eta.

What is the Gradient Clipping Simulator?

🙋
I keep seeing "gradient clipping" in deep learning. What does it actually do?
🎓
In short, it just forces the gradient to be smaller whenever it grows too large. Training means descending a loss landscape: you measure the slope (gradient g) where you are, multiply it by the learning rate eta and take one step. But if the landscape has a "cliff", the gradient at the edge of that cliff becomes enormous. Take a step there and the parameters are flung far away and training breaks. That is the "exploding gradient" problem. Clipping puts a cap on the gradient size just before each update to prevent that blow-up.
🙋
I see. I set the "clipping method" on the left to "No clipping" on the cliff terrain, and the trajectory shot off-screen instantly with a "divergence" flag. Is that the exploding gradient?
🎓
Exactly. This cliff terrain is built so that the walls on either side of the valley centre rise extremely fast, like a cosh function. At the wall the x-direction gradient is huge, and one step multiplied by the learning rate jumps far past the start point. Once you overshoot, the next step lands somewhere even steeper with an even larger gradient, and through positive feedback the loss blows up to infinity. This is exactly the kind of failure that happens in real RNN training.
🙋
So why does switching back to "Clip by norm" cross the cliff properly?
🎓
Clip by norm rescales the entire gradient vector with g <- g*(tau/||g||). The moment the gradient norm at the cliff exceeds the threshold tau, the vector is cut down to length tau. The key point is that the direction is unchanged: the descent direction stays the same and only the step size is capped to a safe value. So you do not leap across the cliff in one bound; you cross it in a controlled way, step by step. Look at the "raw gradient norm" chart below: even when the raw norm spikes above the red threshold line, the actual step is capped at tau.
🙋
And how is the other option, "Clip by value", different?
🎓
Clip by value clamps each component of the gradient independently to the range plus/minus tau. For example, a gradient of (100, 0.3) becomes (5, 0.3) when tau=5. It is simple to implement, but only the x-component is cut while the y-component is untouched, so the direction of the combined vector shifts away from the original gradient. In other words, the descent direction gets a little distorted. Clip by norm shrinks all components by the same ratio, so the direction is preserved. That is why clip by norm is the default choice in practice. Just remember that clip by value, while easy to reason about, has the side effect of distorting the direction.

Frequently Asked Questions

Gradient clipping caps the size of the gradient before each optimisation step. Deep networks, and recurrent networks especially, have loss landscapes with cliffs - regions where the loss changes extremely steeply and the gradient becomes enormous. Taking a normal-sized step on a cliff multiplies that huge gradient by the learning rate and hurls the parameters far away, collapsing training (exploding gradients). Clip-by-norm rescales the whole gradient vector to a maximum length tau, preserving its direction; clip-by-value clamps each component to plus/minus tau. Either way the step stays bounded and training can safely cross the cliff.
Clip-by-norm is usually preferred. It rescales the entire gradient vector with g <- g*(tau/||g||), so the descent direction is preserved and only the magnitude is bounded to tau. Clip-by-value clamps each component independently to plus/minus tau, which is simple to implement but, when only one component is clipped, distorts the direction of the resulting vector. On the cliff surface in this tool, clip-by-norm crosses the wall more smoothly. Most deep-learning frameworks provide both methods.
Set tau so that normal-sized gradients pass through untouched while only the huge spikes from a cliff are capped. Too small and you shrink ordinary gradients, slowing training; too large and you fail to stop the spike and the run diverges. In practice, train for a few hundred steps, observe the distribution of gradient norms and pick a value around its median to 95th percentile, or use an empirical value of roughly 1 to 10. In this tool, lowering tau increases the number of clip events while raising it lets the maximum gradient norm grow.
RNNs apply the same weight matrix repeatedly at every time step, so during backpropagation the gradient scales with powers of that matrix's eigenvalues. If an eigenvalue exceeds 1, the gradient grows exponentially with sequence length and a steep cliff appears in the loss landscape. Hitting that cliff while training on a long sequence makes the gradient explode and breaks training. Gradient clipping caps only the spike at the cliff and leaves other steps untouched, which is why it is a standard part of training RNNs, LSTMs and Transformers.

Real-World Applications

Training recurrent neural networks (RNN/LSTM): Gradient clipping first became widely used for training RNNs. Because an RNN multiplies the same weights repeatedly through time, backpropagating a long sequence makes the gradient grow exponentially and a steep cliff appears in the loss landscape. In early RNN/LSTM models for machine translation, speech recognition and text generation, leaving out norm clipping (typically tau = 1 to 5) routinely caused the loss to become NaN within a few epochs and training to stop. Clipping is, in effect, a mandatory stabilisation technique.

Training Transformers and large language models: Gradient clipping is still standard in modern, very large Transformers. Early in training, or occasionally on a "bad mini-batch", the gradient norm can spike, and without clipping a single spike can wreck the entire run. Training recipes for GPT-style and BERT-style models have long used "clip on the global norm, threshold 1.0" as an almost default value. It acts as insurance that lets a run of billions of parameters proceed for weeks without stopping.

Reinforcement learning and policy gradients: In reinforcement learning the reward scale is unstable, and policy-gradient estimates can become outliers of enormous magnitude. Implementations of algorithms such as PPO and A2C commonly apply norm clipping to the gradients of the policy and value networks. It is an unglamorous but effective safety device that keeps noisy signals from the environment from destroying training.

Divergence control in CAE and numerical optimisation: Beyond machine learning, gradient-based numerical optimisation of an objective function shares the problem of "the step is too large and the run diverges". Trust-region methods and line search are principled ways to control the step size, but gradient clipping can be seen as the simplest step limiter that just sets an upper bound. On an ill-conditioned objective where the design variables are not scaled consistently, it can serve as a quick remedy to keep the optimisation from blowing up.

Common Misconceptions and Pitfalls

The most common misconception is that gradient clipping also fixes vanishing gradients. Clipping addresses only "exploding gradients", where the gradient grows too large; it does nothing for "vanishing gradients", where the gradient becomes too small and training stalls. If anything, clipping is an operation that shrinks large gradients, so it has no bearing on vanishing. The cures for vanishing gradients are different: gated structures such as LSTM and GRU, residual (skip) connections, suitable weight initialisation and batch normalisation. Keep clipping and vanishing-gradient remedies clearly separated in your mind.

Next is the belief that the smaller the threshold tau, the safer. It is true that a smaller tau makes divergence less likely, but if tau is too small even the legitimate, normal-sized gradients are uniformly shrunk, the effective learning rate drops and convergence slows. In the extreme, setting tau near zero leaves every step with a tiny stride and you never reach the valley floor. Lower the threshold in this tool and you will see the number of clip events rise while the loss decreases more and more slowly. The key is to place tau at just the right value that stops only the explosion and lets the normal gradient pass straight through.

Finally, the misconception that with clipping in place, the learning rate can be as large as you like. Clipping caps the gradient size at tau, but the actual step is eta*tau. Raise the learning rate eta to an extreme and, even with a bounded clipped gradient, the eta*tau step becomes too large, overshooting the minimum and oscillating or diverging. In this tool too, raising the learning rate close to its upper limit with norm clipping enabled can still diverge. Clipping is only an "upper bound on the gradient side"; it does not remove the need to tune the learning rate.

How to Use

  1. Set clip threshold (clipNum) between 0.5 and 5.0 to define maximum allowed gradient norm magnitude
  2. Adjust learning rate (lrNum) from 0.001 to 0.1; higher rates amplify exploding gradient risk
  3. Specify training steps (stepsNum) to run the simulation; observe Final loss f, Max gradient norm, and Clip events triggered
  4. Compare loss reduction (%) across runs to identify optimal clip threshold for your network configuration

Worked Example

Training a 3-layer RNN on sequence data with learning rate 0.05 and clip threshold 1.0 over 500 steps: unclipped gradients reach norm 8.3, triggering 47 clip events. Final loss converges to 0.18 with 73% loss reduction. Removing clipping (threshold 10.0) causes divergence at step 312 with final loss exceeding 2.5. Lowering clip threshold to 0.5 prevents divergence but reduces convergence speed—final loss 0.24 after 500 steps.

Practical Notes

  1. LSTM and GRU architectures typically require thresholds 0.5–2.0; standard RNNs often need 0.1–0.5 due to vanishing/exploding gradient pathology
  2. Clip events near 0 indicate insufficient threshold—lower it incrementally; excessive clipping (>100 events/100 steps) signals learning rate too aggressive
  3. Monitor Max gradient norm (before clip): sustained values >5.0 across 50+ consecutive steps warrant threshold reduction or batch normalization integration
  4. For transformers processing NLP, clip thresholds 1.0–2.0 with warmup schedules prevent training instability in early epochs