Visualize the learning rate schedules used to train neural networks. Switch between step decay, exponential decay and cosine annealing, move the initial learning rate and decay parameter, and watch in real time how the rate changes per epoch and how the stride of a ball descending a loss valley shrinks with it.
Parameters
Schedule type
Three representative ways to decay the learning rate
Initial learning rate η₀
Learning rate at the start of training (the stride reference)
Total epochs T
ep
Number of epochs for the whole run
Decay parameter
Step: decay rate γ / Exp: decay coefficient k / Cosine: affects the curve shape
Current epoch t
ep
Shows the learning rate and progress at this point
Results
—
Current learning rate η(t)
—
Initial learning rate η₀
—
Final learning rate η(T)
—
Average learning rate
—
Current / initial ratio
—
Schedule type
—
Ball descending the loss valley — stride proportional to the LR
As the epoch sweeps 0 → T, the learning rate drops and the ball's stride shrinks with it: big bouncy steps early, tiny careful steps in the valley late. The curve along the top is the learning rate over time.
Step decay: the learning rate drops to γ times its value in a discrete jump every s epochs. Exponential decay: a continuous, smooth decrease governed by the decay coefficient k. t: current epoch, η₀: initial learning rate.
Cosine annealing: a smooth descent from η₀ to a minimum η_min following a half-cosine. T: total epochs. Every schedule trades early speed for late precision.
What is the Learning Rate Schedule Simulator?
🙋
In deep learning the "learning rate" is the value that controls how big a weight update is, right? Why would you have a "schedule" that changes it during training?
🎓
Good question. The learning rate decides "how big a stride you take down the loss slope". The catch is that the optimal stride is completely different at the start and at the end of training. Early on the loss is large, so you want to take big strides and descend fast. But once you are near the valley floor, big strides keep overshooting the minimum and you bounce back and forth. So you shrink the learning rate as training proceeds — that is the schedule.
🙋
I see — on the canvas top-left the ball is descending a valley. It bounces in big arcs at first, and near the end it only inches around the valley floor.
🎓
That is exactly the effect of a learning rate schedule. The size of each step that ball takes is proportional to the learning rate at that moment. The big early strides carry it near the valley fast, and the small late strides let it settle right into the bottom. If you kept the learning rate large forever, the ball would bounce around the valley floor and never settle. If you kept it small forever, training would end before it even reached the valley.
🙋
There are three ways to decay it. With step decay the learning rate graph is a jagged staircase, while exponential and cosine are smooth. How do you choose between them?
🎓
Roughly speaking, step decay drops the rate all at once at fixed moments — like "0.1x every 30 epochs" — and it has long been the classic choice: easy to reason about and simple to implement. Exponential decay shrinks continuously with no jumps. Cosine annealing follows a half-cosine curve, and because it eases gently into the minimum near the end it has become almost standard in modern image-recognition models. Line all three up on the "Schedule comparison" chart and the differences in character are obvious.
🙋
When I raise the decay parameter, every schedule drops faster. Is bigger always better?
🎓
No — too large causes problems. Raising the decay parameter makes the learning rate small sooner, so in the second half of training the stride is almost zero and you leave loss on the table that you could still have removed. Too small, and the learning rate stays large until the end, so the ball oscillates at the valley floor and never converges. In practice people tune it so the "final learning rate" lands at roughly 1-10% of the initial value. Watch the final-learning-rate card and look for the right amount of decay.
🙋
If you just pick the optimal learning rate once at the start, you wouldn't need a schedule at all, would you?
🎓
If only it were that easy. A fixed learning rate has a fundamental weakness: the value that is optimal early and the value that is optimal late cannot both be satisfied. A single number always sacrifices one of them. A schedule resolves that contradiction by switching the optimal value over time. That is why in modern deep learning practically every training run uses some kind of schedule.
Frequently Asked Questions
If the learning rate is kept constant from start to finish, the value that is optimal early on is too large later: the loss oscillates around the minimum and never settles. Conversely, the small value that is optimal late on makes early progress painfully slow. A learning rate schedule uses a large learning rate early to drop the loss quickly, then shrinks it as training proceeds so the model converges precisely to the minimum, getting both fast convergence and a good final accuracy.
Step decay multiplies the learning rate by a factor γ in discrete jumps every fixed number of epochs (for example 0.1x every 30 epochs). It is simple to implement and easy to reason about, but the loss curve shows a visible step at each drop. Exponential decay uses η=η₀·e^(−kt) to shrink the rate smoothly with no abrupt jumps. Cosine annealing follows a half-cosine curve from the initial value down to a minimum, easing slowly into the minimum near the end, and is widely used in modern image-recognition models.
The initial learning rate depends on the model, optimizer and batch size: too large and the loss diverges (NaN), too small and training barely moves. In practice, a learning rate range test gradually raises the rate while watching where the loss falls the fastest, and a value just below that region is chosen as the starting point. Around 0.1 is common for SGD and around 0.001 for Adam, after which a schedule decays it.
Warmup ramps the learning rate linearly from near zero up to the target value over the first few epochs, preventing divergence from unstable early gradients. It is almost mandatory for large batch sizes and Transformer training. Warm restarts (SGDR) repeat cosine annealing several times and reset the learning rate to a large value at the start of each cycle. This makes it easier to escape into a different valley of the loss landscape and gives an ensemble-like effect by averaging several solutions.
Real-World Applications
Training image-recognition models: When convolutional neural networks such as ResNet or EfficientNet are trained on ImageNet, the learning rate schedule is one of the most important hyperparameters for final accuracy. Step decay (0.1x at epochs 30, 60, 90) was the classic choice, but cosine annealing is now mainstream and frequently improves the reported top-1 accuracy for the same model and the same number of epochs.
Pre-training large language models: Pre-training Transformer-based language models almost always relies on a schedule with warmup. The learning rate ramps linearly from zero to the target value over the first few thousand steps, then decays slowly with inverse-square-root or cosine decay. Skip the warmup and the large early gradients make the loss diverge, so training never even gets started.
Transfer learning and fine-tuning: When a pre-trained model is adapted to a new task, fine-tuning uses an initial learning rate much smaller than the one used for pre-training and decays it quickly over a short schedule. If the learning rate is too large, the good pre-trained weights get destroyed — the "catastrophic forgetting" problem. This is often combined with assigning different learning rates to different layers.
Pre-study before hyperparameter search: Visualizing the learning rate trajectory in advance with a tool like this one lets you sanity-check, before spending GPU hours, whether the final learning rate has shrunk too far or the decay is too fast. The shape of a schedule should be checked on paper before consuming large amounts of training time, and it also serves as a sanity check against the actual loss curve.
Common Misconceptions and Pitfalls
The first big misconception is "a smaller learning rate is always safer and better". It is true that a smaller learning rate lowers the risk of the loss diverging, but using a learning rate that is too small from the very start makes training extremely slow and the loss never drops far enough within the fixed epoch budget. Worse, small learning rates tend to get stuck in shallow local minima of the loss landscape and generalization (accuracy on unseen data) suffers as well. Be bold with a large learning rate early and let the schedule shrink it gradually — that order matters.
Next, "a schedule should be thought of in steps (iterations), not epochs". This tool displays things per epoch for clarity, but most real frameworks update the learning rate every iteration (every mini-batch). Changing the batch size changes the number of steps per epoch, so the same "0.1x at 30 epochs" lands the decay at a different real moment for a different batch size. When designing a schedule it is safer to think in terms of the total number of steps.
Finally, the misconception that "as long as the schedule is good, the initial learning rate can be anything". A schedule only decays from the initial learning rate as its starting point, so a bad starting point cannot be rescued by any schedule. Too large and the loss becomes NaN in the first epoch; too small and the stride is never enough even at the end. In practice, first find a suitable starting value with a learning rate range test, and only then tune the shape and amount of decay. The initial value and the schedule must be designed together.
How to Use
Set initial learning rate η₀ (typically 0.001 to 0.1 for Adam optimizer) in the lr0Num field
Enter total training epochs in totalEpochsNum (e.g., 100, 200, 500 for ResNet or BERT fine-tuning)
Configure decay parameter: step size for step decay (every N epochs), decay rate for exponential (0.1–0.99), or T_max for cosine annealing
Adjust curEpochNum slider to visualize η(t) at any epoch during training
Compare Current/Initial ratio to track relative learning rate reduction across your schedule type
Worked Example
Training a ResNet-50 on ImageNet with initial η₀=0.1 over 120 epochs using step decay (decay parameter=30): At epoch 0, η(t)=0.1000 and ratio=1.00. At epoch 30, η(t)=0.01 (ratio=0.10). At epoch 60, η(t)=0.001 (ratio=0.01). At epoch 120 (final), η(T)=0.0001 and average learning rate≈0.0042. For comparison, exponential decay with rate 0.95 over same setup yields smoother descent: η(90)≈0.0051 vs step decay's 0.001.
Practical Notes
Cosine annealing (T_max=120) from η₀=0.01 produces η(60)≈0.0025 (half at midpoint) and η(120)≈0.0001—ideal for convergence without abrupt plateaus in SGD or AdamW
Step decay suits transfer learning: reduce by 0.1× every 10–15 epochs for BERT or Vision Transformer fine-tuning on small datasets (10k–100k samples)
Exponential decay (rate=0.99) applies well to distributed training; verify final η(T) stays ≥1e-6 to avoid gradient underflow in 32-bit float precision
Monitor Average learning rate to validate total optimization work; if <0.001 for 100+ epochs, training may converge prematurely