Learning Rate Schedule Simulator

Visualize the learning rate schedules used to train neural networks. Switch between step decay, exponential decay and cosine annealing, move the initial learning rate and decay parameter, and watch in real time how the rate changes per epoch and how the stride of a ball descending a loss valley shrinks with it.

Parameters

Schedule type

Three representative ways to decay the learning rate

Initial learning rate η₀

Learning rate at the start of training (the stride reference)

Total epochs T

Number of epochs for the whole run

Decay parameter

Step: decay rate γ / Exp: decay coefficient k / Cosine: affects the curve shape

Current epoch t

Shows the learning rate and progress at this point

Results

—

Current learning rate η(t)

—

Initial learning rate η₀

—

Final learning rate η(T)

—

Average learning rate

—

Current / initial ratio

—

Schedule type

—

Ball descending the loss valley — stride proportional to the LR

As the epoch sweeps 0 → T, the learning rate drops and the ball's stride shrinks with it: big bouncy steps early, tiny careful steps in the valley late. The curve along the top is the learning rate over time.

Learning rate vs epoch

Schedule comparison (step, exponential, cosine)

Theory & Key Formulas

$$\text{step: }\eta=\eta_0\,\gamma^{\lfloor t/s\rfloor}\qquad \text{exp: }\eta=\eta_0\,e^{-kt}$$

Step decay: the learning rate drops to γ times its value in a discrete jump every s epochs. Exponential decay: a continuous, smooth decrease governed by the decay coefficient k. t: current epoch, η₀: initial learning rate.

$$\text{cosine: }\eta=\eta_{min}+\tfrac{1}{2}(\eta_0-\eta_{min})\!\left(1+\cos\frac{\pi t}{T}\right)$$

Cosine annealing: a smooth descent from η₀ to a minimum η_min following a half-cosine. T: total epochs. Every schedule trades early speed for late precision.

What is the Learning Rate Schedule Simulator?

🙋

In deep learning the "learning rate" is the value that controls how big a weight update is, right? Why would you have a "schedule" that changes it during training?

🎓

Good question. The learning rate decides "how big a stride you take down the loss slope". The catch is that the optimal stride is completely different at the start and at the end of training. Early on the loss is large, so you want to take big strides and descend fast. But once you are near the valley floor, big strides keep overshooting the minimum and you bounce back and forth. So you shrink the learning rate as training proceeds — that is the schedule.

🙋

I see — on the canvas top-left the ball is descending a valley. It bounces in big arcs at first, and near the end it only inches around the valley floor.

🎓

That is exactly the effect of a learning rate schedule. The size of each step that ball takes is proportional to the learning rate at that moment. The big early strides carry it near the valley fast, and the small late strides let it settle right into the bottom. If you kept the learning rate large forever, the ball would bounce around the valley floor and never settle. If you kept it small forever, training would end before it even reached the valley.

🙋

There are three ways to decay it. With step decay the learning rate graph is a jagged staircase, while exponential and cosine are smooth. How do you choose between them?

🎓

Roughly speaking, step decay drops the rate all at once at fixed moments — like "0.1x every 30 epochs" — and it has long been the classic choice: easy to reason about and simple to implement. Exponential decay shrinks continuously with no jumps. Cosine annealing follows a half-cosine curve, and because it eases gently into the minimum near the end it has become almost standard in modern image-recognition models. Line all three up on the "Schedule comparison" chart and the differences in character are obvious.

🙋

When I raise the decay parameter, every schedule drops faster. Is bigger always better?

🎓

No — too large causes problems. Raising the decay parameter makes the learning rate small sooner, so in the second half of training the stride is almost zero and you leave loss on the table that you could still have removed. Too small, and the learning rate stays large until the end, so the ball oscillates at the valley floor and never converges. In practice people tune it so the "final learning rate" lands at roughly 1-10% of the initial value. Watch the final-learning-rate card and look for the right amount of decay.

🙋

If you just pick the optimal learning rate once at the start, you wouldn't need a schedule at all, would you?

🎓

If only it were that easy. A fixed learning rate has a fundamental weakness: the value that is optimal early and the value that is optimal late cannot both be satisfied. A single number always sacrifices one of them. A schedule resolves that contradiction by switching the optimal value over time. That is why in modern deep learning practically every training run uses some kind of schedule.

Frequently Asked Questions

If the learning rate is kept constant from start to finish, the value that is optimal early on is too large later: the loss oscillates around the minimum and never settles. Conversely, the small value that is optimal late on makes early progress painfully slow. A learning rate schedule uses a large learning rate early to drop the loss quickly, then shrinks it as training proceeds so the model converges precisely to the minimum, getting both fast convergence and a good final accuracy.

Step decay multiplies the learning rate by a factor γ in discrete jumps every fixed number of epochs (for example 0.1x every 30 epochs). It is simple to implement and easy to reason about, but the loss curve shows a visible step at each drop. Exponential decay uses η=η₀·e^(−kt) to shrink the rate smoothly with no abrupt jumps. Cosine annealing follows a half-cosine curve from the initial value down to a minimum, easing slowly into the minimum near the end, and is widely used in modern image-recognition models.

The initial learning rate depends on the model, optimizer and batch size: too large and the loss diverges (NaN), too small and training barely moves. In practice, a learning rate range test gradually raises the rate while watching where the loss falls the fastest, and a value just below that region is chosen as the starting point. Around 0.1 is common for SGD and around 0.001 for Adam, after which a schedule decays it.

Warmup ramps the learning rate linearly from near zero up to the target value over the first few epochs, preventing divergence from unstable early gradients. It is almost mandatory for large batch sizes and Transformer training. Warm restarts (SGDR) repeat cosine annealing several times and reset the learning rate to a large value at the start of each cycle. This makes it easier to escape into a different valley of the loss landscape and gives an ensemble-like effect by averaging several solutions.

Real-World Applications

Training image-recognition models: When convolutional neural networks such as ResNet or EfficientNet are trained on ImageNet, the learning rate schedule is one of the most important hyperparameters for final accuracy. Step decay (0.1x at epochs 30, 60, 90) was the classic choice, but cosine annealing is now mainstream and frequently improves the reported top-1 accuracy for the same model and the same number of epochs.

Pre-training large language models: Pre-training Transformer-based language models almost always relies on a schedule with warmup. The learning rate ramps linearly from zero to the target value over the first few thousand steps, then decays slowly with inverse-square-root or cosine decay. Skip the warmup and the large early gradients make the loss diverge, so training never even gets started.

Transfer learning and fine-tuning: When a pre-trained model is adapted to a new task, fine-tuning uses an initial learning rate much smaller than the one used for pre-training and decays it quickly over a short schedule. If the learning rate is too large, the good pre-trained weights get destroyed — the "catastrophic forgetting" problem. This is often combined with assigning different learning rates to different layers.

Pre-study before hyperparameter search: Visualizing the learning rate trajectory in advance with a tool like this one lets you sanity-check, before spending GPU hours, whether the final learning rate has shrunk too far or the decay is too fast. The shape of a schedule should be checked on paper before consuming large amounts of training time, and it also serves as a sanity check against the actual loss curve.

Common Misconceptions and Pitfalls

The first big misconception is "a smaller learning rate is always safer and better". It is true that a smaller learning rate lowers the risk of the loss diverging, but using a learning rate that is too small from the very start makes training extremely slow and the loss never drops far enough within the fixed epoch budget. Worse, small learning rates tend to get stuck in shallow local minima of the loss landscape and generalization (accuracy on unseen data) suffers as well. Be bold with a large learning rate early and let the schedule shrink it gradually — that order matters.

Next, "a schedule should be thought of in steps (iterations), not epochs". This tool displays things per epoch for clarity, but most real frameworks update the learning rate every iteration (every mini-batch). Changing the batch size changes the number of steps per epoch, so the same "0.1x at 30 epochs" lands the decay at a different real moment for a different batch size. When designing a schedule it is safer to think in terms of the total number of steps.

Finally, the misconception that "as long as the schedule is good, the initial learning rate can be anything". A schedule only decays from the initial learning rate as its starting point, so a bad starting point cannot be rescued by any schedule. Too large and the loss becomes NaN in the first epoch; too small and the stride is never enough even at the end. In practice, first find a suitable starting value with a learning rate range test, and only then tune the shape and amount of decay. The initial value and the schedule must be designed together.

Learning Rate Schedule Simulator

What is the Learning Rate Schedule Simulator?

Frequently Asked Questions

Real-World Applications

Common Misconceptions and Pitfalls

How to Use

Worked Example

Practical Notes