Visualize batch normalization, the technique that stabilizes the training of deep neural networks. Adjust the mini-batch mean and standard deviation, the scale gamma and the shift beta to watch an activation travel from its normalized value x-hat to the BN output y, and to see how the distribution changes from layer to layer.
Parameters
Batch mean muB
Mean of the raw activations in the mini-batch
Batch standard deviation sigmaB
Spread of the raw activations
Scale parameter gamma
Learnable. Sets the output standard deviation
Shift parameter beta
Learnable. Sets the output mean
Tracked activation x
A single sample value to follow through x-hat and y
Results
—
Normalized value x̂
—
BN output y
—
Normalized batch mean
—
Normalized batch std dev
—
Output batch mean
—
Output batch std dev
—
Distribution transform — raw batch → normalized → scale & shift
From left: the raw batch, the normalized batch (mean 0, std 1) and the output batch after the affine transform (mean beta, std gamma). Dots mark the tracked sample x → x-hat → y.
Normalized value x̂ and BN output y. The batch mean μB and variance σB² are computed over the whole mini-batch, and the scale γ and shift β are learned by backpropagation. ε is a small numerical-stability constant (1e-5 in this tool).
After normalization the batch has mean ≈ 0 and standard deviation ≈ 1. After the affine transform the batch has mean = β and standard deviation = γ.
What is the Batch Normalization Simulator?
🙋
"Batch normalization" comes up everywhere in neural-network tutorials. What does it actually do?
🎓
In plain terms, it keeps the scale of each layer's outputs under control. In a deep network, a layer's outputs (the activations) tend to drift larger or smaller as they pass through more layers. Batch normalization looks at the activations inside a mini-batch and subtracts and divides so they end up with mean 0 and standard deviation 1. Move the "Batch mean muB" slider on the left and you will see the raw bump shift left and right — BN is the step that pulls it back to zero each time.
🙋
Can just rescaling the numbers really change training that much?
🎓
It can. When the weights of an earlier layer update, the distribution of inputs the next layer receives shifts as well — this is called internal covariate shift. From the next layer's point of view, its target keeps moving every step, which makes learning hard. By pinning the distribution to mean 0 and standard deviation 1, BN smooths the loss landscape and keeps gradients from blowing up. That is why you can use a larger learning rate and converge faster.
🙋
I see. But if you always force mean 0 and std 1, doesn't that limit what each layer can express?
🎓
Great question — that is exactly what gamma (scale) and beta (shift) solve. After normalizing, BN applies one affine transform, y = gamma * x-hat + beta. gamma and beta are learned just like the network weights, so if normalization gets in the way, the network can learn gamma = sigmaB and beta = muB to recover the original distribution. Move gamma on the left and watch the output standard deviation change; move beta and watch the output mean change. gamma=1, beta=0 is exactly "pure normalization".
🙋
Training uses the batch mean. So what happens at inference, when I feed in just one image?
🎓
That is where people trip up in practice. With a single sample you cannot compute a batch mean or standard deviation (the std would be zero). So during training a running average of muB and sigmaB^2 is quietly accumulated, and at inference those fixed values are used. In PyTorch you switch with model.eval(); in TensorFlow with training=False. Forget it and run inference in training mode, and the same image gives a different result every time — a baffling bug.
🙋
I also hear about "layer normalization" lately. Do you choose between it and batch normalization?
🎓
Yes, there is a family of normalization methods. Batch normalization computes statistics across the batch dimension (many samples), so a tiny batch makes those statistics unstable. Layer normalization instead normalizes across the feature dimension within a single sample, so it does not depend on batch size. That is why Transformers and RNNs, which handle variable-length sequences, mostly use layer normalization, while batch normalization is still the default for CNN image classification.
Frequently Asked Questions
To batch-normalize an activation x, first use the mini-batch mean muB and standard deviation sigmaB: x-hat = (x - muB) / sqrt(sigmaB^2 + epsilon). Epsilon is a small constant for numerical stability (1e-5 in this tool). Then apply the learnable affine transform with scale gamma and shift beta: y = gamma * x-hat + beta. For example, with muB=3, sigmaB=2.5, gamma=1, beta=0 and x=6, x-hat = (6-3)/sqrt(6.25+1e-5) ~ 1.200 and y = 1.200.
The distribution of each layer's activations shifts every time the preceding layer's parameters are updated (internal covariate shift). Batch normalization re-centres and re-scales each layer's output to mean 0 and standard deviation 1, so the input distribution seen by the next layer stays stable. This smooths the loss landscape and keeps gradients well-behaved, which allows a higher learning rate and faster convergence. It also reduces sensitivity to weight initialization and provides a mild regularizing effect.
Normalization alone forces activations to mean 0 and standard deviation 1, which restricts what a layer can represent. The learnable scale gamma and shift beta let the network undo the normalization when needed and recover any required mean and variance. After the affine transform the batch has mean = beta and standard deviation = gamma. gamma=1 and beta=0 is equivalent to pure normalization; gamma and beta are learned by backpropagation alongside the other weights.
During training the per-batch muB and sigmaB^2 are used, but at inference you often feed a single sample, so batch statistics become unstable. To handle this, a running mean and running variance of muB and sigmaB^2 are accumulated during training, and those fixed values are used at inference. This guarantees the same output for the same input. BN behaving differently in training versus inference mode is a common source of framework bugs.
Real-World Applications
CNNs for image classification: In convolutional networks such as ResNet and Inception, batch normalization is inserted right after almost every convolutional layer. The reason networks more than 100 layers deep can train stably is that BN keeps the activation distribution of every layer in check. Once BN became standard equipment, careful weight initialization and tiny learning rates were no longer required.
Fusing with the convolution (faster inference): At inference, batch normalization becomes a plain linear transform that uses the frozen statistics (mean and variance) and gamma and beta. Because it is linear, BN can be folded into the weights of the preceding convolution and fused into a single convolution operation (BN folding / fusion). This cuts inference-time compute and memory traffic and speeds up models on edge devices.
Transfer learning and fine-tuning: When a pre-trained model is reused on a different dataset, how the BN running statistics (running mean and variance) are treated affects the result. If the new domain's image statistics differ greatly from the original, strategies such as re-training only the BN layers or freezing BN in inference mode are used. Without understanding BN behaviour, it is easy to miss why a transfer-learning run fails to reach good accuracy.
Exploiting the regularizing effect: Because batch statistics vary slightly from mini-batch to mini-batch, batch normalization adds a small amount of noise to each sample and acts as a regularizer that curbs overfitting. Models with BN are known to need Dropout less, and whether to combine the two is decided by validation per task. BN contributes not only to training stability but also to generalization performance.
Common Misconceptions and Pitfalls
The biggest pitfall is running inference while still in training mode. Batch normalization uses mini-batch statistics during training and the running averages accumulated during training at inference. If you forget to switch to model.eval() (PyTorch) or training=False (Keras), inference still uses batch statistics, so the same input gives different outputs depending on what else is in the batch. Run inference with batch size 1 and the standard deviation approaches zero, which can even make the output diverge. Conversely, leaving eval() on and resuming training is a frequent bug where the BN statistics stop updating.
Next, assuming BN works the same with a small batch size. Batch normalization estimates the mean and variance from the whole mini-batch, so with a batch as small as 2 to 4 the estimates become very unstable and can actually hurt training. In object detection and segmentation, where GPU memory limits batch size, group normalization or layer normalization is chosen to avoid this problem. This tool lets you set muB and sigmaB directly, but remember the real-BN premise: the smaller the batch, the less you can trust these estimates.
Finally, "BN works because it removes internal covariate shift" is not the whole truth. The original paper attributed the benefit to reducing internal covariate shift, but later research made a strong case that the essence of BN is smoothing the loss landscape and making gradients more predictable. There are also implementation subtleties: a bias term placed right before BN is redundant, and placing BN before versus after the activation function changes behaviour. BN is a convenient "drop it in and it helps" tool, but knowing that why it works is still an open research question keeps you from misreading your results.
How to Use
Enter mini-batch statistics in the input fields: set mu (mean) between -5 and 5, sigma (standard deviation) between 0.1 and 3.0, gamma (scale) between 0.5 and 2.0, and beta (shift) between -2 and 2
The simulator normalizes raw batch data using z-score normalization: x̂ = (x − μ)/σ, reducing internal covariate shift
Apply learned affine transformation parameters gamma and beta to produce final output: y = γx̂ + β, then observe normalized batch mean and standard deviation converge toward target values (0 and 1 respectively)
Worked Example
Consider a convolutional layer processing 32 image samples with initial batch mean μ = 2.3 and standard deviation σ = 1.8. Set gamma = 1.5 and beta = 0.2. After normalization, the batch mean becomes 0.0 and standard deviation becomes 1.0. The affine transformation shifts the normalized output: y = 1.5(0) + 0.2 = 0.2 mean, with standard deviation remaining 1.5. This stabilizes gradient flow and allows learning rates up to 10× higher compared to non-normalized networks, reducing training time from 200 epochs to 40 epochs on ResNet-50 on ImageNet.
Practical Notes
Batch size matters: mini-batches smaller than 16 samples produce unreliable batch statistics; use 32–256 samples in production training for ResNet and VGG architectures
Gamma and beta are learned parameters optimized during backpropagation; initializing gamma near 1.0 and beta near 0.0 accelerates convergence by 15–20%
During inference, use exponential moving average of training batch statistics (momentum coefficient 0.1) rather than single-batch estimates to prevent prediction instability
Placing batch normalization before activation functions (BN-ReLU) typically outperforms post-activation normalization in classification networks by 2–3% accuracy improvement