Weight Initialization Simulator — Xavier & He

Parameters

Initialization scheme

How the weight variance Var(W) is set

Activation function

Per-layer non-linear transform

Number of layers

Fully-connected layers the signal passes through

Units per layer n

Layer width (the fan-in)

Input signal std

Std of the activation entering layer 1

Results

—

Initialization

—

Final-layer activation std

—

Change vs input ratio

—

Signal state

—

Recommended init

—

Match verdict

—

Signal propagation — deep network view

Layers run left to right. Each node's brightness and size encode the activation std of that layer, and a signal pulse travels through them. A vanishing signal darkens toward the right; a diverging one swells blindingly bright.

Activation variability vs layer

Final-layer std per initialization scheme

Theory & Key Formulas

$$\text{Xavier: }\mathrm{Var}(W)=\frac{1}{n_{in}},\qquad \text{He: }\mathrm{Var}(W)=\frac{2}{n_{in}}$$

How the weight variance is set. n_in is the fan-in (input units of the layer). He is for ReLU, Xavier for tanh and sigmoid.

$$\mathrm{Var}(z_\ell)=n\cdot\mathrm{Var}(W)\cdot\mathrm{Var}(a_{\ell-1})$$

Variance of the signal z after the linear layer — the sum of n inputs aᵢ weighted by wᵢ.

$$\mathrm{Var}(a_\ell)=g\cdot\mathrm{Var}(z_\ell)$$

Variance after the activation. ReLU discards about half the variance (g≈0.5), so He doubles Var(W) to compensate, keeping the variance constant across layers.

What is Weight Initialization?

🙋

Weight initialization just means setting the weight values before training starts, right? Surely any small random number will do?

🎓

Here's the thing — in a deep network "any small random number" fails spectacularly. Each layer multiplies the previous layer's output by weights and sums it, so the "spread" (variance) of the signal gets multiplied at every layer. If the weights are too small, the variance shrinks step after step, and by the time you've passed twenty layers the signal is nearly zero. Then the gradient reaching the very first layer is zero too, and training stops dead. That's the vanishing gradient.

🙋

Oh — so I should just make the weights bigger? I picked "Large random (std=1.0)" on the left and the final-layer std became an absurd number...

🎓

Right, and that's the failure in the opposite direction — the exploding gradient. Now the variance doubles and doubles at every layer and diverges exponentially. With thirty layers the numbers overflow into NaN and training is impossible. So initialization needs to be "neither too small nor too large", a value that keeps the variance exactly preserved across layers. Xavier and He initialization are the theoretically derived "just-right" values.

🙋

What's the difference between Xavier and He? I don't know which one to pick.

🎓

It's decided by the activation function. Xavier sets the weight variance to Var(W)=1/n, where n is the number of input units. That value assumes a symmetric function whose slope near the origin is about one, like tanh or sigmoid. He, on the other hand, uses Var(W)=2/n — twice Xavier. Why twice? Because ReLU zeros every negative input and throws away roughly half the variance. To recover that half, you double the weight variance in advance. Set the activation to ReLU and pick He on the left, and you'll see the chart above become an almost flat horizontal line.

🙋

I see! So what happens if I use Xavier with ReLU by mistake?

🎓

Try it. Keep the activation on ReLU but switch the initialization to Xavier. Since Var(W)=1/n, the variance right after the linear layer equals the original. But ReLU then throws away half, so the variance halves cleanly every layer. With twelve layers that's a factor of 1/2¹², about 1/4000 — roughly 1/60 in terms of std. That is a full-blown vanishing gradient. "He for ReLU" is the theoretical answer to closing exactly that factor-of-two gap.

🙋

One last thing. There's also a "Zero initialization" option — what's wrong with starting from all zeros?

🎓

That's the textbook example of an initialization you must never use. If every weight is zero, all neurons in a layer return exactly the same output. During backpropagation they all receive the same gradient, so they stay at the same value forever. Hundreds of neurons end up doing the work of just one. We call this "symmetry that never breaks". To make training work you need to break the symmetry — randomize the weights so each neuron takes on a different role. Biases can be zero, but the weights must not be.

Frequently Asked Questions

In a deep network, each layer's linear transform z = Σwᵢaᵢ multiplies the activation variance Var(a) layer by layer. If the weights are too small, Var(a) shrinks at every layer and the gradient reaching the front layers becomes essentially zero, so training stalls (vanishing gradient). If they are too large, Var(a) blows up exponentially and becomes NaN (exploding gradient). Xavier and He initialization design the weight variance Var(W) so that this variance stays roughly constant across layers, which is why simply changing the initial values decides whether training "makes no progress at all" or "converges".

Choose by activation function. For symmetric functions whose slope near the origin is roughly one — tanh and sigmoid — use Xavier (Glorot) initialization Var(W)=1/n. For ReLU, which zeros negative inputs and discards about half of the variance, use He initialization Var(W)=2/n. He carries twice the variance of Xavier precisely to compensate for the half that ReLU throws away. In this tool, set the activation to ReLU and select Xavier, and you will see the variance halve at every layer until the signal vanishes.

If every weight is set to zero, all neurons in a layer produce the same output and receive identical gradients during backpropagation. They therefore stay at the same value forever, so a layer of hundreds of neurons has only the expressive power of one. This is called a failure to break symmetry. To make training work, the weights must be small random values so that each neuron takes on a different role. Biases, by contrast, can safely be initialized to zero.

Batch Normalization (BN) normalizes each layer's output to zero mean and unit variance, so it absorbs much of the effect of initialization. A network with BN can still train even if the initialization is somewhat off. However, BN has a computational cost and a batch dependency, and many architectures such as Transformers do not use BN, so a proper initialization (He, Xavier) remains the baseline. Combining BN or Layer Normalization with a good initialization lets you train deeper networks stably.

Real-World Applications

Deep CNNs for image recognition: Convolutional networks such as ResNet and VGG rely heavily on ReLU, so He initialization is the de facto standard. The He paper (2015) was proposed precisely to train 30-layer-class ReLU networks stably from the start. In convolutional layers the fan-in is taken as "kernel size × input channels" when computing Var(W)=2/n. Train a deep CNN with Xavier left in place and the loss often refuses to drop early on, leaving you wondering "is the learning rate wrong?" — but the cause is frequently the initialization.

NLP and Transformers: Transformers contain many fully-connected layers and use GELU (a smooth cousin of ReLU) as the activation. Beyond Xavier-style initialization, the original paper scales the initial weights down according to the number of residual branches. GPT-family models scale the output-projection weights by 1/√(2L) for L layers — "depth-aware initialization" is a key to stable training. Weight initialization is not mere preprocessing; it is part of architecture design.

Transfer learning and fine-tuning: When you use a pretrained model, the backbone weights are already trained and need no initialization, but the newly attached output head (a classification layer, for instance) does. Use initial values that are too large here and the gradients run wild early in fine-tuning, destroying the good features learned during pretraining. The standard practice is a smaller initialization for the head, or a learning rate lower than the backbone's.

Diagnosing training that won't progress: When you hit "the loss doesn't drop at all" or "it becomes NaN within a few epochs", the first move is to log the standard deviation of each layer's activations. If, as in this tool, the std fades toward zero layer by layer, the initialization is too small; if it diverges, it is too large. Checking the initialization and activation variance before suspecting the learning rate or optimizer is an efficient debugging workflow.

Common Misconceptions and Pitfalls

The most common misconception is that "initializations are all roughly the same, and any of them converge as long as you tune the learning rate". As this tool confirms, for a 12-layer ReLU network He keeps the final-layer std around 1.0, whereas a small random scheme with std=0.01 shrinks the signal to an astronomically small value and std=1.0 makes it diverge. No learning rate can rescue that. If the gradient has vanished, no amount of increasing the learning rate changes the update — it stays zero. Initialization does not merely "decide the starting point"; it builds the very arena in which gradients can flow.

Next, confusing the fan-in (n_in) with the fan-out (n_out). This tool focuses on preserving the forward-pass variance, so it uses Var(W)=2/n_in (He), but the original Xavier paper wants to preserve both the forward and backward passes and proposes the averaged form Var(W)=2/(n_in+n_out). Implementations differ on which convention they adopt, and some frameworks offer both a fan_in mode and a fan_out mode. In networks where the layer width differs greatly between the input and output, this distinction can shift the variance by several times, so always check the default of the library you use.

Finally, the assumption that "with Batch Normalization, initialization no longer matters". It is true that BN normalizes each layer's output and absorbs much of a rough initialization. But BN itself has learnable scale and shift parameters, and their initial values (usually scale 1, shift 0) are a form of initialization too. Moreover, in networks with residual connections there is a deliberate "use zero initialization on purpose" technique: initializing the scale of the last BN in the residual branch to zero makes the network start as an identity map and stabilizes early training. Even with BN, the design of initialization does not vanish — it changes shape and persists.

Weight Initialization Simulator — Xavier & He

What is Weight Initialization?

Frequently Asked Questions

Real-World Applications

Common Misconceptions and Pitfalls

How to Use

Worked Example

Practical Notes