Cross-Entropy vs MSE Loss Simulator

Parameters

Logit z (sigmoid input)

The raw model output. p = σ(z) passes through the sigmoid to become the predicted probability

True label y

The true class this input belongs to (0 or 1)

Results

—

Predicted probability p

—

Cross-entropy loss CE

—

MSE loss

—

CE gradient |dCE/dz|

—

MSE gradient |dMSE/dz|

—

Gradient ratio CE/MSE

—

Loss-curve comparison — operating-point animation

Both loss curves are drawn against the predicted probability p for the current true label y — the cross-entropy curve (rising steeply) and the MSE curve (a gentle parabola) — with the current operating point p marked on each.

Loss vs predicted probability p

Gradient vs logit z

Theory & Key Formulas

$$\text{CE}=-\big[y\ln p+(1-y)\ln(1-p)\big],\qquad \text{MSE}=(p-y)^2$$

The two losses for binary classification. p: sigmoid predicted probability p = σ(z) = 1/(1+e^−z), y: true label (0 or 1).

$$\frac{\partial \text{CE}}{\partial z}=p-y,\qquad \frac{\partial \text{MSE}}{\partial z}=2(p-y)\,p(1-p)$$

Gradients with respect to the logit z. The CE gradient is the error p−y itself; the MSE gradient carries the sigmoid derivative p(1−p).

When the output saturates (p≈0 or p≈1), p(1−p)→0 so the MSE gradient vanishes, while the CE gradient stays near ±1. This is why cross-entropy is the standard loss for classification.

What is Cross-Entropy vs MSE?

🙋

A "loss function" in machine learning measures the gap between the prediction and the truth, right? I get why regression uses MSE (mean squared error), but why does classification use cross-entropy? Is squared error not allowed?

🎓

Good question. The short answer is that MSE "can" be used for classification too. It just squares the difference between the predicted probability p and the truth y, so it stays non-negative and is minimal when p = y. That is why beginners sometimes accidentally use MSE for classification. But when you actually train, it learns clearly slower than cross-entropy. The cause lies in the gradient.

🙋

The gradient — so it is the derivative that matters, not the loss value itself?

🎓

Exactly. Training progresses by nudging the parameters "in the direction of the gradient". So what really matters is not the loss value but the size of the gradient. Try setting the logit z to −8 on the left. With the true label y=1, a strongly negative z means the model is confidently wrong — it is sure it is "class 0". Look at the "MSE gradient" on the right at that point — it should be nearly zero.

🙋

You're right, the MSE gradient is close to 0.000. It is most wrong of all, yet the gradient has disappeared… that is a problem.

🎓

That is precisely the "vanishing gradient". The reason is the shape of the sigmoid. The MSE gradient is dMSE/dz = 2(p−y)·p(1−p), with the sigmoid derivative p(1−p) as a factor. When p saturates near 0 or 1, that p(1−p) becomes almost zero. So at the very moment you want a big correction — being confidently wrong — a brake is applied. The cross-entropy gradient dCE/dz = p−y has no such factor, so even at z=−8 the gradient stays near ±1. The "Gradient vs logit" chart on the right makes the difference obvious at a glance.

🙋

Why does the CE gradient become such a simple expression, p − y? It is too clean to be a coincidence.

🎓

It is no coincidence. When you compose the sigmoid with cross-entropy and apply the chain rule, the 1/p and 1/(1−p) coming from cross-entropy and the sigmoid derivative p(1−p) cancel out exactly. The result is p−y. It is a piece of design elegance — sigmoid and cross-entropy are an information-theoretically "matched" pair. So in practice you reach for this pair without hesitation: MSE for regression, sigmoid + cross-entropy for binary classification, softmax + cross-entropy for multi-class. Remember that and you will rarely go wrong.

Frequently Asked Questions

The reason becomes clear when you look at the gradient with respect to the logit z. The cross-entropy gradient is dCE/dz = p − y, which is proportional to the error itself. The MSE gradient is dMSE/dz = 2(p−y)·p(1−p), which carries the sigmoid derivative p(1−p) as a factor. When the model is confidently wrong, p saturates near 0 or 1 so p(1−p)→0, and the MSE gradient collapses to nearly zero. Cross-entropy keeps a strong gradient the larger the error is, so it is the standard loss for classification.

A vanishing gradient is when the loss is large but the parameter-update gradient is nearly zero, so training stalls. With a sigmoid output combined with MSE loss, the sigmoid derivative p(1−p) becomes extremely small in the saturated region (p≈0 or p≈1). The MSE gradient carries this p(1−p) factor, so exactly when you most want a large update — when the model is confidently wrong — the gradient disappears. Cross-entropy avoids this because p(1−p) does not appear in its gradient.

In cross-entropy CE = −[y·ln p + (1−y)·ln(1−p)], if p is exactly 0 or 1 the logarithm diverges to −∞ and the loss becomes Infinity or NaN. The standard fix is to clamp (clip) the predicted probability p to a tiny interval such as [1e−12, 1−1e−12] before taking the logarithm. This tool applies the same clamp, so the loss never becomes NaN or Infinity even when the logit z is swept to ±8.

When you compose the sigmoid p = σ(z) with cross-entropy CE and apply the chain rule, the sigmoid derivative p(1−p) and the cross-entropy derivative with respect to p cancel each other out exactly, leaving the simple form dCE/dz = p − y. This means the difference between prediction and truth is itself the gradient — a desirable property where larger error means a larger update. The cancellation happens because sigmoid and cross-entropy are an information-theoretically matched pair.

Real-World Applications

Classification tasks in neural networks: For tasks whose output is a class probability — image classification, spam detection, benign/malignant decisions on medical images — cross-entropy is the de facto standard loss. Binary classification uses sigmoid + binary cross-entropy; multi-class classification uses softmax + categorical cross-entropy. PyTorch's BCEWithLogitsLoss and TensorFlow's categorical_crossentropy follow this design, taking logits directly and computing the loss in a numerically stable form.

Logistic regression: Logistic regression, a classical statistical model, also uses "sigmoid + cross-entropy" directly inside. Minimizing the cross-entropy loss is mathematically equivalent to maximum-likelihood estimation of a Bernoulli distribution. In other words, using cross-entropy is the statistically sound procedure of "choosing parameters that make the observed data most likely" — an interpretation that MSE does not have.

Stabilizing gradient-based optimization: In deep learning, the deeper the stack of layers, the smaller gradients tend to become as they propagate. Using MSE at the output layer adds yet another vanishing gradient there and training almost stops. Simply switching the output layer to cross-entropy makes at least the gradient from the output layer a healthy size proportional to the error, and training becomes that much more stable. This is a basic countermeasure against vanishing gradients, alongside ReLU activation and batch normalization.

ML education and model debugging: One classic cause of "the classification model won't train" is a mismatched loss function. Once you have felt firsthand — through a visualization like this tool — that "the MSE gradient vanishes in the saturated region", you will know to suspect the loss/activation combination first when training stalls. When the loss curve stays nearly flat and refuses to move, the standard move is to check whether the design causes a vanishing gradient.

Common Misconceptions and Pitfalls

A common misconception is the flat claim that "MSE cannot be used for classification". The accurate statement is "it can be used, but training is slow and tends to be unstable". MSE loss itself has the correct shape — minimal at p=y — and judged by the loss value alone it is not broken as a classification metric. The problem is in the gradient, not the loss value. Because the gradient vanishes in the region where the model is confidently wrong, optimization fails to progress. The key perspective is not "small loss = good model" but "choose a loss whose gradient scales appropriately with the error".

Next, the assumption that "cross-entropy gives a p−y gradient even without the sigmoid". The clean form dCE/dz = p−y holds only when you always use sigmoid (or softmax for multi-class) together with cross-entropy as a pair. If you put ReLU or an identity map on the output and just layer cross-entropy on top, the cancellation does not happen. Furthermore, double-applying a sigmoid to an already sigmoided probability, or confusing logits with probabilities, is a classic implementation bug. Be careful not to mix up the "pass logits" function (like BCEWithLogitsLoss) with the "pass probabilities" function.

Finally, it is dangerous to think "vanishing gradients are only an MSE problem". The root cause of vanishing gradients is the saturation of sigmoid/tanh; even with cross-entropy as the loss, stacking saturating activations in many hidden layers still makes gradients small. What cross-entropy solves is only the "vanishing gradient originating from the output-layer loss". Vanishing gradients across a whole deep network require a combination of ReLU-family activations, residual (skip) connections, proper weight initialization and batch normalization. Understand that cross-entropy is not a cure-all — it is just one move among the countermeasures against vanishing gradients.

What is Cross-Entropy vs MSE?

Frequently Asked Questions

Real-World Applications

Common Misconceptions and Pitfalls

How to Use

Worked Example

Practical Notes