Softmax and Cross-Entropy Loss — Core of Classification
Convert three-class logits through Softmax into probabilities and compute the cross-entropy loss against the true class in real time. Move the temperature T to feel how the distribution sharpens or smooths.
Parameters
Logit z_1 (Class 1: true)
—
Logit z_2 (Class 2)
—
Logit z_3 (Class 3)
—
Temperature T
—
The true class is fixed at c = 1 ("Class 1"). One-hot label y = (1, 0, 0).
Results
—
P(Class 1) ← true
—
P(Class 2)
—
P(Class 3)
—
Cross-entropy loss L
Probability Distribution and Effect of Temperature
Top = current P(class i) at this T (green = true class, blue = others) / Bottom = P at T = 0.5, 1, 2, 5 / Bottom-right = loss and predicted class
Theory & Key Formulas
Softmax turns a real vector z into a probability distribution p; cross-entropy measures the distance between this prediction and the one-hot label y.
Softmax with temperature T. Each logit is divided by T, exponentiated and normalized by the sum:
Smaller T gives a sharper distribution (concentrated on argmax); larger T approaches a uniform distribution. This is the key property used in knowledge distillation.
What is the Softmax and Cross-Entropy Loss Simulator
🙋
I have heard that classifiers put a "Softmax" at the very end. What does it actually do?
🎓
Roughly, Softmax turns "scores (logits)" into "probabilities". In the simulator above, with z_1=2, z_2=1, z_3=0 the probabilities are 0.665, 0.245, 0.090. The sum is exactly 1. We just exponentiate each score and divide by the total. The key trick is that the exponential stretches differences in the scores exponentially.
🙋
Got it! And what is the "cross-entropy loss" right below it?
🎓
It is a number that measures the gap between the predicted probabilities and the true label. If the correct class is c, the loss is L = -log(p_c). Right now the true class is C1 and p_1 = 0.665, so L ≈ 0.408. If p_c is close to 1 then L→0; if you are confidently wrong with p_c=0.01 then L→4.6. The design says "being confidently wrong hurts a lot".
🙋
When I move the temperature T slider, all the probabilities change at once. At T=5 they all come close to 0.3 or so.
🎓
Nice catch. T is the parameter you divide every logit by before passing them to Softmax. Large T shrinks the differences and pushes us toward a uniform distribution; small T amplifies the differences and concentrates probability on the largest entry. The lower chart shows T=0.5, 1, 2, 5 side by side — compare them. In practice "knowledge distillation" uses a large T to transmit the shape of the teacher distribution to the student.
🙋
Doesn't the computation break if the logits get too large?
🎓
Sharp question. At z=1000 the value exp(1000) overflows to infinity. So in implementations the standard trick is to subtract the maximum logit before exponentiation. Softmax is invariant under constant shifts, so the answer is the same, and the largest entry becomes exp(0)=1, which never overflows. That is the log-sum-exp trick. This simulator uses it too, so even z=5 (the upper bound) does not break the math. Try sliding the sliders all the way to the ends.
Frequently Asked Questions
In the limit T→0, probability mass collapses onto the class with the largest logit and the others go to zero — a hard distribution. With z_1=2, z_2=1, z_3=0 and T=0.1, p_1 is essentially 1. As T→∞ all classes become equally likely; for three classes that is 1/3 ≈ 0.333 — a soft distribution. T is essentially the knob that directly controls the sharpness of the distribution.
In Hinton-style distillation, both the teacher and the student apply Softmax at a large temperature (for example T=4 to 10), and the student is trained to match the shape of the teacher distribution. At T=1 almost all mass sits on the top class, which carries little information. At larger T the runner-up probabilities — what the teacher has learned about class similarity — become visible. At inference time T is reset to 1.
A regularization trick that replaces a hard one-hot label (1.0 on the true class) with, say, 0.9 on the true class and 0.1 spread equally over the others. By smoothing y in L = -Σ y_i log(p_i), it prevents the model from becoming overconfident. It generally improves generalization and calibration, and is a standard technique in image and speech classification.
For large logits exp(z) overflows in double precision (at about z≈709 it becomes +Inf), and very negative ones underflow to zero. We use p_i = exp(z_i − max(z)) / Σ exp(z_j − max(z)) instead. Softmax is invariant under constant shifts so the result matches, and the largest term becomes exp(0)=1, which is stable. PyTorch and TensorFlow implement log_softmax this way internally.
Real-World Applications
Image classification and object detection: The final layer of image classifiers such as ResNet or ViT outputs Softmax probabilities and is trained with cross-entropy loss. For ImageNet (1000 classes), z lives in 1000 dimensions and the output is a per-class probability. Object detectors (YOLO, Faster R-CNN) use the same framework for the class prediction of each candidate box.
Language modeling and machine translation: Large language models such as GPT or BERT emit a logit vector of vocabulary size at each position, and Softmax turns it into a distribution over the next token. Training minimizes -log p_c at the true token; at inference the temperature T controls sampling diversity (raising T gives more variety, lowering it makes the output more deterministic). The "temperature" parameter in ChatGPT is exactly this T.
Knowledge distillation (model compression): A small student model is trained to mimic the high-temperature Softmax distribution of a large teacher. BERT to DistilBERT, quantized image classifiers — distillation is widely used to compress models for edge deployment. The key is "raise the temperature": at T=1 too much information is lost.
Reinforcement learning and policy networks: Policy-gradient methods (REINFORCE, PPO) emit action probabilities with a Softmax. A higher temperature favors exploration, a lower one favors exploitation, and some implementations schedule T during training. AlphaGo's MCTS uses a similar soft selection rule.
Common Misconceptions and Cautions
The most common misconception is to treat Softmax outputs as if they were true probabilities. They satisfy the mathematical properties of a probability (non-negative, summing to 1), but they are not necessarily calibrated. Modern deep networks tend to be overconfident — predicting p=0.99 when the actual accuracy is closer to 80% is routine. In safety-critical domains such as medical diagnosis or autonomous driving, post-hoc methods like temperature scaling are used to correct this.
The next most common mistake is to think cross-entropy loss can be used just like MSE (mean squared error). Cross-entropy assumes a probability distribution as input; passing raw logits directly into -y log z without normalization will return NaN or negative values. PyTorch's nn.CrossEntropyLoss takes raw logits (it computes log_softmax internally), while nn.NLLLoss takes log-probabilities — always check what the function expects.
Finally, do not treat temperature T as a magic accuracy knob. Changing T at training time merely reshapes the loss; the test-time accuracy evaluated at T=1 is essentially unaffected. T earns its keep in tasks that genuinely use the shape of the distribution: transmitting the teacher's distribution in distillation, controlling sampling diversity in generative models, recalibrating uncertainty after training. Move T in the simulator and watch what changes and what does not — this distinction is the point.