Visualize a 2D Gaussian Naive Bayes classifier. Each class is modeled as independent 1D Gaussians, and the prediction maximizes the log-posterior. The decision boundary and posteriors update in real time.
Parameters
Query point x
—
Query point y
—
Sigma shift (uniform on all data)
—
Samples per class N
pts
Training data are generated by a deterministic LCG (seed = 42). You can also set the query point by clicking on the canvas.
Results
—
Predicted class
—
Top posterior P(c*|x)
—
Training resubstitution accuracy
—
Top log-posterior log P(c|x)+log L
Feature space and decision boundary
Background color = argmax class region / small dots = training data / large X = class mean / black X = query point (click the canvas to move)
Posterior at the query point
Per-class posterior P(c|x). Values update whenever the query point moves.
Theory & Key Formulas
A Gaussian Naive Bayes classifier assumes that each feature $x_j$ is conditionally independent given the class $c$ and follows a normal distribution $\mathcal{N}(\mu_{cj},\sigma_{cj}^2)$.
Class-conditional log-likelihood under feature independence:
$$\log p(\mathbf{x}\mid c) = \sum_{j=1}^{d}\left[-\tfrac{1}{2}\log(2\pi\sigma_{cj}^{2})-\tfrac{(x_j-\mu_{cj})^2}{2\sigma_{cj}^{2}}\right]$$
Log-posterior from Bayes' theorem (dropping the constant $\log p(\mathbf{x})$):
If sigma is shared across classes the boundary is linear (equivalent to LDA); otherwise it is a quadric surface.
What is the Gaussian Naive Bayes Classifier Simulator
🙋
I hear "Naive Bayes" all the time. Why is it called "naive"?
🎓
Roughly, because it assumes every feature is independent given the class — an assumption that almost never holds in reality. The classifier pretends it does anyway, hence "naive". In the simulator above, x and y are treated as independent 1D Gaussians per class, so the joint log-likelihood is just a sum (in log space).
🙋
When I move the "sigma shift" slider, the background regions change a lot. Increasing sigma makes the red-green border fuzzier and the green region spreads.
🎓
Increasing sigma weakens each class's "confidence", so the prediction is governed by the prior (sample ratio per class) and distance to the means. With N=30 for all three, the priors are equal, so the boundary becomes essentially "nearest class mean". Pushing sigma too small does the opposite — the likelihood collapses for points slightly off-mean and the boundary becomes overly intricate.
🙋
At the query (0,0), red and blue are equidistant (sqrt(5)), but green is closer at distance 2, so the prediction is green. The top posterior is about 45%, higher than the others.
🎓
Right. The log-posteriors are about -4.94 for green and -5.44 for red and blue. The gap is small, so softmax gives about 45 : 27 : 27 percent. This is the classic "low-confidence prediction". In practice you often add a threshold like "withhold a decision if the top probability is below 50%". Move the query near (2,-1) and blue jumps above 90%, becoming a confident prediction.
🙋
If I lower N to 10, the training data are sparse and the estimated sigma fluctuates, so the boundary warps. How do I use this insight?
🎓
Good observation. With small samples, errors in the mean and variance estimates show up directly in the boundary. In practice you use cross-validation to regularize sigma and the prior (Bayesian priors) and suppress overfitting. Try toggling N between 10 and 60 and watch the training resubstitution accuracy change — you will feel the trade-off quickly.
Frequently Asked Questions
Strictly independent features are rare, and most real problems show correlation between features. Even so, Naive Bayes delivers practical accuracy in many applications such as text classification and medical diagnosis. The reason is that argmax_c log P(c|x) determines the decision boundary, and only the order of the scores matters, not the calibration of the probabilities. Whether you need a model with covariance (LDA, QDA, GMM) should be judged by comparing AUC and log-loss on data.
Represent each document as a vector of word counts and learn per-class word probabilities with a multinomial distribution (Multinomial Naive Bayes) or a Bernoulli distribution. Because the input is discrete counts rather than continuous values, the multinomial model rather than the Gaussian model is the standard choice. In spam filtering, words such as "free" and "winner" have very different conditional probabilities, and the sum of log-likelihoods gives strong discrimination. It has long been a workhorse for email classification where short messages must be classified accurately.
If a word w never appears in any training document of class c, its probability becomes zero, and log(0) = minus infinity makes the log-posterior of that class minus infinity for every test example containing w. Laplace smoothing P(w|c) = (N_wc + alpha)/(N_c + alpha*V) with alpha greater than zero prevents this. The continuous (Gaussian) version in this simulator has a similar issue — the likelihood diverges as sigma approaches zero — which is why the sigma-shift slider has a lower limit.
Only when every feature shares the same sigma across all classes does the difference of log-posteriors become linear in x, giving a straight-line boundary (equivalent to Linear Discriminant Analysis). When sigma differs across classes, the terms -0.5*log(sigma^2) and (x-mu)^2/(2*sigma^2) remain and the boundary becomes a quadric surface (hyperbola, parabola or ellipse). The default of this simulator uses sigma=1 for every class, so the initial boundary is essentially linear; increasing the sigma shift bends it slightly, and changing the sample size makes per-class sigma estimates vary, which curves the boundary.
Real-World Applications
Spam filtering and text classification: This is the most celebrated application of Naive Bayes. Each document is represented as a word-count vector, words are assumed independent, and the log-posterior becomes a sum of per-word contributions, which is extremely fast to evaluate. Combined with Bag-of-Words or TF-IDF, it served as the workhorse of commercial spam filters for many years before modern machine learning took over.
Decision support in medical diagnosis: With symptoms or test results as features and diseases as classes, assuming each symptom is independent yields a ranking of the most plausible diseases from a patient's findings. The conditional probabilities are explicit and easy for clinicians to inspect, which is why Naive Bayes is often used as a transparent baseline in decision-support systems.
Initial screening for anomaly detection: A two-class (normal/abnormal) Gaussian Naive Bayes requires almost no computation and is easy to deploy, so it is widely used to pre-filter outlier candidates from manufacturing sensors or IT system logs. A two-stage pipeline that follows it with a more accurate model (XGBoost, deep learning) is a practical pattern.
Baseline model for natural language processing: In tasks such as sentiment analysis, topic classification and language identification, Naive Bayes is still used as the "baseline" against which new deep models are compared. Linear-time training, high interpretability and robust performance with little data make it a fair reference for claiming the superiority of state-of-the-art models.
Common Misconceptions and Cautions
The most common misconception is to dismiss Naive Bayes as "useless" simply because the independence assumption is violated. In practice, even on text data where independence clearly fails, Naive Bayes delivers surprisingly good accuracy. The reason is that the decision boundary only needs the order of the argmax, not the exact probability values. Calibration of absolute probability values matters when you need probabilities for risk assessment, but for hard classification correlated features can still work well. In the simulator, varying the sigma shift while watching the training accuracy reveals that the boundary may bend, yet accuracy hardly drops.
The next most common error is to read the posterior probabilities as "true probabilities". Under a violated independence assumption, Naive Bayes probabilities are biased toward 0 and 1; a "90% confident" prediction can correspond to an actual accuracy of around 65%. When probabilities are needed, calibration with Platt scaling or isotonic regression is standard. Watching the simulator quickly climb above 90% confidence even near boundary regions makes the risk of overconfidence concrete.
Finally, maximum-likelihood estimates of sigma and the prior become unstable on small samples. Drop N to 10 and observe the boundary warp. In practice, MAP estimation with a conjugate prior or imposing a Laplace-like floor on sigma after standardizing the features is the standard remedy. In the simulator, adding +0.05 to the sigma shift raises the variance floor and smooths the boundary. Just because the model is "naive" does not mean you can skip robustifying the estimates — overfitting is the direct cost.