A tool that decomposes the generalisation error of machine learning into three parts — bias², variance and noise. Change the degree of the polynomial model and watch, in real time, the tradeoff between underfitting (high bias) and overfitting (high variance), and how the total error forms a U-shape.
Parameters
True function
The correct curve behind the data
Polynomial model degree
Model complexity. Low = underfit, high = overfit
Points per dataset
Number of observed points in each training set
Noise standard deviation
Magnitude σ of the Gaussian noise on observations
Number of datasets
Independent training sets generated to take the expectation
Results
—
Bias²
—
Variance
—
Irreducible error (noise)
—
Expected total error
—
Model degree
—
Regime
—
Model fit — ensemble and spread
The bold reference curve is the true function, the faint curves are the per-dataset fits (spread = variance), and the bold curve is the mean prediction (its gap from the true curve = bias). The scattered points are one highlighted dataset.
Bias-variance decomposition vs degree
Model spread — true function, mean prediction and ±1σ band
The expected squared error at a test point decomposes into three parts: bias², variance and the irreducible error (noise variance σ²). $\hat f$ is the learned model, $\overline{\hat f}$ is the mean prediction over datasets, and $f$ is the true function.
Bias² and variance averaged over the test grid. $\hat f_D$ is the model trained on dataset $D$.
Raising the model complexity lowers bias and raises variance. Because of this tradeoff, the expected total error traces a U-shape against complexity, and its minimum gives the best generalisation performance.
What is the Bias-Variance Tradeoff?
🙋
In machine learning everyone says "watch out for overfitting" — but what is actually bad about it? Fitting the data perfectly sounds like a good thing to me.
🎓
Good question. The key is that "fitting the data you have" and "being right on data you haven't seen yet" are two different things. An overfitted model works hard to trace even the random noise in the training data. So the training error drops to nearly zero, but on new data it misses badly. Raise the degree to 12 in this tool — see how the faint curves thrash around chaotically? That is the state where "predictions swing wildly when the data changes".
🙋
You're right, the curves come out completely different every time. If I set the degree to 1, they all become almost the same straight line. Isn't that bad in a different way?
🎓
Exactly — that is where it gets interesting. The degree-1 line barely moves even when the data changes, so its variance is small. But representing a wavy true function like a sine wave with a straight line is fundamentally impossible. See how the mean prediction is far off the true curve? That systematic gap is the "bias". A model that is too simple is stable but fundamentally wrong. We call that underfitting.
🙋
I see — too simple means high bias, too complex means high variance. So there must be a "just right" degree somewhere in the middle?
🎓
That is precisely the bias-variance tradeoff. Look at the "decomposition vs degree" chart below. Bias² slopes downward, variance slopes upward, and their sum (the total error) traces a U-shape. The bottom of the U is the degree that gives the best generalisation. For the default sine wave, the valley should sit around degree 3 to 5. "The more complex, the better" is an illusion.
🙋
Even at the bottom of the U, the error doesn't reach zero. Why doesn't the curve touch the floor?
🎓
That is the third component — the irreducible error. The observed data itself carries noise σ, so no model, however perfect, can push the error below σ². It is not a model problem but a data problem. So the U-shape of the total error sits on a "pedestal" the height of σ². Set the noise standard deviation on the left to 0, and you will see that pedestal vanish and the U reach all the way to the ground.
🙋
In practice, how do you find the "just right" complexity? You can't always draw this chart, can you?
🎓
In the field you don't know the true function, so you use cross-validation. You repeatedly split the data, train on one part and test on the other, and pick the complexity that minimises the test error. If variance looks high, strengthen regularisation or gather more data; if bias looks high, add features or make the model more complex. This tool is a practice ground that shows what is happening from a "god's-eye view" where the true function is known.
Frequently Asked Questions
Bias is the systematic error that measures how far the model is, on average, from the true function. It is large when the model is too simple to represent the underlying shape — like approximating a sine wave with a straight line. Variance is the spread that measures how much the prediction wobbles when the training data changes. When the model is too complex, like a high-degree polynomial, it chases even the noise in the data, so its predictions vary wildly from one dataset to another. The generalisation error is the sum of these two plus the irreducible noise that cannot be removed.
Raising the model degree (complexity) lowers the bias, but in exchange it raises the variance. Too low a degree cannot capture the true shape and underfits; too high a degree overfits the noise in the training data. The expected generalisation error is bias² + variance + noise, and it traces a U-shape against complexity. The training error keeps falling as you add complexity, but the test error reaches a minimum at some optimal point and then gets worse. That is why a more complex model is not always better.
The irreducible error is the variance σ² of the random noise contained in the observed data itself — a lower bound on the error that no model, however good, can remove. This simulator generates each observation y as the true function f(x) plus Gaussian noise of standard deviation noiseLv, so the irreducible error is exactly noiseLv². Even if a model recovers the true function f(x) perfectly, the observed value at a test point still deviates by the noise, so the mean squared error can never drop below σ². Model selection can only reduce bias and variance; lowering the noise requires better data.
In practice you use cross-validation to find the model complexity that minimises the test error. When bias is high (underfitting), you add features, make the model more complex, or weaken regularisation. When variance is high (overfitting), it helps to gather more data, strengthen regularisation (L2/L1, dropout), simplify the model, or average predictions with ensembling and bagging. Note that adding more data reduces variance but does little for bias.
Real-World Applications
Model selection and hyperparameter tuning: The depth of a decision tree, the number and width of layers in a neural network, the kernel parameters of a support vector machine — each is a knob that controls "complexity" and, fundamentally, moves the bias-variance tradeoff. Searching for the bottom of the test-error U-shape with cross-validation is exactly the act of estimating the valley of the decomposition curve this tool draws, but without knowing the true function.
Designing regularisation: The λ of ridge regression, Lasso, the weight decay and dropout of neural networks — all are tools that lower the effective complexity of a model to suppress variance. This simulator also adds a tiny ridge (λ=1e-6) for numerical stability, but increase λ and the bias rises while the variance falls, reproducing the very effect of regularisation. Raising λ when overfitting is an operation that walks back from the right side of the U toward its bottom.
Data-collection decisions: "Should we gather more data, or change the model?" is a question that comes up constantly in the field. When variance dominates, more data effectively reduces the prediction spread. When bias dominates, however much data you add the error plateaus, and feature engineering or a model rethink is required. Increasing the number of points in this tool reduces only the variance — a behaviour that intuitively justifies this decision.
Understanding ensemble learning: Bagging and random forests reduce variance by averaging the predictions of many models. That is exactly the fact that the "mean prediction" in this simulator is far closer to the true function than the wild per-dataset curves. Boosting, conversely, works in the direction of reducing bias by summing weak learners. The design philosophy of ensembles is explained cleanly by the bias-variance decomposition.
Common Misconceptions and Pitfalls
The biggest misconception is the belief that "a small training error means a good model". A complex model can drive the error on the training data arbitrarily low, and raising the degree up to the number of points makes the training error nearly zero. But that is overfitting, and the generalisation error (test error) goes up instead. What you should always evaluate is the "error on data you have not seen". When the training error and the test error diverge greatly, that is a textbook sign of overfitting. What this tool computes is consistently the generalisation error on the test grid, not the training error.
Next is the overconfidence that "more data solves everything". Adding data reliably reduces variance, but it does almost nothing for bias. As long as you approximate a sine wave with a straight-line model, increasing the observed points a hundredfold will not bring the mean prediction closer to the true curve. Gathering data endlessly while in an underfitting state is a classic failure where only the cost rises and the error does not fall. You must first identify whether bias or variance dominates, and choose the countermeasure accordingly.
Finally, the misconception that "the bias-variance decomposition can be computed directly on real data". This simulator outputs bias and variance separately, but that is only possible because it has a "god's-eye view" — it knows the true function f(x) and can generate any number of independent datasets. In a real task the true function is unknown and there is only one dataset. So in practice we do not handle the decomposition itself, but treat the tradeoff indirectly through test-error estimation by cross-validation. Regard this tool purely as a learning model for understanding what is happening behind that.
How to Use
Set Model Degree (1–10) to control polynomial complexity; higher degrees increase variance risk
Adjust Sample Size (10–500) to observe how bias decreases and variance increases with limited training data
Configure Noise Level (0.1–2.0 standard deviations) to simulate real measurement error in your dataset
Select Dataset Size (50–2000 points) to see how total error converges as training capacity grows
Click Simulate to decompose total generalization error into bias² + variance + irreducible noise components
Worked Example
For a cubic polynomial (degree=3) trained on n=100 samples with noise=0.5 and dataset=500: bias²=0.12, variance=0.28, irreducible error=0.25, total expected error=0.65 MSE. Increasing degree to 9 yields bias²=0.02 but variance=0.68, pushing total error to 0.95—overfitting regime. Reducing samples to n=30 raises variance to 1.41 even at degree=3, demonstrating high-variance underfitting in small-sample regimes.
Practical Notes
Underfitting regime (low degree, high bias): typical in linear models on nonlinear physics problems; add polynomial features or kernels
Overfitting regime (high degree, high variance): common with 5+ parameters on <100 samples; use regularization (L2 penalty) or cross-validation
Sweet spot identification: track when total error plateaus; this marks optimal complexity for your noise level
Irreducible error ceiling: noise component limits improvement; if noise=1.0 dominates, collect cleaner measurements before adding model capacity