An interactive view of how k-fold cross-validation estimates a model's ability to generalize to unseen data. Drag the sample size, fold count k, model complexity and noise sliders and watch training MSE, CV MSE, standard error and the generalization gap update in real time - the bias-variance tradeoff in one screen.
Parameters
Total samples N
Total data available for learning
Number of folds k
5 or 10 are standard. k=N gives LOOCV
Model complexity d
Proxy for polynomial degree, tree depth or layer count
True noise std-dev σ
The irreducible noise floor no model can beat
Model bias coefficient b
Mismatch between the hypothesis family and the truth
Each horizontal bar is the full dataset. In each iteration one slot becomes the validation fold (orange) and the rest is training (blue); the CV score is the mean over k iterations.
Bias² · Variance · Total error vs model complexity
L is the loss function (MSE for regression, 0–1 loss for classification) and $\hat f^{-i}$ is the model trained without the i-th fold. The fold mean estimates the generalization error and SE quantifies its uncertainty.
Test error decomposes into bias², variance and irreducible noise. Increasing complexity lowers bias but raises variance - the tradeoff that CV makes visible.
k-Fold Cross-Validation
🙋
Cross-validation comes up all the time in ML textbooks. What does it actually do that a plain train/test split doesn't?
🎓
Good question. A single 80/20 hold-out depends on luck: if the hard examples land in the test set you look bad, and if they land in train you look great. k-fold averages that luck out. You partition the data into k equal parts and, for each iteration, train on k-1 parts and validate on the one you held out, then average the k scores. With k=5 on 1000 rows, that's five runs of "train on 800, validate on 200" - each row contributes to validation exactly once.
🙋
So is bigger k always better? Like cranking it all the way up to LOOCV with k = N?
🎓
It's not that simple. Larger k brings each training set closer to N-1 samples so bias drops, but the k models are almost identical, so the per-fold estimates become highly correlated and variance actually goes up. Plus you have to fit the model k times. Empirically k=10 hits a sweet spot - Hastie & Tibshirani's "Elements of Statistical Learning" specifically recommends k=5 or k=10. In real projects almost everyone uses one of those two values.
🙋
When I push the complexity slider d up, the "generalization gap" number on the left explodes. What's it really measuring?
🎓
That's the overfitting alarm. Training MSE shows how well the model fits the data it has seen, CV MSE estimates the error on unseen data. The difference is called optimism - the in-sample evaluation looks better than reality. This tool reports it as a percentage of the CV error. Above 25% the model is starting to memorize, above 50% it's clearly overfit. Fixes are to reduce complexity, add regularization (L1, L2, dropout, weight decay) or get more data.
🙋
The bias-variance graph is striking - bias² keeps dropping but variance rises so the total error draws a U-shape.
🎓
That's the classical bias-variance decomposition. Test error = Bias² + Variance + irreducible σ². Bias is monotonically decreasing in complexity, variance monotonically increasing, so their sum must form a U. The bottom is the "right" model size, and CV is the empirical way to find it. Modern deep nets show a "double descent" exception, but the U-curve is still the foundation you should internalize first.
🙋
Why does the tool also display a standard error? Isn't the mean enough?
🎓
Because per-fold scores fluctuate. With s_CV being the spread across folds, SE = s_CV / sqrt(k) is the standard error of the mean. With that you can say "model A scored 0.263, model B scored 0.270" and check whether the ± 1 SE bands overlap - if they do, the difference is noise. Breiman's one-SE rule then picks the simplest model whose mean falls within the best model's mean + 1 SE. It's the standard discipline for hyperparameter selection.
FAQ
In practice k=5 or k=10 are the standards, with k=10 being the most widely used choice. Larger k brings each training set closer to N(k-1)/k samples, lowering bias, but the folds overlap more so estimates become correlated and variance rises, and compute cost grows linearly. With N=1000 and k=5 you get 800 train / 200 valid per iteration and only five training runs - a good balance. Only consider LOOCV (k=N) when N is very small.
The gap is the optimism (or in-sample bias): it tells you how much the training-only error underestimates the real prediction error. This tool reports (CV error - training error) / CV error x 100 (%). Below 25% is healthy, 25-50% signals overfitting, and above 50% means the model has effectively memorized the training set and will degrade badly on new data. Fix it by reducing complexity, adding regularization, or collecting more data.
Choosing models by mean CV alone is fragile because per-fold scores fluctuate. The standard error SE = s_CV / sqrt(k) lets you put a confidence band around the mean. Two models whose mean ± 1 SE intervals overlap are statistically indistinguishable. Breiman's one-SE rule picks the simplest model whose mean falls within the best model's mean + 1 SE. This is a standard way to control complexity and reduce overfitting risk.
No. Random splits would let future points train a model that then predicts the past (information leakage), making the score wildly optimistic. Use Walk-Forward validation or TimeSeriesSplit so the training set is always strictly earlier than the validation set. Similarly, for imbalanced classification use Stratified k-fold so each fold keeps the population class ratio. The right splitting strategy is what makes CV trustworthy.
Real-world applications
Model selection in scikit-learn, PyTorch and XGBoost: whenever you tune polynomial degree, tree depth, learning rate or regularization weight with GridSearchCV or RandomizedSearchCV, k-fold cross-validation is doing the work under the hood. Picking the simplest model within "best mean + 1 SE" instead of the raw best mean usually delivers more stable test performance. Playing with the complexity slider here builds the intuition for reading those U-shaped CV curves in production code.
Kaggle and competitive ML leaderboards: trusting the public leaderboard alone is the classic way to get shaken up in the final standings. Top competitors always keep a local 5- or 10-fold CV and watch the CV vs LB correlation. Large divergence between them usually points at distribution shift in the train/test split (time or group based). Knowing the CV standard error lets you ignore the kind of ± 0.001 LB wiggle that is purely noise.
Medical and drug-discovery model evaluation: with only hundreds to a few thousand patients, a fixed hold-out test is too small to trust. Practitioners use Stratified k-fold to balance positive/negative ratios, Repeated k-fold (running CV with several random seeds) to stabilize the mean, and Group k-fold to keep all samples from the same patient in the same fold so the model cannot leak across subjects. FDA and PMDA submissions are reviewed against exactly this kind of CV protocol.
CAE surrogate-model construction: when building Kriging or neural-network surrogates on top of FEM results, each simulation can cost hours to days, so the sample size is tens to hundreds. LOOCV (k=N) is standard in this regime - the cost of N retraining runs is acceptable because each sample is precious. The same per-point LOOCV error map drives active-learning acquisition functions that decide where to run the next expensive simulation.
Common pitfalls
The first trap is doing preprocessing outside the CV loop. If you fit a StandardScaler, feature selector or PCA on the whole dataset before splitting, information from the validation fold leaks into training and the CV score becomes optimistic. The fix is to wrap preprocessing and the estimator in an sklearn Pipeline and pass the Pipeline itself to cross_val_score, or to fit transformers strictly inside each training fold. Simple standardization leaks little, but target-aware feature selection or PCA can easily inflate CV scores by several percent.
The second trap is comparing models by mean CV only. {0.85, 0.82, 0.87, 0.83, 0.86} and {0.88, 0.70, 0.92, 0.75, 0.88} both average around 0.85, but the second one is far less reliable. Always report per-fold standard deviation or standard error, and check whether the ± 1 SE bands of competing models overlap. Following the one-SE rule - pick the simplest model whose mean lies within the best model's mean + 1 SE - usually delivers more stable production performance.
The third trap is using CV all the way through to final evaluation. CV is for hyperparameter tuning, and any data you have looked at many times during that search is no longer truly unseen. Keep a separate hold-out test set that the model has never touched and evaluate it exactly once at the end. The ideal split is train (for CV) / validation (for early stopping and threshold tuning) / test (one-shot final report). Skipping this discipline is how model-selection bias sneaks into papers and production deployments.
How to Use
Set the number of samples (nNum): typical range 50–500 for small datasets, 1000–10000 for production models
Choose k-fold value (kNum): k=5 or k=10 for balanced bias-variance; k=n for leave-one-out CV on small datasets
Specify noise level (sigmaNum) in data units: σ=0.1 for clean synthetic data, σ=1.0 for real-world variability
Execute fold partitioning; simulator trains k models, each on (k-1)/k of data, validates on held-out fold
Read outputs: CV MSE estimates generalization error; compare Training MSE vs CV MSE to detect overfitting
Worked Example
Dataset: 200 samples, quadratic function y=2x²+3x+noise, k=5 folds, polynomial degree=2, σ=0.5. Each fold contains 160 training and 40 validation samples. Training MSE converges to 0.28 after regularization. CV MSE averages 0.31 across five iterations (fold 1: 0.29, fold 2: 0.33, fold 3: 0.30, fold 4: 0.32, fold 5: 0.31). CV standard error = 0.016. Generalization gap = [(0.31–0.28)/0.28]×100 = 10.7%, indicating stable model fit without severe overfitting. Increasing degree to d=4 raises CV MSE to 0.48 and gap to 31%, signaling polynomial overfit.
Practical Notes
For imbalanced regression targets (e.g., predicting rare equipment failures), stratify folds manually or use cross-validation on residuals to ensure each fold spans the full response range
With k=10 and n=100, each fold has only 10 validation samples; increase n to 500+ or reduce k to 5 for stable CV estimates in production CAE pipelines
Generalization gap > 20% typically signals underfitting (high bias) or overfitting (high variance); adjust model complexity and regularization strength accordingly
CV standard error quantifies fold-to-fold variability; use it to compute 95% confidence bands on CV MSE for uncertainty quantification in FEA surrogate models