K-Fold Cross-Validation Simulator Back
Machine Learning Simulator

K-Fold Cross-Validation Simulator — Polynomial Degree Selection

Fit polynomials to 1D data and use K-fold CV to pick the best degree. Compare train MSE and CV MSE to see exactly where overfitting begins.

Parameters
Number of folds K
Sample size N
Score spread σ
%
Animation speed
×

Per-fold validation scores are synthetic (Math.random). More training data (larger K) lowers the bias, while smaller validation sets raise the variance (std) of the estimate.

While paused, move the sliders to update the result instantly.

Results
Number of folds k
Current fold
Mean CV score
CV score std
K-Fold Cross-Validation Animation

Each fold takes its turn as the validation set (orange) while the rest train (green). Per-fold scores accumulate into the CV mean ± std.

Theory & Key Formulas

Polynomial regression model. A degree-$d$ polynomial is fitted by least squares:

$$\hat y(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_d x^d$$

Mean squared error on the training set:

$$\mathrm{MSE}_\text{train} = \frac{1}{n}\sum_{i=1}^{n}\bigl(y_i - \hat y(x_i)\bigr)^2$$

K-fold cross-validation MSE. The data are split into $K$ equal parts; each part is used as test in turn:

$$\mathrm{MSE}_\text{CV} = \frac{1}{K}\sum_{k=1}^{K}\mathrm{MSE}_\text{test}^{(k)}$$

The optimum is $\arg\min_d \mathrm{MSE}_\text{CV}(d)$. Train MSE drops monotonically with $d$, but CV MSE bottoms out and then rises — that bottom is the boundary of overfitting.

What is the K-Fold Cross-Validation Simulator

🙋
You hear "cross-validation" everywhere in ML, but what is it really doing?
🎓
Roughly speaking, it is a procedure for estimating "how well the model fits unseen data" using only the data you have. You split the data into K chunks, hold one out as the test set, train on the rest, and repeat K times. Drop the "Number of folds K" slider above from 5 to 2 — the CV curve gets jagged. More test data per fold makes the estimate more stable, but training shrinks.
🙋
In the chart the train MSE keeps falling, but the CV MSE bottoms out and then climbs. Is this overfitting?
🎓
Exactly. As you push up the polynomial degree, you can fit the training data as well as you like, but you start memorizing the noise and miss on new data. Look at the bottom plot — the green curve is the CV-optimal fit and the dashed purple is the maximum-degree fit. The purple wiggles to pass through nearly every point. That is "memorized the noise".
🙋
The true function is cubic. Why does d* often land on 3 or 4 instead of always 3?
🎓
Ideally we want 3 to win every time, but with noisy data, an extra coefficient often costs almost nothing. So 3 and 4 compete and the noise tips the balance. In real work you never know the "true degree", so practitioners often pick the simpler model from the flat region near the CV minimum — parsimony.
🙋
Does setting noise σ to zero make overfitting go away?
🎓
Try it. With σ=0 any polynomial of degree ≥ 3 reproduces the true function perfectly, and both train and CV MSE collapse to roughly zero. Real data always carry noise, so overfitting is a built-in risk. That is exactly why honest evaluation tools like CV are non-negotiable.

Frequently Asked Questions

LOO is the special case K = N where each data point is held out one at a time. The bias is minimal (you train on almost all the data), but you have to fit N times, the computation is heavy, and the variance is large because each test set is a single point. In practice K = 5 or 10 strikes a good balance, and you choose K based on the size of the dataset.
A holdout splits the data into train/test just once, so the score depends on the luck of the split. K-fold averages across K different splits, so the estimate is more stable. The smaller the dataset, the larger the advantage; with abundant data a simple holdout can be practical enough.
Use it for classification problems with imbalanced classes. Stratified sampling keeps the class proportions in each fold equal to the full data, which avoids the instability of folds that happen to contain almost no minority-class samples. For regression you can also stratify on the binned target value.
If the same CV data both selects hyperparameters and reports the final score, you overfit to the CV score (meta-overfitting). Use Nested CV (outer loop for evaluation, inner loop for tuning), or hold out a separate test set for the final report. This tool only visualizes degree selection; the true generalization of the chosen d* should be measured on yet another dataset.

Real-World Applications

Hyperparameter tuning of ML models: scikit-learn's GridSearchCV and RandomizedSearchCV internally run K-fold CV to compare candidates and pick the best hyperparameters. Tree depth and learning rate of XGBoost or LightGBM, the regularization strength λ of ridge regression — almost every modern pipeline relies on this loop.

Accuracy assessment of CAE surrogate models: When you build a response-surface or Kriging surrogate for a heavy CAE solver, K-fold CV tells you how trustworthy the surrogate is at unseen design points. Without checking this accuracy before optimization, the search can chase artefacts in the surrogate's extrapolation.

Fitting constitutive laws to small experimental datasets: When you fit a yield curve or an S-N fatigue curve to a handful of tests, an over-flexible function will run through every data point and explode on extrapolation. CV helps pick a complexity level that does not memorize the data, which is the key to getting laws that actually work on new tests.

Time-series model evaluation (time-series CV): Plain K-fold cannot be used on time-ordered data like stock prices or demand: you would train on the future to predict the past. Instead, use an "expanding window CV" or a "sliding window CV" that always trains on the past and evaluates on the future.

Common Misconceptions and Cautions

The most common misconception is to think that "CV completely prevents overfitting". CV is just a measure of how much you are overfitting, not a mechanism that stops it. If you only consider over-complex models, CV will report uniformly bad scores. CV is a tool for comparing candidates; if no good model is among them, no amount of CV will help. Raise σ in the simulator and watch how no degree manages to bring the CV MSE down.

Next, do not assume that "larger K is always better". Larger K reduces bias by training on more data per fold, but the test sets shrink and the variance of the estimate increases, plus the compute cost grows linearly. K = 5 or K = 10 are widely used precisely because they balance bias, variance and cost. Compare K = 2 and K = 10 in the simulator and you will see the latter gives a smoother CV curve.

Finally, do not equate "the CV score of the chosen model" with "the true generalization of that model". Selecting the best of many candidates injects a selection bias into the CV score, making it optimistic. This is fatal in competitions or papers. The correct workflow keeps a separate holdout test set for the final report, or uses Nested CV. In this tool, the CV MSE at d* is also not the true performance of d*.

How to Use

  1. Set polynomial degree range (max degree, slDmax): typical range 1–10 for 1D synthetic data.
  2. Configure K-fold parameter (slK): use K=5 or K=10; higher K reduces bias but increases computation.
  3. Specify noise level (slSigma): Gaussian noise standard deviation; 0.5 mimics measurement error on unit-scale data.
  4. Set sample size (slN): 50–200 points; smaller N amplifies overfitting risk, visible in CV MSE vs. train MSE divergence.
  5. Run simulator; observe CV-optimal degree d* and overfit ratio (train MSE / CV MSE).

Worked Example

Fit polynomial to 100 noisy samples (slN=100, slSigma=0.8) from true cubic function y = 2x³ − x + noise. Set slDmax=8, slK=5. Simulator returns: d*=3 (CV-optimal), minimum CV MSE=0.68, train MSE at d*=0.52, overfit ratio=1.31. Degree 4 shows train MSE=0.48 but CV MSE=0.95 (ratio 1.98), indicating overfitting. Degree 2 gives CV MSE=1.10 (underfitting). The d*=3 recovers true model complexity without excess variance.

Practical Notes

  1. Overfit ratio > 1.5 signals excessive model flexibility; use lower degree or increase regularization (L2 penalty).
  2. For production control-chart data with N=200 samples and K=10, CV MSE stabilization near degree 3–4 typically indicates sweet spot; higher degrees yield diminishing returns.
  3. Noise level matters: slSigma=0.2 on calibration sensor data favors lower degrees; slSigma=1.5 on raw pressure transducers may require degree 5+.