K-Fold Cross-Validation Simulator Back
Machine Learning Simulator

K-Fold Cross-Validation Simulator — Polynomial Degree Selection

Fit polynomials to 1D data and use K-fold CV to pick the best degree. Compare train MSE and CV MSE to see exactly where overfitting begins.

Parameters
Max degree d_max
Number of folds K
Noise σ
Sample size N

The true function is $y = 2x - 0.5x^2 + 0.05x^3$ on $x \in [0,5]$. Data are generated deterministically (LCG seed = 42).

Results
CV-optimal degree d*
Minimum CV MSE
Train MSE at d*
Overfit ratio
Train MSE and CV MSE / Data with Fits

Top: train MSE (blue) and CV MSE (red) vs degree d, ★ marks the CV optimum. Bottom: data points + best-degree fit (green) + max-degree overfit (dashed purple).

Theory & Key Formulas

Polynomial regression model. A degree-$d$ polynomial is fitted by least squares:

$$\hat y(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_d x^d$$

Mean squared error on the training set:

$$\mathrm{MSE}_\text{train} = \frac{1}{n}\sum_{i=1}^{n}\bigl(y_i - \hat y(x_i)\bigr)^2$$

K-fold cross-validation MSE. The data are split into $K$ equal parts; each part is used as test in turn:

$$\mathrm{MSE}_\text{CV} = \frac{1}{K}\sum_{k=1}^{K}\mathrm{MSE}_\text{test}^{(k)}$$

The optimum is $\arg\min_d \mathrm{MSE}_\text{CV}(d)$. Train MSE drops monotonically with $d$, but CV MSE bottoms out and then rises — that bottom is the boundary of overfitting.

What is the K-Fold Cross-Validation Simulator

🙋
You hear "cross-validation" everywhere in ML, but what is it really doing?
🎓
Roughly speaking, it is a procedure for estimating "how well the model fits unseen data" using only the data you have. You split the data into K chunks, hold one out as the test set, train on the rest, and repeat K times. Drop the "Number of folds K" slider above from 5 to 2 — the CV curve gets jagged. More test data per fold makes the estimate more stable, but training shrinks.
🙋
In the chart the train MSE keeps falling, but the CV MSE bottoms out and then climbs. Is this overfitting?
🎓
Exactly. As you push up the polynomial degree, you can fit the training data as well as you like, but you start memorizing the noise and miss on new data. Look at the bottom plot — the green curve is the CV-optimal fit and the dashed purple is the maximum-degree fit. The purple wiggles to pass through nearly every point. That is "memorized the noise".
🙋
The true function is cubic. Why does d* often land on 3 or 4 instead of always 3?
🎓
Ideally we want 3 to win every time, but with noisy data, an extra coefficient often costs almost nothing. So 3 and 4 compete and the noise tips the balance. In real work you never know the "true degree", so practitioners often pick the simpler model from the flat region near the CV minimum — parsimony.
🙋
Does setting noise σ to zero make overfitting go away?
🎓
Try it. With σ=0 any polynomial of degree ≥ 3 reproduces the true function perfectly, and both train and CV MSE collapse to roughly zero. Real data always carry noise, so overfitting is a built-in risk. That is exactly why honest evaluation tools like CV are non-negotiable.

Frequently Asked Questions

LOO is the special case K = N where each data point is held out one at a time. The bias is minimal (you train on almost all the data), but you have to fit N times, the computation is heavy, and the variance is large because each test set is a single point. In practice K = 5 or 10 strikes a good balance, and you choose K based on the size of the dataset.
A holdout splits the data into train/test just once, so the score depends on the luck of the split. K-fold averages across K different splits, so the estimate is more stable. The smaller the dataset, the larger the advantage; with abundant data a simple holdout can be practical enough.
Use it for classification problems with imbalanced classes. Stratified sampling keeps the class proportions in each fold equal to the full data, which avoids the instability of folds that happen to contain almost no minority-class samples. For regression you can also stratify on the binned target value.
If the same CV data both selects hyperparameters and reports the final score, you overfit to the CV score (meta-overfitting). Use Nested CV (outer loop for evaluation, inner loop for tuning), or hold out a separate test set for the final report. This tool only visualizes degree selection; the true generalization of the chosen d* should be measured on yet another dataset.

Real-World Applications

Hyperparameter tuning of ML models: scikit-learn's GridSearchCV and RandomizedSearchCV internally run K-fold CV to compare candidates and pick the best hyperparameters. Tree depth and learning rate of XGBoost or LightGBM, the regularization strength λ of ridge regression — almost every modern pipeline relies on this loop.

Accuracy assessment of CAE surrogate models: When you build a response-surface or Kriging surrogate for a heavy CAE solver, K-fold CV tells you how trustworthy the surrogate is at unseen design points. Without checking this accuracy before optimization, the search can chase artefacts in the surrogate's extrapolation.

Fitting constitutive laws to small experimental datasets: When you fit a yield curve or an S-N fatigue curve to a handful of tests, an over-flexible function will run through every data point and explode on extrapolation. CV helps pick a complexity level that does not memorize the data, which is the key to getting laws that actually work on new tests.

Time-series model evaluation (time-series CV): Plain K-fold cannot be used on time-ordered data like stock prices or demand: you would train on the future to predict the past. Instead, use an "expanding window CV" or a "sliding window CV" that always trains on the past and evaluates on the future.

Common Misconceptions and Cautions

The most common misconception is to think that "CV completely prevents overfitting". CV is just a measure of how much you are overfitting, not a mechanism that stops it. If you only consider over-complex models, CV will report uniformly bad scores. CV is a tool for comparing candidates; if no good model is among them, no amount of CV will help. Raise σ in the simulator and watch how no degree manages to bring the CV MSE down.

Next, do not assume that "larger K is always better". Larger K reduces bias by training on more data per fold, but the test sets shrink and the variance of the estimate increases, plus the compute cost grows linearly. K = 5 or K = 10 are widely used precisely because they balance bias, variance and cost. Compare K = 2 and K = 10 in the simulator and you will see the latter gives a smoother CV curve.

Finally, do not equate "the CV score of the chosen model" with "the true generalization of that model". Selecting the best of many candidates injects a selection bias into the CV score, making it optimistic. This is fatal in competitions or papers. The correct workflow keeps a separate holdout test set for the final report, or uses Nested CV. In this tool, the CV MSE at d* is also not the true performance of d*.