Gaussian Process Regression Simulator Back
Machine Learning Simulator

Gaussian Process Regression — RBF Kernel & 95% Confidence Band

Visualize 1D GP regression in real time. Adjust data count, length scale, signal variance and noise to learn how Bayesian uncertainty estimates behave.

Parameters
Data count N
pts
Length scale l
Signal variance σ_f
Noise σ_n

N observations from the true function $y=\sin(x)+0.5\cos(2x)$ with deterministic noise (LCG seed=42).

Results
Predictive mean μ at x*=5
Predictive std σ at x*=5
Mean uncertainty over grid
Log marginal likelihood
Predictive mean and 95% confidence band

Gray dashed = true function $\sin(x)+0.5\cos(2x)$ / black circles = observations / blue line = predictive mean μ(x) / blue band = μ ± 1.96σ

Theory & Key Formulas

GP regression represents the covariance between observation points by a kernel function, then conditions on the data to obtain the predictive distribution.

RBF kernel (signal variance $\sigma_f$, length scale $l$):

$$k(x, x') = \sigma_f^2 \exp\!\left(-\frac{(x - x')^2}{2 l^2}\right)$$

Kernel matrix with added noise variance ($\delta_{ij}$ is the Kronecker delta):

$$K_{ij} = k(x_i, x_j) + \sigma_n^2 \delta_{ij}$$

Predictive mean and variance at a test point $x_\ast$:

$$\mu(x_\ast) = \mathbf{k}_\ast^\top K^{-1} \mathbf{y}, \qquad \sigma^2(x_\ast) = k(x_\ast, x_\ast) - \mathbf{k}_\ast^\top K^{-1} \mathbf{k}_\ast$$

The 95% confidence band is $\mu \pm 1.96\,\sigma$. The variance naturally grows away from observed points, so uncertainty inflates automatically where there is no data.

What is the Gaussian Process Regression Simulator

🙋
I only know least-squares regression. What is so different about Gaussian process regression?
🎓
The big difference is that "every prediction comes with its own uncertainty". Least squares gives you one best line or curve; GP gives you, at every test point, a pair of (mean, spread). In the simulator above, the blue solid line is the predictive mean, and the pale blue band is the 95% confidence interval. Notice how the band is narrow near the observations (black circles) and wider away from them.
🙋
It really does! The band balloons in the regions with no data. Is that automatic?
🎓
Yes, no one programmed "make the uncertainty larger far from data". The math of the GP says so on its own. The predictive variance is $\sigma^2(x_\ast) = k(x_\ast,x_\ast) - \mathbf{k}_\ast^\top K^{-1} \mathbf{k}_\ast$. Far from any observation, $\mathbf{k}_\ast$ is small, so the second term shrinks and the variance stays close to the prior. That is the secret of "uncertainty for free".
🙋
If I move the "length scale" slider to the left, the curve gets wiggly. If I push it to the right, it becomes nearly a straight line.
🎓
That is the effect of l in the RBF kernel $\exp(-(x-x')^2 / 2l^2)$. l controls how far two points are still considered similar. Small l means only very close points influence each other, so the function wiggles. Large l makes even distant points "similar", so everything turns smooth and the true variation is washed out. In practice, l is chosen automatically by maximizing the log marginal likelihood. Watch the logML card while moving l — the value of l that maximizes logML is roughly the right one.
🙋
Got it. And what does increasing the "noise σ_n" do?
🎓
Larger σ_n tells the GP "the observations are not very accurate", so the mean curve no longer passes exactly through every data point and becomes smoother. Very small σ_n forces the curve to interpolate every point and tends to overfit. In practice, the "true noise level" matters a lot, and it is also tuned automatically by maximizing the log marginal likelihood. The really powerful thing about GP is that l, σ_f and σ_n can all be learned consistently from the data within one framework.

Frequently Asked Questions

Yes, many: Matérn (ν = 1/2, 3/2, 5/2), rational quadratic, periodic and linear kernels are common. The RBF is infinitely differentiable, which can be too smooth; Matérn 3/2 or 5/2 captures more realistic roughness. Periodic kernels handle seasonal or oscillatory phenomena, and additive or multiplicative kernel composition lets you combine several behaviors.
It maximizes log p(y|X, θ) over the hyperparameters θ = (l, σ_f, σ_n) with a gradient-based optimizer such as L-BFGS. This objective naturally balances data fit against model complexity, so it acts as built-in regularization that prevents overfitting. Libraries such as sklearn.gaussian_process, GPy and GPflow handle the gradients and optimization for you.
When N reaches thousands or tens of thousands, an O(N^3) cost is no longer feasible, so the model is approximated using M inducing points with M much smaller than N. SVGP (stochastic variational GP) learns the inducing points via variational inference and supports mini-batch training. FITC, DTC and KISS-GP are other well-known approximations, and they push the tractable problem size into the millions, trading accuracy for cost.
The usual choices are EI (Expected Improvement), UCB (Upper Confidence Bound) and PI (Probability of Improvement). EI maximizes the expected gain over the current best value. UCB scores μ + κσ, so larger κ favors exploration and smaller κ favors exploitation. For multi-objective problems, dedicated acquisition functions such as EHI (Expected Hypervolume Improvement) are used.

Real-World Applications

Bayesian optimization: Hyperparameter search for machine learning (learning rate, layer count, regularization weight, etc.) often takes hours or days per evaluation. A GP gives a predictive mean and uncertainty, and an acquisition function decides where to try next; this finds good points in far fewer trials than random or grid search. AutoML systems, SigOpt and Optuna implement variants of this loop.

Materials design and experimental planning: When exploring synthesis conditions for new materials (temperature, composition, time), each experiment is often extremely expensive. GP-based Bayesian optimization proposes the "most promising next conditions" from past experiments and dramatically reduces the number of experiments. It is used in battery materials, catalysts, alloy development and drug discovery.

Surrogate models for simulation: Instead of running a heavy CFD or FEM simulation many times, a GP can be trained on a small number of runs and used as a surrogate. Design optimization, sensitivity analysis and uncertainty propagation become orders of magnitude cheaper. Applications include aerodynamic wing design and robust optimization of crash analyses.

Geostatistics (kriging): The geospatial version of GP, known as kriging, has been used since the 1960s for ore body evaluation, soil contamination maps and spatial interpolation of meteorological data. It estimates values and uncertainties at unmeasured locations from sparse measurements and also helps choose where to sample next. GP and kriging are essentially the same idea developed in different communities.

Common Misconceptions and Cautions

The most common misconception is to assume that "the hyperparameters (l, σ_f, σ_n) can be set casually". In reality, they decide the quality of the prediction. Push the l slider in the simulator to 0.1 and then to 5.0: the first case overfits the noise completely, and the second misses all real variation. In practice, always maximize the log marginal likelihood automatically, or at the very least use cross-validation. Eyeballing "a value that looks fine" is dangerous.

Next is reading the GP confidence band as "the truth must lie inside the band 95% of the time". That probability is conditional on the current model (kernel choice and hyperparameters); if the model is wrong, the band is meaningless. For instance, applying an RBF kernel to a truly periodic function gives a band that only reflects "uncertainty for a smooth function" and silently misses the structural mismatch. Always check separately how far the predictive mean is from the true function, not just the band width.

Finally, do not assume that "more observations always improve accuracy". As N grows, the kernel matrix becomes ill-conditioned and Cholesky factorization can become numerically unstable. In implementations, a small jitter (about 1e-6) is added to the diagonal for stability. Compute cost also rises as O(N^3), so once N reaches a few thousand a sparse GP or inducing-point method becomes necessary. This tool caps N at 30; in production, always weigh "adding more data" against "revisiting the kernel choice".