White: true function A·sin(2πx). Red dots: training observations (with noise jitter). Blue line: Kriging predictive mean. Blue band: 95% credible interval.
$$\hat\mu(x_*) = k_*^T K^{-1} y,\qquad \hat\sigma^2(x_*) = k(x_*,x_*) - k_*^T K^{-1} k_*$$
K is the N×N kernel matrix (K_{ij}=k(x_i,x_j)+σ_n²δ_{ij}); k_* is the vector of covariances between the prediction point and the training points. The GP returns posterior mean μ̂ and variance σ̂² together.
$$k_{SE}(r) = \sigma_f^2 \exp\!\left(-\frac{r^2}{2\ell^2}\right)$$
Squared exponential (RBF) kernel. r=|x−x'|, ℓ = length scale, σ_f² = signal variance. Assumes an infinitely differentiable, very smooth family of functions.
$$k_{M5/2}(r) = \sigma_f^2\left(1+\frac{\sqrt{5}\,r}{\ell}+\frac{5r^2}{3\ell^2}\right)\exp\!\left(-\frac{\sqrt{5}\,r}{\ell}\right)$$
Matérn 5/2 kernel — the Bayesian-Optimization default. Twice differentiable, captures sharper variations than SE.
$$\log p(y\mid X,\theta) = -\tfrac12 y^T K^{-1} y - \tfrac12 \log|K| - \tfrac{N}{2}\log(2\pi)$$
Log marginal likelihood. Term 1 = data fit, term 2 = model-complexity penalty. Maximise over θ=(ℓ,σ_f²,σ_n²) to estimate the hyperparameters.