Bayesian Calibration via MCMC (Metropolis–Hastings) Simulator

Q: Why does a small effective sample size (ESS) matter?

MCMC samples are autocorrelated, so 5000 raw samples can be worth only a few hundred — or even a few dozen — independent samples; that's the ESS. The standard error of the posterior mean is σ_post/√ESS, so when ESS drops below 100 the standard error becomes more than 10% of σ_post and the second decimal place of the estimate is no longer trustworthy. The three remedies are (1) optimise the proposal width, (2) increase the number of samples, or (3) switch to a more efficient sampler such as Hamiltonian Monte Carlo or NUTS.

Calibrate a CAE model parameter the Bayesian way: feed in a prior, a likelihood and observed data. The tool computes the closed-form Normal–Normal posterior in real time, plus the Metropolis–Hastings acceptance rate and effective sample size (ESS) so you can build intuition for UQ and model calibration.

Parameters

Prior mean μ_p

Centre of prior belief about θ

Prior std σ_p

Larger = weakly informative (data-dominated)

Observation noise σ_y

Known std of the observation model

Observed mean ȳ

Sample mean of the N observations

Number of observations N

Posterior std shrinks as 1/√N

MH proposal step

≈ 2.38·σ_post is optimal

MCMC samples

Burn-in

Initial transient samples to discard

Results

—

Posterior mean μ_post

—

Posterior std σ_post

—

95% credible interval width

—

MH acceptance (%)

—

Effective sample size ESS

—

Information gain (nats)

—

Prior / likelihood / posterior + MH walker

Blue: prior p(θ), red: likelihood p(y|θ), green: posterior p(θ|y). The white dot is the current state of an MH chain random-walking on the posterior.

Prior / likelihood / posterior PDFs

MCMC trace plot — θ vs iteration

Theory & Key Formulas

$$p(\theta|y) \propto p(y|\theta)\,p(\theta),\qquad \alpha = \min\left(1,\, \frac{\pi(\theta^{*})}{\pi(\theta_t)}\right)$$

π is the unnormalised posterior. Proposal θ* is accepted with probability α; over a long enough chain the samples approach the target π (Metropolis–Hastings).

$$\frac{1}{\sigma_{\text{post}}^{2}} = \frac{1}{\sigma_p^{2}} + \frac{N}{\sigma_y^{2}},\qquad \mu_{\text{post}} = \sigma_{\text{post}}^{2}\!\left(\frac{\mu_p}{\sigma_p^{2}} + \frac{N\,\bar{y}}{\sigma_y^{2}}\right)$$

Closed-form posterior for the Normal–Normal conjugate model. As N grows, 1/σ_post² grows linearly and σ_post shrinks as 1/√N.

$$\text{CI}_{95} = \mu_{\text{post}} \pm 1.96\,\sigma_{\text{post}},\qquad I = \log\!\frac{\sigma_p}{\sigma_{\text{post}}}\ [\text{nats}]$$

95% credible interval and information gain I (uncertainty reduction from prior to posterior, in nats).

Bayesian Calibration via MCMC — Metropolis–Hastings

🙋

Bayesian calibration is just fitting a CAE model to experimental data, right? How is it different from ordinary least squares?

🎓

Good question. Least squares (maximum likelihood) returns "the single θ that best fits the data". Bayesian calibration treats θ as a probability distribution: you give it a prior p(θ) that encodes past knowledge or a plausible range, multiply by the likelihood p(y|θ), and out comes the posterior p(θ|y). The result is a distribution with spread, not a single point — so you can talk about design margin and credible intervals naturally. For example, "Young's modulus = 200 ± 5 GPa" can be carried forward into downstream stress analysis as a band rather than a number.

🙋

I see. Then why do we need MCMC at all? Can't we just write down the posterior in closed form?

🎓

In a "conjugate" case like this tool — Normal prior, Normal likelihood — the posterior is a clean closed form, which is why the analytic green curve matches the MCMC estimate exactly. But for real CAE models — say the k-ε turbulence constants or a non-linear hardening rule — evaluating p(y|θ) requires a full simulation run, and there is no closed form at all. Metropolis–Hastings only needs the ratio π(θ*)/π(θ_t), so it works on any black-box likelihood and that's why it became the universal hammer.

🙋

When I move the MH proposal step the acceptance rate changes drastically. How do I pick the right one?

🎓

The classical Roberts–Gelman–Gilks (1997) result says about 44% is optimal in one dimension and about 23.4% in high dimensions. The rule of thumb is σ_proposal ≈ 2.38·σ_post. Make step too big and the acceptance drops to near zero — the chain freezes. Make it too small and acceptance is high but the samples are highly correlated, so ESS collapses. Aim for the 30-50% band. The default 0.15 here gives a ratio of ≈ 2.1 against σ_post ≈ 0.07, which is close to the sweet spot.

🙋

Last one — what is "information gain"? I've never seen units of nats before.

🎓

It measures the difference between prior and posterior in information units. Here we use the approximation log(σ_p/σ_post): how much the uncertainty shrank after seeing data. With base e the unit is nats (base 2 would give bits). With the defaults we get log(2/0.0707) ≈ 3.34 nats ≈ 4.8 bits — so the experiment gave you about five bits of new information about θ. It's a handy way to rank the "informational value" of competing calibration experiments.

Frequently Asked Questions

Least squares (maximum likelihood) returns a single point estimate of the parameter, while Bayesian calibration treats the parameter as a probability distribution. It combines prior knowledge p(θ) with the likelihood p(y|θ) to obtain a posterior p(θ|y), giving not only an optimum but also the spread and the probability of every alternative value. CAE calibration is often data-poor and noisy, so a point estimate alone tends to oversell the result. Move N from 5 to 500 in this tool and watch the posterior standard deviation σ_post shrink as 1/√N.

The classical asymptotic result of Roberts-Gelman-Gilks (1997) gives an optimal acceptance of about 44% in one dimension and about 23.4% in high dimensions. A common rule of thumb is to take the proposal standard deviation as ≈ 2.38·σ_post. This tool reproduces that bell-shape: too large a step drives the acceptance rate to near zero and the chain freezes, while too small a step gives high acceptance but strong autocorrelation and a low ESS. Tune step until the acceptance rate sits in the 30-50% band.

MCMC chains usually start away from the posterior bulk, and the first few hundred to few thousand iterations are transient samples drifting toward the high-density region. Including them in the estimate biases the posterior mean and variance, so they are discarded as burn-in. The default of 500 steps here is conservative for a weakly informative prior and a chain of a few thousand samples. Inspect the trace plot to confirm the left-edge drift has disappeared. For rigorous convergence, run multiple chains and check Gelman-Rubin's R̂ < 1.01.

MCMC samples are autocorrelated, so 5000 raw samples can be worth only a few hundred — or even a few dozen — independent samples; that's the ESS. The standard error of the posterior mean is σ_post/√ESS, so when ESS drops below 100 the standard error becomes more than 10% of σ_post and the second decimal place of the estimate is no longer trustworthy. The three remedies are (1) optimise the proposal width, (2) increase the number of samples, or (3) switch to a more efficient sampler such as Hamiltonian Monte Carlo or NUTS.

Real-World Applications

Calibration of CFD / FEM codes: Turbulence-model constants (C_μ, Cε1, Cε2 of k-ε), hardening-law parameters of non-linear materials, limit strains of concrete damage models — quantities the analyst used to fix "by feel" can be turned into posterior distributions estimated from experiment. NASA and DOE list MCMC-based Bayesian calibration as a mandatory step of their VVUQ (Verification, Validation, Uncertainty Quantification) workflow. The Normal–Normal model in this tool is the simplest conjugate entry-point to that practice.

Climate and geoscience model calibration: Parameters such as climate sensitivity or cloud-feedback coefficients are not directly observable, and MCMC is the standard tool for estimating their posterior from historical temperature trends or satellite observations. The Hadley Centre and NOAA assessment reports always include posterior histograms. When observations are scarce and the prior dominates, this tool — by sliding σ_p — makes it visually clear how strongly the prior pulls the posterior.

Bayesian machine learning: Bayesian neural networks (BNN), Gaussian-process hyperparameter inference, and effect-size estimation in A/B testing all rely on MCMC. Modern practice favours HMC/NUTS (Stan, PyMC, NumPyro), but the underlying accept/reject idea is the same Metropolis–Hastings logic exposed in this tool. Watching acceptance rate and ESS — the "chain health check" — applies identically to HMC.

Pharmacokinetic (PK/PD) modelling: Body-compartment models calibrated to a handful of subjects with subject-specific posteriors are the industry standard, using hierarchical Bayes + MCMC. Posterior credible intervals are mandatory in FDA filings. The phenomenon "small N → large σ_post" shown here mirrors the very structure of small Phase-I trials.

Common Misconceptions and Pitfalls

The biggest pitfall is treating "high acceptance rate" as "good sampler". Drive the proposal step to a very small value and acceptance climbs above 90%, but each sample barely moves from the previous one, so 5000 raw samples might be worth only a few dozen independent ones. Try step = 0.05 here and watch the ESS collapse. Acceptance rate must be read together with ESS, never alone.

Next, the "a wide, weakly informative prior is always objective" fallacy. Make the prior too wide and a small dataset will pull the posterior around — outliers and odd values become more frequent. A strongly informative prior (small σ_p) expresses confidence in the model but can also hide a model–data mismatch. Compare N=5, σ_p=0.5 against N=5, σ_p=10 in this tool and the influence of the prior becomes tangible. The professional habit is to try several priors and report the sensitivity.

Finally, judging convergence from the look of a single chain. One chain that seems "stable" may simply be stuck in one mode; a second chain from a different start point can converge to a completely different region (especially for multi-modal posteriors). In production you run multiple chains and require Gelman–Rubin R̂ < 1.01 for every parameter. This tool is a one-dimensional, one-chain concept-trainer with only a proxy R̂; in real projects always inspect summary() in PyMC or Stan for R̂, ESS and MCSE.

Bayesian Calibration via MCMC (Metropolis–Hastings) Simulator