What is MCMC and why is it used in Bayesian calibration?

Markov Chain Monte Carlo (MCMC) is a method for generating samples from a posterior distribution that cannot be determined analytically. It enables efficient sampling even in high-dimensional parameter spaces, with the Metropolis-Hastings algorithm and Hamiltonian Monte Carlo being prominent examples.

What is the Kennedy-O'Hagan framework?

Proposed by Kennedy and O'Hagan in 2001, this framework formulates the relationship as: experimental value = model prediction + model discrepancy term δ(x) + observation error. It assumes the model is imperfect and represents the model discrepancy with a Gaussian process, enabling physically plausible calibration while preventing overfitting.

Which software can be used to perform Bayesian calibration?

Representative tools include Dakota (Sandia National Laboratories), UQLab (ETH Zurich), PyMC, Stan, and MUQ (MIT). Commercially, Bayesian calibration features are also available in software such as Ansys optiSLang and SIMULIA Isight.

Bayesian Calibration

Q: What is Bayesian calibration?

It is a method for estimating uncertain parameters in a CAE model (such as Young's modulus or friction coefficient) as probability distributions using experimental data. By combining a prior distribution and a likelihood function via Bayes' theorem, it calculates a posterior distribution.

Category: V&V・不確かさ定量化 | 更新 2026-04-12

Bayesian calibration posterior distribution visualization showing prior-to-posterior update with MCMC sampling

ベイズキャリブレーション — 事前分布から事後分布への更新プロセスの概念図

Theory and Physics

What is Bayesian Calibration

🧑‍🎓

Professor, is Bayesian calibration about estimating model parameters using experimental data? How is it different from regular optimization?

🎓

Good question. Roughly speaking, Bayesian calibration estimates parameters as a "probability distribution" rather than a "single value". Ordinary least squares or optimization gives a point estimate like "This is the value that best fits the experimental data!", but Bayes returns an answer as a distribution saying "The parameter is within this range with this probability."

🧑‍🎓

What's the benefit of getting an answer as a distribution?

🎓

For example, let's say we estimate the Young's modulus of an FEM model for automotive crash analysis. Optimization would only give a single value like $E = 210\,\text{GPa}$. But Bayes would output "a distribution of $E$ with mean $210\,\text{GPa}$ and standard deviation $5\,\text{GPa}$". This uncertainty information can be directly passed to downstream reliability analysis.

More importantly, it can formally incorporate engineers' experiential knowledge (prior distribution). It narrows down parameters by utilizing both prior knowledge like "Young's modulus for this material is probably around $200 \sim 220\,\text{GPa}$" and experimental data. This is the greatest strength of Bayes.

🧑‍🎓

Being able to incorporate experiential knowledge into equations sounds very practical! But what's the specific mathematics behind it?

Bayes' Theorem and Fundamental Equation

🎓

The mathematical backbone of Bayesian calibration is Bayes' theorem. For a parameter vector $\boldsymbol{\theta}$ and observed data $\mathbf{D}$:

$$ p(\boldsymbol{\theta}|\mathbf{D}) = \frac{p(\mathbf{D}|\boldsymbol{\theta})\,p(\boldsymbol{\theta})}{p(\mathbf{D})} \propto p(\mathbf{D}|\boldsymbol{\theta})\,p(\boldsymbol{\theta}) $$

Physical Meaning of Each Term

$p(\boldsymbol{\theta}|\mathbf{D})$ (Posterior Distribution): The probability distribution of the parameters after seeing the experimental data. This is the final output of the calibration.
$p(\mathbf{D}|\boldsymbol{\theta})$ (Likelihood Function): The probability of obtaining the experimental data given a certain parameter value. It measures the "degree of discrepancy" between model predictions and experimental data.
$p(\boldsymbol{\theta})$ (Prior Distribution): Knowledge about the parameters before seeing the data. Reflects literature values, experience, and physical constraints.
$p(\mathbf{D})$ (Evidence/Marginal Likelihood): A normalization constant. Used for model comparison but can be ignored as a constant in parameter estimation.

🧑‍🎓

How is the likelihood function written specifically?

🎓

Let's consider the simplest case. Assuming the difference between $N$ experimental data points $d_i$ and model predictions $M(\mathbf{x}_i, \boldsymbol{\theta})$ follows independent and identically distributed normal errors:

$$ p(\mathbf{D}|\boldsymbol{\theta}, \sigma) = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{\bigl(d_i - M(\mathbf{x}_i, \boldsymbol{\theta})\bigr)^2}{2\sigma^2}\right) $$

🎓

Here, $\sigma$ is the standard deviation of the observation noise. It can be fixed as known or estimated together with $\boldsymbol{\theta}$ using Bayes. In practice, it's more robust to include $\sigma$ as an estimation target because the exact magnitude of error is often unknown.

🧑‍🎓

But professor, FEM models approximate physics to some extent, right? Isn't the assumption that the difference between model and experiment can be explained by "normal error alone" too simplistic?

🎓

Sharp observation! That's exactly what the Kennedy-O'Hagan framework systematically addresses.

Kennedy-O'Hagan Framework

🎓

The framework proposed by Kennedy and O'Hagan in 2001 has become the de facto standard for Bayesian calibration in CAE. The core idea is to decompose the experimental value as follows:

$$ d(\mathbf{x}) = M(\mathbf{x}, \boldsymbol{\theta}) + \delta(\mathbf{x}) + \varepsilon $$

Meaning of Each Kennedy-O'Hagan Term

$M(\mathbf{x}, \boldsymbol{\theta})$ (Simulation Output): The result of FEM/CFD etc. calculations given input conditions $\mathbf{x}$ and calibration parameters $\boldsymbol{\theta}$.
$\delta(\mathbf{x})$ (Model Discrepancy Term): Structural error inherent to the model itself that cannot be eliminated by adjusting parameters. Caused by mesh discretization error, omitted physical phenomena, idealized boundary conditions, etc. Typically represented by a Gaussian Process (GP).
$\varepsilon$ (Observation Error): Random noise associated with experimental measurement. It's common to assume $\varepsilon \sim \mathcal{N}(0, \sigma_\varepsilon^2)$.

🧑‍🎓

What changes when you include $\delta(\mathbf{x})$ versus when you don't?

🎓

If you don't include $\delta$, all structural model error gets forced onto the parameters $\boldsymbol{\theta}$. For example, if the discrepancy is actually due to inadequate boundary conditions, the Young's modulus might be pushed to an extreme value to make things fit. This is called the parameter compensation problem, leading to parameter estimates that are physically meaningless.

By explicitly including $\delta(\mathbf{x})$, the "part that the model cannot reproduce" is separated, keeping parameter estimation within a physically plausible range.

🧑‍🎓

I see...! So it's about acknowledging the model's limitations before estimation. But how is $\delta$ itself determined?

🎓

$\delta(\mathbf{x})$ is learned from data as a Gaussian Process (GP). That is, the shape of $\delta$ is not predetermined but is automatically estimated via Bayesian inference from the residual pattern between experimental data and model predictions. The hyperparameters of the GP's kernel function (e.g., squared exponential kernel) are also estimated together.

How to Choose the Prior Distribution

🧑‍🎓

How do you decide the prior distribution? Doesn't it become subjective and arbitrary?

🎓

Setting the prior is indeed the most debated point in Bayes. In practice, the following distinctions are often used:

Informative Prior: Set based on material database values, past experimental results, handbook values. Example: For structural steel Young's modulus, $E \sim \mathcal{N}(210, 5^2)\,[\text{GPa}]$.
Weakly Informative Prior: Loosely constrains the physically plausible range. Example: Friction coefficient $\mu \sim \text{Uniform}(0.05, 0.8)$.
Non-informative Prior: Such as Jeffreys prior, used when you want to let the data speak. However, convergence of the posterior can be slow in high dimensions.

In the field, weakly informative priors are common. It's a balance of "excluding physically impossible regions while giving sufficient influence to the data."

🧑‍🎓

Can the results change significantly depending on the choice of prior?

🎓

When data is scarce, the influence of the prior is large. Conversely, with sufficient data, the posterior distribution converges to almost the same place regardless of the prior. This is called asymptotic consistency in Bayes. Therefore, in practice, you should always perform prior sensitivity analysis (checking how much results change when varying the prior).

Numerical Methods and Implementation

Basics of MCMC Methods

🧑‍🎓

The Bayes theorem itself is simple, but actually computing the posterior distribution seems tough, right? I think integration becomes impossible with many parameters...

🎓

Exactly. The normalization constant $p(\mathbf{D})$ of the posterior is a high-dimensional integral that cannot be solved analytically. That's where Markov Chain Monte Carlo (MCMC) methods come in. They generate samples directly from the posterior distribution; increasing the number of samples allows approximating the posterior to any desired accuracy.

🧑‍🎓

How does MCMC work?

🎓

The flow of the most basic Metropolis-Hastings algorithm is as follows:

Set initial value $\boldsymbol{\theta}^{(0)}$
Generate a candidate point $\boldsymbol{\theta}^*$ from a proposal distribution $q(\boldsymbol{\theta}^*|\boldsymbol{\theta}^{(t)})$
Calculate the acceptance probability:

$$ \alpha = \min\!\left(1, \;\frac{p(\mathbf{D}|\boldsymbol{\theta}^*)\,p(\boldsymbol{\theta}^*)}{p(\mathbf{D}|\boldsymbol{\theta}^{(t)})\,p(\boldsymbol{\theta}^{(t)})} \cdot \frac{q(\boldsymbol{\theta}^{(t)}|\boldsymbol{\theta}^*)}{q(\boldsymbol{\theta}^*|\boldsymbol{\theta}^{(t)})}\right) $$

🎓

Accept the candidate $\boldsymbol{\theta}^*$ with probability $\alpha$, or reject it (stay at the current value) with probability $(1-\alpha)$. Repeating this thousands to tens of thousands of times makes the sample sequence converge to the posterior distribution.

Imagine it as a mechanism of randomly walking through a mountain range while staying longer in places with higher elevation (larger likelihood × prior probability).

🧑‍🎓

Wow, I see! But FEM calculations take minutes to hours per run, right? Is running it tens of thousands of times realistic?

🎓

Sharp. That's precisely the biggest bottleneck in using Bayesian calibration for CAE. The countermeasure is surrogate models, which we'll discuss in detail later, but first, let's organize the characteristics of each MCMC algorithm.

Comparison of MCMC Algorithms

Algorithm	Characteristics	Applicable Scenarios	Typical Acceptance Rate
Metropolis-Hastings	Most basic. Simple to implement.	Low to medium dimension (~10 parameters)	20–50%
Adaptive Metropolis (AM)	Automatically adjusts covariance matrix.	Medium dimension. No tuning needed.	23% (optimal)
DRAM	Delayed Rejection + AM. Re-proposes upon rejection.	Medium dimension. Some capability for multimodality.	30–50%
Hamiltonian MC (HMC)	Uses gradient information. Efficient even in high dimensions.	High dimension (100+). Differentiable models.	65–90%
NUTS	Auto-tuning version of HMC. Standard in Stan.	High dimension. Hands-free.	80–95%
Ensemble Sampler (emcee)	Multiple walkers explore in parallel.	Medium dimension. Easy to parallelize.	20–50%
TMCMC / SMC	Sequential Monte Carlo. Strong for multimodality.	Multimodal posterior distributions. Model comparison.	—

🧑‍🎓

There are so many! Which ones are commonly used in CAE?

🎓

DRAM is the standard implementation in Sandia National Labs' Dakota and is the most widely used in practice. HMC/NUTS is overwhelmingly efficient if gradient information is available. For quick implementation in Python, PyMC (NUTS-based) or emcee are good choices.

Convergence Diagnostics

🧑‍🎓

How do you judge that MCMC has "converged"?

🎓

This is a very important point. Using results from unconverged MCMC yields completely meaningless posterior distributions. Standard convergence diagnostics are as follows:

$\hat{R}$ (Gelman-Rubin statistic): Variance ratio of multiple independent chains. Convergence is judged if $\hat{R} < 1.01$ (strictly $< 1.05$).
Trace Plot: Plot the time series of the sample sequence. A "caterpillar-like" pattern with good mixing is ideal. Trends or jumps indicate non-convergence.
Effective Sample Size (ESS): Substantial sample size considering autocorrelation. A guideline is ESS $> 400$ (for each parameter).
Burn-in Removal: Discard initial transient samples (typically 10–50% of chain length).

🧑‍🎓

So it's OK if we confirm $\hat{R} < 1.01$, the trace plot is stable, and ESS is sufficient.

Acceleration via Surrogate Models

🧑‍🎓

Earlier you mentioned "running FEM tens of thousands of times is tough." How is that actually solved?

🎓

The most common solution is surrogate models (proxy models). They replace the input-output relationship of FEM with a fast approximation model. Main methods are:

Gaussian Process Regression (GP/Kriging): High affinity with the Kennedy-O'Hagan framework. Also provides prediction uncertainty.
Polynomial Chaos Expansion (PCE): Expands the probability distribution of input parameters with orthogonal polynomials. Can be used simultaneously with uncertainty propagation.
Neural Networks: Effective for strongly high-dimensional, nonlinear cases. However, requires large amounts of data.

For example, running FEM 100–500 times to train a GP allows each MCMC step to use only GP prediction (millisecond order). In principle, this can reduce FEM computations from $10^4 \to 10^2$.

🧑‍🎓

If the surrogate's accuracy is poor, won't the calibration results also be wrong?

🎓

Exactly. Therefore, surrogate validation is essential. Specifically, use Leave-one-out cross-validation to confirm prediction accuracy. $Q^2 > 0.99$ is a guideline. Another strategy is sequential surrogate updating. During MCMC, add FEM calculation points to high-probability regions of the posterior distribution and refine the surrogate. This achieves both efficiency and accuracy.

Practical Guide

Calibration Workflow

🧑‍🎓

Please teach me the steps for actually performing Bayesian calibration!

🎓