Sobol Sensitivity Indices — Global Sensitivity Analysis Based on Variance Decomposition

Category: V&V・不確かさ定量化 | 更新 2026-04-12
Sobol sensitivity indices bar chart showing first-order and total-order contributions for multiple input parameters in variance-based global sensitivity analysis
Sobol感度指標による入力パラメータの寄与率可視化 — 棒グラフの高さが出力分散への寄与を表す

Theory and Physics

What are Sobol Indices?

🧑‍🎓

Professor, what do Sobol sensitivity indices tell us? There are various types of sensitivity analysis, right?

🎓

Simply put, it's a method to quantify how much each input parameter contributes to the variation in the output. The first-order Sobol index $S_i$ represents the main effect, and the total-order Sobol index $S_{Ti}$ represents the contribution rate including interactions.

🧑‍🎓

What does "contribution rate" mean specifically?

🎓

For example, if $S_i = 0.8$, that parameter alone can explain 80% of the output variance. Even if there are 10 design parameters, often only 2-3 are practically important. Sobol indices show you that numerically.

🧑‍🎓

Wow, even with 10 design variables, only 2-3 are important! If you know that, you can fix the remaining 7-8, right?

🎓

Exactly. In CAE practice, for example, in automotive crash simulation, there are tons of input parameters like sheet thickness, material properties, friction coefficient, welding conditions... but if Sobol indices show that "sheet thickness and material yield stress alone explain 90% of the output variation," then the other parameters can be fixed at their nominal values. This drastically reduces computational cost.

Sobol sensitivity indices are a global sensitivity analysis method based on variance decomposition (ANOVA decomposition) proposed by the Russian mathematician I.M. Sobol in 1993. Unlike sensitivity analysis based on local partial derivatives, it evaluates the variation of the model response over the entire parameter space.

ANOVA Decomposition (Variance Decomposition)

For a model output $Y = f(X_1, \ldots, X_k)$ with $k$ independent input variables $X_1, X_2, \ldots, X_k$, the ANOVA decomposition (High-Dimensional Model Representation: HDMR) can be written as:

$$ f(\mathbf{X}) = f_0 + \sum_{i=1}^{k} f_i(X_i) + \sum_{i

Here $f_0 = E[Y]$ is the overall mean, $f_i(X_i)$ is the main effect of variable $X_i$, and $f_{ij}$ is the second-order interaction term. For this decomposition to be unique, each term must be mutually orthogonal (their integral equals zero), and it is assumed that the input variables are mutually independent.

The variance can be similarly decomposed:

$$ V[Y] = \sum_{i=1}^{k} V_i + \sum_{i

Each $V_i, V_{ij}, \ldots$ is the variance of the corresponding term, representing its contribution to the total variance $V[Y]$.

First-Order Sobol Index $S_i$

🧑‍🎓

What is the specific mathematical formula?

🎓

The first-order Sobol index is the variance of the conditional expectation when $X_i$ is fixed, divided by the total output variance.

$$ \boxed{S_i = \frac{V_{X_i}\!\left[E_{X_{\sim i}}(Y \mid X_i)\right]}{V[Y]}} $$

Where,

  • $E_{X_{\sim i}}(Y \mid X_i)$: The conditional expectation obtained by fixing $X_i$ and averaging over all other variables $X_{\sim i}$
  • $V_{X_i}[\cdot]$: The variance of that conditional expectation with respect to $X_i$
  • $V[Y]$: The unconditional variance (total variance) of the output $Y$

$S_i$ takes a value in the range of 0 to 1. If $S_i = 0$, $X_i$ has no effect on the output. $\sum_{i=1}^{k} S_i = 1$ means zero interaction (an additive model).

🧑‍🎓

So the key is "averaging over all variables except $X_i$." It's about how much the output fluctuates when you only move $X_i$ and average out everything else.

🎓

That understanding is perfect. It's derived from the law of total variance: $V[Y] = V_{X_i}[E_{X_{\sim i}}(Y|X_i)] + E_{X_i}[V_{X_{\sim i}}(Y|X_i)]$. The first term is "the variance explainable by the main effect of $X_i$," and the second term is "the variance that remains even when $X_i$ is fixed."

Total-Order Sobol Index $S_{Ti}$

$$ \boxed{S_{Ti} = 1 - \frac{V_{X_{\sim i}}\!\left[E_{X_i}(Y \mid X_{\sim i})\right]}{V[Y]}} $$

$S_{Ti}$ represents the total contribution rate of $X_i$, including all interaction terms it is involved in. A large difference $S_{Ti} - S_i$ indicates that $X_i$ has strong interactions with other variables.

🧑‍🎓

Can you give me a concrete example of the difference between $S_i$ and $S_{Ti}$?

🎓

For example, consider a heat exchanger design with three inputs (flow velocity $X_1$, pipe diameter $X_2$, material thermal conductivity $X_3$). Suppose the results are:

  • $S_1 = 0.45,\ S_{T1} = 0.52$ — Flow velocity has a large main effect and a small interaction
  • $S_2 = 0.10,\ S_{T2} = 0.35$ — Pipe diameter has a small main effect but large interactions with other variables
  • $S_3 = 0.30,\ S_{T3} = 0.33$ — Thermal conductivity is almost purely a main effect

Looking only at $S_2$, one might judge pipe diameter $X_2$ as "low impact," but $S_{T2}$ shows it cannot be ignored. The combination of flow velocity and pipe diameter (a Reynolds number-like effect) appears as an interaction.

🧑‍🎓

I see! So there are cases where it's dangerous to discard a variable as "unnecessary" based only on $S_i$.

🎓

That's precisely why $S_{Ti}$ must be used for decisions about fixing parameters. If $S_{Ti} \approx 0$, it can be safely fixed. You should not judge based on $S_i$ alone. This is a really crucial point in practice.

Higher-Order Interaction Indices

The second-order interaction index is defined as:

$$ S_{ij} = \frac{V_{ij}}{V[Y]} = S_{ij}^{\text{closed}} - S_i - S_j $$

Here $S_{ij}^{\text{closed}} = \frac{V_{X_i, X_j}[E_{X_{\sim ij}}(Y|X_i, X_j)]}{V[Y]}$ is the closed second-order index for $X_i$ and $X_j$. Theoretically, it's possible to calculate up to any order, but computational cost increases exponentially, so in practice, just $S_i$ and $S_{Ti}$ are often sufficient.

Summary of Sobol Index Properties
  • $0 \leq S_i \leq S_{Ti} \leq 1$: First-order index is less than or equal to total-order index
  • $\sum_{i=1}^k S_i \leq 1$: Equality holds when interactions are zero (additive model)
  • $\sum_{i=1}^k S_{Ti} \geq 1$: Equality holds when interactions are zero
  • Independence of inputs is assumed: For correlated inputs, use generalized Sobol indices (Kucherenko 2012) or Shapley effects
Difference from Local Sensitivity Analysis
  • Local Sensitivity Analysis (partial derivative $\partial Y/\partial X_i$): Captures only linear response around a reference point. Low computational cost but cannot see nonlinear effects or interactions
  • Global Sensitivity Analysis (Sobol): Evaluates variation over the entire parameter space. Captures all nonlinearities and interactions. However, requires Monte Carlo-like sampling and has high computational cost
  • Morris Method (Screening): Suitable for binary judgment of "important or not." Often used as a preliminary step before Sobol indices

Numerical Methods and Implementation

Saltelli Estimator

🧑‍🎓

I understand the definition formula, but how do you actually compute $S_i$ and $S_{Ti}$? You can't get that conditional expectation variance analytically, right?

🎓

Good question. In practice, we use the Monte Carlo estimator proposed by Saltelli (2002, 2010). First, generate two independent sample matrices $\mathbf{A}$ and $\mathbf{B}$ (each $N \times k$). Then, create a matrix $\mathbf{A}_B^{(i)}$ — this is $\mathbf{A}$ with only its $i$-th column replaced by the $i$-th column of $\mathbf{B}$.

Steps of the Saltelli estimator:

Step 1: Independently generate $N \times k$ sample matrices $\mathbf{A}$ and $\mathbf{B}$ (quasi-Monte Carlo sequences are recommended).

Step 2: For each $i = 1, \ldots, k$, construct the mixed matrix $\mathbf{A}_B^{(i)}$. The $j$-th column is:

$$ (\mathbf{A}_B^{(i)})_j = \begin{cases} \mathbf{B}_j & \text{if } j = i \\ \mathbf{A}_j & \text{if } j \neq i \end{cases} $$

Step 3: Evaluate the model $f$ for all rows of $\mathbf{A}$, $\mathbf{B}$, $\mathbf{A}_B^{(1)}, \ldots, \mathbf{A}_B^{(k)}$. A total of $N(k+2)$ model evaluations are required.

Step 4: Calculate the estimators.

$$ \boxed{S_i \approx \frac{\frac{1}{N}\sum_{j=1}^{N} f(\mathbf{B})_j \left(f(\mathbf{A}_B^{(i)})_j - f(\mathbf{A})_j\right)}{V[Y]}} $$
$$ \boxed{S_{Ti} \approx \frac{\frac{1}{2N}\sum_{j=1}^{N} \left(f(\mathbf{A})_j - f(\mathbf{A}_B^{(i)})_j\right)^2}{V[Y]}} $$

Here $V[Y]$ is estimated from all samples of $\mathbf{A}$ and $\mathbf{B}$.

🧑‍🎓

So, you create two sets $\mathbf{A}$ and $\mathbf{B}$, and by swapping one column at a time, you investigate "what happens if only this variable is changed."

🎓

Yes, exactly. Thanks to this "column swapping" trick, the multiple integrals in the definition formula can be efficiently estimated via Monte Carlo. By the way, there are variations for the $S_{Ti}$ estimator, like Jansen (1999) or Sobol-Jansen estimators, and the Jansen estimator is said to have smaller bias. The one written above is the Jansen estimator.

Quasi-Monte Carlo Sampling

🧑‍🎓

Regarding sample matrix generation, it says "quasi-Monte Carlo sequences are recommended." Is regular random no good?

🎓

It can be used, but convergence is slow. Pseudo-random Monte Carlo has a convergence rate of $O(1/\sqrt{N})$, but using quasi-Monte Carlo methods like Sobol sequences or Halton sequences improves it to $O((\log N)^k / N)$. In practice, the same accuracy can often be achieved with about 1/10 the sample count.

Sampling MethodConvergence RateCharacteristicsRecommended Scenario
Pseudo-random MC$O(N^{-1/2})$Easy to implement, has clusteringWhen dimension is low and computation is cheap
Sobol sequence (QMC)$O(N^{-1}(\log N)^k)$Low discrepancy, effective even in high dimensionsStandard recommendation. Default in SALib
Halton sequenceSame as aboveQuality degrades easily in high dimensionsFor low dimensions ($k \leq 10$)
Latin Hypercube (LHS)About $O(N^{-1/2})$Samples marginal distributions accuratelyAlternative when Sobol sequences cannot be used

Combination with Surrogate Models

🧑‍🎓

If one CAE simulation takes hours, isn't $N(k+2)$ model evaluations too harsh? With 10 parameters and $N=1000$, that's 12,000 times...

🎓

That's exactly the practical barrier. The solution is to use surrogate models (proxy models). Build a response surface with a small number of CAE evaluation points (tens to hundreds), and then feed a large number of samples into the surrogate model to estimate Sobol indices.

  • Polynomial Chaos Expansion (PCE): Sobol indices can be calculated directly from the expansion coefficients. Most efficient.
  • Kriging (Gaussian Process Regression): Strength is that it can also quantify prediction uncertainty.
  • Neural Networks: Suitable for high dimensions and strong nonlinearities, but require a lot of training data.
🧑‍🎓

With PCE, Sobol indices come directly from the expansion coefficients? That's convenient!

🎓

In the PCE expansion $Y \approx \sum_{\boldsymbol{\alpha}} c_{\boldsymbol{\alpha}} \Psi_{\boldsymbol{\alpha}}(\mathbf{X})$, the first-order index is $S_i = \sum_{\boldsymbol{\alpha} \in \mathcal{A}_i} c_{\boldsymbol{\alpha}}^2 / \sum_{\boldsymbol{\alpha} \neq \mathbf{0}} c_{\boldsymbol{\alpha}}^2$, expressed as the ratio of squared coefficient sums. This eliminates the need for Monte Carlo sampling, making it dramatically faster. UQLab and OpenTURNS support this method.

Bootstrap Confidence Intervals

Estimates of Sobol indices come with statistical uncertainty. Evaluating confidence intervals using the Bootstrap method is standard:

  1. Resample $N$ points with replacement from the original $N$ samples
  2. Recalculate Sobol indices with the resampled data
  3. Repeat this $B$ times (usually 1000 or more)
  4. Construct percentile confidence intervals from the $B$ obtained estimates

If the confidence interval is wide, the sample size $N$ needs to be increased. In SALib, Bootstrap confidence intervals are automatically calculated with the calc_second_order option in sobol.analyze().

Practical Guide

Sobol Analysis Workflow

🧑‍🎓

When actually doing Sobol analysis, what should you do from start to finish?

🎓

Six steps.

  1. Problem Definition: Clarify the output QoI (Quantity of Interest) and input parameters $X_1, \ldots, X_k$. Determine the probability distribution for each input (uniform, normal, etc.)
  2. Screening (optional): If the number of parameters is large (20+), first use the Morris method to exclude unimportant parameters
  3. Sample Matrix Generation: Generate $\mathbf{A}, \mathbf{B}, \mathbf{A}_B^{(i)}$ using the Saltelli method. Using SALib's saltelli.sample() is easiest
  4. Model Evaluation: Execute the CAE model at all generated sample points. Parallel computing is recommended
  5. Sobol Index Calculation: Collect evaluation results and compute the estimators. Use SALib's sobol.analyze()
  6. Result Interpretation and Decision Making: Rank parameter importance from $S_i, S_{Ti}$ and reflect it in design optimization or uncertainty reduction

How to Determine Sample Size

🧑‍🎓

How large should $N$ be? I know bigger is better, but there's also a budget...

🎓

Let's summarize a guideline in a table.

Number of Parameters $k$$N$ (Base Sample Count)Total Model Evaluations $N(k+2)$Notes
3〜51,024〜2,0485,120〜14,336Direct Sobol method is sufficient
5〜102,048〜4,09614,336〜49,152Approaching the recommended line for surrogate models
10〜204,096〜8,19249,152〜180,224Surrogate model is essential. Prior Morris screening is recommended
20+Direct Sobol method is unrealistic. Use PCE or a two-stage strategy: Morris → Sobol
🧑‍🎓

Is it better for $N$ to be a power of two?

🎓

When using Sobol sequences (quasi-Monte Carlo), $N = 2^p$ is optimal. It maximizes the benefits of the low-discrepancy property. With pseudo-random numbers, any $N$ is fine.

Result Interpretation and Parameter Fixing

🧑‍🎓

After obtaining Sobol indices, what specific decisions are made?

🎓

Three decision criteria: