Stochastic Gradient Langevin Dynamics (SGLD) Simulator Back
Bayesian Deep Learning

Stochastic Gradient Langevin Dynamics (SGLD) Simulator

SGLD (Welling & Teh 2011) turns a stochastic-gradient optimiser into a Bayesian posterior sampler by adding a sqrt(2η) Brownian kick at every step. Drive the six sliders — step size, iterations, mini-batch, dataset size, noise amplification and gradient variance — and watch the stochastic-bias / Langevin-noise trade-off, effective sample size, mixing time and Gelman-Rubin R-hat respond in real time.

Parameters
Step size η
Large η → shorter mixing time but larger bias
Iterations
Total sampler steps — drives ESS and coverage
Mini-batch size B
Smaller B → noisier gradient → higher bias, cheaper step
Dataset size N
Bias scales with N/B in the SGLD estimator
Noise amplification × sqrt(2/N·η)
Temperature-like control on the Brownian kick
Gradient variance σ²_g
Variance of the mini-batch gradient — main bias driver
Results
Stochastic bias
Langevin noise
Effective sample size
Mixing time (iters)
Posterior coverage (%)
R-hat convergence
2D loss surface — SGD vs SGLD walkers

Red dot: SGD (heads to the MAP estimate). Blue cloud: SGLD (Brownian kicks let it visit the posterior contour). The light concentric ellipses are the posterior p(θ|D); change η and noise to see the cloud expand or collapse.

SGD vs SGLD loss trajectories
Posterior coverage vs iterations
Theory & Key Formulas

$$\theta_{t+1} = \theta_t - \frac{\eta_t}{2}\sum_{i\in B_t} \nabla L_i(\theta_t) + \sqrt{\eta_t}\,\xi_t,\quad \xi_t \sim \mathcal{N}(0,I)$$

SGLD update (Welling & Teh 2011). η_t is the step size, B_t the mini-batch, ξ_t the Brownian kick. As η→0 the chain samples the exact posterior; for finite η the samples carry a bias.

$$\mathrm{Bias} \;\propto\; \eta\,\sigma_g^2\,\frac{N}{B}, \qquad \mathrm{Var} \;\propto\; \eta, \qquad \tau_{\mathrm{mix}} \approx \frac{1}{\eta}$$

Bias grows with η and gradient variance σ²_g; mixing time τ_mix scales as 1/η. The Brownian kick dominates the variance at sqrt(2η), and the Kernel Stein Discrepancy admits an upper bound that is a function of η (Chen-Ding-Carin 2015).

Stochastic Gradient Langevin Dynamics (SGLD) — Bayesian Deep Learning

🙋
I keep seeing "SGLD" in recent ML papers. Is it just a cousin of SGD? The "Langevin" surname makes it sound suspicious…
🎓
More like its identical twin. The update rule is almost the same — SGD does θ ← θ − η·∇L, SGLD adds sqrt(2η)·ξ on top. One extra line, but the personality changes completely: an algorithm that hunts for one maximum-likelihood point becomes an MCMC sampler that draws from the Bayesian posterior p(θ|D). Welling & Teh's original ICML 2011 paper is only eight pages and effectively launched Bayesian deep learning. "Langevin" comes from the continuous-time stochastic differential equation that targets a given distribution by adding Brownian motion to a gradient flow.
🙋
OK but if I bump up the step size η, the R-hat on the right keeps climbing. The default is already 4.49 — flagged as "not converged". The sampler doesn't really work, does it?
🎓
That's the most famous gotcha. Stochastic bias scales like η·σ²_g·N/B, so with N=10000 and B=32 the N/B=312 amplification pumps the bias to 0.156 even at η=0.001. Meanwhile the Brownian kick sqrt(2η)=0.045 is more than 3× smaller, so the chain is barely "sampling" — it is "SGD with extra bias". Try one of: drop N to 100, push B to 256, or lower η to 0.0001. Any of those will get R-hat under 1.2.
🙋
I tried η=0.0001 and R-hat dropped to 1.36. But the mixing time jumped to 10 000 iters, so 5 000 is nowhere near enough. Picking a perfect setting feels impossible!
🎓
Welcome to the bias-variance trade-off of SGLD. A smaller η reduces bias but inflates the mixing time 1/η, the sample autocorrelation, and starves the effective sample size. Production code uses a polynomial decay η_t = a/(b+t)^γ with γ∈(0.5,1]: explore broadly at the start, then anneal to a tiny η to capture the posterior accurately. Pereyra-Vono and others have written extensively on choosing the schedule.
🙋
The 2D canvas is hypnotic — the blue SGLD walker drifts around while the red SGD point hugs the centre. What exactly am I looking at?
🎓
A cartoon of the posterior landscape. The dark centre is the MAP estimate, the pale concentric rings are the posterior p(θ|D). SGD takes the shortest path down that valley and stays there — a point estimate. SGLD never settles; the Brownian kicks make it tour the high-probability region. Average many such samples and you obtain predictive uncertainty: "I am 90% sure the tumour is malignant, 95% credible interval 80–97%". That ability to quote calibrated uncertainty is the killer feature of Bayesian deep learning.
🙋
Without that you can't say "I'm not sure", right? Is SGLD actually painful to implement? Can I use it in PyTorch?
🎓
Embarrassingly easy. Take a regular torch.optim.SGD step, then append for p in model.parameters(): p.data += torch.randn_like(p) * math.sqrt(2*lr) — done. Pyro, Edward2 and TensorFlow Probability all expose ready-made SGLD classes. For deep nets practitioners reach for pSGLD (preconditioned, Li 2016) or SGHMC (with momentum, Chen 2014) to converge faster. Mandt-Hoffman even showed that plain SGD itself can be viewed as a sampler with a stationary distribution — SGLD is just the principled extension. A one-line change really does shift your worldview.

Frequently Asked Questions

Plain SGD applies θ_{t+1} = θ_t − η·∇L and walks toward the maximum-likelihood solution. SGLD adds a Gaussian kick sqrt(2η)·ξ_t (with ξ_t ~ N(0,I)) to that same update at every step. The Brownian noise turns the iteration into a discretisation of the Langevin diffusion, so as η→0 the chain samples from the Bayesian posterior p(θ|D). A one-line code change converts an optimiser into a sampler, which is why SGLD is widely used for deep-learning uncertainty quantification and reinforcement-learning exploration.
SGLD has an intrinsic bias-variance trade-off. A large η lets the chain travel quickly (mixing time ≈ 1/η is short) but inflates the stochastic bias ≈ η·grad_var·N/B, so the chain drifts away from the true posterior. A small η reduces bias but yields a tiny effective sample size (ESS) and slow convergence. Welling & Teh's original paper recommends η_t = a/(b+t)^γ with γ∈(0.5,1]. In practice start at 1e-4 to 1e-3 and look for the largest η that still gives R-hat < 1.2 and posterior coverage > 80%.
R-hat is essentially sqrt(between-chain variance / within-chain variance) and tends to 1 once the sampler is stationary. Values above 1.1 to 1.2 mean the chains have not yet explored the full posterior and any moment estimate is biased. Gelman et al. (Bayesian Data Analysis, 2013) require R-hat < 1.1; finite-η samplers such as SGLD usually have a slightly higher value because of the bias, so this tool relaxes the threshold to 1.2. In real projects always run several chains from different initialisations and check R-hat per parameter.
SGHMC (Chen et al. 2014) adds Hamiltonian dynamics with friction, using momentum to escape local minima. pSGLD (Li et al. 2016) uses a RMSprop-like preconditioner to adapt to per-parameter curvature, which speeds up deep nets. SVGD (Liu & Wang 2016) is a deterministic kernelised variational method that updates many particles in parallel — more sample-efficient than SGLD but more expensive per step. A common Bayesian-deep-learning workflow is to start from SGLD/pSGLD as a baseline and switch to SGHMC or Stein methods only when needed.

Real-World Applications

Medical-AI uncertainty quantification: CT classifiers and pathology segmentation models must return calibrated confidence ("90% malignant, 95% CI 80–97%"). Kendall & Gal (2017) used SGLD/MC-Dropout posteriors to separate aleatoric from epistemic uncertainty and flag low-confidence cases to a clinician — an architecture that is now standard for FDA-cleared medical AI.

Reinforcement-learning exploration: Bayesian DQN and Posterior Sampling for RL (PSRL) draw one fresh sample of the Q-function per episode and act greedily under that "optimistic" estimate. SGLD is the workhorse for that posterior, and Osband-Russo show it reaches optimal policies on Atari and self-driving benchmarks with far fewer trials than ε-greedy.

Bayesian optimisation in materials and drug discovery: Single quantum-chemistry (DFT) or wet-lab experiments can take hours to weeks. Sampling a surrogate posterior (Gaussian process or Bayesian NN) with SGLD and selecting the next experiment by Expected Improvement or Thompson sampling has become the active-learning recipe. Hernández-Lobato and collaborators report finding pharmaceutical lead compounds in 1/10 the experiments.

Adversarial robustness: Point-estimate networks are easily fooled by adversarial perturbations invisible to a human. Bayesian ensembles built from SGLD posteriors dramatically reduce attack success rates (Gal & Smith 2018; Carmon et al.), making Bayesian deep learning a recognised baseline for adversarial defence.

Common Pitfalls and Misunderstandings

The biggest trap is treating SGLD like a Metropolis-Hastings sampler. SGLD draws from a biased approximation of the target because it omits the MH correction — a finite step size always leaves residual bias. Unless you take η to zero or switch to the MALA-corrected SGLD of Vollmer-Zygalakis, the samples are not draws from the exact posterior. Welling-Teh's "asymptotic exactness as η→0" is a theoretical guarantee; production code typically lives with the bias.

Another myth is that "posterior coverage above 90% means I can trust the chain." The coverage shown here is an empirical heuristic based on log iterations. If the loss landscape is multimodal (which deep nets always are) the chain may visit only one mode while ignoring others, and coverage will look fine while estimates are wildly wrong. Run multiple chains from very different initialisations, evaluate multivariate R-hat, and consider replica exchange or parallel tempering. Mode collapse is one of the deepest open problems in Bayesian deep learning.

Finally, "a bigger batch is always better because it lowers bias" is wrong in practice. The bias indeed shrinks as η·σ²_g·N/B with larger B, but per-step compute also scales linearly with B and ESS-per-second tops out at B=N (full batch). Conversely, B that is too small (B=1) blows up gradient variance and the bias dominates. Welling-Teh recommend "as large a B as VRAM allows, with η shrunk linearly with B." Preconditioned variants like pSGLD widen the usable B range. The professional habit is to grid-search η and B together.

How to Use

  1. Set the step size (learning rate) between 0.001–0.1; smaller values reduce stochastic bias but increase mixing time for a 10,000-sample dataset.
  2. Configure batch size (e.g., 32, 64, 128) relative to dataset size; batch size = 64 with dataset size = 10,000 produces optimal noise injection for posterior exploration.
  3. Define iteration count (typically 5,000–50,000); monitor R-hat convergence statistic (target < 1.05) and effective sample size to confirm chain mixing.
  4. Run the simulator and inspect output metrics: Langevin noise magnitude, stochastic bias, mixing time in iterations, and posterior coverage percentage.

Worked Example

For a logistic regression model on a dataset with 5,000 observations: set step size = 0.01, batch size = 128, iterations = 20,000. SGLD generates approximately 8,500 effective samples (ESS) with R-hat = 1.02 across parameters. Stochastic bias remains 0.003 (negligible), Langevin noise ≈ 0.015 enables exploration, mixing time ≈ 450 iterations. Posterior coverage achieves 94% credible intervals, matching theoretical 95% nominal level. Compare against full-batch HMC (mixing time ~200 iters but O(N) cost) and standard SGD (zero posterior samples).

Practical Notes

  1. Step size inversely affects bias–variance: 0.001 gives bias < 0.001 but ESS drops 40%; use cross-validation on hold-out predictive likelihood to select optimal step size.
  2. Batch size tuning: B = sqrt(N) empirically minimizes mixing time; for N = 10,000, target B ≈ 100 balances noise and computational cost per iteration.
  3. Warm-up phase: discard first 20% of iterations to eliminate initialization transient; reset R-hat diagnostic after warm-up to detect late-stage non-convergence.
  4. High-dimensional inference (p > 1,000): increase iterations 5× and reduce step size to 0.001–0.005 to maintain ESS per parameter above 400.