EM Algorithm (1D GMM) Simulator Back
Statistics Simulator

EM Algorithm (1D GMM) Simulator

Fit a two-component 1D Gaussian mixture using the EM algorithm. Watch the estimated mixture density converge onto the histogram and the log-likelihood rise monotonically as the E and M steps iterate.

Parameters
μ₁ initial value
μ₂ initial value
EM iterations
iter
Data seed

Fixed: σ₁=σ₂=1.5, π₁=π₂=0.5 (initial). True values: μ=(0, 4), σ=(1.0, 1.5), π=(0.4, 0.6), N=200.

Results
Estimated μ₁ (true 0.0)
Estimated μ₂ (true 4.0)
Estimated π₁ (true 0.40)
Final log-likelihood LL
Histogram, Mixture Density and LL Trajectory

Top: histogram (gray) / true mixture density (black dashed) / estimated mixture density (blue solid) / component 1 (green dashed) / component 2 (red dashed). Bottom: log-likelihood LL[t] per iteration (red line, monotonically non-decreasing).

Theory & Key Formulas

A two-component Gaussian mixture $p(x)=\pi_1\mathcal{N}(x|\mu_1,\sigma_1^2)+\pi_2\mathcal{N}(x|\mu_2,\sigma_2^2)$ is estimated by iterating the E and M steps of the EM algorithm.

E step (the responsibility $\gamma_{ik}$ is the posterior probability that point $x_i$ was generated by component $k$):

$$\gamma_{ik}=\frac{\pi_k\,\mathcal{N}(x_i|\mu_k,\sigma_k^2)}{\sum_j \pi_j\,\mathcal{N}(x_i|\mu_j,\sigma_j^2)}$$

M step (update $\mu_k, \sigma_k^2, \pi_k$ using responsibility-weighted estimates):

$$N_k=\sum_i \gamma_{ik},\quad \mu_k\leftarrow\frac{1}{N_k}\sum_i \gamma_{ik}\,x_i,\quad \sigma_k^2\leftarrow\frac{1}{N_k}\sum_i \gamma_{ik}(x_i-\mu_k)^2,\quad \pi_k\leftarrow\frac{N_k}{N}$$

Log-likelihood (guaranteed non-decreasing across EM iterations):

$$\mathrm{LL}=\sum_i \log\!\left(\sum_k \pi_k\,\mathcal{N}(x_i|\mu_k,\sigma_k^2)\right)$$

The monotonic rise of LL is a check on the implementation. If LL ever decreases between iterations, the update equations have a bug.

What is the EM Algorithm (1D GMM) Simulator

🙋
I hear about the "EM algorithm" all the time, but what is it actually doing? All I see in the graph is two bumps...
🎓
Roughly speaking, EM is an iterative method that recovers a hidden mixture structure behind observed data. The histogram above is 200 samples drawn from two Gaussians, one centered at 0 (sd 1) and another at 4 (sd 1.5). Looking at just the data, you cannot tell which sample came from which component. EM estimates that. Try moving the "EM iterations" slider from 0 to 30 and watch the blue solid curve (estimated mixture) sit right on top of the gray histogram.
🙋
What is the difference between the E step and the M step? Can't we just run one of them?
🎓
The E step computes, given the current parameters, the probability that each point came from each component. That number is called the "responsibility". The M step then uses those responsibilities as weights to update each component's mean, variance, and mixing weight by weighted maximum likelihood. We alternate "weight the data (E)" and "update by weighted average (M)". Look at the LL plot at the bottom: EM is guaranteed to make the log-likelihood non-decreasing at every iteration.
🙋
So if I change the "μ₁ initial value", the result also changes?
🎓
Yes, and this is a classic pitfall. EM only guarantees a local maximum. With mu1_init = 1.0 and mu2_init = 3.0 you converge cleanly toward the true values 0 and 4, but if you put both around 2.0 the algorithm sometimes fails to separate the components, or one mean is flung far away. In practice you run EM from many random starts (10 or so) and keep the solution with the largest log-likelihood, or you initialize from a K-means clustering.
🙋
You mentioned K-means — how is that related to EM?
🎓
K-means is actually a special case of EM on a GMM. If you share the variance across all components and take the limit sigma -> 0, the responsibility becomes 1 for the nearest center and 0 elsewhere. That is exactly the K-means hard assignment. EM does a "soft" probabilistic assignment, K-means does a "hard" deterministic one. In overlapping regions, EM avoids throwing away information.

Frequently Asked Questions

K is rarely known in advance. The standard approach is the Bayesian Information Criterion BIC = −2·LL + k·log(N), with k the number of free parameters and N the sample size. Try K = 1, 2, 3, ... and choose the K that minimizes BIC. AIC has a lighter penalty (tends to overfit) while BIC is more parsimonious, and many practitioners compare both. When interpretability matters, it is often safer to fix K from domain knowledge.
The standard remedy is multi-start EM: run from many (e.g., 10 to 20) random initializations and keep the solution with the largest final log-likelihood. A common alternative is to initialize the EM means and weights from a K-means clustering. In this tool, try setting mu1 and mu2 far from the true values and watch the algorithm get stuck in a local optimum.
K-means is fine when clusters are roughly spherical, of similar size, and well separated. GMM/EM is preferred when clusters are elliptical, of unequal size, or overlap. GMM also gives a probability (responsibility) per point, which is useful for points near a boundary. A practical workflow is to start with K-means to get a feel for the data, then refine with GMM if needed.
The standard rule is to stop once the increase |LL[t] − LL[t−1]| falls below a small threshold (1e−4 or 1e−6) and to cap the number of iterations (e.g., 100 to 1000). The default of 30 iterations in this simulator is enough for this dataset. If the LL ever decreases, the implementation has a bug — first review numerical stabilization (adding small ε in the right places).

Real-World Applications

Customer segmentation and marketing: When purchase amounts or visit frequencies arise as a mix of several latent customer groups, GMM/EM uncovers latent segments such as "high value", "average", and "at-risk", which can then be targeted with tailored campaigns. Compared to K-means, GMM handles elliptical clusters more flexibly.

Anomaly detection: Fit a GMM to "normal" data, then flag a new point whose log-likelihood is unusually low as an anomaly. This classic but powerful approach is used for sensor time series on manufacturing lines, network traffic, and financial transactions.

Speech and image processing: GMM-HMM was the dominant acoustic model in speech recognition for decades (now mostly replaced by deep learning, but still used in lightweight pipelines). In image segmentation, modeling pixel color distributions with a GMM and inferring foreground/background memberships is the basis of methods like GrabCut.

Maximum likelihood under missing data: EM is not limited to GMMs. It applies to maximum likelihood estimation in any probabilistic model with unobserved latent variables. Treating missing values as latent variables lets you run ML estimation on data with missing entries inside the same EM framework. Many other algorithms (HMM, factor analysis, PLSA, LDA) are built on EM.

Common Misconceptions and Cautions

The most common misconception is that EM returns the global optimum. EM is only guaranteed to make the log-likelihood non-decreasing and to converge to a local maximum (or saddle point), not the global one. In this simulator, set mu1_init around −2 and mu2_init around 7 and you can watch the algorithm settle into a different local solution with worse estimates. In real work, always run multi-start EM and keep the solution with the largest log-likelihood.

The next most common mistake is to think increasing K always improves the model. Larger K always raises the training log-likelihood, but this is just overfitting: the fit on the training set improves while generalization to new data deteriorates. Pick K using BIC, AIC, or cross-validation, and balance fit against interpretability. The rule of thumb is: never choose K from training LL alone.

Finally, beware of numerical instability. The denominator of the responsibility can vanish, a variance can collapse to zero, or the log argument can hit zero. In this tool we floor the Gaussian variance (1e−6), add a small ε (1e−12) inside the log-likelihood, and add ε to both numerator and denominator of the responsibility. When implementing EM yourself, place ε consistently and always check that LL is non-decreasing at every iteration — that single check catches most bugs early.