How do I choose the number of mixture components K?

The number of components K is usually unknown in advance. A standard approach is to use the Bayesian Information Criterion BIC = -2 * LL + k * log(N), where k is the number of free parameters and N is the sample size. Try K = 1, 2, 3, ... and pick the K that minimizes BIC. AIC has a lighter penalty (tends to overfit), while BIC is stricter (favors parsimonious models). In practice, comparing BIC across candidate Ks is the standard recipe.

How sensitive is EM to the initial values?

EM is only guaranteed to converge to a local maximum of the log-likelihood, so initial values matter a lot. Try setting mu1 and mu2 far apart (for example 1.0 and 3.0) versus close together (for example both around 2.0) and watch how the result changes. In practice, EM is run from many random initial values and the solution with the largest log-likelihood is kept (multi-start EM), or K-means clustering is used to provide a good starting point.

What is the relationship to K-means?

K-means can be seen as a special case of EM. If we share the variance across all components in a GMM and take the limit sigma -> 0, the responsibility gamma becomes 1 for the nearest center and 0 for the others, which is exactly the K-means hard assignment. EM/GMM does a probabilistic soft assignment, while K-means does a deterministic hard assignment. It is common to initialize EM from a K-means solution.

How do I judge whether EM has converged?

A standard test is to stop once the increase in log-likelihood |LL[t] - LL[t-1]| falls below a small threshold (for example 1e-4) and to also cap the number of iterations. EM is theoretically guaranteed to make LL monotonically non-decreasing, so if LL ever decreases there is a bug in the implementation. Check that the LL curve in the plot below is non-decreasing and flattens out.

EM Algorithm (1D GMM) Simulator — Free Online Calculator

Parameters

μ₁ initial value

—

μ₂ initial value

—

EM iterations

iter

Data seed

—

Fixed: σ₁=σ₂=1.5, π₁=π₂=0.5 (initial). True values: μ=(0, 4), σ=(1.0, 1.5), π=(0.4, 0.6), N=200.

While paused, move the sliders to update the result instantly.

Results

—

Iteration t

—

Estimated μ₁ (true 0.0)

—

Estimated μ₂ (true 4.0)

—

Log-likelihood LL

Histogram, Mixture Density and LL Trajectory

Top: histogram (gray) / true mixture density (black dashed) / estimated mixture density (blue solid) / component 1 (green dashed) / component 2 (red dashed). Bottom: log-likelihood LL[t] per iteration (red line, monotonically non-decreasing).

Theory & Key Formulas

A two-component Gaussian mixture $p(x)=\pi_1\mathcal{N}(x|\mu_1,\sigma_1^2)+\pi_2\mathcal{N}(x|\mu_2,\sigma_2^2)$ is estimated by iterating the E and M steps of the EM algorithm.

E step (the responsibility $\gamma_{ik}$ is the posterior probability that point $x_i$ was generated by component $k$):

$$\gamma_{ik}=\frac{\pi_k\,\mathcal{N}(x_i|\mu_k,\sigma_k^2)}{\sum_j \pi_j\,\mathcal{N}(x_i|\mu_j,\sigma_j^2)}$$

M step (update $\mu_k, \sigma_k^2, \pi_k$ using responsibility-weighted estimates):

$$N_k=\sum_i \gamma_{ik},\quad \mu_k\leftarrow\frac{1}{N_k}\sum_i \gamma_{ik}\,x_i,\quad \sigma_k^2\leftarrow\frac{1}{N_k}\sum_i \gamma_{ik}(x_i-\mu_k)^2,\quad \pi_k\leftarrow\frac{N_k}{N}$$

Log-likelihood (guaranteed non-decreasing across EM iterations):

$$\mathrm{LL}=\sum_i \log\!\left(\sum_k \pi_k\,\mathcal{N}(x_i|\mu_k,\sigma_k^2)\right)$$

The monotonic rise of LL is a check on the implementation. If LL ever decreases between iterations, the update equations have a bug.

What is the EM Algorithm (1D GMM) Simulator

🙋

I hear about the "EM algorithm" all the time, but what is it actually doing? All I see in the graph is two bumps...

🎓

Roughly speaking, EM is an iterative method that recovers a hidden mixture structure behind observed data. The histogram above is 200 samples drawn from two Gaussians, one centered at 0 (sd 1) and another at 4 (sd 1.5). Looking at just the data, you cannot tell which sample came from which component. EM estimates that. Try moving the "EM iterations" slider from 0 to 30 and watch the blue solid curve (estimated mixture) sit right on top of the gray histogram.

🙋

What is the difference between the E step and the M step? Can't we just run one of them?

🎓

The E step computes, given the current parameters, the probability that each point came from each component. That number is called the "responsibility". The M step then uses those responsibilities as weights to update each component's mean, variance, and mixing weight by weighted maximum likelihood. We alternate "weight the data (E)" and "update by weighted average (M)". Look at the LL plot at the bottom: EM is guaranteed to make the log-likelihood non-decreasing at every iteration.

🙋

So if I change the "μ₁ initial value", the result also changes?

🎓

Yes, and this is a classic pitfall. EM only guarantees a local maximum. With mu1_init = 1.0 and mu2_init = 3.0 you converge cleanly toward the true values 0 and 4, but if you put both around 2.0 the algorithm sometimes fails to separate the components, or one mean is flung far away. In practice you run EM from many random starts (10 or so) and keep the solution with the largest log-likelihood, or you initialize from a K-means clustering.

🙋

You mentioned K-means — how is that related to EM?

🎓

K-means is actually a special case of EM on a GMM. If you share the variance across all components and take the limit sigma -> 0, the responsibility becomes 1 for the nearest center and 0 elsewhere. That is exactly the K-means hard assignment. EM does a "soft" probabilistic assignment, K-means does a "hard" deterministic one. In overlapping regions, EM avoids throwing away information.

Frequently Asked Questions

K is rarely known in advance. The standard approach is the Bayesian Information Criterion BIC = −2·LL + k·log(N), with k the number of free parameters and N the sample size. Try K = 1, 2, 3, ... and choose the K that minimizes BIC. AIC has a lighter penalty (tends to overfit) while BIC is more parsimonious, and many practitioners compare both. When interpretability matters, it is often safer to fix K from domain knowledge.

The standard remedy is multi-start EM: run from many (e.g., 10 to 20) random initializations and keep the solution with the largest final log-likelihood. A common alternative is to initialize the EM means and weights from a K-means clustering. In this tool, try setting mu1 and mu2 far from the true values and watch the algorithm get stuck in a local optimum.

K-means is fine when clusters are roughly spherical, of similar size, and well separated. GMM/EM is preferred when clusters are elliptical, of unequal size, or overlap. GMM also gives a probability (responsibility) per point, which is useful for points near a boundary. A practical workflow is to start with K-means to get a feel for the data, then refine with GMM if needed.

The standard rule is to stop once the increase |LL[t] − LL[t−1]| falls below a small threshold (1e−4 or 1e−6) and to cap the number of iterations (e.g., 100 to 1000). The default of 30 iterations in this simulator is enough for this dataset. If the LL ever decreases, the implementation has a bug — first review numerical stabilization (adding small ε in the right places).

Real-World Applications

Customer segmentation and marketing: When purchase amounts or visit frequencies arise as a mix of several latent customer groups, GMM/EM uncovers latent segments such as "high value", "average", and "at-risk", which can then be targeted with tailored campaigns. Compared to K-means, GMM handles elliptical clusters more flexibly.

Anomaly detection: Fit a GMM to "normal" data, then flag a new point whose log-likelihood is unusually low as an anomaly. This classic but powerful approach is used for sensor time series on manufacturing lines, network traffic, and financial transactions.

Speech and image processing: GMM-HMM was the dominant acoustic model in speech recognition for decades (now mostly replaced by deep learning, but still used in lightweight pipelines). In image segmentation, modeling pixel color distributions with a GMM and inferring foreground/background memberships is the basis of methods like GrabCut.

Maximum likelihood under missing data: EM is not limited to GMMs. It applies to maximum likelihood estimation in any probabilistic model with unobserved latent variables. Treating missing values as latent variables lets you run ML estimation on data with missing entries inside the same EM framework. Many other algorithms (HMM, factor analysis, PLSA, LDA) are built on EM.

Common Misconceptions and Cautions

The most common misconception is that EM returns the global optimum. EM is only guaranteed to make the log-likelihood non-decreasing and to converge to a local maximum (or saddle point), not the global one. In this simulator, set mu1_init around −2 and mu2_init around 7 and you can watch the algorithm settle into a different local solution with worse estimates. In real work, always run multi-start EM and keep the solution with the largest log-likelihood.

The next most common mistake is to think increasing K always improves the model. Larger K always raises the training log-likelihood, but this is just overfitting: the fit on the training set improves while generalization to new data deteriorates. Pick K using BIC, AIC, or cross-validation, and balance fit against interpretability. The rule of thumb is: never choose K from training LL alone.

Finally, beware of numerical instability. The denominator of the responsibility can vanish, a variance can collapse to zero, or the log argument can hit zero. In this tool we floor the Gaussian variance (1e−6), add a small ε (1e−12) inside the log-likelihood, and add ε to both numerator and denominator of the responsibility. When implementing EM yourself, place ε consistently and always check that LL is non-decreasing at every iteration — that single check catches most bugs early.

EM Algorithm (1D GMM) Simulator

What is the EM Algorithm (1D GMM) Simulator

Frequently Asked Questions

Real-World Applications

Common Misconceptions and Cautions

How to Use

Worked Example

Practical Notes

EM Algorithm (1D GMM) Simulator

What is the EM Algorithm (1D GMM) Simulator

Frequently Asked Questions

Real-World Applications

Common Misconceptions and Cautions

Related Tools

How to Use

Worked Example

Practical Notes