Hierarchical Clustering Simulator Back
Machine Learning Simulator

Hierarchical Clustering Simulator — Agglomerative & Dendrogram

Apply agglomerative hierarchical clustering to 2D data and watch the dendrogram update live. Switch linkage methods and slide the distance threshold to see how the number of clusters is decided.

Parameters
Linkage method

0 = single / 1 = complete / 2 = average (UPGMA)

Distance threshold t
Points per cluster
pts
Data seed

Default settings (average, t=3.0, 10 pts × 3, seed=42) recover the true 3-cluster structure.

Results
Clusters at threshold t
Final merge distance d_max
Mean silhouette score
Linkage method
Scatter & Dendrogram

Top = scatter (colored by cluster, axes [-4,4] x [-4,4]) / Bottom = dendrogram (X = point id, Y = distance, orange dashed = threshold t)

Theory & Key Formulas

Agglomerative hierarchical clustering starts with each point as its own cluster and repeatedly merges the two nearest clusters. The linkage method changes the definition of inter-cluster distance $d(C_1,C_2)$.

Single linkage (nearest neighbor):

$$d(C_1,C_2)=\min_{x\in C_1,\;y\in C_2}\|x-y\|$$

Complete linkage (farthest neighbor):

$$d(C_1,C_2)=\max_{x\in C_1,\;y\in C_2}\|x-y\|$$

Average linkage (group average / UPGMA):

$$d(C_1,C_2)=\frac{1}{|C_1||C_2|}\sum_{x\in C_1}\sum_{y\in C_2}\|x-y\|$$

A horizontal cut of the dendrogram at threshold $t$ fixes the number of clusters that remain at that height.

What is the Hierarchical Clustering Simulator?

🙋
That upside-down tree thing I see in every ML textbook — what is a dendrogram actually showing?
🎓
Roughly, it is the merge history: which points joined which, and at what distance. Agglomerative hierarchical clustering starts with every point as its own cluster (30 of them here) and keeps merging the closest pair. Try the defaults above (average, t=3.0) and you'll see points fusing near the bottom and groups fusing near the top.
🙋
What about the orange dashed line at t=3.0? Cutting there gives 3 clusters, right?
🎓
Exactly — that is the "cut at a threshold" operation. Slide t down to 0.5 and you'll split the data into nearly 30 micro-clusters; slide it up to 10 and everything merges into one. Because the data is actually a mixture of 3 Gaussians, the default cut at t=3.0 recovers exactly k=3 — by design, not by luck.
🙋
Switching the linkage to "single" makes everything blob into one big cluster — what happened?
🎓
Nice catch. That is the famous "chaining effect" of single linkage. It defines inter-cluster distance as the closest pair, so a single bridging point can connect groups that should stay apart. Complete linkage measures the farthest pair instead and tends to produce compact clusters. Average sits in the middle and, under the name UPGMA, is the standard distance method in phylogenetics.
🙋
In practice, how do I choose between this and k-means?
🎓
k-means forces you to pick k up front; hierarchical lets you defer the choice and read it off the tree. So it shines when "how many clusters should there be?" is itself the question. The downside is O(n^3), so it struggles past tens of thousands of points. Small-to-medium data where you want to see the structure: that's hierarchical's sweet spot.

Frequently Asked Questions

Ward does not define inter-cluster distance directly: it minimizes the variance increase from a merger, so it sits in a different conceptual category from the other three methods. It does fit the Lance–Williams framework, but to first understand linkage as a concept, comparing single, complete and average is enough. Ward is heavily used in practice; this tool just keeps the first lesson focused.
This tool uses Euclidean distance on 2D points, but in general any precomputed distance matrix is fine: Manhattan, cosine, Jaccard, edit distance and so on. Phylogenetics uses sequence edit distance, text mining uses cosine, and the choice of metric can change the resulting clusters drastically. In practice, matching the metric to the domain matters as much as the linkage choice.
For each point we compute a = mean distance to other points in the same cluster and b = mean distance to the nearest other cluster, then (b - a)/max(a, b). Values near 1 mean clean separation, near 0 means overlapping clusters, and negative values mean the point is closer to a foreign cluster. The default settings here give 0.5–0.7, reflecting the clearly separated 3-Gaussian structure.
A naive implementation is O(n^3) time and O(n^2) memory, which limits it to a few thousand points. SLINK (single linkage) and CLINK (complete linkage) reach O(n^2), and BIRCH is a hierarchical method designed for large data. For more than 10k points, k-means or MiniBatchKMeans is far more practical, with DBSCAN as a density-based alternative.

Real-World Applications

Bioinformatics (phylogenetic trees): UPGMA — i.e. average linkage — is the classical algorithm for inferring evolutionary trees from DNA or protein-sequence similarity. You build a distance matrix from sequence alignments, run hierarchical clustering, and read off the species clusters from the branching pattern. Tools such as NCBI's distance methods, MEGA and PhyML implement this as a core feature.

Customer segmentation (marketing): Hierarchical clustering on purchase history and demographics lets the business read the dendrogram and decide "do we serve three customer tiers or five?" without prefixing k. R's hclust and scikit-learn's AgglomerativeClustering are the standard implementations, and the resulting tree is often a powerful artifact for stakeholder discussions.

Gene expression analysis (heatmaps): Microarray and RNA-Seq experiments produce gene-by-sample matrices that are hierarchically clustered along both axes, reordered, and rendered as heatmaps. Co-expressed gene groups and sample groups with similar expression patterns become visually obvious — making this one of the most reproduced figures in life-science papers.

Anomaly detection and failure analysis: In manufacturing or server logs, hierarchical clustering surfaces "lone" points that only merge at very high distance, marking them as anomaly candidates. Sweeping the threshold t to see at what granularity anomalies appear is a uniquely hierarchical workflow that flat methods like k-means cannot offer.

Common Misconceptions & Pitfalls

The most common mistake is treating the linkage choice as a matter of taste. It is not — the result changes dramatically with the data. Single linkage easily detects chain-like clusters but suffers from the "chaining effect", where a single noisy point can fuse otherwise separate groups. Complete linkage prefers tight, spherical clusters but is sensitive to outliers. In this tool, switching from average to single on the default data noticeably reshapes the dendrogram and changes the cluster count at t=3.0. Compare and pick — do not guess.

The second pitfall is treating the distance threshold t as a "true" value. It is a hyperparameter: in practice you find it by inspecting the dendrogram for large vertical gaps, by maximizing the silhouette score, or by domain knowledge. Sliding t in this tool moves the orange dashed line up and down, and the number of clusters changes in discrete steps. A good t typically lies inside a plateau where the cluster count is stable.

Finally, do not underestimate the O(n^3) cost. With 30 points the computation is instantaneous; at 10,000 points the work grows by a factor of about 370,000, and at 100,000 points by about 370 million. Production-scale clustering needs specialized fast algorithms (SLINK, CLINK), approximate methods (k-means, BIRCH) or GPU parallelism. A safe heuristic is: hierarchical clustering is for small-to-medium data where understanding the structure matters more than raw throughput.