How does DBSCAN differ from k-means?

K-means requires you to fix the number of clusters k in advance and assigns each point to the nearest centroid, so it only handles roughly spherical blobs. DBSCAN defines clusters by density, so you do not have to specify the number of clusters, it can discover arbitrary shapes such as crescents or spirals, and low-density points are automatically excluded as 'noise'.

How do you choose ε and MinPts?

A common rule of thumb is MinPts of roughly 2d for dimension d (about 4 in 2D). ε is typically chosen from the 'k-distance plot' — sort each point's distance to its MinPts-th nearest neighbor in ascending order and pick the knee. Choosing these two parameters well is the main difficulty of DBSCAN; bad choices collapse everything into one giant cluster or label every point as noise.

What is the difference between core, border and noise points?

A core point has at least MinPts points within its ε-neighborhood — it sits in a dense region. A border point itself does not have MinPts neighbors but lies within the ε-neighborhood of some core point, so it is pulled into that cluster. A point that is not in any core point's ε-neighborhood is noise (cluster id −1), treated as an outlier.

How does DBSCAN compare with HDBSCAN and OPTICS?

DBSCAN uses a single global ε, so it struggles when cluster densities differ substantially. OPTICS plots reachability distances over a range of ε values, while HDBSCAN removes ε entirely and extracts clusters hierarchically from the density structure, handling clusters of differing density at the same time. In practice HDBSCAN is often preferred.

DBSCAN Simulator — Free Online Calculator

Parameters

ε (neighborhood radius)

—

MinPts

pts

Highlighted cluster ID

—

Data seed

—

70 points (3 clusters + uniform noise) are generated deterministically by a fixed-seed LCG. Highlight ID 0 shows all clusters.

Results

—

Clusters found

—

Noise points

—

Core points

—

Border points

2D Scatter and Cluster Classification

Filled = core / outlined = border / gray × = noise / color = cluster ID (legend top right)

Theory & Key Formulas

DBSCAN defines density via the distance d (here Euclidean) and the two parameters ε and MinPts, and extracts connected components of density as clusters.

ε-neighborhood of a point p:

$$N_\varepsilon(p) = \{\, q \in D \mid d(p,q) \le \varepsilon \,\}$$

Core-point condition (at least MinPts neighbors within ε):

$$|N_\varepsilon(p)| \ge \mathrm{MinPts}$$

Direct density reachability from core p to q:

$$q \in N_\varepsilon(p) \;\wedge\; p \text{ is a core point}$$

Clusters are the connected components of the graph linking core points within ε of each other. Non-core points lying in some core point's ε-neighborhood are added as border points. Points belonging to no core neighborhood are noise (cluster id = −1).

What is the DBSCAN Simulator

🙋

How is DBSCAN different from k-means? I learned that both "make clusters", but I do not really see what changes.

🎓

Roughly speaking, k-means asks you up front "how many clusters?" and forces every point into the nearest centroid. So shapes like a crescent or a ring cannot be separated cleanly. DBSCAN flips the idea: "dense regions are clusters, sparse regions are noise". You do not pick the number of clusters and the shape can be anything. Look at cluster C (the arc) in the simulator — it stays as one curved cluster.

🙋

I see! When I move the "ε" and "MinPts" sliders the result changes a lot. If I drop ε to 0.3 almost everything becomes noise.

🎓

That is the heart of DBSCAN. ε is "how far is still a neighbor", and MinPts is "how many neighbors do I need to be a dense center (a core point)". Shrink ε and fewer points qualify as dense, so isolated points all fall to noise. Push ε to 2.0 and the three clusters merge into one giant cluster. Try it.

🙋

It really does! Everything becomes C1. So how do you pick ε in practice?

🎓

The standard trick is the "k-distance plot". For every point compute the distance to its MinPts-th nearest neighbor, sort those distances and plot the curve. The "knee" — where the curve sharply bends upward — gives a good ε. For something like customer locations you might also fix ε from a physical meaning, such as "a 200 m radius with 5 or more people is the center of a trade area".

🙋

What is the difference between the filled and outlined dots?

🎓

Filled = core, outlined = border. A core point has many friends around it. A border point does not have enough friends itself, but it sits inside a core point's ε, so it gets pulled in. Noise is the gray ×. Being able to separate the "skeleton", the "edge" and the "outliers" of the data visually is one of DBSCAN's biggest strengths.

Frequently Asked Questions

If you know the number of clusters in advance and the data are roughly spherical blobs, k-means is fast and easy. When the cluster count is unknown, the shapes are complex, or you need automatic outlier removal, DBSCAN is a better fit. DBSCAN struggles when cluster densities differ significantly; in that case consider HDBSCAN or OPTICS.

Start with MinPts around 2d for dimension d, and increase it on noisy data (about 4–10 in 2D). For ε, read the knee from the k-distance plot of MinPts−1-nearest-neighbor distances. Setting ε from a domain meaning (trade-area radius, sensor noise, etc.) is also very effective.

In principle yes, but the "curse of dimensionality" makes distances concentrate and the useful ε range shrinks dramatically. In practice you compress the data to 2–10 dimensions with PCA, UMAP or t-SNE before running DBSCAN. The naive cost is O(n²); kd-trees help in low dimensions but lose effectiveness as dimensions grow.

OPTICS replaces the single ε with a reachability-distance plot across all data and lets you slice clusters out afterward. HDBSCAN goes further and removes ε entirely, extracting stable clusters from the density hierarchy. When cluster densities vary strongly, or ε is hard to pick, HDBSCAN has become a de facto standard.

Real-World Applications

Geospatial data analysis: DBSCAN is widely used to extract "stay points" from GPS traces, to discover trade areas from store locations, and to find crime hotspots from incident locations. The fact that ε can be chosen naturally from a physical distance (meters) is a major advantage, and low-density regions are automatically discarded as noise.

Anomaly detection: Compressing sensor data or logs into low dimensions and applying DBSCAN, then flagging "points belonging to no cluster (noise)" as anomaly candidates is a simple and powerful pipeline. It is valuable in early-stage monitoring where labeled data are scarce.

Image segmentation and point cloud processing: DBSCAN is used as a preprocessing step to separate ground, walls and objects from LiDAR or depth-camera point clouds using density. It fits the conditions well: unknown number of clusters and diverse shapes.

Customer segmentation and outlier analysis: In marketing analysis DBSCAN can automatically extract groups of customers with similar behavior from density and isolate outliers for separate review. With no need to assume the number of clusters, it is well suited to exploratory analysis that surfaces previously unknown segments.

Common Misconceptions and Cautions

The most common mistake is to think that "a larger ε will make the clusters neater". Increasing ε bridges originally separate clusters through density and collapses them into one huge cluster. Try ε of 1.5 or 2.0 in the simulator — clusters A, B and C merge and the cluster count drops to 1 or 2. Too small and every point becomes noise. The "right" ε lives in the sweet spot "intra-cluster distance < ε < inter-cluster distance"; always check the knee of the k-distance plot.

Another frequent error is to believe DBSCAN will perfectly remove outliers. DBSCAN only labels as noise "points in low-density regions". Dense clumps of bad data or systematic sensor errors are happily absorbed as core points. Noise labeling is density-based and is not a substitute for data-quality checks or proper preprocessing.

Finally, do not forget to scale your features. DBSCAN uses one ε across all axes in the same units. If you feed in features of very different scales (e.g. age in years and income in dollars), the large-scale axis dominates the distance and the others vanish. Always standardize (StandardScaler, MinMaxScaler, etc.) before running DBSCAN. Skipping this is the textbook source of "the result looks weird and I cannot explain why".

DBSCAN Simulator — Density-Based Clustering

What is the DBSCAN Simulator

Frequently Asked Questions

Real-World Applications

Common Misconceptions and Cautions

Related Tools