HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that finds clusters of varying density and identifies outlier (noise) nodes.
Unlike algorithms like k-Means that partition all nodes into clusters, HDBSCAN finds clusters based on density: tightly connected regions of the graph form clusters, while loosely connected nodes are treated as noise (outliers). This means HDBSCAN can discover clusters of different sizes and shapes, and it doesn't require specifying the number of clusters.
In the HDBSCAN algorithm, distance between two nodes is the shortest-path length (hop count) between them. The core distance of a node answers the question: "How far does a node need to reach to find at least k nearby neighbors?"
For example, to find at least 3 neighbors:
A needs 2 hops (1 hop reaches B and F, 2 hops reaches C or G) → distcore(A) = 2B only needs 1 hop (reaches A, C, and G) → distcore(B) = 1E needs 3 hops (1 hop reaches D, 2 hops reaches C, 3 hops reaches B or G) → distcore(E) = 3A node in a dense region has a small core distance, while a node in a sparse region has a large core distance.
The mutual reachability distance between two nodes A and B is:

This smooths out density differences — connections between dense and sparse regions are penalized, which helps prevent sparse nodes from being pulled into dense clusters.
Consider the graph above with minimum 3 neighbors to find:
mreach(A,B) = max(2,1,1) = 2mreach(A,E) = max(2,3,4) = 4mreach(B,E) = max(1,3,3) = 3The mutual reachability distance between A and B is smaller than each of them with E. This makes sparse-region connections weaker.
Build a minimum spanning tree (MST) using mutual reachability distances as edge weights.
Build the condensed cluster hierarchy. Process MST edges from lightest to heaviest, merging nodes at each step. Each merge is tracked at a density level λ = 1 / weight. A cluster is "born" when a merged component first reaches minClusterSize, and "dies" when it merges into a larger component at a lower density level.
Extract the most stable clusters. The stability of a cluster measures how long it persists across density levels. A cluster with higher stability represents a more persistent, meaningful grouping. When a parent cluster's stability exceeds the combined stability of its children, the parent is selected; otherwise the children are kept. Nodes not belonging to any selected cluster are labeled as noise (cluster = -1).
Each node receives an outlier score between 0 and 1:
1.0coreDist / (coreDist + 1)Nodes in denser regions (smaller core distance) get lower outlier scores. Nodes in sparser regions get higher scores, even if they belong to a cluster.
nodeProperty is set, the algorithm uses Euclidean distance between numeric vectors stored in that property instead.minClusterSize) and the number of neighbors (minSamples) used to compute core distance both significantly affect results — smaller values produce more, smaller clusters; larger values produce fewer, denser clusters.GQLINSERT (A:user {_id: "A"}), (B:user {_id: "B"}), (C:user {_id: "C"}), (D:user {_id: "D"}), (E:user {_id: "E"}), (F:user {_id: "F"}), (G:user {_id: "G"}), (H:user {_id: "H"}), (I:user {_id: "I"}), (J:user {_id: "J"}), (K:user {_id: "K"}), (L:user {_id: "L"}), (M:user {_id: "M"}), (N:user {_id: "N"}), (O:user {_id: "O"}), (A)-[:connect]->(B), (A)-[:connect]->(C), (A)-[:connect]->(F), (A)-[:connect]->(K), (B)-[:connect]->(C), (C)-[:connect]->(D), (D)-[:connect]->(A), (D)-[:connect]->(E), (E)-[:connect]->(A), (F)-[:connect]->(G), (F)-[:connect]->(J), (G)-[:connect]->(H), (H)-[:connect]->(F), (I)-[:connect]->(F), (I)-[:connect]->(H), (J)-[:connect]->(I), (K)-[:connect]->(F), (K)-[:connect]->(N), (L)-[:connect]->(M), (L)-[:connect]->(N), (M)-[:connect]->(K), (M)-[:connect]->(N), (O)-[:connect]->(N)
| Name | Type | Default | Description |
|---|---|---|---|
minClusterSize | INT | 5 | Minimum number of nodes to form a cluster. |
minSamples | INT | 5 | Minimum samples used to compute core distance. |
nodeProperty | STRING | / | Node property containing a numeric vector for attribute-based distance. If unset, uses graph structure distance (shortest-path hop count). |
limit | INT | -1 | Limits the number of results returned (-1 = all). |
order | STRING | / | Sorts the results by cluster: asc or desc. |
Returns:
| Column | Type | Description |
|---|---|---|
nodeId | STRING | Node identifier (_id) |
cluster | INT | Cluster assignment (-1 = noise) |
outlierScore | FLOAT | Outlier score (higher = more likely outlier) |
GQLCALL algo.hdbscan({ minClusterSize: 3, minSamples: 2 }) YIELD nodeId, cluster, outlierScore
Returns the same columns as run mode, streamed for memory efficiency.
GQLCALL algo.hdbscan.stream({ minClusterSize: 3, minSamples: 2 }) YIELD nodeId, cluster RETURN cluster, COLLECT(nodeId) AS members, COUNT(nodeId) AS size GROUP BY cluster
Returns:
| Column | Type | Description |
|---|---|---|
nodeCount | INT | Total number of nodes |
clusterCount | INT | Number of clusters (excluding noise) |
noiseCount | INT | Number of noise points |
avgOutlierScore | FLOAT | Average outlier score |
GQLCALL algo.hdbscan.stats({ minClusterSize: 3, minSamples: 2 }) YIELD nodeCount, clusterCount, noiseCount, avgOutlierScore
Computes results and writes them back to node properties. The write configuration is passed as a second argument map.
Write parameters:
| Name | Type | Description |
|---|---|---|
db.property | STRING or MAP | Node property to write results to. String: writes the cluster column in results to a property. Map: explicit column-to-property mapping (e.g., {cluster: 'hdb_cluster', outlierScore: 'hdb_outlier'}). |
Writable columns:
| Column | Type | Description |
|---|---|---|
cluster | INT | Cluster assignment |
outlierScore | FLOAT | Outlier score |
Returns:
| Column | Type | Description |
|---|---|---|
task_id | STRING | Task identifier for tracking via SHOW TASKS |
nodesWritten | INT | Number of nodes with properties written |
computeTimeMs | INT | Time spent computing the algorithm (milliseconds) |
writeTimeMs | INT | Time spent writing properties to storage (milliseconds) |
GQLCALL algo.hdbscan.write({minClusterSize: 3, minSamples: 2}, { db: { property: "hdb_cluster" } }) YIELD task_id, nodesWritten, computeTimeMs, writeTimeMs