UltipaDocs
Products
Solutions
Resources
Company
Start Free Trial
UltipaDocs
Start Free Trial
  • Introduction
  • Running Algorithms
    • Degree Centrality
    • Closeness Centrality
    • Harmonic Centrality
    • Eccentricity Centrality
    • Betweenness Centrality
    • Bridges
    • Articulation Points
    • Eigenvector Centrality
    • Katz Centrality
    • CELF
    • PageRank
    • ArticleRank
    • TextRank
    • HITS
    • SybilRank
    • Jaccard Similarity
    • Overlap Similarity
    • Cosine Similarity
    • Pearson Correlation Coefficient
    • Euclidean Distance
    • KNN
    • Vector Similarity
    • Bipartite Graph
    • HyperANF
    • Weakly Connected Components (WCC)
    • Strongly Connected Components (SCC)
    • k-Edge Connected Components
    • Local Clustering Coefficient
    • Triangle Count
    • Clique Count
    • k-Core
    • k-Truss
    • p-Cohesion
    • Induced Subgraph
    • Topological Sort
    • Breadth-First Search (BFS)
    • Depth-First Search (DFS)
    • Dijkstra's Shortest Path
    • A* Shortest Path
    • Yen's K-Shortest Paths
    • Shortest Path (BFS)
    • Delta-Stepping SSSP
    • Shortest Path Faster Algorithm (SPFA)
    • All-Pairs Shortest Path (APSP)
    • Minimum Spanning Tree (MST)
    • K-Spanning Tree
    • Steiner Tree
    • Prize-Collecting Steiner Tree (PCST)
    • Minimum Cost Flow
    • Maximum Flow
    • K-Hop Fast
    • Longest Path (DAG)
    • Random Walk
    • Adamic-Adar Index
    • Common Neighbors
    • Preferential Attachment
    • Resource Allocation
    • Total Neighbors
    • Same Community
    • Louvain
    • Leiden
    • Modularity Optimization
    • Label Propagation
    • HANP
    • SLPA
    • k-Means
    • HDBSCAN
    • K-1 Coloring
    • Modularity
    • Conductance
    • Max k-Cut
      • Node2Vec
      • Struc2Vec
      • LINE
      • Fast Random Projection
      • Summary of Graph Embedding
      • Gradient Descent
      • Backpropagation
      • Skip-gram
      • Skip-gram Optimization
  1. Docs
  2. /
  3. Graph Algorithms
  4. /
  5. Community Detection

HDBSCAN

Overview

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that finds clusters of varying density and identifies outlier (noise) nodes.

  • R.J.G.B. Campello, D. Moulavi, J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates (2013)

Concepts

Density-Based Clustering

Unlike algorithms like k-Means that partition all nodes into clusters, HDBSCAN finds clusters based on density: tightly connected regions of the graph form clusters, while loosely connected nodes are treated as noise (outliers). This means HDBSCAN can discover clusters of different sizes and shapes, and it doesn't require specifying the number of clusters.

In the HDBSCAN algorithm, distance between two nodes is the shortest-path length (hop count) between them. The core distance of a node answers the question: "How far does a node need to reach to find at least k nearby neighbors?"

For example, to find at least 3 neighbors:

  • Node A needs 2 hops (1 hop reaches B and F, 2 hops reaches C or G) → distcore(A) = 2
  • Node B only needs 1 hop (reaches A, C, and G) → distcore(B) = 1
  • Node E needs 3 hops (1 hop reaches D, 2 hops reaches C, 3 hops reaches B or G) → distcore(E) = 3

A node in a dense region has a small core distance, while a node in a sparse region has a large core distance.

The mutual reachability distance between two nodes A and B is:

This smooths out density differences — connections between dense and sparse regions are penalized, which helps prevent sparse nodes from being pulled into dense clusters.

Consider the graph above with minimum 3 neighbors to find:

  • mreach(A,B) = max(2,1,1) = 2
  • mreach(A,E) = max(2,3,4) = 4
  • mreach(B,E) = max(1,3,3) = 3

The mutual reachability distance between A and B is smaller than each of them with E. This makes sparse-region connections weaker.

Hierarchical Clustering and Cluster Extraction

  1. Build a minimum spanning tree (MST) using mutual reachability distances as edge weights.

  2. Build the condensed cluster hierarchy. Process MST edges from lightest to heaviest, merging nodes at each step. Each merge is tracked at a density level λ = 1 / weight. A cluster is "born" when a merged component first reaches minClusterSize, and "dies" when it merges into a larger component at a lower density level.

  3. Extract the most stable clusters. The stability of a cluster measures how long it persists across density levels. A cluster with higher stability represents a more persistent, meaningful grouping. When a parent cluster's stability exceeds the combined stability of its children, the parent is selected; otherwise the children are kept. Nodes not belonging to any selected cluster are labeled as noise (cluster = -1).

Outlier Detection

Each node receives an outlier score between 0 and 1:

  • Noise nodes: outlier score = 1.0
  • Clustered nodes: outlier score = coreDist / (coreDist + 1)

Nodes in denser regions (smaller core distance) get lower outlier scores. Nodes in sparser regions get higher scores, even if they belong to a cluster.

Considerations

  • The algorithm treats edges as undirected for distance computation.
  • By default, distance is measured by shortest-path hop count on the graph structure. When nodeProperty is set, the algorithm uses Euclidean distance between numeric vectors stored in that property instead.
  • The minimum cluster size (minClusterSize) and the number of neighbors (minSamples) used to compute core distance both significantly affect results — smaller values produce more, smaller clusters; larger values produce fewer, denser clusters.

Example Graph

GQL
INSERT (A:user {_id: "A"}), (B:user {_id: "B"}),
       (C:user {_id: "C"}), (D:user {_id: "D"}),
       (E:user {_id: "E"}), (F:user {_id: "F"}),
       (G:user {_id: "G"}), (H:user {_id: "H"}),
       (I:user {_id: "I"}), (J:user {_id: "J"}),
       (K:user {_id: "K"}), (L:user {_id: "L"}),
       (M:user {_id: "M"}), (N:user {_id: "N"}),
       (O:user {_id: "O"}),
       (A)-[:connect]->(B), (A)-[:connect]->(C),
       (A)-[:connect]->(F), (A)-[:connect]->(K),
       (B)-[:connect]->(C), (C)-[:connect]->(D),
       (D)-[:connect]->(A), (D)-[:connect]->(E),
       (E)-[:connect]->(A), (F)-[:connect]->(G),
       (F)-[:connect]->(J), (G)-[:connect]->(H),
       (H)-[:connect]->(F), (I)-[:connect]->(F),
       (I)-[:connect]->(H), (J)-[:connect]->(I),
       (K)-[:connect]->(F), (K)-[:connect]->(N),
       (L)-[:connect]->(M), (L)-[:connect]->(N),
       (M)-[:connect]->(K), (M)-[:connect]->(N),
       (O)-[:connect]->(N)

Parameters

NameTypeDefaultDescription
minClusterSizeINT5Minimum number of nodes to form a cluster.
minSamplesINT5Minimum samples used to compute core distance.
nodePropertySTRING/Node property containing a numeric vector for attribute-based distance. If unset, uses graph structure distance (shortest-path hop count).
limitINT-1Limits the number of results returned (-1 = all).
orderSTRING/Sorts the results by cluster: asc or desc.

Run Mode

Returns:

ColumnTypeDescription
nodeIdSTRINGNode identifier (_id)
clusterINTCluster assignment (-1 = noise)
outlierScoreFLOATOutlier score (higher = more likely outlier)
GQL
CALL algo.hdbscan({
  minClusterSize: 3,
  minSamples: 2
}) YIELD nodeId, cluster, outlierScore

Stream Mode

Returns the same columns as run mode, streamed for memory efficiency.

GQL
CALL algo.hdbscan.stream({
  minClusterSize: 3,
  minSamples: 2
}) YIELD nodeId, cluster
RETURN cluster, COLLECT(nodeId) AS members, COUNT(nodeId) AS size
GROUP BY cluster

Stats Mode

Returns:

ColumnTypeDescription
nodeCountINTTotal number of nodes
clusterCountINTNumber of clusters (excluding noise)
noiseCountINTNumber of noise points
avgOutlierScoreFLOATAverage outlier score
GQL
CALL algo.hdbscan.stats({
  minClusterSize: 3,
  minSamples: 2
}) YIELD nodeCount, clusterCount, noiseCount, avgOutlierScore

Write Mode

Computes results and writes them back to node properties. The write configuration is passed as a second argument map.

Write parameters:

NameTypeDescription
db.propertySTRING or MAPNode property to write results to. String: writes the cluster column in results to a property. Map: explicit column-to-property mapping (e.g., {cluster: 'hdb_cluster', outlierScore: 'hdb_outlier'}).

Writable columns:

ColumnTypeDescription
clusterINTCluster assignment
outlierScoreFLOATOutlier score

Returns:

ColumnTypeDescription
task_idSTRINGTask identifier for tracking via SHOW TASKS
nodesWrittenINTNumber of nodes with properties written
computeTimeMsINTTime spent computing the algorithm (milliseconds)
writeTimeMsINTTime spent writing properties to storage (milliseconds)
GQL
CALL algo.hdbscan.write({minClusterSize: 3, minSamples: 2}, {
  db: {
    property: "hdb_cluster"
  }
}) YIELD task_id, nodesWritten, computeTimeMs, writeTimeMs