UltipaDocs
Products
Solutions
Resources
Company
Start Free Trial
UltipaDocs
Start Free Trial
  • Introduction
  • Running Algorithms
    • Degree Centrality
    • Closeness Centrality
    • Harmonic Centrality
    • Eccentricity Centrality
    • Betweenness Centrality
    • Bridges
    • Articulation Points
    • Eigenvector Centrality
    • Katz Centrality
    • CELF
    • PageRank
    • ArticleRank
    • TextRank
    • HITS
    • SybilRank
    • Jaccard Similarity
    • Overlap Similarity
    • Cosine Similarity
    • Pearson Correlation Coefficient
    • Euclidean Distance
    • KNN
    • Vector Similarity
    • Bipartite Graph
    • HyperANF
    • Weakly Connected Components (WCC)
    • Strongly Connected Components (SCC)
    • k-Edge Connected Components
    • Local Clustering Coefficient
    • Triangle Count
    • Clique Count
    • k-Core
    • k-Truss
    • p-Cohesion
    • Induced Subgraph
    • Topological Sort
    • Breadth-First Search (BFS)
    • Depth-First Search (DFS)
    • Dijkstra's Shortest Path
    • A* Shortest Path
    • Yen's K-Shortest Paths
    • Shortest Path (BFS)
    • Delta-Stepping SSSP
    • Shortest Path Faster Algorithm (SPFA)
    • All-Pairs Shortest Path (APSP)
    • Minimum Spanning Tree (MST)
    • K-Spanning Tree
    • Steiner Tree
    • Prize-Collecting Steiner Tree (PCST)
    • Minimum Cost Flow
    • Maximum Flow
    • K-Hop Fast
    • Longest Path (DAG)
    • Random Walk
    • Adamic-Adar Index
    • Common Neighbors
    • Preferential Attachment
    • Resource Allocation
    • Total Neighbors
    • Same Community
    • Louvain
    • Leiden
    • Modularity Optimization
    • Label Propagation
    • HANP
    • SLPA
    • k-Means
    • HDBSCAN
    • K-1 Coloring
    • Modularity
    • Conductance
    • Max k-Cut
      • Node2Vec
      • Struc2Vec
      • LINE
      • Fast Random Projection
      • Summary of Graph Embedding
      • Gradient Descent
      • Backpropagation
      • Skip-gram
      • Skip-gram Optimization
  1. Docs
  2. /
  3. Graph Algorithms
  4. /
  5. Similarity

Jaccard Similarity

Overview

Jaccard similarity, or Jaccard index, was proposed by Paul Jaccard in 1901. It’s a metric of similarity for two sets of data. In the graph, collecting the neighbors of a node into a set, two nodes are considered similar if their neighborhood sets are similar.

Jaccard similarity ranges from 0 to 1, where 1 indicates that two sets are identical, and 0 indicates that they share no common elements.

Concepts

Jaccard Similarity

Given two sets A and B, the Jaccard similarity between them is computed as:

In the following example, set A = {b,c,e,f,g}, set B = {a,d,b,g}, their intersection A⋂B = {b,g}, their union A⋃B = {a,b,c,d,e,f,g}, hence the Jaccard similarity between A and B is 2 / 7 = 0.285714.

When applying Jaccard Similarity to compare two nodes in a graph, we use the 1-hop neighborhood set to represent each target node. The 1-hop neighborhood set:

  • contains no repeated nodes;
  • excludes the two target nodes.

In this graph, the 1-hop neighborhood set of nodes u and v is:

  • Nu = {a,b,c,d,e}
  • Nv = {d,e,f}

Therefore, the Jaccard similarity between nodes u and v is 2 / 6 = 0.333333.

Weighted Jaccard Similarity

The Weighted Jaccard Similarity is an extension of the classic Jaccard Similarity that takes into account the weights associated with elements in the sets being compared.

The formula for Weighted Jaccard Similarity is given by:

In this weighted graph, the union of the 1-hop neighborhood sets Nu and Nv is {a,b,c,d,e,f}. For each element in the union set, assign a value equal to the sum of the edge weights between the target node and the corresponding node; assign 0 if no edge exists between them:

abcdef
N'u11110.50
N'v0000.51.5 + 0.1 =1.61

Therefore, the Weighted Jaccard Similarity between nodes u and v is (0+0+0+0.5+0.5+0) / (1+1+1+1+1.6+1) = 0.151515.

NOTE

Please ensure that the sum of the edge weights between the target node and the neighboring node is greater than or equal to 0.

Considerations

  • The algorithm treats all edges as undirected.
  • Self-loops are ignored when computing neighborhoods.

Example Graph

GQL
INSERT (userA:user {_id: "userA"}), (userB:user {_id: "userB"}),
       (userC:user {_id: "userC"}), (userD:user {_id: "userD"}),
       (running:sport {_id: "running"}), (tennis:sport {_id: "tennis"}),
       (baseball:sport {_id: "baseball"}), (swimming:sport {_id: "swimming"}),
       (badminton:sport {_id: "badminton"}), (iceball:sport {_id: "iceball"}),
       (userA)-[:likes {weight: 2}]->(tennis),
       (userA)-[:likes {weight: 1}]->(baseball),
       (userA)-[:likes {weight: 3}]->(swimming),
       (userA)-[:likes {weight: 2}]->(badminton),
       (userB)-[:likes {weight: 1}]->(running),
       (userB)-[:likes {weight: 3}]->(swimming),
       (userC)-[:likes {weight: 2}]->(swimming),
       (userD)-[:likes {weight: 1}]->(running),
       (userD)-[:likes {weight: 2}]->(badminton),
       (userD)-[:likes {weight: 2}]->(iceball)

Parameters

NameTypeDefaultDescription
typeSTRINGjaccardType of similarity to compute: jaccard.
idsLIST/First group of node _ids. If empty, all nodes are used.
ids2LIST/Second group of node _ids for pairing mode. If empty, selection mode is used.
weightLIST/Numeric edge properties used as weights for weighted Jaccard.
degreeCutoffINT0Minimum degree to include a node (0 = no cutoff).
orderSTRING/Sorts results by similarity: asc or desc.
limitINT-1Maximum total results returned (-1 = all).
top_limitINT-1Maximum results per source node in selection mode (-1 = all).

Supports three computation modes:

  • All-pairs: When both ids and ids2 are empty, computes similarity between all node pairs in the graph.
  • Pairing: When both ids and ids2 are specified, computes similarity between each node in ids and each node in ids2.
  • Selection: When only ids is specified (no ids2), computes similarity between each node in ids and all other nodes. Use top_limit to limit results per source node.

Run Mode

Returns:

ColumnTypeDescription
node1STRINGFirst node identifier (_id)
node2STRINGSecond node identifier (_id)
similarityFLOATComputed similarity score

Jaccard similarity in pairing mode:

GQL
CALL algo.similarity({
  type: "jaccard",
  ids: ["userC"],
  ids2: ["userA", "userB", "userD"]
}) YIELD node1, node2, similarity

Result:

node1node2similarity
userCuserA0.25
userCuserB0.5
userCuserD0

Jaccard similarity in selection mode:

GQL
CALL algo.similarity({
  type: "jaccard",
  ids: ["userA"],
  weight: ["weight"],
  top_limit: 2
}) YIELD node1, node2, similarity
RETURN node1, node2, similarity

Result:

node1node2similarity
userAuserB0.3333333333333333
userAuserC0.25

Stream Mode

Returns the same columns as run mode, streamed for memory efficiency.

GQL
CALL algo.similarity.stream({
  type: "jaccard",
  degreeCutoff: 3
}) YIELD node1, node2, similarity
RETURN node1, node2, similarity

Result:

node1node2similarity
swimminguserA0
swimminguserD0
userAswimming0
userAuserD0.16666666666666666
userDswimming0
userDuserA0.16666666666666666

Stats Mode

Returns:

ColumnTypeDescription
pairCountINTNumber of node pairs computed
minSimilarityFLOATMinimum similarity score
maxSimilarityFLOATMaximum similarity score
avgSimilarityFLOATAverage similarity score
GQL
CALL algo.similarity.stats({
  type: "jaccard"
}) YIELD pairCount, minSimilarity, maxSimilarity, avgSimilarity

Result:

pairCountminSimilaritymaxSimilarityavgSimilarity
90010.1303703703703704

Write Mode

Computes results and writes them back to node properties. The write configuration is passed as a second argument map.

Write parameters:

NameTypeDescription
db.propertySTRING or MAPNode property to write results to. String: writes the similarity column in results to a property. Map: explicit column-to-property mapping (e.g., {similarity: 'sim_score'}).

Writable columns:

ColumnTypeDescription
similarityFLOATComputed similarity score

Returns:

ColumnTypeDescription
task_idSTRINGTask identifier for tracking via SHOW TASKS
nodesWrittenINTNumber of nodes with properties written
computeTimeMsINTTime spent computing the algorithm (milliseconds)
writeTimeMsINTTime spent writing properties to storage (milliseconds)
GQL
CALL algo.similarity.write({type: "jaccard", ids: ["userA", "userB"]}, {
  db: {
    property: "sim_score"
  }
}) YIELD task_id, nodesWritten, computeTimeMs, writeTimeMs