Basic
Overview
Jaccard similarity, also known as Jaccard index, is an indicator of node similarity defined based on the semi-structured information of the internet. It divides the size of the intersection of two sets by the size of their union with the purpose to indicate how similar the two sets are. In the graph, Jaccard similarity uses node to represent set, and neighbors of node to represent elements in set, and to calculate the proportion of common neighbors in all neighbors.
In application, elements in set typically are a series of properties of an entity. For instance, when calculating the similarity between two credit applications, elements are the phone number, email, device IP, company name and so on in the application form. In general graph applications, these kinds of information are often stored as properties of a node; however, when executing this algorithm, these information is designed as nodes and incorporated into the graph.
The range of values of Jaccard similarity is [0,1]; the larger the value, the more similar the two sets are. Node (Set) has Jaccard similarity of 1 to its own.
Basic Concept
Set
A set consists of multiple elements; elements in a set are unordered and distinct; the number of elements in set A is the size of set A, denoted as |A|
.
Set that consists of common elements of set A and set B is called the intersection of A and B, denoted as A⋂B
; set consists of all elements of set A and set B is called the union of A and B, denoted as A⋃B
.

In the image above, set A is {b,c,e,f,g}
, set B is {a,d,b,g}
, intersection A⋂B is {b,g}
, union A⋃B is {a,b,c,d,e,f,g}
.
Jaccard Similarity
Known sets A and B, Jaccard similarity between them can be expressed as:

Jaccard similarity between sets A and B in the previous example can be calculated upon this definition: 2 / 7 = 0.2857
.
Neighbors
In the graph, Kx
is the set of neighbors of node x
to represent set A
, Ky
is the set of neighbors of node y
to represent set B
, Jaccard similarity is the number of common neighbors (see the chapter Common Neighbors) of x
and y
divided by the number of all neighbors. Note that neither Kx
nor Ky
contains repeated value, nor x
, nor y
, so the following interferences need to be eliminated when finding neighbors by edge in the graph:
- Multiple edges between
x
/y
and their neighbors - Self-loop edges of
x
andy
- Edges between
x
andy

In the graph above, there are 2 common neighbors of the red and green nodes, and 5 neighbors in total, Jaccard similarity between these two nodes is 2 / 5 = 0.4
.
Special Case
Lonely Node, Disconnected Graph
There is rarely computational valuable lonely node (empty set) in practice, intersection that involves lonely node is empty, and Jaccard similarity is 0.
For two nodes belong to different connected components, their Jaccard similarity must be 0.
Self-loop Edge
Self-loop edge of a node does not increase the number of neighbors of the node.
Directed Edge
For directed edges, the Jaccard Similarity algorithm ignores the direction of edges but calculates them as undirected edges.
Results and Statistics
The graph below shows which sports userA, userB, userC and userD (their UUID are 1, 2, 3 and 4 in order) likes, run the Jaccard Similarity algorithm in the graph:

Algorithm results 1: Calculate Jaccard similarity between each user and other 3 users, return node
and top_list
; for each node, node
is UUID, top_list
is all nodes which has Jaccard similarity > 0 with the node, those nodes' UUID and the similarity between then are displayed
node | top_list |
---|---|
1 | 3:0.250000,2:0.250000,4:0.200000, |
2 | 3:0.333333,4:0.250000,1:0.250000, |
3 | 2:0.333333,1:0.250000, |
4 | 2:0.250000,1:0.200000, |
Algorithm results 2: Calculate Jaccard similarity between userC and other 3 users, return node1
, node2
and similarity
; for each node pair, node1
and node2
are UUID, similarity
is Jaccard similarity of the two nodes
node1 | node2 | similarity |
---|---|---|
3 | 1 | 0.25 |
3 | 2 | 0.3333333333333333 |
3 | 4 | 0 |
Algorithm statistics: N/A
Command and Configuration
- Command:
algo(jaccard)
- Configurations for the parameter
params()
:
Name | Type | Default Value | Specification | Description |
---|---|---|---|---|
ids / uuids | []_id / []_uuid |
/ | Mandatory | IDs or UUIDs of the first set of nodes to be calculated |
ids2 / uuids2 | []_id / []_uuid |
/ | Optional | IDs or UUIDs of the second set of nodes to be calculated |
limit | int | -1 | >=-1 | Number of results to return; return all results if sets to -1 or not set |
top_limit | int | -1 | -1 or >=0 | Only available when ids2 and uuids2 are ignored, limit the length of selection results top_list of each node, return the full top_list if sets to -1 or not set |
The Jaccard Similarity algorithm has two calculation modes:
- Selection mode
- When ids2 and uuids2 are ignored or given empty value, for each node in ids/uuids, select all nodes that have similarity > 0 with it (sorted in descending similarity).
- The configuration item limit is to control the number of nodes in ids/uuids to return.
- Pairing mode
- When ids2/uuids2 is given solid value, pair node in ids/uuids with node in ids2/uuids2 (Cartesian product), similarities are calculated for each node pair.
- The configuration item limit is to control the number of nodes pairs to return.
Example: Calculate Jaccard similarity between any two nodes in set A (UUID = 1,2) and set B (UUID = 3,4)
algo(jaccard).params({
uuids: [1,2],
uuids2: [3,4]
}) as jacc
return jacc
Example: For each node in set (UUID = 1,2,3), calculate the most similar node to each node
algo(jaccard).params({
uuids: [1,2,3],
top_limit: 1
}) as jacc
return jacc
Algorithm Execution
Task Writeback
1. File Writeback
File Configuration Item | Data in Each Row |
---|---|
filename | node ,top_list or node1 ,node2 ,similarity |
Example: Calculate Jaccard similarity between node UUID=3 and other nodes, write the algorithm results back to file named s3
algo(jaccard).params({
uuids: 3,
uuids2: [1,2,4]
}).write({
file:{
filename: "s3"
}
})
2. Property Writeback
Not supported by this algorithm.
3. Statistics Writeback
This algorithm has no statistics.
Direct Return
Alias Ordinal | Type | Description | Column Name |
---|---|---|---|
0 | []perNode / []perNodePair | Node and its selection results / Node pair and its Jaccard similarity | node , top_list / node1 , node2 , similarity |
Example: Calculate Jaccard similarity between each node and other nodes, define algorithm results as alias named similarity, and return the results
algo(jaccard).params({
uuids: [1,2,3,4]
}) as similarity
return similarity
Streaming Return
Alias Ordinal | Type | Description | Column Name |
---|---|---|---|
0 | []perNode / []perNodePair | Node and its selection results / Node pair and its Jaccard similarity | node , top_list / node1 , node2 , similarity |
Example: Calculate Jaccard similarity between each node and other nodes, only keep the one with the highest similarity, define algorithm results as alias named similarity, and return the results
algo(jaccard).params({
uuids: [1,2,3,4],
top_limit: 1
}).stream() as similarity
return similarity
Real-time Statistics
This algorithm has no statistics.