Overview
Overlap similarity is derived from Jaccard similarity, which is also called the Szymkiewicz–Simpson coefficient. It divides the size of the intersection of two sets by the size of the smaller set with the purpose to indicate how similar the two sets are.
In application, elements in set typically are a series of properties of an entity. For instance, when calculating the similarity between two credit applications, elements are the phone number, email, device IP, company name and so on in the application form. In general graph applications, these kinds of information are often stored as properties of a node; however, when executing this algorithm, these information is designed as nodes and incorporated into the graph.
The range of values of overlap similarity is [0,1]; the larger the value, the more similar the two sets are.
Basic Concept
Set
A set consists of multiple elements; elements in a set are unordered and distinct; the number of elements in set A is the size of set A, denoted as |A|
.
Set that consists of common elements of set A and set B is called the intersection of A and B, denoted as A⋂B
.

In the image above, set A is {b,c,e,f,g}
, set B is {a,d,b,g}
, intersection A⋂B is {b,g}
.
Overlap Similarity
Known sets A and B, overlap similarity between them can be expressed as:

Overlap similarity between sets A and B in the previous example can be calculated upon this definition: 2 / 4 = 0.5
.
Neighbors
In the graph, Kx
is the set of neighbors of node x
to represent set A
, Ky
is the set of neighbors of node y
to represent set B
. Note that neither Kx
nor Ky
contains repeated value, nor x
, nor y
, so the following interferences need to be eliminated when finding neighbors by edge in the graph:
- Multiple edges between
x
/y
and their neighbors - Self-loop edges of
x
andy
- Edges between
x
andy

In the graph above, the red node has 3 neighbors (excludes the green node), the green node has 5 neighbors (excludes the red node), they have 2 common neighbors, their overlap similarity is 2 / 3 = 0.6667
.
Special Case
Isolated Node, Disconnected Graph
There is rarely computational valuable isolated node (empty set) in practice, intersection that involves isolated node is empty, and overlap similarity is 0.
For two nodes belong to different connected components, their overlap similarity must be 0.
Self-loop Edge
Self-loop edge of a node does not increase the number of neighbors of the node.
Directed Edge
For directed edges, the algorithm ignores the direction of edges but calculates them as undirected edges.
Command and Configuration
- Command:
algo(similarity)
- Configurations for the parameter
params()
:
Name |
Type |
Default |
Specification | Description |
---|---|---|---|---|
ids / uuids | []_id / []_uuid |
/ | Mandatory | IDs or UUIDs of the first set of nodes to be calculated |
ids2 / uuids2 | []_id / []_uuid |
/ | Optional | IDs or UUIDs of the second set of nodes to be calculated |
type | string | cosine | jaccard / overlap / cosine / pearson / euclideanDistance / euclidean | Measurement of the similarity: jaccard: Jaccard Similarity overlap: Overlap Similarity cosine: Cosine Similarity pearson: Pearson Correlation Coefficient euclideanDistance: Euclidean Distance euclidean: Normalized Euclidean Distance |
node_schema_property | []@<schema>?.<property> |
/ | Numeric node property; LTE needed; schema can be either carried or not | When type is cosine / pearson / euclideanDistance / euclidean, must specify two or more node properties to form the vector; when type is jaccard / overlap, this parameter is invalid |
limit | int | -1 | >=-1 | Number of results to return; return all results if sets to -1 |
top_limit | int | -1 | >=-1 | Only available in the selection mode, limit the length of selection results (top_list ) of each node, return the full top_list if sets to -1 |
Calculation Mode
This algorithm has two calculation modes:
- Pairing mode: when two sets of valid nodes are configured, pair each node in the first set with each node in the second set (Cartesian product), similarities are calculated for all node pairs.
- Selection mode: when only one set (the first) of valid nodes are configured, for each node in the set, calculate its similarities with all other nodes in the graph, return the results if the similarity > 0, order the results the descending similarity.
Examples
Example Graph
The example graph shows the sports liked by userA, userB, userC and userD (UUIDs are 1, 2, 3 and 4 in order):

Task Writeback
1. File Writeback
Calculation Mode | Configuration |
Data in Each Row |
---|---|---|
Pairing mode | filename | node1 ,node2 ,similarity |
Selection mode | filename | node ,top_list |
Example: Calculate overlap similarity between userC and the sets of userA, userB and userD, write the algorithm results back to file
algo(similarity).params({
ids: "userC",
ids2: ["userA", "userB", "userD"],
type: "overlap"
}).write({
file:{
filename: "sc"
}
})
Results: File sc
userC,userA,0.25
userC,userB,0.5
userC,userD,0
Example: For each user in the set of UUID = 1,2,3,4, select the nodes that have overlap similarity above 0 with the user, write the algorithm results back to file
algo(similarity).params({
uuids: [1,2,3,4],
type: "overlap"
}).write({
file:{
filename: "list"
}
})
Results: File list
userA,userC:1.000000;userB:0.500000;userD:0.333333;
userB,userC:1.000000;userA:0.500000;userD:0.500000;
userC,userA:1.000000;userB:1.000000;
userD,userB:0.500000;userA:0.333333;
2. Property Writeback
Not supported by this algorithm.
3. Statistics Writeback
This algorithm has no statistics.
Direct Return
Calculation Mode | Alias Ordinal |
Type | Description | Column Name |
---|---|---|---|---|
Pairing mode | 0 | []perNodePair | Node pair and its similarity | node1 , node2 , similarity |
Selection mode | 0 | []perNode | Node and its selection results | node , top_list |
Example: Calculate overlap similarity between user UUID = 1 and users UUID = 2,3,4, order results in the descending similarity
algo(similarity).params({
uuids: [1],
uuids2: [2,3,4],
type: "overlap"
}) as overlap
return overlap
order by overlap.similarity desc
Results:
node1 | node2 | similarity |
---|---|---|
1 | 3 | 1 |
1 | 2 | 0.5 |
1 | 4 | 0.333333333333333 |
Example: Select the node with the highest overlap similarity with nodes UUID = 1,2 respectively
algo(similarity).params({
uuids: [1,2],
type: "overlap",
top_limit: 1
}) as top
return top
Results:
node | top_list |
---|---|
1 | 3:1.000000, |
2 | 3:1.000000, |
Streaming Return
Calculation Mode | Alias Ordinal |
Type | Description | Column Name |
---|---|---|---|---|
Pairing mode | 0 | []perNodePair | Node pair and its similarity | node1 , node2 , similarity |
Selection mode | 0 | []perNode | Node and its selection results | node , top_list |
Example: Calculate overlap similariy between user UUID = 3 and users UUID = 1,2,4, only return results that have similariy above 0
algo(similarity).params({
uuids: [3],
uuids2: [1,2,4],
type: "overlap"
}).stream() as overlap
where overlap.similarity > 0
return overlap
Results:
node1 | node2 | similarity |
---|---|---|
3 | 1 | 1 |
3 | 2 | 1 |
Example: Select two nodes with the hightest overlap similarity with node UUID = 1
algo(similarity).params({
uuids: [1],
type: "overlap",
top_limit: 2
}).stream() as top
return top
Results:
node | top_list |
---|---|
1 | 3:1.000000,2:0.500000, |
Real-time Statistics
This algorithm has no statistics.