✓ File Writeback ✕ Property Writeback ✓ Direct Return ✓ Stream Return ✕ Stats
Overview
Jaccard similarity, or Jaccard index, was proposed by Paul Jaccard in 1901. It’s a metric of similarity for two sets of data. In the graph, collecting the neighbors of a node into a set, two nodes are considered similar if their neighbor sets are similar.
Jaccard similarity ranges from 0 to 1; 1 means that two sets are exactly the same, 0 means that the two sets do not have any element in common.
Concepts
Jaccard Similarity
Given two sets A and B, the Jaccard similarity between them is computed as:
In the following example, set A = {b,c,e,f,g}, set B = {a,d,b,g}, their intersection A⋂B = {b,g}, their union A⋃B = {a,b,c,d,e,f,g}, hence the Jaccard similarity between A and B is 2 / 7 = 0.2857
.
Neighbor Set
In Ultipa's Jaccard Similarity algorithm, the following points have to be noted when collecting the neighbor sets of two target nodes to compute their similarity:
- There is no repeated nodes in the neighbor set;
- Self-loop is ignored;
- Any edge between the two target nodes is ignored;
- Edge direction is ignored.
In the graph above, when computing the similarity between node u and node v, the neighbor sets for the two nodes are Nu = {a,b,c,d,e} and Nv = {d,e,f}, so their Jaccard similarity is 2 / 6 = 0.3333
.
In practice, you may need to convert some node properties into node schemas in order to calculate the similarity index that is based on common neighbors, just as the Jaccard Similarity. For instance, when considering the similarity between two applications, information like phone number, email, device IP, etc. of the application might have been stored as properties of @application node schema; they need to be designed as nodes and incorporated into the graph in order to be used for comparison.
Syntax
- Command:
algo(similarity)
- Parameters:
Name |
Type |
Spec |
Default |
Optional |
Description |
---|---|---|---|---|---|
ids / uuids | []_id / []_uuid |
/ | / | No | ID/UUID of the first group of nodes to calculate |
ids2 / uuids2 | []_id / []_uuid |
/ | / | Yes | ID/UUID of the second group of nodes to calculate |
type | string | jaccard |
cosine |
No | Type of similarity; for Jaccard Similarity, keep it as jaccard |
limit | int | >=-1 | -1 |
Yes | Number of results to return, -1 to return all results |
top_limit | int | >=-1 | -1 |
Yes | Limit the length of top_list , -1 to return the full top_list |
This algorithm has two calculation modes:
- Pairing: when
ids/uuids
andids2/uuids2
are both configured, pairing nodes in the first group with nodes in the second group (Cartesian product) to compute pair-wise similarities. - Selection: when only
ids/uuids
is configured, for each node in the group, computing pair-wise similarities between it and all other nodes in the graph in order to select the most similar nodes, the returnedtop_list
includes all nodes that have similarity > 0 with it and is ordered by the descending similarity.
Examples
The example graph is as follows:
File Writeback
Calculation Mode | Spec | Content |
---|---|---|
Pairing | filename | node1 ,node2 ,similarity |
Selection | filename | node ,top_list |
algo(similarity).params({
ids: "userC",
ids2: ["userA", "userB", "userD"],
type: "jaccard"
}).write({
file:{
filename: "sc"
}
})
Results: File sc
userC,userA,0.25
userC,userB,0.5
userC,userD,0
algo(similarity).params({
uuids: [1,2,3,4],
type: "jaccard"
}).write({
file:{
filename: "list"
}
})
Results: File list
userA,userC:0.250000;userB:0.200000;userD:0.166667;
userB,userC:0.500000;userD:0.250000;userA:0.200000;
userC,userB:0.500000;userA:0.250000;
userD,userB:0.250000;userA:0.166667;
Direct Return
Calculation Mode |
Alias Ordinal |
Type |
Description | Columns |
---|---|---|---|---|
Pairing | 0 | []perNodePair | Node pair and its similarity | node1 , node2 , similarity |
Selection | 0 | []perNode | Node and its selection results | node , top_list |
algo(similarity).params({
uuids: [1],
uuids2: [2,3,4],
type: "jaccard"
}) as jacc
return jacc
order by jacc.similarity desc
Results: jacc
node1 | node2 | similarity |
---|---|---|
1 | 3 | 0.25 |
1 | 2 | 0.2 |
1 | 4 | 0.166666666666667 |
algo(similarity).params({
uuids: [1,2],
type: "jaccard",
top_limit: 1
}) as top
return top
Results: top
node | top_list |
---|---|
1 | 3:0.250000, |
2 | 3:0.500000, |
Stream Return
Calculation Mode |
Alias Ordinal |
Type |
Description | Columns |
---|---|---|---|---|
Pairing mode | 0 | []perNodePair | Node pair and its similarity | node1 , node2 , similarity |
Selection mode | 0 | []perNode | Node and its selection results | node , top_list |
algo(similarity).params({
uuids: [3],
uuids2: [1,2,4],
type: "jaccard"
}).stream() as jacc
where jacc.similarity > 0
return jacc
Results: jacc
node1 | node2 | similarity |
---|---|---|
3 | 1 | 0.25 |
3 | 2 | 0.5 |
algo(similarity).params({
uuids: [1],
type: "jaccard",
top_limit: 2
}).stream() as top
return top
Results: top
node | top_list |
---|---|
1 | 3:0.250000,2:0.200000, |