✓ File Writeback ✕ Property Writeback ✓ Direct Return ✓ Stream Return ✕ Stats
Overview
Overlap similarity is derived from Jaccard similarity, which is also called the Szymkiewicz–Simpson coefficient. It divides the size of the intersection of two sets by the size of the smaller set with the purpose to indicate how similar the two sets are.
Overlap similarity ranges from 0 to 1; 1 means that one set is the subset of the other or the two sets are exactly the same, 0 means that the two sets do not have any element in common.
Concepts
Overlap Similarity
Given two sets A and B, the overlap similarity between them is computed as:
In the following example, set A = {b,c,e,f,g}, set B = {a,d,b,g}, their intersection A⋂B = {b,g}, hence the overlap similarity between A and B is 2 / 4 = 0.5
.
Neighbor Set
In Ultipa's Overlap Similarity algorithm, the following points have to be noted when collecting the neighbor sets of two target nodes to compute their similarity:
- There is no repeated nodes in the neighbor set;
- Self-loop is ignored;
- Any edge between the two target nodes is ignored;
- Edge direction is ignored.
In the graph above, when computing the similarity between node u and node v, the neighbor sets for the two nodes are Nu = {a,b,c,d,e} and Nv = {d,e,f}, so their overlap similarity is 2 / 3 = 0.6667
.
In practice, you may need to convert some node properties into node schemas in order to calculate the similarity index that is based on common neighbors, just as the overlap Similarity. For instance, when considering the similarity between two applications, information like phone number, email, device IP, etc. of the application might have been stored as properties of @application node schema; they need to be designed as nodes and incorporated into the graph in order to be used for comparison.
Syntax
- Command:
algo(similarity)
- Parameters:
Name |
Type |
Spec |
Default |
Optional |
Description |
---|---|---|---|---|---|
ids / uuids | []_id / []_uuid |
/ | / | No | ID/UUID of the first group of nodes to calculate |
ids2 / uuids2 | []_id / []_uuid |
/ | / | Yes | ID/UUID of the second group of nodes to calculate |
type | string | overlap |
cosine |
No | Type of similarity; for Overlap Similarity, keep it as overlap |
limit | int | >=-1 | -1 |
Yes | Number of results to return, -1 to return all results |
top_limit | int | >=-1 | -1 |
Yes | In the selection mode, limit the maximum number of results returned for each node specified in ids /uuids , -1 to return all results with similarity > 0; in the pairing mode, this parameter is invalid |
The algorithm has two calculation modes:
- Pairing: when both
ids
/uuids
andids2
/uuids2
are configured, pairing each node inids
/uuids
with each node inids2
/uuids2
(ignore the same node) and computing pair-wise similarities. - Selection: when only
ids
/uuids
is configured, for each target node in it, computing pair-wise similarities between it and all other nodes in the graph. The returned results include all or limited number of nodes that have similarity > 0 with the target node and is ordered by the descending similarity.
Examples
The example graph is as follows:
File Writeback
Spec | Content |
---|---|
filename | node1 ,node2 ,similarity |
algo(similarity).params({
ids: 'userC',
ids2: ['userA', 'userB', 'userD'],
type: 'overlap'
}).write({
file:{
filename: 'sc'
}
})
Results: File sc
userC,userA,0.25
userC,userB,0.5
userC,userD,0
algo(similarity).params({
uuids: [1,2,3,4],
type: 'overlap'
}).write({
file:{
filename: 'list'
}
})
Results: File list
userA,userC,1
userA,userB,0.5
userA,userD,0.333333
userB,userC,1
userB,userA,0.5
userB,userD,0.5
userC,userA,1
userC,userB,1
userD,userB,0
userD,userA,0.333333
Direct Return
Alias Ordinal |
Type |
Description | Columns |
---|---|---|---|
0 | []perNodePair | Node pair and its similarity | node1 , node2 , similarity |
algo(similarity).params({
uuids: [1,2],
uuids2: [2,3,4],
type: 'overlap'
}) as overlap
return overlap
Results: overlap
node1 | node2 | similarity |
---|---|---|
1 | 2 | 0.5 |
1 | 3 | 1 |
1 | 4 | 0.333333333333333 |
2 | 3 | 1 |
2 | 4 | 0.5 |
algo(similarity).params({
uuids: [1,2],
type: 'overlap',
top_limit: 1
}) as top
return top
Results: top
node1 | node2 | similarity |
---|---|---|
1 | 3 | 1 |
2 | 3 | 1 |
Stream Return
Alias Ordinal |
Type |
Description | Columns |
---|---|---|---|
0 | []perNodePair | Node pair and its similarity | node1 , node2 , similarity |
algo(similarity).params({
uuids: [3],
uuids2: [1,2,4],
type: 'overlap'
}).stream() as overlap
where overlap.similarity > 0
return overlap
Results: overlap
node1 | node2 | similarity |
---|---|---|
3 | 1 | 1 |
3 | 2 | 1 |
algo(similarity).params({
uuids: [1],
type: 'overlap',
top_limit: 2
}).stream() as top
return top
Results: top
node1 | node2 | similarity |
---|---|---|
1 | 3 | 1 |
1 | 2 | 0.5 |