Overview
Overlap similarity is derived from Jaccard similarity, which is also called the Szymkiewicz–Simpson coefficient. It divides the size of the intersection of two sets by the size of the smaller set with the purpose to indicate how similar the two sets are.
Overlap similarity ranges from 0 to 1; 1 means that one set is the subset of the other or the two sets are exactly the same, 0 means that the two sets do not have any element in common.
Concepts
Overlap Similarity
Given two sets A and B, the overlap similarity between them is computed as:
In the following example, set A = {b,c,e,f,g}, set B = {a,d,b,g}, their intersection A⋂B = {b,g}, hence the overlap similarity between A and B is 2 / 4 = 0.5
.
When applying Overlap Similarity to compare two nodes in a graph, we use the 1-hop neighborhood set to represent each target node. The 1-hop neighborhood set:
- contains no repeated nodes;
- excludes the two target nodes.
In this graph, the 1-hop neighborhood set of nodes u and v is:
- Nu = {a,b,c,d,e}
- Nv = {d,e,f}
Therefore, the Jaccard similarity between nodes u and v is 2 / 3 = 0.666667
.
In practice, you may need to convert some node properties into node schemas in order to calculate the similarity index that is based on common neighbors, just as the overlap Similarity. For instance, when considering the similarity between two applications, information like phone number, email, device IP, etc. of the application might have been stored as properties of @application node schema; they need to be designed as nodes and incorporated into the graph in order to be used for comparison.
Weighted Overlap Similarity
The Weighted Overlap Similarity is an extension of the classic Overlap Similarity that takes into account the weights associated with elements in the sets being compared.
The formula for Weighted Overlap Similarity is given by:
In this weighted graph, the union of the 1-hop neighborhood sets Nu and Nv is {a,b,c,d,e,f}. Set each element in the union set to the sum of the edge weights between the target node and the corresponding node, or 0 if there are no edges between them:
a | b | c | d | e | f | sum | |
---|---|---|---|---|---|---|---|
N'u | 1 | 1 | 1 | 1 | 0.5 | 0 | 4.5 |
N'v | 0 | 0 | 0 | 0.5 | 1.5 + 0.1 =1.6 | 1 | 3.1 |
Therefore, the Weight Overlap Similarity between nodes u and v is (0+0+0+0.5+0.5+0) / 3.1 = 0.322581
.
Please ensure that the sum of the edge weights between the target node and the neighboring node is greater than or equal to 0.
Considerations
- The Overlap Similarity algorithm ignores the direction of edges but calculates them as undirected edges.
- The Overlap Similarity algorithm ignores any self-loop.
Example Graph
To create this graph:
// Runs each row separately in order in an empty graphset
create().node_schema("user").node_schema("sport").edge_schema("like")
insert().into(@user).nodes([{_id:"userA"}, {_id:"userB"}, {_id:"userC"}, {_id:"userD"}])
insert().into(@sport).nodes([{_id:"running"}, {_id:"tennis"}, {_id:"baseball"}, {_id:"swimming"}, {_id:"badminton"}, {_id:"iceball"}])
insert().into(@like).edges([{_from:"userA", _to:"tennis"}, {_from:"userA", _to:"baseball"}, {_from:"userA", _to:"swimming"}, {_from:"userA", _to:"badminton"}, {_from:"userB", _to:"running"}, {_from:"userB", _to:"swimming"}, {_from:"userC", _to:"swimming"}, {_from:"userD", _to:"running"}, {_from:"userD", _to:"badminton"}, {_from:"userD", _to:"iceball"}])
Creating HDC Projection
To project the entire graph to the HDC server hdc-server-1
as hdc_overlap_similarity
:
CALL hdc.graph.create("hdc-server-1", "hdc_overlap_similarity", {
nodes: {"*": ["*"]},
edges: {"*": ["*"]},
direction: "undirected",
load_id: true,
update: "static",
query: "query",
default: false
})
hdc.graph.create("hdc_overlap_similarity", {
nodes: {"*": ["*"]},
edges: {"*": ["*"]},
direction: "undirected",
load_id: true,
update: "static",
query: "query",
default: false
}).to("hdc-server-1")
Parameters
Algorithm name: similarity
Name |
Type |
Spec |
Default |
Optional |
Description |
---|---|---|---|---|---|
ids / uuids |
[]_id / []_uuid |
/ | / | No | Specifies the first group of nodes for computation by their _id or _uuid . |
ids2 / uuids2 |
[]_id / []_uuid |
/ | / | Yes | Specifies the second group of nodes for computation by their _id or _uuid . |
type |
String | overlap |
cosine |
No | Specifies the type of similarity; for Overlap Similarity, keep it as overlap . |
edge_weight_property |
@<schema>?.<property> |
Numeric type | / | Yes | Specifies the edge property to be used as the edge weight; for multiple edges between two nodes, their weights will be summed. |
return_id_uuid |
String | uuid , id , both |
uuid |
Yes | Includes _uuid , _id , or both to represent nodes in the results. |
order |
String | asc , desc |
/ | Yes | Sorts the results by similarity . |
limit |
Integer | ≥-1 | -1 |
Yes | Limits the number of results returned; -1 includes all results. |
top_limit |
Integer | ≥-1 | -1 |
Yes | In the selection mode, limit the maximum number of results returned for each node specified in ids /uuids , -1 to return all results with similarity > 0; in the pairing mode, this parameter is invalid. |
The algorithm has two calculation modes:
- Pairing: When both
ids
/uuids
andids2
/uuids2
are configured, each node inids
/uuids
is paired with each node inids2
/uuids2
(excluding self-pairing), and pairwise similarities are computed. - Selection: When only
ids
/uuids
is configured, pairwise similarities are computed between each target node and all other nodes in the graph. The results include all or a limited number of nodes with a similarity > 0 to the target node, ordered in descending similarity.
File Writeback
Computes similarities in pairing mode:
CALL algo.similarity.write("hdc_overlap_similarity", {
params: {
return_id_uuid: "id",
ids: "userC",
ids2: ["userA", "userB", "userD"],
type: "overlap"
},
return_params: {
file: {
filename: "overlap"
}
}
})
algo(similarity).params({
project: "hdc_overlap_similarity",
return_id_uuid: "id",
ids: "userC",
ids2: ["userA", "userB", "userD"],
type: "overlap"
}).write({
file: {
filename: "overlap"
}
})
Result:
_id1,_id2,similarity
userC,userA,1
userC,userB,1
userC,userD,0
Computes similarities in selection mode:
CALL algo.similarity.write("hdc_overlap_similarity", {
params: {
return_id_uuid: "id",
ids: ["userA", "userB", "userC", "userD"],
type: "overlap"
},
return_params: {
file: {
filename: "list"
}
}
})
algo(similarity).params({
project: "hdc_overlap_similarity",
return_id_uuid: "id",
ids: ["userA", "userB", "userC", "userD"],
type: "overlap"
}).write({
file: {
filename: "list"
}
})
Result:
_id1,_id2,similarity
userA,userC,1
userA,userB,0.5
userA,userD,0.333333
userB,userC,1
userB,userA,0.5
userB,userD,0.5
userC,userA,1
userC,userB,1
userD,userB,0.5
userD,userA,0.333333
Full Return
Computes similarities in pairing mode:
CALL algo.similarity("hdc_overlap_similarity", {
params: {
return_id_uuid: "id",
ids: ["userA","userB"],
ids2: ["userB","userC","userD"],
type: "overlap"
},
return_params: {}
}) YIELD overlap
RETURN overlap
exec{
algo(similarity).params({
return_id_uuid: "id",
ids: ["userA","userB"],
ids2: ["userB","userC","userD"],
type: "overlap"
}) as overlap
return overlap
} on hdc_overlap_similarity
Result:
_id1 | _id2 | similarity |
---|---|---|
userA | userB | 0.5 |
userA | userC | 1 |
userA | userD | 0.333333 |
userB | userC | 1 |
userB | userD | 0.5 |
Computes similarities in selection mode:
CALL algo.similarity("hdc_overlap_similarity", {
params: {
return_id_uuid: "id",
ids: ["userA","userB"],
type: "overlap",
top_limit: 1
},
return_params: {}
}) YIELD overlap
RETURN overlap
exec{
algo(similarity).params({
return_id_uuid: "id",
ids: ["userA","userB"],
type: "overlap",
top_limit: 1
}) as overlap
return overlap
} on hdc_overlap_similarity
Result:
_id1 | _id2 | similarity |
---|---|---|
userA | userC | 1 |
userB | userC | 1 |
Stream Return
Computes similarities in pairing mode:
CALL algo.similarity("hdc_overlap_similarity", {
params: {
return_id_uuid: "id",
ids: ["userC"],
ids2: ["userA","userB","userD"],
type: "overlap"
},
return_params: {type: "stream"}
}) YIELD overlap
FILTER overlap.similarity > 0
RETURN overlap
exec{
algo(similarity).params({
return_id_uuid: "id",
ids: ["userC"],
ids2: ["userA","userB","userD"],
type: "overlap"
}).stream() as overlap
where overlap.similarity > 0
return overlap
} on hdc_overlap_similarity
Result:
_id1 | _id2 | similarity |
---|---|---|
userC | userA | 1 |
userC | userB | 1 |
Computes similarities in selection mode:
CALL algo.similarity("hdc_overlap_similarity", {
params: {
return_id_uuid: "id",
ids: ["userA"],
type: "overlap",
top_limit: 2
},
return_params: {type: "stream"}
}) YIELD overlap
RETURN overlap
exec{
algo(similarity).params({
return_id_uuid: "id",
ids: ["userA"],
type: "overlap",
top_limit: 2
}).stream() as overlap
return overlap
} on hdc_overlap_similarity
Result:
_id1 | _id2 | similarity |
---|---|---|
userA | userC | 1 |
userA | userB | 0.5 |