Overlap Similarity

✓ File Writeback ✕ Property Writeback ✓ Direct Return ✓ Stream Return ✕ Stats

Overview

Overlap similarity is derived from Jaccard similarity, which is also called the Szymkiewicz–Simpson coefficient. It divides the size of the intersection of two sets by the size of the smaller set with the purpose to indicate how similar the two sets are.

Overlap similarity ranges from 0 to 1; 1 means that one set is the subset of the other or the two sets are exactly the same, 0 means that the two sets do not have any element in common.

Concepts

Overlap Similarity

Given two sets A and B, the overlap similarity between them is computed as:

In the following example, set A = {b,c,e,f,g}, set B = {a,d,b,g}, their intersection A⋂B = {b,g}, hence the overlap similarity between A and B is 2 / 4 = 0.5.

When applying Overlap Similarity to compare two nodes in a graph, we use the 1-hop neighborhood set to represent each target node. The 1-hop neighborhood set:

contains no repeated nodes;
excludes the two target nodes.

In this graph, the 1-hop neighborhood set of nodes u and v is:

N_u = {a,b,c,d,e}
N_v = {d,e,f}

Therefore, the Jaccard similarity between nodes u and v is 2 / 3 = 0.666667.

NOTE

In practice, you may need to convert some node properties into node schemas in order to calculate the similarity index that is based on common neighbors, just as the overlap Similarity. For instance, when considering the similarity between two applications, information like phone number, email, device IP, etc. of the application might have been stored as properties of @application node schema; they need to be designed as nodes and incorporated into the graph in order to be used for comparison.

Weighted Overlap Similarity

The Weighted Overlap Similarity is an extension of the classic Overlap Similarity that takes into account the weights associated with elements in the sets being compared.

The formula for Weighted Overlap Similarity is given by:

In this weighted graph, the union of the 1-hop neighborhood sets N_u and N_v is {a,b,c,d,e,f}. Set each element in the union set to the sum of the edge weights between the target node and the corresponding node, or 0 if there are no edges between them:

	a	b	c	d	e	f	sum
N'_u	1	1	1	1	0.5	0	4.5
N'_v	0	0	0	0.5	1.5 + 0.1 =1.6	1	3.1

Therefore, the Weight Overlap Similarity between nodes u and v is (0+0+0+0.5+0.5+0) / 3.1 = 0.322581.

NOTE

Please ensure that the sum of the edge weights between the target node and the neighboring node is greater than or equal to 0.

Considerations

The Overlap Similarity algorithm ignores the direction of edges but calculates them as undirected edges.
The Overlap Similarity algorithm ignores any self-loop.

Syntax

Command: algo(similarity)
Parameters:

Name	Type	Spec	Default	Optional	Description
ids / uuids	[]`_id` / []`_uuid`	/	/	No	ID/UUID of the first group of nodes to calculate
ids2 / uuids2	[]`_id` / []`_uuid`	/	/	Yes	ID/UUID of the second group of nodes to calculate
type	string	`overlap`	`cosine`	No	Type of similarity; for Overlap Similarity, keep it as `overlap`
edge_weight_property	`@<schema>?.<property>`	Numeric type, must LTE	/	Yes	The edge property to use as edge weight, where the weights of multiple edges between two nodes are summed up
limit	int	≥-1	`-1`	Yes	Number of results to return, `-1` to return all results
top_limit	int	≥-1	`-1`	Yes	In the selection mode, limit the maximum number of results returned for each node specified in `ids`/`uuids`, `-1` to return all results with similarity > 0; in the pairing mode, this parameter is invalid

The algorithm has two calculation modes:

Pairing: when both ids/uuids and ids2/uuids2 are configured, pairing each node in ids/uuids with each node in ids2/uuids2 (ignore the same node) and computing pair-wise similarities.
Selection: when only ids/uuids is configured, for each target node in it, computing pair-wise similarities between it and all other nodes in the graph. The returned results include all or limited number of nodes that have similarity > 0 with the target node and is ordered by the descending similarity.

Examples

The example graph is as follows:

File Writeback

Spec	Content
filename	`node1`,`node2`,`similarity`

UQL
algo(similarity).params({
  ids: 'userC',
  ids2: ['userA', 'userB', 'userD'],
  type: 'overlap'
}).write({
  file:{ 
    filename: 'sc'
  }
})

Results: File sc

File
userC,userA,1
userC,userB,1
userC,userD,0

UQL
algo(similarity).params({
  uuids: [1,2,3,4],
  type: 'overlap'
}).write({
  file:{ 
    filename: 'list'
  }
})

Results: File list

File
userA,userC,1
userA,userB,0.5
userA,userD,0.333333
userB,userC,1
userB,userA,0.5
userB,userD,0.5
userC,userA,1
userC,userB,1
userD,userB,0.5
userD,userA,0.333333

Direct Return

Alias Ordinal	Type	Description	Columns
0	[]perNodePair	Node pair and its similarity	`node1`, `node2`, `similarity`

UQL
algo(similarity).params({ 
  uuids: [1,2], 
  uuids2: [2,3,4],
  type: 'overlap'
}) as overlap
return overlap

Results: overlap

node1	node2	similarity
1	2	0.5
1	3	1
1	4	0.333333333333333
2	3	1
2	4	0.5

UQL
algo(similarity).params({
  uuids: [1,2],
  type: 'overlap',
  top_limit: 1
}) as top
return top

Results: top

node1	node2	similarity
1	3	1
2	3	1

Stream Return

Alias Ordinal	Type	Description	Columns
0	[]perNodePair	Node pair and its similarity	`node1`, `node2`, `similarity`

UQL
algo(similarity).params({ 
  uuids: [3], 
  uuids2: [1,2,4],
  type: 'overlap'
}).stream() as overlap
where overlap.similarity > 0
return overlap

Results: overlap

node1	node2	similarity
3	1	1
3	2	1

UQL
algo(similarity).params({
  uuids: [1],
  type: 'overlap',
  top_limit: 2
}).stream() as top
return top

Results: top

node1	node2	similarity
1	3	1
1	2	0.5