Overlap Similarity

Change Password

Submit

Change Email

Submit

Change Nickname

Current Nickname:

Submit

Profile

Account ID:

Full Name:
Phone:
Company:
Company Email:

Change Password

Apply

You have no license application record.

Apply

Certificate	Issued at	Valid until	Serial No.	File

Serial No.	Valid until	File

Not having one? Apply now! >>>

Product	Created On	ID	Amount (USD)	Invoice

Product	Created On	ID	Amount (USD)	Invoice

No Invoice

Create Ultipa Account

I agree to the Privacy Policy and the

Data Processing Agreement .

Please agree to continue.

Already have an Ultipa account? Sign in now!

Forgot Password

Reset Password

Back to sign in

Overlap Similarity

HDC

Overview

Overlap similarity is derived from Jaccard similarity, which is also called the Szymkiewicz–Simpson coefficient. It divides the size of the intersection of two sets by the size of the smaller set with the purpose to indicate how similar the two sets are.

Overlap similarity ranges from 0 to 1; 1 means that one set is the subset of the other or the two sets are exactly the same, 0 means that the two sets do not have any element in common.

Concepts

Given two sets A and B, the overlap similarity between them is computed as:

In the following example, set A = {b,c,e,f,g}, set B = {a,d,b,g}, their intersection A⋂B = {b,g}, hence the overlap similarity between A and B is 2 / 4 = 0.5.

When applying Overlap Similarity to compare two nodes in a graph, we use the 1-hop neighborhood set to represent each target node. The 1-hop neighborhood set:

contains no repeated nodes;
excludes the two target nodes.

In this graph, the 1-hop neighborhood set of nodes u and v is:

N_u = {a,b,c,d,e}
N_v = {d,e,f}

Therefore, the Jaccard similarity between nodes u and v is 2 / 3 = 0.666667.

In practice, you may need to convert some node properties into node schemas in order to calculate the similarity index that is based on common neighbors, just as the overlap Similarity. For instance, when considering the similarity between two applications, information like phone number, email, device IP, etc. of the application might have been stored as properties of @application node schema; they need to be designed as nodes and incorporated into the graph in order to be used for comparison.

Weighted Overlap Similarity

The Weighted Overlap Similarity is an extension of the classic Overlap Similarity that takes into account the weights associated with elements in the sets being compared.

The formula for Weighted Overlap Similarity is given by:

In this weighted graph, the union of the 1-hop neighborhood sets N_u and N_v is {a,b,c,d,e,f}. Set each element in the union set to the sum of the edge weights between the target node and the corresponding node, or 0 if there are no edges between them:

	a	b	c	d	e	f	sum
N'_u	1	1	1	1	0.5	0	4.5
N'_v	0	0	0	0.5	1.5 + 0.1 =1.6	1	3.1

Therefore, the Weighted Overlap Similarity between nodes u and v is (0+0+0+0.5+0.5+0) / 3.1 = 0.322581.

Please ensure that the sum of the edge weights between the target node and the neighboring node is greater than or equal to 0.

Considerations

The Overlap Similarity algorithm ignores the direction of edges but calculates them as undirected edges.
The Overlap Similarity algorithm ignores any self-loop.

Example Graph

To create this graph:

// Runs each row separately in order in an empty graphset
create().node_schema("user").node_schema("sport").edge_schema("like")
insert().into(@user).nodes([{_id:"userA"}, {_id:"userB"}, {_id:"userC"}, {_id:"userD"}])
insert().into(@sport).nodes([{_id:"running"}, {_id:"tennis"}, {_id:"baseball"}, {_id:"swimming"}, {_id:"badminton"}, {_id:"iceball"}])
insert().into(@like).edges([{_from:"userA", _to:"tennis"}, {_from:"userA", _to:"baseball"}, {_from:"userA", _to:"swimming"}, {_from:"userA", _to:"badminton"}, {_from:"userB", _to:"running"}, {_from:"userB", _to:"swimming"}, {_from:"userC", _to:"swimming"}, {_from:"userD", _to:"running"}, {_from:"userD", _to:"badminton"}, {_from:"userD", _to:"iceball"}])

Creating HDC Graph

To load the entire graph to the HDC server hdc-server-1 as hdc_sim_nbr:

CALL hdc.graph.create("hdc-server-1", "hdc_sim_nbr", {
  nodes: {"*": ["*"]},
  edges: {"*": ["*"]},
  direction: "undirected",
  load_id: true,
  update: "static",
  query: "query",
  default: false
})

hdc.graph.create("hdc_sim_nbr", {
  nodes: {"*": ["*"]},
  edges: {"*": ["*"]},
  direction: "undirected",
  load_id: true,
  update: "static",
  query: "query",
  default: false
}).to("hdc-server-1")

Parameters

Algorithm name: similarity

Name	Type	Spec	Default	Optional	Description
`ids`	[]`_id`	/	/	No	Specifies the first group of nodes for computation by their `_id`; computes for all nodes if it is unset.
`uuids`	[]`_uuid`	/	/	No	Specifies the first group of nodes for computation by their `_uuid`; computes for all nodes if it is unset.
`ids2`	[]`_id`	/	/	Yes	Specifies the second group of nodes for computation by their `_id`; computes for all nodes if it is unset.
`uuids2`	[]`_uuid`	/	/	Yes	Specifies the second group of nodes for computation by their `_uuid`; computes for all nodes if it is unset.
`type`	String	`overlap`	`cosine`	No	Specifies the type of similarity to compute; for Overlap Similarity, keep it as `overlap`.
`edge_weight_property`	[]"`<@schema.?><property>`"	/	/	Yes	Numeric edge properties used as the edge weights, summing values across the specified properties; edges without the specified properties are ignored.
`return_id_uuid`	String	`uuid`, `id`, `both`	`uuid`	Yes	Includes `_uuid`, `_id`, or both to represent nodes in the results.
`order`	String	`asc`, `desc`	/	Yes	Sorts the results by `similarity`.
`limit`	Integer	≥-1	`-1`	Yes	Limits the number of results returned; `-1` includes all results.
`top_limit`	Integer	≥-1	`-1`	Yes	Limits the number of results returned for each node specified with `ids`/`uuids` in selection mode; `-1` includes all results with a similarity greater than 0. This parameter is invalid in pairing mode.

The algorithm has two calculation modes:

Pairing: When both ids/uuids and ids2/uuids2 are configured, each node in ids/uuids is paired with each node in ids2/uuids2 (excluding self-pairing), and pairwise similarities are computed.
Selection: When only ids/uuids is configured, pairwise similarities are computed between each target node and all other nodes in the graph. The results include all or a limited number of nodes with a similarity > 0 to the target node, ordered in descending similarity.

File Writeback

CALL algo.similarity.write("hdc_sim_nbr", {
  params: {
    return_id_uuid: "id",
    ids: "userC",
    ids2: ["userA", "userB", "userD"],
    type: "overlap"
  },
  return_params: {
    file: {
      filename: "overlap"
    }
  }
})

algo(similarity).params({
  projection: "hdc_sim_nbr",
  return_id_uuid: "id",
  ids: "userC",
  ids2: ["userA", "userB", "userD"],
  type: "overlap"  
}).write({
  file: {
    filename: "overlap"
  }
})

Result:

_id1,_id2,similarity
userC,userA,1
userC,userB,1
userC,userD,0

Full Return

Computes similarities in pairing mode:

CALL algo.similarity("hdc_sim_nbr", {
  params: {
    return_id_uuid: "id",
    ids: ["userA","userB"], 
    ids2: ["userB","userC","userD"],
    type: "overlap"
  },
  return_params: {}
}) YIELD overlap
RETURN overlap

exec{
  algo(similarity).params({
    return_id_uuid: "id",
    ids: ["userA","userB"], 
    ids2: ["userB","userC","userD"],
    type: "overlap"
  }) as overlap
  return overlap
} on hdc_sim_nbr

Result:

_id1	_id2	similarity
userA	userB	0.5
userA	userC	1
userA	userD	0.333333
userB	userC	1
userB	userD	0.5

Stream Return

CALL algo.similarity("hdc_sim_nbr", {
  params: {
    return_id_uuid: "id",
    ids: ["userA"],
    type: "overlap",
    edge_weight_property: "weight",
    top_limit: 2    
  },
  return_params: {
  	stream: {}
  }
}) YIELD overlap
RETURN overlap

exec{
  algo(similarity).params({
    return_id_uuid: "id",
    ids: ["userA"], 
    type: "overlap",
    edge_weight_property: "weight",
    top_limit: 2  
  }).stream() as overlap
  return overlap
} on hdc_sim_nbr

Result:

_id1	_id2	similarity
userA	userC	1
userA	userB	0.75

ID
Product
Status
Cores
Maximum Shard Services
Maximum Total Cores for Shard Service
Maximum HDC Services
Maximum Total Cores for HDC Service
Applied Validity Period(days)
Effective Date
Expired Date
Mac Address
Reason for Application
Review Comment