Change Password

Please enter the password.
Please enter the password. Between 8-64 characters. Not identical to your email address. Contain at least 3 of: uppercase, lowercase, numbers, and special characters.
Please enter the password.
Submit

Change Nickname

Current Nickname:
Submit

Apply New License

License Detail

Please complete this required field.

  • Ultipa Graph V4

Standalone

Please complete this required field.

Please complete this required field.

The MAC address of the server you want to deploy.

Please complete this required field.

Please complete this required field.

Cancel
Apply
ID
Product
Status
Cores
Applied Validity Period(days)
Effective Date
Excpired Date
Mac Address
Apply Comment
Review Comment
Close
Profile
  • Full Name:
  • Phone:
  • Company:
  • Company Email:
  • Country:
  • Language:
Change Password
Apply

You have no license application record.

Apply
Certificate Issued at Valid until Serial No. File
Serial No. Valid until File

Not having one? Apply now! >>>

Product Created On ID Amount (USD) Invoice
Product Created On ID Amount (USD) Invoice

No Invoice

Search
    English

      Jaccard Similarity

      Overview

      Jaccard similarity, also known as Jaccard index, was proposed by Paul Jaccard in 1901. It is an indicator of node similarity defined based on the semi-structured information of the internet. It divides the size of the intersection of two sets by the size of their union with the purpose to indicate how similar the two sets are. In the graph, Jaccard similarity uses node to represent set, and neighbors of node to represent elements in set, and to calculate the proportion of common neighbors in all neighbors.

      In application, elements in set typically are a series of properties of an entity. For instance, when calculating the similarity between two credit applications, elements are the phone number, email, device IP, company name and so on in the application form. In general graph applications, these kinds of information are often stored as properties of a node; however, when executing this algorithm, these information is designed as nodes and incorporated into the graph.

      The range of values of Jaccard similarity is [0,1]; the larger the value, the more similar the two sets are.

      Basic Concept

      Set

      A set consists of multiple elements; elements in a set are unordered and distinct; the number of elements in set A is the size of set A, denoted as |A|.

      Set that consists of common elements of set A and set B is called the intersection of A and B, denoted as A⋂B; set consists of all elements of set A and set B is called the union of A and B, denoted as A⋃B.

      In the image above, set A is {b,c,e,f,g}, set B is {a,d,b,g}, intersection A⋂B is {b,g}, union A⋃B is {a,b,c,d,e,f,g}.

      Jaccard Similarity

      Known sets A and B, Jaccard similarity between them can be expressed as:

      Jaccard similarity between sets A and B in the previous example can be calculated upon this definition: 2 / 7 = 0.2857.

      Neighbors

      In the graph, Kx is the set of neighbors of node x to represent set A, Ky is the set of neighbors of node y to represent set B. Note that neither Kx nor Ky contains repeated value, nor x, nor y, so the following interferences need to be eliminated when finding neighbors by edge in the graph:

      • Multiple edges between x/y and their neighbors
      • Self-loop edges of x and y
      • Edges between x and y

      In the graph above, the red and green nodes have 2 common neighbors and 6 neighbors in total, their Jaccard similarity is 2 / 6 = 0.3333.

      Special Case

      Isolated Node, Disconnected Graph

      There is rarely computational valuable isolated node (empty set) in practice, intersection that involves isolated node is empty, and Jaccard similarity is 0.

      For two nodes belong to different connected components, their Jaccard similarity must be 0.

      Self-loop Edge

      Self-loop edge of a node does not increase the number of neighbors of the node.

      Directed Edge

      For directed edges, the algorithm ignores the direction of edges but calculates them as undirected edges.

      Command and Configuration

      • Command: algo(similarity)
      • Configurations for the parameter params():
      Name
      Type
      Default
      Specification
      Description
      ids / uuids []_id / []_uuid / Mandatory IDs or UUIDs of the first set of nodes to be calculated
      ids2 / uuids2 []_id / []_uuid / Optional IDs or UUIDs of the second set of nodes to be calculated
      type string cosine jaccard / overlap / cosine / pearson / euclideanDistance / euclidean Measurement of the similarity:
      jaccard: Jaccard Similarity
      overlap: Overlap Similarity
      cosine: Cosine Similarity
      pearson: Pearson Correlation Coefficient
      euclideanDistance: Euclidean Distance
      euclidean: Normalized Euclidean Distance
      node_schema_property []@<schema>?.<property> / Numeric node property; LTE needed; schema can be either carried or not When type is cosine / pearson / euclideanDistance / euclidean, must specify two or more node properties to form the vector; when type is jaccard / overlap, this parameter is invalid
      limit int -1 >=-1 Number of results to return; return all results if sets to -1
      top_limit int -1 >=-1 Only available in the selection mode, limit the length of selection results (top_list) of each node, return the full top_list if sets to -1

      Calculation Mode

      This algorithm has two calculation modes:

      1. Pairing mode: when two sets of valid nodes are configured, pair each node in the first set with each node in the second set (Cartesian product), similarities are calculated for all node pairs.
      2. Selection mode: when only one set (the first) of valid nodes are configured, for each node in the set, calculate its similarities with all other nodes in the graph, return the results if the similarity > 0, order the results the descending similarity.

      Examples

      Example Graph

      The example graph shows the sports liked by userA, userB, userC and userD (UUIDs are 1, 2, 3 and 4 in order):

      Task Writeback

      1. File Writeback

      Calculation Mode
      Configuration
      Data in Each Row
      Pairing mode filename node1,node2,similarity
      Selection mode filename node,top_list

      Example: Calculate Jaccard similarity between userC and the sets of userA, userB and userD, write the algorithm results back to file

      algo(similarity).params({
        ids: "userC",
        ids2: ["userA", "userB", "userD"],
        type: "jaccard"
      }).write({
        file:{ 
          filename: "sc"
        }
      })
      

      Results: File sc

      userC,userA,0.25
      userC,userB,0.5
      userC,userD,0
      

      Example: For each user in the set of UUID = 1,2,3,4, select the nodes that have Jaccard similarity above 0 with the user, write the algorithm results back to file

      algo(similarity).params({
        uuids: [1,2,3,4],
        type: "jaccard"
      }).write({
        file:{ 
          filename: "list"
        }
      })
      

      Results: File list

      userA,userC:0.250000;userB:0.200000;userD:0.166667;
      userB,userC:0.500000;userD:0.250000;userA:0.200000;
      userC,userB:0.500000;userA:0.250000;
      userD,userB:0.250000;userA:0.166667;
      

      2. Property Writeback

      Not supported by this algorithm.

      3. Statistics Writeback

      This algorithm has no statistics.

      Direct Return

      Calculation Mode
      Alias Ordinal
      Type Description Column Name
      Pairing mode 0 []perNodePair Node pair and its similarity node1, node2, similarity
      Selection mode 0 []perNode Node and its selection results node, top_list

      Example: Calculate Jaccard similarity between user UUID = 1 and users UUID = 2,3,4, order results in the descending similarity

      algo(similarity).params({ 
        uuids: [1], 
        uuids2: [2,3,4],
        type: "jaccard"
      }) as jacc
      return jacc 
      order by jacc.similarity desc
      

      Results:

      node1 node2 similarity
      1 3 0.25
      1 2 0.2
      1 4 0.166666666666667

      Example: Select the node with the highest Jaccard similarity with nodes UUID = 1,2 respectively

      algo(similarity).params({
        uuids: [1,2],
        type: "jaccard",
        top_limit: 1
      }) as top
      return top
      

      Results:

      node top_list
      1 3:0.250000,
      2 3:0.500000,

      Streaming Return

      Calculation Mode
      Alias Ordinal
      Type Description Column Name
      Pairing mode 0 []perNodePair Node pair and its similarity node1, node2, similarity
      Selection mode 0 []perNode Node and its selection results node, top_list

      Example: Calculate Jaccard similariy between user UUID = 3 and users UUID = 1,2,4, only return results that have similariy above 0

      algo(similarity).params({ 
        uuids: [3], 
        uuids2: [1,2,4],
        type: "jaccard"
      }).stream() as jacc
      where jacc.similarity > 0
      return jacc
      

      Results:

      node1 node2 similarity
      3 1 0.25
      3 2 0.5

      Example: Select two nodes with the hightest Jaccard similarity with node UUID = 1

      algo(similarity).params({
        uuids: [1],
        type: "jaccard",
        top_limit: 2
      }).stream() as top
      return top
      

      Results:

      node top_list
      1 3:0.250000,2:0.200000,

      Real-time Statistics

      This algorithm has no statistics.

      Please complete the following information to download this book
      *
      公司名称不能为空
      *
      公司邮箱必须填写
      *
      你的名字必须填写
      *
      你的电话必须填写