Change Password

Please enter the password
Please enter the password Length between [8, 64] ASCII characters Not identical to your email address At least 3 character types from uppercase, lowercase, numbers, and single-byte character symbols
Please enter the password
Submit

Change Nickname

Current Nickname:
Submit
Search
v4.0
    v4.0

    Jaccard Similarity

    Overview

    Jaccard similarity, also known as Jaccard index, was proposed by Paul Jaccard in 1901. It is an indicator of node similarity defined based on the semi-structured information of the internet. It divides the size of the intersection of two sets by the size of their union with the purpose to indicate how similar the two sets are. In the graph, Jaccard similarity uses node to represent set, and neighbors of node to represent elements in set, and to calculate the proportion of common neighbors in all neighbors.

    In application, elements in set typically are a series of properties of an entity. For instance, when calculating the similarity between two credit applications, elements are the phone number, email, device IP, company name and so on in the application form. In general graph applications, these kinds of information are often stored as properties of a node; however, when executing this algorithm, these information is designed as nodes and incorporated into the graph.

    The range of values of Jaccard similarity is [0,1]; the larger the value, the more similar the two sets are.

    Basic Concept

    Set

    A set consists of multiple elements; elements in a set are unordered and distinct; the number of elements in set A is the size of set A, denoted as |A|.

    Set that consists of common elements of set A and set B is called the intersection of A and B, denoted as A⋂B; set consists of all elements of set A and set B is called the union of A and B, denoted as A⋃B.

    In the image above, set A is {b,c,e,f,g}, set B is {a,d,b,g}, intersection A⋂B is {b,g}, union A⋃B is {a,b,c,d,e,f,g}.

    Jaccard Similarity

    Known sets A and B, Jaccard similarity between them can be expressed as:

    Jaccard similarity between sets A and B in the previous example can be calculated upon this definition: 2 / 7 = 0.2857.

    Neighbors

    In the graph, Kx is the set of neighbors of node x to represent set A, Ky is the set of neighbors of node y to represent set B. Note that neither Kx nor Ky contains repeated value, nor x, nor y, so the following interferences need to be eliminated when finding neighbors by edge in the graph:

    • Multiple edges between x/y and their neighbors
    • Self-loop edges of x and y
    • Edges between x and y

    In the graph above, the red and green nodes have 2 common neighbors and 6 neighbors in total, their Jaccard similarity is 2 / 6 = 0.3333.

    Special Case

    Lonely Node, Disconnected Graph

    There is rarely computational valuable lonely node (empty set) in practice, intersection that involves lonely node is empty, and Jaccard similarity is 0.

    For two nodes belong to different connected components, their Jaccard similarity must be 0.

    Self-loop Edge

    Self-loop edge of a node does not increase the number of neighbors of the node.

    Directed Edge

    For directed edges, the algorithm ignores the direction of edges but calculates them as undirected edges.

    Results and Statistics

    The graph below shows which sports userA, userB, userC and userD (their UUID are 1, 2, 3 and 4 respectively) likes:

    Algorithm results 1: Calculate the similarity between each user and other 3 users, return node and top_list; top_list is all nodes which has the similarity > 0 with the node, those nodes' UUID and the similarity between then are displayed (ordered by the descending similarity)

    node top_list
    1 3:0.250000, 2:0.200000, 4:0.166667
    2 3:0.500000, 4:0.250000, 1:0.200000
    3 2:0.500000, 1:0.250000
    4 2:0.250000, 1:0.166667

    Algorithm results 2: Calculate the similarity between userC and other 3 users, return node1, node2 and similarity

    node1 node2 similarity
    3 1 0.25
    3 2 0.5
    3 4 0

    Algorithm statistics: N/A

    Command and Configuration

    • Command: algo(similarity)
    • Configurations for the parameter params():
    Name
    Type
    Default
    Specification
    Description
    ids / uuids []_id / []_uuid / Mandatory IDs or UUIDs of the first set of nodes to be calculated
    ids2 / uuids2 []_id / []_uuid / Optional IDs or UUIDs of the second set of nodes to be calculated
    type string cosine jaccard / overlap / cosine / pearson / euclideanDistance / euclidean Measurement of the similarity; jaccard means to calculate Jaccard similarity, overlap means to calcualte overlap similarity, cosine means to calcualte cosine similarity, pearson means to calculate Pearson correlation coefficient, euclideanDistance means to calculate Euclidean distance, euclidean means to calcualte normalzied Euclidean distance
    node_schema_property []@<schema>?.<property> / Numeric node property, LTE needed When the type is cosine/pearson/euclideanDistance/euclidean, must specify two or more node properties to form the vector, schema can be either carried or not; when the type is jaccard/overlap, this parameter is invalid
    limit int -1 >=-1 Number of results to return; return all results if sets to -1 or not set
    top_limit int -1 -1 or >=0 Only available when ids2 and uuids2 are ignored, limit the length of selection results top_list of each node, return the full top_list if sets to -1 or not set

    This algorithm has two calculation modes:

    1. Selection mode

    • When ids2 and uuids2 are ignored or given empty value, for each node in ids/uuids, select all nodes that have similarity > 0 with it (ordered by the descending similarity).
    • The configuration item limit is to control the number of nodes in ids/uuids to return.

    2. Pairing mode

    • When ids2/uuids2 is given solid value, pair node in ids/uuids with node in ids2/uuids2 (Cartesian product), similarities are calculated for each node pair.
    • The configuration item limit is to control the number of nodes pairs to return.

    Example: Calculate Jaccard similarity between any two nodes in set A (UUID = 1,2) and set B (UUID = 3,4)

    algo(similarity).params({ 
      uuids: [1,2], 
      uuids2: [3,4],
      type: "jaccard"
    }) as jacc
    return jacc
    

    Example: For each node in set (UUID = 1,2,3), calculate the most similar node to each node measured by Jaccard similarity

    algo(similarity).params({ 
      uuids: [1,2,3], 
      type: "jaccard",
      top_limit: 1
    }) as nodes
    return nodes
    

    Algorithm Execution

    Task Writeback

    1. File Writeback

    Configuration
    Data in Each Row
    filename node,top_list or node1,node2,similarity

    Example: Calculate Jaccard similarity between node UUID = 3 and other nodes, write the algorithm results back to file named s3

    algo(similarity).params({
      uuids: 3,
      uuids2: [1,2,4],
      type: "jaccard"
    }).write({
      file:{ 
        filename: "s3"
      }
    })
    

    2. Property Writeback

    Not supported by this algorithm.

    3. Statistics Writeback

    This algorithm has no statistics.

    Direct Return

    Alias Ordinal Type Description Column Name
    0 []perNode / []perNodePair Node and its selection results / Node pair and its similarity node, top_list / node1, node2, similarity

    Example: Calculate Jaccard similarity between each node and other nodes, define algorithm results as alias named similarity, and return the results

    algo(similarity).params({
      uuids: [1,2,3,4],
      type: "jaccard"
    }) as s
    return s
    

    Streaming Return

    Alias Ordinal Type Description Column Name
    0 []perNode / []perNodePair Node and its selection results / Node pair and its similarity node, top_list / node1, node2, similarity

    Example: Calculate Jaccard similarity between each node and other nodes, only keep the one with the highest similarity, define algorithm results as alias named similarity, and return the results

    algo(similarity).params({
      uuids: [1,2,3,4],
      type: "jaccard",
      top_limit: 1 
    }).stream() as similarity
    return similarity
    

    Real-time Statistics

    This algorithm has no statistics.

    Please complete the following information to download this book
    *
    公司名称不能为空
    *
    公司邮箱必须填写
    *
    你的名字必须填写
    *
    你的电话必须填写
    *
    你的电话必须填写