Change Password

Input error
Input error
Input error
Submit

Change Nickname

Current Nickname:
Submit
Search
v2.x
    v4.0

    HyperANF

      Advanced  

    Overview

    Graph average distance is defined as the average length of the shortest paths of all node pairs, it can be used to describe the compactness of the graph. In the early days, this concept was used to assess the design of building floors, to study the chemical molecular structure and so on. Gradually, it is applied to the analysis and design of computer system connectivity and communication networks.

    The theoretical value of the graph average distance can be solved by the Neighborhood Function. Since the neighborhood function is very resource intensive when computing on large graphs, ANF (Approximate Neighborhood Function) algorithm and HyperANF algorithm - a breakthrough improvement over ANF in terms of speed and linear scalability - were emerged. HyperANF was proposed in 2011 by Paolo Boldi, Marco Rosa and Sebastiano Vigna of the Department of Information Science at the University of Milan, Italy.

    Related materials of the algorithm are as below:

    Basic Concept

    Graph Average Distance

    According to the definition of graph average distance, given that i and j are two distinct nodes in the same connected component, d(i,j) is the distance (the length or the number of edges of the shortest path) between i and j, k is the number of all connectable node pairs in the whole graph, then the graph average distance can be expressed as:

    Another expression of the graph average distance is given by counting the values of d(i,j) in the above formula, using a positive integer t to represent the value of d(i,j) and p(t) to represent the number of node pairs whose distance is t:

    As the following figure shows, to get the graph average distance, the calculation process of the two methods are (1+2+1+1)/4=1.25, (1*3+2*1)/4=1.25:

    Neighborhood Function

    Neighborhood Function, also known as Global Neighborhood Function, is denoted as N(t) which means when the distance t is given, the number of node pairs that have distance of t or less; the p(t) mentioned earlier can be written by neighborhood function as p(t) = N(t) - N(t-1).

    Consider the number of nodes that are equal or less than t far from node x (those nodes are not x), define it as Independent Neighborhood Function and denote as Nx(t); the relationship between independent neighborhood function and global neighborhood function is: N(t) = 1/2 · ∑ Nx(t).

    Calculate N(2) = (2+3+4+3+2)/2 = 7 by Nx(2), or calculate N(2) = p(2) + N(1) = 3 + 4 = 7 by p(2), the results are the same.

    So far, the solution of the graph average distance can be converted into the calculation of Nx(t).

    Neighborhood Set

    If uses set B(x,t) to represent all the nodes that node x can reach within t steps (x included) and call it the t-step Neighborhood Set of x, then Nx(t) = |B(x,t)| - 1; B(x,t) can be obtained by the iterative computation of itself, i.e. B(x,t) = ∪ B(y,t-1), where y is all the 1-step neighbors of x.

    As above, to examine the neighborhood set B(x,3) of the blue node x in Graph A, 1-step neighbors of the blue node are the green, red and yellow three nodes (denoted as i, j, k respectively); the nodes in the green, red and yellow circles in graph B, C, D are B(i,2), B(j,2), B(k,2) respectively, the union of the three circles is the whole graph, i.e. B(x,3) = B(i,2) ∪ B(j,2) ∪ B(k,2).

    According to the definition of neighborhood set, the calculation of Nx(t) is essentially a process of finding the union set and getting the size of the set, that is, writing the elements in multiple sets into a sequence and then finding the cardinality of the sequence. Cardinality refers to the number of distinct elements in a sequence. For example, two sets {1, 3, 4} and {4, 5, 1 ,7}, writing their elements into a sequence as 1, 3, 4, 4, 5, 1, 7, the cardinality of this sequence is 5, meanwhile it is the number of elements in their union set {1, 3, 4, 5, 7}.

    HyperLogLog

    HyperLogLog is an approximate evaluation of cardinality that counts the cardinality of an elements sequence after reading it. The size of the memory it occupies at the time of computation is a near-linear double logarithm O(n·log(log n)), hence the name.

    HyperLogLog works by preparing an array M of length 2b (each element in array is initially -∞), and matching each element in the sequence to be calculated with a position in that array (there may occur that multiple elements correspond to the same position); reading each element in the sequence in turn and updating the value on the corresponding position in the array; after the reading of the sequence is completed, calculating the cardinality of the sequence according to the value of the array.

    1. How to make an element corresponded to a position in array

    Mapping each element value to a sufficiently long binary numeric sequence, use the integer value of the first b bits in the sequence as the array subscript corresponding to the element. For example, mapping an element into 001100111010101..., when the value of b is 4 and the first 4 bits are 0011 = 3, this element corresponds to M[3] in the array.

    1. How to update the array

    Starting from the b+1 bit in the sequence that the element is mapped to, observe where the first 1 appears and denote that position as ρ (ρ must be a positive integer), if ρ is greater than the value of position that the element corresponds, then replace the value of that position with ρ. In the previous example, after removing the first 4 bits of the binary numeric sequence of the element, the first 1 appears at the 3rd position, i.e. ρ = 3; if the value of M is [-∞,3,1,1,-∞,...], it becomes to [-∞,3,1,3,-∞,...] after reading the element.

    1. How to calculate the cardinality

    After reading all elements, cardinality of the elements sequence is calculated by the value of the array M:

    where m = 2b is the length of array M, α is the function of m:

    For instance, when giving b the value of 4, m is 16, the calcuated cardinality is 45.012745.

    Algorithm Process

    Formula

    Taking connected graph as an example, according to the concept introduced earlier, use the size of neighborhood set |B(x,t)| to represtent the graph average distance:

    where T is the maximum value of the shortest distance between nodes in the graph, that is, the diameter of the graph; n is the number of nodes in the graph.

    Iteration

    Preparing an array M for each node x in the graph to calculate |B(x,t)| the number of nodes in the neighborhood set of the node, n M arrays are needed for n nodes; define C as the n × m-dimensional matrix of these arrays:

    Starting from t = 1, calculate C(t) iteratively; each Mx in C(t) can be derived by taking the maximum value of the same subscript from multiple My in C(t-1) (y is all 1-step neighbors of x). Looping iteratively until C(t) no longer changes or the number of rounds of iteration reaches the limit.

    Example

    Known from the node mapping scheme given above, the length of array M is 4, α is 0.5324, calculate C(t) iteratively:

    C(7) is the same with C(6) after iterating 7 times, so the iteration stops and the diameter T of the graph is 6. Generally, when the times of iteration of the algorithm is less than the diameter of the graph, the more it iterates, the more accurate the result is; when the times of iteration exceeds the diameter of the graph, the accuracy does not vary with the times of iteraction.

    Calculate the corresponding |B(x,t)| according to the values of each M in the table above (when the exponent in the power operation is -∞, use 0 instead):

    Node x |B(x,1)| |B(x,2)| |B(x,3)| |B(x,4)| |B(x,5)| |B(x,6)|
    1 3.586929632 5.679305251 13.6303326 13.6303326 13.6303326 13.6303326
    2 3.40758315 4.5434442 5.679305251 13.6303326 13.6303326 13.6303326
    3 5.679305251 5.679305251 13.6303326 13.6303326 13.6303326 13.6303326
    4 5.242435616 13.6303326 13.6303326 13.6303326 13.6303326 13.6303326
    5 3.097802864 4.259478938 4.5434442 5.679305251 13.6303326 13.6303326
    6 3.7862035 4.259478938 4.5434442 5.679305251 13.6303326 13.6303326
    7 3.40758315 5.679305251 5.679305251 13.6303326 13.6303326 13.6303326
    8 3.245317286 5.679305251 5.679305251 13.6303326 13.6303326 13.6303326
    9 5.679305251 11.3586105 13.6303326 13.6303326 13.6303326 13.6303326
    10 2.839652625 3.7862035 4.259478938 4.5434442 5.679305251 13.6303326
    11 3.245317286 4.008921353 5.679305251 5.679305251 13.6303326 13.6303326
    12 3.586929632 5.679305251 11.3586105 13.6303326 13.6303326 13.6303326
    13 3.40758315 5.679305251 11.3586105 13.6303326 13.6303326 13.6303326

    Put these calcualted |B(x,t)| into the formula and the obtained graph average distance is 3.245076304; in comparison with the accurate average distance 3, it is a reasonable eastimation.

    Special Case

    Lonely Node, Disconnected Graph

    Lonely node is not connected with any other node, it does not participate in the calculation of graph average distance.

    Connected node pairs are found in each connected component in disconnected graph, graph average distance is calcualted by these node pairs jointly.

    Self-loop Edge

    Self-loop edge does not form the shortest paths between nodes, thus has no impact on the calculation of graph average distance.

    Directed Edge

    For directed edges, HyperANF algorithm ignores the direction of edges but calculates them as undirected edges.

    Command

    algo(hyperANF).params(<>)

    Configuration Item Type Default Value Specification Description
    loop_num int 5 >=1 The maximum rounds of iteration of the algorithm
    register_num int 10 4~30 The value of exponent b of the power operation that decides the length of array M when using HyperLogLog to estimate the cardinality

    File Writeback

    (Not supported)

    Property Writeback

    (Not supported)

    Statistics Writeback

    .write() and when executing other writeback operations:

    Name Type Description
    hyperANF_result double Graph average distance

    Direct Return

    (Not supported)

    Streaming Return

    (Not supported)

    Real-time Statistics

    .stats() as <alias> return <alias>

    Alias Number Type Description Column Name
    0 KV, double Graph average distance hyperANF_result
    Please complete the following information to download this book
    *
    公司名称不能为空
    *
    公司邮箱必须填写
    *
    你的名字必须填写
    *
    你的电话必须填写
    *
    你的电话必须填写