Advanced
Overview
Graph average distance is defined as the average length of the shortest paths of all node pairs, it can be used to describe the compactness of the graph. In the early days, this concept was used to assess the design of building floors, to study the chemical molecular structure and so on. Gradually, it is applied to the analysis and design of computer system connectivity and communication networks.
The theoretical value of the graph average distance can be solved by the Neighborhood Function. Since the neighborhood function is very resource intensive when computing on large graphs, ANF (Approximate Neighborhood Function) algorithm and HyperANF algorithm - a breakthrough improvement over ANF in terms of speed and linear scalability - were emerged. HyperANF was proposed in 2011 by Paolo Boldi, Marco Rosa and Sebastiano Vigna of the Department of Information Science at the University of Milan, Italy.
Related materials of the algorithm are as below:
- HyperANF: Approximating the Neighbourhood Function of Very Large Graphs on a Budget
- HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
Basic Concept
Graph Average Distance
According to the definition of graph average distance, given that i
and j
are two distinct nodes in the same connected component, d(i,j)
is the distance (the length or the number of edges of the shortest path) between i
and j
, k
is the number of all connectable node pairs in the whole graph, then the graph average distance can be expressed as:

Another expression of the graph average distance is given by counting the values of d(i,j)
in the above formula, using a positive integer t
to represent the value of d(i,j)
and p(t)
to represent the number of node pairs whose distance is t
:

As the following figure shows, to get the graph average distance, the calculation process of the two methods are (1+2+1+1)/4=1.25
, (1*3+2*1)/4=1.25
:

Neighborhood Function
Neighborhood Function, also known as Global Neighborhood Function, is denoted as N(t)
which means when the distance t
is given, the number of node pairs that have distance of t
or less; the p(t)
mentioned earlier can be written by neighborhood function as p(t) = N(t) - N(t-1)
.
Consider the number of nodes that are equal or less than t
far from node x
(those nodes are not x
), define it as Independent Neighborhood Function and denote as Nx(t)
; the relationship between independent neighborhood function and global neighborhood function is: N(t) = 1/2 · ∑ Nx(t)
.

Calculate N(2) = (2+3+4+3+2)/2 = 7
by Nx(2)
, or calculate N(2) = p(2) + N(1) = 3 + 4 = 7
by p(2)
, the results are the same.
So far, the solution of the graph average distance can be converted into the calculation of Nx(t)
.
Neighborhood Set
If uses set B(x,t)
to represent all the nodes that node x
can reach within t
steps (x
included) and call it the t
-step Neighborhood Set of x
, then Nx(t) = |B(x,t)| - 1
; B(x,t)
can be obtained by the iterative computation of itself, i.e. B(x,t) = ∪ B(y,t-1)
, where y
is all the 1-step neighbors of x
.

As above, to examine the neighborhood set B(x,3)
of the blue node x
in Graph A, 1-step neighbors of the blue node are the green, red and yellow three nodes (denoted as i
, j
, k
respectively); the nodes in the green, red and yellow circles in graph B, C, D are B(i,2)
, B(j,2)
, B(k,2)
respectively, the union of the three circles is the whole graph, i.e. B(x,3) = B(i,2) ∪ B(j,2) ∪ B(k,2)
.
According to the definition of neighborhood set, the calculation of Nx(t)
is essentially a process of finding the union set and getting the size of the set, that is, writing the elements in multiple sets into a sequence and then finding the cardinality of the sequence. Cardinality refers to the number of distinct elements in a sequence. For example, two sets {1, 3, 4}
and {4, 5, 1 ,7}
, writing their elements into a sequence as 1, 3, 4, 4, 5, 1, 7
, the cardinality of this sequence is 5, meanwhile it is the number of elements in their union set {1, 3, 4, 5, 7}
.
HyperLogLog
HyperLogLog is an approximate evaluation of cardinality that counts the cardinality of an elements sequence after reading it. The size of the memory it occupies at the time of computation is a near-linear double logarithm O(n·log(log n))
, hence the name.
HyperLogLog works by preparing an array M
of length 2b (each element in array is initially -∞
), and matching each element in the sequence to be calculated with a position in that array (there may occur that multiple elements correspond to the same position); reading each element in the sequence in turn and updating the value on the corresponding position in the array; after the reading of the sequence is completed, calculating the cardinality of the sequence according to the value of the array.

- How to make an element corresponded to a position in array
Mapping each element value to a sufficiently long binary numeric sequence, use the integer value of the first b
bits in the sequence as the array subscript corresponding to the element. For example, mapping an element into 001100111010101...
, when the value of b
is 4 and the first 4 bits are 0011 = 3
, this element corresponds to M[3]
in the array.
- How to update the array
Starting from the b+1
bit in the sequence that the element is mapped to, observe where the first 1
appears and denote that position as ρ
(ρ
must be a positive integer), if ρ
is greater than the value of position that the element corresponds, then replace the value of that position with ρ
. In the previous example, after removing the first 4 bits of the binary numeric sequence of the element, the first 1
appears at the 3rd position, i.e. ρ = 3
; if the value of M
is [-∞,3,1,1,-∞,...]
, it becomes to [-∞,3,1,3,-∞,...]
after reading the element.
- How to calculate the cardinality
After reading all elements, cardinality of the elements sequence is calculated by the value of the array M
:

where m
= 2b is the length of array M
, α
is the function of m
:

For instance, when giving b
the value of 4
, m
is 16
, the calcuated cardinality is 45.0127
≈45
.
Algorithm Process
Formula
Taking connected graph as an example, according to the concept introduced earlier, use the size of neighborhood set |B(x,t)|
to represtent the graph average distance:

where T
is the maximum value of the shortest distance between nodes in the graph, that is, the diameter of the graph; n
is the number of nodes in the graph.
Iteration
Preparing an array M
for each node x
in the graph to calculate |B(x,t)|
the number of nodes in the neighborhood set of the node, n
M
arrays are needed for n
nodes; define C
as the n × m
-dimensional matrix of these arrays:

Starting from t
= 1
, calculate C(t)
iteratively; each Mx in C(t)
can be derived by taking the maximum value of the same subscript from multiple My in C(t-1)
(y
is all 1-step neighbors of x
). Looping iteratively until C(t)
no longer changes or the number of rounds of iteration reaches the limit.
Example

Known from the node mapping scheme given above, the length of array M
is 4
, α
is 0.5324
, calculate C(t)
iteratively:

C(7)
is the same with C(6)
after iterating 7 times, so the iteration stops and the diameter T
of the graph is 6
. Generally, when the times of iteration of the algorithm is less than the diameter of the graph, the more it iterates, the more accurate the result is; when the times of iteration exceeds the diameter of the graph, the accuracy does not vary with the times of iteraction.
Calculate the corresponding |B(x,t)|
according to the values of each M
in the table above (when the exponent in the power operation is -∞
, use 0
instead):
Node x | |B(x,1)| | |B(x,2)| | |B(x,3)| | |B(x,4)| | |B(x,5)| | |B(x,6)| |
---|---|---|---|---|---|---|
1 | 3.586929632 | 5.679305251 | 13.6303326 | 13.6303326 | 13.6303326 | 13.6303326 |
2 | 3.40758315 | 4.5434442 | 5.679305251 | 13.6303326 | 13.6303326 | 13.6303326 |
3 | 5.679305251 | 5.679305251 | 13.6303326 | 13.6303326 | 13.6303326 | 13.6303326 |
4 | 5.242435616 | 13.6303326 | 13.6303326 | 13.6303326 | 13.6303326 | 13.6303326 |
5 | 3.097802864 | 4.259478938 | 4.5434442 | 5.679305251 | 13.6303326 | 13.6303326 |
6 | 3.7862035 | 4.259478938 | 4.5434442 | 5.679305251 | 13.6303326 | 13.6303326 |
7 | 3.40758315 | 5.679305251 | 5.679305251 | 13.6303326 | 13.6303326 | 13.6303326 |
8 | 3.245317286 | 5.679305251 | 5.679305251 | 13.6303326 | 13.6303326 | 13.6303326 |
9 | 5.679305251 | 11.3586105 | 13.6303326 | 13.6303326 | 13.6303326 | 13.6303326 |
10 | 2.839652625 | 3.7862035 | 4.259478938 | 4.5434442 | 5.679305251 | 13.6303326 |
11 | 3.245317286 | 4.008921353 | 5.679305251 | 5.679305251 | 13.6303326 | 13.6303326 |
12 | 3.586929632 | 5.679305251 | 11.3586105 | 13.6303326 | 13.6303326 | 13.6303326 |
13 | 3.40758315 | 5.679305251 | 11.3586105 | 13.6303326 | 13.6303326 | 13.6303326 |
Put these calcualted |B(x,t)|
into the formula and the obtained graph average distance is 3.245076304
; in comparison with the accurate average distance 3
, it is a reasonable eastimation.
Special Case
Lonely Node, Disconnected Graph
Lonely node is not connected with any other node, it does not participate in the calculation of graph average distance.
Connected node pairs are found in each connected component in disconnected graph, graph average distance is calcualted by these node pairs jointly.
Self-loop Edge
Self-loop edge does not form the shortest paths between nodes, thus has no impact on the calculation of graph average distance.
Directed Edge
For directed edges, HyperANF algorithm ignores the direction of edges but calculates them as undirected edges.
Command
algo(hyperANF).params(<>)
Configuration Item | Type | Default Value | Specification | Description |
---|---|---|---|---|
loop_num | int | 5 | >=1 | The maximum rounds of iteration of the algorithm |
register_num | int | 10 | 4~30 | The value of exponent b of the power operation that decides the length of array M when using HyperLogLog to estimate the cardinality |
File Writeback
(Not supported)
Property Writeback
(Not supported)
Statistics Writeback
.write()
and when executing other writeback operations:
Name | Type | Description |
---|---|---|
hyperANF_result |
double | Graph average distance |
Direct Return
(Not supported)
Streaming Return
(Not supported)
Real-time Statistics
.stats() as <alias> return <alias>
Alias Number | Type | Description | Column Name |
---|---|---|---|
0 | KV, double | Graph average distance | hyperANF_result |