UltipaDocs
Products
Solutions
Resources
Company
Start Free Trial
UltipaDocs
Start Free Trial
  • Introduction
  • Running Algorithms
    • Degree Centrality
    • Closeness Centrality
    • Harmonic Centrality
    • Eccentricity Centrality
    • Betweenness Centrality
    • Bridges
    • Articulation Points
    • Eigenvector Centrality
    • Katz Centrality
    • CELF
    • PageRank
    • ArticleRank
    • TextRank
    • HITS
    • SybilRank
    • Jaccard Similarity
    • Overlap Similarity
    • Cosine Similarity
    • Pearson Correlation Coefficient
    • Euclidean Distance
    • KNN
    • Vector Similarity
    • Bipartite Graph
    • HyperANF
    • Weakly Connected Components (WCC)
    • Strongly Connected Components (SCC)
    • k-Edge Connected Components
    • Local Clustering Coefficient
    • Triangle Count
    • Clique Count
    • k-Core
    • k-Truss
    • p-Cohesion
    • Induced Subgraph
    • Topological Sort
    • Breadth-First Search (BFS)
    • Depth-First Search (DFS)
    • Dijkstra's Shortest Path
    • A* Shortest Path
    • Yen's K-Shortest Paths
    • Shortest Path (BFS)
    • Delta-Stepping SSSP
    • Shortest Path Faster Algorithm (SPFA)
    • All-Pairs Shortest Path (APSP)
    • Minimum Spanning Tree (MST)
    • K-Spanning Tree
    • Steiner Tree
    • Prize-Collecting Steiner Tree (PCST)
    • Minimum Cost Flow
    • Maximum Flow
    • K-Hop Fast
    • Longest Path (DAG)
    • Random Walk
    • Adamic-Adar Index
    • Common Neighbors
    • Preferential Attachment
    • Resource Allocation
    • Total Neighbors
    • Same Community
    • Louvain
    • Leiden
    • Modularity Optimization
    • Label Propagation
    • HANP
    • SLPA
    • k-Means
    • HDBSCAN
    • K-1 Coloring
    • Modularity
    • Conductance
    • Max k-Cut
      • Node2Vec
      • Struc2Vec
      • LINE
      • Fast Random Projection
      • Summary of Graph Embedding
      • Gradient Descent
      • Backpropagation
      • Skip-gram
      • Skip-gram Optimization
  1. Docs
  2. /
  3. Graph Algorithms
  4. /
  5. Graph Structure

HyperANF

Overview

The HyperANF (Hyper-Approximate Neighborhood Function) algorithm estimates the average graph distance and the number of reachable nodes from each node using HyperLogLog counters. It provides a balance between accuracy and computational efficiency, making it well-suited for large-scale graphs where calculating exact distances is infeasible.

Related material of the algorithm:

  • P. Boldi, M. Rosa, S. Vigna, HyperANF: Approximating the Neighbourhood Function of Very Large Graphs on a Budget (2011)

Concepts

Average Graph Distance

The average graph distance is a metric used to measure the average number of steps or edges required to traverse between any two nodes in a graph. It quantifies the overall connectivity or closeness of the nodes in the graph.

As described above, the average graph distance is typically calculated by performing graph traversals to find the shortest path between every pair of nodes, summing these distances, and dividing by the total number of node pairs to get the average.

Approximate Neighborhood Function (ANF)

Graph traversals can be computationally expensive and memory-intensive, especially for large-scale graphs. In such cases, approximate neighborhood function (ANF) algorithms are commonly used to estimate the average graph distance more efficiently.

ANF algorithms aim to estimate the neighborhood function (NF):

  • The neighborhood function (NF) of a graph, denoted as N(t), returns the number of node pairs such that the two nodes can reach each other with t or fewer steps.
  • The individual neighborhood function (INF) of a node x in a graph, denoted as N(x,t), returns the number of nodes that can be reached from x with t or fewer steps.
  • In an undirected graph G = (V, E), the relationship between NF and INF is:

The NF can help to reveal some features of graphs, including the average graph distance:

The calculation of the above example graph is shown below:

However, it is very expensive to compute the NF exactly on large graphs. By approximating the neighborhood function, ANF algorithms can estimate the average graph distance without traversing the entire graph.

HyperLogLog Counter

HyperLogLog counter is used to count approximately the number of distinct elements (i.e., the cardinality) in a large set or stream of elements. Calculating the exact cardinality often requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. HyperLoglog uses significantly less memory, with the space complexity as O(log(log n)) (this is the reason why these counters are called HyperLogLog).

A HyperLogLog counter can be viewed as an array of m = 2b registers, and each register is initialized to -∞. For example, b = 3, then M[0] = M[1] = ... = M[7] = -∞.

NOTE

The number of registers depends on the desired precision of the estimation. More registers can provide a more accurate estimation, but also require more memory.

First, each element x in the set is mapped into a fixed-size binary sequence by a hash function h(). For example, h(x) = 0100001110101....

Then, update the registers. For each element x in the set:

  • Calculate the index i of the register by the integer value of the leftmost b bits of h(x), i.e., hb(x). In the example, i = hb(x) = 010 = 0*22 + 1*21 + 0*20 = 2.
  • Let hb(x) be the sequence of remaining bits of h(x), and ρ(hb(x)) be the position of the leftmost 1 of hb(x). In the example, ρ(hb(x)) = ρ(0001110101...) = 4.
  • Update register M[i] = max(M[i], ρ(hb(x))). In the example, M[2] = max(-∞, 4) = 4.

After reading all elements, the cardinality is calculated by the HyperLogLog counter as:

It is actually a normalized version of the harmonic mean of the 2M[i], where αm is a constant calculated by m as:

HyperANF

HyperANF is one popular ANF algorithm, it is a breakthrough improvement in terms of speed and scalability.

The algorithm is based on the observation that B(x,t), the set of reachable nodes from node x with distance t or less, satisfies

In the example graph below, node a has 3 adjacent edges (a,b), (a,c) and (a,d), so B(a,3) = B(b,2) ∪ B(c,2) ∪ B(d,2).

Instead of keeping track of B(x,t), the HyperANF algorithm employs HyperLogLog counters to simplify the computation process, as illustrated by the example graph above:

  • Each node x is mapped to a binary representation h(x), and is assigned a HyperLogLog counter Cx(t). Set b = 2, so each counter has m = 2b = 4 registers.
  • Cx(0) is then computed by the value of i and ρ. Note: we use 0 instead of -∞ for the calculation, the result is the same.
  • In the t-th iteration, for each node x, the union of B(y,t-1) ((x,y)∈E) is implemented by combining the counters of all neighbors of node x, that is, maximizing the values of the counter of node x register by register.
  • The values of all counters remain unchanged after 6 iterations because the diameter of the graph is 6.
  • |B(x,t)| is computed in each iteration by the cardinality equation with the constant αm = 0.53243.

Since B(x,0) = {x}, then |N(x,t)| = |B(x,t)| - 1. In this example, the average graph distance computed by the algorithm is 3.2041. The exact average graph distance of this example is 3.

Considerations

  • The HyperANF algorithm is typically best suited for connected graphs. For disconnected graphs, the algorithm may not provide accurate results.
  • The algorithm treats all edges as undirected.
  • Results are approximate due to the HyperLogLog estimation.

Example Graph

GQL
INSERT (A:default {_id: "A"}), (B:default {_id: "B"}),
       (C:default {_id: "C"}), (D:default {_id: "D"}),
       (E:default {_id: "E"}), (F:default {_id: "F"}),
       (G:default {_id: "G"}), (H:default {_id: "H"}),
       (I:default {_id: "I"}), (J:default {_id: "J"}),
       (G)-[:default]->(H), (H)-[:default]->(I),
       (I)-[:default]->(J), (J)-[:default]->(A),
       (A)-[:default]->(B), (A)-[:default]->(E),
       (E)-[:default]->(F), (B)-[:default]->(E),
       (B)-[:default]->(C), (C)-[:default]->(D)

Parameters

NameTypeDefaultDescription
maxIterationsINT10Maximum number of hops (iterations).
precisionINT10HyperLogLog precision b (number of registers m = 2b). Range: 4-16. Higher values give better accuracy but use more memory.

Run Mode

Returns:

ColumnTypeDescription
nodeIdSTRINGNode identifier (_id)
reachableINTEstimated number of reachable nodes at max hop
closenessFLOATApproximate closeness centrality (1/avg_distance)
GQL
CALL algo.hyperanf({
  maxIterations: 5,
  precision: 4
}) YIELD nodeId, reachable, closeness

Result:

nodeIdreachablecloseness
E80.5965532301951378
D80.34341452360360125
G60.27985924668795603
F80.3996822154660621
A80.5965532301951378
C80.47933092250796094
B80.6101912344377625
I80.39154003724083103
H60.3876337890055861
J80.4746971359742313

Stream Mode

Returns the same columns as run mode, streamed for memory efficiency.

GQL
CALL algo.hyperanf.stream({
  maxIterations: 10,
  precision: 10
}) YIELD nodeId, reachable, closeness
RETURN nodeId, reachable, closeness

Result:

nodeIdreachablecloseness
E100.4085657061461551
D100.25686127108769014
G100.230532216041899
F100.2996826682798575
A100.4731716299064865
C100.3329032372979068
B100.4280138237136014
I100.35959571365548715
H100.28998011925615325
J100.42817415133736225

Stats Mode

Returns:

ColumnTypeDescription
nodeCountINTTotal number of nodes
avgGraphDistanceFLOATAverage graph distance across all reachable pairs
avgReachableFLOATAverage estimated reachable nodes
estimatedDiameterINTEstimated diameter (hop where neighborhood function stabilizes)
GQL
CALL algo.hyperanf.stats({
  maxIterations: 10,
  precision: 10
}) YIELD nodeCount, avgGraphDistance, avgReachable, estimatedDiameter

Result:

nodeCountavgGraphDistanceavgReachableestimatedDiameter
103.0197961999815877107