UltipaDocs
Products
Solutions
Resources
Company
Start Free Trial
UltipaDocs
Start Free Trial
  • Introduction
  • Running Algorithms
    • Degree Centrality
    • Closeness Centrality
    • Harmonic Centrality
    • Eccentricity Centrality
    • Betweenness Centrality
    • Bridges
    • Articulation Points
    • Eigenvector Centrality
    • Katz Centrality
    • CELF
    • PageRank
    • ArticleRank
    • TextRank
    • HITS
    • SybilRank
    • Jaccard Similarity
    • Overlap Similarity
    • Cosine Similarity
    • Pearson Correlation Coefficient
    • Euclidean Distance
    • KNN
    • Vector Similarity
    • Bipartite Graph
    • HyperANF
    • Weakly Connected Components (WCC)
    • Strongly Connected Components (SCC)
    • k-Edge Connected Components
    • Local Clustering Coefficient
    • Triangle Count
    • Clique Count
    • k-Core
    • k-Truss
    • p-Cohesion
    • Induced Subgraph
    • Topological Sort
    • Breadth-First Search (BFS)
    • Depth-First Search (DFS)
    • Dijkstra's Shortest Path
    • A* Shortest Path
    • Yen's K-Shortest Paths
    • Shortest Path (BFS)
    • Delta-Stepping SSSP
    • Shortest Path Faster Algorithm (SPFA)
    • All-Pairs Shortest Path (APSP)
    • Minimum Spanning Tree (MST)
    • K-Spanning Tree
    • Steiner Tree
    • Prize-Collecting Steiner Tree (PCST)
    • Minimum Cost Flow
    • Maximum Flow
    • K-Hop Fast
    • Longest Path (DAG)
    • Random Walk
    • Adamic-Adar Index
    • Common Neighbors
    • Preferential Attachment
    • Resource Allocation
    • Total Neighbors
    • Same Community
    • Louvain
    • Leiden
    • Modularity Optimization
    • Label Propagation
    • HANP
    • SLPA
    • k-Means
    • HDBSCAN
    • K-1 Coloring
    • Modularity
    • Conductance
    • Max k-Cut
      • Node2Vec
      • Struc2Vec
      • LINE
      • Fast Random Projection
      • Summary of Graph Embedding
      • Gradient Descent
      • Backpropagation
      • Skip-gram
      • Skip-gram Optimization
  1. Docs
  2. /
  3. Graph Algorithms
  4. /
  5. Centrality

CELF

Overview

The CELF (Cost Effective Lazy Forward) algorithm selects seed nodes in a network to act as propagation sources and maximize the number of influenced nodes. This is known as Influence Maximization (IM), where 'influence' represents anything that can be spread across the network, such as contamination, information, disease, etc.

CELF was proposed by Jure Leskovec et al. in 2007, it improves the traditional Greedy algorithm based on the IC model by taking advantage of the submodularity. It only calculates the spread score for all nodes only at the initial stage and does not recalculate for all nodes afterwards, hence cost-effective.

Related materials of the algorithm:

  • J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, N. Glance, Cost-effective Outbreak Detection in Networks (2007)
  • D. Kempe, J. Kleinberg, E. Tardos, Maximizing the Spread of Influence through a Social Network (2003)

A typical application of the algorithm is to prevent epidemic outbreak by selecting a small group of people to monitor, so that any disease can be detected in an early stage.

Concepts

Spread Function - Independent Cascade

This algorithm uses the Independent Cascade (IC) model to simulate influence propagation in the network. IC is a probabilistic model, it starts with a set of active seed nodes, and in step k:

  • For each node that becomes active in step k-1, it has a single chance to activate each inactive outgoing neighbor with a success probability.
  • The process runs until no more activations are possible.

The spread of a given seed set is measured by the number of active nodes in the graph at the end of the process. This process is repeated many times (using Monte Carlo simulations), and the average spread is calculated.

Submodularity

The spread function IC() is called submodular as the marginal gain of a single node v is diminishing as the seed set S grows:

where the seed set |Sk+1| > |Sk|, S ∪ {v} means to add node v into the seed set.

Submodularity of the spread function is the key property exploited by CELF. CELF significantly improves the traditional Greedy algorithm that is used to solve the influence maximization problem, it runs a lot faster while achieving near optimal results.

Lazy Forward

At initialization, CELF, like the Greedy algorithm, computes the spread for each node and stores them in a list sorted by descending spread. As the seed set is empty now, the spread for each node can be viewed as its initial marginal gain.

In the first iteration, the top node is moved from the list to the seed set.

In the next iteration, only the marginal gain of the current top-ranked node is recalculated. After sorting, if that node remains at top, move it to the seed set; if not, repeat the process for the new top node.

Unlike Greedy, CELF avoids calculating marginal gain for all the rest nodes in each iteration, this is where the submodularity of the spread function is considered - the marginal gain of every node in this round is always lower than the previous round. So if the top node remains at top, we can put it into the seed set directly without calculating for other nodes.

The algorithm terminates when the seed set reaches the specified size.

Example Graph

GQL
INSERT (A:account {_id: "A"}), (B:account {_id: "B"}),
       (C:account {_id: "C"}), (D:account {_id: "D"}),
       (E:account {_id: "E"}), (F:account {_id: "F"}),
       (G:account {_id: "G"}), (H:account {_id: "H"}),
       (I:account {_id: "I"}), (J:account {_id: "J"}),
       (A)-[:follow]->(B), (A)-[:follow]->(G),
       (B)-[:follow]->(F), (C)-[:follow]->(B),
       (C)-[:follow]->(J), (D)-[:follow]->(J),
       (E)-[:follow]->(A), (F)-[:follow]->(C),
       (F)-[:follow]->(G), (G)-[:follow]->(H),
       (H)-[:follow]->(C), (H)-[:follow]->(E),
       (H)-[:follow]->(J), (I)-[:follow]->(B)

Parameters

NameTypeDefaultDescription
seedSetSizeINT10Number of seed nodes to select.
monteCarloRunsINT100Number of Monte Carlo simulations.
probabilityFLOAT0.1Edge activation probability.
limitINT-1Limits the number of results returned (-1 = all).
orderSTRING/Sorts the results by spread: asc or desc.

Run Mode

Returns:

ColumnTypeDescription
nodeIdSTRINGNode identifier (_id)
spreadFLOATExpected influence spread
rankINTSeed selection order (1 = first selected)

Select top 3 seed nodes:

GQL
CALL algo.celf({
  seedSetSize: 3,
  monteCarloRuns: 1000,
  probability: 0.5
}) YIELD nodeId, spread, rank

Result:

nodeIdspreadrank
H3.6331
I5.2792
A6.5893
NOTE

CELF uses Monte Carlo simulations, so results may vary slightly between runs.

Stream Mode

Returns the same columns as run mode, streamed for memory efficiency.

GQL
CALL algo.celf.stream({
  seedSetSize: 2,
  monteCarloRuns: 1000,
  probability: 0.6
}) YIELD nodeId, spread
RETURN nodeId, spread

Result:

nodeIdspread
H4.45
I6.206

Stats Mode

Returns:

ColumnTypeDescription
nodeCountINTTotal number of nodes
minSpreadFLOATMinimum influence spread
maxSpreadFLOATMaximum influence spread
avgSpreadFLOATAverage influence spread
GQL
CALL algo.celf.stats({
  seedSetSize: 3,
  probability: 0.5
}) YIELD nodeCount, minSpread, maxSpread, avgSpread

Result:

nodeCountminSpreadmaxSpreadavgSpread
103.436.55.033333333333333

Write Mode

Computes results and writes them back to node properties. The write configuration is passed as a second argument map.

Write parameters:

NameTypeDescription
db.propertySTRING or MAPNode property to write results to. String: writes the spread column in results to a property. Map: explicit column-to-property mapping (e.g., {spread: 'celf_spread', rank: 'celf_rank'}).

Writable columns:

ColumnTypeDescription
spreadFLOATExpected influence spread
rankINTSeed selection order

Returns:

ColumnTypeDescription
task_idSTRINGTask identifier for tracking via SHOW TASKS
nodesWrittenINTNumber of nodes with properties written
computeTimeMsINTTime spent computing the algorithm (milliseconds)
writeTimeMsINTTime spent writing properties to storage (milliseconds)
GQL
CALL algo.celf.write({seedSetSize: 3, probability: 0.5}, {
  db: {
    property: "celf_spread"
  }
}) YIELD task_id, nodesWritten, computeTimeMs, writeTimeMs