Overview
Struc2Vec graph embedding algorithm was first published in 2017 SIGMOD conference. Struc2vec literally means 'structure to vector'. This algorithm establishes a framework for generating node vectors from graph while preserving the graph structure (i.e., topological similarities). Although Node2Vec results also reflect structural similarity between nodes to some extent, they are still limited to the random walk depth. Nodes with similar structures may be far apart in the network, Struc2Vec ensures that those nodes are close in the embedding space.
In general, Node2Vec is adequate if the downstream tasks (e.g., classification) value the homophily of nodes more. However, if the downstream tasks are to perform on the similarity of the local topology of nodes, Struc2Vec is more applicable.
Related material of the algorithm:
- L. Ribeiro, P. Saverese, D. Figueiredo, struc2vec: Learning Node Representations from Structural Identity (2017)
Basic Concept
Structural Identity
In almost all networks, nodes tend to have different structural identities determined by their functions. Nodes perform similar functions are partitioned into the same class, that is, have structural similarity. For example, in the social network of a company, the roles of all interns are found similar.
Structural similarity means that the neighborhood topology of two nodes is homogeneous or symmetrical. As shown in the graph below, nodes v
and u
are structural similar (degrees 5 and 4, connected to 3 and 2 triangles, connected to the rest of the network by two nodes), but very far apart in the network, they neither have direct connection nor any common neighbor.
Key Ideas of Struc2Vec
When generating node sequences from random walks similar to Node2Vec, if two nodes have similar sequences, their embedding results are also similar. It is easy to think that nodes that are closer together in the graph are more likely to produce similar walk sequences, while nodes that are farther apart (beyond the walk depth) are almost impossible to produce similar sequences, even if they have similar structures.
Struc2Vec overcomes this limitation, and the key ideas of Struc2Vec are:
- Assess structural similarity between nodes independently of node and edge properties as well as their position in the network.
- Establish a hierarchy to measure structural similarity: at the bottom of the hierarchy, structural similarity between nodes depend only on their degrees; while at the top of the hierarchy, similarity depends on the entire network.
- Generate random walk sequences for nodes, nodes with similar sequences have similar structure.
Struc2Vec Framework
1. Measuring Structural Similarity
Intuitively, two nodes that have the same degrees are structurally similar, but if their neighbors also have the same degree, then they are even more structurally similar.
Let Rk(u) denote the set of nodes at exactly distance (hop) k from node u in graph (0 ≤ k ≤ graph diameter), s(S) denote the ordered degree sequence of a node set S, below is an example:
When neighbors at distance k both exist for node u and v, let fk(u,v) denote the structural distance between u and v when considering their k-hop neighborhoods (all nodes at distance less than or equal to k):
where g() ≥ 0 measures the distance between two sequences. For nodes u and v, sequences s(Rk(u)) and s(Rk(v)) can be of different sizes. To assess distance between two sequences of different sizes, Dynamic Time Wrapping (DTW), or any other appliable function can be adopted. Note that if the k-hop neighborhoods of node u and v are isomorphic, then fk-1(u,v) = 0.
2. Constructing Multilayer Weighted Graph
For graph G = (V,E), construct a multilayer weighted graph M as follows:
Each layer k of M is formed by a weighted undirected complete graph, the edge weight between every node pair u and v is inversely proportional to the structural distance between them, as defined below:
Note that edges are defined only if fk(u,v) is defined.
Every node u in layer k is connected to the corresponding node u in layer k+1 and k-1. The directed edge weight between layers are as follows:
where Γk(u) is the number of edges incident to u that have weight larger than the average edge weight of the complete graph in layer k. Γk(u) actually measures the similarity of node u to other nodes in layer k. Note that if node u has many similar nodes in layer k, then it should change to some higher layer to obtain a more refined context.
3. Generating Context For Nodes
Each node performs random walk in the multilayer weighted graph M to generate structural context, that is, the sequence of nodes. Each node starts random walks in its corresponding node in layer 0. When reaches uk:
- The random walk first decides if it will change
layers or walk in the current layer k, the probability of staying in the current layer k is given by the parameter
stay_probability
(i.e.p
in the graph below) - If stays in the current layer, the probability of uk jumping to each neighbor node is proportional to the edge weights between them; that is, the random walk will prefer to step onto nodes that are structurally more similar to the current node.
- With probability
1 − p
, the random walk decides to change layers, and moves to the corresponding node either in layer k+1 or layer k−1 with probability proportional to the edge weights.
4. Training the Language Model
Struc2Vec uses the Skip-gram model used by Word2Vec to learn the embeddings of nodes.
To learn more about Skip-gram Model, please read Skip-gram Model and Skip-gram Model Optimization.
Special Case
Lonely Node, Disconnected Graph
Lonely node does not have any neighbor, thus the structural distance cannot be defined from 1-hop neighborhood. Thus, from layer 1 to each higher layer in the multilayer weighted graph, the edge weight between lonely node and other nodes are extremely small.
Connectivity of the graph does not affect the structural similarity of nodes, nodes that are highly structural similar might locate in different connected component.
Self-loop Edge
The way Struc2Vec algorithm counts the degree of self-loop edge is the same with the Degree algorithm algo(degree)
, each self-loop edge is counted twice.
Directed Edge
For directed edges, Struc2Vec algorithm ignores the direction of edges but calculates them as undirected edges.
Command and Configuration
- Command:
algo(struc2vec)
- Configurations for the parameter
params()
:
Name | Type | Default |
Specification |
Description |
---|---|---|---|---|
ids / uuids | []_id / []_uuid |
/ | / | IDs or UUIDs of nodes to start the walk; all nodes to be selected if not set |
walk_length | int | 1 | >=1 | Depth of each walk, i.e. the number of nodes walking through |
walk_num | int | 1 | >=1 | Number of walks |
k | int | / | [1,10] | Number of layers of the multilayer graph, it should be equal or lower than the diameter of the graph |
stay_probability | float | / | [0,1] | The probability of staying at the current level |
window_size | int | / | >=1 | Size of the sliding window, to sample window_size nodes on left and window_size nodes on right |
dimension | int | / | >=1 | Dimension of node embedding vector |
learning_rate | float | / | (0,1) | Initial learning rate, it decreases until reaches min_learning_rate with increase in training iterations |
min_learning_rate | float | / | (0,learning_rate ) |
Minimum learning rate |
min_frequency | int | / | >=0 | Minimum number of occurrences of a node in the walk sequences to be included in the model, <=1 means to keep all nodes |
sub_sample_alpha | float | / | / | Threshold for sub sampling, <=0 means no sub sampling |
resolution | int | / | >=1 | Such as 10, 100 |
neg_num | int | / | >=0 | Number of negative samples, it is suggested 0~10 |
loop_num | int | / | >=1 | Number of training iterations (epochs) |
limit | int | -1 | >=-1 | Number of results to return; return all results if sets to -1 or not set |
Example: Run Struc2Vec in the whole graph
algo(struc2vec).params({
walk_length: 3,
walk_num: 2,
k: 2,
stay_probability: 0.5,
window_size: 5,
dimension: 5,
learning_rate: 0.025,
min_learning_rate: 0.00025,
min_frequency: -1,
sub_sample_alpha: -1,
resolution: 2,
neg_num: 0,
loop_num: 2,
limit: -1
}) as results
return results
Algorithm Execution
Task Writeback
1. File Writeback
Configuration | Data in Each Row |
---|---|
filename | _id ,embedding_result |
2. Property Writeback
Not supported by this algorithm.
3. Statistics Writeback
This algorithm has no statistics.
Direct Return
Alias Ordinal | Type | Description |
Column Name |
---|---|---|---|
0 | []perNode | Array of UUIDs and embeddings of nodes | _uuid , embedding_result |
Streaming Return
Alias Ordinal | Type | Description |
Column Name |
---|---|---|---|
0 | []perNode | Array of UUIDs and embeddings of nodes | _uuid , embedding_result |
Real-time Statistics
This algorithm has no statistics.