## Overview

Struc2Vec graph embedding algorithm was first published in 2017 SIGMOD conference. Struc2vec literally means 'structure to vector'. This algorithm establishes a framework for generating node vectors from graph while preserving the graph structure (i.e., topological similarities). Although Node2Vec results also reflect structural similarity between nodes to some extent, they are still limited to the random walk depth. Nodes with similar structures may be far apart in the network, Struc2Vec ensures that those nodes are close in the embedding space.

In general, Node2Vec is adequate if the downstream tasks (e.g., classification) value the homophily of nodes more. However, if the downstream tasks are to perform on the similarity of the local topology of nodes, Struc2Vec is more applicable.

Related material of the algorithm:

- L. Ribeiro, P. Saverese, D. Figueiredo, struc2vec: Learning Node Representations from Structural Identity (2017)

## Basic Concept

### Structural Identity

In almost all networks, nodes tend to have different structural identities determined by their functions. Nodes perform similar functions are partitioned into the same class, that is, have structural similarity. For example, in the social network of a company, the roles of all interns are found similar.

Structural similarity means that the neighborhood topology of two nodes is homogeneous or symmetrical. As shown in the graph below, nodes `v`

and `u`

are structural similar (degrees 5 and 4, connected to 3 and 2 triangles, connected to the rest of the network by two nodes), but very far apart in the network, they neither have direct connection nor any common neighbor.

### Key Ideas of Struc2Vec

When generating node sequences from random walks similar to Node2Vec, if two nodes have similar sequences, their embedding results are also similar. It is easy to think that nodes that are closer together in the graph are more likely to produce similar walk sequences, while nodes that are farther apart (beyond the walk depth) are almost impossible to produce similar sequences, even if they have similar structures.

Struc2Vec overcomes this limitation, and the key ideas of Struc2Vec are:

- Assess structural similarity between nodes independently of node and edge properties as well as their position in the network.
- Establish a hierarchy to measure structural similarity: at the bottom of the hierarchy, structural similarity between nodes depend only on their degrees; while at the top of the hierarchy, similarity depends on the entire network.
- Generate random walk sequences for nodes, nodes with similar sequences have similar structure.

### Struc2Vec Framework

#### 1. Measuring Structural Similarity

Intuitively, two nodes that have the same degrees are structurally similar, but if their neighbors also have the same degree, then they are even more structurally similar.

Let *R _{k}(u)* denote the set of nodes at exactly distance (hop)

*k*from node

*u*in graph (0 ≤

*k*≤ graph diameter),

*s(S)*denote the ordered degree sequence of a node set

*S*, below is an example:

When neighbors at distance *k* both exist for node *u* and *v*, let *f _{k}(u,v)* denote the

*structural distance*between

*u*and

*v*when considering their

*k*-hop neighborhoods (all nodes at distance less than or equal to

*k*):

where *g() ≥ 0* measures the distance between two sequences. For nodes *u* and *v*, sequences *s(R _{k}(u))* and

*s(R*can be of different sizes. To assess distance between two sequences of different sizes, Dynamic Time Wrapping (DTW), or any other appliable function can be adopted. Note that if the

_{k}(v))*k*-hop neighborhoods of node

*u*and

*v*are isomorphic, then

*f*.

_{k-1}(u,v) = 0#### 2. Constructing Multilayer Weighted Graph

For graph *G = (V,E)*, construct a multilayer weighted graph *M* as follows:

Each layer *k* of *M* is formed by a weighted undirected complete graph, the edge weight between every node pair *u* and *v* is inversely proportional to the structural distance between them, as defined below:

Note that edges are defined only if *f _{k}(u,v)* is defined.

Every node *u* in layer *k* is connected to the corresponding node *u* in layer *k+1* and *k-1*. The directed edge weight between layers are as follows:

where *Γ _{k}(u)* is the number of edges incident to

*u*that have weight larger than the average edge weight of the complete graph in layer

*k*.

*Γ*actually measures the similarity of node

_{k}(u)*u*to other nodes in layer k. Note that if node

*u*has many similar nodes in layer

*k*, then it should change to some higher layer to obtain a more refined context.

#### 3. Generating Context For Nodes

Each node performs random walk in the multilayer weighted graph *M* to generate structural context, that is, the sequence of nodes. Each node starts random walks in its corresponding node in layer 0. When reaches *u _{k}*:

- The random walk first decides if it will change
layers or walk in the current layer
*k*, the probability of staying in the current layer*k*is given by the parameter`stay_probability`

(i.e.`p`

in the graph below) - If stays in the current layer, the probability of
*u*jumping to each neighbor node is proportional to the edge weights between them; that is, the random walk will prefer to step onto nodes that are structurally more similar to the current node._{k} - With probability
`1 − p`

, the random walk decides to change layers, and moves to the corresponding node either in layer*k+1*or layer*k−1*with probability proportional to the edge weights.

#### 4. Training the Language Model

Struc2Vec uses the Skip-gram model used by Word2Vec to learn the embeddings of nodes.

To learn more about Skip-gram Model, please read Skip-gram Model and Skip-gram Model Optimization.

## Special Case

### Lonely Node, Disconnected Graph

Lonely node does not have any neighbor, thus the structural distance cannot be defined from 1-hop neighborhood. Thus, from layer 1 to each higher layer in the multilayer weighted graph, the edge weight between lonely node and other nodes are extremely small.

Connectivity of the graph does not affect the structural similarity of nodes, nodes that are highly structural similar might locate in different connected component.

### Self-loop Edge

The way Struc2Vec algorithm counts the degree of self-loop edge is the same with the Degree algorithm `algo(degree)`

, each self-loop edge is counted twice.

### Directed Edge

For directed edges, Struc2Vec algorithm ignores the direction of edges but calculates them as undirected edges.

## Command and Configuration

- Command：
`algo(struc2vec)`

- Configurations for the parameter
`params()`

:

Name | Type | Default |
Specification |
Description |
---|---|---|---|---|

ids / uuids | []`_id` / []`_uuid` |
/ | / | IDs or UUIDs of nodes to start the walk; all nodes to be selected if not set |

walk_length | int | 1 | >=1 | Depth of each walk, i.e. the number of nodes walking through |

walk_num | int | 1 | >=1 | Number of walks |

k | int | / | [1,10] | Number of layers of the multilayer graph, it should be equal or lower than the diameter of the graph |

stay_probability | float | / | [0,1] | The probability of staying at the current level |

window_size | int | / | >=1 | Size of the sliding window, to sample `window_size` nodes on left and `window_size` nodes on right |

dimension | int | / | >=1 | Dimension of node embedding vector |

learning_rate | float | / | (0,1) | Initial learning rate, it decreases until reaches `min_learning_rate` with increase in training iterations |

min_learning_rate | float | / | (0,`learning_rate` ) |
Minimum learning rate |

min_frequency | int | / | >=0 | Minimum number of occurrences of a node in the walk sequences to be included in the model, <=1 means to keep all nodes |

sub_sample_alpha | float | / | / | Threshold for sub sampling, <=0 means no sub sampling |

resolution | int | / | >=1 | Such as 10, 100 |

neg_num | int | / | >=0 | Number of negative samples, it is suggested 0~10 |

loop_num | int | / | >=1 | Number of training iterations (epochs) |

limit | int | -1 | >=-1 | Number of results to return; return all results if sets to -1 or not set |

Example: Run Struc2Vec in the whole graph

```
algo(struc2vec).params({
walk_length: 3,
walk_num: 2,
k: 2,
stay_probability: 0.5,
window_size: 5,
dimension: 5,
learning_rate: 0.025,
min_learning_rate: 0.00025,
min_frequency: -1,
sub_sample_alpha: -1,
resolution: 2,
neg_num: 0,
loop_num: 2,
limit: -1
}) as results
return results
```

## Algorithm Execution

### Task Writeback

#### 1. File Writeback

Configuration | Data in Each Row |
---|---|

filename | `_id` ,`embedding_result` |

#### 2. Property Writeback

Not supported by this algorithm.

#### 3. Statistics Writeback

This algorithm has no statistics.

### Direct Return

Alias Ordinal | Type | Description |
Column Name |
---|---|---|---|

0 | []perNode | Array of UUIDs and embeddings of nodes | `_uuid` , `embedding_result` |

### Streaming Return

Alias Ordinal | Type | Description |
Column Name |
---|---|---|---|

0 | []perNode | Array of UUIDs and embeddings of nodes | `_uuid` , `embedding_result` |

### Real-time Statistics

This algorithm has no statistics.