UltipaDocs
Try Playground
  • Introduction
    • Show Algorithms
    • Install and Uninstall
    • Run Algorithms
    • Algorithm Results and Statistics
    • Degree Centrality
    • Closeness Centrality
    • Harmonic Centrality
    • Graph Centrality
    • Betweenness Centrality
    • Eigenvector Centrality
    • CELF
    • PageRank
    • ArticleRank
    • HITS
    • SybilRank
    • Jaccard Similarity
    • Overlap Similarity
    • Cosine Similarity
    • Pearson Correlation Coefficient
    • Euclidean Distance
    • K-Hop All
    • Bipartite Graph
    • HyperANF
    • Connected Component
    • Triangle Counting
    • Induced Subgraph
    • k-Core
    • k-Truss
    • p-Cohesion
    • k-Edge Connected Components
    • Local Clustering Coefficient
    • Topological Sort
    • Schema Overview
    • Dijkstra's Single-Source Shortest Path
    • Delta-Stepping Single-Source Shortest Path
    • Shortest Path Faster Algorithm (SPFA)
    • Minimum Spanning Tree
    • Breadth-First Search (BFS)
    • Depth-First Search (DFS)
    • Adamic-Adar Index
    • Common Neighbors
    • Preferential Attachment
    • Resource Allocation
    • Total Neighbors
    • Louvain
    • Leiden
    • Label Propagation
    • HANP
    • k-Means
    • kNN (k-Nearest Neighbors)
    • K-1 Coloring
    • Conductance
      • Random Walk
      • Node2Vec Walk
      • Node2Vec
      • Struc2Vec Walk
      • Struc2Vec
      • GraphSAGE
      • GraphSAGE Train
      • LINE
      • Fast Random Projection
      • Summary of Graph Embedding
      • Gradient Descent
      • Backpropagation
      • Skip-gram
      • Skip-gram Optimization
  1. Docs
  2. /
  3. Graph Analytics & Algorithms
  4. /
  5. Background Knowledge

Skip-gram

The Skip-gram (SG) model is a prominent approach designed for creating word embeddings in the field of natural language processing (NLP). It has also been used in graph embedding algorithms such as Node2Vec and Struc2Vec to generate node embeddings.

Background

The origin of the Skip-gram model can be traced back to the Word2Vec algorithm. Word2Vec maps words to a vector space, where semantically similar words are represented by vectors that are close in proximity. Google's T. Mikolov et al. introduced Word2Vec in 2013.

  • T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space (2013)
  • X. Rong, word2vec Parameter Learning Explained (2016)

In the realm of graph embedding, the inception of DeepWalk in 2014 marked the application of the Skip-gram model to generate vector representations of nodes within graphs. The core concept is treating nodes as "words", the node sequences generated by random walks as "sentences" and "corpus".

  • B. Perozzi, R. Al-Rfou, S. Skiena, DeepWalk: Online Learning of Social Representations (2014)

Subsequent graph embedding methods such as Node2Vec and Struc2Vec have enhanced the DeepWalk approach while still leveraging the Skip-gram framework.

We will illustrate the Skip-gram model using the context of its original application in natural language processing.

Model Overview

The fundamental principle underlying Skip-gram is to predict a set of context words for a given target word. As depicted in the diagram below, the input word, denoted as w(t), is fed into the model; the model then generates four context words related to w(t): w(t-2), w(t-1), w(t+1), and w(t+2). Here, the symbols +/- signify the preceding and succeeding context in relation to the target word. The number of context words produced can be adjusted as needed.

However, it's important to recognize that the ultimate objective of the Skip-gram model is not prediction. Instead, its real task to derive the weight matrix found within the mapping relationship (indicated as PROJECTION in the diagram), which effectively represents the vectorized representations of words.

Corpus

A corpus is a collection of sentences or texts that a model utilizes to learn the semantic relationships between words.

Consider a vocabulary containing 10 distinct words drawn from a corpus: graph, is, a, good, way, to, visualize, data, very, at.

These words can be constructed into sentences such as:

Graph is a good way to visualize data.

Sliding Window Sampling

The Skip-gram model employs the sliding window sampling technique to generate training samples. This method uses a "window" that moves sequentially through each word in the sentence. The target word is combined with every context word before and after it, within a certain range window_size.

Below is an illustration of the sampling process when window_size=1.

It's important to note that when window_size>1, all context words falling within the specified window are treated equally, regardless of their distance from the target word.

One-hot Encoding

Since words are not directly interpretable by models, they must be converted into machine-understandable representations.

One common method for encoding words is the one-hot encoding. In this approach, each word is represented as a unique binary vector where only one element is "hot" (1) while all others are "cold" (0). The position of the 1 in the vector corresponds to the index of the word in the vocabulary.

Here is how one-hot encoding is applied to our vocabulary:

WordOne-hot Encoded Vector
graph1000000000
is0100000000
a0010000000
good0001000000
way0000100000
to0000010000
visualize0000001000
data0000000100
very0000000010

Skip-gram Architecture

The architecture of the Skip-gram model is depicted as above, where:

  • The input vector xV×1 is the one-hot encoding of the target word, and V is the number of words in the vocabulary.
  • WV×N is the input→hidden weight matrix, and N is the dimension of word embeddings.
  • hN×1 is the hidden layer vector.
  • W′N×V is the hidden→output weight matrix. W′ and W are different, W′ is not the transpose of W.
  • uV×1 is the vector before applying the activation function Softmax.
  • The output vectors yc (c=1, 2, ..., C) are also referred to as panels, C panels correspond to C context words of the target word.

Softmax: Softmax function as an activation function, serving to normalize a numerical vector into a probability distribution vector. In this transformed vector, the summation of all probabilities equals 1. The formula for Softmax is as follows:

Forward Propagation

In our example, V=10, and set N=2. Let's begin by randomly initializing the weight matrices W and W′ as shown below. We will then use the sample (is, graph), (is, a) for demonstration.

Input Layer → Hidden Layer

Get the hidden layer vector h by:

Given x is a one-hot encoded vector with only xk=1, h corresponds to the k-th row of matrix W. This operation is essentially a simple lookup process:

where vwI is the input vector of the target word.

In fact, each row of matrix W, denoted as vw, is viewed as the final embedding of each word in the vocabulary.

Hidden Layer → Output Layer

Get the vector u by:

The j-th component of vector u equals to the dot product of vector h and the transpose of the j-th column vector of matrix W′:

where vwj′ is the output vector of the j-th word in the vocabulary.

NOTE

In the design of the Skip-gram model, each word within the vocabulary has two distinct representations: the input vector vw and the output vector vw′. The input vector serves as the representation when the word is used as the target, while the output vector represents the word when it acts as the context.

During computation, uj is essentially the dot product of the input vector of the target word vwI and the output vector of the j-th word vwj′. The Skip-gram model is also designed with the principle that greater similarity between two vectors yields a larger dot product of those vectors.

Besides, it is important to highlight that only the input vectors are ultimately utilized as the word embeddings. This separation of input and output vectors simplifies the computation process, enhancing both efficiency and accuracy in the model's training and inference.

Get each output panel yc by:

where yc,j is the j-th component of yc, representing the probability of the j-th word within the vocabulary being predicted while considering the given target word. Apparently, the sum of all probabilities is 1.

The C words with the highest probabilities are considered as the predicted context words. In our example, C=2 and the predicted context words are good and visualize.

Backpropagation

To adjust the weights in W and W′, SGD is used to backpropagate the errors.

Loss Function

We want to maximize the probabilities of the C context words, i.e., to maximize the product of these probabilities:

where jc* is the index of the expected c-th output context word.

Since minimization is often considered more straightforward and practical than maximization, we perform some transformations to the above goal:

Therefore, the loss function E of Skip-gram is expressed as:

Take the partial derivative of E with respect to uj:

To simply the notaion going forward, define the following:

where tc is the one-hot encoding vector of the c-th expected output context word. In our example, t1 and t2 are the one-hot encoded vectors of words graph and a, thus e1 and e2 are calculated as:

Therefore, ∂E∂uj can be written as:

In our example, it is calculated as:

Output Layer → Hidden Layer

Adjustments are made to all weights in matrix W′, which means that all the output vectors of words are updated.

Calculate the partial derivative of E with respect to wij′:

Adjust wij′ according to the learning rate η:

Set η=0.4. For instance, w14′=0.86 and w24′=0.67 are updated to:

w14′ = w14′ - η ⋅ ( e1,4 + e2,4 ) ⋅ h1 = 0.86 - 0.4 × 0.314 × 0.65 = 0.78
w24′ = w24′ - η ⋅ ( e1,4 + e2,4 ) ⋅ h2 = 0.67 - 0.4 × 0.314 × 0.87 = 0.56

Hidden Layer → Input Layer

Adjustments are made only to the weights in matrix W that correspond to the input vector of the target word.

The vector h is obtained by only looking up the k-th row of matrix W (given that xk=1):

Calculate the partial derivative of E with respect to wki:

Adjust wki according to the learning rate η:

In our example, k=2, hence, w21=0.65 and w22=0.87 are updated:

∂E ∂w21 = ( e1,1 + e2,1 ) ⋅ w11′ + ( e1,2 + e2,2 ) ⋅ w12′ + ... + ( e1,10 + e2,10 ) ⋅ w1,10′ = 0.283
w21 = w21 - η ⋅ ∂E ∂w21 = 0.65 - 0.4 × 0.283 = 0.54

∂E ∂w22 = ( e1,1 + e2,1 ) ⋅ w21′ + ( e1,2 + e2,2 ) ⋅ w22′ + ... + ( e1,10 + e2,10 ) ⋅ w2,10′ = 0.081
w22 = w22 - η ⋅ ∂E ∂w22 = 0.87 - 0.4 × 0.081 = 0.84

Optimization

We have explored the fundamentals of the Skip-gram model. Nevertheless, incorporating optimizations is imperative to ensure the model's computational complexity remains viable and pragmatic for real-world applications. Click here to continue reading.