UltipaDocs
Products
Solutions
Resources
Company
Start Free Trial
UltipaDocs
Start Free Trial
  • Introduction
  • Running Algorithms
    • Degree Centrality
    • Closeness Centrality
    • Harmonic Centrality
    • Eccentricity Centrality
    • Betweenness Centrality
    • Bridges
    • Articulation Points
    • Eigenvector Centrality
    • Katz Centrality
    • CELF
    • PageRank
    • ArticleRank
    • TextRank
    • HITS
    • SybilRank
    • Jaccard Similarity
    • Overlap Similarity
    • Cosine Similarity
    • Pearson Correlation Coefficient
    • Euclidean Distance
    • KNN
    • Vector Similarity
    • Bipartite Graph
    • HyperANF
    • Weakly Connected Components (WCC)
    • Strongly Connected Components (SCC)
    • k-Edge Connected Components
    • Local Clustering Coefficient
    • Triangle Count
    • Clique Count
    • k-Core
    • k-Truss
    • p-Cohesion
    • Induced Subgraph
    • Topological Sort
    • Breadth-First Search (BFS)
    • Depth-First Search (DFS)
    • Dijkstra's Shortest Path
    • A* Shortest Path
    • Yen's K-Shortest Paths
    • Shortest Path (BFS)
    • Delta-Stepping SSSP
    • Shortest Path Faster Algorithm (SPFA)
    • All-Pairs Shortest Path (APSP)
    • Minimum Spanning Tree (MST)
    • K-Spanning Tree
    • Steiner Tree
    • Prize-Collecting Steiner Tree (PCST)
    • Minimum Cost Flow
    • Maximum Flow
    • K-Hop Fast
    • Longest Path (DAG)
    • Random Walk
    • Adamic-Adar Index
    • Common Neighbors
    • Preferential Attachment
    • Resource Allocation
    • Total Neighbors
    • Same Community
    • Louvain
    • Leiden
    • Modularity Optimization
    • Label Propagation
    • HANP
    • SLPA
    • k-Means
    • HDBSCAN
    • K-1 Coloring
    • Modularity
    • Conductance
    • Max k-Cut
      • Node2Vec
      • Struc2Vec
      • LINE
      • Fast Random Projection
      • Summary of Graph Embedding
      • Gradient Descent
      • Backpropagation
      • Skip-gram
      • Skip-gram Optimization
  1. Docs
  2. /
  3. Graph Algorithms
  4. /
  5. Background Knowledge

Skip-gram Optimization

Overview

The basic Skip-gram model is nearly impractical for real-world use due to its high computational demands.

The sizes of the matrices W and W′ depend on the vocabulary size (e.g., V=10000) and the embedding dimension (e.g., N=300). As a result, each matrix can contain millions of weights (e.g., V⋅N=3 million), making the Skip-gram neural network substantially large. Training such a model effectively requires a massive number of samples to tune all the weights.

Additionaly, during each backpropagation step, updates are applied to all output vectors (vw′) in matrix W′, even though most of these vectors are unrelated to the current target or context words. Given the large size of W′, this makes gradient descent highly inefficient and computationally slow.

Another significant computational cost comes from the Softmax function, which involves all words in the vocabulary to compute the normalization denominator.

T. Mikoliv and colleagues introduced optimization techniques for the Skip-gram model, including subsampling and negative sampling. These approaches help accelerate training and improve the quality of the resulting embedding vectors.

  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed Representations of Words and Phrases and their Compositionality (2013)
  • X. Rong, word2vec Parameter Learning Explained (2016)

Subsampling

Common words in the corpus, such as "the", "and", and "is", raise certain concerns:

  • They have limited semantic value. For example, the model benefits more from the co-occurrence of "France" and "Paris" than from the frequent pairing of "France" and "the".
  • These words show up in more training samples than needed, making it inefficient to train their vectors.

Subsampling is used to address this issue by randomly discarding words during training. Frequent words are more likely to be discarded, while rare words are kept more often.

First, calculate the probability of keeping a word by:

where f(wi) is the frequency of the i-th word, α is a factor that influences the distribution and is default to 0.001.

Then, a random number between 0 and 1 is generated. If P(wi) is smaller than this number, the word is discarded.

For instance, when α=0.001, then for f(wi)≤0.0026, P(wi)≥1, so words with frequency 0.0026 or less will 100% be kept. For a high word frequency like f(wi)=0.03, P(wi)=0.22.

In the case when α=0.002, then words with frequency 0.0052 or less will 100% be kept. For the same high word frequency f(wi)=0.03, P(wi)=0.32.

Thus, a higher value of α increases the probability that frequent nodes are down-sampled.

For example, if the word 'a' is discarded from the sentence 'Graph is a good way to visualize data,' then no training samples will include 'a' as either the target or a context word.

Negative Sampling

In the negative sampling approach, when a positive context word is sampled for a target word, a total of k words are simultaneously chosen as negative samples.

For instance, let's consider the simple corpus when discussing the basic Skip-gram model. This corpus comprises a vocabulary of 10 words: graph, is, a, good, way, to, visualize, data, very, at. When the positive sample (target, content): (is, a) is generated using a sliding window, we select k=3 negative words graph, data and at to accompany it:

Target WordContext WordExpected Output
isPositive Samplea1
Negative Samplesgraph0
data0
at0

With negative sampling, the training objective of the model shifts from predicting context words for the target word to a binary classification task. In this setup, the output for the positive word is expected as 1, while the outputs for the negative words are expected as 0; other words that do not fall into either category are disregarded.

Consequently, during the backpropagation process, the model only updates the output vectors vw′ associated with the positive and negative words to improve the model's classification performance.

Consider the scenario where V=10000 and N=300. When applying negative sampling with the parameter k=9, only 300×10=3000 individual weights in W′ will require updates, which is 0.1% of the 3 million weights to be updated without negative sampling!

NOTE

Our experiments indicate that values of k in the range 5~20 are useful for small training datasets, while for large datasets the k can be as small as 2~5. (Mikolov et al.)

To select negative samples, a probability distribution Pn is required. The fundamental principle is to prioritize frequent words in the corpus. However, using raw frequency can result in an overrepresentation of very common words, while underrepresenting less frequent ones. To address this, an empirical distribution is often used that involves raising the word frequency to the power of 34:

where f(wi) is the frequency of the i-th word, the subscript n of P indicates the concept of noise, the distribution Pn is also called the noise distribution.

In extreme cases, consider a corpus containing only two words, with frequencies of 0.9 and 0.1 respectively. Applying the above formula results in adjusted probabilities of 0.84 and 0.16. This adjustment helps reduce the bias caused by large frequency disparities.

However, when working with large corpora, negative sampling can still be computationally intensive. To address this, a resolution is introduced to rescale the noise distribution. A higher resolution value enables the resampled distribution to better approximate the original noise distribution, striking a balance between efficiency and accuracy.

Optimized Model Training

Forward Propagation

We will demonstrate with target word is, positive word a, and negative words graph, data and at:

With negative sampling, the Skip-gram model uses the following variation of the Softmax function, which is actually the Sigmoid function (σ) of uj. This function maps all components of u within the range of 0 and 1:

Backpropagation

As explained, the output for the positive word, denoted as y0, is expected to be 1; while the k outputs corresponding to the negative words, denoted as yi, are expected to be 0. Therefore, the objective of the model's training is to maximize both y0 and 1-yi, which can be equivalently interpreted as maximizing their product:

The loss funtion E is then obtained by transforming the above as a minimization problem:

Take the partial derivative of E with respect to u0 and ui:

∂E∂u0 and ∂E∂ui hold a similar meaning to ∂E∂uj in the original Skip-gram model, which can be understood as subtracting the expected vector from the output vector:

The process of updating weights in matrices W′ and W is straightforward. You may refer to the original form of Skip-gram. However, only weights w11′, w21′, w13′, w23′, w18′, w28′, w1,10′ and w2,10′ in W′ and weights w21 and w21 in W are updated.