Skip-gram Optimization

Overview

The basic Skip-gram model is nearly impractical for real-world use due to its high computational demands.

The sizes of the matrices $W$ and $W'$ depend on the vocabulary size (e.g., $V = 10000$ ) and the embedding dimension (e.g., $N = 300$ ). As a result, each matrix can contain millions of weights (e.g., $V \cdot N = 3$ million), making the Skip-gram neural network substantially large. Training such a model effectively requires a massive number of samples to tune all the weights.

Additionaly, during each backpropagation step, updates are applied to all output vectors ( $v_{w}^{'}$ ) in matrix $W'$ , even though most of these vectors are unrelated to the current target or context words. Given the large size of $W'$ , this makes gradient descent highly inefficient and computationally slow.

Another significant computational cost comes from the Softmax function, which involves all words in the vocabulary to compute the normalization denominator.

T. Mikoliv and colleagues introduced optimization techniques for the Skip-gram model, including subsampling and negative sampling. These approaches help accelerate training and improve the quality of the resulting embedding vectors.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed Representations of Words and Phrases and their Compositionality (2013)
X. Rong, word2vec Parameter Learning Explained (2016)

Subsampling

Common words in the corpus, such as "the", "and", and "is", raise certain concerns:

They have limited semantic value. For example, the model benefits more from the co-occurrence of "France" and "Paris" than from the frequent pairing of "France" and "the".
These words show up in more training samples than needed, making it inefficient to train their vectors.

Subsampling is used to address this issue by randomly discarding words during training. Frequent words are more likely to be discarded, while rare words are kept more often.

First, calculate the probability of keeping a word by:

where $f (w_{i})$ is the frequency of the $i$ -th word, $α$ is a factor that influences the distribution and is default to $0.001$ .

Then, a random number between $0$ and $1$ is generated. If $P (w_{i})$ is smaller than this number, the word is discarded.

For instance, when $α = 0.001$ , then for $f (w_{i}) \leq 0.0026$ , $P (w_{i}) \geq 1$ , so words with frequency $0.0026$ or less will 100% be kept. For a high word frequency like $f (w_{i}) = 0.03$ , $P (w_{i}) = 0.22$ .

In the case when $α = 0.002$ , then words with frequency $0.0052$ or less will 100% be kept. For the same high word frequency $f (w_{i}) = 0.03$ , $P (w_{i}) = 0.32$ .

Thus, a higher value of $α$ increases the probability that frequent nodes are down-sampled.

For example, if the word 'a' is discarded from the sentence 'Graph is a good way to visualize data,' then no training samples will include 'a' as either the target or a context word.

Negative Sampling

In the negative sampling approach, when a positive context word is sampled for a target word, a total of $k$ words are simultaneously chosen as negative samples.

For instance, let's consider the simple corpus when discussing the basic Skip-gram model. This corpus comprises a vocabulary of 10 words: graph, is, a, good, way, to, visualize, data, very, at. When the positive sample (target, content): (is, a) is generated using a sliding window, we select $k = 3$ negative words graph, data and at to accompany it:

Target Word		Context Word	Expected Output
is	Positive Sample	a	1
	Negative Samples	graph	0
		data	0
		at	0

With negative sampling, the training objective of the model shifts from predicting context words for the target word to a binary classification task. In this setup, the output for the positive word is expected as $1$ , while the outputs for the negative words are expected as $0$ ; other words that do not fall into either category are disregarded.

Consequently, during the backpropagation process, the model only updates the output vectors $v_{w}^{'}$ associated with the positive and negative words to improve the model's classification performance.

Consider the scenario where $V = 10000$ and $N = 300$ . When applying negative sampling with the parameter $k = 9$ , only $300 \times 10 = 3000$ individual weights in $W'$ will require updates, which is $0.1%$ of the $3$ million weights to be updated without negative sampling!

NOTE

Our experiments indicate that values of $k$ in the range $5 ~ 20$ are useful for small training datasets, while for large datasets the $k$ can be as small as $2 ~ 5$ . (Mikolov et al.)

To select negative samples, a probability distribution $P_{n}$ is required. The fundamental principle is to prioritize frequent words in the corpus. However, using raw frequency can result in an overrepresentation of very common words, while underrepresenting less frequent ones. To address this, an empirical distribution is often used that involves raising the word frequency to the power of $\frac{3}{4}$ :

where $f (w_{i})$ is the frequency of the $i$ -th word, the subscript $n$ of $P$ indicates the concept of noise, the distribution $P_{n}$ is also called the noise distribution.

In extreme cases, consider a corpus containing only two words, with frequencies of $0.9$ and $0.1$ respectively. Applying the above formula results in adjusted probabilities of $0.84$ and $0.16$ . This adjustment helps reduce the bias caused by large frequency disparities.

However, when working with large corpora, negative sampling can still be computationally intensive. To address this, a resolution is introduced to rescale the noise distribution. A higher resolution value enables the resampled distribution to better approximate the original noise distribution, striking a balance between efficiency and accuracy.

Optimized Model Training

Forward Propagation

We will demonstrate with target word is, positive word a, and negative words graph, data and at:

With negative sampling, the Skip-gram model uses the following variation of the Softmax function, which is actually the Sigmoid function ( $σ$ ) of $u_{j}$ . This function maps all components of $u$ within the range of $0$ and $1$ :

Backpropagation

As explained, the output for the positive word, denoted as $y_{0}$ , is expected to be $1$ ; while the $k$ outputs corresponding to the negative words, denoted as $y_{i}$ , are expected to be $0$ . Therefore, the objective of the model's training is to maximize both $y_{0}$ and $1 - y_{i}$ , which can be equivalently interpreted as maximizing their product:

The loss funtion $E$ is then obtained by transforming the above as a minimization problem:

Take the partial derivative of $E$ with respect to $u_{0}$ and $u_{i}$ :

$\frac{\partial E}{\partial u_{0}}$ and $\frac{\partial E}{\partial u_{i}}$ hold a similar meaning to $\frac{\partial E}{\partial u_{j}}$ in the original Skip-gram model, which can be understood as subtracting the expected vector from the output vector:

The process of updating weights in matrices $W'$ and $W$ is straightforward. You may refer to the original form of Skip-gram. However, only weights $w_{11}^{'}$ , $w_{21}^{'}$ , $w_{13}^{'}$ , $w_{23}^{'}$ , $w_{18}^{'}$ , $w_{28}^{'}$ , $w_{1, 10}^{'}$ and $w_{2, 10}^{'}$ in $W'$ and weights $w_{21}$ and $w_{21}$ in $W$ are updated.