This chapter is based on the original form of Skip-gram training. It is recommended to read Skip-gram before continue reading this chapter.
The original training of Skip-gram model uses Softmax function to convert a series of scores into a probability distribution, where the calculation of the denorminator used for normalizaton engages all words in the corpus (refer to the formula of Softmax below), costing huge computing effort. When optimizing Skip-gram, we want to avoid calculating Softmax in this way.
In addition, during each backpropagation: for matrix W, we only update the input vector vw of the target word; for matrix W', all the output vectors v'w are updated, although most words are irrelavent to both the target word and context words. Matrices W and W' are often huge, the bigger the neural network is, the greater the amount of training samples are needed, hence the slower the gradient descent process. One of the directions of optimization is to limit the number of output vectors to be updated.
The original training form of Skip-gram can hardly be applied in practices. T. Mikoliv et al. proposed several optimization methods along with Skip-gram model: hierarchical softmax, negative sampling and subsampling of frequent words. These methods not only improve training speed but also the quality of embedding vectors. We will explain the negative sampling and subsampling of frequent words, which are adopted by graph embedding algorithms such as Node2Vec and Struc2Vec.
- T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed Representations of Words and Phrases and their Compositionality (2013)
- X. Rong, word2vec Parameter Learning Explained (2016)
Positive Sample & Negative Sample
In NLP field, if two words are a combination of one target word and one context word, it is a positive sample, and a negative sample otherwise.
Negative sampling is a sampling method: when a training sample composed of a target word and a context word is obtained as a positive sample and enters into the model for training, several other words are selected to combine with the target word as negative samples. When applying negative sampling, Skip-gram only updates the output vectors of the words in the positive sample and negative samples.
When describing the original form of Skip-gram, we used a corpus that contains 10 words:
graph, is, a, good, way, to, visualize, data, very, at
And a sentence:
Graph is a good way to visualize data.
When the size of the sliding window is 1 and the target word is 'is', one of the training samples is (is, a), which is a positive sample. Negative samples can be generated by randomly selecting and combining other words with 'is', assume that we get three negative samples as below:
For positive sample, the column of isContext is 1, representing the combination of a target word and a context word; and 0 for all negative samples, representing not the combination of the target word and context word. You may notice that although 'graph' is contained in the window centered by 'is', (is, graph) is still regarded as a negative sample with regard to the positive sample (is, a).
About the number of negative samples, T. Mikoliv et al. recommend to 5~20 for small datasets and 2~5 for big datasets.
How to Select Negative Samples
The basic principle for selecting words of negative samples is: frequent words in the corpus are more likely to be selected. However, if only considers frequency, high frequency words will be overly selected, while low frequency words are hardly chosen. To counter the imbalance, T. Mikoliv et al. used the following empirical distribution to raise the word frequency to the 3/4rd power:
where wi is the i-th word, f(wi) is the word frequency, V is the number of distinct words in the corpus, Pn(wi) is the probability of selecting word as a negative sample, the subscript n represents noise, Pn is also called noise distribution.
In extreme cases, suppose that there are only 2 words in the corpus, with word frequencies 0.9 and 0.1, after processed by the formula above, the probabilities turn to 0.84 and 0.16 respectively.
Model Training - Optimized with Negative Sampling
Take the training sample (is, a) as example：
From hidden layer to output layer, only the output vectors of positive sample w0 and three negative samples w1, w2, and w3 are involved. When applying negative sampling, Skip-gram model uses the following Softmax function:
where h is the input vector of target word, which is directly extracted from matrix W, v'wi is the output vector of context words from matrix W', ui = v'wih represents the dot product of the two vectors. Intutively, the more similar two words are, the dot product of their vectors (cosine distance) is bigger, hence yi → 1, otherwise yi → 0.
In negative sampling, the form of Softmax function looks like the Sigmoid function, but it is not Sigmoid since v'wi and h are both parameters that need to be trained.
The original Skip-gram aims at outputting the context words of the target word, after applying negative sampling technique, the goal changes - we expect the model to differentiate positive sample and negative samples. Specifically, the corresponding output of positive sample w0 is 1 (means 'true'), and negative samples are 0 (means 'false'):
Therefore, the goal of model training is to maximize y0 and all 1 - yi, assume that k negative samples are selected:
where i=1,2,...,k. The loss function E is:
Take the partial derivative of E with respect to u0 and ui, and define the result as e0 and ei:
e0 and ei are similar to the ej in the original Skip-gram, which can be regarded as deducting the expected results from the output results:
After getting the loss function E, e0, and ei, the backpropagation of Skip-gram is very straightforward. You can refer to the original form of Skip-gram, the only difference is - when updating the matrix W' between output layer and hidden layer, only update the output vectors of the positive sample and k negative samples.
Subsampling of Frequent Words
Some words in a corpus can appear with high frequency, such as in, the and a. These frequent words bring concerns like:
- Frequent words do not carry much semantic information, for instance, more information is given when man and handsome appear simultaneously than man and the.
- We can get plenty of training samples containing frequent words, more than what we need to train the vectors of these words, it would cost considerable unnecessary calculations.
Therefore, T. Mikolov et al. propose to discard words with a certain probability, and high frequency words are more likely to be discarded. For example, if discards 'a' from sentence 'Graph is a good way to visualize data', the sampling results of this sentence will include no training samples with 'a' neither as target word nor as context word. The probability is computed by the formula:
where wi is the 1-th word, f(wi) is the word frequency, t is the set threshold which is often around 10-5. When taking t = 0.006, if word 'a' has frequency of 0.03 (yes, this is a high frequency), then the probability of 'a' being discarded is 0.817; when the frequency of word is less than or equal to t, the word will not be discarded.
Please note: subsampling is conducted during the sliding window sampling, not to be confused with negative sampling.
As for graph, high frequency words are like supernodes with huge number of neighbors. The more neighbors a node has, the less information it can provide to each neighbor. For instance, in a social network, user A has 10,000 friends, user B only has 10 friends, so for their mutual friend C, the relationship with user B is viewed as more valuable.