Change Password

Please enter the password.
Please enter the password. Between 8-64 characters. Not identical to your email address. Contain at least 3 of: uppercase, lowercase, numbers, and special characters.
Please enter the password.

Change Nickname

Current Nickname:

Apply New License

License Detail

Please complete this required field.

  • Ultipa Graph V4


Please complete this required field.

Please complete this required field.

The MAC address of the server you want to deploy.

Please complete this required field.

Please complete this required field.

Applied Validity Period(days)
Effective Date
Excpired Date
Mac Address
Apply Comment
Review Comment
  • Full Name:
  • Phone:
  • Company:
  • Company Email:
  • Country:
  • Language:
Change Password

You have no license application record.

Certificate Issued at Valid until Serial No. File
Serial No. Valid until File

Not having one? Apply now! >>>

Product Created On ID Amount (USD) Invoice
Product Created On ID Amount (USD) Invoice

No Invoice


      Skip-gram Optimization


      The basic Skip-gram model is almost impractical due to various computational demands.

      The sizes of matrices W and W depend on the vocabulary size (e.g., V=10000) and the embedding dimension (e.g., N=300), where each matrix often contains millions of weights (e.g., VN=3 million) each! The neural network of Skip-gram is thus made very large, demanding a vast number of training samples to tune these weights.

      Additionaly, during each backpropagation step, updates are applied to all output vectors (vw) for matrix W, even though most of these vectors are unrelated to both the target word and context words. Given the significant size of W, this gradient descent process is going to be very slow.

      Another substantial cost arises from the Softmax function, which engages all words in the vocabulary to compute the denominator used for normalization.

      T. Mikoliv and others introduced optimization techniques in conjunction with the Skip-gram model, including subsampling and negative sampling. These approaches not only accelerate the training process but also improve the quality of embedding vectors.


      Common words in corpus like "the", "and", "is" pose some concerns:

      • They have limited semantic value. E.g., the model benefits more from the co-occurrence of "France" and "Paris" than the frequent co-occurrence of "France" and "the".
      • There will be excessive training samples containing these words than the needed amount to train the corresponding vectors.

      The subsampling approach is used to address this. For each word in the training set, there is a chance to discard it, and less frequent words are discarded less often.

      First, calculate the probability of keeping a word by:

      where f(wi) is the frequency of the i-th word, α is a factor that influences the distribution and is default to 0.001.

      Then, a random fraction between 0 and 1 is generated. If P(wi) is smaller than this number, the word is discarded.

      For instance, when α=0.001, then for f(wi)0.0026, P(wi)1, so words with frequency 0.0026 or less will 100% be kept. For a high word frequency like f(wi)=0.03, P(wi)=0.22.

      In the case when α=0.002, then words with frequency 0.0052 or less will 100% be kept. For the same high word frequency f(wi)=0.03, P(wi)=0.32.

      Thus, a higher value of α increases the probability that frequent nodes are down-sampled.

      For example, if word "a" is discarded and is not added to the training sentence "Graph is a good way to visualize data", the resulting sampling outcomes for this sentence will encompass no samples where "a" serves as either the target word or the context word.

      Negative Sampling

      In the negative sampling approach, when a positive context word is sampled for a target word, a total of k words are simultaneously chosen as negative samples.

      For instance, let's consider the simple corpus when discussing the basic Skip-gram model. This corpus comprises a vocabulary of 10 words: graph, is, a, good, way, to, visualize, data, very, at. When the positive sample (target, content): (is, a) is generated using a sliding window, we select k=3 negative words graph, data and at to accompany it:

      Target Word Context Word Expected Output
      is Positive Sample a 1
      Negative Samples graph 0
      data 0
      at 0

      With negative sampling, the training objective of the model shifts from predicting context words for the target word to a binary classification task. In this setup, the output for the positive word is expected as 1, while the outputs for the negative words are expected as 0; other words that do not fall into either category are disregarded.

      Consequently, during the backpropagation process, the model only updates the output vectors vw associated with the positive and negative words to improve the model's classification performance.

      Consider the scenario where V=10000 and N=300. When applying negative sampling with the parameter k=9, only 300×10=3000 individual weights in W will require updates, which is 0.1% of the 3 million weights to be updated without negative sampling!

      Our experiments indicate that values of k in the range 5~20 are useful for small training datasets, while for large datasets the k can be as small as 2~5. (Mikolov et al.)

      A probabilistic distribution Pn is needed for selecting negative words. The fundamental principle is to prioritize frequent words in the corpus. However, if the selection is solely based on word frequency, it can lead to an overrepresentation of high-frequency words and a neglect of low-frequency words. To address this imbalance, an empirical distribution is often used that involves raising the word frequency to the power of 34:

      where f(wi) is the frequency of the i-th word, the subscript n of P indicates the concept of noise, the distribution Pn is also called the noise distribution.

      In extreme cases where the corpus contains just two words, with frequencies of 0.9 and 0.1 respectively, utilizing the above formula would yield adjusted probabilities of 0.84 and 0.16. This adjustment goes some way in alleviating the inherent selection bias stemming from frequency differences.

      Dealing with large corpus can pose challenges in terms of computational efficiency for negative sampling. Therefore, we further adopt a resolution parameter to rescale the noise distribution. A higher value of resolution will provide a closer approximation to the original noise distribution.

      Optimized Model Training

      Forward Propagation

      We will demonstrate with target word is, positive word a, and negative words graph, data and at:

      With negative sampling, the Skip-gram model uses the following variation of the Softmax function, which is actually the Sigmoid function (σ) of uj. This function maps all components of u within the range of 0 and 1:


      As explained, the output for the positive word, denoted as y0, is expected to be 1; while the k outputs corresponding to the negative words, denoted as yi, are expected to be 0. Therefore, the objective of the model's training is to maximize both y0 and 1-yi, which can be equivalently interpreted as maximizing their product:

      The loss funtion E is then obtained by transforming the above as a minimization problem:

      Take the partial derivative of E with respect to u0 and ui:

      Eu0 and Eui hold a similar meaning to Euj in the original Skip-gram model, which can be understood as subtracting the expected vector from the output vector:

      The process of updating weights in matrices W and W is straightforward. You may refer to the original form of Skip-gram. However, only weights w11, w21, w13, w23, w18, w28, w1,10 and w2,10 in W and weights w21 and w21 in W are updated.

      Please complete the following information to download this book