TL;DR - word2vec is awesome, it's also really simple. Learn how it works, and implement your own version.
Since joining a tech startup back in 2016, my life has revolved around machine learning and natural language processing (NLP). Trying to extract faint signals from terabytes of streaming social media is the name of the game. Because of this, I’m constantly experimenting and implementing different NLP schemes; word2vec being among the simplest and coincidently yielding great predictive value. The underpinnings of word2vec are exceptionally simple and the math is borderline elegant. The whole system is deceptively simple, and provides exceptional results. This tutorial aims to teach the basics of word2vec while building a barebones implementation in Python using NumPy. Note that the final Python implementation will not be optimized for speed or memory usage, but instead for easy understanding.
The goal with word2vec and most NLP embedding schemes is to translate text into vectors so that they can then be processed using operations from linear algebra. Vectorizing text data allows us to then create predictive models that use these vectors as input to then perform something useful. If you understand both forward and back propagation for plain vanella neural networks, you already understand 90% of word2vec.
It's Actually Really Simple (I promise!)
Word2vec is actually a collection of two different methods: continuous bag-of-words (CBOW) and skip-gram1. Given a word in a sentence, lets call it w(t) (also called the center word or target word), CBOW uses the context or surrounding words as input. For instance, if the context window C is set to C=5, then the input would be words at positions w(t-2), w(t-1), w(t+1), and w(t+2). Basically the two words before and after the center word w(t). Given this information, CBOW then tries to predict the target word.
The second method, skip-gram is the exact opposite. Instead of inputting the context words and predicting the center word, we feed in the center word and predict the context words. This means that w(t) becomes the input while w(t-2), w(t-1), w(t+1), and w(t+2) are the ideal output. For for this post, we’re only going to consider the skip-gram model since it has been shown to produce better word-embeddings than CBOW.
The concept of a center word surrounded by context words can be likened to a sliding window that travels across the text corpus. As an example, lets encode the following sentence: “the quick brown fox jumps over the lazy dog” using a window size of C=5 (two before, and two after the center word). As the context window slides across the sentence from left to right, it gets populated with the corresponding words. When the context window reaches the edges of the sentences, it simply drops the furthest window positions. Below is what this process looks like. Note that instead of w(t), w(t+1), etc., the center word has become xk and the context words have become yc.
Because we can’t send text data directly through a matrix, we need to employ one-hot encoding. This means we have a vector of length v where v is the total number of unique words in the text corpus (or shorter if we want). Each word corresponds to a single position in this vector, so when embedding the word v_n, everywhere in vector v is zero except v_n which becomes a one. Below in Figure 3, a one-hot encoding of examples 1, 5, and 9 from Figure 2 above.
→ For future reference, the column vectors
So far, so good right? Now we need to feed this data into the network and train it. Most of the literature describing skip-gram uses the same graphic depicting the final layer of the model somehow having three or more matrices. I found this rather confusing while I was reading the white papers so I ended up digging through the original source code to get to the bottom of it. I figure there are other people like me, so I created another version of the architecture that I find a little easier to digest.
Now that the text data has been encoded, lets proceed through a single forward pass of the network. Step one is getting to the hidden layer h.
Preceding forward, we need to calculate the values as they exit the network.
For each one-hot encoded center word
In code, this will take the form of the following:
With the softmax calculation taking the form of:
The forward pass is fairly simple and differs little from that of a standard, fully connected neural network.
Backpropagation & Training
Now to improve the weights within
Now that the loss has been calculated, we’re going to employ the chain rule to distribute the error amongst the weight matrices
The column vector
Therefore, the gradient decent update equation for
This allows us to then calculate the loss with respect to
and finally, we can formulate the gradient decent weight update equation for
At this point, everything needed to train the network has been established and we just need to code it.
Where the back propagation function is defined as:
That’s it! Only slightly more complicated than a simple neural network. To avoid posting redundant sections of code, you can find the completed word2vec model along with some additional features at this GitHub repo (here).
As a simple sanity check, lets look at the network output given a few input words. This is the output after 5000 iterations.
These sample results show that the output probabilities for each center word are split fairly evenly between the correct context word. If we narrow in on the word lazy, we can see that the probabilities for the words dog, over, and the are split fairly evenly at roughly 33.33% each. This is exactly what we want.
Shortly after the initial release of word2vec, a second paper detailing several improvements was published.2 Amongst these proposed improvements are:
Phrase Generation — This is the process in which commonly co-occuring words such as “san” and “francisco” become “san_francisco”. The result of phrase generation is a cleaner, more useful, and user-friendly vocabulary list. Phrase generation is based on the following equation which utilizes the unigram and bigram vocabulary counts for the given corpus.
The numerator consists of the total number of times the bigram formed with words
For longer word combinations such as “new york city”, an iterative phrase generation approach can be used. As an example, on the initial pass “new” and “york” could be combined into “new_york”, with a second pass combining “new_york” with “city” yielding the desired phrase “new_york_city”. According to Mikolov et al. (2013), typically 2-4 passes with decreasing threshold values yielded the best results.
Subsampling — The purpose of subsampling is to counter the balance between frequent and rare words. No single word should represent a sizable portion of the corpus. To correct for this all words are discarded based on the following probability:
Negative Sampling — Often referred to as just NEG, this is a modification to the backpropagation procedure in which only a small percentage of the errors are actually considered. For instance, using example #9 from figure 3 above, dog is the target word, while the and lazy are the context words. This means that in an ideal situation, the network will return a “0” for everything except for the and lazy which will be “1”. In short, context words become “positive” while everything else becomes “negative”. With negative sampling, we only concern ourselves with the positive words and only a small percentage of the negative words. This means that instead of backpropagating the errors from all 8 words in the vocabulary, we backpropagate the errors from the positive words the and lazy plus
The probability that a negative word is chosen is determined using the following equation:
As we can see, the primary difference between this and the standard stochastic gradient decent version is that now we include K observations. As for how large k should be, Mikolov et al. had this to say:
Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can be as small as 2–5.
- Mikolov et al. (2013)
Something to keep in mind however is that the official C implementation uses a slightly different formulation as seen below3:
Note that even without negative sampling, only the weights for the target word in matrix W are updated.
Thanks for reading, I hope you found this useful! If anything still seems unclear, please let me know so I can improve this tutorial. For further reading, I highly suggest working through each of the references below.
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space (Submitted on 16 Jan 2013) ↩
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality (Submitted on 16 Oct 2013) ↩ ↩2