Skip-Gram Model
Natural Language Processing is the popular field of Artificial Intelligence. We go to process human language as text or speech to make computers like humans in this process. Humans have a big amount of data written in a much careless format. That is a problem for any machine to find meaning from raw text.
We essential to transforming this data into a vector format to make a machine learn from the raw text. It then may simply be processed by the computers. Transformation of this raw text into a vector format is recognized as word representation.
We need unsupervised learning methods because the vocabulary of any language is big and cannot be labeled by humans. That is required to learn the context of any word on its own. Skip-gram is one of the unsupervised learning methods. This is used to discover the best-related words for a given word. In this article, we will know about the Skip-gram model in detail.
DescriptionThe Skip-gram model tries to guess the source context words given a target word. It reverses the practice of target and context words. In this circumstance;
- The target word is provided for at the input.
- The hidden layer leftovers the same.
- The output layer of the neural network is computer-generated many times to put up the chosen number of context words.
- Look at the example of "cat" and "tree" as context words. And "climbed" by way of the target word.
- The input vector in the skim-gram model will be [0 0 0 1 0 0 0 0] t.
- Though the two output layers will have [0 1 0 0 0 0 0 0] t and [0 0 0 0 0 0 0 1] t as target vectors correspondingly.
- Two such vectors will be made for the current example ready for making one vector of probabilities.
- In the above figure, we can see w(t) is the target word or input given.
- The hidden layer creates the dot product amid the weight matrix and the input vector w (t).
- There is no activation function is used in the hidden layer.
- The result of the dot product at the hidden layer is distributed to the output layer.
- The output layer calculates the dot product between the output vector of the hidden layer and the weight matrix of the output layer.
- At that time, we put on the softmax activation function to calculate the probability of words looking to be in the context of w (t) at a known context location.
- We use the neural network to find some probabilities.
- Though we really care of is not its output but instead its weights.
- Actually, the weights of the hidden layer can be seen as the dimensions of the word vector.
- Each row denotes a word and each column represents a dimension.
- We have to nourish some odd neural networks with some pairs of words.
- We can’t just do that using as inputs the real characters.
- We have to find some method to signify these words mathematically.
- Therefore, that the network may process them.
- One way to handle this is to create a vocabulary of all the words in the text.
- Then to encode the word as a vector of the same dimensions of vocabulary.
- Each dimension may be believed as a word in the vocabulary.
- Therefore, we will have a vector with all zeros and a 1 which denotes the corresponding word in the vocabulary.
- This encoding method is named one-hot encoding.