1. One-word context Model
In our setting, the vocabulary size is $V$, and the hidden layer size is $N$.
The input $x$ is a one-hot representation vector, which means for a given input context word, only one out of $V$ units, ${x_1,cdots,x_V}$, will be 1, and all other units are 0. for example,
[x=[0,cdots,1,cdots,0]]
The weight between the input layer and the output layer can be represented by a $V imes N$ matrix $W$. Each row of $W$ is the $N$-dimension vector representation $v_w$ of the associated word of the input layer.
Given a context (a word), assuming $x_k=1$ and $x_{k’}=0$ for $k’ eq k$ then
[h=x^TW=W{(k,cdot):=v_{w_I}}]
which is just copying the $k$-th row of $W$ to $h$. $v_{w_I}$ is the vector representation of the input word $w_I$. This implies that the link (activation) function of the hidden layer units is simply linear (i.e., directly passing its weighted sum of inputs to the next layer).
From the hidden layer to the output layer, there is a different weight matrix $W’={w’_{ij}}$, which is a $N imes V$ matrix. Using these weights, we can compute a score $u_j$ for each word in the vocabulary,
[ u_j={v’_{w_j}}^T cdot h ]
where $v’_{w_j}$ is the $j$-th column of the matrix $W’$. Then we can use the softmax classification model to obtain the posterior distribution of the words, which is a multinomial distribution.
[p(w_j|w_I)=y_j=frac{exp(u_j)}{sum_{j’=1}^V{exp(u_{j’})}}]
where $y_j$ is the output of the $j$-th unit in the output layer.
Finally, we obtain:
[p(w_j | w_I) = y_j = frac{exp( {v’_{w_o}}^T v_{w_I})}{sum_{j’=1}^V{exp( {v’_{w’_j}}^T v_{w_I})}}]
Note that $v_w$ and $v’_w$ are two representations of the word $w$. $v_w$ comes from rows of $W$, which is the input $ o$ hidden weight matrix, and $v’_w$ comes from columns of $W’$, which is the hidden $ o$ output matrix. In subsequent analysis, we call $v_w$ as the “input vector”, and $v’_w$ as the “output vector” of the word w.
1.2. Cost Function
Let’s derive the weight update equation for this model. Although the actual computation is impractical, we still derivate the update equation to gain insights on the original model without tricks.
The training objective is to maximize the conditional probability of observing the actual output word $w_o$ (denote its corresponding index in the output layer as $j*$) given the input context word $w_I$ with regard to the weights.
[ max p(w_o|w_I)=max y_{j*}]
which is equivlant to minimize the negative-log probability, where :
[ E=-log p(w_o|w_I)=u_{j*}-log sumlimits_{j’=1}^V{exp(u_{j’})}]
$j*$ is the index of the actual output word in the output layer. Note that, this loss function can be understood as a special case of the cross-entropy measurement between two probabilistic distributions, which has been talked about in previous post: Negative log-likelihood function.
1.3. Update weight from output $ o$ hidden layer
Let’s derive the update equation of the weights between hidden and output layer $W’_{N imes V}$.
Take the derivative of $E$ with regard to $j$-th unit’s net input $u_j$:
[ frac{partial E}{partial u_j}= y_j-t_j := e_j ]
where
[t_j = left{egin{aligned}1, j=j* \ 0, j eq j* end{aligned} ight. ]
i.e., $t_j$ will only be 1 when the $j$-th is the actual output word, otherwise, $t_j=0$. Note that, this derivative tis simply the prediction error $e_j$ of the output layer.
Next we take the derivative on $w’_{ij}$ to obtain the gradient on the hidden $ o$ output weights $W’_{N imes V}$.
[ frac{partial E}{partial w’_{ij}}= frac{partial E}{partial u_j} cdot frac{partial u_j}{partial w’_{ij} }= e_jcdot h_i ]
Therefore, with SGD, we can obtain the weight update equation for the hidden $ o$ output weight:
[ w’_{ij} = w’_{ij} – alpha cdot e_j cdot h_i]
or vector repesentation:
[ v’_{w_j}=v’_{w_j} – alphacdot e_j cdot h ~~~~~~j=1,2,cdots,V]
where $alpha$ is the learning rate, $e_j = y_j – t_j$ and $h_i$ is the $i$-th unit in the hidden layer; $v’_{w_j}$ is the output vector of $w_j$.
Note that this update equation implies that we Have To Go Through Every Word In The Vocabulary, check its output probability $y_j$, and compare $y_j$ with its expected output $t_j$ (either 0 or 1).
If $y_j ge t_j$ (“overestimating”), then we subtract a proportion of the hidden vector $h$ (i.e., $v_{w_I}$) from $v’_{w_o}$, thus making $v’_{w_o}$ far away from $v_{w_I}$; If $y_j le t_j$ (“underestimating”), we add some $h$ to $v’_{w_o}$, thus making $v’_{w_o}$ closer to $v_{w_I}$. if $y_j approx t_j$, then according to the update equation, little change will be made to the weights. Note, $v_w$ (input vector) and $v’_w$ (output vector) are two different vector representations of the word $w$.
1.4. Update weight from hidden $ o$ input layer
Having obtained the update equations for $W’$, we now move on to $W$. We take the derivative of $E$ on the output of the hidden layer, obteaining:
[frac{partial E}{partial h_i}=sumlimits_{j=1}^Vfrac{partial E}{partial u_j}cdotfrac{partial u_j}{partial h_i}=sumlimits_{j=1}^V{e_jcdot w’_{ij}}:=EH_i]
where $h_i$ is the output of the $i$-th unit of the hidden layer; $u_j$ is the net input of the $j$-thunit in the output layer; $e_j=y_j-t_j$ is the prediction error of the $j$-th word in the output layer. $EH$, a $N$-dim vector, is the sum of the output vectors of all words in the vocabulary, weighted by their prediction error.
Next, we should take the derivative of $E$ on $W$. First, we recall that the input layer value performs a linear computation to form the hidden layer:
[h_j=sumlimits_{k=1}^V{x_k}cdot w_{ki}]
Then, we obtain:
[frac{partial E}{partial w_{ki}}=frac{partial E}{partial h_i} cdot frac{partial h_i}{partial w_{ki}}=EH_i cdot x_k]
[frac{partial E}{partial W}=xcdot EH]
from which, we obtain a $V imes N$ matrix. since only one component of $x$ is non-zero, only one row of $frac{partial E}{partial W}$ is non-zero, and the value of that row is $EH$, a $N$-dim vector.
We obtain the update equation of $W$ as :
[v_{w_I}=v_{w_I}-alphacdot EH]
where, $v_{w_I}$ is a row of $W$, the “input vector”of the only context word, and is the only row of $W$ whose gradient is non-zero. All the other rows of $W$ remains unchanged in this iteration.
Vector $EH$ is the sum of output vectors of all words in the vocabulary weighted by their prediction error $e_j=y_j-t_j$, which is adding a portion of every output vector in the vocabulary to the input vector of the context word.
As we iteratively update the model parameters by going through context-target word pairs generated from a training corpus, the e ects on the vectors will accumulate. We can imagine that the output vector of a word w is “dragged” back-and-forth by the input vectors of $w$'s co-occurring neighbors, as if there are physical strings between the vector of w and the vectors of its neighbors. Similarly, an input vector can also be considered as being dragged by many output vectors. This interpretation can remind us of gravity, or force-directed graph layout. The equilibrium length of each imaginary string is related to the strength of cooccurrence between the associated pair of words, as well as the learning rate. After many iterations, the relative positions of the input and output vectors will eventually stabilize.
2. Multi-word context Model
Now, we move on to the model with a multi-word context setting.
When computing the hidden layer output, the CBOW model computes the mean value of the inputs:
[h=frac{1}{C}Wdot (x_1+x_2+cdots+x_C)=frac{1}{C}cdot (v_{w_1}+v_{w_2}+cdots+v_{w_C})]
where $C$ is the number of words in the context, $w_1,cdots,w_C$ are the words in the context, and $v_w$ is the input vector of a word $w$. The loss function is:
[E = - log p(w_o | w_{I,1},cdots,w_{I,C}) ]
[=-u_{j*}+logsumlimits_{j’=1}^{V}{exp(u_j’)}]
2.2. Update weight from output $ o$ hidden layer
The update equation for the hidden $ o$ output weights stay the same as that for the one-word-context model.
[ v’_{w_j}=v’_{w_j} – alphacdot e_j cdot h ~~~~~~j=1,2,cdots,V]
Note that we need to apply this to every element of the hidden!output weight matrix for each training instance.
2.3 Update weight from hidden $ o$ input layer
The update equation for input $to$ hidden weights is similar to 1.4, except that now we need to apply the following equation for every word $w_{I,c}$ in the context:
[ v_{w_{I,c}}=v_{w_{I,c}} –frac{1}{C}cdot alpha cdot EH ]
where $w_{w_{I,c}}$ is the input vector of the $c$-th word in the onput context; $alpha$ is the learning rate; and $EH=frac{partial E}{partial h_i}$.