Evaluating a Language Model: Perplexity
We have a serial of (m) sentences:
We could look at the probability under our model (prod_{i=1}^m{p(s_i)}). Or more conveniently, the log probability:
where (p(s_i)) is the probability of sentence (s_i).
In fact, the usual evaluation measure is perplexity:
and (M) is the total number of words in the test data.
Cross-Entropy
Given words (x_1,cdots,x_t), a language model prdicts the following word (x_{t+1}) by modeling:
where (v_j) is a word in the vocabulary.
The predicted output vector (hat y^tin mathbb{R}^{|V|}) is a probability distribution over the vocabulary, and we optimize the cross-entrpy loss:
where (y^t) is the one-hot vector corresponding to the target word. This is a poiny-wise loss, and we sum the cross-ntropy loss across all examples in a sequence, across all sequences in the dataset in order to evaluate model performance.
The relationship between cross-entropy and ppl
which is the inverse probability of the correct word, according to the model distribution (P).
suppose (y_i^t) is the only nonzero element of (y^t). Then, note that:
Then, it follows that:
In fact, minizing the arthimic mean of the cross-entropy is identical to minimizing the geometric mean of the perplexity. If the model predictions are completely random, (E[hat y_i^t]=frac{1}{|V|}), and the expected cross-entropies are (log |V|), ((log 10000approx 9.21))