Perplexity of a probability distribution

Natural language generation evaluation

More than one effective way to say most things.

Given a sequence \(\mathbf{x} = [x_1,\dots,x_N]\) of length N and a probability distribution \(p\):

\[PP(p, \mathbf{x}) = \prod_{i=1}^n \Bigg( \frac{1}{ p(x_i) }) \Bigg)^{ \frac{1}{n} }\]

which uses a product of inverse assigned probabilities, and a geometric mean.

\[perplexity(X) = 2^{H(X)}\]

We wish to minimize perpleixity. Equivalent to the exponentiation of the cross-entropy loss.

Does the model assign high probability to the input sequence?

Weaknesses: heavily dependent on the underlying vocabulary. I can reduce perplexity just by changing the size of my vocabulary. (Can’t make comparisons across vocabularies, or datasets with different vocabularies.)

N-Gram Based Methods

Edit distance (measure of distance between strings), BLEU, ROUGE,

MMLU