Perplexity of a probability distribution

Natural language generation evaluation

More than one effective way to say most things.

Given a sequence \(\mathbf{x} = [x_1,\dots,x_N]\) of length N and a probability distribution \(p\):

\[PP(p, \mathbf{x}) = \prod_{i=1}^n \Bigg( \frac{1}{ p(x_i) }) \Bigg)^{ \frac{1}{n} }\]

which uses a product of inverse assigned probabilities, and a geometric mean.

\[perplexity(X) = 2^{H(X)}\]

We wish to minimize perpleixity. Equivalent to the exponentiation of the cross-entropy loss.

Does the model assign high probability to the input sequence?

Weaknesses: heavily dependent on the underlying vocabulary. I can reduce perplexity just by changing the size of my vocabulary. (Can’t make comparisons across vocabularies, or datasets with different vocabularies.)

N-Gram Based Methods

Edit distance (measure of distance between strings), BLEU, ROUGE,

Modern Benchmarks

Image Reasoning

MMMU
MathVista

Image Understanding

ChartQA
DocVQA
Vibe-Eval (Reka)

Coding

LiveCodeBench v5
Aider Polyglot: Paul Gauthier
SWE-bench verified: From OpenAI blog

**Reasoning & Knowledge

MMLU Pro
GPQA Diamond
Math-500
Humanity’s Last Exam

Long Context

MTOB (half book)
MTOB (full book)
MRCR

Multilingual

Multilingual MMLU
Global MMLU (Lite)

Mathematics

AIME 2024
AIME 2025