NLP Evaluation
Perplexity of a probability distribution
Natural language generation evaluation
More than one effective way to say most things.
Given a sequence \(\mathbf{x} = [x_1,\dots,x_N]\) of length N and a probability distribution \(p\):
\[PP(p, \mathbf{x}) = \prod_{i=1}^n \Bigg( \frac{1}{ p(x_i) }) \Bigg)^{ \frac{1}{n} }\]which uses a product of inverse assigned probabilities, and a geometric mean.
\[perplexity(X) = 2^{H(X)}\]We wish to minimize perpleixity. Equivalent to the exponentiation of the cross-entropy loss.
Does the model assign high probability to the input sequence?
Weaknesses: heavily dependent on the underlying vocabulary. I can reduce perplexity just by changing the size of my vocabulary. (Can’t make comparisons across vocabularies, or datasets with different vocabularies.)
N-Gram Based Methods
Edit distance (measure of distance between strings), BLEU, ROUGE,
Modern Benchmarks
Image Reasoning
- MMMU
- MathVista
Image Understanding
- ChartQA
- DocVQA
- Vibe-Eval (Reka)
Coding
- LiveCodeBench v5
- Aider Polyglot: Paul Gauthier
- SWE-bench verified: From OpenAI blog
**Reasoning & Knowledge
- MMLU Pro
- GPQA Diamond
- Math-500
- Humanity’s Last Exam
Long Context
- MTOB (half book)
- MTOB (full book)
- MRCR
Multilingual
- Multilingual MMLU
- Global MMLU (Lite)
Mathematics
- AIME 2024
- AIME 2025