Perplexity of a probability distribution

Natural language generation evaluation

More than one effective way to say most things.

Given a sequence \(\mathbf{x} = [x_1,\dots,x_N]\) of length N and a probability distribution \(p\):

\[PP(p, \mathbf{x}) = \prod_{i=1}^n \Bigg( \frac{1}{ p(x_i) }) \Bigg)^{ \frac{1}{n} }\]

which uses a product of inverse assigned probabilities, and a geometric mean.

\[perplexity(X) = 2^{H(X)}\]

We wish to minimize perpleixity. Equivalent to the exponentiation of the cross-entropy loss.

Does the model assign high probability to the input sequence?

Weaknesses: heavily dependent on the underlying vocabulary. I can reduce perplexity just by changing the size of my vocabulary. (Can’t make comparisons across vocabularies, or datasets with different vocabularies.)

N-Gram Based Methods

Edit distance (measure of distance between strings), BLEU, ROUGE,

Modern Benchmarks

Image Reasoning

  • MMMU
  • MathVista

Image Understanding

  • ChartQA
  • DocVQA
  • Vibe-Eval (Reka)

Coding

  • LiveCodeBench v5
  • Aider Polyglot: Paul Gauthier
  • SWE-bench verified: From OpenAI blog

**Reasoning & Knowledge

  • MMLU Pro
  • GPQA Diamond
  • Math-500
  • Humanity’s Last Exam

Long Context

  • MTOB (half book)
  • MTOB (full book)
  • MRCR

Multilingual

  • Multilingual MMLU
  • Global MMLU (Lite)

Mathematics

  • AIME 2024
  • AIME 2025