Evaluating Language Models: An Introduction to Perplexity in NLP

A chore

Imagine you’re trying to build a chatbot that helps home cooks autocomplete their grocery shopping lists based on popular flavor combinations from social media. Your goal is to let users type in what they have in their fridge, like “chicken, carrots,” then list the five or six ingredients that go best with those flavors. You’ve already scraped thousands of recipe sites for ingredient lists, and now you just need to choose the best NLP model to predict which words appear together most often. Easy, right?

Calculating perplexity

To understand how perplexity is calculated, let’s start with a very simple version of the recipe training dataset that only has four short ingredient lists:

  1. chicken, butter, pears
  2. chicken, butter, chili
  3. lemon, pears, shrimp
  4. chili, shrimp, lemon
  1. chicken, butter, pears
  2. chicken, butter, lemon
  3. chicken, lemon, pears
  4. chili, shrimp, lemon

Interpreting perplexity

The word likely is important, because unlike a simple metric like prediction accuracy, lower perplexity isn’t guaranteed to translate into better model performance, for at least two reasons.

Perplexity in the real world

You can see similar, if more subtle, problems when you use perplexity to evaluate models trained on real world datasets like the One Billion Word Benchmark. This corpus was put together from thousands of online news articles published in 2011, all broken down into their component sentences. It’s designed as a standardardized test dataset that allows researchers to directly compare different models trained on different data, and perplexity is a popular benchmark choice.

Perplexity in a nutshell


  • Fast to calculate, allowing researchers to weed out models that are unlikely to perform well in expensive/time-consuming real-world testing
  • Useful to have estimate of the model’s uncertainty/information density
  • Not good for final evaluation, since it just measures the model’s confidence, not its accuracy
  • Hard to make apples-to-apples comparisons across datasets with different context lengths, vocabulary sizes, word- vs. character-based models, etc.
  • Can end up rewarding models that mimic toxic or outdated datasets



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Surge AI

Surge AI


The world’s most powerful data labeling platform, designed from the ground up for stunning AI. https://www.surgehq.ai