A Cartoon Guide to Language Models in NLP (Part 1: Intuition)


Language models are a core component of NLP systems, from machine translation to speech recognition. Intuitively, you can think of language models as answering: “How English is this phrase?”

Wrecking a nice beach vs. recognizing speech… It all depends on context!

Intuition: Thinking Like an Alien

Imagine you’re an alien whose spaceship crashes into Earth. You’re far from home, and you need to blend in until the rescue team arrives. You want to pick up some food, maybe watch Squid Game to learn about human culture, and so you need to learn how to speak like an earthling first.

  • Assign a high probability to responses that lead to delicious food. (For example: “Two cheeseburgers and an order of fries”)
  • Assign a low probability to unintelligible responses that lead to fear, confusion, and a call to the Men in Black. (For example: “Fries Santa cheese dirt hello”)
A language model scoring two responses
  • Customer 1: “Two cheeseburgers”
  • Customer 2: “Two cheeseburgers”
  • Customer 3: “Two cheeseburgers”
  • Customer 4: “Fries”
  • Customer 5: “The daily special”
  • P(“two cheeseburgers”) = 3 / 5 (since “two cheeseburgers” was uttered in 3 out of the 5 interactions)
  • P(“fries”) = 1 / 5 (since “fries” was uttered in 1 out of the 5 interactions)
  • P(“the daily special”) = 1/5 (since “the daily special” was uttered in 1 out of the 5 interactions)
  • P(anything else) = 0
  • Customer 1: “two cheeseburgers”, “cheeseburgers two”
  • Customer 2: “two cheeseburgers”, “cheeseburgers two”
  • Customer 3: “two cheeseburgers”, “cheeseburgers two”
  • Customer 4: “fries”
  • Customer 5: “the daily special”, “the special daily”, “daily the special”, “daily special the”, “special the daily”, “special daily the”
  • P(“two cheeseburgers”) = P(“cheeseburgers two”) = 3 / 13
  • P(“fries”) = 1 / 13
  • P(“the daily special”) = P(“the special daily”) = P(“special daily the”) = P(“special the daily”) = P(“daily special the”) =P(“daily the special”) = 1 /13
  • P(anything else) = 0

Evaluating Language Models

One question, then, is: which of your robots performs better? Remember that “two cheeseburgers” and “cheeseburgers two” sound equally valid to an uninformed alien!

Human Evaluation

One approach is a human evaluation approach. Because your robots are trying to imitate human language, why not ask humans how good their imitations are? So you stand outside Shake Shack, and every time a customer approaches, you ask your robot to generate an output and the customer to evaluate it. If the customer thinks it’s a good, human-like response, they’ll assign it a score of +1; otherwise, they’ll score it 0.

  • At the approach of the first customer: Robot A and Robot B both say “two cheeseburgers”. The customer thinks this is a very human response, so scores both of these as 1 (human-like). A: 1.0, B: 1.0.
  • With the second customer: Robot A says “fries” (score: 1), while Robot B says “cheeseburgers two” (score: 0). A: 1.0, B: 0.0.
  • With the third customer: Robot A says “fries” (score: 1), while Robot B says “daily the special” (score: 0). A: 1.0, B: 0.0.
One way to measure the quality of a language model is by asking human judges.

Task-Specific Evaluation

Another approach would be to evaluate the outputs against a downstream, real-world task. In our alien situation, your goal is to get food from Shake Shack, so you could measure whether or not your robots help you achieve that goal.

  • Robot A goes up to the counter. When the cashier asks “What would you like to order today?”, it outputs “two cheeseburgers”. The cashier understands and gives him two cheeseburgers. Success! A: 1.0.
  • Robot A goes up to the counter again. This time, it says “fries”. The cashier understands and gives him a fresh bag of fries. Success again! A: 1.0.
  • Next, Robot B goes up to the counter and says “cheeseburgers two”. The cashier doesn’t understand, so gives him nothing. Failure! B: 0.0.
  • Robot B tries again with “the daily special”. The cashier understands this time, so gives him the Tuesday Taco. Success! B: 1.0.
In this task-based evaluation, the better language model leads to actual food.

Intrinsic Evaluation and Perplexity

Human evaluations and task-based evaluations are often the most robust way to measure your robots’ performance. But sometimes you want a faster and dirtier way of comparing language models; maybe you don’t have the means to get humans to score your robots’ output, and you can’t risk blowing your cover as an alien with a bad response at Shake Shack.

The better language model is less perplexed.


In summary, this post provided an overview of a couple key concepts surrounding language models:

  • First, we defined a language model as an algorithm that scores how “human” a sentence is. (More formally, a language model maps pieces of texts to probabilities.)
  • We described a way to train language models: by observing language and turning these observations into probabilities.
  • We discussed a couple approaches to evaluating the quality of language models: human evaluation (did the robot responses sound natural to a human?), downstream tasks (did the robot responses lead to actual food?), and intrinsic evaluations (how perplexed were the robots by the human utterances?).



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Surge AI

Surge AI


The world’s most powerful data labeling platform, designed from the ground up for stunning AI. https://www.surgehq.ai