Initialize
If you've ever used autocomplete on your phone, or wondered how Spotify knows what song you'll probably want to queue next, you've already bumped into n-grams, you just didn't know it. N-grams are one of those concepts that sound intimidating on paper but are honestly quite elegant once you see them. They sit right at the intersection of statistics and language, and they're one of the foundational building blocks of natural language processing (NLP). Let's get into it.
So What Even Is an N-gram?
An n-gram is a contiguous sequence of n items from a given sequence. That sequence could be characters, words, syllables — basically anything. The n is just a number. So:
- A unigram is a sequence of 1 item (
n=1) - A bigram is a sequence of 2 items (
n=2) - A trigram is a sequence of 3 items (
n=3) - And so on...
Let's say you have the sentence: "the cat sat on the mat"
The bigrams (word-level) would be:
("the", "cat")
("cat", "sat")
("sat", "on")
("on", "the")
("the", "mat")
See the pattern? You slide a window of size n across the text, one step at a time. That's literally it. Let's build this in Python.
Building N-grams From Scratch
No libraries, no magic. Just pure Python.
def ngrams(tokens, n):
return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]
text = "the cat sat on the mat"
tokens = text.split()
print(ngrams(tokens, 1)) # unigrams
print(ngrams(tokens, 2)) # bigrams
print(ngrams(tokens, 3)) # trigrams
Output:
[('the',), ('cat',), ('sat',), ('on',), ('the',), ('mat',)]
[('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat')]
[('the', 'cat', 'sat'), ('cat', 'sat', 'on'), ('sat', 'on', 'the'), ('on', 'the', 'mat')]
Clean and dead simple. The list comprehension slides the window across, and tuple() makes each gram hashable (useful later when you want to count them).
Character N-grams
You're not limited to words. Character n-grams are super useful for things like spell checking, identifying languages, and even fuzzy string matching. Same idea, different granularity.
def char_ngrams(text, n):
return [text[i:i+n] for i in range(len(text) - n + 1)]
word = "banana"
print(char_ngrams(word, 2))
print(char_ngrams(word, 3))
Output:
['ba', 'an', 'na', 'an', 'na']
['ban', 'ana', 'nan', 'ana']
Notice 'ana' appears twice — that's the frequency information baked right in. Language has patterns, and n-grams let you measure them.
Counting Frequencies
Raw n-grams are fine, but things get interesting when you start counting how often each one appears. This gives you a frequency distribution, which is the backbone of a lot of statistical NLP.
from collections import Counter
text = "to be or not to be that is the question to be"
tokens = text.split()
bigram_counts = Counter(ngrams(tokens, 2))
for gram, count in bigram_counts.most_common(5):
print(f"{gram}: {count}")
Output:
('to', 'be'): 3
('be', 'or'): 1
('or', 'not'): 1
('not', 'to'): 1
('be', 'that'): 1
('to', 'be') shows up 3 times — makes sense given the text. Counter from Python's standard library does the heavy lifting here, super convenient.
A Simple Language Model
Here's where it gets genuinely cool. You can use bigram frequencies to predict the next word given the current one. It's a super primitive language model, but it's the same idea behind GPT and friends, just scaled up by a few orders of magnitude(and a lot of GPU hours).
from collections import defaultdict
import random
def build_bigram_model(tokens):
model = defaultdict(list)
for w1, w2 in ngrams(tokens, 2):
model[w1].append(w2)
return model
def generate_text(model, start_word, length=10):
current = start_word
result = [current]
for _ in range(length - 1):
next_words = model.get(current)
if not next_words:
break
current = random.choice(next_words)
result.append(current)
return " ".join(result)
corpus = "the cat sat on the mat the cat ate the rat the rat ran off the mat".split()
model = build_bigram_model(corpus)
print(generate_text(model, "the"))
Each word just picks a random successor from words that followed it in training. It's probabilistic, it's wasteful, and it's a terrible language model for production — but it works, and it helps you understand what's actually going on under the hood of the real stuff.
The Sparsity Problem
N-grams have a fundamental achilles heel: sparsity. As n gets larger, the number of possible sequences explodes combinatorially. Most n-grams you'll encounter in real text appear exactly once, or not at all. This means your model will have no idea what to do with sequences it hasn't seen before.
A classic workaround is smoothing — you add a small probability to unseen n-grams so they don't get a probability of zero. The simplest variant is Laplace (add-one) smoothing:
def laplace_probability(ngram, corpus_counts, vocab_size, n=2):
prefix = ngram[:-1]
prefix_count = sum(v for k, v in corpus_counts.items() if k[:-1] == prefix)
ngram_count = corpus_counts.get(ngram, 0)
return (ngram_count + 1) / (prefix_count + vocab_size)
tokens = "the cat sat on the mat the cat".split()
counts = Counter(ngrams(tokens, 2))
vocab = set(tokens)
prob = laplace_probability(("the", "cat"), counts, len(vocab))
print(f"P(cat | the) ≈ {prob:.4f}")
It's a blunt instrument but it gets the job done for small-scale stuff.
Using NLTK
If you don't want to reinvent the wheel (sometimes you really should, just to understand it), NLTK has n-gram utilities built in.
from nltk.util import ngrams as nltk_ngrams
from nltk.tokenize import word_tokenize
text = "natural language processing is pretty fascinating"
tokens = word_tokenize(text)
trigrams = list(nltk_ngrams(tokens, 3))
print(trigrams)
Output:
[('natural', 'language', 'processing'),
('language', 'processing', 'is'),
('processing', 'is', 'pretty'),
('is', 'pretty', 'fascinating')]
Same idea, cleaner tokenization (handles punctuation more gracefully), and integrates nicely with the rest of the NLTK ecosystem.
Where N-grams Actually Show Up
You'd be surprised how many things quietly use n-grams under the hood:
- Autocomplete & predictive keyboards — bigram/trigram models trained on massive text corpora
- Spam filters — character n-grams of suspicious words get flagged
- Plagiarism detection — compare n-gram overlap between two documents
- Search engines — query expansion and typo correction both lean on n-grams
- DNA sequence analysis — same sliding window trick, different alphabet(
CGATinstead of words) - Intrusion detection — system call n-grams to detect anomalous process behavior (this one is kinda wild tbh)
The DNA and system call applications are where it gets really interesting. The concept generalises beyond text completely — anything that is a sequence is fair game.
The Takeaway
N-grams are one of those beautiful ideas that are simple enough to implement in ten lines of Python, yet powerful enough to underlie serious production systems. They're a great entry point into NLP because they don't require any fancy mathematics, just counting and a bit of probability intuition. Once you get n-grams, sliding window attention in transformers starts making a lot more sense too. Start simple, understand the primitives, build up from there. That's the move.
Now go tokenize something.