N-grams: Chopping Text into Meaningful Chunks

Initialize

If you've ever used autocomplete on your phone, or wondered how Spotify knows what song you'll probably want to queue next, you've already bumped into n-grams, you just didn't know it. N-grams are one of those concepts that sound intimidating on paper but are honestly quite elegant once you see them. They sit right at the intersection of statistics and language, and they're one of the foundational building blocks of natural language processing (NLP). Let's get into it.

So What Even Is an N-gram?

An n-gram is a contiguous sequence of n items from a given sequence. That sequence could be characters, words, syllables — basically anything. The n is just a number. So:

A unigram is a sequence of 1 item (n=1)
A bigram is a sequence of 2 items (n=2)
A trigram is a sequence of 3 items (n=3)
And so on...

Let's say you have the sentence: "the cat sat on the mat"

The bigrams (word-level) would be:

("the", "cat")
("cat", "sat")
("sat", "on")
("on", "the")
("the", "mat")

See the pattern? You slide a window of size n across the text, one step at a time. That's literally it. Let's build this in Python.

Building N-grams From Scratch

No libraries, no magic. Just pure Python.

def ngrams(tokens, n):
    return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]

text = "the cat sat on the mat"
tokens = text.split()

print(ngrams(tokens, 1))  # unigrams
print(ngrams(tokens, 2))  # bigrams
print(ngrams(tokens, 3))  # trigrams

Output:

[('the',), ('cat',), ('sat',), ('on',), ('the',), ('mat',)]
[('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat')]
[('the', 'cat', 'sat'), ('cat', 'sat', 'on'), ('sat', 'on', 'the'), ('on', 'the', 'mat')]

Clean and dead simple. The list comprehension slides the window across, and tuple() makes each gram hashable (useful later when you want to count them).

Character N-grams

You're not limited to words. Character n-grams are super useful for things like spell checking, identifying languages, and even fuzzy string matching. Same idea, different granularity.

def char_ngrams(text, n):
    return [text[i:i+n] for i in range(len(text) - n + 1)]

word = "banana"
print(char_ngrams(word, 2))
print(char_ngrams(word, 3))

Output:

['ba', 'an', 'na', 'an', 'na']
['ban', 'ana', 'nan', 'ana']

Notice 'ana' appears twice — that's the frequency information baked right in. Language has patterns, and n-grams let you measure them.

Counting Frequencies

Raw n-grams are fine, but things get interesting when you start counting how often each one appears. This gives you a frequency distribution, which is the backbone of a lot of statistical NLP.

from collections import Counter

text = "to be or not to be that is the question to be"
tokens = text.split()

bigram_counts = Counter(ngrams(tokens, 2))

for gram, count in bigram_counts.most_common(5):
    print(f"{gram}: {count}")

Output:

('to', 'be'): 3
('be', 'or'): 1
('or', 'not'): 1
('not', 'to'): 1
('be', 'that'): 1

('to', 'be') shows up 3 times — makes sense given the text. Counter from Python's standard library does the heavy lifting here, super convenient.

A Simple Language Model

Here's where it gets genuinely cool. You can use bigram frequencies to predict the next word given the current one. It's a super primitive language model, but it's the same idea behind GPT and friends, just scaled up by a few orders of magnitude(and a lot of GPU hours).

from collections import defaultdict
import random

def build_bigram_model(tokens):
    model = defaultdict(list)
    for w1, w2 in ngrams(tokens, 2):
        model[w1].append(w2)
    return model

def generate_text(model, start_word, length=10):
    current = start_word
    result = [current]
    for _ in range(length - 1):
        next_words = model.get(current)
        if not next_words:
            break
        current = random.choice(next_words)
        result.append(current)
    return " ".join(result)

corpus = "the cat sat on the mat the cat ate the rat the rat ran off the mat".split()
model = build_bigram_model(corpus)

print(generate_text(model, "the"))

Each word just picks a random successor from words that followed it in training. It's probabilistic, it's wasteful, and it's a terrible language model for production — but it works, and it helps you understand what's actually going on under the hood of the real stuff.

The Sparsity Problem

N-grams have a fundamental achilles heel: sparsity. As n gets larger, the number of possible sequences explodes combinatorially. Most n-grams you'll encounter in real text appear exactly once, or not at all. This means your model will have no idea what to do with sequences it hasn't seen before.

A classic workaround is smoothing — you add a small probability to unseen n-grams so they don't get a probability of zero. The simplest variant is Laplace (add-one) smoothing:

def laplace_probability(ngram, corpus_counts, vocab_size, n=2):
    prefix = ngram[:-1]
    prefix_count = sum(v for k, v in corpus_counts.items() if k[:-1] == prefix)
    ngram_count = corpus_counts.get(ngram, 0)
    return (ngram_count + 1) / (prefix_count + vocab_size)

tokens = "the cat sat on the mat the cat".split()
counts = Counter(ngrams(tokens, 2))
vocab = set(tokens)

prob = laplace_probability(("the", "cat"), counts, len(vocab))
print(f"P(cat | the) ≈ {prob:.4f}")

It's a blunt instrument but it gets the job done for small-scale stuff.

Using NLTK

If you don't want to reinvent the wheel (sometimes you really should, just to understand it), NLTK has n-gram utilities built in.

from nltk.util import ngrams as nltk_ngrams
from nltk.tokenize import word_tokenize

text = "natural language processing is pretty fascinating"
tokens = word_tokenize(text)

trigrams = list(nltk_ngrams(tokens, 3))
print(trigrams)

Output:

[('natural', 'language', 'processing'),
 ('language', 'processing', 'is'),
 ('processing', 'is', 'pretty'),
 ('is', 'pretty', 'fascinating')]

Same idea, cleaner tokenization (handles punctuation more gracefully), and integrates nicely with the rest of the NLTK ecosystem.

Where N-grams Actually Show Up

You'd be surprised how many things quietly use n-grams under the hood:

Autocomplete & predictive keyboards — bigram/trigram models trained on massive text corpora
Spam filters — character n-grams of suspicious words get flagged
Plagiarism detection — compare n-gram overlap between two documents
Search engines — query expansion and typo correction both lean on n-grams
DNA sequence analysis — same sliding window trick, different alphabet(CGAT instead of words)
Intrusion detection — system call n-grams to detect anomalous process behavior (this one is kinda wild tbh)

The DNA and system call applications are where it gets really interesting. The concept generalises beyond text completely — anything that is a sequence is fair game.

The Takeaway

N-grams are one of those beautiful ideas that are simple enough to implement in ten lines of Python, yet powerful enough to underlie serious production systems. They're a great entry point into NLP because they don't require any fancy mathematics, just counting and a bit of probability intuition. Once you get n-grams, sliding window attention in transformers starts making a lot more sense too. Start simple, understand the primitives, build up from there. That's the move.

Now go tokenize something.