Learn AI Series (#57) - Language Modeling - Predicting the Next Word

avatar

Learn AI Series (#57) - Language Modeling - Predicting the Next Word

ai-banner.png

What will I learn

  • You will learn what language modeling actually is and why it underpins every modern AI system;
  • statistical language models -- n-grams, how they work, and why they fall apart;
  • neural language models and why continuous embeddings change everything;
  • perplexity -- the standard metric for measuring language model quality;
  • the scaling hypothesis -- why bigger models are better, and what scaling laws tell us;
  • tokenization -- BPE, WordPiece, SentencePiece -- how models actually read text;
  • vocabulary size tradeoffs and their practical implications for model behavior.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#57) - Language Modeling - Predicting the Next Word

Welcome to Arc 4: Large Language Models. This is where everything we've built so far comes together -- where the transformer architecture from Arc 3 meets the scale and data that turned a next-token predictor into something that can write code, answer questions, and hold conversations. I've been looking forward to this arc for quite some time, because it bridges the gap between "we understand the mechanics" and "we understand what's actually happening in production systems today."

But before we get into specific architectures and training procedures, we need to nail down the fundamental task that underpins ALL of it: language modeling. What does it mean to predict the next word? Why is this deceptively simple objective so powerful? And how do we even measure whether a language model is any good?

Here we go!

What is a language model?

A language model assigns probabilities to sequences of words (or tokens, more precisely). Given some context, it tells you how likely each possible continuation is.

If I give you the fragment "The cat sat on the ___", you'd probably predict "mat" or "floor" or "couch" with high probability, and something like "blockchain" or "quarantine" with very low probability. You're doing language modeling in your head right now -- using your accumulated knowledge of language patterns to predict what comes next.

A computational language model does the same thing formally:

P(w_t | w_1, w_2, ..., w_{t-1})

That's the probability of word w_t given all previous words. The full probability of a sentence is the product of these conditional probabilities:

P("The cat sat") = P(The) * P(cat|The) * P(sat|The cat)

Every large language model you've heard of -- GPT-4, Claude, Gemini, LLaMA -- is fundamentally a language model. When it "generates" text, it's sampling from its learned probability distribution one token at a time. The entire magic (and it does feel like magic sometimes) emerges from this single objective: predict the next token, given everything that came before.

Let me build this up from the historical foundations, because understanding where we came from makes it much clearer why the modern approach works so well.

Statistical language models: n-grams

Before neural networks dominated NLP, language models counted word sequences in large text corpora. An n-gram model estimates the probability of a word based on the previous n-1 words:

  • Unigram (n=1): P(word) based on overall word frequency. No context whatsoever.
  • Bigram (n=2): P(word | previous_word). "What word usually follows 'the'?"
  • Trigram (n=3): P(word | prev_2_words). "What usually follows 'the cat'?"

Let's build one from scratch:

from collections import Counter, defaultdict

text = """the cat sat on the mat the cat ate the food
the dog sat on the rug the dog ate the bone
the cat sat on the mat again the dog ran away
the bird sat on the fence the bird ate the seed"""

def build_bigram_model(text):
    words = text.split()
    counts = defaultdict(Counter)
    for w1, w2 in zip(words, words[1:]):
        counts[w1][w2] += 1

    # Convert counts to probabilities
    probs = {}
    for w1, followers in counts.items():
        total = sum(followers.values())
        probs[w1] = {w2: count / total for w2, count in followers.items()}
    return probs

bigram = build_bigram_model(text)
print("After 'the':", bigram.get('the', {}))
print("After 'cat':", bigram.get('cat', {}))
print("After 'sat':", bigram.get('sat', {}))

Output (approximately):

After 'the': {'cat': 0.25, 'mat': 0.17, 'food': 0.08, 'dog': 0.17, ...}
After 'cat': {'sat': 0.67, 'ate': 0.33}
After 'sat': {'on': 1.0}

We can use this to generate text:

import random

def generate_from_bigram(model, start_word, n_words=15):
    current = start_word
    result = [current]
    for _ in range(n_words):
        if current not in model:
            break
        next_words = model[current]
        words = list(next_words.keys())
        probs = list(next_words.values())
        current = random.choices(words, weights=probs, k=1)[0]
        result.append(current)
    return ' '.join(result)

for i in range(5):
    print(generate_from_bigram(bigram, 'the'))

N-gram models dominated NLP for decades. They're fast, they're interpretable, and they work reasonably well for short-range patterns. Speech recognition systems in the 2000s relied heavily on trigram models. But they have fundamental limitations that no amount of data can fix:

Sparsity: most n-grams never appear in the training data. Even a corpus of 1 billion words can't cover all reasonable 5-word sequences. There are simply too many possible combinations. Smoothing techniques (adding small counts to unseen n-grams) help, but they're band-aids on a fundamental problem.

No generalization: an n-gram model that has seen "the cat sat on the mat" and "the dog sat on the rug" has absolutely no way of knowing that "the hamster sat on the pillow" is also plausible. Each word is a discrete symbol with zero semantic connection to any other word. "Cat" and "dog" are as different as "cat" and "neutron" -- the model doesn't know they're both animals.

Fixed context window: a trigram uses exactly 2 previous words. But consider this sentence: "The president of the United States, who was elected in November after a contentious campaign that lasted over two years, finally signed the ___". You need context from 20+ words back to make a good prediction here. No practical n-gram model can capture this (a 20-gram model would need impossibly large training data).

Neural language models: the paradigm shift

Neural language models solve all three problems by representing words as dense vectors (embeddings) rather than discrete symbols. We covered embeddings back in episode #31 -- now you'll see exactly why they matter so much.

No sparsity: in stead of storing counts for every possible n-gram, a neural model learns a continuous function that maps any context to a probability distribution over next words. It can assign meaningful probabilities to sequences it has literally never seen, because it operates in continuous space. The function interpolates between training examples.

Generalization: because "cat" and "dog" develop similar embeddings (they appear in similar contexts during training -- "the ___ sat on", "the ___ ate the", "fed the ___"), a neural model that learns "the cat sat on the ___" automatically assigns reasonable probability to "the hamster sat on the ___". Similar embeddings produce similar outputs. This is a direct consequence of the distributional hypothesis we discussed in episode #31.

Variable context: a transformer language model attends to ALL previous tokens (up to its context window length). That president sentence from 20 words ago? Self-attention connects it directly to the current prediction position. No fixed window -- every position can attend to every earlier position.

The architecture is exactly what we built in episode #56: a decoder-only transformer. The input is a sequence of tokens, and the output at each position is a probability distribution over the entire vocabulary for what the next token should be. The causal mask ensures each position can only see earlier positions -- exactly the autoregressive constraint that language modeling requires.

Let's build a simple neural language model to make this concrete:

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleNeuralLM(nn.Module):
    """A minimal feed-forward neural language model.
    Uses a fixed context window of n previous tokens."""

    def __init__(self, vocab_size, d_embed=64, context_size=5, d_hidden=128):
        super().__init__()
        self.context_size = context_size
        self.embedding = nn.Embedding(vocab_size, d_embed)
        self.fc1 = nn.Linear(context_size * d_embed, d_hidden)
        self.fc2 = nn.Linear(d_hidden, d_hidden)
        self.out = nn.Linear(d_hidden, vocab_size)

    def forward(self, x):
        # x shape: (batch, context_size) -- token indices
        emb = self.embedding(x)            # (batch, context_size, d_embed)
        emb = emb.view(x.size(0), -1)     # (batch, context_size * d_embed)
        h = F.relu(self.fc1(emb))
        h = F.relu(self.fc2(h))
        logits = self.out(h)               # (batch, vocab_size)
        return logits

# Demonstrate with a toy vocabulary
vocab = ['<pad>', 'the', 'cat', 'dog', 'sat', 'on', 'mat', 'rug', 'ate', 'food']
word_to_idx = {w: i for i, w in enumerate(vocab)}
vocab_size = len(vocab)

model = SimpleNeuralLM(vocab_size, d_embed=32, context_size=3, d_hidden=64)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# Example forward pass
context = torch.tensor([[word_to_idx['the'], word_to_idx['cat'], word_to_idx['sat']]])
logits = model(context)
probs = F.softmax(logits, dim=-1)
print(f"Next word probabilities: {probs.shape}")
for word, prob in zip(vocab, probs[0].tolist()):
    if prob > 0.05:
        print(f"  {word}: {prob:.3f}")

This is the simplest possible neural LM -- a feed-forward network with a fixed context window. It's better than n-grams already (because of the embedding generalization), but it still has a fixed context. The transformer architecture from episodes #52-53 removes that limitation entirely.

Perplexity: measuring language model quality

How do you measure whether a language model is good? You can't just check accuracy ("did it predict the exact next word?") because language is inherently stochastic -- many words could validly follow any given context. "The cat sat on the ___" has dozens of legitimate continuations.

Perplexity is the standard metric. It measures how "surprised" the model is by test data on average. Formally:

PPL = exp( -1/N * sum( log P(w_t | context) ) )

Intuition: if the model assigns probability 1.0 to every actual next token, the perplexity is 1 (perfect, zero surprise). If it's choosing uniformly among 100 options at each step, the perplexity is 100. Lower is better -- it means the model is less "confused" by the data.

import math

def compute_perplexity(model, data_tokens, context_size):
    """Compute perplexity of a neural LM on a token sequence."""
    model.eval()
    total_log_prob = 0
    n_predictions = 0

    with torch.no_grad():
        for i in range(context_size, len(data_tokens)):
            context = data_tokens[i - context_size:i].unsqueeze(0)
            logits = model(context)
            probs = F.softmax(logits, dim=-1)
            target = data_tokens[i].item()
            prob = probs[0, target].item()

            if prob > 0:
                total_log_prob += math.log(prob)
            else:
                total_log_prob += math.log(1e-10)  # avoid log(0)
            n_predictions += 1

    avg_log_prob = total_log_prob / n_predictions
    perplexity = math.exp(-avg_log_prob)
    return perplexity

# With a transformer model (from episode #56):
def compute_transformer_perplexity(model, data, block_size=64):
    """Perplexity for an autoregressive transformer."""
    model.eval()
    total_loss = 0
    total_tokens = 0

    with torch.no_grad():
        for i in range(0, len(data) - block_size - 1, block_size):
            x = data[i:i + block_size].unsqueeze(0)
            y = data[i + 1:i + block_size + 1].unsqueeze(0)
            logits = model(x)
            if isinstance(logits, tuple):
                logits = logits[0]
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                y.view(-1)
            )
            total_loss += loss.item() * block_size
            total_tokens += block_size

    avg_loss = total_loss / total_tokens
    return math.exp(avg_loss)

# Example interpretation:
print("Perplexity benchmarks:")
print("  Random guess (vocab=50,000): 50,000")
print("  Basic n-gram model:          200-500")
print("  LSTM language model:          60-100")
print("  GPT-2 (1.5B params):          ~30")
print("  GPT-3 (175B params):          ~20")
print("  Human baseline (estimates):    12-20")

For English text, human-level perplexity is estimated around 12-20 (depending on domain and how you measure it). GPT-2 achieved perplexity around 30-35 on WikiText-103. Modern LLMs approach or reach human-level on many standard benchmarks.

But perplexity has important limitations. A model with low perplexity assigns high probability to the correct next tokens, but that doesn't mean it can follow instructions, reason about problems, or generate actually useful text. Perplexity measures statistical fit, not intelligence, not capability. A model could achieve excellent perplexity on Wikipedia articles and still be completely useless as a chatbot. This distinction between "predicts well" and "behaves usefully" is precisely why we need techniques like instruction tuning and RLHF on top of raw language modeling -- but that's a topic for later in this arc.

The scaling hypothesis

One of the most consequential empirical findings in AI: language model performance improves predictably with scale. More parameters, more data, more compute leads to lower loss. And not in some vague handwavy way -- it follows precise mathematical relationships called scaling laws.

In 2020, Kaplan et al. at OpenAI published a paper showing that language model loss follows power laws in three variables:

  • N (parameters): Loss ~ N^(-0.076)
  • D (dataset size): Loss ~ D^(-0.095)
  • C (compute budget): Loss ~ C^(-0.050)
import numpy as np

def predicted_loss(n_params, data_tokens, alpha_n=0.076, alpha_d=0.095):
    """Approximate scaling law predictions.
    These are simplified -- real scaling laws have more terms."""
    loss_from_params = (n_params / 8.8e13) ** (-alpha_n)
    loss_from_data = (data_tokens / 5.4e13) ** (-alpha_d)
    # In practice, loss is dominated by the larger bottleneck
    return max(loss_from_params, loss_from_data)

# Compare model sizes
configs = [
    ("Small (125M)", 125e6, 10e9),
    ("Medium (1.3B)", 1.3e9, 100e9),
    ("Large (13B)", 13e9, 300e9),
    ("XL (175B)", 175e9, 300e9),
    ("XXL (540B)", 540e9, 1.4e12),
]

print(f"{'Model':<20} {'Params':>10} {'Tokens':>12} {'~Loss':>8}")
print("-" * 55)
for name, params, tokens in configs:
    loss = predicted_loss(params, tokens)
    ppl = math.exp(loss) if loss < 10 else float('inf')
    print(f"{name:<20} {params:>10.0e} {tokens:>12.0e} {loss:>8.4f}")

What this means in practice: you can predict, before spending millions of dollars on training, approximately how well a model will perform based on its size and the amount of training data. It also means that training a 10x bigger model gives predictable (though diminishing) improvements. This is why companies keep making bigger models -- the returns are predictable and the math says "bigger will be better."

The Chinchilla scaling laws (Hoffmann et al., 2022) refined this further, showing that many early large models were actually undertrained. The optimal balance is roughly: scale parameters and training tokens at the same rate. A 70B parameter model trained on 1.4 trillion tokens (Chinchilla) outperformed a 280B parameter model trained on 300 billion tokens (Gopher). More data, properly matched to model size, beats raw parameter count.

The scaling hypothesis goes even further: as models scale, they develop emergent abilities -- capabilities that appear to not exist in smaller models and then suddenly work at a certain scale. Few-shot learning (learning from examples in the prompt without updating parameters) emerged around GPT-3's scale (~175B parameters). Whether these abilities are truly discontinuous or just cross a usefulness threshold on a smooth curve is debated among researchers. But the practical implication is clear: bigger models can do things smaller ones simply cannot.

Tokenization: how models actually read text

Models don't see characters or words -- they see tokens. Tokenization is the process of converting raw text into a sequence of integer IDs from a fixed vocabulary. This is one of those "boring infrastructure" topics that turns out to have massive implications for model behavior.

In episode #56, we used character-level tokenization (each character = one token). This works but is inefficient: the word "transformer" takes 11 tokens. The model must learn to compose characters into meaningful units, wasting quite some capacity on low-level spelling patterns.

Word-level tokenization (each word = one token) is more efficient for common words but can't handle unknown words. What token do you assign to "transformeresque" or "ChatGPTification"?

Subword tokenization splits the difference. It learns a vocabulary of common subword units from the training data. Frequent words get their own dedicated token ("the", "and", "transformer"). Rare words are split into known pieces ("un" + "forgett" + "able" or "transform" + "er" + "esque").

Byte Pair Encoding (BPE)

BPE is the most widely used subword tokenization algorithm. It starts with individual characters and iteratively merges the most frequent adjacent pair:

def train_bpe(text, n_merges):
    """Train a simple BPE tokenizer."""
    # Start with character-level tokens
    tokens = list(text)
    merges = []
    vocab = set(tokens)

    for step in range(n_merges):
        # Count all adjacent pairs
        pairs = {}
        for i in range(len(tokens) - 1):
            pair = (tokens[i], tokens[i + 1])
            pairs[pair] = pairs.get(pair, 0) + 1

        if not pairs:
            break

        # Find and merge most frequent pair
        best_pair = max(pairs, key=pairs.get)
        merged_token = best_pair[0] + best_pair[1]
        merges.append((best_pair, merged_token))
        vocab.add(merged_token)

        # Apply merge throughout the sequence
        new_tokens = []
        i = 0
        while i < len(tokens):
            if (i < len(tokens) - 1 and
                tokens[i] == best_pair[0] and
                tokens[i + 1] == best_pair[1]):
                new_tokens.append(merged_token)
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        tokens = new_tokens

        if step < 10 or step % 50 == 0:
            print(f"Step {step:>3d}: merge '{best_pair[0]}' + '{best_pair[1]}' "
                  f"-> '{merged_token}' | seq_len={len(tokens)}, vocab={len(vocab)}")

    return tokens, merges, vocab

sample = "the cat sat on the mat the cat ate the food the dog sat on the rug"
tokens, merges, vocab = train_bpe(sample, n_merges=20)
print(f"\nFinal sequence ({len(tokens)} tokens): {tokens[:30]}")
print(f"Vocabulary size: {len(vocab)}")

After enough merges, common words become single tokens. The string "the" will be one token, "cat" will be one token, but a rare word like "quaternion" might be split into "qu" + "ater" + "nion" or similar. The key insight: the algorithm discovers meaningful subword units automatically from the statistics of the training corpus. No linguistic knowledge required ;-)

Here's a more complete tokenizer class you can actually use:

class BPETokenizer:
    """Minimal BPE tokenizer for demonstration."""

    def __init__(self, merges):
        self.merges = merges  # list of ((tok_a, tok_b), merged_tok)
        # Build vocab from merge rules
        self.vocab = set()
        for (a, b), merged in merges:
            self.vocab.add(a)
            self.vocab.add(b)
            self.vocab.add(merged)
        # Add single characters as base vocab
        for c in range(256):
            self.vocab.add(chr(c))

        self.token_to_id = {tok: i for i, tok in enumerate(sorted(self.vocab))}
        self.id_to_token = {i: tok for tok, i in self.token_to_id.items()}

    def encode(self, text):
        """Tokenize text using learned merges."""
        tokens = list(text)
        for (a, b), merged in self.merges:
            new_tokens = []
            i = 0
            while i < len(tokens):
                if (i < len(tokens) - 1 and
                    tokens[i] == a and tokens[i + 1] == b):
                    new_tokens.append(merged)
                    i += 2
                else:
                    new_tokens.append(tokens[i])
                    i += 1
            tokens = new_tokens
        return [self.token_to_id[t] for t in tokens if t in self.token_to_id]

    def decode(self, ids):
        """Convert token IDs back to text."""
        return ''.join(self.id_to_token[i] for i in ids)

# Use the merges from training
tokenizer = BPETokenizer(merges)
encoded = tokenizer.encode("the cat sat on the mat")
print(f"Encoded: {encoded}")
print(f"Decoded: {tokenizer.decode(encoded)}")
print(f"Compression: {len('the cat sat on the mat')} chars -> {len(encoded)} tokens")

GPT-2 uses BPE with approximately 50,000 merges, producing a vocabulary of 50,257 tokens. GPT-4 uses around 100,000 tokens. The larger vocabulary means more words can be represented as single units, making sequences shorter (more computationally efficient) at the cost of a larger embedding table.

WordPiece and SentencePiece

WordPiece (used by BERT): similar to BPE but selects merges based on likelihood improvement rather than raw frequency. It uses ## as a prefix for continuation tokens: "playing" becomes "play" + "##ing". The ## tells you this piece continues a previous token (it's not the start of a new word).

SentencePiece (used by T5, LLaMA): treats the input as raw bytes with no pre-tokenization by spaces, making it truly language-agnostic. Japanese, Chinese, Arabic -- no special handling needed. It also supports a "blank" special character for spaces, so the tokenizer can learn space-inclusive tokens like "_ the" (underscore representing space).

Byte-level BPE (GPT-2, GPT-4): operates on raw bytes, ensuring ANY text can be tokenized (no "unknown token" ever). UTF-8 multibyte characters may become multiple tokens, which is why non-Latin scripts tend to use more tokens per word than English.

# Demonstrating how different texts tokenize differently
# (Using tiktoken -- OpenAI's tokenizer library)
# pip install tiktoken

# import tiktoken
# enc = tiktoken.get_encoding("cl100k_base")  # GPT-4's tokenizer
#
# examples = [
#     "Hello world",
#     "Transformers are amazing",
#     "supercalifragilisticexpialidocious",
#     "Python 3.11 is fast",
#     "def fibonacci(n):",
# ]
#
# for text in examples:
#     tokens = enc.encode(text)
#     print(f"{text:>40s} -> {len(tokens)} tokens: {tokens}")

# Without tiktoken, here's what you'd roughly see:
print("Typical GPT-4 tokenization examples:")
print("  'Hello world'           -> 2 tokens  (each common word = 1 token)")
print("  'Transformers'          -> 1 token   (common word in training)")
print("  'supercalifragilistic'  -> 5 tokens  (rare word, split into pieces)")
print("  'def fibonacci(n):'    -> 5 tokens  (code patterns learned)")
print("  'x = 42'               -> 3 tokens  (variable, equals, number)")

Vocabulary size tradeoffs

The choice of vocabulary size creates a fundamental tension that every model designer must navigate:

Small vocabulary (1K-10K tokens): each token is a common character or short subword. Sequences become long -- the model needs more positions to represent the same text. The context window fills up faster. But the embedding table is small and every single token gets plenty of training data, so each embedding is well-learned.

Large vocabulary (50K-200K tokens): many full words get their own dedicated token. Sequences are shorter (more efficient use of context window). A 4000-word article might be 3000 tokens with a small vocab but only 1500 tokens with a large one. But the embedding table is massive and rare tokens (obscure technical terms, foreign words) may appear only a handful of times in training, so their embeddings are poorly learned.

Modern models typically land on 30K-100K tokens. GPT-4's ~100K vocabulary handles most English words as single tokens, plus common multi-word code patterns ("def ", "return "), multilingual character sequences, and frequent subword units.

One practical consequence that trips people up: tokenization affects model behavior in ways you might not expect. A model might struggle with tasks that require reasoning about individual characters (anagrams, spelling checks, counting letters) because it processes multi-character tokens. The word "strawberry" is likely 1-2 tokens -- the model never "sees" the individual letters unless you explicitly spell them out.

Token boundaries can also fall in unexpected places. "New York" might be a single token while "New Orleans" is two tokens. Numbers get tokenized inconsistently: "1000" might be one token, "1001" might be two ("100" + "1"). These tokenization artifacts affect everything from arithmetic ability to multilingual performance, and they're one reason why apparently simple tasks (like counting the letters in a word) can be surprisingly difficult for large language models.

# Demonstrating tokenization effects on sequence length
def estimate_token_ratio(text, avg_chars_per_token=4.0):
    """Rough estimate: English averages ~4 chars per token with BPE.
    Code averages ~3.5. Chinese/Japanese averages ~1.5-2."""
    n_chars = len(text)
    estimated_tokens = n_chars / avg_chars_per_token
    return estimated_tokens

texts = {
    "English prose": "The transformer architecture has revolutionized natural language processing.",
    "Python code": "def train_model(data, epochs=10, lr=0.001):\n    model = TransformerLM()",
    "Technical": "The BERT-base model uses 12 transformer encoder layers with 768-dim hidden states.",
}

print(f"{'Type':<20} {'Chars':>6} {'~Tokens':>8} {'Ratio':>6}")
print("-" * 45)
for label, text in texts.items():
    chars = len(text)
    tokens = estimate_token_ratio(text)
    print(f"{label:<20} {chars:>6} {tokens:>8.0f} {chars/tokens:>6.1f}")

Putting it together: from n-grams to transformers

Let me make the connection explicit. The evolution from statistical to neural language models mirrors the evolution we traced through this entire series:

  1. N-grams (1980s-2000s): count word sequences. Fast but sparse, no generalization, fixed context. Like a lookup table.

  2. Feed-forward neural LMs (Bengio, 2003): embed words as vectors, predict next word from fixed context window through a neural network. Generalizes via embeddings but still fixed context.

  3. RNN/LSTM language models (2010-2017): process tokens sequentially, maintain hidden state that theoretically captures unbounded context. In practice limited by vanishing gradients and sequential processing. We built these in episodes #48-49.

  4. Transformer language models (2017-present): attend to all previous tokens in parallel. Unbounded context (within window), no sequential bottleneck, scales well on modern hardware. This is what powers everything today.

Each step solved specific limitations of the previous approach. The transformer didn't appear from nowhere -- it solved the parallelization and long-range dependency problems that made RNNs impractical at scale. And the key realization that drove the LLM revolution: once you have a good enough architecture (transformers) and a good enough objective (next-token prediction), you can scale it up and emergent capabilites appear without being explicitly programmed.

What this means for the rest of Arc 4

Everything in this arc builds on what we covered today. Language modeling is the pre-training objective -- the task that teaches a model about language before it ever sees an instruction or a conversation. The model learns grammar, facts, reasoning patterns, code syntax, and world knowledge, all from the single task of predicting the next token on internet-scale text data.

The questions that drive the rest of this arc are:

  • How do you design a model architecture specifically optimized for language modeling? (decoder-only transformers, the GPT family)
  • What happens when you train an encoder model with a different objective? (masked language modeling, BERT)
  • How do you scale training to trillions of tokens across thousands of GPUs?
  • How do you turn a raw next-token predictor into something that follows instructions and behaves helpfully? (instruction tuning, alignment)
  • How do you use these models effectively through APIs and prompting?

The foundation is in place. The transformer predicts the next token. Everything that follows is about making that predictor bigger, training it on more data, and then teaching it to be useful.

Did it click? Let's check

  • A language model assigns probabilities to sequences: P(next_token | all_previous_tokens). That's the entire foundation.
  • N-gram models count word sequences but suffer from sparsity, zero generalization, and fixed context windows -- fundamental limitations that no amount of data fixes.
  • Neural language models solve all three problems through continuous embeddings and learned functions over those embeddings.
  • Perplexity measures how surprised the model is by test data. Lower = better. But low perplexity alone does not mean the model is useful -- it's a necessary but not sufficient condition.
  • Scaling laws show predictable power-law improvement with model size, data, and compute. The Chinchilla insight: balance parameters and training tokens, don't just scale one.
  • Subword tokenization (BPE, WordPiece, SentencePiece) balances vocabulary coverage with sequence efficiency. Most models use 30K-100K tokens.
  • Tokenization artifacts (token boundaries, multi-char tokens, inconsistent number splits) affect model capabilities in subtle but real ways.
  • The evolution from n-grams to transformers is the same trajectory we traced through this whole series: from discrete counts to continuous functions, from fixed to adaptive, from sequential to parallel.

Exercises

Exercise 1: Build a complete trigram language model with Laplace smoothing. Train it on a paragraph of text, compute its perplexity on a held-out sentence, and generate 5 random sentences from it. Compare the perplexity and output quality to an unsmoothed bigram model.

Exercise 2: Implement BPE from scratch as a class with train(text, n_merges), encode(text), and decode(ids) methods. Train it on a few paragraphs of English text with 200 merges. Then tokenize the sentence "the untokenizable transformerification" and show how many tokens it produces at different vocabulary sizes (50, 100, 200 merges).

Exercise 3: Take the transformer model from episode #56, train it on a text corpus, and compute perplexity on a held-out test set (use the last 10% of the data). Then vary the model size (number of layers: 2, 4, 6, 8) and plot perplexity vs parameter count. Does it follow the predicted power-law scaling?

Thanks for reading, see you in the next one!

@scipio



0
0
0.000
0 comments