Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers

What will I learn
- You will learn why decoder-only transformers won the architecture race;
- causal (autoregressive) attention masking and why it's the key design choice;
- the GPT family -- GPT-1 through GPT-4 and how each pushed the frontier;
- emergent abilities -- capabilities that appear only at sufficient scale;
- in-context learning -- few-shot adaptation without updating any weights;
- why "just predict the next token" produces models that can reason, code, and converse.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers (this post)
Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
Solutions to Episode #57 Exercises
Exercise 1: Complete trigram language model with Laplace smoothing, perplexity computation, and text generation.
from collections import Counter, defaultdict
import math
import random
text = """the cat sat on the mat the cat ate the food
the dog sat on the rug the dog ate the bone
the cat sat on the mat again the dog ran away
the bird sat on the fence the bird ate the seed
the cat and the dog played in the yard all day
the bird flew over the fence and landed on the mat"""
words = text.split()
vocab = sorted(set(words))
V = len(vocab)
# Build trigram and bigram counts
trigram_counts = defaultdict(Counter)
bigram_counts = defaultdict(Counter)
for i in range(len(words) - 2):
context = (words[i], words[i+1])
trigram_counts[context][words[i+2]] += 1
for i in range(len(words) - 1):
bigram_counts[words[i]][words[i+1]] += 1
def trigram_prob(w, context, alpha=1.0):
"""P(w | context) with Laplace smoothing."""
count_trigram = trigram_counts[context][w]
count_context = sum(trigram_counts[context].values())
return (count_trigram + alpha) / (count_context + alpha * V)
def bigram_prob_unsmoothed(w, prev):
"""Unsmoothed bigram for comparison."""
count_bi = bigram_counts[prev][w]
count_prev = sum(bigram_counts[prev].values())
if count_prev == 0:
return 1.0 / V
return count_bi / count_prev
def perplexity_trigram(sentence_words, alpha=1.0):
log_prob = 0
n = 0
for i in range(2, len(sentence_words)):
ctx = (sentence_words[i-2], sentence_words[i-1])
p = trigram_prob(sentence_words[i], ctx, alpha)
log_prob += math.log(p)
n += 1
return math.exp(-log_prob / n) if n > 0 else float('inf')
def perplexity_bigram(sentence_words):
log_prob = 0
n = 0
for i in range(1, len(sentence_words)):
p = bigram_prob_unsmoothed(sentence_words[i], sentence_words[i-1])
if p == 0:
p = 1e-10
log_prob += math.log(p)
n += 1
return math.exp(-log_prob / n) if n > 0 else float('inf')
test = "the cat sat on the fence".split()
ppl_tri = perplexity_trigram(test)
ppl_bi = perplexity_bigram(test)
print(f"Test: '{' '.join(test)}'")
print(f" Trigram PPL (smoothed): {ppl_tri:.2f}")
print(f" Bigram PPL (unsmoothed): {ppl_bi:.2f}")
def generate_trigram(start_pair, length=15):
result = list(start_pair)
for _ in range(length):
ctx = (result[-2], result[-1])
candidates = vocab
probs = [trigram_prob(w, ctx) for w in candidates]
chosen = random.choices(candidates, weights=probs, k=1)[0]
result.append(chosen)
return ' '.join(result)
print("\nGenerated sentences:")
for i in range(5):
print(f" {generate_trigram(('the', 'cat'))}")
Laplace smoothing adds alpha (typically 1.0) to every trigram count and alpha * V to the denominator. This guarantees that unseen trigrams get a small but nonzero probability -- the unsmoothed bigram model assigns zero probability to any word pair it hasn't seen before, which makes perplexity infinite on novel sequences. The smoothed trigram model's perplexity should be lower and finite on the test sentence, showing how smoothing solves the zero-probability problem.
Exercise 2: BPE tokenizer class with train, encode, and decode methods.
class BPETokenizer:
def __init__(self):
self.merges = []
self.vocab = set()
self.token_to_id = {}
self.id_to_token = {}
def train(self, text, n_merges):
tokens = list(text)
self.vocab = set(tokens)
self.merges = []
for step in range(n_merges):
pairs = {}
for i in range(len(tokens) - 1):
pair = (tokens[i], tokens[i+1])
pairs[pair] = pairs.get(pair, 0) + 1
if not pairs:
break
best = max(pairs, key=pairs.get)
merged = best[0] + best[1]
self.merges.append((best, merged))
self.vocab.add(merged)
new_tokens = []
i = 0
while i < len(tokens):
if (i < len(tokens) - 1 and
tokens[i] == best[0] and tokens[i+1] == best[1]):
new_tokens.append(merged)
i += 2
else:
new_tokens.append(tokens[i])
i += 1
tokens = new_tokens
sorted_vocab = sorted(self.vocab)
self.token_to_id = {t: i for i, t in enumerate(sorted_vocab)}
self.id_to_token = {i: t for t, i in self.token_to_id.items()}
def encode(self, text):
tokens = list(text)
for (a, b), merged in self.merges:
new_tokens = []
i = 0
while i < len(tokens):
if (i < len(tokens) - 1 and
tokens[i] == a and tokens[i+1] == b):
new_tokens.append(merged)
i += 2
else:
new_tokens.append(tokens[i])
i += 1
tokens = new_tokens
return [self.token_to_id[t] for t in tokens if t in self.token_to_id]
def decode(self, ids):
return ''.join(self.id_to_token[i] for i in ids)
corpus = """The transformer architecture processes sequences in parallel.
Self-attention computes relationships between all positions simultaneously.
Language models predict the next token given all previous context tokens."""
test_word = "the untokenizable transformerification"
for n in [50, 100, 200]:
tok = BPETokenizer()
tok.train(corpus, n_merges=n)
encoded = tok.encode(test_word)
pieces = [tok.id_to_token[i] for i in encoded]
print(f"{n:>3d} merges: {len(encoded)} tokens -> {pieces}")
With 50 merges, common bigrams like "th" and "er" become single tokens but "untokenizable" stays mostly character-level. By 200 merges, common words like "the" and "trans" are single tokens, so the sequence gets shorter. The vocabulary grows with each merge, but sequences compress -- this is the fundamental tradeoff between vocab size and sequence length.
Exercise 3: Scaling experiment with the transformer from episode #56.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
# Using MiniGPT from episode #56 (import or paste it here)
# ... (model definition from ep56)
text = open('shakespeare.txt', 'r').read() if __name__ == '__main__' else "placeholder"
chars = sorted(set(text))
vocab_size = len(chars)
ch_to_idx = {c: i for i, c in enumerate(chars)}
data = torch.tensor([ch_to_idx[c] for c in text], dtype=torch.long)
split = int(0.9 * len(data))
train_data = data[:split]
test_data = data[split:]
def get_batch(source, batch_size=32, block_size=64):
ix = torch.randint(len(source) - block_size - 1, (batch_size,))
x = torch.stack([source[i:i+block_size] for i in ix])
y = torch.stack([source[i+1:i+block_size+1] for i in ix])
return x, y
results = []
for n_layers in [2, 4, 6, 8]:
model = MiniGPT(vocab_size, d_model=128, n_heads=4,
n_layers=n_layers, d_ff=512, max_len=128)
n_params = sum(p.numel() for p in model.parameters())
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)
for step in range(2000):
x, y = get_batch(train_data)
logits, _ = model(x)
loss = F.cross_entropy(logits.view(-1, vocab_size), y.view(-1))
opt.zero_grad()
loss.backward()
opt.step()
# Test perplexity
model.eval()
total_loss, total_tokens = 0, 0
with torch.no_grad():
for _ in range(50):
x, y = get_batch(test_data)
logits, _ = model(x)
loss = F.cross_entropy(logits.view(-1, vocab_size), y.view(-1))
total_loss += loss.item() * x.numel()
total_tokens += x.numel()
ppl = math.exp(total_loss / total_tokens)
results.append((n_layers, n_params, ppl))
print(f"Layers={n_layers}, params={n_params:,}, test_ppl={ppl:.2f}")
print("\nScaling analysis:")
for i in range(1, len(results)):
ratio = results[i][1] / results[i-1][1]
ppl_drop = results[i-1][2] / results[i][2]
print(f" {results[i-1][0]}->{results[i][0]} layers: "
f"{ratio:.1f}x params, {ppl_drop:.2f}x ppl improvement")
On a ~1MB text corpus, you should see perplexity drop as layers increase: roughly 2 layers ~ 15-20 PPL, 4 layers ~ 10-14 PPL, 6 layers ~ 8-12 PPL, 8 layers ~ 7-10 PPL. The improvement gets smaller with each layer added -- diminishing returns consistent with the power-law scaling we discussed. On this small dataset, 8 layers may even start to overfit (train loss much lower than test loss), which would show up as higher test perplexity than expected.
On to today's episode
Here we go! Last episode we nailed down the fundamental task that underpins every modern language model: predict the next token. We built n-gram models, neural language models, measured perplexity, explored tokenization, and established the scaling hypothesis. Today we zoom in on the specific architecture that turned next-token prediction into one of the most consequential technologies of the last decade: the decoder-only transformer, better known as the GPT architecture.
The original transformer from the 2017 "Attention Is All You Need" paper (which we covered in episodes #52-53) had both an encoder and a decoder. The encoder processes the input, the decoder generates the output. For machine translation, this makes intuitive sense -- you need one side to understand English and another to produce German.
But in 2018, OpenAI made a bold simplification: drop the encoder entirely. Keep only the decoder. Train it on a massive pile of text with one objective -- predict the next token. They called it the Generative Pre-trained Transformer, or GPT.
This turned out to be one of the most consequential architectural decisions in AI history. Not because the decoder-only design was technically novel (it's actually simpler than the full transformer), but because it scaled in ways nobody predicted. I've been wanting to write this episode for quite some time, because it connects everything we've built so far to the models that are making headlines every week ;-)
Why decoder-only?
The elegance of the decoder-only approach lies in what it eliminates. There's no encoder. No cross-attention (remember from episode #53 -- that was the mechanism where the decoder attended to encoder outputs). No separate input/output distinction. Just one stack of transformer layers with causal masking.
The model takes a sequence of tokens and, for each position, predicts the next token. That's the entire architecture and the entire training objective. Input and output share the same vocabulary, the same embedding space, the same model.
Let's build the core GPT architecture. If you followed episodes #52-53 and especially the mini-project in episode #56, you'll recognize every piece -- today we're just assembling them with a focus on the design choices that make GPT specifically what it is:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class CausalSelfAttention(nn.Module):
"""Multi-head self-attention with causal mask.
This is the heart of GPT -- each position can only
attend to previous positions (and itself)."""
def __init__(self, d_model, n_heads, max_len=1024, dropout=0.1):
super().__init__()
assert d_model % n_heads == 0
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.qkv = nn.Linear(d_model, 3 * d_model)
self.proj = nn.Linear(d_model, d_model)
self.drop = nn.Dropout(dropout)
# Pre-compute the causal mask (lower triangular)
mask = torch.tril(torch.ones(max_len, max_len))
self.register_buffer('mask', mask.view(1, 1, max_len, max_len))
def forward(self, x):
B, T, C = x.shape
qkv = self.qkv(x).reshape(B, T, 3, self.n_heads, self.d_k)
qkv = qkv.permute(2, 0, 3, 1, 4) # (3, B, heads, T, d_k)
Q, K, V = qkv[0], qkv[1], qkv[2]
scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)
scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
attn = self.drop(F.softmax(scores, dim=-1))
out = (attn @ V).transpose(1, 2).contiguous().view(B, T, C)
return self.proj(out)
Notice the causal mask -- that lower-triangular matrix we built in episode #53. Position 0 can only attend to position 0. Position 5 can attend to positions 0 through 5. Position 100 can attend to positions 0 through 100. This is what makes the model autoregressive: each token prediction is based only on what came before, never what comes after. Without this mask, the model could "cheat" during training by looking at the answer it's supposed to predict.
Having said that, let me now show you the full GPT block and model. Pay attention to pre-norm (LayerNorm before the sub-layers, not after -- we discussed this back in episode #53):
class GPTBlock(nn.Module):
"""A single GPT transformer block.
Pre-norm architecture: LayerNorm BEFORE attention and FFN."""
def __init__(self, d_model, n_heads, d_ff, max_len=1024, dropout=0.1):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.attn = CausalSelfAttention(d_model, n_heads, max_len, dropout)
self.ln2 = nn.LayerNorm(d_model)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.ff(self.ln2(x))
return x
class GPT(nn.Module):
"""Complete GPT model: embeddings + N blocks + output head."""
def __init__(self, vocab_size, d_model=768, n_heads=12,
n_layers=12, d_ff=3072, max_len=1024, dropout=0.1):
super().__init__()
self.tok_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_len, d_model)
self.drop = nn.Dropout(dropout)
self.blocks = nn.ModuleList([
GPTBlock(d_model, n_heads, d_ff, max_len, dropout)
for _ in range(n_layers)
])
self.ln_f = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab_size, bias=False)
# Weight tying (same as episode #56)
self.head.weight = self.tok_emb.weight
self.max_len = max_len
def forward(self, idx):
B, T = idx.shape
tok = self.tok_emb(idx)
pos = self.pos_emb(torch.arange(T, device=idx.device))
x = self.drop(tok + pos)
for block in self.blocks:
x = block(x)
x = self.ln_f(x)
return self.head(x)
# GPT-1 scale
model = GPT(vocab_size=50257, d_model=768, n_heads=12,
n_layers=12, d_ff=3072, max_len=512)
n_params = sum(p.numel() for p in model.parameters())
print(f"GPT-1 scale model: {n_params:,} parameters")
# Should print roughly 117M parameters
This simplicity is the point. One architecture, one training objective, and you can scale indefinitely. No task-specific heads, no paired training data, no encoder-decoder alignment. Just text in, next-token probabilities out.
Weight tying -- sharing the embedding matrix with the output projection -- is a critical detail we already used in episode #56. The same vector that represents "cat" in the input embedding space is also the vector that predicts "cat" in the output softmax. This saves millions of parameters and provides a natural regularization: the input and output representations of each token are forced to be the same, which makes the model's internal language space more consistent.
The GPT family: a scaling story
GPT-1 (2018): proof of concept
- 117 million parameters, 12 layers, 768 dimensions, 12 heads
- Trained on BookCorpus (~800M words of unpublished fiction)
- Context window: 512 tokens
- Key innovation: unsupervised pre-training followed by supervised fine-tuning
GPT-1 was modest by current standards. But its main contribution was showing that a pre-trained language model could be fine-tuned for various NLP tasks (sentiment analysis, textual entailment, similarity) and beat architectures designed specifically for those tasks. One model, many tasks -- a radical idea at the time, because the dominant approach was still to build a separate model for each task.
GPT-2 (2019): the generation leap
- 1.5 billion parameters, 48 layers, 1600 dimensions, 25 heads
- Trained on WebText (~40GB of text from high-quality Reddit links)
- Context window: 1024 tokens
- Key innovation: zero-shot task performance without any fine-tuning
GPT-2 was the first model where pure language modeling produced genuinly coherent multi-paragraph text. OpenAI initially withheld the full model, citing concerns about misuse. The controversy seems quaint now given what followed, but it marked the first time a language model's capabilities generated public debate.
More importantly, GPT-2 demonstrated zero-shot abilities. Without any task-specific training, you could prompt it with "Translate English to French: cheese =>" and it would produce "fromage." The model had learned the concept of translation purely from predicting the next token on internet text. Wowzers.
GPT-3 (2020): the scaling revelation
- 175 billion parameters, 96 layers, 12,288 dimensions, 96 heads
- Trained on ~300 billion tokens (Common Crawl, WebText, books, Wikipedia)
- Context window: 2048 tokens
- Key innovation: few-shot in-context learning
GPT-3 was 100x larger than GPT-2. The scaling paid off dramatically. Given just a few examples in the prompt (few-shot), GPT-3 could perform tasks it was never explicitly trained for: translation, arithmetic, code generation, question answering. No gradient updates, no fine-tuning -- just examples in the context window.
# Few-shot prompting (the input to GPT-3)
Translate English to Dutch:
sea otter => zeeotter
cheese => kaas
plush giraffe => pluchen giraffe
transformer architecture =>
GPT-3 produces "transformer architectuur" -- not because it was trained on translation pairs, but because it saw enough multilingual text during pre-training to internalize translation as a pattern. The model doesn't "know" it's translating; it's predicting the most likely next tokens given the pattern established by the examples.
GPT-4 (2023): the capability jump
- Estimated 1+ trillion parameters (details not publicly disclosed)
- Multimodal: accepts both text and images as input
- Context window: 8K-128K tokens
- Dramatic improvement in reasoning, instruction following, and safety
GPT-4 represented a qualitative leap. It could pass the bar exam (90th percentile), write complex code, analyze images, and maintain coherent reasoning over long conversations. The gap between GPT-3 and GPT-4 convinced many researchers that scaling transformer language models produces genuine capability improvements, not just better pattern matching.
Let's look at the numbers side by side to see the scaling trajectory:
import math
configs = {
"GPT-1": {"params": 117e6, "layers": 12, "d_model": 768,
"heads": 12, "d_ff": 3072, "ctx": 512},
"GPT-2": {"params": 1.5e9, "layers": 48, "d_model": 1600,
"heads": 25, "d_ff": 6400, "ctx": 1024},
"GPT-3": {"params": 175e9, "layers": 96, "d_model": 12288,
"heads": 96, "d_ff": 49152, "ctx": 2048},
"LLaMA-2 70B": {"params": 70e9, "layers": 80, "d_model": 8192,
"heads": 64, "d_ff": 28672, "ctx": 4096},
"Mistral 7B": {"params": 7e9, "layers": 32, "d_model": 4096,
"heads": 32, "d_ff": 14336, "ctx": 32768},
}
print(f"{'Model':<16} {'Params':>10} {'Layers':>7} {'d_model':>8} "
f"{'Heads':>6} {'d_ff':>7} {'Context':>8}")
print("-" * 68)
for name, c in configs.items():
print(f"{name:<16} {c['params']:>10.0e} {c['layers']:>7d} "
f"{c['d_model']:>8d} {c['heads']:>6d} {c['d_ff']:>7d} "
f"{c['ctx']:>8d}")
# Verify d_ff = 4 * d_model (standard ratio)
ratio = c['d_ff'] / c['d_model']
d_k = c['d_model'] // c['heads']
print(f"{'':>16} d_ff/d_model={ratio:.1f}x, d_k={d_k}")
The pattern: each generation increases parameters by 10-100x, increases context length, and keeps the same fundamental architecture. d_ff is typically 4x d_model. The number of heads scales with d_model (keeping d_k at 64-128 per head). Modern models also add architectural refinements -- Grouped Query Attention (GQA) reduces memory for key-value caches, RoPE (Rotary Position Embeddings) handles long contexts better than learned position embeddings, and SwiGLU replaces GELU in the feed-forward layers -- but these are optimizations, not fundamental changes to the architecture.
Causal masking in detail
Let me be very explicit about the causal mask because it's the single design choice that separates GPT from BERT (which we'll cover in the next episode). The mask determines what each token can "see":
import torch
import torch.nn.functional as F
def show_causal_attention(seq_len=6):
"""Demonstrate how causal masking works."""
# Random Q, K for illustration
torch.manual_seed(42)
d_k = 4
Q = torch.randn(1, seq_len, d_k)
K = torch.randn(1, seq_len, d_k)
# Raw attention scores
scores = Q @ K.transpose(-2, -1) / d_k**0.5
# Causal mask: lower triangular
mask = torch.tril(torch.ones(seq_len, seq_len))
masked_scores = scores.masked_fill(mask == 0, float('-inf'))
attn = F.softmax(masked_scores, dim=-1)
tokens = ["The", "cat", "sat", "on", "the", "mat"]
print("Causal attention weights:")
print(f"{'':>6}", end="")
for t in tokens:
print(f"{t:>6}", end="")
print()
for i, row_tok in enumerate(tokens):
print(f"{row_tok:>6}", end="")
for j in range(seq_len):
val = attn[0, i, j].item()
print(f"{val:>6.2f}", end="")
print(f" <- '{row_tok}' sees tokens 0..{i}")
print()
print("Position 0 ('The') only attends to itself.")
print("Position 5 ('mat') attends to all 6 tokens.")
print("Future tokens always get weight 0.00 (masked out).")
show_causal_attention()
This is WHY GPT is autoregressive. During training, all positions are computed in parallel (the mask takes care of hiding future tokens), but during generation, each new token is produced one at a time because it genuinly depends on everything that came before it. This is a part from what makes GPT fast to train (parallelism) but slow to generate (sequential).
BERT takes the opposite approach -- bidirectional attention -- where every token can attend to every other token. This makes BERT excellent for understanding tasks (classification, question answering where you see the full input) but incapable of generation (you can't predict the next token if you already see the full sequence, including the future). We'll dig into that distinction properly in the next episode.
In-context learning: the surprising ability
The most intellectually interesting property of large GPT models is in-context learning (ICL). You provide examples in the prompt, and the model adapts its behavior accordingly -- without any parameter updates.
This is fundamentally different from what we did throughout this entire series. In supervised learning (episodes #10-20), you train on labeled examples and the model updates its weights via gradient descent. In ICL, the weights are frozen. The model processes the examples through its forward pass and produces appropriate outputs for new inputs, all within a single forward pass.
# What the model sees (everything is just tokens):
prompt = """Classify the sentiment:
Text: I love this movie! -> Positive
Text: Terrible waste of time -> Negative
Text: Pretty good overall -> Positive
Text: The worst experience ever ->"""
# The model continues: " Negative"
# It figured out the pattern from 3 examples
# Another example -- format conversion:
prompt_2 = """Convert to JSON:
Name: Alice, Age: 30, City: Amsterdam -> {"name":"Alice","age":30,"city":"Amsterdam"}
Name: Bob, Age: 25, City: Rotterdam -> {"name":"Bob","age":25,"city":"Rotterdam"}
Name: Charlie, Age: 35, City: Utrecht ->"""
# The model continues with the correctly formatted JSON.
# No training. No fine-tuning. Just pattern matching in context.
How does this work? The leading theories:
Implicit gradient descent: some researchers have shown that transformer attention layers can implement a form of gradient descent internally. The examples in the prompt effectively "train" an implicit linear model within the attention mechanism during the forward pass. This is a fascinating result -- the forward pass of a transformer can simulate the training of a simpler model.
Pattern completion: the model has seen millions of similar classification patterns during pre-training. The examples in the prompt activate the relevant "skill" already encoded in the weights. The model isn't learning something new -- it's retrieving the right behavior from what it already knows.
Bayesian inference: the model may be doing implicit Bayesian reasoning over tasks -- using the examples to narrow down which of its many learned behaviors to apply.
The truth is probably some combination of all three. Whatever the mechanism, ICL means you don't need to train a separate model for every task. One pre-trained model adapts to whatever you put in the prompt. This was the foundational insight that made GPT-3 so impactful ;-)
Emergent abilities
Some capabilities don't appear gradually as models scale -- they seem to switch on at a certain size. Below a threshold, the model performs at random chance. Above it, performance jumps dramatically.
Examples that emerged around the 100B+ parameter scale:
- Multi-step arithmetic: correctly doing 3+ digit addition
- Chain-of-thought reasoning: solving problems step by step when prompted with "Let's think step by step"
- Code generation: writing functional programs from natural language descriptions
- Analogical reasoning: "A is to B as C is to ___" across abstract domains
Let's demonstrate the scaling effect with our own models:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
def test_icl_capability(model, tokenizer, n_examples=3):
"""Test if a model can do in-context learning.
Returns accuracy on held-out examples from a pattern."""
# Simple pattern: reverse the word
examples = [
("cat", "tac"), ("dog", "god"), ("bat", "tab"),
("top", "pot"), ("rat", "tar"), ("pan", "nap"),
]
# Build prompt from first n_examples
prompt_tokens = []
for word, rev in examples[:n_examples]:
prompt_tokens.extend(tokenizer.encode(f"{word} -> {rev}\n"))
# Test on remaining examples
correct = 0
for word, rev in examples[n_examples:]:
test_prompt = prompt_tokens + tokenizer.encode(f"{word} -> ")
x = torch.tensor([test_prompt])
with torch.no_grad():
logits = model(x)
# Check if the model predicts the first character of the reversal
pred_id = logits[0, -1].argmax().item()
expected_id = tokenizer.encode(rev[0])[0]
if pred_id == expected_id:
correct += 1
total = len(examples) - n_examples
return correct / total if total > 0 else 0.0
# On tiny models, ICL accuracy will be near zero.
# On GPT-3 scale models, ICL accuracy is 80-95%.
# The jump from "can't do it" to "can do it" happens at scale.
print("In-context learning emerges at scale:")
print(" Small models (< 1B params): ~0% ICL accuracy")
print(" Medium models (1-10B params): ~30-50% ICL accuracy")
print(" Large models (100B+ params): ~80-95% ICL accuracy")
Whether these are truly emergent (discontinuous phase transitions) or just cross a usefulness threshold on a smooth curve is actively debated. Recent work by Schaeffer et al. (2023) argues that many apparent emergent abilities are artifacts of evaluation metrics that are discontinuous (like exact-match accuracy) rather than the underlying capability being discontinuous. Are they truly discontinuous, or does our measurement just make them look that way? The practical observation holds either way: some things that a 10B parameter model simply cannot do become reliable at 100B+.
The unreasonable effectiveness of next-token prediction
Here's the deep question that I think is worth spending some time on: why does predicting the next token produce models that can reason, plan, code, and converse?
The answer that most researchers converge on: predicting the next token on internet-scale text requires learning an extremely rich internal representation of the world. To predict what comes next in a physics textbook, you need to model physics. To predict what comes next in a code repository, you need to model programming. To predict the next token in a conversation, you need to model human communication patterns.
# Why next-token prediction is so powerful:
# The model must learn EVERYTHING about the training data
# to minimize its prediction loss.
training_examples = {
"Physics text": (
"Newton's second law states that F = ma, so a 2kg object "
"experiencing 10N of force has acceleration ___",
"5 m/s^2" # Must know physics to predict this
),
"Code": (
"def fibonacci(n):\n if n <= 1:\n return ___",
"n" # Must know programming to predict this
),
"Conversation": (
"User: What's the capital of the Netherlands?\nAssistant: The capital is ___",
"Amsterdam" # Must know geography to predict this
),
"Math": (
"If x + 3 = 7, then x = ___",
"4" # Must know algebra to predict this
),
}
print("To predict the next token correctly, the model must learn:")
for domain, (context, answer) in training_examples.items():
print(f"\n {domain}:")
print(f" Context: ...{context[-60:]}")
print(f" Must predict: '{answer}'")
print(f" Requires: knowledge of {domain.lower()}")
The pre-training objective doesn't explicitly teach reasoning. But reasoning is necessary to minimize the next-token prediction loss on text that contains reasoning. The model is implicitly incentivized to learn every pattern present in its training data -- including logic, arithmetic, coding conventions, world knowledge, and social dynamics.
Whether this learned representation constitutes genuine "understanding" or is extremely sophisticated pattern matching is one of the deepest questions in AI research right now. I argue that the line between "understanding" and "very sophisticated pattern matching" is a lot blurrier than people like to admit. But what's NOT debatable: the behavior is useful, it scales, and it emerges from a simple training objective.
Architectural refinements in modern GPT variants
The core GPT architecture has been refined significantly since GPT-1. Let me walk through the most important changes, because if you read any recent LLM paper you'll encounter these terms:
import torch
import torch.nn as nn
import math
# 1. RoPE -- Rotary Position Embeddings
# Instead of learned position embeddings (GPT-1/2/3),
# RoPE rotates Q and K vectors based on position.
# This makes position information relative, not absolute,
# which generalizes better to sequence lengths unseen in training.
def apply_rope(x, seq_len, d_model):
"""Simplified RoPE: rotate pairs of dimensions by position."""
positions = torch.arange(seq_len).float()
dim_pairs = d_model // 2
freqs = 1.0 / (10000 ** (torch.arange(dim_pairs).float() / dim_pairs))
angles = positions.unsqueeze(1) * freqs.unsqueeze(0) # (T, d/2)
cos_angles = angles.cos()
sin_angles = angles.sin()
# Split x into pairs and rotate
x1 = x[..., :dim_pairs]
x2 = x[..., dim_pairs:]
rotated = torch.cat([
x1 * cos_angles - x2 * sin_angles,
x1 * sin_angles + x2 * cos_angles
], dim=-1)
return rotated
# 2. SwiGLU activation (replaces GELU in feed-forward)
class SwiGLU(nn.Module):
"""SwiGLU: gate * swish(linear(x)) -- used in LLaMA, PaLM."""
def __init__(self, d_model, d_ff):
super().__init__()
self.w1 = nn.Linear(d_model, d_ff, bias=False)
self.w2 = nn.Linear(d_ff, d_model, bias=False)
self.w3 = nn.Linear(d_model, d_ff, bias=False)
def forward(self, x):
return self.w2(F.silu(self.w1(x)) * self.w3(x))
# 3. Grouped Query Attention (GQA)
# Instead of separate K,V heads for each Q head,
# share K,V across groups of Q heads.
# Reduces KV cache memory at inference time.
class GroupedQueryAttention(nn.Module):
def __init__(self, d_model, n_q_heads=32, n_kv_heads=8):
super().__init__()
self.n_q_heads = n_q_heads
self.n_kv_heads = n_kv_heads
self.group_size = n_q_heads // n_kv_heads
self.d_k = d_model // n_q_heads
self.wq = nn.Linear(d_model, n_q_heads * self.d_k, bias=False)
self.wk = nn.Linear(d_model, n_kv_heads * self.d_k, bias=False)
self.wv = nn.Linear(d_model, n_kv_heads * self.d_k, bias=False)
self.wo = nn.Linear(n_q_heads * self.d_k, d_model, bias=False)
def forward(self, x, mask=None):
B, T, _ = x.shape
q = self.wq(x).view(B, T, self.n_q_heads, self.d_k).transpose(1, 2)
k = self.wk(x).view(B, T, self.n_kv_heads, self.d_k).transpose(1, 2)
v = self.wv(x).view(B, T, self.n_kv_heads, self.d_k).transpose(1, 2)
# Repeat K,V for each group
k = k.repeat_interleave(self.group_size, dim=1)
v = v.repeat_interleave(self.group_size, dim=1)
scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn = F.softmax(scores, dim=-1)
out = (attn @ v).transpose(1, 2).contiguous().view(B, T, -1)
return self.wo(out)
# Compare memory usage
d_model = 4096
mha = nn.MultiheadAttention(d_model, 32, batch_first=True)
gqa = GroupedQueryAttention(d_model, n_q_heads=32, n_kv_heads=8)
print(f"Standard MHA params: {sum(p.numel() for p in mha.parameters()):,}")
print(f"GQA (32Q/8KV) params: {sum(p.numel() for p in gqa.parameters()):,}")
print(f"KV cache reduction: {32/8:.0f}x smaller with GQA")
RoPE is the big one for long-context models. Learned position embeddings (GPT-1/2/3) have a fixed maximum sequence length -- the model can't process anything longer than what it was trained on. RoPE encodes position as a rotation, which means the model can extrapolate to longer sequences than it saw during training. LLaMA, Mistral, and most modern open-source LLMs use RoPE.
GQA is critical for inference efficiency. During generation, the model stores all previous keys and values in a KV cache (so it doesn't recompute them at each step). With 32 heads, that's 32 sets of K and V tensors to store. GQA with 8 KV heads reduces this cache by 4x, which translates directly to being able to serve longer contexts or more concurrent users on the same GPU.
The bottom line
- Decoder-only transformers (GPT) use only causal masked self-attention and next-token prediction -- no encoder, no cross-attention;
- GPT-1 proved pre-training works, GPT-2 showed generation quality, GPT-3 demonstrated in-context learning, GPT-4 achieved broad capabilities;
- In-context learning lets the model adapt to new tasks from examples in the prompt, with no parameter updates -- this is fundamentally different from the supervised learning we covered in episodes #10-20;
- Emergent abilities appear at scale: arithmetic, chain-of-thought reasoning, and code generation emerge in 100B+ parameter models;
- Next-token prediction implicitly requires learning rich world models -- reasoning, coding, and knowledge are necessary to minimize prediction loss on diverse text;
- The architecture is the same from GPT-1 to GPT-4: just transformer decoder blocks, scaled up. The simplicity is the strength;
- Modern refinements (RoPE, SwiGLU, GQA) improve efficiency and long-context handling, but the fundamental architecture remains unchanged.
Exercises
Exercise 1: Build a complete GPT model using the CausalSelfAttention, GPTBlock, and GPT classes from this episode. Create two configurations: "small" (d_model=64, 2 layers, 4 heads) and "medium" (d_model=128, 4 layers, 4 heads). Train both on a character-level text corpus (Shakespeare or similar, at least 100KB) for 2000 steps with AdamW (lr=3e-4, batch_size=32, block_size=64). Report: (a) parameter count for each, (b) final training loss, (c) generate 200 characters from the prompt "The " with temperature=0.8. Compare the output quality and loss between the two sizes.
Exercise 2: Implement the causal attention visualization from episode #56 but now compare causal (GPT-style) vs bidirectional (BERT-style) attention on the same input. Take a short sentence ("The cat sat on the mat"), run it through both a causal-masked attention layer and an unmasked attention layer, and plot both attention weight matrices side by side. For each, identify: which positions attend most to which other positions? How does masking change the attention distribution? Print the attention entropy (average Shannon entropy across positions) for both -- bidirectional should have higher entropy because each position distributes attention across more candidates.
Exercise 3: Implement a simple few-shot in-context learning test. Build a small GPT model and train it on a text that contains several "X -> Y" mapping examples (e.g., country -> capital pairs, or word -> reversed word). After training, give the model a prompt with 3 seen examples followed by a new unseen mapping, and check if the model predicts the correct continuation. Try varying the number of in-context examples (1, 2, 3, 5) and report how accuracy changes. This demonstrates ICL even at small scale (though accuracy will be low -- real ICL requires much larger models).
Congratulations @scipio! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)
Your next target is to reach 22000 upvotes.
You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word
STOPCheck out our last posts: