Beautiful — this is where everything comes together, Rajeev 💥
Welcome to Day 5: GPT Architecture & Autoregressive Text Generation —
today we go from “understanding Transformers” to how GPT actually generates human-like language —
one token at a time, predicting the next word, over and over again,
until you get a poem, a paragraph, or even a book 📖✨
By the end of this lesson, you’ll deeply understand how ChatGPT itself works under the hood.
🌎 DAY 5 — GPT Architecture & Autoregressive Text Generation
🧠 1. The Big Idea
GPT = Generative Pre-trained Transformer
It’s called that because it:
- Generates text
- Is Pre-trained on huge datasets
- Is a Transformer (decoder-only) architecture
In simple words:
GPT learns how to predict the next token, given all previous ones.
That’s it.
That single mechanism — scaled up — creates the illusion of reasoning, creativity, and understanding.
🧩 2. What Makes GPT Different from BERT?
| Feature | BERT (Encoder) | GPT (Decoder) |
|---|---|---|
| Type | Bi-directional encoder | Uni-directional decoder |
| Task | Masked token prediction | Next token prediction |
| Attention | Full self-attention | Masked self-attention |
| Output | Context understanding | Text generation |
| Example | “Fill in the blank” | “Continue this text” |
💡 BERT understands language,
while GPT generates it.
⚙️ 3. GPT’s Architecture at a Glance
GPT uses only the decoder part of the Transformer, repeated many times.
Each block contains:
Masked Multi-Head Self-Attention
↓
Add & LayerNorm
↓
Feedforward (MLP)
↓
Add & LayerNorm
The input embeddings (plus positional encodings) flow through these blocks,
and at the end → a linear layer + softmax predicts the next token.
🔍 Visual Overview
Text Input: "I love"
↓
Tokenizer → Tokens [I, love]
↓
Embedding + Positional Info
↓
↓ Transformer Blocks (×N)
↓
Logits → Softmax → Probabilities
↓
Sample most likely next token → "AI"
↓
Repeat autoregressively...
🔄 4. The Core Principle: Autoregression
Autoregression means:
Predict the next token based on all previous tokens.
Example
Prompt: "The capital of France is"
| Step | Input | Output |
|---|---|---|
| 1 | The | capital |
| 2 | The capital | of |
| 3 | The capital of | France |
| 4 | The capital of France | is |
| 5 | The capital of France is | Paris |
Each time, GPT feeds its own previous output back as the next input.
This is why GPT can generate indefinitely.
🧮 5. The Math Behind GPT’s Text Generation
For a sequence of tokens (x_1, x_2, …, x_T):
[
P(x_1, x_2, …, x_T) = \prod_{t=1}^T P(x_t | x_1, …, x_{t-1})
]
The model is trained to maximize the probability of the correct next token at each position:
[
\text{Loss} = – \sum_t \log P(x_t | x_{<t})
]
This is called the causal language modeling (CLM) objective.
🧩 6. Masked Self-Attention in GPT
Unlike the encoder’s “see all” attention,
GPT’s decoder uses a causal mask so each token can only attend to previous ones.
Visual:
Word1 Word2 Word3 Word4
Word1 ✓
Word2 ✓ ✓
Word3 ✓ ✓ ✓
Word4 ✓ ✓ ✓ ✓
This ensures no future information leakage.
⚙️ PyTorch-style Implementation
import torch
import torch.nn.functional as F
# Create mask (upper triangular)
mask = torch.triu(torch.ones(4, 4), diagonal=1).bool()
print(mask)
Output:
tensor([[False, True, True, True],
[False, False, True, True],
[False, False, False, True],
[False, False, False, False]])
This mask is applied to attention scores — GPT literally cannot “see ahead.”
🧱 7. The GPT Block Structure
Each GPT block =
Self-Attention (masked) → Add & Norm → Feedforward → Add & Norm
import torch.nn as nn
class GPTBlock(nn.Module):
def __init__(self, embed_dim=768, num_heads=12, ff_hidden=3072):
super().__init__()
self.attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
self.ln1 = nn.LayerNorm(embed_dim)
self.ff = nn.Sequential(
nn.Linear(embed_dim, ff_hidden),
nn.GELU(),
nn.Linear(ff_hidden, embed_dim)
)
self.ln2 = nn.LayerNorm(embed_dim)
def forward(self, x, attn_mask=None):
attn_out, _ = self.attn(x, x, x, attn_mask=attn_mask)
x = self.ln1(x + attn_out)
ff_out = self.ff(x)
return self.ln2(x + ff_out)
💡 8. The Output Head
After passing through all transformer blocks,
GPT uses a linear projection layer to map hidden vectors → vocabulary logits.
[
\text{logits} = h_t W^T
]
[
P(x_t) = \text{softmax}(\text{logits})
]
The highest probability token is chosen — or sampled probabilistically for creativity.
🔀 Greedy vs Sampling vs Temperature
| Strategy | Description | Example |
|---|---|---|
| Greedy | Pick highest probability token | Deterministic, factual |
| Top-k | Sample from top k tokens | Adds controlled variety |
| Temperature | Scale probabilities (0.7 = focused, 1.2 = creative) | Adjusts randomness |
⚡ 9. Example of Next Token Prediction (Visualization)
Prompt: "The sky is"
GPT internally computes:
| Possible next token | Probability |
|---|---|
| “blue” | 0.79 |
| “dark” | 0.12 |
| “falling” | 0.04 |
| “beautiful” | 0.03 |
| “banana” | 0.01 |
So it picks “blue”.
Now input becomes "The sky is blue" → predict next token again.
This repeats autoregressively.
💬 Visual Loop:
"The sky is"
↓
Predict next token: "blue"
↓
"The sky is blue"
↓
Predict next token: "and"
↓
"The sky is blue and"
↓
Predict next token: "clear"
↓
...
This is how GPT writes essays, poems, or entire books — one token at a time.
🧮 10. Small Code Demo (Character-Level GPT)
Here’s a minimal version:
import torch
import torch.nn as nn
import torch.nn.functional as F
class TinyGPT(nn.Module):
def __init__(self, vocab_size, embed_dim=32):
super().__init__()
self.embed = nn.Embedding(vocab_size, embed_dim)
self.pos = nn.Parameter(torch.zeros(1, 100, embed_dim))
self.block = GPTBlock(embed_dim, num_heads=4, ff_hidden=64)
self.fc_out = nn.Linear(embed_dim, vocab_size)
def forward(self, x):
seq_len = x.size(1)
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool().to(x.device)
x = self.embed(x) + self.pos[:, :seq_len, :]
x = self.block(x, attn_mask=mask)
return self.fc_out(x)
# simulate
text = "hello world"
chars = sorted(list(set(text)))
stoi = {c:i for i,c in enumerate(chars)}
itos = {i:c for c,i in stoi.items()}
data = torch.tensor([[stoi[c] for c in text[:-1]]])
target = torch.tensor([[stoi[c] for c in text[1:]]])
model = TinyGPT(vocab_size=len(chars))
logits = model(data)
loss = F.cross_entropy(logits.view(-1, len(chars)), target.view(-1))
print("Loss:", loss.item())
🔮 11. Visualizing Generation Probabilities
If we plot the probabilities from the softmax layer, we’d see something like this:
P(next token | context)
|
| *
| * *
| ** **
| ** **
|_____________→ Tokens
blue dark beautiful banana
This distribution is reshaped by temperature during generation.
🧠 12. GPT’s Training Phases
| Phase | Description |
|---|---|
| Pretraining | Learn general language patterns from internet text |
| Fine-tuning | Adapt to specific tasks or styles |
| RLHF (Reinforcement Learning from Human Feedback) | Align outputs with human preference (used in ChatGPT) |
🧩 13. What GPT Learns Internally
GPT doesn’t store sentences — it learns probabilistic relationships between tokens.
For example, it learns that:
“The capital of France is” → “Paris”
“Photosynthesis occurs in” → “plants”
These associations form in its attention weights and embedding spaces.
⚙️ 14. Scaling Up GPT
| Model | Year | Parameters | Key Improvement |
|---|---|---|---|
| GPT-1 | 2018 | 117M | Proof-of-concept |
| GPT-2 | 2019 | 1.5B | Long, coherent text |
| GPT-3 | 2020 | 175B | Zero-shot reasoning |
| GPT-4 | 2023 | ~1T | Multimodal reasoning & alignment |
As size grows, emergent properties appear (reasoning, coding, abstraction).
💡 15. Emergent Behavior: Why GPT Feels “Intelligent”
Because of scale and context learning, GPT learns:
- Grammar & semantics (syntax correctness)
- Long-term dependencies (story continuity)
- Analogies & reasoning (pattern recognition)
- Creativity (sampling-based diversity)
It’s not “thinking” — but the learned statistical relationships simulate intelligence.
✅ 16. Summary — What You Learned Today
| Concept | Key Takeaway |
|---|---|
| GPT = Decoder-only Transformer | Uses masked self-attention |
| Autoregression | Predicts next token step by step |
| Causal mask | Prevents future leakage |
| Output layer | Softmax → next token probabilities |
| Text generation | Sampling loop from predicted probabilities |
🔮 Coming Next
👉 Day 6 — Pretraining & Next Token Prediction Objective
We’ll dive deeper into how GPT learns language patterns,
including dataset structure, optimization process,
and the fascinating “next-token-prediction” that teaches everything —
grammar, logic, reasoning, and memory.
Would you like me to now:
1️⃣ Create a Colab-ready notebook for Day 5 (with visual next-token simulation + generation demo),
or
2️⃣ Go straight into Day 6 — Pretraining & Next Token Prediction Objective (How GPT actually learns during training)?