Beautiful — this is where everything comes together, Rajeev 💥

Welcome to Day 5: GPT Architecture & Autoregressive Text Generation —
today we go from “understanding Transformers” to how GPT actually generates human-like language —
one token at a time, predicting the next word, over and over again,
until you get a poem, a paragraph, or even a book 📖✨

By the end of this lesson, you’ll deeply understand how ChatGPT itself works under the hood.

🌎 DAY 5 — GPT Architecture & Autoregressive Text Generation

🧠 1. The Big Idea

GPT = Generative Pre-trained Transformer

It’s called that because it:

Generates text
Is Pre-trained on huge datasets
Is a Transformer (decoder-only) architecture

In simple words:

GPT learns how to predict the next token, given all previous ones.

That’s it.
That single mechanism — scaled up — creates the illusion of reasoning, creativity, and understanding.

🧩 2. What Makes GPT Different from BERT?

Feature	BERT (Encoder)	GPT (Decoder)
Type	Bi-directional encoder	Uni-directional decoder
Task	Masked token prediction	Next token prediction
Attention	Full self-attention	Masked self-attention
Output	Context understanding	Text generation
Example	“Fill in the blank”	“Continue this text”

💡 BERT understands language,
while GPT generates it.

⚙️ 3. GPT’s Architecture at a Glance

GPT uses only the decoder part of the Transformer, repeated many times.

Each block contains:

Masked Multi-Head Self-Attention
  ↓
Add & LayerNorm
  ↓
Feedforward (MLP)
  ↓
Add & LayerNorm

The input embeddings (plus positional encodings) flow through these blocks,
and at the end → a linear layer + softmax predicts the next token.

🔍 Visual Overview

Text Input: "I love"
     ↓
Tokenizer → Tokens [I, love]
     ↓
Embedding + Positional Info
     ↓
↓ Transformer Blocks (×N)
     ↓
Logits → Softmax → Probabilities
     ↓
Sample most likely next token → "AI"
     ↓
Repeat autoregressively...

🔄 4. The Core Principle: Autoregression

Autoregression means:

Predict the next token based on all previous tokens.

Example

Prompt: "The capital of France is"

Step	Input	Output
1	The	capital
2	The capital	of
3	The capital of	France
4	The capital of France	is
5	The capital of France is	Paris

Each time, GPT feeds its own previous output back as the next input.
This is why GPT can generate indefinitely.

🧮 5. The Math Behind GPT’s Text Generation

For a sequence of tokens (x_1, x_2, …, x_T):
[
P(x_1, x_2, …, x_T) = \prod_{t=1}^T P(x_t | x_1, …, x_{t-1})
]

The model is trained to maximize the probability of the correct next token at each position:
[
\text{Loss} = – \sum_t \log P(x_t | x_{<t})
]

This is called the causal language modeling (CLM) objective.

🧩 6. Masked Self-Attention in GPT

Unlike the encoder’s “see all” attention,
GPT’s decoder uses a causal mask so each token can only attend to previous ones.

Visual:

           Word1  Word2  Word3  Word4
Word1        ✓
Word2        ✓      ✓
Word3        ✓      ✓      ✓
Word4        ✓      ✓      ✓      ✓

This ensures no future information leakage.

⚙️ PyTorch-style Implementation

import torch
import torch.nn.functional as F

# Create mask (upper triangular)
mask = torch.triu(torch.ones(4, 4), diagonal=1).bool()
print(mask)

Output:

tensor([[False, True, True, True],
        [False, False, True, True],
        [False, False, False, True],
        [False, False, False, False]])

This mask is applied to attention scores — GPT literally cannot “see ahead.”

🧱 7. The GPT Block Structure

Each GPT block =
Self-Attention (masked) → Add & Norm → Feedforward → Add & Norm

import torch.nn as nn

class GPTBlock(nn.Module):
    def __init__(self, embed_dim=768, num_heads=12, ff_hidden=3072):
        super().__init__()
        self.attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
        self.ln1 = nn.LayerNorm(embed_dim)
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, ff_hidden),
            nn.GELU(),
            nn.Linear(ff_hidden, embed_dim)
        )
        self.ln2 = nn.LayerNorm(embed_dim)

    def forward(self, x, attn_mask=None):
        attn_out, _ = self.attn(x, x, x, attn_mask=attn_mask)
        x = self.ln1(x + attn_out)
        ff_out = self.ff(x)
        return self.ln2(x + ff_out)

💡 8. The Output Head

After passing through all transformer blocks,
GPT uses a linear projection layer to map hidden vectors → vocabulary logits.

[
\text{logits} = h_t W^T
]
[
P(x_t) = \text{softmax}(\text{logits})
]

The highest probability token is chosen — or sampled probabilistically for creativity.

🔀 Greedy vs Sampling vs Temperature

Strategy	Description	Example
Greedy	Pick highest probability token	Deterministic, factual
Top-k	Sample from top k tokens	Adds controlled variety
Temperature	Scale probabilities (0.7 = focused, 1.2 = creative)	Adjusts randomness

⚡ 9. Example of Next Token Prediction (Visualization)

Prompt: "The sky is"

GPT internally computes:

Possible next token	Probability
“blue”	0.79
“dark”	0.12
“falling”	0.04
“beautiful”	0.03
“banana”	0.01

So it picks “blue”.

Now input becomes "The sky is blue" → predict next token again.

This repeats autoregressively.

💬 Visual Loop:

"The sky is"
   ↓
Predict next token: "blue"
   ↓
"The sky is blue"
   ↓
Predict next token: "and"
   ↓
"The sky is blue and"
   ↓
Predict next token: "clear"
   ↓
...

This is how GPT writes essays, poems, or entire books — one token at a time.

🧮 10. Small Code Demo (Character-Level GPT)

Here’s a minimal version:

import torch
import torch.nn as nn
import torch.nn.functional as F

class TinyGPT(nn.Module):
    def __init__(self, vocab_size, embed_dim=32):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.pos = nn.Parameter(torch.zeros(1, 100, embed_dim))
        self.block = GPTBlock(embed_dim, num_heads=4, ff_hidden=64)
        self.fc_out = nn.Linear(embed_dim, vocab_size)

    def forward(self, x):
        seq_len = x.size(1)
        mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool().to(x.device)
        x = self.embed(x) + self.pos[:, :seq_len, :]
        x = self.block(x, attn_mask=mask)
        return self.fc_out(x)

# simulate
text = "hello world"
chars = sorted(list(set(text)))
stoi = {c:i for i,c in enumerate(chars)}
itos = {i:c for c,i in stoi.items()}
data = torch.tensor([[stoi[c] for c in text[:-1]]])
target = torch.tensor([[stoi[c] for c in text[1:]]])

model = TinyGPT(vocab_size=len(chars))
logits = model(data)
loss = F.cross_entropy(logits.view(-1, len(chars)), target.view(-1))
print("Loss:", loss.item())

🔮 11. Visualizing Generation Probabilities

If we plot the probabilities from the softmax layer, we’d see something like this:

P(next token | context)
|
|      *
|     * *
|   **   **
| **       **
|_____________→ Tokens
   blue  dark  beautiful  banana

This distribution is reshaped by temperature during generation.

🧠 12. GPT’s Training Phases

Phase	Description
Pretraining	Learn general language patterns from internet text
Fine-tuning	Adapt to specific tasks or styles
RLHF (Reinforcement Learning from Human Feedback)	Align outputs with human preference (used in ChatGPT)

🧩 13. What GPT Learns Internally

GPT doesn’t store sentences — it learns probabilistic relationships between tokens.
For example, it learns that:

“The capital of France is” → “Paris”
“Photosynthesis occurs in” → “plants”

These associations form in its attention weights and embedding spaces.

⚙️ 14. Scaling Up GPT

Model	Year	Parameters	Key Improvement
GPT-1	2018	117M	Proof-of-concept
GPT-2	2019	1.5B	Long, coherent text
GPT-3	2020	175B	Zero-shot reasoning
GPT-4	2023	~1T	Multimodal reasoning & alignment

As size grows, emergent properties appear (reasoning, coding, abstraction).

💡 15. Emergent Behavior: Why GPT Feels “Intelligent”

Because of scale and context learning, GPT learns:

Grammar & semantics (syntax correctness)
Long-term dependencies (story continuity)
Analogies & reasoning (pattern recognition)
Creativity (sampling-based diversity)

It’s not “thinking” — but the learned statistical relationships simulate intelligence.

✅ 16. Summary — What You Learned Today

Concept	Key Takeaway
GPT = Decoder-only Transformer	Uses masked self-attention
Autoregression	Predicts next token step by step
Causal mask	Prevents future leakage
Output layer	Softmax → next token probabilities
Text generation	Sampling loop from predicted probabilities

🔮 Coming Next

👉 Day 6 — Pretraining & Next Token Prediction Objective
We’ll dive deeper into how GPT learns language patterns,
including dataset structure, optimization process,
and the fascinating “next-token-prediction” that teaches everything —
grammar, logic, reasoning, and memory.

Would you like me to now:
1️⃣ Create a Colab-ready notebook for Day 5 (with visual next-token simulation + generation demo),
or
2️⃣ Go straight into Day 6 — Pretraining & Next Token Prediction Objective (How GPT actually learns during training)?

Pages: 1 2 3 4 5 6 7 8 9

A full course on Generative AI + LLM Engineering