A full course on Generative AI + LLM Engineering

Beautiful — this is where everything comes together, Rajeev 💥

Welcome to Day 5: GPT Architecture & Autoregressive Text Generation
today we go from “understanding Transformers” to how GPT actually generates human-like language
one token at a time, predicting the next word, over and over again,
until you get a poem, a paragraph, or even a book 📖✨

By the end of this lesson, you’ll deeply understand how ChatGPT itself works under the hood.


🌎 DAY 5 — GPT Architecture & Autoregressive Text Generation


🧠 1. The Big Idea

GPT = Generative Pre-trained Transformer

It’s called that because it:

  • Generates text
  • Is Pre-trained on huge datasets
  • Is a Transformer (decoder-only) architecture

In simple words:

GPT learns how to predict the next token, given all previous ones.

That’s it.
That single mechanism — scaled up — creates the illusion of reasoning, creativity, and understanding.


🧩 2. What Makes GPT Different from BERT?

FeatureBERT (Encoder)GPT (Decoder)
TypeBi-directional encoderUni-directional decoder
TaskMasked token predictionNext token prediction
AttentionFull self-attentionMasked self-attention
OutputContext understandingText generation
Example“Fill in the blank”“Continue this text”

💡 BERT understands language,
while GPT generates it.


⚙️ 3. GPT’s Architecture at a Glance

GPT uses only the decoder part of the Transformer, repeated many times.

Each block contains:

Masked Multi-Head Self-Attention
  ↓
Add & LayerNorm
  ↓
Feedforward (MLP)
  ↓
Add & LayerNorm

The input embeddings (plus positional encodings) flow through these blocks,
and at the end → a linear layer + softmax predicts the next token.


🔍 Visual Overview

Text Input: "I love"
     ↓
Tokenizer → Tokens [I, love]
     ↓
Embedding + Positional Info
     ↓
↓ Transformer Blocks (×N)
     ↓
Logits → Softmax → Probabilities
     ↓
Sample most likely next token → "AI"
     ↓
Repeat autoregressively...

🔄 4. The Core Principle: Autoregression

Autoregression means:

Predict the next token based on all previous tokens.

Example

Prompt: "The capital of France is"

StepInputOutput
1Thecapital
2The capitalof
3The capital ofFrance
4The capital of Franceis
5The capital of France isParis

Each time, GPT feeds its own previous output back as the next input.
This is why GPT can generate indefinitely.


🧮 5. The Math Behind GPT’s Text Generation

For a sequence of tokens (x_1, x_2, …, x_T):
[
P(x_1, x_2, …, x_T) = \prod_{t=1}^T P(x_t | x_1, …, x_{t-1})
]

The model is trained to maximize the probability of the correct next token at each position:
[
\text{Loss} = – \sum_t \log P(x_t | x_{<t})
]

This is called the causal language modeling (CLM) objective.


🧩 6. Masked Self-Attention in GPT

Unlike the encoder’s “see all” attention,
GPT’s decoder uses a causal mask so each token can only attend to previous ones.

Visual:

           Word1  Word2  Word3  Word4
Word1        ✓
Word2        ✓      ✓
Word3        ✓      ✓      ✓
Word4        ✓      ✓      ✓      ✓

This ensures no future information leakage.


⚙️ PyTorch-style Implementation

import torch
import torch.nn.functional as F

# Create mask (upper triangular)
mask = torch.triu(torch.ones(4, 4), diagonal=1).bool()
print(mask)

Output:

tensor([[False, True, True, True],
        [False, False, True, True],
        [False, False, False, True],
        [False, False, False, False]])

This mask is applied to attention scores — GPT literally cannot “see ahead.”


🧱 7. The GPT Block Structure

Each GPT block =
Self-Attention (masked) → Add & Norm → Feedforward → Add & Norm

import torch.nn as nn

class GPTBlock(nn.Module):
    def __init__(self, embed_dim=768, num_heads=12, ff_hidden=3072):
        super().__init__()
        self.attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
        self.ln1 = nn.LayerNorm(embed_dim)
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, ff_hidden),
            nn.GELU(),
            nn.Linear(ff_hidden, embed_dim)
        )
        self.ln2 = nn.LayerNorm(embed_dim)

    def forward(self, x, attn_mask=None):
        attn_out, _ = self.attn(x, x, x, attn_mask=attn_mask)
        x = self.ln1(x + attn_out)
        ff_out = self.ff(x)
        return self.ln2(x + ff_out)

💡 8. The Output Head

After passing through all transformer blocks,
GPT uses a linear projection layer to map hidden vectors → vocabulary logits.

[
\text{logits} = h_t W^T
]
[
P(x_t) = \text{softmax}(\text{logits})
]

The highest probability token is chosen — or sampled probabilistically for creativity.


🔀 Greedy vs Sampling vs Temperature

StrategyDescriptionExample
GreedyPick highest probability tokenDeterministic, factual
Top-kSample from top k tokensAdds controlled variety
TemperatureScale probabilities (0.7 = focused, 1.2 = creative)Adjusts randomness

⚡ 9. Example of Next Token Prediction (Visualization)

Prompt: "The sky is"

GPT internally computes:

Possible next tokenProbability
“blue”0.79
“dark”0.12
“falling”0.04
“beautiful”0.03
“banana”0.01

So it picks “blue”.

Now input becomes "The sky is blue" → predict next token again.

This repeats autoregressively.


💬 Visual Loop:

"The sky is"
   ↓
Predict next token: "blue"
   ↓
"The sky is blue"
   ↓
Predict next token: "and"
   ↓
"The sky is blue and"
   ↓
Predict next token: "clear"
   ↓
...

This is how GPT writes essays, poems, or entire books — one token at a time.


🧮 10. Small Code Demo (Character-Level GPT)

Here’s a minimal version:

import torch
import torch.nn as nn
import torch.nn.functional as F

class TinyGPT(nn.Module):
    def __init__(self, vocab_size, embed_dim=32):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.pos = nn.Parameter(torch.zeros(1, 100, embed_dim))
        self.block = GPTBlock(embed_dim, num_heads=4, ff_hidden=64)
        self.fc_out = nn.Linear(embed_dim, vocab_size)

    def forward(self, x):
        seq_len = x.size(1)
        mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool().to(x.device)
        x = self.embed(x) + self.pos[:, :seq_len, :]
        x = self.block(x, attn_mask=mask)
        return self.fc_out(x)

# simulate
text = "hello world"
chars = sorted(list(set(text)))
stoi = {c:i for i,c in enumerate(chars)}
itos = {i:c for c,i in stoi.items()}
data = torch.tensor([[stoi[c] for c in text[:-1]]])
target = torch.tensor([[stoi[c] for c in text[1:]]])

model = TinyGPT(vocab_size=len(chars))
logits = model(data)
loss = F.cross_entropy(logits.view(-1, len(chars)), target.view(-1))
print("Loss:", loss.item())

🔮 11. Visualizing Generation Probabilities

If we plot the probabilities from the softmax layer, we’d see something like this:

P(next token | context)
|
|      *
|     * *
|   **   **
| **       **
|_____________→ Tokens
   blue  dark  beautiful  banana

This distribution is reshaped by temperature during generation.


🧠 12. GPT’s Training Phases

PhaseDescription
PretrainingLearn general language patterns from internet text
Fine-tuningAdapt to specific tasks or styles
RLHF (Reinforcement Learning from Human Feedback)Align outputs with human preference (used in ChatGPT)

🧩 13. What GPT Learns Internally

GPT doesn’t store sentences — it learns probabilistic relationships between tokens.
For example, it learns that:

“The capital of France is” → “Paris”
“Photosynthesis occurs in” → “plants”

These associations form in its attention weights and embedding spaces.


⚙️ 14. Scaling Up GPT

ModelYearParametersKey Improvement
GPT-12018117MProof-of-concept
GPT-220191.5BLong, coherent text
GPT-32020175BZero-shot reasoning
GPT-42023~1TMultimodal reasoning & alignment

As size grows, emergent properties appear (reasoning, coding, abstraction).


💡 15. Emergent Behavior: Why GPT Feels “Intelligent”

Because of scale and context learning, GPT learns:

  • Grammar & semantics (syntax correctness)
  • Long-term dependencies (story continuity)
  • Analogies & reasoning (pattern recognition)
  • Creativity (sampling-based diversity)

It’s not “thinking” — but the learned statistical relationships simulate intelligence.


✅ 16. Summary — What You Learned Today

ConceptKey Takeaway
GPT = Decoder-only TransformerUses masked self-attention
AutoregressionPredicts next token step by step
Causal maskPrevents future leakage
Output layerSoftmax → next token probabilities
Text generationSampling loop from predicted probabilities

🔮 Coming Next

👉 Day 6 — Pretraining & Next Token Prediction Objective
We’ll dive deeper into how GPT learns language patterns,
including dataset structure, optimization process,
and the fascinating “next-token-prediction” that teaches everything —
grammar, logic, reasoning, and memory.


Would you like me to now:
1️⃣ Create a Colab-ready notebook for Day 5 (with visual next-token simulation + generation demo),
or
2️⃣ Go straight into Day 6 — Pretraining & Next Token Prediction Objective (How GPT actually learns during training)?

Pages: 1 2 3 4 5 6 7 8 9