A full course on Generative AI + LLM Engineering

Perfect, Rajeev 🌟 — this is where you’ll truly understand how GPT learns
not just how it generates text, but how it’s trained to become intelligent.

If Day 5 taught you how GPT predicts the next token,
then Day 6 teaches you how GPT learns to predict the next token accurately — through Pretraining and the Next-Token-Prediction Objective.

We’ll break this into a full intuitive explanation → with visual reasoning → practical training loop → and deep insights into why this simple learning rule creates language understanding.


🌎 DAY 6 — Pretraining & Next-Token Prediction Objective

(How GPT Learns to Understand, Think, and Generate Language)


🧠 1️⃣ The Big Picture: How Does GPT Learn?

GPT is trained on massive text corpora — internet articles, Wikipedia, books, code, forums, etc.
During pretraining, it repeatedly sees snippets of text and learns to predict the next token at each position.

That’s it.
No explicit grammar rules, logic rules, or labeled datasets.
Just:

“Given the past text, guess the next word.”

This single, simple rule — applied billions of times — teaches GPT everything:
grammar, knowledge, reasoning, humor, even coding.


🧩 2️⃣ The Training Objective — Next Token Prediction

🔹 Formal Definition

For every training example (a sequence of tokens):
[
x = [x_1, x_2, x_3, …, x_T]
]

GPT learns to model:
[
P(x) = P(x_1) P(x_2 | x_1) P(x_3 | x_1, x_2) … P(x_T | x_{<T})
]

The model’s loss is the negative log-likelihood over the sequence:
[
\mathcal{L} = -\sum_{t=1}^T \log P(x_t | x_{<t})
]

So every token tries to predict its successor correctly.
That’s why it’s called causal (autoregressive) language modeling.


🧮 Example

Text: "The sky is blue"

Training samples:

Input (context)Target (next token)
Thesky
The skyis
The sky isblue

Loss = how wrong the model’s predicted probabilities are compared to the actual next word.


⚙️ 3️⃣ The Training Loop (Simplified)

for batch in dataset:
    inputs, targets = batch[:, :-1], batch[:, 1:]
    logits = model(inputs)             # Predict next token for each position
    loss = cross_entropy(logits, targets)
    loss.backward()
    optimizer.step()

💡 Key idea:

  • Inputs = tokens 1 → n-1
  • Targets = tokens 2 → n
  • The model learns to minimize how far its predictions are from real next tokens.

🧩 4️⃣ Where Does the Data Come From?

GPT is trained on a mixture of diverse, cleaned datasets, including:

  • Common Crawl (large internet snapshot)
  • Wikipedia (detailed factual text)
  • Books Corpus (story / narrative language)
  • WebText (OpenAI’s filtered web data)
  • GitHub code (for code models like Codex)

Each text is split into sequences (e.g., 2 048 tokens long)
and fed into the model during training.


🧩 Data Size vs Model Scale

ModelParametersTokens Trained On
GPT-21.5 B40 GB text (~8 B tokens)
GPT-3175 B570 GB text (~300 B tokens)
GPT-4~1 T>1 Trillion tokens

The more tokens → the better generalization (up to a point).


🔥 5️⃣ Why Next-Token Prediction Works So Well

Even though GPT only predicts the next word, this objective forces it to learn:

SkillWhy It Emerges
Grammar & syntaxMust predict valid sequences
Semantics & meaningMust choose words that make sense
World knowledgeMust understand context to predict accurately
ReasoningMust track long logical dependencies
MemoryMust attend to past context via self-attention

It’s emergent learning — patterns arise naturally from prediction pressure.


⚙️ 6️⃣ Autoregressive Training Visualization

Let’s visualize a small example:

"The cat sat on the mat."

During training:

Input:  [The]
Target: [cat]

Input:  [The, cat]
Target: [sat]

Input:  [The, cat, sat]
Target: [on]

Loss is computed for every token prediction in parallel (shifted by one).

So GPT simultaneously learns from all possible next-token pairs.


🧮 7️⃣ Loss Function: Cross Entropy

Each step, GPT predicts a probability distribution over the vocabulary:

P("blue") = 0.79
P("dark") = 0.12
P("banana") = 0.01

If the correct token is “blue,” loss = −log(0.79).
Minimizing this loss pushes “blue” probability higher next time.


⚙️ Implementation in PyTorch

import torch
import torch.nn.functional as F

# logits: [batch, seq_len, vocab_size]
# targets: [batch, seq_len]
loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))

This trains all positions in the sequence simultaneously —
each predicting its next token.


⚙️ 8️⃣ Training Dynamics — How the Model Improves

1️⃣ At first, predictions are random → loss ≈ log(vocab size)
2️⃣ As weights update, probabilities for correct tokens rise
3️⃣ Model starts learning syntax (“subject → verb → object”)
4️⃣ Then semantics (“The capital of France is” → “Paris”)
5️⃣ Finally patterns of reasoning, coding, storytelling emerge


🧩 9️⃣ Learning Representations Inside the Model

As GPT trains:

  • Early layers learn local features (spelling, word order)
  • Middle layers learn syntactic relations (grammar, phrase structure)
  • Deeper layers learn semantic & world knowledge

Each token’s vector (embedding) becomes a compressed summary of its context
that’s why GPT embeddings are so powerful for downstream tasks.


🧠 🔟 How Attention Aids Learning

During training, attention heads discover useful relationships:

  • “it” → nearest noun
  • “because” → clause relationship
  • “dog” → adjective “furry”
  • “if…then” → logical pairing

These relationships are reinforced when the next-token loss rewards correct dependencies.


⚙️ 11️⃣ Optimization & Hardware

  • Optimizer: AdamW
  • Batch size: often 2 K–8 K sequences
  • Learning rate scheduling: Warmup + Cosine decay
  • Hardware: tens of thousands of GPUs/TPUs
  • Duration: weeks to months

GPT-3 reportedly took ~355 GPU years of compute to pretrain.


🧩 12️⃣ Checkpointing & Gradient Accumulation

Since full batches can’t fit in memory, training uses:

  • Gradient accumulation (over mini-batches)
  • Checkpointing (recomputing layers on the fly to save RAM)
  • Mixed precision (float16/8 quantization)

This enables training of trillion-parameter models efficiently.


🔍 13️⃣ From Pretraining → Fine-tuning

After pretraining (on general text), GPT is fine-tuned for specific behavior:

PhaseGoal
Supervised Fine-tuning (SFT)Teach to follow instructions
RLHF (Reinforcement Learning from Human Feedback)Align with human preferences
Continual TrainingDomain adaptation (e.g., coding, legal, medical)

We’ll cover SFT + RLHF in Day 8–9.


🧩 14️⃣ Visual Summary: How GPT Learns

[Raw Text Dataset]
      ↓
Tokenization
      ↓
Token IDs
      ↓
Model Input
      ↓
Predicted Next Token Probabilities
      ↓
Cross-Entropy Loss
      ↓
Backpropagation
      ↓
Weight Updates
      ↓
Repeat for Trillions of Tokens

Over time, the model becomes a master of predicting language — and therefore understanding it.


⚡ 15️⃣ Emergent Understanding — Why It Works So Well

Next-token prediction indirectly teaches:

  • Language structure (syntax & grammar)
  • World knowledge (co-occurrence patterns)
  • Logic & causality (“if X then Y”)
  • Common sense reasoning
  • Abstract representation learning

When you scale model + data + compute → you cross the emergence threshold where reasoning appears.


✅ 16️⃣ Summary — What You Learned Today

ConceptDescription
PretrainingMassive unsupervised training on text data
ObjectivePredict next token given previous context
LossCross-entropy (minimize prediction error)
OutcomeLearns language patterns and world knowledge
Why it worksPredictive pressure forces compositional understanding

# %% [markdown]
# Day 6 — Mini-GPT Next-Token Prediction (Colab Notebook)

This notebook trains a tiny autoregressive model (Mini-GPT) on a toy corpus so you can see **next-token prediction** in action end-to-end: tokenization, training loop, loss plot, and generation sampling.

**Notes:**
- Designed to run quickly in Colab (CPU or GPU).
- Uses PyTorch and matplotlib for plotting.
- Character-level tokenizer is used for simplicity.

---

# %% [markdown]
## 0. Install / Import Dependencies

Run this cell first. If running on Colab, enable GPU via `Runtime → Change runtime type → GPU`.

---

# %%
# Install (uncomment if running in a fresh Colab env)
# !pip install -q torch torchvision

import math
import random
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
from IPython.display import clear_output

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device:', device)

# %% [markdown]
## 1. Toy Corpus & Character Tokenizer

We'll create a small dataset of sentences. We'll build char-level tokenization (stoi, itos) for simplicity.

---

# %%
corpus = [
    "hello world",
    "hello there",
    "how are you",
    "i love ai",
    "gpt predicts next",
    "the cat sat on the mat",
    "the dog lay on the rug",
    "a quick brown fox",
    "pack my box with five dozen liquor jugs",
    "sphinx of black quartz judge my vow"
]

text = "\n".join(corpus)
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}

print('Vocab size:', vocab_size)
print('Chars:', chars)
print('\nSample text:\n', text[:200])

# Encode entire text as sequence of ints
data = torch.tensor([stoi[c] for c in text], dtype=torch.long)

# Create train sequences (sliding window)
seq_len = 32
inputs = []
targets = []
for i in range(0, len(data) - seq_len):
    inputs.append(data[i:i+seq_len])
    targets.append(data[i+1:i+seq_len+1])

inputs = torch.stack(inputs)
targets = torch.stack(targets)

print('Dataset shape (num_seq, seq_len):', inputs.shape)

# %% [markdown]
## 2. Mini GPT Model (tiny) — masked autoregressive with causal mask

This is a minimal decoder-only transformer with one or few blocks. It's intentionally small so it trains fast on this tiny corpus.

---

# %%
class GPTBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_hidden):
        super().__init__()
        self.attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
        self.ln1 = nn.LayerNorm(embed_dim)
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, ff_hidden),
            nn.GELU(),
            nn.Linear(ff_hidden, embed_dim)
        )
        self.ln2 = nn.LayerNorm(embed_dim)
    def forward(self, x, attn_mask=None):
        attn_out, attn_weights = self.attn(x, x, x, attn_mask=attn_mask)
        x = self.ln1(x + attn_out)
        ff_out = self.ff(x)
        x = self.ln2(x + ff_out)
        return x, attn_weights

class MiniGPT(nn.Module):
    def __init__(self, vocab_size, embed_dim=64, n_layers=2, num_heads=4, ff_hidden=256, max_len=128):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.pos = nn.Parameter(torch.zeros(1, max_len, embed_dim))
        self.blocks = nn.ModuleList([GPTBlock(embed_dim, num_heads, ff_hidden) for _ in range(n_layers)])
        self.ln_f = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, vocab_size, bias=False)
        self.max_len = max_len
    def forward(self, idx):
        b, t = idx.shape
        assert t <= self.max_len, 'sequence length exceeds model max_len'
        x = self.embed(idx) + self.pos[:, :t, :]
        attn_weights_all = []
        causal_mask = torch.triu(torch.ones(t, t, device=idx.device), diagonal=1).bool()
        for block in self.blocks:
            x, attn_weights = block(x, attn_mask=causal_mask)
            attn_weights_all.append(attn_weights)
        x = self.ln_f(x)
        logits = self.head(x)
        return logits, attn_weights_all

model = MiniGPT(vocab_size).to(device)
print(model)

# %% [markdown]
## 3. Training Setup

We'll use AdamW, cross-entropy loss, and a simple training loop with periodic sampling and loss plotting.

---

# %%
batch_size = 16
dataset_size = inputs.shape[0]
print('Num sequences:', dataset_size)

def get_batch(i, batch_size=batch_size):
    start = i * batch_size
    end = start + batch_size
    if end > dataset_size:
        end = dataset_size
        start = max(0, end - batch_size)
    return inputs[start:end].to(device), targets[start:end].to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-3)
epochs = 30

all_losses = []

# %% [markdown]
## 4. Training Loop — Train the MiniGPT

This loop prints loss, plots loss curve, and generates sample text every few steps.

---

# %%
model.train()
for epoch in range(epochs):
    perm = torch.randperm(dataset_size)
    epoch_loss = 0.0
    steps = 0
    for i in range(0, dataset_size, batch_size):
        idx = perm[i:i+batch_size]
        xb = inputs[idx].to(device)
        yb = targets[idx].to(device)
        optimizer.zero_grad()
        logits, _ = model(xb)
        loss = F.cross_entropy(logits.view(-1, vocab_size), yb.view(-1))
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        steps += 1
    avg_loss = epoch_loss / steps
    all_losses.append(avg_loss)

    # display progress
    clear_output(wait=True)
    print(f'Epoch {epoch+1}/{epochs} — Loss: {avg_loss:.4f}')

    # plot loss
    plt.figure(figsize=(6,3))
    plt.plot(all_losses, label='train loss')
    plt.xlabel('epoch')
    plt.ylabel('loss')
    plt.grid(True)
    plt.legend()
    plt.show()

    # sample generation
    model.eval()
    with torch.no_grad():
        # start from random seed string
        seed = random.choice(corpus)[:10]
        context = torch.tensor([[stoi[c] for c in seed]], dtype=torch.long).to(device)
        generated = seed
        for _ in range(80):
            logits, _ = model(context)
            # take last token logits
            last = logits[0, -1, :]
            probs = F.softmax(last, dim=-1)
            idx = torch.multinomial(probs, num_samples=1).item()
            ch = itos[idx]
            generated += ch
            # append to context (keep last seq_len tokens)
            new_ctx = torch.cat([context, torch.tensor([[idx]], device=device)], dim=1)
            if new_ctx.shape[1] > seq_len:
                new_ctx = new_ctx[:, -seq_len:]
            context = new_ctx
    print('\nSample generation:')
    print(generated)
    model.train()

# %% [markdown]
## 5. Inspecting Attention Weights

We can run a single forward pass and visualize attention matrices (one head averaged) for a sample input. This helps see "who attends to whom".

---

# %%
model.eval()
with torch.no_grad():
    sample_text = 'the cat sat on the mat'
    idx = torch.tensor([[stoi[c] for c in sample_text]], dtype=torch.long).to(device)
    logits, attn_weights_all = model(idx)

# attn_weights_all: list of [batch, num_heads, seq_len, seq_len]
# We'll average heads to get one matrix per layer

for layer, attn in enumerate(attn_weights_all):
    # attn shape: (batch, num_heads, seq_len, seq_len)
    attn_avg = attn[0].mean(dim=0).cpu().numpy()
    plt.figure(figsize=(5,4))
    plt.imshow(attn_avg, cmap='viridis')
    plt.title(f'Layer {layer} — Avg attention (heads averaged)')
    plt.colorbar()
    plt.xlabel('Key position')
    plt.ylabel('Query position')
    plt.show()

# %% [markdown]
## 6. Save & Export Model (Optional)

You can save the trained model weights to drive if running on Colab.

---

# %%
# torch.save(model.state_dict(), 'minigpt.pth')
# print('Saved minigpt.pth')

# %% [markdown]
## 7. Exercises

1. Increase `epochs` to 100 and observe the loss curve — does the model overfit?\
2. Replace char-level tokenizer with a simple word-level tokenizer and retrain.\
3. Try different sampling strategies: greedy, top-k, temperature scaling.\
4. Print logits and top-k tokens at each generation step to understand probability distribution.

---

# %% [markdown]
## End of Notebook

This mini notebook demonstrates next-token prediction training on a toy corpus end-to-end. Use it to experiment and observe how the model learns to produce coherent text from a tiny dataset.

Pages: 1 2 3 4 5 6 7 8 9