Excellent, Rajeev 🎯 — you’ve now mastered how GPT learns language from scratch (pretraining via next-token prediction).
Now it’s time to turn our pretrained LLM into something helpful, aligned, and conversational — like ChatGPT.

Welcome to Day 7: Fine-Tuning LLMs (SFT + RLHF)

You’ll learn how OpenAI took a pretrained GPT and transformed it into ChatGPT through two crucial stages:
Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF).

This is one of the most important and misunderstood parts of modern AI engineering — and today, you’ll understand it clearly, step-by-step.

🌎 DAY 7 — Fine-Tuning LLMs (SFT + RLHF)

Making GPTs Helpful, Safe, and Aligned with Human Intent

🧠 1️⃣ Recap: What Pretraining Achieves

After Day 6, you know GPT’s pretraining goal:

[
P(x_t | x_{<t})
]

It learns general language patterns, world knowledge, and semantic reasoning —
but it’s not yet aligned with human expectations.

Example

After pretraining, GPT might respond:

User: “How do I make a bomb?”
GPT (raw): “Here’s a chemical process you can use…” ❌

It’s technically correct from a language perspective — but not safe or ethical.
So, we need alignment training.

🎯 2️⃣ Goal of Fine-Tuning

Fine-tuning takes the pretrained LLM and teaches it to:
✅ Follow instructions
✅ Be safe and factual
✅ Align with human values
✅ Provide step-by-step reasoning (CoT)
✅ Refuse harmful requests

This transforms the model from predicting text → helpfully completing tasks.

🧩 3️⃣ The Two-Stage Fine-Tuning Process

Stage	Name	Goal	Data Source
1️⃣	Supervised Fine-Tuning (SFT)	Teach model to follow human-written examples	Instruction datasets
2️⃣	RLHF (Reinforcement Learning from Human Feedback)	Optimize for helpfulness & alignment	Human preference comparisons

🧩 Stage 1 — Supervised Fine-Tuning (SFT)

⚙️ 4️⃣ What Is Supervised Fine-Tuning?

SFT = Train the model on human-written (prompt, response) pairs.

You show the model examples of good behavior:

Prompt: "Explain gravity to a 10-year-old."
Response: "Gravity pulls objects toward each other, like how an apple falls to the ground."

The model learns to imitate this pattern.
This helps it follow instructions and write clear answers.

💡 Analogy:

Pretraining = Learning English
Fine-tuning = Learning how to be a good tutor who follows directions politely.

🧮 5️⃣ The SFT Objective

Exactly like pretraining, but now supervised:
[
\mathcal{L}{SFT} = – \sum_t \log P(y_t | x, y{<t})
]

Where x = user instruction, y = desired completion.
So GPT learns to generate the human-written output token by token.

⚙️ Example (Python Pseudocode)

for batch in instruction_dataset:
    input_ids, labels = batch
    outputs = model(input_ids, labels=labels)
    loss = outputs.loss
    loss.backward()
    optimizer.step()

✅ Exactly the same mechanism as before — only the dataset differs.

📚 6️⃣ Example Datasets for SFT

Dataset	Description
OpenAI internal instruction data	Human-annotated prompts & responses
Alpaca / Dolly / OpenAssistant	Open instruction datasets for fine-tuning
Self-Instruct	GPT-generated + human-filtered examples

🧠 SFT makes the model:

Better at answering instructions
Capable of multi-turn conversations
Polite, structured, and context-aware

After SFT, the model can produce good answers — but not consistently safe or truthful.
So we move to Stage 2: RLHF.

🤝 Stage 2 — RLHF (Reinforcement Learning from Human Feedback)

🧩 7️⃣ The Motivation

SFT teaches imitation, but doesn’t teach preference.
Sometimes multiple valid answers exist — we want GPT to choose the one humans prefer.

That’s where Reinforcement Learning from Human Feedback (RLHF) comes in.

⚙️ 8️⃣ The Three-Model Setup in RLHF

RLHF uses three models:

Model	Role	Description
🧩 Policy Model (GPT-SFT)	Learner	Generates responses
🧠 Reward Model (RM)	Teacher	Scores responses based on human preference
🦾 PPO Optimizer	Trainer	Updates the policy to maximize reward

💡 Visual Overview

User Prompt
   ↓
Policy Model (GPT-SFT) → multiple responses
   ↓
Human ranks best responses (A > B > C)
   ↓
Train Reward Model to predict human ranking
   ↓
Use Reward Model as “teacher”
   ↓
Train Policy Model with PPO to maximize reward

🧮 9️⃣ Step 1 — Train the Reward Model (RM)

Humans label pairs of model responses:

Prompt: "Explain quantum mechanics simply."
Response A: "It's about how tiny particles behave unpredictably." ✅
Response B: "Quantum mechanics is a hard subject in physics." ❌

We then train a model to predict the human preference.

Loss function (binary logistic loss):
[
\mathcal{L}{RM} = -\log(\sigma(r\theta(x,A) – r_\theta(x,B)))
]

Where ( r_\theta(x,y) ) = reward score predicted by the RM.

🧩 10️⃣ Step 2 — Optimize the Policy (with PPO)

Once the reward model is ready,
we use Reinforcement Learning (RL) to fine-tune GPT to maximize reward.

Objective:
[
\mathcal{L}{PPO} = -\mathbb{E}{y \sim \pi_\theta}[r_\phi(x,y)] + \beta D_{KL}(\pi_\theta || \pi_{SFT})
]

Meaning:

First term → maximize reward
Second term → keep model close to the original SFT (avoid going rogue)

✅ Balances creativity vs stability

⚙️ Simplified PPO Pseudocode

for prompt in dataset:
    responses = policy_model.generate(prompt)
    rewards = reward_model.score(prompt, responses)
    loss = -mean(rewards) + beta * KL_divergence(policy_model, sft_model)
    loss.backward()
    optimizer.step()

🧭 11️⃣ RLHF Training Pipeline Summary

1️⃣ Pretrained GPT → learns general language
2️⃣ Supervised Fine-Tuning → learns to follow instructions
3️⃣ Reward Model → learns human preferences
4️⃣ PPO Training → policy optimized to maximize reward
5️⃣ Result: ChatGPT-style model (helpful, safe, aligned)

🧩 12️⃣ The KL Divergence Term — Why It’s Crucial

The KL term ensures that the new model doesn’t deviate too far from the SFT model.

[
D_{KL}(π_{RLHF} || π_{SFT})
]

Without it, RLHF can cause:

Over-optimization (weird or repetitive outputs)
Mode collapse (always safe but useless)
Loss of general capabilities

It’s like keeping a leash on the model’s creativity.

⚖️ 13️⃣ Trade-Offs in RLHF

Goal	Risk
Too much optimization	Over-sanitized, dull answers
Too little	Unsafe or biased outputs
Balanced	Helpful, harmless, honest behavior ✅

Fine-tuning is always about finding that balance.

💬 14️⃣ Example — Before vs After RLHF

Prompt: “Write a poem about sadness.”

GPT-SFT Output:

“Sadness is a state of emotion, often accompanied by low mood.”

GPT-RLHF Output (ChatGPT):

“Sadness drips like rain from heavy skies, yet nourishes growth below.”

👉 Same model architecture, but trained to produce human-preferred style.

🔍 15️⃣ Real-World RLHF Systems

System	Components	Notes
ChatGPT (OpenAI)	GPT + SFT + RLHF + Safety Layers	RLHF over large human preference datasets
Anthropic Claude	Constitutional AI (no human feedback, uses AI-judged ethics)	Alternative to RLHF
Google Gemini	Multi-modal + RLHF	Text + image alignment
LLaMA 3	Open SFT + alignment + safety classifiers	Open-weight approach

🧠 16️⃣ Beyond RLHF — Modern Alignment Techniques

Recent improvements to alignment go beyond human feedback:

DPO (Direct Preference Optimization) — RLHF without PPO (simpler, faster)
Constitutional AI — AI self-evaluates responses using ethical principles
RLAIF (Reinforcement Learning from AI Feedback) — model-as-critic
Toolformer / Function Calling — aligning models to use external tools safely

We’ll explore these in Day 9: Alignment and Safety Mechanisms.

✅ 17️⃣ Summary — SFT + RLHF in One View

Phase	What It Teaches	Example
Pretraining	Grammar, logic, world knowledge	“Knows facts”
SFT	Follows instructions	“Writes answers”
RLHF	Prefers helpful & safe outputs	“Behaves like ChatGPT”

🧩 18️⃣ Visual Summary Flow

[Internet Text Corpus]
   ↓
Pretraining (Next-token prediction)
   ↓
[Instruction Dataset]
   ↓
Supervised Fine-Tuning (SFT)
   ↓
[Human Ranking Data]
   ↓
Train Reward Model
   ↓
PPO Fine-Tuning with Reward Model (RLHF)
   ↓
✅ Aligned Chat Model

🧠 19️⃣ Why RLHF Works (The Core Insight)

Because humans are the loss function.
Instead of just maximizing likelihood, GPT learns to maximize human preference.
It aligns statistical learning with human intention —
that’s the heart of alignment engineering.

Pages: 1 2 3 4 5 6 7 8 9

A full course on Generative AI + LLM Engineering