Excellent, Rajeev 🎯 — you’ve now mastered how GPT learns language from scratch (pretraining via next-token prediction).
Now it’s time to turn our pretrained LLM into something helpful, aligned, and conversational — like ChatGPT.
Welcome to Day 7: Fine-Tuning LLMs (SFT + RLHF)
You’ll learn how OpenAI took a pretrained GPT and transformed it into ChatGPT through two crucial stages:
Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF).
This is one of the most important and misunderstood parts of modern AI engineering — and today, you’ll understand it clearly, step-by-step.
🌎 DAY 7 — Fine-Tuning LLMs (SFT + RLHF)
Making GPTs Helpful, Safe, and Aligned with Human Intent
🧠 1️⃣ Recap: What Pretraining Achieves
After Day 6, you know GPT’s pretraining goal:
[
P(x_t | x_{<t})
]
It learns general language patterns, world knowledge, and semantic reasoning —
but it’s not yet aligned with human expectations.
Example
After pretraining, GPT might respond:
User: “How do I make a bomb?”
GPT (raw): “Here’s a chemical process you can use…” ❌
It’s technically correct from a language perspective — but not safe or ethical.
So, we need alignment training.
🎯 2️⃣ Goal of Fine-Tuning
Fine-tuning takes the pretrained LLM and teaches it to:
✅ Follow instructions
✅ Be safe and factual
✅ Align with human values
✅ Provide step-by-step reasoning (CoT)
✅ Refuse harmful requests
This transforms the model from predicting text → helpfully completing tasks.
🧩 3️⃣ The Two-Stage Fine-Tuning Process
| Stage | Name | Goal | Data Source |
|---|---|---|---|
| 1️⃣ | Supervised Fine-Tuning (SFT) | Teach model to follow human-written examples | Instruction datasets |
| 2️⃣ | RLHF (Reinforcement Learning from Human Feedback) | Optimize for helpfulness & alignment | Human preference comparisons |
🧩 Stage 1 — Supervised Fine-Tuning (SFT)
⚙️ 4️⃣ What Is Supervised Fine-Tuning?
SFT = Train the model on human-written (prompt, response) pairs.
You show the model examples of good behavior:
Prompt: "Explain gravity to a 10-year-old."
Response: "Gravity pulls objects toward each other, like how an apple falls to the ground."
The model learns to imitate this pattern.
This helps it follow instructions and write clear answers.
💡 Analogy:
Pretraining = Learning English
Fine-tuning = Learning how to be a good tutor who follows directions politely.
🧮 5️⃣ The SFT Objective
Exactly like pretraining, but now supervised:
[
\mathcal{L}{SFT} = – \sum_t \log P(y_t | x, y{<t})
]
Where x = user instruction, y = desired completion.
So GPT learns to generate the human-written output token by token.
⚙️ Example (Python Pseudocode)
for batch in instruction_dataset:
input_ids, labels = batch
outputs = model(input_ids, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
✅ Exactly the same mechanism as before — only the dataset differs.
📚 6️⃣ Example Datasets for SFT
| Dataset | Description |
|---|---|
| OpenAI internal instruction data | Human-annotated prompts & responses |
| Alpaca / Dolly / OpenAssistant | Open instruction datasets for fine-tuning |
| Self-Instruct | GPT-generated + human-filtered examples |
🧠 SFT makes the model:
- Better at answering instructions
- Capable of multi-turn conversations
- Polite, structured, and context-aware
After SFT, the model can produce good answers — but not consistently safe or truthful.
So we move to Stage 2: RLHF.
🤝 Stage 2 — RLHF (Reinforcement Learning from Human Feedback)
🧩 7️⃣ The Motivation
SFT teaches imitation, but doesn’t teach preference.
Sometimes multiple valid answers exist — we want GPT to choose the one humans prefer.
That’s where Reinforcement Learning from Human Feedback (RLHF) comes in.
⚙️ 8️⃣ The Three-Model Setup in RLHF
RLHF uses three models:
| Model | Role | Description |
|---|---|---|
| 🧩 Policy Model (GPT-SFT) | Learner | Generates responses |
| 🧠 Reward Model (RM) | Teacher | Scores responses based on human preference |
| 🦾 PPO Optimizer | Trainer | Updates the policy to maximize reward |
💡 Visual Overview
User Prompt
↓
Policy Model (GPT-SFT) → multiple responses
↓
Human ranks best responses (A > B > C)
↓
Train Reward Model to predict human ranking
↓
Use Reward Model as “teacher”
↓
Train Policy Model with PPO to maximize reward
🧮 9️⃣ Step 1 — Train the Reward Model (RM)
Humans label pairs of model responses:
Prompt: "Explain quantum mechanics simply."
Response A: "It's about how tiny particles behave unpredictably." ✅
Response B: "Quantum mechanics is a hard subject in physics." ❌
We then train a model to predict the human preference.
Loss function (binary logistic loss):
[
\mathcal{L}{RM} = -\log(\sigma(r\theta(x,A) – r_\theta(x,B)))
]
Where ( r_\theta(x,y) ) = reward score predicted by the RM.
🧩 10️⃣ Step 2 — Optimize the Policy (with PPO)
Once the reward model is ready,
we use Reinforcement Learning (RL) to fine-tune GPT to maximize reward.
Objective:
[
\mathcal{L}{PPO} = -\mathbb{E}{y \sim \pi_\theta}[r_\phi(x,y)] + \beta D_{KL}(\pi_\theta || \pi_{SFT})
]
Meaning:
- First term → maximize reward
- Second term → keep model close to the original SFT (avoid going rogue)
✅ Balances creativity vs stability
⚙️ Simplified PPO Pseudocode
for prompt in dataset:
responses = policy_model.generate(prompt)
rewards = reward_model.score(prompt, responses)
loss = -mean(rewards) + beta * KL_divergence(policy_model, sft_model)
loss.backward()
optimizer.step()
🧭 11️⃣ RLHF Training Pipeline Summary
1️⃣ Pretrained GPT → learns general language
2️⃣ Supervised Fine-Tuning → learns to follow instructions
3️⃣ Reward Model → learns human preferences
4️⃣ PPO Training → policy optimized to maximize reward
5️⃣ Result: ChatGPT-style model (helpful, safe, aligned)
🧩 12️⃣ The KL Divergence Term — Why It’s Crucial
The KL term ensures that the new model doesn’t deviate too far from the SFT model.
[
D_{KL}(π_{RLHF} || π_{SFT})
]
Without it, RLHF can cause:
- Over-optimization (weird or repetitive outputs)
- Mode collapse (always safe but useless)
- Loss of general capabilities
It’s like keeping a leash on the model’s creativity.
⚖️ 13️⃣ Trade-Offs in RLHF
| Goal | Risk |
|---|---|
| Too much optimization | Over-sanitized, dull answers |
| Too little | Unsafe or biased outputs |
| Balanced | Helpful, harmless, honest behavior ✅ |
Fine-tuning is always about finding that balance.
💬 14️⃣ Example — Before vs After RLHF
Prompt: “Write a poem about sadness.”
GPT-SFT Output:
“Sadness is a state of emotion, often accompanied by low mood.”
GPT-RLHF Output (ChatGPT):
“Sadness drips like rain from heavy skies, yet nourishes growth below.”
👉 Same model architecture, but trained to produce human-preferred style.
🔍 15️⃣ Real-World RLHF Systems
| System | Components | Notes |
|---|---|---|
| ChatGPT (OpenAI) | GPT + SFT + RLHF + Safety Layers | RLHF over large human preference datasets |
| Anthropic Claude | Constitutional AI (no human feedback, uses AI-judged ethics) | Alternative to RLHF |
| Google Gemini | Multi-modal + RLHF | Text + image alignment |
| LLaMA 3 | Open SFT + alignment + safety classifiers | Open-weight approach |
🧠 16️⃣ Beyond RLHF — Modern Alignment Techniques
Recent improvements to alignment go beyond human feedback:
- DPO (Direct Preference Optimization) — RLHF without PPO (simpler, faster)
- Constitutional AI — AI self-evaluates responses using ethical principles
- RLAIF (Reinforcement Learning from AI Feedback) — model-as-critic
- Toolformer / Function Calling — aligning models to use external tools safely
We’ll explore these in Day 9: Alignment and Safety Mechanisms.
✅ 17️⃣ Summary — SFT + RLHF in One View
| Phase | What It Teaches | Example |
|---|---|---|
| Pretraining | Grammar, logic, world knowledge | “Knows facts” |
| SFT | Follows instructions | “Writes answers” |
| RLHF | Prefers helpful & safe outputs | “Behaves like ChatGPT” |
🧩 18️⃣ Visual Summary Flow
[Internet Text Corpus]
↓
Pretraining (Next-token prediction)
↓
[Instruction Dataset]
↓
Supervised Fine-Tuning (SFT)
↓
[Human Ranking Data]
↓
Train Reward Model
↓
PPO Fine-Tuning with Reward Model (RLHF)
↓
✅ Aligned Chat Model
🧠 19️⃣ Why RLHF Works (The Core Insight)
Because humans are the loss function.
Instead of just maximizing likelihood, GPT learns to maximize human preference.
It aligns statistical learning with human intention —
that’s the heart of alignment engineering.