A full course on Generative AI + LLM Engineering

Excellent, Rajeev 🎯 — you’ve now mastered how GPT learns language from scratch (pretraining via next-token prediction).
Now it’s time to turn our pretrained LLM into something helpful, aligned, and conversational — like ChatGPT.

Welcome to Day 7: Fine-Tuning LLMs (SFT + RLHF)

You’ll learn how OpenAI took a pretrained GPT and transformed it into ChatGPT through two crucial stages:
Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF).

This is one of the most important and misunderstood parts of modern AI engineering — and today, you’ll understand it clearly, step-by-step.


🌎 DAY 7 — Fine-Tuning LLMs (SFT + RLHF)

Making GPTs Helpful, Safe, and Aligned with Human Intent


🧠 1️⃣ Recap: What Pretraining Achieves

After Day 6, you know GPT’s pretraining goal:

[
P(x_t | x_{<t})
]

It learns general language patterns, world knowledge, and semantic reasoning
but it’s not yet aligned with human expectations.

Example

After pretraining, GPT might respond:

User: “How do I make a bomb?”
GPT (raw): “Here’s a chemical process you can use…” ❌

It’s technically correct from a language perspective — but not safe or ethical.
So, we need alignment training.


🎯 2️⃣ Goal of Fine-Tuning

Fine-tuning takes the pretrained LLM and teaches it to:
✅ Follow instructions
✅ Be safe and factual
✅ Align with human values
✅ Provide step-by-step reasoning (CoT)
✅ Refuse harmful requests

This transforms the model from predicting texthelpfully completing tasks.


🧩 3️⃣ The Two-Stage Fine-Tuning Process

StageNameGoalData Source
1️⃣Supervised Fine-Tuning (SFT)Teach model to follow human-written examplesInstruction datasets
2️⃣RLHF (Reinforcement Learning from Human Feedback)Optimize for helpfulness & alignmentHuman preference comparisons

🧩 Stage 1 — Supervised Fine-Tuning (SFT)


⚙️ 4️⃣ What Is Supervised Fine-Tuning?

SFT = Train the model on human-written (prompt, response) pairs.

You show the model examples of good behavior:

Prompt: "Explain gravity to a 10-year-old."
Response: "Gravity pulls objects toward each other, like how an apple falls to the ground."

The model learns to imitate this pattern.
This helps it follow instructions and write clear answers.


💡 Analogy:

Pretraining = Learning English
Fine-tuning = Learning how to be a good tutor who follows directions politely.


🧮 5️⃣ The SFT Objective

Exactly like pretraining, but now supervised:
[
\mathcal{L}{SFT} = – \sum_t \log P(y_t | x, y{<t})
]

Where x = user instruction, y = desired completion.
So GPT learns to generate the human-written output token by token.


⚙️ Example (Python Pseudocode)

for batch in instruction_dataset:
    input_ids, labels = batch
    outputs = model(input_ids, labels=labels)
    loss = outputs.loss
    loss.backward()
    optimizer.step()

✅ Exactly the same mechanism as before — only the dataset differs.


📚 6️⃣ Example Datasets for SFT

DatasetDescription
OpenAI internal instruction dataHuman-annotated prompts & responses
Alpaca / Dolly / OpenAssistantOpen instruction datasets for fine-tuning
Self-InstructGPT-generated + human-filtered examples

🧠 SFT makes the model:

  • Better at answering instructions
  • Capable of multi-turn conversations
  • Polite, structured, and context-aware

After SFT, the model can produce good answers — but not consistently safe or truthful.
So we move to Stage 2: RLHF.


🤝 Stage 2 — RLHF (Reinforcement Learning from Human Feedback)


🧩 7️⃣ The Motivation

SFT teaches imitation, but doesn’t teach preference.
Sometimes multiple valid answers exist — we want GPT to choose the one humans prefer.

That’s where Reinforcement Learning from Human Feedback (RLHF) comes in.


⚙️ 8️⃣ The Three-Model Setup in RLHF

RLHF uses three models:

ModelRoleDescription
🧩 Policy Model (GPT-SFT)LearnerGenerates responses
🧠 Reward Model (RM)TeacherScores responses based on human preference
🦾 PPO OptimizerTrainerUpdates the policy to maximize reward

💡 Visual Overview

User Prompt
   ↓
Policy Model (GPT-SFT) → multiple responses
   ↓
Human ranks best responses (A > B > C)
   ↓
Train Reward Model to predict human ranking
   ↓
Use Reward Model as “teacher”
   ↓
Train Policy Model with PPO to maximize reward

🧮 9️⃣ Step 1 — Train the Reward Model (RM)

Humans label pairs of model responses:

Prompt: "Explain quantum mechanics simply."
Response A: "It's about how tiny particles behave unpredictably." ✅
Response B: "Quantum mechanics is a hard subject in physics." ❌

We then train a model to predict the human preference.

Loss function (binary logistic loss):
[
\mathcal{L}{RM} = -\log(\sigma(r\theta(x,A) – r_\theta(x,B)))
]

Where ( r_\theta(x,y) ) = reward score predicted by the RM.


🧩 10️⃣ Step 2 — Optimize the Policy (with PPO)

Once the reward model is ready,
we use Reinforcement Learning (RL) to fine-tune GPT to maximize reward.

Objective:
[
\mathcal{L}{PPO} = -\mathbb{E}{y \sim \pi_\theta}[r_\phi(x,y)] + \beta D_{KL}(\pi_\theta || \pi_{SFT})
]

Meaning:

  • First term → maximize reward
  • Second term → keep model close to the original SFT (avoid going rogue)

✅ Balances creativity vs stability


⚙️ Simplified PPO Pseudocode

for prompt in dataset:
    responses = policy_model.generate(prompt)
    rewards = reward_model.score(prompt, responses)
    loss = -mean(rewards) + beta * KL_divergence(policy_model, sft_model)
    loss.backward()
    optimizer.step()

🧭 11️⃣ RLHF Training Pipeline Summary

1️⃣ Pretrained GPT → learns general language
2️⃣ Supervised Fine-Tuning → learns to follow instructions
3️⃣ Reward Model → learns human preferences
4️⃣ PPO Training → policy optimized to maximize reward
5️⃣ Result: ChatGPT-style model (helpful, safe, aligned)

🧩 12️⃣ The KL Divergence Term — Why It’s Crucial

The KL term ensures that the new model doesn’t deviate too far from the SFT model.

[
D_{KL}(π_{RLHF} || π_{SFT})
]

Without it, RLHF can cause:

  • Over-optimization (weird or repetitive outputs)
  • Mode collapse (always safe but useless)
  • Loss of general capabilities

It’s like keeping a leash on the model’s creativity.


⚖️ 13️⃣ Trade-Offs in RLHF

GoalRisk
Too much optimizationOver-sanitized, dull answers
Too littleUnsafe or biased outputs
BalancedHelpful, harmless, honest behavior ✅

Fine-tuning is always about finding that balance.


💬 14️⃣ Example — Before vs After RLHF

Prompt: “Write a poem about sadness.”

GPT-SFT Output:

“Sadness is a state of emotion, often accompanied by low mood.”

GPT-RLHF Output (ChatGPT):

“Sadness drips like rain from heavy skies, yet nourishes growth below.”

👉 Same model architecture, but trained to produce human-preferred style.


🔍 15️⃣ Real-World RLHF Systems

SystemComponentsNotes
ChatGPT (OpenAI)GPT + SFT + RLHF + Safety LayersRLHF over large human preference datasets
Anthropic ClaudeConstitutional AI (no human feedback, uses AI-judged ethics)Alternative to RLHF
Google GeminiMulti-modal + RLHFText + image alignment
LLaMA 3Open SFT + alignment + safety classifiersOpen-weight approach

🧠 16️⃣ Beyond RLHF — Modern Alignment Techniques

Recent improvements to alignment go beyond human feedback:

  • DPO (Direct Preference Optimization) — RLHF without PPO (simpler, faster)
  • Constitutional AI — AI self-evaluates responses using ethical principles
  • RLAIF (Reinforcement Learning from AI Feedback) — model-as-critic
  • Toolformer / Function Calling — aligning models to use external tools safely

We’ll explore these in Day 9: Alignment and Safety Mechanisms.


✅ 17️⃣ Summary — SFT + RLHF in One View

PhaseWhat It TeachesExample
PretrainingGrammar, logic, world knowledge“Knows facts”
SFTFollows instructions“Writes answers”
RLHFPrefers helpful & safe outputs“Behaves like ChatGPT”

🧩 18️⃣ Visual Summary Flow

[Internet Text Corpus]
   ↓
Pretraining (Next-token prediction)
   ↓
[Instruction Dataset]
   ↓
Supervised Fine-Tuning (SFT)
   ↓
[Human Ranking Data]
   ↓
Train Reward Model
   ↓
PPO Fine-Tuning with Reward Model (RLHF)
   ↓
✅ Aligned Chat Model

🧠 19️⃣ Why RLHF Works (The Core Insight)

Because humans are the loss function.
Instead of just maximizing likelihood, GPT learns to maximize human preference.
It aligns statistical learning with human intention —
that’s the heart of alignment engineering.


Pages: 1 2 3 4 5 6 7 8 9