Excellent — Day 6 is where you move from “Spark runs” to “Spark runs the way I want.”
This day gives you control.
🧠 DAY 6 — SparkSession, Configs & Runtime Control
(Why Defaults Hurt You on AWS)
Goal today:
Understand where Spark configuration actually lives, which configs matter on AWS, and how runtime behavior changes between EMR, Glue, and Serverless.
🎯 What You’ll Master Today
By the end of Day 6, you will:
- Fully understand SparkSession internals
- Know the Spark config hierarchy
- Control parallelism, shuffle, memory
- Understand Glue vs EMR defaults
- Avoid silent performance killers
- Answer “Which config fixes this?” in interviews
1️⃣ SparkSession — What It REALLY Is
❌ Wrong idea
“SparkSession is just how we start Spark”
✅ Correct idea
SparkSession is the gateway to the Spark engine
SparkSession:
- Initializes SparkContext
- Loads configs
- Manages SQL, DataFrames, catalogs
- Connects driver ↔ executors
spark = SparkSession.builder.getOrCreate()
📌 Every config must exist before executors start
2️⃣ Spark Context vs Spark Session (CLARITY)
| Component | Role |
|---|---|
| SparkContext | Low-level engine control |
| SparkSession | Unified entry point |
📌 You rarely touch SparkContext directly — but it still exists.
3️⃣ Spark Config Hierarchy (THIS IS CRITICAL)

Config precedence (highest → lowest)
- spark-submit flags
- SparkSession.builder.config()
spark-defaults.conf- Cluster defaults (EMR / Glue)
- Spark internal defaults
🧠 Interview gold:
“Configs defined at submission time override cluster defaults.”
4️⃣ Types of Spark Configs (MEMORIZE CATEGORIES)
🔹 Execution & Parallelism
spark.sql.shuffle.partitionsspark.default.parallelism
🔹 Memory
spark.executor.memoryspark.executor.memoryOverheadspark.driver.memory
🔹 Execution Behavior
spark.sql.autoBroadcastJoinThresholdspark.serializer
🔹 AWS / S3
- S3 committer configs
- Retry configs
📌 Don’t memorize all — know what category to look in
5️⃣ The Most Dangerous Default on AWS
⚠️ spark.sql.shuffle.partitions = 200
Why dangerous?
- Small data → too many tasks
- Big data → too few tasks
- Glue jobs → wasted DPUs
- EMR jobs → slow shuffles
Rule of thumb:
Shuffle partitions ≈ total executor cores × 2–3
📌 This single config fixes many slow jobs.
6️⃣ Parallelism: Hardware ≠ Speed
Example (EMR)
Executors: 10
Cores/executor: 4
Total cores = 40
If:
Partitions = 10
Then:
- Only 10 tasks run
- 30 cores idle
- 75% waste
📌 Parallelism comes from partitions, not machines
7️⃣ Glue vs EMR — Runtime Differences (VERY IMPORTANT)


Spark on AWS Glue
- DPUs abstract memory + cores
- Limited low-level control
- Good defaults for ETL
- Less tuning flexibility
Spark on EMR
- Full JVM control
- Explicit executor sizing
- Best for heavy tuning
- More DevOps
🧠 Senior insight:
Glue optimizes for convenience; EMR optimizes for control.
8️⃣ When Configs DON’T Apply (TRICKY)
Some configs:
- Must be set before job starts
- Cannot be changed at runtime
❌ This won’t work:
spark.conf.set("spark.executor.memory", "8g")
Why?
- Executors already started
✔ Correct place:
- spark-submit
- Glue job parameters
- EMR step configs
9️⃣ AWS Cost Implication (REAL LIFE)
Bad configs:
- Too many shuffles
- Too much memory
- Idle executors
Good configs:
- Right-size executors
- Balanced partitions
- Faster job → lower bill
📌 Spark tuning = AWS cost tuning
🔥 Interview-Level Q&A (YOU CAN ANSWER NOW)
Q: Where do Spark configs live?
✔ Multiple layers with clear precedence
Q: Why changing executor memory at runtime doesn’t work?
✔ Executors already allocated
Q: Why Glue feels “less tunable” than EMR?
✔ DPUs abstract JVM internals
🧪 DAY 6 THINKING EXERCISE
Scenario:
Spark job is slow
Executors underutilized
CPU usage ~30%
Questions:
- What config category do you check first?
- Which 2 configs are your first suspects?
- Why adding more nodes may NOT help?
(We’ll formalize this on Day 13 tuning.)
🧠 DAY 6 MEMORY ANCHORS
SparkSession = gateway
Configs are hierarchical
Partitions = parallelism
Defaults hurt on AWS
Runtime changes are limited
🎤 Resume-Grade Line (You Earned This)
Experienced in configuring and tuning Spark runtime using SparkSession and configuration hierarchy to optimize performance and cost on AWS Glue and EMR
⏭️ DAY 7 — WEEK 1 CONSOLIDATION & EXECUTION THINKING
Tomorrow you’ll:
- Rebuild Spark execution from memory
- Debug a slow job mentally
- Connect Days 1–6 into one system
- Lock foundations permanently
Reply with:
DAY 7
and we’ll consolidate everything before moving into real data engineering work 🚀