30-DAY PYSPARK MASTERY ON AWS course

Excellent — Day 6 is where you move from “Spark runs” to “Spark runs the way I want.”
This day gives you control.


🧠 DAY 6 — SparkSession, Configs & Runtime Control

(Why Defaults Hurt You on AWS)

Goal today:
Understand where Spark configuration actually lives, which configs matter on AWS, and how runtime behavior changes between EMR, Glue, and Serverless.


🎯 What You’ll Master Today

By the end of Day 6, you will:

  • Fully understand SparkSession internals
  • Know the Spark config hierarchy
  • Control parallelism, shuffle, memory
  • Understand Glue vs EMR defaults
  • Avoid silent performance killers
  • Answer “Which config fixes this?” in interviews

1️⃣ SparkSession — What It REALLY Is

❌ Wrong idea

“SparkSession is just how we start Spark”

✅ Correct idea

SparkSession is the gateway to the Spark engine

SparkSession:

  • Initializes SparkContext
  • Loads configs
  • Manages SQL, DataFrames, catalogs
  • Connects driver ↔ executors
spark = SparkSession.builder.getOrCreate()

📌 Every config must exist before executors start


2️⃣ Spark Context vs Spark Session (CLARITY)

ComponentRole
SparkContextLow-level engine control
SparkSessionUnified entry point

📌 You rarely touch SparkContext directly — but it still exists.


3️⃣ Spark Config Hierarchy (THIS IS CRITICAL)

Image
Image

Config precedence (highest → lowest)

  1. spark-submit flags
  2. SparkSession.builder.config()
  3. spark-defaults.conf
  4. Cluster defaults (EMR / Glue)
  5. Spark internal defaults

🧠 Interview gold:

“Configs defined at submission time override cluster defaults.”


4️⃣ Types of Spark Configs (MEMORIZE CATEGORIES)

🔹 Execution & Parallelism

  • spark.sql.shuffle.partitions
  • spark.default.parallelism

🔹 Memory

  • spark.executor.memory
  • spark.executor.memoryOverhead
  • spark.driver.memory

🔹 Execution Behavior

  • spark.sql.autoBroadcastJoinThreshold
  • spark.serializer

🔹 AWS / S3

  • S3 committer configs
  • Retry configs

📌 Don’t memorize all — know what category to look in


5️⃣ The Most Dangerous Default on AWS

⚠️ spark.sql.shuffle.partitions = 200

Why dangerous?

  • Small data → too many tasks
  • Big data → too few tasks
  • Glue jobs → wasted DPUs
  • EMR jobs → slow shuffles

Rule of thumb:

Shuffle partitions ≈ total executor cores × 2–3

📌 This single config fixes many slow jobs.


6️⃣ Parallelism: Hardware ≠ Speed

Example (EMR)

Executors: 10
Cores/executor: 4
Total cores = 40

If:

Partitions = 10

Then:

  • Only 10 tasks run
  • 30 cores idle
  • 75% waste

📌 Parallelism comes from partitions, not machines


7️⃣ Glue vs EMR — Runtime Differences (VERY IMPORTANT)

Image
Image

Spark on AWS Glue

  • DPUs abstract memory + cores
  • Limited low-level control
  • Good defaults for ETL
  • Less tuning flexibility

Spark on EMR

  • Full JVM control
  • Explicit executor sizing
  • Best for heavy tuning
  • More DevOps

🧠 Senior insight:

Glue optimizes for convenience; EMR optimizes for control.


8️⃣ When Configs DON’T Apply (TRICKY)

Some configs:

  • Must be set before job starts
  • Cannot be changed at runtime

❌ This won’t work:

spark.conf.set("spark.executor.memory", "8g")

Why?

  • Executors already started

✔ Correct place:

  • spark-submit
  • Glue job parameters
  • EMR step configs

9️⃣ AWS Cost Implication (REAL LIFE)

Bad configs:

  • Too many shuffles
  • Too much memory
  • Idle executors

Good configs:

  • Right-size executors
  • Balanced partitions
  • Faster job → lower bill

📌 Spark tuning = AWS cost tuning


🔥 Interview-Level Q&A (YOU CAN ANSWER NOW)

Q: Where do Spark configs live?
✔ Multiple layers with clear precedence

Q: Why changing executor memory at runtime doesn’t work?
✔ Executors already allocated

Q: Why Glue feels “less tunable” than EMR?
✔ DPUs abstract JVM internals


🧪 DAY 6 THINKING EXERCISE

Scenario:

Spark job is slow
Executors underutilized
CPU usage ~30%

Questions:

  1. What config category do you check first?
  2. Which 2 configs are your first suspects?
  3. Why adding more nodes may NOT help?

(We’ll formalize this on Day 13 tuning.)


🧠 DAY 6 MEMORY ANCHORS

SparkSession = gateway
Configs are hierarchical
Partitions = parallelism
Defaults hurt on AWS
Runtime changes are limited

🎤 Resume-Grade Line (You Earned This)

Experienced in configuring and tuning Spark runtime using SparkSession and configuration hierarchy to optimize performance and cost on AWS Glue and EMR


⏭️ DAY 7 — WEEK 1 CONSOLIDATION & EXECUTION THINKING

Tomorrow you’ll:

  • Rebuild Spark execution from memory
  • Debug a slow job mentally
  • Connect Days 1–6 into one system
  • Lock foundations permanently

Reply with:

DAY 7

and we’ll consolidate everything before moving into real data engineering work 🚀