PySpark & Spark Mastery Course (A→Z, PhD-level)

Below is Module 3 — Spark Memory Management & Performance Engineering at OS + JVM + Distributed Systems depth.
This is the part most engineers never truly understand, but it is what separates:

  • ❌ Spark users
  • ✅ Spark engineers
  • 🧠 Spark architects

I’ll explain Spark memory like an operating system + JVM + distributed engine combined, with simulations, failure cases, and tuning logic.


🧠 MODULE 3 — SPARK MEMORY MANAGEMENT & PERFORMANCE ENGINEERING (DEEPEST LEVEL)

3.0 Mental Model (Critical)

Most people think:

Spark memory = executor memory.

Correct model:

Cluster Memory
 ├── Driver Memory
 ├── Executor Memory
 │     ├── JVM Heap
 │     ├── Off-Heap Memory
 │     ├── Python Worker Memory
 │     ├── Shuffle Memory
 │     ├── Cache Memory
 │     ├── User Memory
 │     └── OS Memory
 └── Disk (spill)

Spark memory ≠ JVM memory ≠ OS memory.


🧱 3.1 JVM Memory Architecture (Foundation)

Spark runs on JVM, so JVM memory rules apply.

3.1.1 JVM Heap Structure

JVM Heap
 ├── Young Generation
 │     ├── Eden
 │     ├── Survivor S0
 │     └── Survivor S1
 ├── Old Generation
 └── Metaspace

Key Insight

  • Spark objects live in Old Gen.
  • Frequent GC kills Spark performance.

3.1.2 Garbage Collectors (GC) in Spark

Common GCs:

GCUse Case
Parallel GCOld Spark clusters
CMSLegacy
G1GCDefault modern Spark
ZGCLow-latency (rare)

🔥 Interview Insight:

Spark tuning is often GC tuning.


🧠 3.2 Spark Executor Memory Model (Deep)

Executor memory is divided logically.

3.2.1 Unified Memory Model (Spark 1.6+)

Executor Memory
 ├── Reserved Memory (~300MB)
 ├── Unified Memory (spark.memory.fraction)
 │     ├── Execution Memory
 │     └── Storage Memory
 ├── User Memory
 └── Off-Heap Memory

Default:

spark.memory.fraction = 0.6
spark.memory.storageFraction = 0.5

Meaning:

  • 60% of executor heap → Spark memory
  • 40% → user + metadata + overhead
  • Within 60%:
    • 50% storage
    • 50% execution

3.2.2 Execution vs Storage Memory

Execution Memory

Used for:

  • Shuffles
  • Joins
  • Sorts
  • Aggregations

Storage Memory

Used for:

  • cache()
  • persist()
  • broadcast variables

3.2.3 Dynamic Borrowing (Critical)

Execution & storage memory can borrow from each other.

Example:

  • Shuffle needs more memory → steals from cache.
  • Cache needs memory → spills execution.

🔥 This is why cached data disappears unexpectedly.


🧠 3.3 Python Memory vs JVM Memory (PySpark Reality)

PySpark adds another memory layer.

Executor JVM Memory
 ├── JVM Heap
 ├── Python Worker Memory
 ├── Py4J Buffers
 └── Pickle Objects

Key Insight

Increasing executor memory does NOT always fix PySpark OOM.

Because:

  • Python memory is separate.
  • Pickle objects duplicate memory.

🧠 3.4 Example: Memory Explosion Scenario (Realistic)

Code:

data = list(range(100_000_000))
rdd = sc.parallelize(data)
result = rdd.map(lambda x: x * 2).collect()

What happens?

  1. Python list created on driver → huge memory.
  2. Serialized → sent to executors.
  3. Executors compute.
  4. collect() brings data back to driver → driver OOM.

🔥 Lesson:

collect() is the most dangerous Spark operation.


🧠 3.5 Partitioning & Memory Relationship

Golden Rule:

Partition size ≈ 100MB – 256MB

If partitions too large:

  • OOM risk.

If partitions too small:

  • Scheduling overhead.

Example:

Data size = 1 TB

Recommended partitions:

1 TB / 128 MB ≈ 8000 partitions

🧠 3.6 Shuffle Memory (Most Expensive Operation)

Shuffle process:

Map Task
 ├── Buffer data in memory
 ├── Spill to disk if memory full
 └── Write shuffle files

Reduce Task:

Fetch shuffle files → merge → sort → aggregate

Memory Pressure Points:

  • Buffer overflow
  • Disk spill
  • Network congestion

🧠 3.7 Spark Spill to Disk (Hidden Performance Killer)

Spark spills data when:

  • Execution memory full
  • Sort buffer full
  • Aggregation buffer full

Symptoms:

  • Job slow
  • Disk IO high
  • CPU low

🧠 3.8 Broadcast Variables & Memory

Broadcast variable stored in:

Executor JVM Memory (Storage)

Problem:

Large broadcast table → executor OOM.


🧠 3.9 Real Production Case Study (Deep)

Scenario:

  • Join between fact (1 TB) and dimension (10 GB).
  • Spark job failing with OOM.

Mistake:

Spark auto-broadcast dimension table.

Fix:

Disable broadcast join:

spark.sql.autoBroadcastJoinThreshold = -1

🧠 3.10 Executor Sizing (Engineering Formula)

Most important topic.

Step 1 — Understand cluster resources

Example cluster:

  • 10 nodes
  • 32 cores per node
  • 128 GB RAM per node

Step 2 — Decide executor cores

Rule:

executor_cores = 3–5

Why?

  • Too many cores → GC overhead.
  • Too few cores → underutilization.

Assume:

executor_cores = 4

Step 3 — Calculate executors per node

32 cores / 4 cores = 8 executors per node

Step 4 — Calculate executor memory

Available memory per node:

128 GB × 0.9 (OS overhead) ≈ 115 GB

Memory per executor:

115 GB / 8 ≈ 14 GB

Step 5 — Final Spark config

--executor-cores 4
--num-executors 80
--executor-memory 14g
--driver-memory 8g

🔥 This is real-world sizing logic.


🧠 3.11 Spark UI — Memory Debugging (Advanced)

Spark UI tabs:

TabInsight
JobsStage breakdown
StagesShuffle size
StorageCache usage
ExecutorsMemory usage
SQLPhysical plan

Example Debugging:

Symptoms:

  • Stage slow
  • One task slow

Cause:

👉 Data skew.


🧠 3.12 Common Spark Memory Errors (Deep)

1) java.lang.OutOfMemoryError: Java heap space

Cause:

  • Large partitions
  • Skewed join
  • Large broadcast
  • Too few partitions

Fix:

  • Increase partitions
  • Reduce broadcast
  • Tune executor memory

2) GC Overhead Limit Exceeded

Cause:

  • Too many small objects
  • Python UDF
  • Inefficient transformations

Fix:

  • Use DataFrame instead of RDD
  • Avoid Python UDFs

3) Driver OOM

Cause:

  • collect()
  • toPandas()
  • large broadcast

Fix:

  • Use take()
  • write to disk instead of collect

🧠 3.13 Performance Optimization Techniques (Hardcore)

1) Avoid Shuffles

Replace:

groupByKey()

with:

reduceByKey()

2) Repartition vs Coalesce

MethodShuffleUse Case
repartitionYesIncrease partitions
coalesceNoDecrease partitions

3) Cache Strategically

Bad:

df.cache()

Good:

Cache only reused datasets.


4) Predicate Pushdown

Use Parquet/ORC filters.


5) Column Pruning

Select only needed columns.


🧠 3.14 PhD-Level Insight: Spark Performance Equation

Spark performance depends on:

Execution Time =
CPU Time +
Serialization Time +
Shuffle Time +
GC Time +
Network Time +
Disk IO Time

Best Spark engineers optimize all six.


🧠 3.15 Interview-Level Questions (Elite)

Q1

Why does increasing executor memory sometimes make Spark slower?

✅ Because:

  • GC overhead increases.

Q2

Why is many small executors better than few large executors?

✅ Because:

  • Better parallelism
  • Less GC overhead

Q3

Why does Spark spill to disk?

✅ Because execution memory full.


Q4 (Trap)

Increasing partitions always improves performance. True or false?

❌ False.

Too many partitions → overhead.


🧠 3.16 Real Engineering Mindset (Most Important)

Spark optimization is not about code.

It is about:

  • memory flow
  • data flow
  • network flow
  • execution graph

🔥 Spark engineers think like OS engineers.


🚀 If you want, next module will be even crazier:

MODULE 4 — Spark Performance Optimization, AQE, Skew, Join Algorithms, Query Planning (Deepest Level)

You will learn:

  • Adaptive Query Execution (AQE)
  • Join algorithm internals
  • Skew join mechanics
  • Cost-based optimization
  • Real-world tuning recipes
  • Spark vs Flink performance
  • Spark query plan reading like a compiler engineer
  • 200+ tricky interview questions
  • Real production debugging cases

If you want, say:

👉 “Start Module 4 at the deepest possible level.”

And if you want, I can also:

  • Convert Modules 1–3 into a Spark Bible PDF
  • Add diagrams + exercises + assignments
  • Add real-world datasets & projects
  • Create Spark interview mega-question bank

We are basically building the most advanced Spark course possible.