Below is Module 3 — Spark Memory Management & Performance Engineering at OS + JVM + Distributed Systems depth.
This is the part most engineers never truly understand, but it is what separates:
- ❌ Spark users
- ✅ Spark engineers
- 🧠 Spark architects
I’ll explain Spark memory like an operating system + JVM + distributed engine combined, with simulations, failure cases, and tuning logic.
🧠 MODULE 3 — SPARK MEMORY MANAGEMENT & PERFORMANCE ENGINEERING (DEEPEST LEVEL)
3.0 Mental Model (Critical)
Most people think:
Spark memory = executor memory.
Correct model:
Cluster Memory
├── Driver Memory
├── Executor Memory
│ ├── JVM Heap
│ ├── Off-Heap Memory
│ ├── Python Worker Memory
│ ├── Shuffle Memory
│ ├── Cache Memory
│ ├── User Memory
│ └── OS Memory
└── Disk (spill)
Spark memory ≠ JVM memory ≠ OS memory.
🧱 3.1 JVM Memory Architecture (Foundation)
Spark runs on JVM, so JVM memory rules apply.
3.1.1 JVM Heap Structure
JVM Heap
├── Young Generation
│ ├── Eden
│ ├── Survivor S0
│ └── Survivor S1
├── Old Generation
└── Metaspace
Key Insight
- Spark objects live in Old Gen.
- Frequent GC kills Spark performance.
3.1.2 Garbage Collectors (GC) in Spark
Common GCs:
| GC | Use Case |
|---|---|
| Parallel GC | Old Spark clusters |
| CMS | Legacy |
| G1GC | Default modern Spark |
| ZGC | Low-latency (rare) |
🔥 Interview Insight:
Spark tuning is often GC tuning.
🧠 3.2 Spark Executor Memory Model (Deep)
Executor memory is divided logically.
3.2.1 Unified Memory Model (Spark 1.6+)
Executor Memory
├── Reserved Memory (~300MB)
├── Unified Memory (spark.memory.fraction)
│ ├── Execution Memory
│ └── Storage Memory
├── User Memory
└── Off-Heap Memory
Default:
spark.memory.fraction = 0.6
spark.memory.storageFraction = 0.5
Meaning:
- 60% of executor heap → Spark memory
- 40% → user + metadata + overhead
- Within 60%:
- 50% storage
- 50% execution
3.2.2 Execution vs Storage Memory
Execution Memory
Used for:
- Shuffles
- Joins
- Sorts
- Aggregations
Storage Memory
Used for:
- cache()
- persist()
- broadcast variables
3.2.3 Dynamic Borrowing (Critical)
Execution & storage memory can borrow from each other.
Example:
- Shuffle needs more memory → steals from cache.
- Cache needs memory → spills execution.
🔥 This is why cached data disappears unexpectedly.
🧠 3.3 Python Memory vs JVM Memory (PySpark Reality)
PySpark adds another memory layer.
Executor JVM Memory
├── JVM Heap
├── Python Worker Memory
├── Py4J Buffers
└── Pickle Objects
Key Insight
Increasing executor memory does NOT always fix PySpark OOM.
Because:
- Python memory is separate.
- Pickle objects duplicate memory.
🧠 3.4 Example: Memory Explosion Scenario (Realistic)
Code:
data = list(range(100_000_000))
rdd = sc.parallelize(data)
result = rdd.map(lambda x: x * 2).collect()
What happens?
- Python list created on driver → huge memory.
- Serialized → sent to executors.
- Executors compute.
- collect() brings data back to driver → driver OOM.
🔥 Lesson:
collect() is the most dangerous Spark operation.
🧠 3.5 Partitioning & Memory Relationship
Golden Rule:
Partition size ≈ 100MB – 256MB
If partitions too large:
- OOM risk.
If partitions too small:
- Scheduling overhead.
Example:
Data size = 1 TB
Recommended partitions:
1 TB / 128 MB ≈ 8000 partitions
🧠 3.6 Shuffle Memory (Most Expensive Operation)
Shuffle process:
Map Task
├── Buffer data in memory
├── Spill to disk if memory full
└── Write shuffle files
Reduce Task:
Fetch shuffle files → merge → sort → aggregate
Memory Pressure Points:
- Buffer overflow
- Disk spill
- Network congestion
🧠 3.7 Spark Spill to Disk (Hidden Performance Killer)
Spark spills data when:
- Execution memory full
- Sort buffer full
- Aggregation buffer full
Symptoms:
- Job slow
- Disk IO high
- CPU low
🧠 3.8 Broadcast Variables & Memory
Broadcast variable stored in:
Executor JVM Memory (Storage)
Problem:
Large broadcast table → executor OOM.
🧠 3.9 Real Production Case Study (Deep)
Scenario:
- Join between fact (1 TB) and dimension (10 GB).
- Spark job failing with OOM.
Mistake:
Spark auto-broadcast dimension table.
Fix:
Disable broadcast join:
spark.sql.autoBroadcastJoinThreshold = -1
🧠 3.10 Executor Sizing (Engineering Formula)
Most important topic.
Step 1 — Understand cluster resources
Example cluster:
- 10 nodes
- 32 cores per node
- 128 GB RAM per node
Step 2 — Decide executor cores
Rule:
executor_cores = 3–5
Why?
- Too many cores → GC overhead.
- Too few cores → underutilization.
Assume:
executor_cores = 4
Step 3 — Calculate executors per node
32 cores / 4 cores = 8 executors per node
Step 4 — Calculate executor memory
Available memory per node:
128 GB × 0.9 (OS overhead) ≈ 115 GB
Memory per executor:
115 GB / 8 ≈ 14 GB
Step 5 — Final Spark config
--executor-cores 4
--num-executors 80
--executor-memory 14g
--driver-memory 8g
🔥 This is real-world sizing logic.
🧠 3.11 Spark UI — Memory Debugging (Advanced)
Spark UI tabs:
| Tab | Insight |
|---|---|
| Jobs | Stage breakdown |
| Stages | Shuffle size |
| Storage | Cache usage |
| Executors | Memory usage |
| SQL | Physical plan |
Example Debugging:
Symptoms:
- Stage slow
- One task slow
Cause:
👉 Data skew.
🧠 3.12 Common Spark Memory Errors (Deep)
1) java.lang.OutOfMemoryError: Java heap space
Cause:
- Large partitions
- Skewed join
- Large broadcast
- Too few partitions
Fix:
- Increase partitions
- Reduce broadcast
- Tune executor memory
2) GC Overhead Limit Exceeded
Cause:
- Too many small objects
- Python UDF
- Inefficient transformations
Fix:
- Use DataFrame instead of RDD
- Avoid Python UDFs
3) Driver OOM
Cause:
- collect()
- toPandas()
- large broadcast
Fix:
- Use take()
- write to disk instead of collect
🧠 3.13 Performance Optimization Techniques (Hardcore)
1) Avoid Shuffles
Replace:
groupByKey()
with:
reduceByKey()
2) Repartition vs Coalesce
| Method | Shuffle | Use Case |
|---|---|---|
| repartition | Yes | Increase partitions |
| coalesce | No | Decrease partitions |
3) Cache Strategically
Bad:
df.cache()
Good:
Cache only reused datasets.
4) Predicate Pushdown
Use Parquet/ORC filters.
5) Column Pruning
Select only needed columns.
🧠 3.14 PhD-Level Insight: Spark Performance Equation
Spark performance depends on:
Execution Time =
CPU Time +
Serialization Time +
Shuffle Time +
GC Time +
Network Time +
Disk IO Time
Best Spark engineers optimize all six.
🧠 3.15 Interview-Level Questions (Elite)
Q1
Why does increasing executor memory sometimes make Spark slower?
✅ Because:
- GC overhead increases.
Q2
Why is many small executors better than few large executors?
✅ Because:
- Better parallelism
- Less GC overhead
Q3
Why does Spark spill to disk?
✅ Because execution memory full.
Q4 (Trap)
Increasing partitions always improves performance. True or false?
❌ False.
Too many partitions → overhead.
🧠 3.16 Real Engineering Mindset (Most Important)
Spark optimization is not about code.
It is about:
- memory flow
- data flow
- network flow
- execution graph
🔥 Spark engineers think like OS engineers.
🚀 If you want, next module will be even crazier:
MODULE 4 — Spark Performance Optimization, AQE, Skew, Join Algorithms, Query Planning (Deepest Level)
You will learn:
- Adaptive Query Execution (AQE)
- Join algorithm internals
- Skew join mechanics
- Cost-based optimization
- Real-world tuning recipes
- Spark vs Flink performance
- Spark query plan reading like a compiler engineer
- 200+ tricky interview questions
- Real production debugging cases
If you want, say:
👉 “Start Module 4 at the deepest possible level.”
And if you want, I can also:
- Convert Modules 1–3 into a Spark Bible PDF
- Add diagrams + exercises + assignments
- Add real-world datasets & projects
- Create Spark interview mega-question bank
We are basically building the most advanced Spark course possible.