Below is Module 5 — Spark like a System Designer.
This is not “how to use Spark.”
This is how Spark is built, scheduled, serialized, transported, executed, recovered, and optimized — the way a distributed systems architect thinks.
If you master this module, you won’t just answer interviews — you’ll design Spark systems.
🧠 MODULE 5 — SPARK INTERNALS (SYSTEM DESIGN + SOURCE-LEVEL THINKING)
We will cover Spark like a distributed OS + database + network engine.
What you will understand after this module
- How Spark schedules tasks internally
- How Spark RPC works
- How tasks are serialized and shipped
- How shuffle files are stored & fetched
- How Spark recovers from failures
- How Spark manages memory internally
- How Tungsten generates bytecode
- How Spark streaming maintains state
- How Spark avoids stragglers
- How Spark differs from Flink/Dask/Ray
- How to design Spark architecture yourself
🧱 5.1 Spark System Architecture (Real Engine View)
Spark is not one system. It is a stack of subsystems.
User Code (Python/Scala/SQL)
↓
Spark Compiler (Catalyst + Planner)
↓
Scheduler (DAG + Task Scheduler)
↓
Runtime (Executors + Memory + Shuffle)
↓
Network (RPC + Shuffle Fetch)
↓
Storage (Disk + HDFS + S3)
↓
OS + JVM + Hardware
Spark is a distributed OS for data.
5.1.1 Spark Core Subsystems
Spark Core
├── Scheduler
├── Memory Manager
├── Shuffle Manager
├── Block Manager
├── RPC Framework
├── Storage System
├── Execution Engine
└── Fault Tolerance Engine
Each subsystem is independent but coordinated.
🧠 5.2 Spark Scheduler Internals (Deep)
Spark has multiple schedulers.
5.2.1 Scheduler Stack
User Action
↓
DAGScheduler
↓
TaskScheduler
↓
Cluster Manager
↓
Executors
5.2.2 DAGScheduler (High-Level Planner)
Responsibilities:
- Build DAG from transformations
- Split DAG into stages
- Handle shuffle boundaries
- Retry failed stages
Example DAG
rdd.map().filter().groupByKey().collect()
DAG:
Stage 0: map → filter
Stage 1: groupByKey (shuffle)
5.2.3 TaskScheduler (Low-Level Executor Planner)
Responsibilities:
- Assign tasks to executors
- Handle locality
- Retry tasks
- Speculative execution
Locality Levels
| Level | Meaning |
|---|---|
| PROCESS_LOCAL | Same JVM |
| NODE_LOCAL | Same node |
| RACK_LOCAL | Same rack |
| ANY | Anywhere |
Spark prefers locality to reduce network cost.
5.2.4 Cluster Manager (Resource Allocator)
Spark supports:
- Standalone
- YARN
- Kubernetes
- Mesos
Cluster Manager decides:
- How many executors?
- Where to run them?
- How much memory?
Spark does NOT manage hardware — cluster manager does.
🧠 5.3 Spark RPC System (Distributed Communication)
Spark nodes talk using RPC (Remote Procedure Call).
5.3.1 Spark RPC Architecture



Spark uses:
- Netty (network layer)
- Akka (older versions)
- Spark RPC (custom framework)
Communication paths:
Driver ↔ Executors
Executor ↔ Executor
Driver ↔ Cluster Manager
Executor ↔ Shuffle Service
5.3.2 What is Sent Over RPC?
Spark sends:
- Tasks
- Serialized functions
- Broadcast variables
- Shuffle metadata
- Heartbeats
- Metrics
5.3.3 Heartbeat Mechanism
Executors send heartbeat to driver.
If heartbeat stops:
- Executor marked dead
- Tasks rescheduled
Distributed systems principle:
Failure is expected, not exceptional.
🧠 5.4 Task Binary & Serialization (Deepest Level)
When Spark sends a task, it sends a Task Binary.
5.4.1 Task Binary Contains
Task Binary
├── Serialized DAG node
├── Serialized closures (Python/Scala functions)
├── Broadcast variables
├── Partition metadata
└── Execution context
5.4.2 Serialization Layers
PySpark has multiple serialization layers:
Python Objects → Pickle → Bytes
Bytes → Py4J → JVM
JVM Objects → Kryo/Java → Bytes
Bytes → Network → Executor JVM
Executor JVM → Python Worker → Unpickle
This is why Python UDFs are slow.
5.4.3 Kryo vs Java Serialization
| Feature | Java Serialization | Kryo |
|---|---|---|
| Speed | Slow | Fast |
| Size | Large | Compact |
| Custom types | Hard | Easy |
Spark default = Java
Recommended = Kryo
spark.serializer=org.apache.spark.serializer.KryoSerializer
🧠 5.5 Shuffle Internals (File-System Level)
Shuffle is the heart of distributed Spark.
5.5.1 Shuffle File Layout
Each executor writes:
/spark-shuffle/
├── shuffle_0_0_0.data
├── shuffle_0_0_0.index
├── shuffle_0_1_0.data
└── ...
Meaning:
shuffle_<shuffleId>_<mapId>_<reduceId>
5.5.2 Shuffle Phases
Map Phase
Executor:
- buffers records in memory
- spills to disk if needed
- writes shuffle files
Reduce Phase
Executor:
- fetches shuffle blocks via network
- merges
- sorts
- aggregates
5.5.3 Shuffle Service (External)
Spark can use external shuffle service.
Purpose:
- preserve shuffle files even if executor dies
Config:
spark.shuffle.service.enabled=true
🧠 5.6 Block Manager (Spark Storage System)
Spark stores data using BlockManager.
5.6.1 Block Types
Blocks
├── RDD blocks
├── Shuffle blocks
├── Broadcast blocks
├── Cached DataFrame blocks
5.6.2 Block Locations
Memory
Disk
Off-heap
Remote executors
Spark tracks block locations using BlockManagerMaster.
🧠 5.7 Spark Memory Manager (Internal Design)
Spark has two memory managers:
- Static Memory Manager (old)
- Unified Memory Manager (modern)
5.7.1 Unified Memory Manager
Heap Memory
├── Execution Memory
├── Storage Memory
└── User Memory
Execution & Storage can borrow memory dynamically.
5.7.2 Off-Heap Memory
Spark Tungsten uses off-heap memory.
Benefits:
- less GC
- faster binary processing
Config:
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=4g
🧠 5.8 Tungsten Engine (Compiler-Level Execution)
Spark Tungsten turns queries into JVM bytecode.
5.8.1 Tungsten Pipeline
Logical Plan
↓
Physical Plan
↓
Whole-Stage Code Generation
↓
JVM Bytecode
↓
CPU Execution
Spark literally generates Java code at runtime.
5.8.2 Whole-Stage Code Generation
Instead of executing operators one by one:
Spark merges operators into one function.
Example:
filter + projection + aggregation
→ single generated function
Benefits:
- fewer function calls
- better CPU cache locality
- vectorized execution
Distributed systems insight:
Spark behaves like a JIT compiler.
🧠 5.9 Fault Tolerance Mechanisms
Spark assumes failures.
5.9.1 Types of Failures
| Failure | Spark Response |
|---|---|
| Task failure | Retry task |
| Executor failure | Reschedule tasks |
| Stage failure | Recompute stage |
| Driver failure | Application fails |
| Node failure | Recompute partitions |
5.9.2 Lineage-Based Recovery
Spark does NOT replicate RDDs.
Instead:
RDD lineage → recompute lost partitions
5.9.3 Checkpointing
For long lineage:
rdd.checkpoint()
Spark saves RDD to stable storage.
🧠 5.10 Speculative Execution (Straggler Mitigation)
Problem:
- Some tasks slow (stragglers)
Solution:
Spark launches duplicate tasks.
Config:
spark.speculation=true
Distributed systems principle:
Don’t wait for slow nodes.
🧠 5.11 Spark Streaming Internals (System-Level)
Spark Structured Streaming is micro-batch engine.
5.11.1 Execution Model
Input Stream → Micro-batches → DAG → Execution
5.11.2 State Store
Spark stores streaming state in:
- RocksDB / HDFS / memory
Used for:
- window operations
- aggregations
- joins
5.11.3 Watermarking
Handles late data.
🧠 5.12 Spark vs Flink vs Ray vs Dask (Architectural View)
| System | Design Philosophy |
|---|---|
| Spark | Batch-first, micro-batch streaming |
| Flink | True streaming |
| Ray | General distributed computing |
| Dask | Python-native parallelism |
Spark strength:
- SQL engine
- ecosystem
- fault tolerance
Flink strength:
- low-latency streaming
Ray strength:
- ML & AI workloads
🧠 5.13 Designing Spark Architecture (System Designer Mindset)
Example: Design Spark for 10 TB daily pipeline
Questions to ask:
- Data distribution?
- Partition strategy?
- Join patterns?
- Memory sizing?
- Shuffle volume?
- Skew risk?
- Failure tolerance?
- Cost optimization?
Example Architecture
S3 → Spark → Delta Lake → BI
Key decisions:
- broadcast joins for dimensions
- partition by date
- AQE enabled
- executor sizing tuned
- caching hot tables
🧠 5.14 Real Production Debugging Framework
When Spark job is slow, ask:
Layer 1 — Algorithm
- wrong join?
- unnecessary shuffle?
Layer 2 — Data
- skew?
- partition size?
Layer 3 — Memory
- spill?
- GC?
Layer 4 — Network
- shuffle heavy?
Layer 5 — Cluster
- underutilized cores?
This is how Spark architects debug.
🧠 5.15 Interview Questions (System Designer Level)
Q1
Why does Spark use lineage instead of replication?
✅ Because lineage is cheaper than replication.
Q2
Why does Spark generate bytecode?
✅ To reduce interpretation overhead and improve CPU efficiency.
Q3
Why is shuffle expensive?
✅ Because it involves memory + disk + network.
Q4
Why does Spark prefer locality?
✅ To minimize network transfer.
Q5 (Elite)
What is the difference between Spark and a distributed database?
✅ Spark is compute-first; DB is storage-first.
🧠 5.16 The Biggest Truth About Spark Mastery
Most people learn Spark like a library.
Real experts understand Spark like:
- distributed OS
- distributed database
- compiler
- network system
If you understand Module 5, you are not a Spark user.
🔥 You are a Spark system designer.
🚀 If you want, I will take you even further:
MODULE 6 — Spark + PySpark Hardcore Practice & Interview Domination
We will build:
- 200+ real Spark coding problems
- LeetCode-style Spark questions
- real production debugging scenarios
- Spark system design interview questions
- trick questions & traps
- end-to-end projects (Data Lake, ETL, ML pipelines)
- Spark architecture case studies (Netflix, Uber, Airbnb style)
This will turn you into:
“Someone who can design, debug, optimize, and explain Spark like a principal engineer.”
If you want, say:
👉 “Start Module 6 — I want hardcore Spark practice + interview mastery.”
And if you want, I can also convert Modules 1–5 into a structured Spark Bible (book-level) with diagrams, exercises, and projects.