PySpark & Spark Mastery Course (A→Z, PhD-level)

Below is Module 5 — Spark like a System Designer.
This is not “how to use Spark.”
This is how Spark is built, scheduled, serialized, transported, executed, recovered, and optimized — the way a distributed systems architect thinks.

If you master this module, you won’t just answer interviews — you’ll design Spark systems.


🧠 MODULE 5 — SPARK INTERNALS (SYSTEM DESIGN + SOURCE-LEVEL THINKING)

We will cover Spark like a distributed OS + database + network engine.

What you will understand after this module

  • How Spark schedules tasks internally
  • How Spark RPC works
  • How tasks are serialized and shipped
  • How shuffle files are stored & fetched
  • How Spark recovers from failures
  • How Spark manages memory internally
  • How Tungsten generates bytecode
  • How Spark streaming maintains state
  • How Spark avoids stragglers
  • How Spark differs from Flink/Dask/Ray
  • How to design Spark architecture yourself

🧱 5.1 Spark System Architecture (Real Engine View)

Spark is not one system. It is a stack of subsystems.

User Code (Python/Scala/SQL)
 ↓
Spark Compiler (Catalyst + Planner)
 ↓
Scheduler (DAG + Task Scheduler)
 ↓
Runtime (Executors + Memory + Shuffle)
 ↓
Network (RPC + Shuffle Fetch)
 ↓
Storage (Disk + HDFS + S3)
 ↓
OS + JVM + Hardware

Spark is a distributed OS for data.


5.1.1 Spark Core Subsystems

Spark Core
 ├── Scheduler
 ├── Memory Manager
 ├── Shuffle Manager
 ├── Block Manager
 ├── RPC Framework
 ├── Storage System
 ├── Execution Engine
 └── Fault Tolerance Engine

Each subsystem is independent but coordinated.


🧠 5.2 Spark Scheduler Internals (Deep)

Spark has multiple schedulers.

5.2.1 Scheduler Stack

User Action
 ↓
DAGScheduler
 ↓
TaskScheduler
 ↓
Cluster Manager
 ↓
Executors

5.2.2 DAGScheduler (High-Level Planner)

Responsibilities:

  • Build DAG from transformations
  • Split DAG into stages
  • Handle shuffle boundaries
  • Retry failed stages

Example DAG

rdd.map().filter().groupByKey().collect()

DAG:

Stage 0: map → filter
Stage 1: groupByKey (shuffle)

5.2.3 TaskScheduler (Low-Level Executor Planner)

Responsibilities:

  • Assign tasks to executors
  • Handle locality
  • Retry tasks
  • Speculative execution

Locality Levels

LevelMeaning
PROCESS_LOCALSame JVM
NODE_LOCALSame node
RACK_LOCALSame rack
ANYAnywhere

Spark prefers locality to reduce network cost.


5.2.4 Cluster Manager (Resource Allocator)

Spark supports:

  • Standalone
  • YARN
  • Kubernetes
  • Mesos

Cluster Manager decides:

  • How many executors?
  • Where to run them?
  • How much memory?

Spark does NOT manage hardware — cluster manager does.


🧠 5.3 Spark RPC System (Distributed Communication)

Spark nodes talk using RPC (Remote Procedure Call).

5.3.1 Spark RPC Architecture

Image
Image
Image

Spark uses:

  • Netty (network layer)
  • Akka (older versions)
  • Spark RPC (custom framework)

Communication paths:

Driver ↔ Executors
Executor ↔ Executor
Driver ↔ Cluster Manager
Executor ↔ Shuffle Service

5.3.2 What is Sent Over RPC?

Spark sends:

  • Tasks
  • Serialized functions
  • Broadcast variables
  • Shuffle metadata
  • Heartbeats
  • Metrics

5.3.3 Heartbeat Mechanism

Executors send heartbeat to driver.

If heartbeat stops:

  • Executor marked dead
  • Tasks rescheduled

Distributed systems principle:

Failure is expected, not exceptional.


🧠 5.4 Task Binary & Serialization (Deepest Level)

When Spark sends a task, it sends a Task Binary.

5.4.1 Task Binary Contains

Task Binary
 ├── Serialized DAG node
 ├── Serialized closures (Python/Scala functions)
 ├── Broadcast variables
 ├── Partition metadata
 └── Execution context

5.4.2 Serialization Layers

PySpark has multiple serialization layers:

Python Objects → Pickle → Bytes
Bytes → Py4J → JVM
JVM Objects → Kryo/Java → Bytes
Bytes → Network → Executor JVM
Executor JVM → Python Worker → Unpickle

This is why Python UDFs are slow.


5.4.3 Kryo vs Java Serialization

FeatureJava SerializationKryo
SpeedSlowFast
SizeLargeCompact
Custom typesHardEasy

Spark default = Java
Recommended = Kryo

spark.serializer=org.apache.spark.serializer.KryoSerializer

🧠 5.5 Shuffle Internals (File-System Level)

Shuffle is the heart of distributed Spark.

5.5.1 Shuffle File Layout

Each executor writes:

/spark-shuffle/
 ├── shuffle_0_0_0.data
 ├── shuffle_0_0_0.index
 ├── shuffle_0_1_0.data
 └── ...

Meaning:

shuffle_<shuffleId>_<mapId>_<reduceId>

5.5.2 Shuffle Phases

Map Phase

Executor:

  • buffers records in memory
  • spills to disk if needed
  • writes shuffle files

Reduce Phase

Executor:

  • fetches shuffle blocks via network
  • merges
  • sorts
  • aggregates

5.5.3 Shuffle Service (External)

Spark can use external shuffle service.

Purpose:

  • preserve shuffle files even if executor dies

Config:

spark.shuffle.service.enabled=true

🧠 5.6 Block Manager (Spark Storage System)

Spark stores data using BlockManager.

5.6.1 Block Types

Blocks
 ├── RDD blocks
 ├── Shuffle blocks
 ├── Broadcast blocks
 ├── Cached DataFrame blocks

5.6.2 Block Locations

Memory
Disk
Off-heap
Remote executors

Spark tracks block locations using BlockManagerMaster.


🧠 5.7 Spark Memory Manager (Internal Design)

Spark has two memory managers:

  • Static Memory Manager (old)
  • Unified Memory Manager (modern)

5.7.1 Unified Memory Manager

Heap Memory
 ├── Execution Memory
 ├── Storage Memory
 └── User Memory

Execution & Storage can borrow memory dynamically.


5.7.2 Off-Heap Memory

Spark Tungsten uses off-heap memory.

Benefits:

  • less GC
  • faster binary processing

Config:

spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=4g

🧠 5.8 Tungsten Engine (Compiler-Level Execution)

Spark Tungsten turns queries into JVM bytecode.

5.8.1 Tungsten Pipeline

Logical Plan
 ↓
Physical Plan
 ↓
Whole-Stage Code Generation
 ↓
JVM Bytecode
 ↓
CPU Execution

Spark literally generates Java code at runtime.


5.8.2 Whole-Stage Code Generation

Instead of executing operators one by one:

Spark merges operators into one function.

Example:

filter + projection + aggregation
→ single generated function

Benefits:

  • fewer function calls
  • better CPU cache locality
  • vectorized execution

Distributed systems insight:

Spark behaves like a JIT compiler.


🧠 5.9 Fault Tolerance Mechanisms

Spark assumes failures.

5.9.1 Types of Failures

FailureSpark Response
Task failureRetry task
Executor failureReschedule tasks
Stage failureRecompute stage
Driver failureApplication fails
Node failureRecompute partitions

5.9.2 Lineage-Based Recovery

Spark does NOT replicate RDDs.

Instead:

RDD lineage → recompute lost partitions

5.9.3 Checkpointing

For long lineage:

rdd.checkpoint()

Spark saves RDD to stable storage.


🧠 5.10 Speculative Execution (Straggler Mitigation)

Problem:

  • Some tasks slow (stragglers)

Solution:

Spark launches duplicate tasks.

Config:

spark.speculation=true

Distributed systems principle:

Don’t wait for slow nodes.


🧠 5.11 Spark Streaming Internals (System-Level)

Spark Structured Streaming is micro-batch engine.

5.11.1 Execution Model

Input Stream → Micro-batches → DAG → Execution

5.11.2 State Store

Spark stores streaming state in:

  • RocksDB / HDFS / memory

Used for:

  • window operations
  • aggregations
  • joins

5.11.3 Watermarking

Handles late data.


🧠 5.12 Spark vs Flink vs Ray vs Dask (Architectural View)

SystemDesign Philosophy
SparkBatch-first, micro-batch streaming
FlinkTrue streaming
RayGeneral distributed computing
DaskPython-native parallelism

Spark strength:

  • SQL engine
  • ecosystem
  • fault tolerance

Flink strength:

  • low-latency streaming

Ray strength:

  • ML & AI workloads

🧠 5.13 Designing Spark Architecture (System Designer Mindset)

Example: Design Spark for 10 TB daily pipeline

Questions to ask:

  1. Data distribution?
  2. Partition strategy?
  3. Join patterns?
  4. Memory sizing?
  5. Shuffle volume?
  6. Skew risk?
  7. Failure tolerance?
  8. Cost optimization?

Example Architecture

S3 → Spark → Delta Lake → BI

Key decisions:

  • broadcast joins for dimensions
  • partition by date
  • AQE enabled
  • executor sizing tuned
  • caching hot tables

🧠 5.14 Real Production Debugging Framework

When Spark job is slow, ask:

Layer 1 — Algorithm

  • wrong join?
  • unnecessary shuffle?

Layer 2 — Data

  • skew?
  • partition size?

Layer 3 — Memory

  • spill?
  • GC?

Layer 4 — Network

  • shuffle heavy?

Layer 5 — Cluster

  • underutilized cores?

This is how Spark architects debug.


🧠 5.15 Interview Questions (System Designer Level)

Q1

Why does Spark use lineage instead of replication?

✅ Because lineage is cheaper than replication.


Q2

Why does Spark generate bytecode?

✅ To reduce interpretation overhead and improve CPU efficiency.


Q3

Why is shuffle expensive?

✅ Because it involves memory + disk + network.


Q4

Why does Spark prefer locality?

✅ To minimize network transfer.


Q5 (Elite)

What is the difference between Spark and a distributed database?

✅ Spark is compute-first; DB is storage-first.


🧠 5.16 The Biggest Truth About Spark Mastery

Most people learn Spark like a library.

Real experts understand Spark like:

  • distributed OS
  • distributed database
  • compiler
  • network system

If you understand Module 5, you are not a Spark user.

🔥 You are a Spark system designer.


🚀 If you want, I will take you even further:

MODULE 6 — Spark + PySpark Hardcore Practice & Interview Domination

We will build:

  • 200+ real Spark coding problems
  • LeetCode-style Spark questions
  • real production debugging scenarios
  • Spark system design interview questions
  • trick questions & traps
  • end-to-end projects (Data Lake, ETL, ML pipelines)
  • Spark architecture case studies (Netflix, Uber, Airbnb style)

This will turn you into:

“Someone who can design, debug, optimize, and explain Spark like a principal engineer.”

If you want, say:

👉 “Start Module 6 — I want hardcore Spark practice + interview mastery.”

And if you want, I can also convert Modules 1–5 into a structured Spark Bible (book-level) with diagrams, exercises, and projects.