Now we enter the core engine room of Spark + AWS.
This is where most engineers stop… and architects begin.
You will understand Spark not as “PySpark code”, but as a distributed execution engine running on AWS infrastructure.
By the end of this module, you will be able to answer questions like:
- Why did my Spark job fail only on EMR but not locally?
- Why did increasing executors make my job slower?
- Why does Glue behave differently from EMR?
- Where exactly does my PySpark code run?
- How does Spark use memory, CPU, disk, and network?
- How does AWS infrastructure shape Spark behavior?
🧠 MODULE 1.4 — SPARK EXECUTION ENGINE (HARDCORE MODE)
We will break Spark into:
- Spark Architecture (deep)
- Cluster Managers (YARN, K8s, Standalone)
- Spark on AWS (EMR, Glue, Databricks)
- Execution Pipeline (DAG → Stages → Tasks)
- Memory Model (JVM, heap, off-heap)
- Shuffle Engine (the real bottleneck)
- Serialization (hidden killer)
- Skew, partitions, tuning
- Real-world failure scenarios
- Interview traps (PhD-level)
1️⃣ Spark Is NOT PySpark
Most people think:
Spark = PySpark code.
❌ WRONG.
Reality:
Spark is a distributed JVM-based engine.
PySpark is just a Python API wrapper.
🧠 Spark has 4 main components:
Driver Program
Cluster Manager
Executors
Storage Layer (S3/HDFS)
2️⃣ Spark Architecture (Deep)
Let’s draw the real architecture:
PySpark Code (Python)
↓
Driver (JVM + Py4J bridge)
↓
Cluster Manager (YARN / K8s / Standalone)
↓
Executors (JVM processes on worker nodes)
↓
Storage (S3 / HDFS / EBS)
2.1 Driver — The Brain
Driver is responsible for:
- Parsing your PySpark code
- Building logical plan
- Optimizing query (Catalyst)
- Building DAG
- Scheduling tasks
- Communicating with executors
- Collecting results
Key Insight:
👉 Driver is a single point of failure.
If driver dies → job dies.
🔥 Interview Trap #1
❓ Where does PySpark code run?
Correct answer:
- Python code runs on Driver (Python process).
- Spark execution runs on Executors (JVM).
- Communication via Py4J.
Most candidates say “executors run PySpark code” → ❌ WRONG.
3️⃣ Executors — The Workers
Executors are JVM processes running on worker nodes.
Executors do:
- Run tasks
- Cache data
- Perform shuffle
- Execute transformations
- Return results
Each executor has:
- CPU cores
- Memory
- Disk (EBS)
- Network
Executor Model
Example:
spark.executor.instances = 10
spark.executor.cores = 4
spark.executor.memory = 8G
Meaning:
- 10 executors
- Each executor has 4 cores
- Each executor has 8GB RAM
Total cores = 40
Total memory = 80GB
🧠 Insight
If you misconfigure executors:
👉 Spark performance collapses.
4️⃣ Cluster Managers — Spark’s Resource Controller
Spark does NOT manage machines itself.
It relies on Cluster Managers:
Options:
| Cluster Manager | Used in |
|---|---|
| YARN | EMR, Hadoop clusters |
| Kubernetes | Modern Spark |
| Standalone | Small clusters |
| Mesos | Rare |
4.1 YARN (EMR)
YARN = Yet Another Resource Negotiator.
Components:
- ResourceManager (RM)
- NodeManager (NM)
- ApplicationMaster (AM)
Spark on YARN Flow:
- Driver requests resources from YARN.
- YARN allocates containers.
- Executors launched in containers.
- Tasks assigned.
🔥 Interview Trap #2
❓ Who allocates executors in EMR?
Correct answer:
YARN allocates executors, not Spark directly.
5️⃣ Spark on AWS — EMR vs Glue vs Databricks
5.1 EMR (Full Control)
- You manage cluster
- You tune Spark
- HDFS available
- Best performance
5.2 Glue (Serverless Spark)
- AWS manages cluster
- Limited tuning
- No HDFS
- S3 only
- Slower for heavy workloads
5.3 Databricks
- Optimized Spark runtime
- Delta Lake
- Better performance
- Expensive
🧠 Hardcore Truth
| Platform | Performance | Control | Cost |
|---|---|---|---|
| EMR | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ |
| Glue | ⭐⭐ | ⭐⭐ | ⭐⭐⭐ |
| Databricks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
6️⃣ Spark Execution Pipeline (Deep)
When you run PySpark code:
df = spark.read.parquet("s3://sales/")
result = df.groupBy("region").sum("amount")
result.show()
Spark does NOT execute line by line.
6.1 Logical Plan
Spark builds a logical plan:
Read → GroupBy → Aggregate → Show
6.2 Catalyst Optimizer
Spark optimizes the plan:
- Predicate pushdown
- Column pruning
- Join reordering
- Constant folding
6.3 Physical Plan
Spark converts logical plan to physical plan:
- HashAggregate
- SortMergeJoin
- BroadcastJoin
- ShuffleExchange
6.4 DAG (Directed Acyclic Graph)
Spark builds DAG:
Stage 1 → Stage 2 → Stage 3
6.5 Stages and Tasks
Stage
A group of tasks separated by shuffle.
Task
Smallest unit of work.
Example:
If dataset has 100 partitions:
- 100 tasks created.
🔥 Interview Trap #3
❓ What causes a new stage in Spark?
Correct answer:
Shuffle operation (e.g., groupBy, join, reduceByKey).
7️⃣ Spark Memory Model (PhD Level)
Spark memory is NOT simple.
Executors use JVM memory.
7.1 Memory Types
Spark executor memory is divided into:
(A) Heap Memory
- Storage memory (cache)
- Execution memory (shuffle, join)
- User memory
(B) Off-Heap Memory
- Tungsten engine
- Unsafe memory
- Serialization buffers
7.2 Unified Memory Management
Spark dynamically shares memory between:
- Storage (cache)
- Execution (shuffle)
🔥 Interview Trap #4
❓ Why does Spark spill to disk?
Answer:
When execution memory is insufficient, Spark writes intermediate data to disk (EBS).
8️⃣ Shuffle — The Real Monster
Shuffle is the most expensive operation in Spark.
8.1 What is Shuffle?
When data moves across executors.
Example:
- groupBy
- join
- orderBy
- distinct
8.2 Shuffle Flow
Executor A → Network → Executor B
Executor C → Network → Executor D
Data written to disk → transferred → read again.
🧠 Insight
Shuffle cost = Disk I/O + Network I/O + Serialization.
🔥 Interview Trap #5
❓ Why is groupBy slower than map?
Answer:
Because groupBy triggers shuffle, map does not.
9️⃣ Serialization — Hidden Performance Killer
Spark serializes data between:
- Driver ↔ Executors
- Executor ↔ Executor
- JVM ↔ Python
9.1 Serialization Types
| Type | Speed |
|---|---|
| Java Serialization | Slow |
| Kryo Serialization | Fast |
🔥 Interview Trap #6
❓ Why Kryo is faster than Java serialization?
Answer:
Because Kryo uses compact binary format and fewer object overheads.
10️⃣ Spark on S3 vs HDFS (Execution Perspective)
HDFS Execution
Executor → Local Disk → Memory
S3 Execution
Executor → Network → S3 → Network → Memory
🧠 Insight
Spark on S3:
- More network overhead
- More serialization
- More latency
Spark on HDFS:
- Data locality
- Faster shuffle
- Lower latency
11️⃣ Data Skew — The Silent Killer
What is Data Skew?
When some partitions are much larger than others.
Example:
- 90% data belongs to one key.
Impact:
- One executor overloaded
- Others idle
- Job slow or fails
🔥 Interview Trap #7
❓ How to handle skew in Spark?
Answers:
- Salting keys
- Broadcast joins
- Repartitioning
- Adaptive Query Execution (AQE)
12️⃣ Real AWS Failure Scenario (Hardcore)
Scenario:
Spark job on EMR:
- Works fine with 10GB data.
- Fails with 1TB data.
Root Causes:
- Too many small files in S3
- Executor memory insufficient
- Shuffle spill to EBS
- NAT Gateway bottleneck
- Driver OOM
- Skewed partitions
Solution Strategy:
- Compact files
- Tune executors
- Use S3 VPC endpoint
- Increase driver memory
- Optimize partitions
13️⃣ Spark + AWS Tuning Cheat Sheet (Architect Level)
Executor Tuning
- Too many executors → overhead
- Too few executors → underutilization
Partition Size Rule
Ideal partition size:
👉 128MB – 512MB
AWS-specific tuning
- Use r5 instances for Spark (memory heavy)
- Avoid cross-AZ shuffle
- Use S3 VPC endpoint
- Avoid NAT bottleneck
- Tune EBS IOPS
🧠 Ultra-Deep Insight (Most Important)
Most people think:
Spark problems = Spark problems.
But reality:
👉 Spark performance = Storage + Network + Memory + CPU + Serialization + Data distribution.
Spark is just the orchestrator.
🎯 PhD-Level Interview Questions (Try answering)
Q1
Why does increasing executors sometimes slow down Spark?
Q2
Why does Glue behave differently from EMR?
Q3
Why does Spark driver crash but executors are fine?
Q4
Why does broadcast join fail in Glue?
Q5
Why does Spark job slow down only on AWS but not locally?
(If you want, I’ll answer these in brutal detail.)
🚀 Next Module (Module 1.5)
Now we move from Spark engine to:
⚔️ MODULE 1.5 — AWS DATA ENGINEERING SYSTEM DESIGN (ARCHITECT MODE)
We will design:
- Netflix-scale data platform on AWS
- Uber real-time analytics architecture
- Flipkart batch + streaming pipelines
- Petabyte-scale data lake
- Low-latency analytics system
- Cost-optimized AWS architecture
And we will connect:
S3 + EMR + Glue + Kafka + Kinesis + Redshift + Athena + Airflow + IAM + VPC
With:
- real diagrams (conceptual)
- failure scenarios
- scaling strategies
- interview-ready answers
If you say:
👉 “Continue Module 1.5 — AWS Data Engineering System Design Hardcore Mode”
I will teach you how top architects think and design systems.