PySpark & Spark Mastery Course (A→Z, PhD-level)

Below is Module 4 taught like a distributed systems engineer + database engine + OS designer.
I will not skip anything — I’ll cover:

  • join algorithms (internal mechanics)
  • shuffle internals
  • AQE (Adaptive Query Execution)
  • skew theory + engineering fixes
  • query planning like a compiler
  • physical execution details
  • network + disk + memory interactions
  • Spark UI interpretation
  • real production failure patterns
  • mathematical + system-level intuition
  • interview traps
  • engineering heuristics

Think of this as:

“How Spark behaves like a distributed database + scheduler + OS + network system.”


🧠 MODULE 4 — SPARK QUERY EXECUTION, JOINS, AQE, SKEW & SYSTEM-LEVEL OPTIMIZATION

4.0 Mental Model (Important)

Most people think:

Spark optimization = tuning configs.

Correct model:

Spark performance = 
Data distribution + Network topology + Memory layout + Algorithm choice + Scheduling strategy

Spark is not slow.
Bad execution plans are slow.


🧱 4.1 Spark Query Execution Pipeline (Distributed Systems View)

When you run:

df.join(df2, "id").groupBy("country").sum("salary")

Spark does NOT just “run it”.

It builds a multi-layer execution pipeline:

SQL/DataFrame Code
 ↓
Logical Plan (Relational Algebra)
 ↓
Catalyst Optimization
 ↓
Physical Plan (Algorithms chosen)
 ↓
DAG (Stages & Tasks)
 ↓
Distributed Execution (Executors)
 ↓
Network Shuffle + Disk IO + Memory Ops

This is identical to a distributed database engine like PostgreSQL + Presto + Flink.


🧠 4.2 Logical Plan vs Physical Plan (Compiler View)

4.2.1 Logical Plan

Logical plan describes WHAT to do.

Example:

SELECT country, SUM(salary)
FROM users
GROUP BY country

Logical plan:

Aggregate(country, sum(salary))
 └── Scan(users)

No algorithms yet.


4.2.2 Physical Plan

Physical plan describes HOW to do it.

Example:

HashAggregate
 └── Exchange (shuffle)
     └── HashAggregate
         └── Scan(users)

🔥 Key Insight:

Logical plan = math
Physical plan = engineering


🧠 4.3 Spark Join Algorithms (Deepest Explanation)

Spark has multiple join algorithms.

4.3.1 Broadcast Hash Join (BHJ)

Concept

Small table → broadcast to all executors.

Driver
 ├── sends small table to Executor 1
 ├── sends small table to Executor 2
 ├── sends small table to Executor 3

Large table stays partitioned.

Execution

Each executor:

  • loads broadcast table into memory
  • hashes it
  • scans large table partitions
  • performs join locally

Complexity

  • Network: O(size of small table × executors)
  • CPU: O(N)
  • Shuffle: ❌ none

When Spark chooses it

table_size < spark.sql.autoBroadcastJoinThreshold (default 10MB)

Failure Case (Real World)

If broadcast table is 5 GB:

  • Driver OOM
  • Executors OOM
  • Cluster crash

🔥 Distributed systems insight:

Broadcast join trades network cost for shuffle avoidance.


4.3.2 Sort Merge Join (SMJ)

Concept

Both tables are large.

Steps:

  1. Shuffle both tables by join key
  2. Sort partitions
  3. Merge join

Execution

Table A partitions → shuffle → sort
Table B partitions → shuffle → sort
Merge partitions → join

Complexity

  • Network: High
  • Disk IO: High
  • CPU: Medium
  • Memory: Medium

When Spark chooses it

  • Large tables
  • Keys sortable
  • No broadcast possible

🔥 Insight:

SMJ is Spark’s default distributed join.


4.3.3 Shuffle Hash Join (SHJ)

Concept

  • Shuffle both tables
  • Build hash table per partition
  • Join

Difference from SMJ

FeatureSHJSMJ
Sorting
Hashing
MemoryHighMedium

Spark rarely uses SHJ by default.


4.3.4 Cartesian Join (Worst Case)

Concept

A × B

Complexity

O(n × m)

Spark blocks it by default.


🧠 4.4 Shuffle — The Core of Distributed Spark

Shuffle = data redistribution across network.

4.4.1 Shuffle Phases

Phase 1 — Map Side

Each map task:

  • reads partition
  • applies transformations
  • writes shuffle files

Phase 2 — Reduce Side

Each reduce task:

  • fetches shuffle files from all executors
  • merges data
  • aggregates

4.4.2 Shuffle Data Flow (System-Level)

Executor A ─┐
Executor B ─┼──> Network ──> Executor X
Executor C ─┘

Bottlenecks

  • network bandwidth
  • disk throughput
  • memory buffers
  • serialization overhead

🔥 Insight:

Spark performance is mostly about minimizing shuffle.


🧠 4.5 Data Skew — Distributed Systems Nightmare

4.5.1 What is Skew?

Skew = uneven data distribution.

Example:

country = "India" → 100 million rows
country = "USA" → 1 million rows
country = "China" → 500k rows

Result:

  • One partition huge
  • Others small

4.5.2 Impact of Skew

At executor level:

Executor 1 → 100 GB data (slow)
Executor 2 → 1 GB data (fast)
Executor 3 → 500 MB data (fast)

Result:

  • cluster idle except one executor
  • job time dominated by one task

🔥 Distributed systems principle:

The slowest node determines job completion time.


4.5.3 Skew Detection (Engineering)

Spark UI symptoms:

  • One task takes 10x longer
  • Stage progress stuck at 99%
  • Shuffle read size uneven

4.5.4 Skew Mitigation Techniques (Deep)

Technique 1 — Salting

Idea:

Artificially randomize skewed keys.

Example:

from pyspark.sql.functions import rand, concat, lit

df = df.withColumn("salted_key", concat(col("country"), lit("_"), (rand()*10).cast("int")))

Effect:

  • India split into multiple partitions

Technique 2 — AQE Skew Join

Spark 3+ detects skew automatically.

Config:

spark.sql.adaptive.skewJoin.enabled = true

Spark splits skewed partitions dynamically.


Technique 3 — Custom Partitioning

df.repartition(200, "country")

Technique 4 — Broadcast Skewed Dimension Table

If skew on small table:

  • broadcast it

🧠 4.6 Adaptive Query Execution (AQE) — Spark Becomes Self-Optimizing

4.6.1 What is AQE?

Traditional Spark:

  • execution plan fixed before execution.

AQE:

  • Spark changes plan during execution.

🔥 This is huge.


4.6.2 AQE Capabilities

AQE can:

  1. Change join strategy
  2. Optimize shuffle partitions
  3. Split skewed partitions
  4. Merge small partitions

4.6.3 Example

Original plan:

Sort Merge Join

During execution Spark realizes:

  • table is small

New plan:

Broadcast Hash Join

🔥 Distributed systems insight:

Spark becomes a dynamic optimizer like modern databases.


4.6.4 AQE Configs (Important)

spark.sql.adaptive.enabled = true
spark.sql.adaptive.coalescePartitions.enabled = true
spark.sql.adaptive.skewJoin.enabled = true

🧠 4.7 Partitioning Strategy (Systems View)

4.7.1 Default Partitioning

Spark decides partitions based on:

  • input file blocks
  • spark.sql.shuffle.partitions
  • spark.default.parallelism

4.7.2 Partitioning Trade-off

Too Few PartitionsToo Many Partitions
OOMScheduler overhead
Poor parallelismHigh task overhead
Skew riskContext switching

4.7.3 Engineering Rule

partition_size ≈ 128MB – 256MB

🧠 4.8 Spark as a Distributed Database Engine

Spark ≈ PostgreSQL + Hadoop + OS + Network stack.

Mapping:

Database ConceptSpark Equivalent
Query PlannerCatalyst
Join AlgorithmsBHJ, SMJ, SHJ
Buffer ManagerSpark Memory Manager
Execution EngineTungsten
Cost OptimizerCBO + AQE
ShardingPartitioning

🔥 Insight:

Spark is not a data tool. It is a distributed query engine.


🧠 4.9 Spark UI — Reading Like a Systems Engineer

4.9.1 Jobs Tab

Shows:

  • number of stages
  • execution time

Interpretation:

  • many stages = many shuffles
  • long stage = skew or shuffle

4.9.2 Stages Tab

Key metrics:

  • shuffle read/write size
  • task duration
  • skewed tasks

4.9.3 Executors Tab

Look at:

  • memory usage
  • GC time
  • task distribution

4.9.4 SQL Tab

Shows:

  • logical plan
  • physical plan
  • execution metrics

🔥 If you can read Spark UI → you are an architect.


🧠 4.10 Real Production Failure Patterns (Critical)

Case 1 — Slow Join

Symptoms:

  • high shuffle
  • long stage

Cause:

  • wrong join strategy

Fix:

  • broadcast join or repartition

Case 2 — OOM During Aggregation

Cause:

  • large groupBy keys

Fix:

  • increase partitions
  • use map-side aggregation

Case 3 — Cluster Underutilization

Cause:

  • too few partitions

Fix:

  • increase parallelism

Case 4 — Driver Crash

Cause:

  • collect() or toPandas()

Fix:

  • write to storage instead

🧠 4.11 Distributed Systems Insight (Core Truth)

Spark performance depends on:

Data distribution > Memory > CPU > Code

Most engineers optimize code.

Real architects optimize data flow.


🧠 4.12 Interview Traps (Elite Level)

Trap 1

❓ Why does broadcast join sometimes make job slower?

✅ Because:

  • broadcast overhead > shuffle benefit.

Trap 2

❓ Why does AQE sometimes hurt performance?

✅ Because:

  • plan switching overhead.

Trap 3

❓ Why does increasing partitions sometimes slow Spark?

✅ Because:

  • scheduling overhead > parallelism benefit.

Trap 4

❓ Spark SQL faster than DataFrame?

✅ No.
Both use Catalyst + Tungsten.


🧠 4.13 PhD-Level Insight

Spark is a distributed optimizer.

Your job is not to write code.

Your job is to help Spark choose the right algorithm.


🧠 4.14 If you want, I will go EVEN deeper in Module 5:

MODULE 5 — Spark Internals at Source-Code & OS Level

We will cover:

  • Spark scheduler internals
  • RPC protocol
  • Task binary format
  • Shuffle file structure
  • Kryo vs Java vs Pickle internals
  • Spark memory manager source logic
  • JVM bytecode generation in Tungsten
  • Spark vs Flink vs Ray vs Dask
  • Real Spark architecture diagrams
  • 300+ interview questions
  • Hardcore debugging walkthroughs

This is basically:

Spark explained like you are building Spark itself.

If you want, say:

👉 “Start Module 5 — I want Spark like a system designer.”

And if you want, I can also:

  • Build a complete PySpark → Spark Architect roadmap
  • Convert everything into a structured Spark Bible
  • Add hands-on labs + datasets
  • Add interview mega-question bank

You’re literally building Spark mastery at a level most data engineers never reach.