Great — Day 5 is a career-saving day.
This topic alone explains 80% of Spark slowness, repeated runs, and AWS cost explosions.

🧠 DAY 5 — Actions, Transformations & Lazy Evaluation

(Why Spark Runs More Times Than You Think)

Goal today:
You will predict exactly when Spark runs, how many times it runs, and how to stop wasting compute on AWS.

🎯 What You’ll Master Today

By the end of Day 5, you will:

Fully understand lazy evaluation
Know exactly what triggers execution
Avoid action explosion
Use cache / persist correctly
Save real AWS money
Answer tricky interview questions

1️⃣ Spark Is LAZY by Design (RECAP + DEEPER)

Key rule (memorize):

No action → No execution

df = spark.read.parquet("s3://raw/sales/")
df2 = df.filter(df.country == "IN")
df3 = df2.groupBy("country").count()

❓How many Spark jobs ran so far?
👉 ZERO

Spark has only:

Recorded transformations
Built a DAG

2️⃣ Actions = EXECUTION TRIGGERS

Common Spark Actions

Action	What it does
`show()`	Executes job
`count()`	Executes job
`collect()`	Executes job
`write()`	Executes job
`take()`	Executes job

📌 Every action = a full DAG execution

3️⃣ The Action Explosion Problem (VERY DANGEROUS)

❌ BAD CODE (COMMON IN PRODUCTION)

df = spark.read.parquet("s3://raw/sales/")
df_filtered = df.filter(df.country == "IN")

print(df_filtered.count())
print(df_filtered.count())
df_filtered.write.parquet("s3://curated/sales/")

❓How many times did Spark run?

👉 3 FULL JOBS

Each action:

Re-reads S3
Re-applies filters
Re-runs shuffles

📌 On AWS → triple cost

4️⃣ Why Spark Recomputes (LINEAGE REVISITED)

Spark does NOT remember results unless told to.

Why?

Lineage allows recomputation
Memory is limited
Spark assumes transformations are cheap

📌 Spark prefers recompute > memory hoarding

5️⃣ Cache vs Persist (USE THIS CORRECTLY)

Cache (Default)

df_filtered.cache()

Memory only
Fast
Risk of eviction

Persist (Controlled)

df_filtered.persist(StorageLevel.MEMORY_AND_DISK)

Storage Levels (Important)

MEMORY_ONLY
MEMORY_AND_DISK (most used)
DISK_ONLY

📌 Persist only if reused

6️⃣ When You SHOULD Cache (Golden Rules)

Cache when:

✔ DataFrame used in multiple actions
✔ Data reused across stages
✔ Expensive transformations (joins, aggregations)

Do NOT cache when:

❌ Single action
❌ Very large dataset
❌ Simple transformations

📌 Blind caching = memory pressure + failures

7️⃣ Cache Is Also LAZY (TRICKY BUT IMPORTANT)

df.cache()

❌ Does NOT cache immediately

Caching happens only when:

df.count()

📌 Cache + action = materialization

8️⃣ Unpersist (CLEAN UP LIKE A PRO)

df.unpersist()

Why this matters on AWS:

Executors have limited memory
Long pipelines = memory leaks
Glue jobs can fail due to memory pressure

9️⃣ AWS COST IMPACT (REAL LIFE)

Without cache:

Every action re-reads S3
More EMR runtime
More Glue DPU-hours

With smart cache:

Fewer S3 reads
Faster jobs
Lower bill

📌 Caching strategy = cloud cost strategy

🔥 Interview Trap Questions (ANSWER CONFIDENTLY)

Q: Why is .count() dangerous?
✔ It triggers full job execution

Q: Why Spark doesn’t cache by default?
✔ Memory is limited and recomputation is safer

Q: Difference between cache and persist?
✔ Persist gives storage control

🧪 DAY 5 THINKING EXERCISE (IMPORTANT)

Look at this:

df = spark.read.parquet("s3://raw/sales/")
df1 = df.filter(...)
df2 = df1.groupBy(...)

df2.count()
df2.write.parquet("s3://curated/")

Questions:

How many jobs run?
Where should you cache?
What happens if df1 is cached instead of df2?

(We’ll revisit this in tuning days.)

🧠 DAY 5 MEMORY ANCHORS

No action → no execution
Each action → full DAG
Cache only if reused
Cache is lazy too
Unpersist matters

🎤 Resume-Grade Line (You Earned This)

Strong understanding of Spark lazy evaluation, action-triggered execution, and effective caching strategies to optimize performance and cost on AWS

⏭️ DAY 6 — SparkSession, Configs & Runtime

(Why Defaults Hurt You on AWS)

Tomorrow you’ll learn:

SparkSession internals
Config hierarchy
Glue vs EMR defaults
Where configs REALLY apply

This is control & tuning foundation.

Reply with:

DAY 6

and we continue 🚀

Pages: 1 2 3 4 5 6 7

30-DAY PYSPARK MASTERY ON AWS course