Great — Day 5 is a career-saving day.
This topic alone explains 80% of Spark slowness, repeated runs, and AWS cost explosions.
🧠 DAY 5 — Actions, Transformations & Lazy Evaluation
(Why Spark Runs More Times Than You Think)
Goal today:
You will predict exactly when Spark runs, how many times it runs, and how to stop wasting compute on AWS.
🎯 What You’ll Master Today
By the end of Day 5, you will:
- Fully understand lazy evaluation
- Know exactly what triggers execution
- Avoid action explosion
- Use cache / persist correctly
- Save real AWS money
- Answer tricky interview questions
1️⃣ Spark Is LAZY by Design (RECAP + DEEPER)
Key rule (memorize):
No action → No execution
df = spark.read.parquet("s3://raw/sales/")
df2 = df.filter(df.country == "IN")
df3 = df2.groupBy("country").count()
❓How many Spark jobs ran so far?
👉 ZERO
Spark has only:
- Recorded transformations
- Built a DAG
2️⃣ Actions = EXECUTION TRIGGERS
Common Spark Actions
| Action | What it does |
|---|---|
show() | Executes job |
count() | Executes job |
collect() | Executes job |
write() | Executes job |
take() | Executes job |
📌 Every action = a full DAG execution
3️⃣ The Action Explosion Problem (VERY DANGEROUS)
❌ BAD CODE (COMMON IN PRODUCTION)
df = spark.read.parquet("s3://raw/sales/")
df_filtered = df.filter(df.country == "IN")
print(df_filtered.count())
print(df_filtered.count())
df_filtered.write.parquet("s3://curated/sales/")
❓How many times did Spark run?
👉 3 FULL JOBS


Each action:
- Re-reads S3
- Re-applies filters
- Re-runs shuffles
📌 On AWS → triple cost
4️⃣ Why Spark Recomputes (LINEAGE REVISITED)
Spark does NOT remember results unless told to.
Why?
- Lineage allows recomputation
- Memory is limited
- Spark assumes transformations are cheap
📌 Spark prefers recompute > memory hoarding
5️⃣ Cache vs Persist (USE THIS CORRECTLY)
Cache (Default)
df_filtered.cache()
- Memory only
- Fast
- Risk of eviction
Persist (Controlled)
df_filtered.persist(StorageLevel.MEMORY_AND_DISK)


Storage Levels (Important)
- MEMORY_ONLY
- MEMORY_AND_DISK (most used)
- DISK_ONLY
📌 Persist only if reused
6️⃣ When You SHOULD Cache (Golden Rules)
Cache when:
✔ DataFrame used in multiple actions
✔ Data reused across stages
✔ Expensive transformations (joins, aggregations)
Do NOT cache when:
❌ Single action
❌ Very large dataset
❌ Simple transformations
📌 Blind caching = memory pressure + failures
7️⃣ Cache Is Also LAZY (TRICKY BUT IMPORTANT)
df.cache()
❌ Does NOT cache immediately
Caching happens only when:
df.count()
📌 Cache + action = materialization
8️⃣ Unpersist (CLEAN UP LIKE A PRO)
df.unpersist()
Why this matters on AWS:
- Executors have limited memory
- Long pipelines = memory leaks
- Glue jobs can fail due to memory pressure
9️⃣ AWS COST IMPACT (REAL LIFE)
Without cache:
- Every action re-reads S3
- More EMR runtime
- More Glue DPU-hours
With smart cache:
- Fewer S3 reads
- Faster jobs
- Lower bill
📌 Caching strategy = cloud cost strategy
🔥 Interview Trap Questions (ANSWER CONFIDENTLY)
Q: Why is .count() dangerous?
✔ It triggers full job execution
Q: Why Spark doesn’t cache by default?
✔ Memory is limited and recomputation is safer
Q: Difference between cache and persist?
✔ Persist gives storage control
🧪 DAY 5 THINKING EXERCISE (IMPORTANT)
Look at this:
df = spark.read.parquet("s3://raw/sales/")
df1 = df.filter(...)
df2 = df1.groupBy(...)
df2.count()
df2.write.parquet("s3://curated/")
Questions:
- How many jobs run?
- Where should you cache?
- What happens if df1 is cached instead of df2?
(We’ll revisit this in tuning days.)
🧠 DAY 5 MEMORY ANCHORS
No action → no execution
Each action → full DAG
Cache only if reused
Cache is lazy too
Unpersist matters
🎤 Resume-Grade Line (You Earned This)
Strong understanding of Spark lazy evaluation, action-triggered execution, and effective caching strategies to optimize performance and cost on AWS
⏭️ DAY 6 — SparkSession, Configs & Runtime
(Why Defaults Hurt You on AWS)
Tomorrow you’ll learn:
- SparkSession internals
- Config hierarchy
- Glue vs EMR defaults
- Where configs REALLY apply
This is control & tuning foundation.
Reply with:
DAY 6
and we continue 🚀