30-DAY PYSPARK MASTERY ON AWS course

Great — Day 5 is a career-saving day.
This topic alone explains 80% of Spark slowness, repeated runs, and AWS cost explosions.


🧠 DAY 5 — Actions, Transformations & Lazy Evaluation

(Why Spark Runs More Times Than You Think)

Goal today:
You will predict exactly when Spark runs, how many times it runs, and how to stop wasting compute on AWS.


🎯 What You’ll Master Today

By the end of Day 5, you will:

  • Fully understand lazy evaluation
  • Know exactly what triggers execution
  • Avoid action explosion
  • Use cache / persist correctly
  • Save real AWS money
  • Answer tricky interview questions

1️⃣ Spark Is LAZY by Design (RECAP + DEEPER)

Key rule (memorize):

No action → No execution

df = spark.read.parquet("s3://raw/sales/")
df2 = df.filter(df.country == "IN")
df3 = df2.groupBy("country").count()

❓How many Spark jobs ran so far?
👉 ZERO

Spark has only:

  • Recorded transformations
  • Built a DAG

2️⃣ Actions = EXECUTION TRIGGERS

Common Spark Actions

ActionWhat it does
show()Executes job
count()Executes job
collect()Executes job
write()Executes job
take()Executes job

📌 Every action = a full DAG execution


3️⃣ The Action Explosion Problem (VERY DANGEROUS)

❌ BAD CODE (COMMON IN PRODUCTION)

df = spark.read.parquet("s3://raw/sales/")
df_filtered = df.filter(df.country == "IN")

print(df_filtered.count())
print(df_filtered.count())
df_filtered.write.parquet("s3://curated/sales/")

❓How many times did Spark run?

👉 3 FULL JOBS

Image
Image

Each action:

  • Re-reads S3
  • Re-applies filters
  • Re-runs shuffles

📌 On AWS → triple cost


4️⃣ Why Spark Recomputes (LINEAGE REVISITED)

Spark does NOT remember results unless told to.

Why?

  • Lineage allows recomputation
  • Memory is limited
  • Spark assumes transformations are cheap

📌 Spark prefers recompute > memory hoarding


5️⃣ Cache vs Persist (USE THIS CORRECTLY)

Cache (Default)

df_filtered.cache()
  • Memory only
  • Fast
  • Risk of eviction

Persist (Controlled)

df_filtered.persist(StorageLevel.MEMORY_AND_DISK)
Image
Image

Storage Levels (Important)

  • MEMORY_ONLY
  • MEMORY_AND_DISK (most used)
  • DISK_ONLY

📌 Persist only if reused


6️⃣ When You SHOULD Cache (Golden Rules)

Cache when:

✔ DataFrame used in multiple actions
✔ Data reused across stages
✔ Expensive transformations (joins, aggregations)

Do NOT cache when:

❌ Single action
❌ Very large dataset
❌ Simple transformations

📌 Blind caching = memory pressure + failures


7️⃣ Cache Is Also LAZY (TRICKY BUT IMPORTANT)

df.cache()

❌ Does NOT cache immediately

Caching happens only when:

df.count()

📌 Cache + action = materialization


8️⃣ Unpersist (CLEAN UP LIKE A PRO)

df.unpersist()

Why this matters on AWS:

  • Executors have limited memory
  • Long pipelines = memory leaks
  • Glue jobs can fail due to memory pressure

9️⃣ AWS COST IMPACT (REAL LIFE)

Without cache:

  • Every action re-reads S3
  • More EMR runtime
  • More Glue DPU-hours

With smart cache:

  • Fewer S3 reads
  • Faster jobs
  • Lower bill

📌 Caching strategy = cloud cost strategy


🔥 Interview Trap Questions (ANSWER CONFIDENTLY)

Q: Why is .count() dangerous?
✔ It triggers full job execution

Q: Why Spark doesn’t cache by default?
✔ Memory is limited and recomputation is safer

Q: Difference between cache and persist?
✔ Persist gives storage control


🧪 DAY 5 THINKING EXERCISE (IMPORTANT)

Look at this:

df = spark.read.parquet("s3://raw/sales/")
df1 = df.filter(...)
df2 = df1.groupBy(...)

df2.count()
df2.write.parquet("s3://curated/")

Questions:

  1. How many jobs run?
  2. Where should you cache?
  3. What happens if df1 is cached instead of df2?

(We’ll revisit this in tuning days.)


🧠 DAY 5 MEMORY ANCHORS

No action → no execution
Each action → full DAG
Cache only if reused
Cache is lazy too
Unpersist matters

🎤 Resume-Grade Line (You Earned This)

Strong understanding of Spark lazy evaluation, action-triggered execution, and effective caching strategies to optimize performance and cost on AWS


⏭️ DAY 6 — SparkSession, Configs & Runtime

(Why Defaults Hurt You on AWS)

Tomorrow you’ll learn:

  • SparkSession internals
  • Config hierarchy
  • Glue vs EMR defaults
  • Where configs REALLY apply

This is control & tuning foundation.


Reply with:

DAY 6

and we continue 🚀