Fantastic 🔥
Now you’re officially doing real cloud Spark, exactly how it’s done in industry.


⚡ LESSON 5 — PySpark on AWS EMR

(Spark on YARN → Spark on EMR | Hands-On + Resume-Grade)

Amazon EMR


🎯 Lesson Outcome

By the end of this lesson, you will:

  • Launch an EMR Spark cluster
  • Attach IAM roles correctly
  • Run PySpark jobs
  • Read data from S3
  • Write transformed data back to S3
  • Understand EMR vs Databricks (interview-ready)

🧠 Mental Model First (IMPORTANT)

EMR Cluster
 ├── EC2 Master (Driver)
 ├── EC2 Core (Executors)
 ├── Spark
 ├── Glue Catalog (Metastore)
 └── S3 (Data)

📌 EMR = Managed EC2 + Spark + Hadoop ecosystem
📌 S3 is the storage, NOT HDFS


🧩 EMR Architecture (Visual)

Image
Image
Image

1️⃣ Create EMR Cluster (Hands-On)

🔹 Go to:

AWS Console → EMR → Create cluster

Choose:
👉 Create cluster – Advanced options


🔹 Software Configuration

  • Release:
emr-6.x
  • Applications:
    • ✅ Spark
    • ❌ Hive (optional, not required now)

🔹 Hardware (COST-SAFE)

  • Instance type:
m5.xlarge  (or m5.large if shown)
  • Nodes:
1 Master
1 Core

📌 This stays within Free Tier-friendly usage (short time)


🔹 Storage

  • Root volume: default (no change)

2️⃣ Security & IAM (CRITICAL STEP)

🔹 EC2 Instance Profile

Choose:

EMR-S3-Access-Role

(This is the role you created earlier)

🔹 EMR Service Role

Use default:

EMR_DefaultRole

📌 If IAM is wrong → Spark fails (classic mistake)


3️⃣ Create Cluster

Click Create cluster

⏳ Wait 5–10 minutes
Status should become:

Waiting

🧪 TASK 1 (Reply Required)

EMR cluster status:
Cluster name:

4️⃣ Connect to EMR (Web UI – Easy Mode)

🔹 Go to:

EMR → Clusters → Your cluster

Open:
👉 Spark History Server
👉 YARN Resource Manager

📌 This is how production Spark jobs are monitored


5️⃣ Run PySpark on EMR (CORE SKILL)

🔹 Open:

EMR → Notebooks / SSH → Spark shell

(If notebook option isn’t available, we’ll use step execution next)


🔥 PySpark Code (READ FROM S3)

spark.read \
    .format("csv") \
    .option("header", "true") \
    .load("s3://rajeev-data-lake-2026/raw/sales/") \
    .show()

📌 Spark uses:

  • IAM Role → access S3
  • Glue Catalog → schema (if using tables)

🔥 Transform & Write Back

df = spark.read.option("header", "true") \
    .csv("s3://rajeev-data-lake-2026/raw/sales/")

df_clean = df.groupBy("country") \
    .count()

df_clean.write.mode("overwrite") \
    .parquet("s3://rajeev-data-lake-2026/curated/sales/")

🎉 You just built a cloud Spark ETL


6️⃣ Verify Output (S3)

Go to:

s3://rajeev-data-lake-2026/curated/sales/

You should see:

  • part-0000*.parquet
  • _SUCCESS

🧠 On-Prem → EMR Mapping (MEMORIZE)

On-PremAWS EMR
YARNYARN (managed)
HDFSS3
Hive MetastoreGlue Catalog
Spark-submitEMR Step

💡 Interview GOLD (Very Important)

✔ EMR uses EC2 under the hood
✔ IAM roles provide secure S3 access
✔ Spark does not store data on EMR
✔ EMR clusters can be ephemeral


🧾 Resume Bullets You Earn

Developed PySpark ETL pipelines on AWS EMR integrating S3 and Glue Catalog
Optimized Spark jobs using partitioned Parquet outputs on S3


⚠️ COST SAFETY (DO THIS)

When finished:
👉 Terminate EMR cluster

📌 Never leave EMR running idle


⏭ NEXT LESSON (PRODUCTION-GRADE)

🔄 LESSON 6 — Orchestration & Automation

You will:

  • Use EMR Steps
  • Introduce Airflow (MWAA)
  • Add retry, logging
  • Build resume-level pipelines

Amazon MWAA


🔔 Quick Confirmation (Reply)

EMR created:
Spark job ran:
Data written to curated:

Once confirmed, we move into Lesson 6: Automation & Orchestration 🚀