Fantastic 🔥
Now you’re officially doing real cloud Spark, exactly how it’s done in industry.
⚡ LESSON 5 — PySpark on AWS EMR
(Spark on YARN → Spark on EMR | Hands-On + Resume-Grade)
Amazon EMR
🎯 Lesson Outcome
By the end of this lesson, you will:
- Launch an EMR Spark cluster
- Attach IAM roles correctly
- Run PySpark jobs
- Read data from S3
- Write transformed data back to S3
- Understand EMR vs Databricks (interview-ready)
🧠 Mental Model First (IMPORTANT)
EMR Cluster
├── EC2 Master (Driver)
├── EC2 Core (Executors)
├── Spark
├── Glue Catalog (Metastore)
└── S3 (Data)
📌 EMR = Managed EC2 + Spark + Hadoop ecosystem
📌 S3 is the storage, NOT HDFS
🧩 EMR Architecture (Visual)



1️⃣ Create EMR Cluster (Hands-On)
🔹 Go to:
AWS Console → EMR → Create cluster
Choose:
👉 Create cluster – Advanced options
🔹 Software Configuration
- Release:
emr-6.x
- Applications:
- ✅ Spark
- ❌ Hive (optional, not required now)
🔹 Hardware (COST-SAFE)
- Instance type:
m5.xlarge (or m5.large if shown)
- Nodes:
1 Master
1 Core
📌 This stays within Free Tier-friendly usage (short time)
🔹 Storage
- Root volume: default (no change)
2️⃣ Security & IAM (CRITICAL STEP)
🔹 EC2 Instance Profile
Choose:
EMR-S3-Access-Role
(This is the role you created earlier)
🔹 EMR Service Role
Use default:
EMR_DefaultRole
📌 If IAM is wrong → Spark fails (classic mistake)
3️⃣ Create Cluster
Click Create cluster
⏳ Wait 5–10 minutes
Status should become:
Waiting
🧪 TASK 1 (Reply Required)
EMR cluster status:
Cluster name:
4️⃣ Connect to EMR (Web UI – Easy Mode)
🔹 Go to:
EMR → Clusters → Your cluster
Open:
👉 Spark History Server
👉 YARN Resource Manager
📌 This is how production Spark jobs are monitored
5️⃣ Run PySpark on EMR (CORE SKILL)
🔹 Open:
EMR → Notebooks / SSH → Spark shell
(If notebook option isn’t available, we’ll use step execution next)
🔥 PySpark Code (READ FROM S3)
spark.read \
.format("csv") \
.option("header", "true") \
.load("s3://rajeev-data-lake-2026/raw/sales/") \
.show()
📌 Spark uses:
- IAM Role → access S3
- Glue Catalog → schema (if using tables)
🔥 Transform & Write Back
df = spark.read.option("header", "true") \
.csv("s3://rajeev-data-lake-2026/raw/sales/")
df_clean = df.groupBy("country") \
.count()
df_clean.write.mode("overwrite") \
.parquet("s3://rajeev-data-lake-2026/curated/sales/")
🎉 You just built a cloud Spark ETL
6️⃣ Verify Output (S3)
Go to:
s3://rajeev-data-lake-2026/curated/sales/
You should see:
part-0000*.parquet_SUCCESS
🧠 On-Prem → EMR Mapping (MEMORIZE)
| On-Prem | AWS EMR |
|---|---|
| YARN | YARN (managed) |
| HDFS | S3 |
| Hive Metastore | Glue Catalog |
| Spark-submit | EMR Step |
💡 Interview GOLD (Very Important)
✔ EMR uses EC2 under the hood
✔ IAM roles provide secure S3 access
✔ Spark does not store data on EMR
✔ EMR clusters can be ephemeral
🧾 Resume Bullets You Earn
Developed PySpark ETL pipelines on AWS EMR integrating S3 and Glue Catalog
Optimized Spark jobs using partitioned Parquet outputs on S3
⚠️ COST SAFETY (DO THIS)
When finished:
👉 Terminate EMR cluster
📌 Never leave EMR running idle
⏭ NEXT LESSON (PRODUCTION-GRADE)
🔄 LESSON 6 — Orchestration & Automation
You will:
- Use EMR Steps
- Introduce Airflow (MWAA)
- Add retry, logging
- Build resume-level pipelines
Amazon MWAA
🔔 Quick Confirmation (Reply)
EMR created:
Spark job ran:
Data written to curated:
Once confirmed, we move into Lesson 6: Automation & Orchestration 🚀