Perfect 👍
Below is a PRODUCTION-GRADE AWS Step Functions JSON that orchestrates a Spark job on EMR Serverless, with wait, retry, failure handling, and SNS alerts.
This is exactly what senior data engineers use.
🧠 CONTEXT (VERY IMPORTANT – READ FIRST)
This Step Function will:
Start
→ Run EMR Serverless Spark job (SYNC)
→ If SUCCESS → Publish success SNS
→ If FAILURE → Retry → Publish failure SNS
→ End
📌 Key concept
We use .sync so Step Functions waits for Spark to finish
(not fire-and-forget like Lambda).
🧩 STEP FUNCTIONS — EMR SERVERLESS SPARK (JSON)
🔹 You can paste this directly into AWS Step Functions → Create State Machine → Code view
{
"Comment": "Serverless Spark ETL using EMR Serverless with retries and notifications",
"StartAt": "RunSparkJob",
"States": {
"RunSparkJob": {
"Type": "Task",
"Resource": "arn:aws:states:::emr-serverless:startJobRun.sync",
"Parameters": {
"ApplicationId": "00f2abcd1234example",
"ExecutionRoleArn": "arn:aws:iam::123456789012:role/EMRServerlessExecutionRole",
"JobDriver": {
"SparkSubmit": {
"EntryPoint": "s3://rajeev-data-lake-2026/scripts/sales_etl.py",
"EntryPointArguments": [
"--input",
"s3://rajeev-data-lake-2026/raw/sales/",
"--output",
"s3://rajeev-data-lake-2026/curated/sales/"
],
"SparkSubmitParameters": "--conf spark.executor.memory=4g --conf spark.executor.cores=2"
}
}
},
"Retry": [
{
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 60,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "NotifyFailure"
}
],
"Next": "NotifySuccess"
},
"NotifySuccess": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:ap-south-1:123456789012:etl-success",
"Message": "Spark ETL completed successfully using EMR Serverless"
},
"End": true
},
"NotifyFailure": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:ap-south-1:123456789012:etl-failure",
"Message": "Spark ETL failed after retries. Please check CloudWatch logs."
},
"End": true
}
}
}
🔍 DEEP EXPLANATION (LINE-BY-LINE)
🔹 .sync (MOST IMPORTANT)
"Resource": "arn:aws:states:::emr-serverless:startJobRun.sync"
✔ Step Functions waits until Spark finishes
✔ Gets SUCCESS / FAILURE
✔ Enables retry & catch
📌 Without .sync → NO orchestration
🔹 Spark Job Definition
"EntryPoint": "s3://.../sales_etl.py"
✔ Spark code lives in S3
✔ Same script usable by:
- EMR
- EMR Serverless
- CI/CD
🔹 Retry Strategy (Senior-Level)
"Retry": [{
"IntervalSeconds": 60,
"MaxAttempts": 3,
"BackoffRate": 2.0
}]
Meaning:
- Retry after 1 min
- Retry up to 3 times
- Exponential backoff
📌 This replaces custom retry logic
🔹 Failure Handling
"Catch": [{
"Next": "NotifyFailure"
}]
✔ Centralized failure handling
✔ No silent job failures
✔ Mandatory in real pipelines
🔹 SNS Alerts
"arn:aws:states:::sns:publish"
✔ Email / Slack / PagerDuty
✔ SLA-friendly
✔ Audit-ready
🧠 WHY THIS DESIGN IS INDUSTRY-PERFECT
| Feature | Why It Matters |
|---|---|
| Serverless Spark | No cluster cost |
| Step Functions | Workflow visibility |
| Retries | Fault tolerance |
| SNS | Operational awareness |
| S3-based code | Reproducible builds |
💬 INTERVIEW-PERFECT ANSWER (USE THIS)
“We orchestrated EMR Serverless Spark jobs using AWS Step Functions with synchronous execution, built-in retries, and SNS-based alerting. This allowed us to run fault-tolerant, cost-optimized Spark pipelines without managing clusters.”
🔥 This answer = senior data engineer signal
🧾 RESUME BULLET (COPY-PASTE READY)
Orchestrated serverless PySpark pipelines using AWS Step Functions with EMR Serverless, implementing retries, failure handling, and SNS alerts
🚀 WHAT NEXT (VERY STRONG OPTIONS)
I can now:
1️⃣ Write the actual PySpark sales_etl.py
2️⃣ Write Lambda trigger code for Step Functions
3️⃣ Create GitHub repo structure (industry-style)
4️⃣ Convert this into diagram + interview Q&A
Tell me 👉 what you want next