Perfect 👍
Now we enter the most important service for a PySpark Data Engineer on AWS.
From this lesson onward, everything you do will revolve around S3.
📦 LESSON 3 — Amazon S3 for Python & PySpark Developers
(HDFS → S3 | Hands-on + Resume-ready)
Amazon S3
🎯 Lesson Outcome
By the end of this lesson, you will:
- Understand how S3 really works (not just clicks)
- Design a production-grade data lake
- Upload & manage data
- Access S3 using Python
- Know interview-level S3 concepts
🧠 First: Stop Thinking of S3 as HDFS
🔥 This is CRITICAL
| HDFS | S3 |
|---|---|
| File system | Object storage |
| Rename = cheap | Rename = copy + delete |
| Data locality | No locality |
| Always mounted | Access via API |
📌 S3 is NOT a file system — Spark talks to it via APIs
🧩 S3 Architecture (Mental Model)


Bucket
├── raw/
├── cleansed/
├── curated/
No folders — just prefixes
1️⃣ Create Your First S3 Bucket (Hands-On)
🔹 Go to:
AWS Console → S3 → Create bucket
🔹 Bucket Name (VERY IMPORTANT RULES)
Bucket names are globally unique
Use:
rajeev-data-lake-<your-unique-number>
Example:
rajeev-data-lake-2026
📌 Use lowercase, no spaces
🔹 Region
👉 Asia Pacific (Mumbai)
🔹 Settings (IMPORTANT)
- ❌ Block Public Access → KEEP ENABLED
- ❌ Versioning → OFF (for now)
- Encryption → Default (SSE-S3)
Create bucket ✅
🧪 TASK 1 (Reply Required)
Confirm:
Bucket created: YES
Bucket name:
2️⃣ Design Data Lake Structure (Industry Standard)
Inside your bucket, create prefixes:
raw/
cleansed/
curated/
📌 This is used in real companies
Example:
s3://rajeev-data-lake-2026/raw/sales/
s3://rajeev-data-lake-2026/cleansed/sales/
s3://rajeev-data-lake-2026/curated/sales/
3️⃣ Upload Sample Data
Use ANY CSV (or create one)
Example sales.csv:
order_id,amount,country
1,500,IN
2,800,US
3,200,IN
Upload to:
raw/sales/sales.csv
🧪 TASK 2 (Reply Required)
Confirm:
Data uploaded: YES
Object path:
4️⃣ Access S3 Using Python (VERY IMPORTANT)
We’ll use boto3 (AWS Python SDK).
import boto3
s3 = boto3.client('s3')
response = s3.list_objects_v2(
Bucket='rajeev-data-lake-2026',
Prefix='raw/sales/'
)
for obj in response.get('Contents', []):
print(obj['Key'])
🧠 Why this matters:
- Glue
- Lambda
- EMR
ALL use this internally
5️⃣ S3 Interview Gold (REMEMBER THESE)
🔑 Key Concepts
- Strong consistency
- Object immutability
- Multipart upload
- Lifecycle policies
- Partition-friendly design
❌ Common Mistakes
- Using S3 like HDFS
- Too many small files
- Renaming objects frequently
🧾 Resume Bullets You Earn
Designed AWS S3 data lake with raw, cleansed, and curated layers
Integrated S3 with Python using boto3 for data access and validation
⏭ NEXT LESSON (VERY IMPORTANT)
📚 LESSON 4 — AWS Glue Catalog (Hive Metastore Replacement)
You will:
- Create Glue database
- Run Glue crawler
- Understand schema evolution
- Query via Athena
AWS Glue is core to your resume
🔔 Before Moving On (Quick Check)
Reply with:
Bucket created:
Data uploaded:
Once confirmed, we jump straight into Lesson 4 (Glue Catalog) 🚀