Great — DAY 6 is where many data engineers get confused, so we’ll lock this down at architect level.
📅 DAY 6 — DATABASES ON AWS (MASTER LEVEL FOR DATA ENGINEERS)
RDS vs DynamoDB vs Glue Catalog — zero confusion forever
Amazon RDS
Amazon DynamoDB
🎯 DAY 6 GOAL
By the end of today, you will:
- Clearly know WHEN to use which database
- Understand why most data engineers don’t manage DB servers
- Stop confusing Glue Catalog with databases
- Design real production metadata & application data patterns
- Answer database interview questions confidently
🧠 PART 1 — FIRST PRINCIPLE: WHY DATABASES EXIST
Core problem databases solve:
Fast, consistent access to structured data with guarantees
Not all data needs:
- Transactions
- Indexes
- ACID
- Millisecond latency
📌 Data engineers mostly process data, not serve transactions
🧠 PART 2 — AMAZON RDS (RELATIONAL DATABASES)


🧩 What RDS Is
Amazon RDS = managed:
- PostgreSQL
- MySQL
- MariaDB
- Oracle
- SQL Server
You get:
- Backups
- Patching
- High availability
- Scaling (to a point)
🧠 WHEN RDS IS USED (REAL LIFE)
✔ Application databases
✔ Metadata tables
✔ Configuration stores
✔ Small control tables
❌ NOT for big data analytics
❌ NOT for data lakes
🧠 RDS REAL-WORLD DATA ENGINEER USE CASES
Airflow metadata DB
Job status tables
ETL control tables
Audit tables
📌 RDS is NOT where raw data lives
🔐 RDS + VPC (IMPORTANT)
- RDS always runs inside VPC
- Usually in private subnet
- Accessed via:
- EC2
- Lambda (inside VPC)
🧠 PART 3 — AMAZON DYNAMODB (NO-SQL, SERVERLESS)

🧩 What DynamoDB Is
Amazon DynamoDB = serverless key-value / document DB
You get:
- Massive scale
- Single-digit ms latency
- No servers
- Auto scaling
🧠 WHEN DYNAMODB IS USED
✔ Event-driven systems
✔ High-scale lookups
✔ Session stores
✔ Job state tracking
❌ Complex joins
❌ Analytics queries
🧠 DynamoDB Mental Model
Primary Key
├── Partition Key (required)
└── Sort Key (optional)
📌 Bad key design = disaster
🧠 DynamoDB IN DATA PLATFORMS
Lambda → DynamoDB
Step Functions → DynamoDB
Job status → DynamoDB
📌 Often used instead of RDS for simple state
🧠 PART 4 — GLUE CATALOG IS NOT A DATABASE (CRITICAL)
AWS Glue
❌ Common confusion
Glue stores data
✅ Reality
Glue stores metadata only
Glue Catalog Stores:
- Table name
- Column schema
- S3 location
- Partitions
Glue DOES NOT store:
- Rows
- Records
- Values
📌 Glue = Hive Metastore replacement
🧠 PART 5 — DATABASE COMPARISON (SAVE THIS)
| Feature | RDS | DynamoDB | Glue Catalog |
|---|---|---|---|
| Data type | Relational | Key-value | Metadata |
| Stores data | ✅ | ✅ | ❌ |
| Serverless | ❌ | ✅ | ✅ |
| Joins | ✅ | ❌ | ❌ |
| Analytics | ❌ | ❌ | ❌ |
| Data Lake | ❌ | ❌ | ❌ |
📌 S3 stores data for analytics
🧠 PART 6 — REAL-LIFE ARCHITECTURE PATTERNS
🔹 Pattern 1 — ETL Metadata
RDS (Postgres)
→ Job config
→ Schedules
→ Status
🔹 Pattern 2 — Serverless State
Lambda / Step Functions
→ DynamoDB
→ Job state
🔹 Pattern 3 — Analytics
S3 (data)
+ Glue Catalog (schema)
+ Athena / Spark
🧠 PART 7 — INTERVIEW TRAPS (IMPORTANT)
❌ “We store analytics data in RDS”
❌ “Glue is a database”
❌ “DynamoDB replaces all databases”
✔ Correct answers show trade-off thinking
🎤 INTERVIEW STATEMENTS (MASTER LEVEL)
✔ “RDS is used for transactional and metadata workloads, not analytics.”
✔ “DynamoDB provides serverless, low-latency access for stateful workflows.”
✔ “Glue Catalog stores schema metadata; actual data resides in S3.”
✔ “Analytics workloads should decouple storage (S3) from compute.”
🧪 DAY 6 THINKING EXERCISE
Answer mentally:
- Why is RDS a bad choice for large analytical queries?
- Why do serverless pipelines prefer DynamoDB over RDS?
- Why does Glue Catalog scale effortlessly?
🧠 DAY 6 MEMORY MAP (SAVE THIS)
RDS → Transactions / Metadata
DynamoDB → High-scale state
Glue → Metadata only
S3 → Actual data
⏭️ DAY 7 PREVIEW — WEEK 1 ARCHITECT REVIEW
Tomorrow we will:
- Connect Days 1–6 into ONE architecture
- Do whiteboard-style explanation
- Identify gaps
- Lock fundamentals forever
Reply with:
DAY 7
You’re progressing exactly like a senior AWS data engineer 🚀