Perfect. DAY 8 is where you officially step into real AWS Data Engineering.
Today we master metadata, which is the backbone of every scalable data platform.
📅 DAY 8 — AWS Glue Data Catalog (MASTER LEVEL)
Hive Metastore → Serverless, Shared, Governed Metadata
AWS Glue
🎯 DAY 8 GOAL
By the end of today, you will:
- Understand what metadata really is (and why it matters)
- Master Glue Databases, Tables, Partitions
- Know how Glue replaces Hive Metastore
- Design multi-team, production-grade data catalogs
- Avoid the most common Glue interview & design mistakes
🧠 PART 1 — WHAT METADATA ACTUALLY MEANS (FIRST PRINCIPLES)
❌ Wrong understanding
Metadata = schema only
✅ Correct understanding
Metadata = everything needed to understand and query data
Metadata answers:
- What is the table called?
- Where is the data stored?
- What are column names & types?
- How is data partitioned?
- Who can access it?
📌 Data lives in S3. Glue only describes it.
🧩 GLUE CATALOG ARCHITECTURE (MENTAL MODEL)


S3 (Parquet / CSV / JSON)
↓
Glue Crawler / DDL
↓
Glue Data Catalog
↓
Spark (EMR / Glue)
Athena
Redshift Spectrum
🧠 One catalog → many engines
🧠 PART 2 — GLUE CATALOG CORE OBJECTS
1️⃣ Glue Database
Logical namespace (like Hive DB)
Example:
analytics_db
raw_db
finance_db
📌 No data stored here — just organization
2️⃣ Glue Table
Maps to S3 location + schema
A table contains:
- Table name
- S3 path
- Columns & data types
- Partition keys
- SerDe info
📌 Exactly like Hive external tables
3️⃣ Partitions (VERY IMPORTANT)
Partitions are metadata pointers, not folders.
Example:
country=IN/date=2025-01-01
🧠 Query engines prune partitions → lower cost & faster queries
🧠 PART 3 — HOW TABLES ARE CREATED (2 WAYS)
🔹 Option 1: Glue Crawler (MOST COMMON)
Crawler:
- Scans S3
- Infers schema
- Creates/updates tables
Used when:
- Data arrives continuously
- Schema may evolve
🔹 Option 2: DDL (ADVANCED / CONTROLLED)
CREATE EXTERNAL TABLE sales (
order_id int,
amount double,
country string
)
PARTITIONED BY (date string)
LOCATION 's3://company-data/curated/sales/';
Used when:
- Schema is fixed
- Strict governance required
📌 Senior teams prefer DDL for curated layers
🧠 PART 4 — SCHEMA EVOLUTION (REAL-LIFE PROBLEM)
What happens when schema changes?
| Change | Result |
|---|---|
| New column | ✅ Safe |
| Column reorder | ⚠️ Depends |
| Type change | ❌ Dangerous |
| Column deletion | ❌ Breaking |
🧠 Production strategy:
- Raw: flexible
- Curated: strict
🧠 PART 5 — GLUE CATALOG VS DATABASES (NO CONFUSION)
| Feature | Glue Catalog | RDS |
|---|---|---|
| Stores data | ❌ | ✅ |
| Stores schema | ✅ | ✅ |
| Query engine | ❌ | ✅ |
| Serverless | ✅ | ❌ |
📌 Glue is metadata only
🧠 PART 6 — REAL-WORLD CATALOG DESIGN (VERY IMPORTANT)
🏗️ Enterprise Pattern
raw_db → source-owned
cleansed_db → engineering-owned
curated_db → analytics-owned
Benefits:
- Ownership clarity
- Access control
- Fewer breaking changes
🧠 PART 7 — ACCESS CONTROL (HIGH-LEVEL)
Glue permissions are enforced via:
- IAM policies
- Lake Formation (advanced, later)
Example:
- Analysts → read curated
- Engineers → read/write cleansed
- No one touches raw
📌 Data governance starts at the catalog
🧠 PART 8 — GLUE CATALOG IN ACTION (REAL FLOWS)
🔹 Spark on EMR
spark.table("curated_db.sales")
🔹 Athena
SELECT * FROM curated_db.sales;
🔹 Glue Job
Uses same catalog automatically
🧠 One schema, many engines
🎤 INTERVIEW STATEMENTS (MASTER LEVEL)
✔ “Glue Data Catalog is a centralized, serverless Hive Metastore.”
✔ “Glue tables are pointers to S3 locations, not data containers.”
✔ “We manage schema evolution carefully between raw and curated layers.”
✔ “Partitions are metadata constructs that improve query performance.”
🧪 DAY 8 THINKING EXERCISE
Answer mentally:
- Why is Glue Catalog shared across Athena and Spark?
- Why do senior teams avoid crawlers on curated data?
- What happens if S3 data changes but Glue metadata doesn’t?
🧠 DAY 8 MEMORY MAP (SAVE THIS)
S3 = Data
Glue = Metadata
Database → Namespace
Table → Schema + S3 path
Partition → Metadata pointer
One catalog → Many engines
⏭️ DAY 9 PREVIEW — ATHENA (SERVERLESS ANALYTICS)
Tomorrow we go deep into:
- Athena internals
- Cost model (VERY IMPORTANT)
- Partition pruning
- When Athena beats Spark
- Real analyst workflows
Reply with:
DAY 9
You’re now firmly in core AWS data engineering territory 🔥