Excellent 👍
Now you’re entering real AWS Data Engineering territory.
📚 LESSON 4 — AWS Glue Catalog
(Hive Metastore → AWS Glue | Hands-On + Interview-Ready)
AWS Glue
🎯 Lesson Outcome
By the end of this lesson, you will:
- Replace Hive Metastore with Glue Catalog
- Create Glue Database & Tables
- Use Glue Crawler
- Query data using Athena
- Add strong resume bullets
🧠 First: Understand the Role of Glue Catalog
Think like this:
Glue Catalog is ONLY metadata, not data.
| On-Prem Hadoop | AWS |
|---|---|
| Hive Metastore | Glue Data Catalog |
| Hive DB | Glue Database |
| Hive Table | Glue Table |
| HDFS Path | S3 Path |
📌 Spark, Athena, EMR all read schema from Glue
🧩 Glue Architecture (Visual Mental Model)



S3 (raw data)
↓
Glue Crawler
↓
Glue Catalog (schema)
↓
Athena / EMR / Spark
1️⃣ Create Glue Database (Hands-On)
🔹 Go to:
AWS Console → Glue → Data Catalog → Databases → Add database
🔹 Database Details
- Name:
rajeev_data_lake_db
- Description:
Glue catalog for S3 data lake
Create database ✅
🧪 TASK 1 (Reply Required)
Glue database created: YES
Database name:
2️⃣ Create Glue Crawler (MOST IMPORTANT STEP)
🧠 What is a Crawler?
A crawler:
- Scans S3
- Infers schema
- Creates Glue tables automatically
🔹 Create Crawler
Glue → Crawlers → Create crawler
Step-by-Step:
- Name:
sales-raw-crawler
- Data source:
- S3
- Path:
s3://rajeev-data-lake-2026/raw/sales/
- IAM Role:
- Create new role
AWSGlueServiceRole-sales
- Target database:
rajeev_data_lake_db
- Table name prefix:
raw_
Create crawler ✅
▶️ Run the Crawler
Click Run crawler
Wait ~1 minute ⏳
🧪 TASK 2 (Reply Required)
Crawler run: SUCCESS
Table created:
(Expected: raw_sales)
3️⃣ Verify Glue Table (Schema Check)
Go to:
Glue → Databases → rajeev_data_lake_db → Tables
Open table:
raw_sales
You should see:
- Columns
- Data types
- S3 location
📌 Exactly like Hive Metastore
4️⃣ Query Using Athena (BIG INTERVIEW POINT)
Amazon Athena
🔹 Setup Athena (First Time Only)
- Go to Athena
- Set query result location:
s3://rajeev-data-lake-2026/athena-results/
(Create folder if needed)
🔹 Run SQL
SELECT * FROM rajeev_data_lake_db.raw_sales;
🎉 You just ran serverless SQL on S3
🧠 On-Prem vs AWS Querying
| On-Prem | AWS |
|---|---|
| Hive CLI | Athena |
| Spark SQL | Spark on EMR |
| HDFS | S3 |
💡 Interview GOLD (Remember This)
✔ Glue is metadata only
✔ Athena reads data directly from S3
✔ Glue Crawler handles schema evolution
✔ Spark uses Glue as metastore
🧾 Resume Bullets You Earn
Implemented AWS Glue Catalog as Hive Metastore replacement
Automated schema discovery using Glue Crawlers and queried data via Athena
⏭ NEXT LESSON (VERY EXCITING)
⚡ LESSON 5 — PySpark on AWS (EMR)
You will:
- Launch EMR cluster
- Attach IAM role
- Read S3 data in PySpark
- Write transformed data back to S3
Amazon EMR
🔔 Quick Confirmation (Reply)
Glue DB created:
Crawler success:
Athena query ran:
Once confirmed, we jump into Lesson 5: Spark on EMR 🚀