Excellent ✅ — let’s begin Stage 1: Foundations of DataFrames (Pandas + PySpark + Polars)
This stage builds your conceptual foundation so that you can think in “DataFrame logic” — and then apply it instantly across all three libraries.
🧭 Stage 1 — Foundations of DataFrames
🎯 Learning Goals
- Understand what a DataFrame is
- Learn how each library represents and processes data
- Create basic DataFrames manually and from files
- Compare schema, column types, and basic info
- Know how memory & execution differ
🧩 1. What is a DataFrame?
A DataFrame is a 2D labeled data structure (rows & columns) — think of it like a table in a database or an Excel sheet.
| Library | Execution Type | Scale | Key Idea |
|---|---|---|---|
| 🐼 Pandas | In-memory, eager | Small–medium data | Simple, fast on single machine |
| 🔥 PySpark | Distributed, lazy | Big Data | Parallel processing across cluster |
| ⚡ Polars | Rust-based, lazy or eager | Medium–large | Ultra-fast, multicore execution |
⚙️ 2. Creating DataFrames
Let’s create a small dataset in all three:
💡 Example data
data = [
{"id": 1, "name": "Alice", "age": 25, "city": "Delhi"},
{"id": 2, "name": "Bob", "age": 30, "city": "Mumbai"},
{"id": 3, "name": "Charlie", "age": 35, "city": "Chennai"}
]
🐼 Pandas
import pandas as pd
df_pd = pd.DataFrame(data)
print(df_pd)
📤 Output:
id name age city
0 1 Alice 25 Delhi
1 2 Bob 30 Mumbai
2 3 Charlie 35 Chennai
🔥 PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Stage1").getOrCreate()
df_spark = spark.createDataFrame(data)
df_spark.show()
📤 Output:
+---+-------+---+-------+
| id| name|age| city|
+---+-------+---+-------+
| 1| Alice| 25| Delhi|
| 2| Bob| 30| Mumbai|
| 3|Charlie| 35|Chennai|
+---+-------+---+-------+
⚡ Polars
import polars as pl
df_pl = pl.DataFrame(data)
print(df_pl)
📤 Output:
shape: (3, 4)
┌─────┬─────────┬─────┬─────────┐
│ id │ name │ age │ city │
│ --- │ --- │ --- │ --- │
│ i64 │ str │ i64 │ str │
├─────┼─────────┼─────┼─────────┤
│ 1 │ Alice │ 25 │ Delhi │
│ 2 │ Bob │ 30 │ Mumbai │
│ 3 │ Charlie │ 35 │ Chennai │
└─────┴─────────┴─────┴─────────┘
🧠 3. Inspecting DataFrames (Schema & Info)
| Task | Pandas | PySpark | Polars |
|---|---|---|---|
| Column Names | df_pd.columns | df_spark.columns | df_pl.columns |
| Data Types | df_pd.dtypes | df_spark.printSchema() | df_pl.dtypes |
| Shape | df_pd.shape | df_spark.count(), len(df_spark.columns) | df_pl.shape |
| Summary | df_pd.info() | df_spark.describe().show() | df_pl.describe() |
📘 Example: Print Schema / Info
Pandas
df_pd.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 3 non-null int64
1 name 3 non-null object
2 age 3 non-null int64
3 city 3 non-null object
PySpark
df_spark.printSchema()
Output:
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- city: string (nullable = true)
Polars
df_pl.dtypes
Output:
[Int64, Utf8, Int64, Utf8]
💡 4. Lazy vs Eager Execution
| Concept | Pandas | PySpark | Polars |
|---|---|---|---|
| Execution Type | Eager (runs immediately) | Lazy (plans query, then executes) | Lazy (optional with LazyFrame) |
| Trigger Execution | Always | When .show() / .collect() | When .collect() |
| Optimization | Manual | Catalyst Optimizer | Query Planner |
Example (Polars LazyFrame)
lazy_df = df_pl.lazy().select(pl.col("age") + 10)
result = lazy_df.collect()
Interview Tip:
👉 Be able to explain lazy evaluation:
In PySpark and Polars, transformations are not executed immediately. They’re added to a logical plan and optimized before execution, improving performance.
🧮 5. Basic DataFrame Operations Preview
| Operation | Pandas | PySpark | Polars |
|---|---|---|---|
| Select columns | df_pd[['name','age']] | df_spark.select("name","age") | df_pl.select(["name","age"]) |
| Filter rows | df_pd[df_pd.age>25] | df_spark.filter(df_spark.age>25) | df_pl.filter(pl.col("age")>25) |
| Add column | df_pd['age2']=df_pd.age+5 | df_spark.withColumn("age2",df_spark.age+5) | df_pl.with_columns((pl.col("age")+5).alias("age2")) |
🧠 Concept Recap
| Concept | Pandas | PySpark | Polars |
|---|---|---|---|
| Memory usage | In-memory | Distributed cluster | In-memory + parallel |
| Speed | Fast for small data | Scales massively | Extremely fast on single machine |
| Execution | Immediate | Lazy | Lazy or eager |
| Schema strictness | Flexible | Strict | Strict |
| Typical use case | Local analysis | Big Data pipelines | High-performance local analytics |
🧩 6. Exercise — Practice Time
Task
Create the following DataFrame:
| emp_id | name | dept | salary | age |
|---|---|---|---|---|
| 101 | Raj | IT | 70000 | 28 |
| 102 | Neha | HR | 65000 | 32 |
| 103 | Arjun | IT | 72000 | 26 |
Then do the following:
- Show first 2 rows
- Print schema and dtypes
- Select only
nameandsalary - Filter employees with
salary > 68000 - Add a new column
bonus = salary * 0.10
Try to implement all of this in Pandas, PySpark, and Polars side by side.
Excellent 🔥 — welcome to Stage 2: Data Reading & Writing (I/O Operations + Schema Inference)
This is where you learn how real-world data actually enters and leaves your system — a core interview topic for data engineers and analysts.