Excellent ✅ — let’s begin Stage 1: Foundations of DataFrames (Pandas + PySpark + Polars)

This stage builds your conceptual foundation so that you can think in “DataFrame logic” — and then apply it instantly across all three libraries.


🧭 Stage 1 — Foundations of DataFrames

🎯 Learning Goals

  • Understand what a DataFrame is
  • Learn how each library represents and processes data
  • Create basic DataFrames manually and from files
  • Compare schema, column types, and basic info
  • Know how memory & execution differ

🧩 1. What is a DataFrame?

A DataFrame is a 2D labeled data structure (rows & columns) — think of it like a table in a database or an Excel sheet.

LibraryExecution TypeScaleKey Idea
🐼 PandasIn-memory, eagerSmall–medium dataSimple, fast on single machine
🔥 PySparkDistributed, lazyBig DataParallel processing across cluster
PolarsRust-based, lazy or eagerMedium–largeUltra-fast, multicore execution

⚙️ 2. Creating DataFrames

Let’s create a small dataset in all three:

💡 Example data

data = [
    {"id": 1, "name": "Alice", "age": 25, "city": "Delhi"},
    {"id": 2, "name": "Bob", "age": 30, "city": "Mumbai"},
    {"id": 3, "name": "Charlie", "age": 35, "city": "Chennai"}
]

🐼 Pandas

import pandas as pd
df_pd = pd.DataFrame(data)
print(df_pd)

📤 Output:

   id     name  age    city
0   1    Alice   25   Delhi
1   2      Bob   30  Mumbai
2   3  Charlie   35  Chennai

🔥 PySpark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Stage1").getOrCreate()

df_spark = spark.createDataFrame(data)
df_spark.show()

📤 Output:

+---+-------+---+-------+
| id|   name|age|   city|
+---+-------+---+-------+
|  1|  Alice| 25|  Delhi|
|  2|    Bob| 30| Mumbai|
|  3|Charlie| 35|Chennai|
+---+-------+---+-------+

Polars

import polars as pl
df_pl = pl.DataFrame(data)
print(df_pl)

📤 Output:

shape: (3, 4)
┌─────┬─────────┬─────┬─────────┐
│ id  │ name    │ age │ city    │
│ --- │ ---     │ --- │ ---     │
│ i64 │ str     │ i64 │ str     │
├─────┼─────────┼─────┼─────────┤
│ 1   │ Alice   │ 25  │ Delhi   │
│ 2   │ Bob     │ 30  │ Mumbai  │
│ 3   │ Charlie │ 35  │ Chennai │
└─────┴─────────┴─────┴─────────┘

🧠 3. Inspecting DataFrames (Schema & Info)

TaskPandasPySparkPolars
Column Namesdf_pd.columnsdf_spark.columnsdf_pl.columns
Data Typesdf_pd.dtypesdf_spark.printSchema()df_pl.dtypes
Shapedf_pd.shapedf_spark.count(), len(df_spark.columns)df_pl.shape
Summarydf_pd.info()df_spark.describe().show()df_pl.describe()

📘 Example: Print Schema / Info

Pandas

df_pd.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      3 non-null      int64
 1   name    3 non-null      object
 2   age     3 non-null      int64
 3   city    3 non-null      object

PySpark

df_spark.printSchema()

Output:

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- city: string (nullable = true)

Polars

df_pl.dtypes

Output:

[Int64, Utf8, Int64, Utf8]

💡 4. Lazy vs Eager Execution

ConceptPandasPySparkPolars
Execution TypeEager (runs immediately)Lazy (plans query, then executes)Lazy (optional with LazyFrame)
Trigger ExecutionAlwaysWhen .show() / .collect()When .collect()
OptimizationManualCatalyst OptimizerQuery Planner

Example (Polars LazyFrame)

lazy_df = df_pl.lazy().select(pl.col("age") + 10)
result = lazy_df.collect()

Interview Tip:
👉 Be able to explain lazy evaluation:

In PySpark and Polars, transformations are not executed immediately. They’re added to a logical plan and optimized before execution, improving performance.


🧮 5. Basic DataFrame Operations Preview

OperationPandasPySparkPolars
Select columnsdf_pd[['name','age']]df_spark.select("name","age")df_pl.select(["name","age"])
Filter rowsdf_pd[df_pd.age>25]df_spark.filter(df_spark.age>25)df_pl.filter(pl.col("age")>25)
Add columndf_pd['age2']=df_pd.age+5df_spark.withColumn("age2",df_spark.age+5)df_pl.with_columns((pl.col("age")+5).alias("age2"))

🧠 Concept Recap

ConceptPandasPySparkPolars
Memory usageIn-memoryDistributed clusterIn-memory + parallel
SpeedFast for small dataScales massivelyExtremely fast on single machine
ExecutionImmediateLazyLazy or eager
Schema strictnessFlexibleStrictStrict
Typical use caseLocal analysisBig Data pipelinesHigh-performance local analytics

🧩 6. Exercise — Practice Time

Task

Create the following DataFrame:

emp_idnamedeptsalaryage
101RajIT7000028
102NehaHR6500032
103ArjunIT7200026

Then do the following:

  1. Show first 2 rows
  2. Print schema and dtypes
  3. Select only name and salary
  4. Filter employees with salary > 68000
  5. Add a new column bonus = salary * 0.10

Try to implement all of this in Pandas, PySpark, and Polars side by side.


Excellent 🔥 — welcome to Stage 2: Data Reading & Writing (I/O Operations + Schema Inference)
This is where you learn how real-world data actually enters and leaves your system — a core interview topic for data engineers and analysts.