🧭 Stage 2 — Data Reading & Writing (Pandas + PySpark + Polars)

🎯 Learning Goals

Read and write data in multiple formats: CSV, Parquet, JSON, SQL
Understand schema inference and how each library handles types
Learn read/write performance & options
Cover partitioning, compression, and overwrite vs append modes
Get interview-ready theory for I/O systems in Spark and Polars

🧩 1. Reading Data — The Basics

🐼 Pandas

import pandas as pd

# Read CSV
df_pd = pd.read_csv("employees.csv")

# Read JSON
df_json = pd.read_json("employees.json")

# Read SQL
import sqlite3
conn = sqlite3.connect("company.db")
df_sql = pd.read_sql("SELECT * FROM employees", conn)

🧠 Notes

Automatically infers types (int, float, object)
Reads entire file into memory
Best for small/medium datasets

🔥 PySpark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("IOStage").getOrCreate()

# Read CSV
df_spark = spark.read.csv("employees.csv", header=True, inferSchema=True)

# Read Parquet
df_parquet = spark.read.parquet("employees.parquet")

# Read JSON
df_json = spark.read.json("employees.json")

# Read from JDBC/SQL
df_sql = spark.read.format("jdbc") \
    .option("url", "jdbc:mysql://localhost:3306/company") \
    .option("dbtable", "employees") \
    .option("user", "root") \
    .option("password", "password") \
    .load()

🧠 Notes

Uses lazy evaluation: reading triggers schema planning, not immediate load
inferSchema=True analyzes first few rows to detect data types
JDBC is common for production data ingestion from databases

⚡ Polars

import polars as pl

# Read CSV
df_pl = pl.read_csv("employees.csv")

# Read JSON
df_json = pl.read_json("employees.json")

# Read Parquet
df_parquet = pl.read_parquet("employees.parquet")

# From SQL
import sqlite3
conn = sqlite3.connect("company.db")
df_sql = pl.read_database("SELECT * FROM employees", connection=conn)

🧠 Notes

Polars can autodetect data types quickly using Rust backend
Fastest CSV reader among all three (multi-threaded)
Supports lazy reading: lazy_df = pl.scan_csv("employees.csv") # Doesn’t load until collect()

🧱 2. Schema Inference

What is Schema Inference?

When reading data (CSV/JSON), libraries automatically detect each column’s data type — e.g., int, float, string, date.

Library	Schema Handling	Control Option
Pandas	Auto via sample data	`dtype={"col": "int64"}`
PySpark	Optional auto	`inferSchema=True`
Polars	Auto, or specify	`dtypes={"col": pl.Int64}`

Example Comparison

Library	Code	Output Schema
Pandas	`pd.read_csv("data.csv").dtypes`	`id int64, name object, salary float64`
PySpark	`spark.read.csv(..., inferSchema=True).printSchema()`	`id: int, name: string, salary: double`
Polars	`pl.read_csv("data.csv").dtypes`	`[Int64, Utf8, Float64]`

🧠 Interview Insight:

Q: “How does Spark infer schema?”
A: Spark reads a sample of data (default 100 rows) and infers column types. For accuracy and performance, it’s better to define schema explicitly using StructType.

Example:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("salary", IntegerType(), True)
])

df_spark = spark.read.csv("employees.csv", schema=schema, header=True)

✅ Tip: Explicit schema → faster + safer in production.

💾 3. Writing Data

Format	Pandas	PySpark	Polars
CSV	`df.to_csv("output.csv", index=False)`	`df.write.mode("overwrite").csv("output")`	`df.write_csv("output.csv")`
Parquet	`df.to_parquet("output.parquet")`	`df.write.parquet("output")`	`df.write_parquet("output.parquet")`
JSON	`df.to_json("output.json")`	`df.write.json("output_json")`	`df.write_json("output.json")`
SQL	`df.to_sql("table", conn)`	`df.write.jdbc(url, "table")`	`df.write_database("table", connection=conn)`

🧠 PySpark Write Modes

Mode	Description
`overwrite`	Replace data
`append`	Add to existing data
`ignore`	Skip write if path exists
`error`	Fail if path exists

Example

df_spark.write.mode("overwrite").parquet("output/employees")

⚙️ 4. Partitioning & Compression (Performance)

PySpark Example:

df_spark.write \
    .partitionBy("department") \
    .parquet("output/partitioned", mode="overwrite", compression="snappy")

Explanation:

Partitioning creates subfolders for faster query filtering
Compression (e.g., Snappy, Gzip) reduces storage, improves speed

Polars:

df_pl.write_parquet("output/snappy.parquet", compression="snappy")

Pandas:
Limited — compression via parameter:

df_pd.to_csv("output.csv.gz", compression="gzip")

🧠 5. Lazy I/O in Spark and Polars

Spark: reading doesn’t load data immediately; execution waits until .show(), .count(), .collect()
Polars LazyFrame: read lazily and execute later

lazy_df = pl.scan_csv("employees.csv")
result = lazy_df.filter(pl.col("age") > 30).collect()

⚡ Benefit → faster query optimization & pipeline chaining

🔍 6. Comparing Read Performance (Conceptually)

Feature	Pandas	PySpark	Polars
Speed	Fast (small data)	Medium (distributed overhead)	⚡ Very fast (Rust backend)
Scale	Limited to RAM	Scales to cluster	Medium–large (multi-core)
Schema	Automatic	Infer or define	Automatic
Lazy read	❌	✅	✅
Compression	Supported	Strong	Strong

🧠 7. Common Interview Questions

#	Question	Key Point
1	How does PySpark read large CSV efficiently?	Lazy + parallel read + partitioned IO
2	What’s the difference between CSV and Parquet?	CSV = row-based (slower), Parquet = columnar (faster, compressed)
3	What is schema inference in Spark?	Auto-detecting column data types from sample
4	When to use explicit schema?	When performance or consistency matters
5	What is the default compression in Spark Parquet?	Snappy
6	What is the difference between eager and lazy I/O?	Pandas = eager, Spark/Polars = lazy until action triggered
7	Why is Parquet preferred for big data pipelines?	Columnar format → less I/O, better compression, predicate pushdown
8	Can Polars read lazy from multiple files?	Yes, using glob patterns: `pl.scan_csv("data/*.csv")`

🧩 8. Hands-on Challenge (Mini Project)

Dataset: sales.csv

sale_id	region	sales	profit	month
1	East	5000	800	Jan
2	West	7000	1200	Jan
3	South	6500	1000	Feb

Tasks

Read CSV in Pandas, PySpark, and Polars
Print schema & data types
Write the file to:
- Parquet (compressed)
- JSON
In Spark: write partitioned by month
In Polars: perform lazy read using scan_csv

✅ After Stage 2, you now understand:

Data I/O pipelines
Schema inference and optimization
Read/write modes and performance trade-offs
Practical syntax differences between Pandas, PySpark, and Polars

Would you like to move to Stage 3 → Data Exploration & Summary Statistics
(covering .head(), .describe(), .info(), column introspection, summary aggregation, profiling, etc.) next?

Pages: 1 2 3

Pandas Vs Pyspark Vs Polars – a comparative study

🧭 Stage 2 — Data Reading & Writing (Pandas + PySpark + Polars)

🎯 Learning Goals

🧩 1. Reading Data — The Basics

🐼 Pandas

🔥 PySpark

⚡ Polars

🧱 2. Schema Inference

What is Schema Inference?

Example Comparison

💾 3. Writing Data

🧠 PySpark Write Modes

⚙️ 4. Partitioning & Compression (Performance)

🧠 5. Lazy I/O in Spark and Polars

🔍 6. Comparing Read Performance (Conceptually)

🧠 7. Common Interview Questions

🧩 8. Hands-on Challenge (Mini Project)

Recent Posts

Recent Comments

Archives

Categories