🧭 Stage 2 — Data Reading & Writing (Pandas + PySpark + Polars)
🎯 Learning Goals
- Read and write data in multiple formats: CSV, Parquet, JSON, SQL
- Understand schema inference and how each library handles types
- Learn read/write performance & options
- Cover partitioning, compression, and overwrite vs append modes
- Get interview-ready theory for I/O systems in Spark and Polars
🧩 1. Reading Data — The Basics
🐼 Pandas
import pandas as pd
# Read CSV
df_pd = pd.read_csv("employees.csv")
# Read JSON
df_json = pd.read_json("employees.json")
# Read SQL
import sqlite3
conn = sqlite3.connect("company.db")
df_sql = pd.read_sql("SELECT * FROM employees", conn)
🧠 Notes
- Automatically infers types (int, float, object)
- Reads entire file into memory
- Best for small/medium datasets
🔥 PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("IOStage").getOrCreate()
# Read CSV
df_spark = spark.read.csv("employees.csv", header=True, inferSchema=True)
# Read Parquet
df_parquet = spark.read.parquet("employees.parquet")
# Read JSON
df_json = spark.read.json("employees.json")
# Read from JDBC/SQL
df_sql = spark.read.format("jdbc") \
.option("url", "jdbc:mysql://localhost:3306/company") \
.option("dbtable", "employees") \
.option("user", "root") \
.option("password", "password") \
.load()
🧠 Notes
- Uses lazy evaluation: reading triggers schema planning, not immediate load
inferSchema=Trueanalyzes first few rows to detect data types- JDBC is common for production data ingestion from databases
⚡ Polars
import polars as pl
# Read CSV
df_pl = pl.read_csv("employees.csv")
# Read JSON
df_json = pl.read_json("employees.json")
# Read Parquet
df_parquet = pl.read_parquet("employees.parquet")
# From SQL
import sqlite3
conn = sqlite3.connect("company.db")
df_sql = pl.read_database("SELECT * FROM employees", connection=conn)
🧠 Notes
- Polars can autodetect data types quickly using Rust backend
- Fastest CSV reader among all three (multi-threaded)
- Supports lazy reading:
lazy_df = pl.scan_csv("employees.csv") # Doesn’t load until collect()
🧱 2. Schema Inference
What is Schema Inference?
When reading data (CSV/JSON), libraries automatically detect each column’s data type — e.g., int, float, string, date.
| Library | Schema Handling | Control Option |
|---|---|---|
| Pandas | Auto via sample data | dtype={"col": "int64"} |
| PySpark | Optional auto | inferSchema=True |
| Polars | Auto, or specify | dtypes={"col": pl.Int64} |
Example Comparison
| Library | Code | Output Schema |
|---|---|---|
| Pandas | pd.read_csv("data.csv").dtypes | id int64, name object, salary float64 |
| PySpark | spark.read.csv(..., inferSchema=True).printSchema() | id: int, name: string, salary: double |
| Polars | pl.read_csv("data.csv").dtypes | [Int64, Utf8, Float64] |
🧠 Interview Insight:
Q: “How does Spark infer schema?”
A: Spark reads a sample of data (default 100 rows) and infers column types. For accuracy and performance, it’s better to define schema explicitly usingStructType.
Example:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("salary", IntegerType(), True)
])
df_spark = spark.read.csv("employees.csv", schema=schema, header=True)
✅ Tip: Explicit schema → faster + safer in production.
💾 3. Writing Data
| Format | Pandas | PySpark | Polars |
|---|---|---|---|
| CSV | df.to_csv("output.csv", index=False) | df.write.mode("overwrite").csv("output") | df.write_csv("output.csv") |
| Parquet | df.to_parquet("output.parquet") | df.write.parquet("output") | df.write_parquet("output.parquet") |
| JSON | df.to_json("output.json") | df.write.json("output_json") | df.write_json("output.json") |
| SQL | df.to_sql("table", conn) | df.write.jdbc(url, "table") | df.write_database("table", connection=conn) |
🧠 PySpark Write Modes
| Mode | Description |
|---|---|
overwrite | Replace data |
append | Add to existing data |
ignore | Skip write if path exists |
error | Fail if path exists |
Example
df_spark.write.mode("overwrite").parquet("output/employees")
⚙️ 4. Partitioning & Compression (Performance)
PySpark Example:
df_spark.write \
.partitionBy("department") \
.parquet("output/partitioned", mode="overwrite", compression="snappy")
Explanation:
- Partitioning creates subfolders for faster query filtering
- Compression (e.g., Snappy, Gzip) reduces storage, improves speed
Polars:
df_pl.write_parquet("output/snappy.parquet", compression="snappy")
Pandas:
Limited — compression via parameter:
df_pd.to_csv("output.csv.gz", compression="gzip")
🧠 5. Lazy I/O in Spark and Polars
- Spark: reading doesn’t load data immediately; execution waits until
.show(),.count(),.collect() - Polars LazyFrame: read lazily and execute later
lazy_df = pl.scan_csv("employees.csv")
result = lazy_df.filter(pl.col("age") > 30).collect()
⚡ Benefit → faster query optimization & pipeline chaining
🔍 6. Comparing Read Performance (Conceptually)
| Feature | Pandas | PySpark | Polars |
|---|---|---|---|
| Speed | Fast (small data) | Medium (distributed overhead) | ⚡ Very fast (Rust backend) |
| Scale | Limited to RAM | Scales to cluster | Medium–large (multi-core) |
| Schema | Automatic | Infer or define | Automatic |
| Lazy read | ❌ | ✅ | ✅ |
| Compression | Supported | Strong | Strong |
🧠 7. Common Interview Questions
| # | Question | Key Point |
|---|---|---|
| 1 | How does PySpark read large CSV efficiently? | Lazy + parallel read + partitioned IO |
| 2 | What’s the difference between CSV and Parquet? | CSV = row-based (slower), Parquet = columnar (faster, compressed) |
| 3 | What is schema inference in Spark? | Auto-detecting column data types from sample |
| 4 | When to use explicit schema? | When performance or consistency matters |
| 5 | What is the default compression in Spark Parquet? | Snappy |
| 6 | What is the difference between eager and lazy I/O? | Pandas = eager, Spark/Polars = lazy until action triggered |
| 7 | Why is Parquet preferred for big data pipelines? | Columnar format → less I/O, better compression, predicate pushdown |
| 8 | Can Polars read lazy from multiple files? | Yes, using glob patterns: pl.scan_csv("data/*.csv") |
🧩 8. Hands-on Challenge (Mini Project)
Dataset: sales.csv
| sale_id | region | sales | profit | month |
|---|---|---|---|---|
| 1 | East | 5000 | 800 | Jan |
| 2 | West | 7000 | 1200 | Jan |
| 3 | South | 6500 | 1000 | Feb |
Tasks
- Read CSV in Pandas, PySpark, and Polars
- Print schema & data types
- Write the file to:
Parquet(compressed)JSON
- In Spark: write partitioned by
month - In Polars: perform lazy read using
scan_csv
✅ After Stage 2, you now understand:
- Data I/O pipelines
- Schema inference and optimization
- Read/write modes and performance trade-offs
- Practical syntax differences between Pandas, PySpark, and Polars
Would you like to move to Stage 3 → Data Exploration & Summary Statistics
(covering .head(), .describe(), .info(), column introspection, summary aggregation, profiling, etc.) next?