🧭 Stage 2 — Data Reading & Writing (Pandas + PySpark + Polars)


🎯 Learning Goals

  • Read and write data in multiple formats: CSV, Parquet, JSON, SQL
  • Understand schema inference and how each library handles types
  • Learn read/write performance & options
  • Cover partitioning, compression, and overwrite vs append modes
  • Get interview-ready theory for I/O systems in Spark and Polars

🧩 1. Reading Data — The Basics

🐼 Pandas

import pandas as pd

# Read CSV
df_pd = pd.read_csv("employees.csv")

# Read JSON
df_json = pd.read_json("employees.json")

# Read SQL
import sqlite3
conn = sqlite3.connect("company.db")
df_sql = pd.read_sql("SELECT * FROM employees", conn)

🧠 Notes

  • Automatically infers types (int, float, object)
  • Reads entire file into memory
  • Best for small/medium datasets

🔥 PySpark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("IOStage").getOrCreate()

# Read CSV
df_spark = spark.read.csv("employees.csv", header=True, inferSchema=True)

# Read Parquet
df_parquet = spark.read.parquet("employees.parquet")

# Read JSON
df_json = spark.read.json("employees.json")

# Read from JDBC/SQL
df_sql = spark.read.format("jdbc") \
    .option("url", "jdbc:mysql://localhost:3306/company") \
    .option("dbtable", "employees") \
    .option("user", "root") \
    .option("password", "password") \
    .load()

🧠 Notes

  • Uses lazy evaluation: reading triggers schema planning, not immediate load
  • inferSchema=True analyzes first few rows to detect data types
  • JDBC is common for production data ingestion from databases

⚡ Polars

import polars as pl

# Read CSV
df_pl = pl.read_csv("employees.csv")

# Read JSON
df_json = pl.read_json("employees.json")

# Read Parquet
df_parquet = pl.read_parquet("employees.parquet")

# From SQL
import sqlite3
conn = sqlite3.connect("company.db")
df_sql = pl.read_database("SELECT * FROM employees", connection=conn)

🧠 Notes

  • Polars can autodetect data types quickly using Rust backend
  • Fastest CSV reader among all three (multi-threaded)
  • Supports lazy reading: lazy_df = pl.scan_csv("employees.csv") # Doesn’t load until collect()

🧱 2. Schema Inference

What is Schema Inference?

When reading data (CSV/JSON), libraries automatically detect each column’s data type — e.g., int, float, string, date.

LibrarySchema HandlingControl Option
PandasAuto via sample datadtype={"col": "int64"}
PySparkOptional autoinferSchema=True
PolarsAuto, or specifydtypes={"col": pl.Int64}

Example Comparison

LibraryCodeOutput Schema
Pandaspd.read_csv("data.csv").dtypesid int64, name object, salary float64
PySparkspark.read.csv(..., inferSchema=True).printSchema()id: int, name: string, salary: double
Polarspl.read_csv("data.csv").dtypes[Int64, Utf8, Float64]

🧠 Interview Insight:

Q: “How does Spark infer schema?”
A: Spark reads a sample of data (default 100 rows) and infers column types. For accuracy and performance, it’s better to define schema explicitly using StructType.

Example:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("salary", IntegerType(), True)
])

df_spark = spark.read.csv("employees.csv", schema=schema, header=True)

Tip: Explicit schema → faster + safer in production.


💾 3. Writing Data

FormatPandasPySparkPolars
CSVdf.to_csv("output.csv", index=False)df.write.mode("overwrite").csv("output")df.write_csv("output.csv")
Parquetdf.to_parquet("output.parquet")df.write.parquet("output")df.write_parquet("output.parquet")
JSONdf.to_json("output.json")df.write.json("output_json")df.write_json("output.json")
SQLdf.to_sql("table", conn)df.write.jdbc(url, "table")df.write_database("table", connection=conn)

🧠 PySpark Write Modes

ModeDescription
overwriteReplace data
appendAdd to existing data
ignoreSkip write if path exists
errorFail if path exists

Example

df_spark.write.mode("overwrite").parquet("output/employees")

⚙️ 4. Partitioning & Compression (Performance)

PySpark Example:

df_spark.write \
    .partitionBy("department") \
    .parquet("output/partitioned", mode="overwrite", compression="snappy")

Explanation:

  • Partitioning creates subfolders for faster query filtering
  • Compression (e.g., Snappy, Gzip) reduces storage, improves speed

Polars:

df_pl.write_parquet("output/snappy.parquet", compression="snappy")

Pandas:
Limited — compression via parameter:

df_pd.to_csv("output.csv.gz", compression="gzip")

🧠 5. Lazy I/O in Spark and Polars

  • Spark: reading doesn’t load data immediately; execution waits until .show(), .count(), .collect()
  • Polars LazyFrame: read lazily and execute later
lazy_df = pl.scan_csv("employees.csv")
result = lazy_df.filter(pl.col("age") > 30).collect()

⚡ Benefit → faster query optimization & pipeline chaining


🔍 6. Comparing Read Performance (Conceptually)

FeaturePandasPySparkPolars
SpeedFast (small data)Medium (distributed overhead)⚡ Very fast (Rust backend)
ScaleLimited to RAMScales to clusterMedium–large (multi-core)
SchemaAutomaticInfer or defineAutomatic
Lazy read
CompressionSupportedStrongStrong

🧠 7. Common Interview Questions

#QuestionKey Point
1How does PySpark read large CSV efficiently?Lazy + parallel read + partitioned IO
2What’s the difference between CSV and Parquet?CSV = row-based (slower), Parquet = columnar (faster, compressed)
3What is schema inference in Spark?Auto-detecting column data types from sample
4When to use explicit schema?When performance or consistency matters
5What is the default compression in Spark Parquet?Snappy
6What is the difference between eager and lazy I/O?Pandas = eager, Spark/Polars = lazy until action triggered
7Why is Parquet preferred for big data pipelines?Columnar format → less I/O, better compression, predicate pushdown
8Can Polars read lazy from multiple files?Yes, using glob patterns: pl.scan_csv("data/*.csv")

🧩 8. Hands-on Challenge (Mini Project)

Dataset: sales.csv

sale_idregionsalesprofitmonth
1East5000800Jan
2West70001200Jan
3South65001000Feb

Tasks

  1. Read CSV in Pandas, PySpark, and Polars
  2. Print schema & data types
  3. Write the file to:
    • Parquet (compressed)
    • JSON
  4. In Spark: write partitioned by month
  5. In Polars: perform lazy read using scan_csv

After Stage 2, you now understand:

  • Data I/O pipelines
  • Schema inference and optimization
  • Read/write modes and performance trade-offs
  • Practical syntax differences between Pandas, PySpark, and Polars

Would you like to move to Stage 3 → Data Exploration & Summary Statistics
(covering .head(), .describe(), .info(), column introspection, summary aggregation, profiling, etc.) next?