🚀 Lesson 7 — FastAPI for AI/ML Model Serving (Transformers, HuggingFace, PyTorch, TensorFlow)

This is the lesson that turns FastAPI from a backend framework into a full AI application platform.

By the end of this lesson, you will be able to:

✔ Serve ML/DL models with FastAPI
✔ Build a sentiment analysis API
✔ Build a text-generation API (mini ChatGPT endpoint)
✔ Load PyTorch/TensorFlow models
✔ Use async inference for speed
✔ Build a production-ready AI inference microservice

Let’s begin. 🔥

🎯 What You Will Learn Today

A. Classical ML Model Serving

✔ Scikit-learn model
✔ Pickle loading
✔ Prediction API

B. HuggingFace Transformers (Most Used Today)

✔ Sentiment analysis API
✔ Text generation API
✔ Embeddings endpoint

C. PyTorch & TensorFlow Models

✔ Load models safely
✔ Async inference
✔ CPU/GPU optimization

D. Production patterns

✔ Batch inference
✔ Rate limiting
✔ Warm start
✔ Background workers

🧠 A. Serve a Traditional ML Model (Scikit-learn)

This is useful for:

Fraud detection
Recommendation
Classification models
Tabular ML

Step 1: Train & save model (example)

import pickle
from sklearn.linear_model import LogisticRegression

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]

model = LogisticRegression().fit(X, y)

pickle.dump(model, open("model.pkl", "wb"))

Step 2: Load model in FastAPI

from fastapi import FastAPI
import pickle

app = FastAPI()
model = pickle.load(open("model.pkl", "rb"))

Step 3: Create Prediction API

@app.post("/predict")
def predict(value: int):
    result = model.predict([[value]])[0]
    return {"prediction": int(result)}

Done.
This is how ML teams expose their models to internal systems.

🤖 B. HuggingFace Transformers — Industry Standard

We use pipelines (simplest way to deploy an NLP model).

Install:

pip install transformers torch

1. Sentiment Analysis API

from transformers import pipeline

app = FastAPI()
sentiment = pipeline("sentiment-analysis")

@app.post("/sentiment")
def analyze(text: str):
    return sentiment(text)

Try:

POST → /sentiment

{
  "text": "FastAPI is amazing!"
}

2. Text Generation API (like mini ChatGPT)

generator = pipeline("text-generation", model="gpt2")

@app.post("/generate")
def generate(prompt: str):
    output = generator(prompt, max_length=50)
    return {"result": output[0]["generated_text"]}

This is EXACTLY how:

Chatbots
Creative writing assistants
Copilot-like tools
serve text-generation models.

3. Embeddings API (Used in RAG, search engines)

embedder = pipeline("feature-extraction", model="sentence-transformers/all-MiniLM-L6-v2")

@app.post("/embed")
def embed(text: str):
    embedding = embedder(text)[0][0]
    return {"embedding": embedding}

Used in:

✔ Vector databases
✔ Semantic search
✔ RAG pipelines

⚡ C. PyTorch & TensorFlow Model Serving

1. PyTorch Example

import torch

model = torch.load("model.pt")
model.eval()

@app.post("/predict")
def predict(input_data: list[float]):
    tensor = torch.tensor([input_data])
    output = model(tensor)
    return {"output": output.tolist()}

2. TensorFlow Example

import tensorflow as tf

model = tf.keras.models.load_model("model.h5")

@app.post("/predict")
def predict(input_data: list[float]):
    result = model.predict([input_data])
    return {"result": result.tolist()}

⚡ D. Async Model Inference (Production High-Speed Technique)

Running models is CPU/GPU heavy → must use run_in_executor.

import asyncio

@app.post("/predict_async")
async def predict_async(text: str):
    loop = asyncio.get_event_loop()
    result = await loop.run_in_executor(None, sentiment, text)
    return {"result": result}

FastAPI stays responsive even for heavy workloads.

🏭 E. Production Architecture for AI Model Serving

Here is the recommended industry structure:

FastAPI
     ↓
Async inference
     ↓
Background worker (Celery / Redis)
     ↓
Batching / GPU optimization
     ↓
Response

Add:

✔ Auto-scaling
✔ Caching (Redis)
✔ Load balancing
✔ Logging (Prometheus + Grafana)
✔ Monitoring
✔ Docker + Kubernetes deployment

🌐 F. Real Industry Use Case Example

❇️ Example: Summarization Microservice

POST /summarize
Body: {"text": "Big article..."}

FastAPI:

Validates input
Sends task to GPU worker
Returns summary

Used in:

News summarizers
Email assistants
Document intelligence

🧪 G. Full Working AI FastAPI Microservice (Plug & Play)

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline
import asyncio

class Input(BaseModel):
    text: str

app = FastAPI()
summarizer = pipeline("summarization")

@app.post("/summarize")
async def summarize(data: Input):
    loop = asyncio.get_event_loop()
    result = await loop.run_in_executor(None, summarizer, data.text)
    return {"summary": result[0]["summary_text"]}

This is production-ready for medium traffic.

📌 Lesson 7 Summary

You learned:

✔ Serve ML models (sklearn, PyTorch, TF)
✔ HuggingFace pipelines
✔ Sentiment, generation, embedding APIs
✔ Async inference
✔ High-performance GPU/CPU-serving pattern
✔ Real microservice architectures

This lesson is EXACTLY what AI engineers use in:

LLM-based apps
RAG pipelines
Chatbot backends
Document intelligence
Recommendation engines

🚀 Ready for Lesson 8 — Middleware, Dependency Injection, Background Tasks, Caching (Redis)?

This lesson will make your FastAPI enterprise-level.

Should I continue with Lesson 8?

Pages: 1 2 3 4 5 6 7 8 9 10 11

FastAPI A to Z Course with ChatGPT