🚀 Lesson 7 — FastAPI for AI/ML Model Serving (Transformers, HuggingFace, PyTorch, TensorFlow)
This is the lesson that turns FastAPI from a backend framework into a full AI application platform.
By the end of this lesson, you will be able to:
✔ Serve ML/DL models with FastAPI
✔ Build a sentiment analysis API
✔ Build a text-generation API (mini ChatGPT endpoint)
✔ Load PyTorch/TensorFlow models
✔ Use async inference for speed
✔ Build a production-ready AI inference microservice
Let’s begin. 🔥
🎯 What You Will Learn Today
A. Classical ML Model Serving
✔ Scikit-learn model
✔ Pickle loading
✔ Prediction API
B. HuggingFace Transformers (Most Used Today)
✔ Sentiment analysis API
✔ Text generation API
✔ Embeddings endpoint
C. PyTorch & TensorFlow Models
✔ Load models safely
✔ Async inference
✔ CPU/GPU optimization
D. Production patterns
✔ Batch inference
✔ Rate limiting
✔ Warm start
✔ Background workers
🧠 A. Serve a Traditional ML Model (Scikit-learn)
This is useful for:
- Fraud detection
- Recommendation
- Classification models
- Tabular ML
Step 1: Train & save model (example)
import pickle
from sklearn.linear_model import LogisticRegression
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
model = LogisticRegression().fit(X, y)
pickle.dump(model, open("model.pkl", "wb"))
Step 2: Load model in FastAPI
from fastapi import FastAPI
import pickle
app = FastAPI()
model = pickle.load(open("model.pkl", "rb"))
Step 3: Create Prediction API
@app.post("/predict")
def predict(value: int):
result = model.predict([[value]])[0]
return {"prediction": int(result)}
Done.
This is how ML teams expose their models to internal systems.
🤖 B. HuggingFace Transformers — Industry Standard
We use pipelines (simplest way to deploy an NLP model).
Install:
pip install transformers torch
1. Sentiment Analysis API
from transformers import pipeline
app = FastAPI()
sentiment = pipeline("sentiment-analysis")
@app.post("/sentiment")
def analyze(text: str):
return sentiment(text)
Try:
POST → /sentiment
{
"text": "FastAPI is amazing!"
}
2. Text Generation API (like mini ChatGPT)
generator = pipeline("text-generation", model="gpt2")
@app.post("/generate")
def generate(prompt: str):
output = generator(prompt, max_length=50)
return {"result": output[0]["generated_text"]}
This is EXACTLY how:
- Chatbots
- Creative writing assistants
- Copilot-like tools
serve text-generation models.
3. Embeddings API (Used in RAG, search engines)
embedder = pipeline("feature-extraction", model="sentence-transformers/all-MiniLM-L6-v2")
@app.post("/embed")
def embed(text: str):
embedding = embedder(text)[0][0]
return {"embedding": embedding}
Used in:
✔ Vector databases
✔ Semantic search
✔ RAG pipelines
⚡ C. PyTorch & TensorFlow Model Serving
1. PyTorch Example
import torch
model = torch.load("model.pt")
model.eval()
@app.post("/predict")
def predict(input_data: list[float]):
tensor = torch.tensor([input_data])
output = model(tensor)
return {"output": output.tolist()}
2. TensorFlow Example
import tensorflow as tf
model = tf.keras.models.load_model("model.h5")
@app.post("/predict")
def predict(input_data: list[float]):
result = model.predict([input_data])
return {"result": result.tolist()}
⚡ D. Async Model Inference (Production High-Speed Technique)
Running models is CPU/GPU heavy → must use run_in_executor.
import asyncio
@app.post("/predict_async")
async def predict_async(text: str):
loop = asyncio.get_event_loop()
result = await loop.run_in_executor(None, sentiment, text)
return {"result": result}
FastAPI stays responsive even for heavy workloads.
🏭 E. Production Architecture for AI Model Serving
Here is the recommended industry structure:
FastAPI
↓
Async inference
↓
Background worker (Celery / Redis)
↓
Batching / GPU optimization
↓
Response
Add:
✔ Auto-scaling
✔ Caching (Redis)
✔ Load balancing
✔ Logging (Prometheus + Grafana)
✔ Monitoring
✔ Docker + Kubernetes deployment
🌐 F. Real Industry Use Case Example
❇️ Example: Summarization Microservice
POST /summarize
Body: {"text": "Big article..."}
FastAPI:
- Validates input
- Sends task to GPU worker
- Returns summary
Used in:
- News summarizers
- Email assistants
- Document intelligence
🧪 G. Full Working AI FastAPI Microservice (Plug & Play)
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline
import asyncio
class Input(BaseModel):
text: str
app = FastAPI()
summarizer = pipeline("summarization")
@app.post("/summarize")
async def summarize(data: Input):
loop = asyncio.get_event_loop()
result = await loop.run_in_executor(None, summarizer, data.text)
return {"summary": result[0]["summary_text"]}
This is production-ready for medium traffic.
📌 Lesson 7 Summary
You learned:
✔ Serve ML models (sklearn, PyTorch, TF)
✔ HuggingFace pipelines
✔ Sentiment, generation, embedding APIs
✔ Async inference
✔ High-performance GPU/CPU-serving pattern
✔ Real microservice architectures
This lesson is EXACTLY what AI engineers use in:
- LLM-based apps
- RAG pipelines
- Chatbot backends
- Document intelligence
- Recommendation engines
🚀 Ready for Lesson 8 — Middleware, Dependency Injection, Background Tasks, Caching (Redis)?
This lesson will make your FastAPI enterprise-level.
Should I continue with Lesson 8?