Excellent point, Rajeev 👏 — you’re absolutely right.
We did Day 1 with a full teaching-style explanation (conceptual + visual + intuitive + practical),
then created a Colab notebook as a companion.
So let’s now do the same properly for Day 2: Tokenization & Embeddings — Full Theory + Visual + Intuitive Flow
(then the Colab you already have acts as your lab manual 🧑💻).
🌎 DAY 2 — Tokenization & Embeddings
(How LLMs convert text into numbers and learn meaning)
🧠 1. Why Tokenization Exists
Machines don’t “read” words — they understand numbers.
Tokenization is the first step of every LLM pipeline, converting text → tokens → integer IDs → embeddings.
💬 Example
Sentence:
“ChatGPT creates amazing responses!”
We break it down as:
["Chat", "G", "PT", "creates", "amazing", "responses", "!"]
and then map to numbers:
[1023, 81, 95, 3948, 2191, 1295, 33]
These numbers are called token IDs.
Each represents a unit of text that the model understands.
🧩 2. What Is a Token?
A token is a small chunk of text.
Depending on tokenizer design, a token can be:
| Type | Example | Pros | Cons |
|---|---|---|---|
| Character | H, e, l, l, o | Simple, language-agnostic | Long sequences |
| Word | hello, world | Human-intuitive | Fails for rare/misspelled words |
| Subword | play, ##ing, foot, ##ball | Flexible, balanced | Slightly complex |
| Byte-level | pl, ay, ##ing | Universal (handles emojis, code) | Longer token lists |
⚙️ Analogy
Think of tokenization like cutting a paragraph into LEGO blocks 🧱:
- Small blocks → more flexibility, but slower to build
- Big blocks → faster, but can’t represent everything
🧬 3. Evolution of Tokenizers
| Era | Approach | Example Model |
|---|---|---|
| Pre-2015 | Word-level (dictionary-based) | Word2Vec, GloVe |
| 2015–2018 | Subword (BPE, WordPiece) | BERT, GPT-2 |
| 2019–Now | SentencePiece, Unigram, Byte-BPE | T5, Llama, GPT-4 |
⚡ 4. Tokenization Techniques in Detail
Let’s break down the major types intuitively.
🔹 (a) Word-level Tokenization
Simplest: split text by spaces and punctuation.
Input: "I love AI."
Tokens: ["I", "love", "AI", "."]
Problems:
- Misspelled or unseen words cause “unknown token” (
[UNK]) - Not efficient across many languages
🔹 (b) Character-level Tokenization
Treat each character as a token:
"I love AI" → ['I', ' ', 'l', 'o', 'v', 'e', ' ', 'A', 'I']
✅ Handles every language and emoji
❌ Too long → “attention explosion” (impractical for large texts)
🔹 (c) Subword-level Tokenization
The middle ground — splits into frequently seen pieces.
Two major algorithms dominate modern LLMs:
🧩 Byte-Pair Encoding (BPE) — used in GPT-2
- Start with single characters
- Merge the most frequent pairs iteratively
- Stop when vocab size reached
Example:
playing → play + ing
football → foot + ball
🧩 WordPiece — used in BERT
Similar idea, but uses probability instead of frequency merges.
Subword continuations are marked with “##”:
playing → play + ##ing
🔹 (d) SentencePiece / Unigram Model
Used in multilingual models (T5, Llama, etc.)
Works at byte or character level, no need for whitespace!
Trains on raw text and selects subwords probabilistically.
🧮 5. Token IDs and Vocabularies
After training a tokenizer, each token gets an ID in the vocabulary.
Example (BERT):
[CLS] = 101
[SEP] = 102
play = 2402
##ing = 176
When you input text, it becomes a sequence of IDs:
[101, 2402, 176, 102]
💥 6. Why Tokenization Matters So Much
Tokenization determines:
- 🔹 How efficiently the model processes text
- 🔹 How well it handles rare words
- 🔹 How large the vocabulary (and embedding table) must be
A bad tokenizer → poor model understanding.
🧠 7. From Tokens → Embeddings
Once text becomes token IDs, they must become dense vectors that models can process.
This step is called embedding.
💬 Analogy
Think of embedding as giving numerical meaning to each token:
“king”, “queen”, “man”, “woman” → live near each other in vector space.
Each token ID → a learned vector of, say, 768 numbers (for BERT).
⚙️ Formula
Embedding layer is just a matrix lookup:
Embedding Matrix E ∈ R^(vocab_size × embed_dim)
Output = E[token_id]
If token_id = 2402 (“play”),
the model picks the 2402-th row from this matrix.
🔢 Example
| Token | ID | Embedding Vector (shortened) |
|---|---|---|
| play | 2402 | [0.12, –0.33, 0.48, … ] |
| ##ing | 176 | [–0.21, 0.09, 0.45, … ] |
🧭 8. Static vs Contextual Embeddings
📘 Static (Word2Vec, GloVe)
- One vector per word
- Same regardless of context
- “bank” → same vector in both
- “river bank”
- “money bank”
🤖 Contextual (BERT, GPT)
- Word’s vector changes with sentence meaning
- Derived from multiple transformer layers
- Captures syntax and semantics dynamically
⚙️ Visual Comparison
| Sentence | Static Embedding | Contextual Embedding |
|---|---|---|
| “I deposited cash in the bank.” | same | closer to “finance”, “account” |
| “The fisherman sat by the bank.” | same | closer to “river”, “shore” |
That’s why contextual embeddings revolutionized NLP.
🎨 9. Visualizing Embeddings
Imagine plotting each word’s vector in 2D.
Words with similar meaning cluster together:
queen ●
↘
woman ●
king ●
man ●
This is what we do with PCA or t-SNE visualizations (you’ll see this in your Colab notebook).
🧩 10. Embeddings in Transformers
Each token embedding is added to a positional embedding, so the model knows order:
Token Embedding + Position Embedding → Input to Transformer
Otherwise, the model would treat:
“dog bites man”
and
“man bites dog”
as identical (same tokens, different order!).
💡 11. What Happens Next (High-Level Flow)
Raw text
↓
Tokenizer → Tokens (subwords)
↓
Token IDs (integers)
↓
Embedding layer → Dense vectors
↓
Transformer layers → Contextual understanding
↓
Decoder → Generated text
🧪 12. Real Example: BERT vs GPT-2 Tokenization
Text:
“Tokenization is fun!”
| Model | Tokens | Notes |
|---|---|---|
| BERT | [‘token’, ‘##ization’, ‘is’, ‘fun’, ‘!’] | uses ## continuation |
| GPT-2 | [‘Token’, ‘ization’, ‘ is’, ‘ fun’, ‘!’] | merges frequent pairs |
👉 You’ll explore this interactively in your Day 2 Colab.
🧠 13. Why Embeddings Are the Foundation of “Understanding”
Embeddings encode:
- Semantic meaning
- Grammar & syntax
- Contextual relationships
- Domain adaptation (through fine-tuning)
Every transformer layer refines these embeddings — that’s how models “understand” and “generate” language.
✅ Summary — What You Learned Today
| Concept | Key Takeaway |
|---|---|
| Tokenization | Converts text → tokens → numbers |
| Subword models | Balance between vocab size & flexibility |
| Embeddings | Give numerical meaning to tokens |
| Static vs Contextual | Contextual = smarter, dynamic meanings |
| Visualization | Shows how meaning clusters emerge in vector space |
# %% [markdown]
# Day 2 — Tokenization & Embeddings
This Colab-ready notebook teaches tokenization (char, word, subword/BPE, WordPiece, SentencePiece) and embeddings (static & contextual).
**Goals:**
- Understand different tokenization strategies and their trade-offs.
- Use Hugging Face tokenizers to tokenize text and inspect tokens/ids.
- Extract token embeddings from a pretrained Transformer and visualize them in 2D (PCA / t-SNE).
- Hands-on exercises to build a small custom tokenizer and compare tokenizations.
---
# %% [markdown]
## 0. Setup: Install required packages
Run this cell in Colab (or locally) to install the libraries used in this notebook.
---
# %%
!pip install -q transformers tokenizers sentencepiece scikit-learn matplotlib seaborn
# %% [markdown]
## 1. Quick refresher: Why tokenization matters
- Models operate on **numbers**, not raw text. Tokenizers convert text → integer ids.\
- Tokenization strategy affects vocabulary size, model efficiency, out-of-vocabulary handling, and downstream performance.\
Main types:
- **Character-level**: every char is a token. Small vocab, long sequences.
- **Word-level / Whitespace**: splits on spaces. Simple but poor OOV handling.
- **Subword (BPE / WordPiece / Unigram)**: breaks words into common sub-units. Balances vocab size and expressivity.
---
# %% [markdown]
## 2. Char-level and Word-level tokenization (toy examples)
Let's implement simple char-level and whitespace tokenizers to see token ids.
---
# %%
# Char-level tokenizer
text = "I love playing football with friends."
# char tokenizer
chars = sorted(list(set(text)))
stoi_char = {c:i for i,c in enumerate(chars)}
itos_char = {i:c for c,i in stoi_char.items()}
char_tokens = [stoi_char[c] for c in text]
print('Chars:', chars)
print('Char token ids:', char_tokens[:30])
# word / whitespace tokenizer
words = text.split()
vocab_word = sorted(list(set(words)))
stoi_word = {w:i for i,w in enumerate(vocab_word)}
word_tokens = [stoi_word[w] for w in words]
print('\nWords:', words)
print('Word vocab:', vocab_word)
print('Word token ids:', word_tokens)
# Show reversed mapping example
print('\nReconstructed (words):', ' '.join([vocab_word[id] for id in word_tokens]))
# %% [markdown]
## 3. Subword tokenization with Hugging Face tokenizers
We'll use `AutoTokenizer` from `transformers` to load common tokenizers quickly. We'll demonstrate BPE/WordPiece behavior using distilBERT (WordPiece) and GPT-2 (BPE).
---
# %%
from transformers import AutoTokenizer
# Load DistilBERT tokenizer (WordPiece-like)
bert_tok = AutoTokenizer.from_pretrained('distilbert-base-uncased')
# Load GPT-2 tokenizer (BPE)
gpt2_tok = AutoTokenizer.from_pretrained('gpt2')
sample = "transformers tokenizer: tokenization of compounds like playing-football and emojis 😂"
print('Original text:\n', sample, '\n')
print('DistilBERT tokens and ids:')
bert_enc = bert_tok(sample)
print(bert_tok.convert_ids_to_tokens(bert_enc['input_ids']))
print(bert_enc['input_ids'])
print('\nGPT-2 tokens and ids:')
G_enc = gpt2_tok(sample)
print(gpt2_tok.convert_ids_to_tokens(G_enc['input_ids']))
print(G_enc['input_ids'])
# %% [markdown]
## 4. Subword tokenizers: WordPiece vs BPE vs Unigram
Short notes:
- **WordPiece** (BERT family): greedily builds subwords, uses "##" prefix for continuation tokens.
- **BPE** (GPT-2): Byte-Pair Encoding — merges frequent pairs of bytes/subwords.
- **Unigram** (SentencePiece): probabilistic subword selection (common in T5, ALBERT variations).
---
# %% [markdown]
## 5. Handling unknown words & multilingual text
Tokenization matters for OOV handling and languages without whitespace.
Try tokenizing a made-up or rare word and see how models break it into subwords.
---
# %%
rare = "antidisestablishmentarianismzzz"
print('BERT tokens:', bert_tok.tokenize(rare))
print('GPT-2 tokens:', gpt2_tok.tokenize(rare))
# Non-english example
hindi = 'भारत में एआई तेजी से बढ़ रहा है'
print('\nHindi tokenization (DistilBERT tokenizer may not be ideal):')
print(bert_tok.tokenize(hindi))
# %% [markdown]
## 6. Embeddings: Static vs Contextual
- **Static embeddings**: word → single vector regardless of context (Word2Vec, GloVe).
- **Contextual embeddings**: token vectors depend on surrounding context (BERT, GPT). These are produced by the model's internal layers.
We'll extract embeddings from a pretrained model to visualize how contextual embeddings place similar tokens nearby.
---
# %%
import torch
from transformers import AutoModel
# Load a small transformer model for embeddings
model_name = 'distilbert-base-uncased'
model = AutoModel.from_pretrained(model_name)
model.eval()
def get_token_embeddings(text, tokenizer, model):
# returns tokens, token_ids, embeddings (tokens x dim)
enc = tokenizer(text, return_tensors='pt')
with torch.no_grad():
out = model(**enc, output_hidden_states=True)
# take the last hidden layer (batch x seq_len x dim)
last_hidden = out.last_hidden_state.squeeze(0)
tokens = tokenizer.convert_ids_to_tokens(enc['input_ids'].squeeze().tolist())
return tokens, enc['input_ids'].squeeze().tolist(), last_hidden.numpy()
sample_sent = "The cat sat on the mat. The dog lay on the rug. A feline and a canine."
tokens, ids, embeds = get_token_embeddings(sample_sent, bert_tok, model)
print('Tokens:', tokens)
print('Embeddings shape:', embeds.shape)
# %% [markdown]
## 7. Visualizing embeddings (PCA + t-SNE)
We'll pick a subset of tokens (nouns & adjectives) and project embeddings to 2D using PCA and t-SNE.
---
# %%
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
# select indices of interest (filter out special tokens)
keep = []
keep_tokens = []
for i, t in enumerate(tokens):
if t.startswith('[CLS]') or t.startswith('[SEP]') or t == '.' or t == ',' or t == '':
continue
# skip punctuation-like tokens
keep.append(i)
keep_tokens.append(t)
vecs = embeds[keep]
print('Selected tokens:', keep_tokens)
# PCA to 2D
pca = PCA(n_components=2)
vecs_pca = pca.fit_transform(vecs)
plt.figure(figsize=(7,5))
plt.scatter(vecs_pca[:,0], vecs_pca[:,1])
for i, txt in enumerate(keep_tokens):
plt.annotate(txt, (vecs_pca[i,0], vecs_pca[i,1]))
plt.title('PCA projection of token embeddings (DistilBERT last hidden)')
plt.grid(True)
plt.show()
# t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
vecs_tsne = tsne.fit_transform(vecs)
plt.figure(figsize=(7,5))
plt.scatter(vecs_tsne[:,0], vecs_tsne[:,1])
for i, txt in enumerate(keep_tokens):
plt.annotate(txt, (vecs_tsne[i,0], vecs_tsne[i,1]))
plt.title('t-SNE projection of token embeddings')
plt.grid(True)
plt.show()
# %% [markdown]
## 8. Static embeddings: Train a tiny Word2Vec (gensim) or use a simple co-occurrence embedding
For small corpora you can train embeddings quickly. Here we use a tiny co-occurrence + SVD approach (toy example) to demonstrate the idea.
---
# %%
# Toy corpus and simple co-occurrence matrix
corpus = ["the cat sat on the mat".split(),
"the dog lay on the rug".split(),
"a cat is a feline".split(),
"a dog is a canine".split()]
vocab = sorted({w for sent in corpus for w in sent})
v2i = {w:i for i,w in enumerate(vocab)}
# build co-occurrence within window=1
V = len(vocab)
cooc = np.zeros((V, V), dtype=np.float32)
for sent in corpus:
for i,w in enumerate(sent):
wi = v2i[w]
if i > 0:
cooc[wi, v2i[sent[i-1]]] += 1
if i < len(sent)-1:
cooc[wi, v2i[sent[i+1]]] += 1
print('Vocab:', vocab)
print('Co-occurrence matrix:\n', cooc)
# SVD to get embeddings
U, S, VT = np.linalg.svd(cooc + 1e-6)
embeddings_svd = U[:, :2] # 2D embeddings for visualization
plt.figure(figsize=(6,5))
plt.scatter(embeddings_svd[:,0], embeddings_svd[:,1])
for i,w in enumerate(vocab):
plt.annotate(w, (embeddings_svd[i,0], embeddings_svd[i,1]))
plt.title('Toy static embeddings via SVD on co-occurrence')
plt.grid(True)
plt.show()
# %% [markdown]
## 9. Exercises
1. Tokenize the sentence: "unbelievability" with BERT and GPT-2. Compare how many tokens are produced and why.\
2. Build a custom BPE tokenizer using the `tokenizers` library from Hugging Face (see docs) on a small corpus and inspect merges.\
3. Use another sentence and visualize contextual embeddings from DistilBERT—compare the embedding of the word "bank" in two contexts: (a) "I went to the bank to deposit money." (b) "The river bank was steep." Check how the embeddings differ.\
4. Train a small Word2Vec (gensim) on the toy corpus and compare nearest neighbors for 'cat' vs the SVD-based embedding.
---
# %% [markdown]
## 10. Summary & Next Steps
You learned:
- Tokenization strategies, trade-offs, and how to inspect token ids.\
- How to extract contextual embeddings from a pretrained transformer and visualize them.\
- Difference between static and contextual embeddings.
**Next:** Day 3 — Attention Mechanism (in-depth math + visualizations + coding your own attention layer from scratch).
# End of notebook
# Tips:
# - Running the model-based embedding extraction (`AutoModel`) downloads model weights; ensure you have internet and optionally GPU.
# - t-SNE can be slow on larger sets—use PCA for quick overviews.
🔮 Coming Next
👉 Day 3 — The Attention Mechanism (The Heart of Transformers)
We’ll explore how embeddings interact —
how the model “pays attention” to relevant tokens
and how this mechanism replaced RNNs forever.
Includes diagrams, matrix math, and a full-coded attention demo.
Would you like me to continue with
Day 3 — Attention Mechanism (Full Theory + Visuals + Demo)
next?