A full course on Generative AI + LLM Engineering

Excellent point, Rajeev 👏 — you’re absolutely right.

We did Day 1 with a full teaching-style explanation (conceptual + visual + intuitive + practical),
then created a Colab notebook as a companion.

So let’s now do the same properly for Day 2: Tokenization & Embeddings — Full Theory + Visual + Intuitive Flow
(then the Colab you already have acts as your lab manual 🧑‍💻).


🌎 DAY 2 — Tokenization & Embeddings

(How LLMs convert text into numbers and learn meaning)


🧠 1. Why Tokenization Exists

Machines don’t “read” words — they understand numbers.
Tokenization is the first step of every LLM pipeline, converting text → tokens → integer IDs → embeddings.


💬 Example

Sentence:

“ChatGPT creates amazing responses!”

We break it down as:

["Chat", "G", "PT", "creates", "amazing", "responses", "!"]

and then map to numbers:

[1023, 81, 95, 3948, 2191, 1295, 33]

These numbers are called token IDs.
Each represents a unit of text that the model understands.


🧩 2. What Is a Token?

A token is a small chunk of text.
Depending on tokenizer design, a token can be:

TypeExampleProsCons
CharacterH, e, l, l, oSimple, language-agnosticLong sequences
Wordhello, worldHuman-intuitiveFails for rare/misspelled words
Subwordplay, ##ing, foot, ##ballFlexible, balancedSlightly complex
Byte-levelpl, ay, ##ingUniversal (handles emojis, code)Longer token lists

⚙️ Analogy

Think of tokenization like cutting a paragraph into LEGO blocks 🧱:

  • Small blocks → more flexibility, but slower to build
  • Big blocks → faster, but can’t represent everything

🧬 3. Evolution of Tokenizers

EraApproachExample Model
Pre-2015Word-level (dictionary-based)Word2Vec, GloVe
2015–2018Subword (BPE, WordPiece)BERT, GPT-2
2019–NowSentencePiece, Unigram, Byte-BPET5, Llama, GPT-4

⚡ 4. Tokenization Techniques in Detail

Let’s break down the major types intuitively.


🔹 (a) Word-level Tokenization

Simplest: split text by spaces and punctuation.

Input: "I love AI."
Tokens: ["I", "love", "AI", "."]

Problems:

  • Misspelled or unseen words cause “unknown token” ([UNK])
  • Not efficient across many languages

🔹 (b) Character-level Tokenization

Treat each character as a token:

"I love AI" → ['I', ' ', 'l', 'o', 'v', 'e', ' ', 'A', 'I']

✅ Handles every language and emoji
❌ Too long → “attention explosion” (impractical for large texts)


🔹 (c) Subword-level Tokenization

The middle ground — splits into frequently seen pieces.
Two major algorithms dominate modern LLMs:

🧩 Byte-Pair Encoding (BPE) — used in GPT-2

  1. Start with single characters
  2. Merge the most frequent pairs iteratively
  3. Stop when vocab size reached

Example:

playing → play + ing
football → foot + ball

🧩 WordPiece — used in BERT

Similar idea, but uses probability instead of frequency merges.
Subword continuations are marked with “##”:

playing → play + ##ing

🔹 (d) SentencePiece / Unigram Model

Used in multilingual models (T5, Llama, etc.)
Works at byte or character level, no need for whitespace!
Trains on raw text and selects subwords probabilistically.


🧮 5. Token IDs and Vocabularies

After training a tokenizer, each token gets an ID in the vocabulary.

Example (BERT):

[CLS] = 101  
[SEP] = 102  
play = 2402  
##ing = 176  

When you input text, it becomes a sequence of IDs:

[101, 2402, 176, 102]

💥 6. Why Tokenization Matters So Much

Tokenization determines:

  • 🔹 How efficiently the model processes text
  • 🔹 How well it handles rare words
  • 🔹 How large the vocabulary (and embedding table) must be

A bad tokenizer → poor model understanding.


🧠 7. From Tokens → Embeddings

Once text becomes token IDs, they must become dense vectors that models can process.
This step is called embedding.


💬 Analogy

Think of embedding as giving numerical meaning to each token:

“king”, “queen”, “man”, “woman” → live near each other in vector space.

Each token ID → a learned vector of, say, 768 numbers (for BERT).


⚙️ Formula

Embedding layer is just a matrix lookup:

Embedding Matrix E ∈ R^(vocab_size × embed_dim)
Output = E[token_id]

If token_id = 2402 (“play”),
the model picks the 2402-th row from this matrix.


🔢 Example

TokenIDEmbedding Vector (shortened)
play2402[0.12, –0.33, 0.48, … ]
##ing176[–0.21, 0.09, 0.45, … ]

🧭 8. Static vs Contextual Embeddings

📘 Static (Word2Vec, GloVe)

  • One vector per word
  • Same regardless of context
  • “bank” → same vector in both
    • “river bank”
    • “money bank”

🤖 Contextual (BERT, GPT)

  • Word’s vector changes with sentence meaning
  • Derived from multiple transformer layers
  • Captures syntax and semantics dynamically

⚙️ Visual Comparison

SentenceStatic EmbeddingContextual Embedding
“I deposited cash in the bank.”samecloser to “finance”, “account”
“The fisherman sat by the bank.”samecloser to “river”, “shore”

That’s why contextual embeddings revolutionized NLP.


🎨 9. Visualizing Embeddings

Imagine plotting each word’s vector in 2D.
Words with similar meaning cluster together:

        queen ●
              ↘
          woman ●
       king ●
    man ●

This is what we do with PCA or t-SNE visualizations (you’ll see this in your Colab notebook).


🧩 10. Embeddings in Transformers

Each token embedding is added to a positional embedding, so the model knows order:

Token Embedding + Position Embedding → Input to Transformer

Otherwise, the model would treat:

“dog bites man”
and
“man bites dog”
as identical (same tokens, different order!).


💡 11. What Happens Next (High-Level Flow)

Raw text
   ↓
Tokenizer → Tokens (subwords)
   ↓
Token IDs (integers)
   ↓
Embedding layer → Dense vectors
   ↓
Transformer layers → Contextual understanding
   ↓
Decoder → Generated text

🧪 12. Real Example: BERT vs GPT-2 Tokenization

Text:

“Tokenization is fun!”

ModelTokensNotes
BERT[‘token’, ‘##ization’, ‘is’, ‘fun’, ‘!’]uses ## continuation
GPT-2[‘Token’, ‘ization’, ‘ is’, ‘ fun’, ‘!’]merges frequent pairs

👉 You’ll explore this interactively in your Day 2 Colab.


🧠 13. Why Embeddings Are the Foundation of “Understanding”

Embeddings encode:

  • Semantic meaning
  • Grammar & syntax
  • Contextual relationships
  • Domain adaptation (through fine-tuning)

Every transformer layer refines these embeddings — that’s how models “understand” and “generate” language.


✅ Summary — What You Learned Today

ConceptKey Takeaway
TokenizationConverts text → tokens → numbers
Subword modelsBalance between vocab size & flexibility
EmbeddingsGive numerical meaning to tokens
Static vs ContextualContextual = smarter, dynamic meanings
VisualizationShows how meaning clusters emerge in vector space

# %% [markdown]
# Day 2 — Tokenization & Embeddings

This Colab-ready notebook teaches tokenization (char, word, subword/BPE, WordPiece, SentencePiece) and embeddings (static & contextual).

**Goals:**
- Understand different tokenization strategies and their trade-offs.
- Use Hugging Face tokenizers to tokenize text and inspect tokens/ids.
- Extract token embeddings from a pretrained Transformer and visualize them in 2D (PCA / t-SNE).
- Hands-on exercises to build a small custom tokenizer and compare tokenizations.

---

# %% [markdown]
## 0. Setup: Install required packages

Run this cell in Colab (or locally) to install the libraries used in this notebook.

---

# %%
!pip install -q transformers tokenizers sentencepiece scikit-learn matplotlib seaborn

# %% [markdown]
## 1. Quick refresher: Why tokenization matters

- Models operate on **numbers**, not raw text. Tokenizers convert text → integer ids.\
- Tokenization strategy affects vocabulary size, model efficiency, out-of-vocabulary handling, and downstream performance.\

Main types:
- **Character-level**: every char is a token. Small vocab, long sequences.
- **Word-level / Whitespace**: splits on spaces. Simple but poor OOV handling.
- **Subword (BPE / WordPiece / Unigram)**: breaks words into common sub-units. Balances vocab size and expressivity.

---

# %% [markdown]
## 2. Char-level and Word-level tokenization (toy examples)

Let's implement simple char-level and whitespace tokenizers to see token ids.

---

# %%
# Char-level tokenizer
text = "I love playing football with friends."

# char tokenizer
chars = sorted(list(set(text)))
stoi_char = {c:i for i,c in enumerate(chars)}
itos_char = {i:c for c,i in stoi_char.items()}
char_tokens = [stoi_char[c] for c in text]

print('Chars:', chars)
print('Char token ids:', char_tokens[:30])

# word / whitespace tokenizer
words = text.split()
vocab_word = sorted(list(set(words)))
stoi_word = {w:i for i,w in enumerate(vocab_word)}
word_tokens = [stoi_word[w] for w in words]

print('\nWords:', words)
print('Word vocab:', vocab_word)
print('Word token ids:', word_tokens)

# Show reversed mapping example
print('\nReconstructed (words):', ' '.join([vocab_word[id] for id in word_tokens]))

# %% [markdown]
## 3. Subword tokenization with Hugging Face tokenizers

We'll use `AutoTokenizer` from `transformers` to load common tokenizers quickly. We'll demonstrate BPE/WordPiece behavior using distilBERT (WordPiece) and GPT-2 (BPE).

---

# %%
from transformers import AutoTokenizer

# Load DistilBERT tokenizer (WordPiece-like)
bert_tok = AutoTokenizer.from_pretrained('distilbert-base-uncased')
# Load GPT-2 tokenizer (BPE)
gpt2_tok = AutoTokenizer.from_pretrained('gpt2')

sample = "transformers tokenizer: tokenization of compounds like playing-football and emojis 😂"

print('Original text:\n', sample, '\n')

print('DistilBERT tokens and ids:')
bert_enc = bert_tok(sample)
print(bert_tok.convert_ids_to_tokens(bert_enc['input_ids']))
print(bert_enc['input_ids'])

print('\nGPT-2 tokens and ids:')
G_enc = gpt2_tok(sample)
print(gpt2_tok.convert_ids_to_tokens(G_enc['input_ids']))
print(G_enc['input_ids'])

# %% [markdown]
## 4. Subword tokenizers: WordPiece vs BPE vs Unigram

Short notes:
- **WordPiece** (BERT family): greedily builds subwords, uses "##" prefix for continuation tokens. 
- **BPE** (GPT-2): Byte-Pair Encoding — merges frequent pairs of bytes/subwords.
- **Unigram** (SentencePiece): probabilistic subword selection (common in T5, ALBERT variations).

---

# %% [markdown]
## 5. Handling unknown words & multilingual text

Tokenization matters for OOV handling and languages without whitespace.
Try tokenizing a made-up or rare word and see how models break it into subwords.

---

# %%
rare = "antidisestablishmentarianismzzz"
print('BERT tokens:', bert_tok.tokenize(rare))
print('GPT-2 tokens:', gpt2_tok.tokenize(rare))

# Non-english example
hindi = 'भारत में एआई तेजी से बढ़ रहा है'
print('\nHindi tokenization (DistilBERT tokenizer may not be ideal):')
print(bert_tok.tokenize(hindi))

# %% [markdown]
## 6. Embeddings: Static vs Contextual

- **Static embeddings**: word → single vector regardless of context (Word2Vec, GloVe).
- **Contextual embeddings**: token vectors depend on surrounding context (BERT, GPT). These are produced by the model's internal layers.

We'll extract embeddings from a pretrained model to visualize how contextual embeddings place similar tokens nearby.

---

# %%
import torch
from transformers import AutoModel

# Load a small transformer model for embeddings
model_name = 'distilbert-base-uncased'
model = AutoModel.from_pretrained(model_name)
model.eval()

def get_token_embeddings(text, tokenizer, model):
    # returns tokens, token_ids, embeddings (tokens x dim)
    enc = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        out = model(**enc, output_hidden_states=True)
    # take the last hidden layer (batch x seq_len x dim)
    last_hidden = out.last_hidden_state.squeeze(0)
    tokens = tokenizer.convert_ids_to_tokens(enc['input_ids'].squeeze().tolist())
    return tokens, enc['input_ids'].squeeze().tolist(), last_hidden.numpy()

sample_sent = "The cat sat on the mat. The dog lay on the rug. A feline and a canine."

tokens, ids, embeds = get_token_embeddings(sample_sent, bert_tok, model)
print('Tokens:', tokens)
print('Embeddings shape:', embeds.shape)

# %% [markdown]
## 7. Visualizing embeddings (PCA + t-SNE)

We'll pick a subset of tokens (nouns & adjectives) and project embeddings to 2D using PCA and t-SNE.

---

# %%
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

# select indices of interest (filter out special tokens)
keep = []
keep_tokens = []
for i, t in enumerate(tokens):
    if t.startswith('[CLS]') or t.startswith('[SEP]') or t == '.' or t == ',' or t == '':
        continue
    # skip punctuation-like tokens
    keep.append(i)
    keep_tokens.append(t)

vecs = embeds[keep]
print('Selected tokens:', keep_tokens)

# PCA to 2D
pca = PCA(n_components=2)
vecs_pca = pca.fit_transform(vecs)

plt.figure(figsize=(7,5))
plt.scatter(vecs_pca[:,0], vecs_pca[:,1])
for i, txt in enumerate(keep_tokens):
    plt.annotate(txt, (vecs_pca[i,0], vecs_pca[i,1]))
plt.title('PCA projection of token embeddings (DistilBERT last hidden)')
plt.grid(True)
plt.show()

# t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
vecs_tsne = tsne.fit_transform(vecs)

plt.figure(figsize=(7,5))
plt.scatter(vecs_tsne[:,0], vecs_tsne[:,1])
for i, txt in enumerate(keep_tokens):
    plt.annotate(txt, (vecs_tsne[i,0], vecs_tsne[i,1]))
plt.title('t-SNE projection of token embeddings')
plt.grid(True)
plt.show()

# %% [markdown]
## 8. Static embeddings: Train a tiny Word2Vec (gensim) or use a simple co-occurrence embedding

For small corpora you can train embeddings quickly. Here we use a tiny co-occurrence + SVD approach (toy example) to demonstrate the idea.

---

# %%
# Toy corpus and simple co-occurrence matrix
corpus = ["the cat sat on the mat".split(),
          "the dog lay on the rug".split(),
          "a cat is a feline".split(),
          "a dog is a canine".split()]

vocab = sorted({w for sent in corpus for w in sent})
v2i = {w:i for i,w in enumerate(vocab)}

# build co-occurrence within window=1
V = len(vocab)
cooc = np.zeros((V, V), dtype=np.float32)
for sent in corpus:
    for i,w in enumerate(sent):
        wi = v2i[w]
        if i > 0:
            cooc[wi, v2i[sent[i-1]]] += 1
        if i < len(sent)-1:
            cooc[wi, v2i[sent[i+1]]] += 1

print('Vocab:', vocab)
print('Co-occurrence matrix:\n', cooc)

# SVD to get embeddings
U, S, VT = np.linalg.svd(cooc + 1e-6)
embeddings_svd = U[:, :2]  # 2D embeddings for visualization

plt.figure(figsize=(6,5))
plt.scatter(embeddings_svd[:,0], embeddings_svd[:,1])
for i,w in enumerate(vocab):
    plt.annotate(w, (embeddings_svd[i,0], embeddings_svd[i,1]))
plt.title('Toy static embeddings via SVD on co-occurrence')
plt.grid(True)
plt.show()

# %% [markdown]
## 9. Exercises

1. Tokenize the sentence: "unbelievability" with BERT and GPT-2. Compare how many tokens are produced and why.\
2. Build a custom BPE tokenizer using the `tokenizers` library from Hugging Face (see docs) on a small corpus and inspect merges.\
3. Use another sentence and visualize contextual embeddings from DistilBERT—compare the embedding of the word "bank" in two contexts: (a) "I went to the bank to deposit money." (b) "The river bank was steep." Check how the embeddings differ.\
4. Train a small Word2Vec (gensim) on the toy corpus and compare nearest neighbors for 'cat' vs the SVD-based embedding.

---

# %% [markdown]
## 10. Summary & Next Steps

You learned:
- Tokenization strategies, trade-offs, and how to inspect token ids.\
- How to extract contextual embeddings from a pretrained transformer and visualize them.\
- Difference between static and contextual embeddings.

**Next:** Day 3 — Attention Mechanism (in-depth math + visualizations + coding your own attention layer from scratch).

# End of notebook

# Tips:
# - Running the model-based embedding extraction (`AutoModel`) downloads model weights; ensure you have internet and optionally GPU.
# - t-SNE can be slow on larger sets—use PCA for quick overviews.

🔮 Coming Next

👉 Day 3 — The Attention Mechanism (The Heart of Transformers)
We’ll explore how embeddings interact
how the model “pays attention” to relevant tokens
and how this mechanism replaced RNNs forever.
Includes diagrams, matrix math, and a full-coded attention demo.


Would you like me to continue with
Day 3 — Attention Mechanism (Full Theory + Visuals + Demo)
next?

Pages: 1 2 3 4 5 6 7 8 9