Monday, 20 October 2025

Tokenizer

🧩 What is a Tokenizer?

A tokenizer is a text preprocessing tool used in NLP (Natural Language Processing) that converts human-readable text into numbers so that models like BERT or DistilBERT can understand it.

💬 Why do we need it?

Machine learning models cannot understand raw text like:

"I love this movie!"

So, we must convert text → tokens → numbers (IDs).

🔤 Step-by-step example

Let’s see what a tokenizer does with DistilBERT.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

text = "I love this movie!"
tokens = tokenizer.tokenize(text)
print(tokens)

📊 Output:

['i', 'love', 'this', 'movie', '!']

So, it breaks the sentence into small pieces called tokens.

🔢 Convert tokens to IDs

Next, we convert these words into numeric IDs (the model’s vocabulary).

ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

Example:

[1045, 2293, 2023, 3185, 999]

Each number corresponds to a token in BERT’s vocabulary.

No comments:

Post a Comment