🧩 What is a Tokenizer?
A tokenizer is a text preprocessing tool used in NLP (Natural Language Processing) that converts human-readable text into numbers so that models like BERT or DistilBERT can understand it.
💬 Why do we need it?
Machine learning models cannot understand raw text like:
"I love this movie!"
So, we must convert text → tokens → numbers (IDs).
🔤 Step-by-step example
Let’s see what a tokenizer does with DistilBERT.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
text = "I love this movie!"
tokens = tokenizer.tokenize(text)
print(tokens)
📊 Output:
['i', 'love', 'this', 'movie', '!']
So, it breaks the sentence into small pieces called tokens.
🔢 Convert tokens to IDs
Next, we convert these words into numeric IDs (the model’s vocabulary).
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
Example:
[1045, 2293, 2023, 3185, 999]
Each number corresponds to a token in BERT’s vocabulary.
No comments:
Post a Comment