
The Vocabulary Problem
In the early days of NLP, computers often failed when they saw a word they didn’t know (an “Out of Vocabulary” or OOV word). If you only know “play” and “ed,” you might stumble on “played.”
Enter: WordPiece
The WordPiece algorithm, popularized by BERT, solves this by breaking words into smaller sub-units.
Instead of treating “unlikable” as one word, it might see:
un + ##lik + ##able.
How It Works
WordPiece is a greedy algorithm. It starts with individual characters and iteratively merges them based on which merge increases the likelihood of the training data the most.
- Iterative Merging: It looks for the pair of sub-words that appears most frequently together in your training corpus.
- The Double Hash: In BERT and WordPiece, the
##prefix indicates that the sub-unit is a continuation of a word, not a new word itself.
Why It Changed AI
- Efficiency: You can represent a massive vocabulary with just ~30,000 sub-units.
- Universal Knowledge: Even if the model hasn’t seen a specific medical term, it can “guess” its meaning by looking at the sub-units (e.g.,
neuro+##logy).
Conclusion
Tokenization is the “front door” of any AI model. Understanding WordPiece is the key to understanding why BERT and its successors are so good at grasping the nuances of human language.
References & Further Reading
- Google Research: BERT: Pre-training of Deep Bidirectional Transformers
- Hugging Face: Summary of Tokenizers
- Towards AI: Deep Dive into WordPiece
Last updated on