The WordPiece Algorithm: How BERT Actually Sees Language

All Notes

Technology

Abstract patterns representing digital fragments.

The Vocabulary Problem

In the early days of NLP, computers often failed when they saw a word they didn’t know (an “Out of Vocabulary” or OOV word). If you only know “play” and “ed,” you might stumble on “played.”

Enter: WordPiece

The WordPiece algorithm, popularized by BERT, solves this by breaking words into smaller sub-units. Instead of treating “unlikable” as one word, it might see: un + ##lik + ##able.

How It Works

WordPiece is a greedy algorithm. It starts with individual characters and iteratively merges them based on which merge increases the likelihood of the training data the most.

Iterative Merging: It looks for the pair of sub-words that appears most frequently together in your training corpus.
The Double Hash: In BERT and WordPiece, the ## prefix indicates that the sub-unit is a continuation of a word, not a new word itself.

Why It Changed AI

Efficiency: You can represent a massive vocabulary with just ~30,000 sub-units.
Universal Knowledge: Even if the model hasn’t seen a specific medical term, it can “guess” its meaning by looking at the sub-units (e.g., neuro + ##logy).

Conclusion

Tokenization is the “front door” of any AI model. Understanding WordPiece is the key to understanding why BERT and its successors are so good at grasping the nuances of human language.