Here's a comparison of BERT's WordPiece tokenizer with other popular tokenization techniques, in terms of vocabulary size, the use of subwords, and how each tokenizer works.
1. WordPiece Tokenizer (used by BERT)
- Vocabulary Size: 30,522 tokens for BERT-base.
- Subword Usage: Yes.
- How it Works:
- WordPiece breaks words into subwords if they are not present in the vocabulary.
- It starts with the longest possible subword in the vocabulary and progressively shortens until it finds a match (greedy approach).
- Common words are stored as full tokens (e.g., "dog"), while rare or complex words are split into subword tokens (e.g., "unhappiness" becomes
["un", "happi", "ness"]
).
- Strengths:
- Handles out-of-vocabulary (OOV) words well through subword splitting.
- Reduces the vocabulary size while still maintaining flexibility.
- Efficient for handling different languages or mixed languages.
- Weaknesses:
- Splitting words into subwords may result in longer input sequences, which can lead to increased memory and compute requirements.
2. Byte-Pair Encoding (BPE) (used by GPT-2, RoBERTa)
- Vocabulary Size: RoBERTa has a vocabulary size of 50,265 tokens, GPT-2 has 50,257 tokens.
- Subword Usage: Yes.
- How it Works:
- BPE starts with a character-based vocabulary and merges the most frequent pairs of characters or subwords iteratively.
- It builds larger subwords over time as frequent character combinations are merged.
- Eventually, it learns both full words and subword combinations, which allows it to tokenize both common and rare words.
- Strengths:
- Efficient at encoding both frequent and rare words.
- Allows for relatively small vocabularies while maintaining good coverage of words.
- Handles out-of-vocabulary (OOV) words through subword splitting.
- Weaknesses:
- Like WordPiece, BPE can result in longer tokenized sequences for rare or complex words, leading to more computational overhead.
3. Unigram Language Model (used by SentencePiece, T5, and XLNet)
- Vocabulary Size: Typically, T5 uses a vocabulary size of 32,000.
- Subword Usage: Yes.
- How it Works:
- Unigram LM is based on a probabilistic model that directly optimizes for subwords by scoring subword candidates.
- It starts with a large vocabulary of all possible subwords and reduces the vocabulary by retaining only the most likely subwords according to the language model.
- Rather than merging frequent pairs (like in BPE), it directly drops low-probability subwords.
- Strengths:
- Unigram LM is flexible, allowing for a balance between long subwords for frequent tokens and short subwords for rare tokens.
- It is more probabilistically principled, potentially leading to better subword choices compared to BPE and WordPiece.
- Popular in multilingual models (such as mBERT and T5) due to its effectiveness in balancing multiple languages.
- Weaknesses:
- Similar to WordPiece and BPE, it can create longer tokenized sequences for rare words, increasing sequence length.
4. SentencePiece (used by T5 and XLM-R)
- Vocabulary Size: Typically 32,000.
- Subword Usage: Yes.
- How it Works:
- SentencePiece treats input text as a raw stream of characters and does not require tokenization to be done before applying subword splitting. This is different from WordPiece and BPE, which assume whitespace tokenization.
- Like BPE and Unigram, SentencePiece breaks words into subwords based on statistical patterns in the text.
- Strengths:
- Works well with multiple languages and multilingual corpora.
- More language-agnostic than WordPiece or BPE since it doesn’t assume whitespace tokenization, making it useful for non-whitespace-separated languages (like Chinese or Japanese).
- The ability to process raw text streams simplifies handling of languages without clear word boundaries.
- Weaknesses:
- Like other subword-based tokenizers, it can lead to longer input sequences.
- May require additional compute for preprocessing raw text (due to lack of initial tokenization).
5. Character-Level Tokenization (used by CharRNN and some LSTM models)
- Vocabulary Size: Small (usually around 100 tokens for common characters like letters, punctuation, and digits).
- Subword Usage: No.
- How it Works:
- Every character in the input text is treated as a token.
- There's no need for subword or word tokenization since each character is treated independently.
- Strengths:
- Completely eliminates the OOV problem because every word is decomposed into individual characters.
- Very simple and can work with any language without retraining.
- Weaknesses:
- Significantly increases the length of input sequences, since each character is a separate token.
- Requires deeper models to learn meaningful context from individual characters, making it slower and less efficient for many tasks.
6. Word-Level Tokenization (used in older models)
- Vocabulary Size: Large (could be 100,000 or more).
- Subword Usage: No.
- How it Works:
- Each word in the input text is treated as a token.
- OOV words that are not in the vocabulary are usually replaced with a special
[UNK]
token.
- Strengths:
- Simpler and faster since each word is treated as a single token, leading to shorter input sequences.
- More interpretable because each token corresponds directly to a word.
- Weaknesses:
- Suffers from the OOV problem: any word not in the vocabulary is replaced by
[UNK]
, making it hard to handle rare words or new terms.
- Requires a very large vocabulary to cover most common words in a language, leading to inefficient memory use.
Summary Comparison Table
Tokenizer | Vocabulary Size | Uses Subwords | Mechanism | Strengths | Weaknesses |
---|
WordPiece (BERT) | 30,522 | Yes | Greedy subword splitting (finds the longest matching subword) | Efficient, small vocab, handles OOV with subwords | Longer tokenized sequences |
Byte-Pair Encoding | ~50,000 (GPT-2, RoBERTa) | Yes | Merges frequent character pairs iteratively | Works well for common and rare words alike | Can lead to longer tokenized sequences |
Unigram LM (T5) | ~32,000 | Yes | Probabilistically selects most likely subwords and drops unlikely candidates | Flexible and efficient at encoding diverse languages | Can still create longer sequences |
SentencePiece (T5) | ~32,000 | Yes | Treats raw text as input, breaks words into subwords | Great for multilingual and non-whitespace-separated languages | Adds preprocessing complexity |
Character-Level | Small (~100) | No | Each character is treated as a token | No OOV problem, works with any language | Very long tokenized sequences, requires deeper models |
Word-Level | Large (~100,000+) | No | Each word is a token | Simple, interpretable, fast | Suffers from OOV problems, requires huge vocab size |
Popular Tokenizers for NLP Tasks Today:
Hugging Face Transformers provides built-in support for most of these tokenizers (BPE, WordPiece, Unigram, SentencePiece), making them the go-to choices for a wide range of tasks.
GPT and BERT models still heavily rely on BPE and WordPiece, respectively, while multilingual and cross-lingual models are shifting towards SentencePiece and Unigram models for greater flexibility.
Conclusion:
- Subword tokenization methods like WordPiece, BPE, and Unigram LM are the most popular in modern models because they balance vocabulary size, flexibility, and handling of rare words.
- Character-level and word-level tokenization are less commonly used now due to their limitations with rare words (OOV) or sequence length inefficiencies.