Tokenizer

heekyung·2024년 9월 28일

Here's a comparison of BERT's WordPiece tokenizer with other popular tokenization techniques, in terms of vocabulary size, the use of subwords, and how each tokenizer works.

1. WordPiece Tokenizer (used by BERT)

Vocabulary Size: 30,522 tokens for BERT-base.
Subword Usage: Yes.
How it Works:
- WordPiece breaks words into subwords if they are not present in the vocabulary.
- It starts with the longest possible subword in the vocabulary and progressively shortens until it finds a match (greedy approach).
- Common words are stored as full tokens (e.g., "dog"), while rare or complex words are split into subword tokens (e.g., "unhappiness" becomes ["un", "happi", "ness"]).
Strengths:
- Handles out-of-vocabulary (OOV) words well through subword splitting.
- Reduces the vocabulary size while still maintaining flexibility.
- Efficient for handling different languages or mixed languages.
Weaknesses:
- Splitting words into subwords may result in longer input sequences, which can lead to increased memory and compute requirements.

2. Byte-Pair Encoding (BPE) (used by GPT-2, RoBERTa)

Vocabulary Size: RoBERTa has a vocabulary size of 50,265 tokens, GPT-2 has 50,257 tokens.
Subword Usage: Yes.
How it Works:
- BPE starts with a character-based vocabulary and merges the most frequent pairs of characters or subwords iteratively.
- It builds larger subwords over time as frequent character combinations are merged.
- Eventually, it learns both full words and subword combinations, which allows it to tokenize both common and rare words.
Strengths:
- Efficient at encoding both frequent and rare words.
- Allows for relatively small vocabularies while maintaining good coverage of words.
- Handles out-of-vocabulary (OOV) words through subword splitting.
Weaknesses:
- Like WordPiece, BPE can result in longer tokenized sequences for rare or complex words, leading to more computational overhead.

3. Unigram Language Model (used by SentencePiece, T5, and XLNet)

Vocabulary Size: Typically, T5 uses a vocabulary size of 32,000.
Subword Usage: Yes.
How it Works:
- Unigram LM is based on a probabilistic model that directly optimizes for subwords by scoring subword candidates.
- It starts with a large vocabulary of all possible subwords and reduces the vocabulary by retaining only the most likely subwords according to the language model.
- Rather than merging frequent pairs (like in BPE), it directly drops low-probability subwords.
Strengths:
- Unigram LM is flexible, allowing for a balance between long subwords for frequent tokens and short subwords for rare tokens.
- It is more probabilistically principled, potentially leading to better subword choices compared to BPE and WordPiece.
- Popular in multilingual models (such as mBERT and T5) due to its effectiveness in balancing multiple languages.
Weaknesses:
- Similar to WordPiece and BPE, it can create longer tokenized sequences for rare words, increasing sequence length.

4. SentencePiece (used by T5 and XLM-R)

Vocabulary Size: Typically 32,000.
Subword Usage: Yes.
How it Works:
- SentencePiece treats input text as a raw stream of characters and does not require tokenization to be done before applying subword splitting. This is different from WordPiece and BPE, which assume whitespace tokenization.
- Like BPE and Unigram, SentencePiece breaks words into subwords based on statistical patterns in the text.
Strengths:
- Works well with multiple languages and multilingual corpora.
- More language-agnostic than WordPiece or BPE since it doesn’t assume whitespace tokenization, making it useful for non-whitespace-separated languages (like Chinese or Japanese).
- The ability to process raw text streams simplifies handling of languages without clear word boundaries.
Weaknesses:
- Like other subword-based tokenizers, it can lead to longer input sequences.
- May require additional compute for preprocessing raw text (due to lack of initial tokenization).

5. Character-Level Tokenization (used by CharRNN and some LSTM models)

Vocabulary Size: Small (usually around 100 tokens for common characters like letters, punctuation, and digits).
Subword Usage: No.
How it Works:
- Every character in the input text is treated as a token.
- There's no need for subword or word tokenization since each character is treated independently.
Strengths:
- Completely eliminates the OOV problem because every word is decomposed into individual characters.
- Very simple and can work with any language without retraining.
Weaknesses:
- Significantly increases the length of input sequences, since each character is a separate token.
- Requires deeper models to learn meaningful context from individual characters, making it slower and less efficient for many tasks.

6. Word-Level Tokenization (used in older models)

Vocabulary Size: Large (could be 100,000 or more).
Subword Usage: No.
How it Works:
- Each word in the input text is treated as a token.
- OOV words that are not in the vocabulary are usually replaced with a special [UNK] token.
Strengths:
- Simpler and faster since each word is treated as a single token, leading to shorter input sequences.
- More interpretable because each token corresponds directly to a word.
Weaknesses:
- Suffers from the OOV problem: any word not in the vocabulary is replaced by [UNK], making it hard to handle rare words or new terms.
- Requires a very large vocabulary to cover most common words in a language, leading to inefficient memory use.

Summary Comparison Table

Tokenizer	Vocabulary Size	Uses Subwords	Mechanism	Strengths	Weaknesses
WordPiece (BERT)	30,522	Yes	Greedy subword splitting (finds the longest matching subword)	Efficient, small vocab, handles OOV with subwords	Longer tokenized sequences
Byte-Pair Encoding	~50,000 (GPT-2, RoBERTa)	Yes	Merges frequent character pairs iteratively	Works well for common and rare words alike	Can lead to longer tokenized sequences
Unigram LM (T5)	~32,000	Yes	Probabilistically selects most likely subwords and drops unlikely candidates	Flexible and efficient at encoding diverse languages	Can still create longer sequences
SentencePiece (T5)	~32,000	Yes	Treats raw text as input, breaks words into subwords	Great for multilingual and non-whitespace-separated languages	Adds preprocessing complexity
Character-Level	Small (~100)	No	Each character is treated as a token	No OOV problem, works with any language	Very long tokenized sequences, requires deeper models
Word-Level	Large (~100,000+)	No	Each word is a token	Simple, interpretable, fast	Suffers from OOV problems, requires huge vocab size

Popular Tokenizers for NLP Tasks Today:

Hugging Face Transformers provides built-in support for most of these tokenizers (BPE, WordPiece, Unigram, SentencePiece), making them the go-to choices for a wide range of tasks.

GPT and BERT models still heavily rely on BPE and WordPiece, respectively, while multilingual and cross-lingual models are shifting towards SentencePiece and Unigram models for greater flexibility.

Conclusion:

Subword tokenization methods like WordPiece, BPE, and Unigram LM are the most popular in modern models because they balance vocabulary size, flexibility, and handling of rare words.
Character-level and word-level tokenization are less commonly used now due to their limitations with rare words (OOV) or sequence length inefficiencies.

heekyung

수신제가치국평천하

이전 포스트

시작하는 마음

다음 포스트