Tokenizer

heekyung·2024년 9월 28일
0

Here's a comparison of BERT's WordPiece tokenizer with other popular tokenization techniques, in terms of vocabulary size, the use of subwords, and how each tokenizer works.

1. WordPiece Tokenizer (used by BERT)

  • Vocabulary Size: 30,522 tokens for BERT-base.
  • Subword Usage: Yes.
  • How it Works:
    • WordPiece breaks words into subwords if they are not present in the vocabulary.
    • It starts with the longest possible subword in the vocabulary and progressively shortens until it finds a match (greedy approach).
    • Common words are stored as full tokens (e.g., "dog"), while rare or complex words are split into subword tokens (e.g., "unhappiness" becomes ["un", "happi", "ness"]).
  • Strengths:
    • Handles out-of-vocabulary (OOV) words well through subword splitting.
    • Reduces the vocabulary size while still maintaining flexibility.
    • Efficient for handling different languages or mixed languages.
  • Weaknesses:
    • Splitting words into subwords may result in longer input sequences, which can lead to increased memory and compute requirements.

2. Byte-Pair Encoding (BPE) (used by GPT-2, RoBERTa)

  • Vocabulary Size: RoBERTa has a vocabulary size of 50,265 tokens, GPT-2 has 50,257 tokens.
  • Subword Usage: Yes.
  • How it Works:
    • BPE starts with a character-based vocabulary and merges the most frequent pairs of characters or subwords iteratively.
    • It builds larger subwords over time as frequent character combinations are merged.
    • Eventually, it learns both full words and subword combinations, which allows it to tokenize both common and rare words.
  • Strengths:
    • Efficient at encoding both frequent and rare words.
    • Allows for relatively small vocabularies while maintaining good coverage of words.
    • Handles out-of-vocabulary (OOV) words through subword splitting.
  • Weaknesses:
    • Like WordPiece, BPE can result in longer tokenized sequences for rare or complex words, leading to more computational overhead.

3. Unigram Language Model (used by SentencePiece, T5, and XLNet)

  • Vocabulary Size: Typically, T5 uses a vocabulary size of 32,000.
  • Subword Usage: Yes.
  • How it Works:
    • Unigram LM is based on a probabilistic model that directly optimizes for subwords by scoring subword candidates.
    • It starts with a large vocabulary of all possible subwords and reduces the vocabulary by retaining only the most likely subwords according to the language model.
    • Rather than merging frequent pairs (like in BPE), it directly drops low-probability subwords.
  • Strengths:
    • Unigram LM is flexible, allowing for a balance between long subwords for frequent tokens and short subwords for rare tokens.
    • It is more probabilistically principled, potentially leading to better subword choices compared to BPE and WordPiece.
    • Popular in multilingual models (such as mBERT and T5) due to its effectiveness in balancing multiple languages.
  • Weaknesses:
    • Similar to WordPiece and BPE, it can create longer tokenized sequences for rare words, increasing sequence length.

4. SentencePiece (used by T5 and XLM-R)

  • Vocabulary Size: Typically 32,000.
  • Subword Usage: Yes.
  • How it Works:
    • SentencePiece treats input text as a raw stream of characters and does not require tokenization to be done before applying subword splitting. This is different from WordPiece and BPE, which assume whitespace tokenization.
    • Like BPE and Unigram, SentencePiece breaks words into subwords based on statistical patterns in the text.
  • Strengths:
    • Works well with multiple languages and multilingual corpora.
    • More language-agnostic than WordPiece or BPE since it doesn’t assume whitespace tokenization, making it useful for non-whitespace-separated languages (like Chinese or Japanese).
    • The ability to process raw text streams simplifies handling of languages without clear word boundaries.
  • Weaknesses:
    • Like other subword-based tokenizers, it can lead to longer input sequences.
    • May require additional compute for preprocessing raw text (due to lack of initial tokenization).

5. Character-Level Tokenization (used by CharRNN and some LSTM models)

  • Vocabulary Size: Small (usually around 100 tokens for common characters like letters, punctuation, and digits).
  • Subword Usage: No.
  • How it Works:
    • Every character in the input text is treated as a token.
    • There's no need for subword or word tokenization since each character is treated independently.
  • Strengths:
    • Completely eliminates the OOV problem because every word is decomposed into individual characters.
    • Very simple and can work with any language without retraining.
  • Weaknesses:
    • Significantly increases the length of input sequences, since each character is a separate token.
    • Requires deeper models to learn meaningful context from individual characters, making it slower and less efficient for many tasks.

6. Word-Level Tokenization (used in older models)

  • Vocabulary Size: Large (could be 100,000 or more).
  • Subword Usage: No.
  • How it Works:
    • Each word in the input text is treated as a token.
    • OOV words that are not in the vocabulary are usually replaced with a special [UNK] token.
  • Strengths:
    • Simpler and faster since each word is treated as a single token, leading to shorter input sequences.
    • More interpretable because each token corresponds directly to a word.
  • Weaknesses:
    • Suffers from the OOV problem: any word not in the vocabulary is replaced by [UNK], making it hard to handle rare words or new terms.
    • Requires a very large vocabulary to cover most common words in a language, leading to inefficient memory use.

Summary Comparison Table

TokenizerVocabulary SizeUses SubwordsMechanismStrengthsWeaknesses
WordPiece (BERT)30,522YesGreedy subword splitting (finds the longest matching subword)Efficient, small vocab, handles OOV with subwordsLonger tokenized sequences
Byte-Pair Encoding~50,000 (GPT-2, RoBERTa)YesMerges frequent character pairs iterativelyWorks well for common and rare words alikeCan lead to longer tokenized sequences
Unigram LM (T5)~32,000YesProbabilistically selects most likely subwords and drops unlikely candidatesFlexible and efficient at encoding diverse languagesCan still create longer sequences
SentencePiece (T5)~32,000YesTreats raw text as input, breaks words into subwordsGreat for multilingual and non-whitespace-separated languagesAdds preprocessing complexity
Character-LevelSmall (~100)NoEach character is treated as a tokenNo OOV problem, works with any languageVery long tokenized sequences, requires deeper models
Word-LevelLarge (~100,000+)NoEach word is a tokenSimple, interpretable, fastSuffers from OOV problems, requires huge vocab size

Hugging Face Transformers provides built-in support for most of these tokenizers (BPE, WordPiece, Unigram, SentencePiece), making them the go-to choices for a wide range of tasks.

GPT and BERT models still heavily rely on BPE and WordPiece, respectively, while multilingual and cross-lingual models are shifting towards SentencePiece and Unigram models for greater flexibility.

Conclusion:

  • Subword tokenization methods like WordPiece, BPE, and Unigram LM are the most popular in modern models because they balance vocabulary size, flexibility, and handling of rare words.
  • Character-level and word-level tokenization are less commonly used now due to their limitations with rare words (OOV) or sequence length inefficiencies.
profile
수신제가치국평천하

0개의 댓글