LLM: English vs Non-English in LLMs

calico·6일 전

Artificial Intelligence

목록 보기
177/177

English vs Non-English in LLMs: Tokenization, Embeddings, and Representation

English and non-English languages differ in tokenization structure and efficiency, but multilingual LLMs align them within a shared embedding space, while position, casing, whitespace, and punctuation further influence interpretation.

To understand how LLMs handle different languages, we must distinguish three key levels:

  • byte: raw encoding unit (e.g., UTF-8)
  • token: the unit the model processes (subword or character)
  • vector (embedding): numerical representation of tokens

The overall pipeline is:

texttokensvectors\text{text} \rightarrow \text{tokens} \rightarrow \text{vectors}

LLMs do not directly process text as humans see it. They operate on tokens and vectors.

1. English vs Non-English: Not a single category

“Non-English” is not a single uniform group.
Different languages (Korean, Japanese, Chinese, French, etc.) have fundamentally different structures, which leads to different tokenization behaviors.

2. Tokenization differences across languages

English (space-based language)

English uses whitespace between words, making tokenization relatively straightforward.

"I love you"
→ ["I", " love", " you"]

Korean (agglutinative language)

Korean attaches grammatical markers to words, so tokenization often splits into smaller units.

"나는 너를 사랑해"
→ ["나", "는", " 너", "를", " 사랑", "해"]

Characteristics:

  • morphological splitting
  • more tokens per sentence
  • suffixes and particles separated

Japanese (mixed script language)

Japanese uses multiple writing systems (Kanji, Hiragana, Katakana) and no spaces.

"私は学生です"
→ ["私", "は", "学生", "です"]

Characteristics:

  • no whitespace
  • mixed character systems
  • complex segmentation

Chinese (character-based language)

Chinese typically uses character-level tokenization.

"我爱你"
→ ["我", "爱", "你"]

Characteristics:

  • each character often carries meaning
  • very short tokens
  • no spaces

French (space-based but morphologically richer)

French is similar to English but includes contractions and more inflection.

"Je t'aime"
→ ["Je", " t'", "aime"]

Characteristics:

  • whitespace-based
  • contractions (t’, l’, etc.)
  • slightly more complex morphology than English

3. Token count and efficiency differences

Different languages produce different numbers of tokens for the same meaning.

Example:

English: "I love you" → 3 tokens
Korean: "나는 너를 사랑해" → 5–6 tokens
Chinese: "我爱你" → 3 tokens

Implications:

  • different input lengths
  • different computational cost
  • possible performance differences

4. Byte-level differences

Languages also differ at the encoding level.

English

"love"

Typically 1 byte per character (ASCII-compatible)

Korean

"사"

Typically 3 bytes per character (UTF-8)

However:

LLMs usually operate on tokens, not raw bytes, so byte differences are abstracted away after tokenization.

5. Embedding vectors: different but shared space

Each token has its own embedding vector.

Examples:

  • "love" → one vector
  • "사랑" → another vector
  • "爱" → another vector

These vectors are:

  • numerically different
  • but can be close in semantic space

6. Shared embedding space (core concept)

Multilingual LLMs typically use a single shared embedding space.

This means:

  • all languages are mapped into one vector space
  • semantically similar words across languages are placed near each other

Examples:

  • "dog" ↔ "개"
  • "love" ↔ "사랑" ↔ "爱"

This enables:

  • translation
  • cross-lingual understanding
  • multilingual reasoning

7. Position matters (order sensitivity)

Meaning depends on token order.

Example:

"dog bites man"
"man bites dog"

Same tokens, different meaning.

To handle this, models use:

input=token embedding+position embedding\text{input} = \text{token embedding} + \text{position embedding}
  • token = what

  • position = where

8. Uppercase vs lowercase

Text form also matters.

"apple" ≠ "Apple"
  • "apple" → common noun
  • "Apple" → proper noun (company)

Implications:

  • different tokens
  • different embeddings
  • different meanings

Most modern LLMs are case-sensitive.

9. Whitespace and punctuation

These are also part of the input.

Whitespace

"hello"
" hello"

May produce different tokens.

Punctuation

"word"
"word."

Changes:

  • token sequence
  • sentence boundary
  • meaning or tone

10. Why English often performs better

Even though LLMs are multilingual, English often performs better due to:

1) More training data

Large datasets are heavily English-dominated.

2) Efficient tokenization

English aligns well with tokenization schemes.

3) Simpler structure (in many cases)

Compared to morphologically rich or space-less languages.

Final Integrated Summary

LLMs process all languages through a shared token–vector pipeline, but different languages such as English, Korean, Japanese, Chinese, and French exhibit fundamentally different tokenization patterns due to their linguistic structures. These differences affect token count, segmentation, and efficiency. Despite this, multilingual models map all tokens into a shared embedding space, allowing semantically similar words across languages to be represented closely. Additionally, meaning is influenced not only by token identity but also by position, casing, whitespace, and punctuation.

profile
All views expressed here are solely my own and do not represent those of any affiliated organization.

0개의 댓글