English and non-English languages differ in tokenization structure and efficiency, but multilingual LLMs align them within a shared embedding space, while position, casing, whitespace, and punctuation further influence interpretation.
To understand how LLMs handle different languages, we must distinguish three key levels:
The overall pipeline is:
LLMs do not directly process text as humans see it. They operate on tokens and vectors.
“Non-English” is not a single uniform group.
Different languages (Korean, Japanese, Chinese, French, etc.) have fundamentally different structures, which leads to different tokenization behaviors.
English uses whitespace between words, making tokenization relatively straightforward.
"I love you"
→ ["I", " love", " you"]
Korean attaches grammatical markers to words, so tokenization often splits into smaller units.
"나는 너를 사랑해"
→ ["나", "는", " 너", "를", " 사랑", "해"]
Characteristics:
Japanese uses multiple writing systems (Kanji, Hiragana, Katakana) and no spaces.
"私は学生です"
→ ["私", "は", "学生", "です"]
Characteristics:
Chinese typically uses character-level tokenization.
"我爱你"
→ ["我", "爱", "你"]
Characteristics:
French is similar to English but includes contractions and more inflection.
"Je t'aime"
→ ["Je", " t'", "aime"]
Characteristics:
Different languages produce different numbers of tokens for the same meaning.
Example:
English: "I love you" → 3 tokens
Korean: "나는 너를 사랑해" → 5–6 tokens
Chinese: "我爱你" → 3 tokens
Implications:
Languages also differ at the encoding level.
"love"
Typically 1 byte per character (ASCII-compatible)
"사"
Typically 3 bytes per character (UTF-8)
However:
LLMs usually operate on tokens, not raw bytes, so byte differences are abstracted away after tokenization.
Each token has its own embedding vector.
Examples:
These vectors are:
Multilingual LLMs typically use a single shared embedding space.
This means:
Examples:
This enables:
Meaning depends on token order.
Example:
"dog bites man"
"man bites dog"
Same tokens, different meaning.
To handle this, models use:
token = what
position = where
Text form also matters.
"apple" ≠ "Apple"
Implications:
Most modern LLMs are case-sensitive.
These are also part of the input.
"hello"
" hello"
May produce different tokens.
"word"
"word."
Changes:
Even though LLMs are multilingual, English often performs better due to:
Large datasets are heavily English-dominated.
English aligns well with tokenization schemes.
Compared to morphologically rich or space-less languages.
LLMs process all languages through a shared token–vector pipeline, but different languages such as English, Korean, Japanese, Chinese, and French exhibit fundamentally different tokenization patterns due to their linguistic structures. These differences affect token count, segmentation, and efficiency. Despite this, multilingual models map all tokens into a shared embedding space, allowing semantically similar words across languages to be represented closely. Additionally, meaning is influenced not only by token identity but also by position, casing, whitespace, and punctuation.