LLM: English vs Non-English in LLMs

calico·2026년 3월 22일

목록 보기

177/186

English vs Non-English in LLMs: Tokenization, Embeddings, and Representation

English and non-English languages differ in tokenization structure and efficiency, but multilingual LLMs align them within a shared embedding space, while position, casing, whitespace, and punctuation further influence interpretation.

To understand how LLMs handle different languages, we must distinguish three key levels:

byte: raw encoding unit (e.g., UTF-8)
token: the unit the model processes (subword or character)
vector (embedding): numerical representation of tokens

The overall pipeline is:

\text{text} \rightarrow \text{tokens} \rightarrow \text{vectors}

LLMs do not directly process text as humans see it. They operate on tokens and vectors.

1. English vs Non-English: Not a single category

“Non-English” is not a single uniform group.
Different languages (Korean, Japanese, Chinese, French, etc.) have fundamentally different structures, which leads to different tokenization behaviors.

2. Tokenization differences across languages

English (space-based language)

English uses whitespace between words, making tokenization relatively straightforward.

"I love you"
→ ["I", " love", " you"]

Korean (agglutinative language)

Korean attaches grammatical markers to words, so tokenization often splits into smaller units.

"나는 너를 사랑해"
→ ["나", "는", " 너", "를", " 사랑", "해"]

Characteristics:

morphological splitting
more tokens per sentence
suffixes and particles separated

Japanese (mixed script language)

Japanese uses multiple writing systems (Kanji, Hiragana, Katakana) and no spaces.

"私は学生です"
→ ["私", "は", "学生", "です"]

Characteristics:

no whitespace
mixed character systems
complex segmentation

Chinese (character-based language)

Chinese typically uses character-level tokenization.

"我爱你"
→ ["我", "爱", "你"]

Characteristics:

each character often carries meaning
very short tokens
no spaces

French (space-based but morphologically richer)

French is similar to English but includes contractions and more inflection.

"Je t'aime"
→ ["Je", " t'", "aime"]

Characteristics:

whitespace-based
contractions (t’, l’, etc.)
slightly more complex morphology than English

3. Token count and efficiency differences

Different languages produce different numbers of tokens for the same meaning.

Example:

English: "I love you" → 3 tokens
Korean: "나는 너를 사랑해" → 5–6 tokens
Chinese: "我爱你" → 3 tokens

Implications:

different input lengths
different computational cost
possible performance differences

4. Byte-level differences

Languages also differ at the encoding level.

English

"love"

Typically 1 byte per character (ASCII-compatible)

Korean

"사"

Typically 3 bytes per character (UTF-8)

However:

LLMs usually operate on tokens, not raw bytes, so byte differences are abstracted away after tokenization.

5. Embedding vectors: different but shared space

Each token has its own embedding vector.

Examples:

"love" → one vector
"사랑" → another vector
"爱" → another vector

These vectors are:

numerically different
but can be close in semantic space

6. Shared embedding space (core concept)

Multilingual LLMs typically use a single shared embedding space.

This means:

all languages are mapped into one vector space
semantically similar words across languages are placed near each other

Examples:

"dog" ↔ "개"
"love" ↔ "사랑" ↔ "爱"

This enables:

translation
cross-lingual understanding
multilingual reasoning

7. Position matters (order sensitivity)

Meaning depends on token order.

Example:

"dog bites man"
"man bites dog"

Same tokens, different meaning.

To handle this, models use:

\text{input} = \text{token embedding} + \text{position embedding}

token = what
position = where

8. Uppercase vs lowercase

Text form also matters.

"apple" ≠ "Apple"

"apple" → common noun
"Apple" → proper noun (company)

Implications:

different tokens
different embeddings
different meanings

Most modern LLMs are case-sensitive.

9. Whitespace and punctuation

These are also part of the input.

Whitespace

"hello"
" hello"

May produce different tokens.

Punctuation

"word"
"word."

Changes:

token sequence
sentence boundary
meaning or tone

10. Why English often performs better

Even though LLMs are multilingual, English often performs better due to:

1) More training data

Large datasets are heavily English-dominated.

2) Efficient tokenization

English aligns well with tokenization schemes.

3) Simpler structure (in many cases)

Compared to morphologically rich or space-less languages.

Final Integrated Summary

LLMs process all languages through a shared token–vector pipeline, but different languages such as English, Korean, Japanese, Chinese, and French exhibit fundamentally different tokenization patterns due to their linguistic structures. These differences affect token count, segmentation, and efficiency. Despite this, multilingual models map all tokens into a shared embedding space, allowing semantically similar words across languages to be represented closely. Additionally, meaning is influenced not only by token identity but also by position, casing, whitespace, and punctuation.

calico

Hi!

이전 포스트

LLM: LLM(Large Langage Model) 작동 방식 및 원리

다음 포스트

LLM: English vs Non-English in LLMs

Artificial Intelligence

English vs Non-English in LLMs: Tokenization, Embeddings, and Representation

1. English vs Non-English: Not a single category

2. Tokenization differences across languages

English (space-based language)

Korean (agglutinative language)

Japanese (mixed script language)

Chinese (character-based language)

French (space-based but morphologically richer)

3. Token count and efficiency differences

4. Byte-level differences

English

Korean

5. Embedding vectors: different but shared space

6. Shared embedding space (core concept)

7. Position matters (order sensitivity)

8. Uppercase vs lowercase

9. Whitespace and punctuation

Whitespace

Punctuation

10. Why English often performs better

1) More training data

2) Efficient tokenization

3) Simpler structure (in many cases)

Final Integrated Summary

LLM: LLM(Large Langage Model) 작동 방식 및 원리

MCP: LLM(대형 언어 모델)이 이 API를 '언제, 왜, 어떻게' 써야 할지 스스로 판단하게 만드는 연결점

0개의 댓글