
I recently picked up Build a Large Language Model (From Scratch) by Sebastian Raschka. The book walks you through building a GPT-style language model step by step — from raw text all the way to a trained model. It's Python-first, but I'm a Go developer, so I'm implementing everything in Go as I follow along.
The book is structured around understanding, not shortcuts. Instead of importing a library and calling it a day, you build each component from scratch and understand why it works. Chapter 1 covers the big picture of LLMs, what they are, how they differ from earlier models, and the overall architecture. Chapter 2 gets into the first real implementation: text tokenization.
By the end of chapter 2, you have a working BPE tokenizer(same kind used in GPT-2 and GPT-3). That's what this post covers.
Before a language model can process text, the text needs to be converted into numbers. A tokenizer does this job. It takes a string like "Hello, world!" and converts it into a list of integers:
"Hello, world!" → [15496, 11, 995, 0]
Each integer is a token ID — an index into the model's vocabulary. The model never sees the raw text, only these numbers. Decoding is the reverse: given a list of token IDs, reconstruct the original string.
Instead of training a vocabulary from scratch, I'm using OpenAI's pre-built r50k_base vocabulary. It contains ~50,000 tokens and comes in the tiktoken format, where each line is:
<base64_encoded_token_bytes> <token_id>
For example:
SGVsbG8= 15496
SGVsbG8= is the base64 encoding of the bytes for Hello, and 15496 is its token ID.
The first 256 entries cover every possible single byte (0x00–0xFF), which guarantees that any input can always be encoded. The remaining ~49,700 entries are multi-byte merge tokens built up during BPE training.
I use Go's embed package to bundle the file directly into the binary, then parse it line by line:
//go:embed resources/r50k_base.tiktoken
var resource embed.FS
var (
Vocabulary = make(map[string]int)
VacabularyInv = make(map[int]string)
)
func LoadVocabulary() error {
fd, err := resource.Open("resources/r50k_base.tiktoken")
if err != nil {
return fmt.Errorf("fail to open file %w", err)
}
defer fd.Close()
scanner := bufio.NewReader(fd)
for {
line, err := scanner.ReadString('\n')
if err == io.EOF {
break
}
split := strings.Split(line, " ")
tokenBytes, _ := base64.StdEncoding.DecodeString(strings.TrimSpace(split[0]))
tokenID, _ := strconv.Atoi(strings.TrimSpace(split[1]))
key := string(tokenBytes)
Vocabulary[key] = tokenID
VacabularyInv[tokenID] = key
}
return nil
}
Two maps: one for encoding (token bytes → ID) and one for decoding (ID → token bytes).
BPE doesn't run on the entire input at once. First the text is split into chunks using a regex. This prevents merges from crossing word boundaries (you don't want "dog" and "s" in "dogs" merging with a token from the next word).
GPT-2 uses a specific regex for this. The standard regexp package in Go doesn't support Unicode categories like \p{L} (letters) and \p{N} (numbers), so I reached for github.com/dlclark/regexp2:
var expr = regexp2.MustCompile(
`'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+`,
0,
)
For input "Hello, world!" this produces: ["Hello", ",", " world", "!"]
Each chunk is then encoded independently by the BPE algorithm.
BPE is the core algorithm. The key insight: you don't train it here, the training already happened when OpenAI built the r50k_base vocabulary. Your job is just the encoding side: given a chunk of text, find the optimal way to split it into tokens from the vocabulary.
Take the chunk " world" (note the leading space — GPT-2 treats " world" and "world" as different tokens).
1. Initialize as individual bytes
parts = [ [0x20], [0x77], [0x6F], [0x72], [0x6C], [0x64] ]
" " "w" "o" "r" "l" "d"
Every single byte is guaranteed to be in the vocabulary, so this starting state is always valid.
2. Find the best merge candidate
Look at every adjacent pair and check if their concatenation exists in the vocabulary. Among all valid pairs, pick the one with the lowest token ID — lower ID means it was merged earlier during training, giving it higher priority.
3. Merge that pair
Replace the two adjacent tokens with their concatenation. The slice shrinks by one element.
4. Repeat until no valid merges remain
5. Return token IDs
Look up each remaining token in the vocabulary and return the IDs.
func (b *BPE) Encode(chunk []byte) []int {
// Step 1: one element per byte
parts := make([][]byte, len(chunk))
for i, b := range chunk {
parts[i] = []byte{b}
}
for {
bestRank := math.MaxInt
bestIdx := -1
// Step 2: scan all adjacent pairs
for i := 0; i < len(parts)-1; i++ {
merged := append(append([]byte{}, parts[i]...), parts[i+1]...)
rank, ok := Vocabulary[string(merged)]
if ok && rank < bestRank {
bestRank = rank
bestIdx = i
}
}
// No valid merge found — done
if bestIdx == -1 {
break
}
// Step 3: merge at bestIdx
merged := append(append([]byte{}, parts[bestIdx]...), parts[bestIdx+1]...)
parts = append(parts[:bestIdx+1], parts[bestIdx+2:]...)
parts[bestIdx] = merged
}
// Step 5: convert to token IDs
result := make([]int, len(parts))
for i, part := range parts {
result[i] = Vocabulary[string(part)]
}
return result
}
You'll notice the merging always uses:
append(append([]byte{}, parts[i]...), parts[i+1]...)
instead of the simpler:
append(parts[i], parts[i+1]...)
The simpler version is a bug. If parts[i] has spare capacity in its underlying array (which happens after a merge), the append will write parts[i+1] directly into that memory silently corrupting another token's data. The double-append forces a fresh allocation every time, so nothing is shared.
The Tokenizer owns a BPE instance and orchestrates the full pipeline:
type Tokenizer struct {
bpe *BPE
}
func NewTokenizer() *Tokenizer {
if err := LoadVocabulary(); err != nil {
panic(err)
}
return &Tokenizer{bpe: new(BPE)}
}
func (t *Tokenizer) Encode(input string) []int {
match, err := expr.FindStringMatch(input)
if err != nil {
panic(err)
}
var result []int
for match != nil {
ids := t.bpe.Encode([]byte(match.String()))
result = append(result, ids...)
match, _ = expr.FindNextMatch(match)
}
return result
}
func (t *Tokenizer) Decode(input []int) string {
var result []byte
for _, id := range input {
result = append(result, []byte(VacabularyInv[id])...)
}
return string(result)
}
The flow is:
1. Regex splits input into chunks
2. Each chunk goes through BPE → token IDs
3. All IDs are concatenated into one slice
Decoding is simpler, just look up each ID in the inverse vocabulary and concatenate the bytes.
t := NewTokenizer()
ids := t.Encode("Hello, world!")
fmt.Println(ids) // [15496 11 995 0]
fmt.Println(t.Decode(ids)) // Hello, world!
The round-trip works. Decode(Encode(s)) == s for any input.
Part 2 will continue into the next chapter: building the data loader that feeds tokenized text into the model in fixed-size chunks with a sliding window. That's where the training pipeline starts to take shape.
The full source is on GitHub: github.com/MarinX/llm-from-scratch