Building an LLM from Scratch in Go Part 1: BPE Tokenizer

Marin Basic·2026년 2월 27일
post-thumbnail

I recently picked up Build a Large Language Model (From Scratch) by Sebastian Raschka. The book walks you through building a GPT-style language model step by step — from raw text all the way to a trained model. It's Python-first, but I'm a Go developer, so I'm implementing everything in Go as I follow along.

The Book

The book is structured around understanding, not shortcuts. Instead of importing a library and calling it a day, you build each component from scratch and understand why it works. Chapter 1 covers the big picture of LLMs, what they are, how they differ from earlier models, and the overall architecture. Chapter 2 gets into the first real implementation: text tokenization.

By the end of chapter 2, you have a working BPE tokenizer(same kind used in GPT-2 and GPT-3). That's what this post covers.


What is a Tokenizer?

Before a language model can process text, the text needs to be converted into numbers. A tokenizer does this job. It takes a string like "Hello, world!" and converts it into a list of integers:

"Hello, world!" → [15496, 11, 995, 0]

Each integer is a token ID — an index into the model's vocabulary. The model never sees the raw text, only these numbers. Decoding is the reverse: given a list of token IDs, reconstruct the original string.


The Vocabulary: r50k_base.tiktoken

Instead of training a vocabulary from scratch, I'm using OpenAI's pre-built r50k_base vocabulary. It contains ~50,000 tokens and comes in the tiktoken format, where each line is:

<base64_encoded_token_bytes> <token_id>

For example:

SGVsbG8= 15496

SGVsbG8= is the base64 encoding of the bytes for Hello, and 15496 is its token ID.

The first 256 entries cover every possible single byte (0x00–0xFF), which guarantees that any input can always be encoded. The remaining ~49,700 entries are multi-byte merge tokens built up during BPE training.

Loading the vocabulary in Go

I use Go's embed package to bundle the file directly into the binary, then parse it line by line:

//go:embed resources/r50k_base.tiktoken
var resource embed.FS

var (
    Vocabulary    = make(map[string]int)
    VacabularyInv = make(map[int]string)
)

func LoadVocabulary() error {
    fd, err := resource.Open("resources/r50k_base.tiktoken")
    if err != nil {
        return fmt.Errorf("fail to open file %w", err)
    }
    defer fd.Close()

    scanner := bufio.NewReader(fd)
    for {
        line, err := scanner.ReadString('\n')
        if err == io.EOF {
            break
        }

        split := strings.Split(line, " ")
        tokenBytes, _ := base64.StdEncoding.DecodeString(strings.TrimSpace(split[0]))
        tokenID, _    := strconv.Atoi(strings.TrimSpace(split[1]))

        key := string(tokenBytes)
        Vocabulary[key]       = tokenID
        VacabularyInv[tokenID] = key
    }
    return nil
}

Two maps: one for encoding (token bytes → ID) and one for decoding (ID → token bytes).


Step 1: Pre-tokenization with Regex

BPE doesn't run on the entire input at once. First the text is split into chunks using a regex. This prevents merges from crossing word boundaries (you don't want "dog" and "s" in "dogs" merging with a token from the next word).

GPT-2 uses a specific regex for this. The standard regexp package in Go doesn't support Unicode categories like \p{L} (letters) and \p{N} (numbers), so I reached for github.com/dlclark/regexp2:

var expr = regexp2.MustCompile(
    `'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+`,
    0,
)

For input "Hello, world!" this produces: ["Hello", ",", " world", "!"]

Each chunk is then encoded independently by the BPE algorithm.


Step 2: Byte Pair Encoding (BPE)

BPE is the core algorithm. The key insight: you don't train it here, the training already happened when OpenAI built the r50k_base vocabulary. Your job is just the encoding side: given a chunk of text, find the optimal way to split it into tokens from the vocabulary.

How it works

Take the chunk " world" (note the leading space — GPT-2 treats " world" and "world" as different tokens).

1. Initialize as individual bytes

parts = [ [0x20], [0x77], [0x6F], [0x72], [0x6C], [0x64] ]
          " "      "w"     "o"     "r"     "l"     "d"

Every single byte is guaranteed to be in the vocabulary, so this starting state is always valid.

2. Find the best merge candidate

Look at every adjacent pair and check if their concatenation exists in the vocabulary. Among all valid pairs, pick the one with the lowest token ID — lower ID means it was merged earlier during training, giving it higher priority.

3. Merge that pair

Replace the two adjacent tokens with their concatenation. The slice shrinks by one element.

4. Repeat until no valid merges remain

5. Return token IDs

Look up each remaining token in the vocabulary and return the IDs.

The Go implementation

func (b *BPE) Encode(chunk []byte) []int {
    // Step 1: one element per byte
    parts := make([][]byte, len(chunk))
    for i, b := range chunk {
        parts[i] = []byte{b}
    }

    for {
        bestRank := math.MaxInt
        bestIdx  := -1

        // Step 2: scan all adjacent pairs
        for i := 0; i < len(parts)-1; i++ {
            merged := append(append([]byte{}, parts[i]...), parts[i+1]...)
            rank, ok := Vocabulary[string(merged)]
            if ok && rank < bestRank {
                bestRank = rank
                bestIdx  = i
            }
        }

        // No valid merge found — done
        if bestIdx == -1 {
            break
        }

        // Step 3: merge at bestIdx
        merged := append(append([]byte{}, parts[bestIdx]...), parts[bestIdx+1]...)
        parts = append(parts[:bestIdx+1], parts[bestIdx+2:]...)
        parts[bestIdx] = merged
    }

    // Step 5: convert to token IDs
    result := make([]int, len(parts))
    for i, part := range parts {
        result[i] = Vocabulary[string(part)]
    }
    return result
}

One subtle Go gotcha

You'll notice the merging always uses:

append(append([]byte{}, parts[i]...), parts[i+1]...)

instead of the simpler:

append(parts[i], parts[i+1]...)

The simpler version is a bug. If parts[i] has spare capacity in its underlying array (which happens after a merge), the append will write parts[i+1] directly into that memory silently corrupting another token's data. The double-append forces a fresh allocation every time, so nothing is shared.


Step 3: Wiring it Together

The Tokenizer owns a BPE instance and orchestrates the full pipeline:

type Tokenizer struct {
    bpe *BPE
}

func NewTokenizer() *Tokenizer {
    if err := LoadVocabulary(); err != nil {
        panic(err)
    }
    return &Tokenizer{bpe: new(BPE)}
}

func (t *Tokenizer) Encode(input string) []int {
    match, err := expr.FindStringMatch(input)
    if err != nil {
        panic(err)
    }
    var result []int
    for match != nil {
        ids := t.bpe.Encode([]byte(match.String()))
        result = append(result, ids...)
        match, _ = expr.FindNextMatch(match)
    }
    return result
}

func (t *Tokenizer) Decode(input []int) string {
    var result []byte
    for _, id := range input {
        result = append(result, []byte(VacabularyInv[id])...)
    }
    return string(result)
}

The flow is:
1. Regex splits input into chunks
2. Each chunk goes through BPE → token IDs
3. All IDs are concatenated into one slice

Decoding is simpler, just look up each ID in the inverse vocabulary and concatenate the bytes.


Does it Work?

t := NewTokenizer()

ids := t.Encode("Hello, world!")
fmt.Println(ids)          // [15496 11 995 0]

fmt.Println(t.Decode(ids)) // Hello, world!

The round-trip works. Decode(Encode(s)) == s for any input.


What's Next

Part 2 will continue into the next chapter: building the data loader that feeds tokenized text into the model in fixed-size chunks with a sliding window. That's where the training pipeline starts to take shape.

The full source is on GitHub: github.com/MarinX/llm-from-scratch

profile
크로아티아 출신 Go 개발자입니다. 한국 문화와 홈 오토메이션, 셀프 호스팅에 관심이 많습니다. 한국어는 아직 배우는 중이라 서툴지만, 커피챗이나 언어 교환은 언제든 환영합니다!

0개의 댓글