Beyond Language Models: Byte Models are Digital World Simulators

임재석·2024년 3월 9일

LLM transformers

paper-study

목록 보기

12/23

1. Introduction

Deep Learning has focused on interpretable digital media files - text, images, audio
- Text played central role in conveying human intelligence and has led to the emergence of LMs
- LMs tokenize text and predict next token so that it can comprehend human language and intellegence
- Recent advancements extend tokenization beyond text
These deep learning models overlooks the omnipresent native binary data in the digital world
- Next-Byte Prediction will allow the models to truly understand and simulate all activities in the digital world
- It has practical benefits in cybersecurity, computer diagnostics, data compression and even for reverse-engineering a software's source code from binary representation
bGPT : model for binary data processing and digital world modelling by next byte prediction
- directly interpreting and manipulating binary data
- two-fold advantages
  - Interpreting Digital System
  - Unified Modelling
Experiment in two areas
- well-studied tasks (generative modelling, classification)
- relatively underexplored tasks intrinsic to binary-native operations (data conversion, CPU state modelling)

2. Background

2.1 Language Models

Text Models
- LSTM-based to Transformer-based
- Tokenization plays a fundamental role (breaking down into words or subwords)
- GPT models pretrained with self-supervised learning via next token prediction
- next token prediction enables the GPT to capture the structure and semantics behind languages
Audio Models
- AudioPaLM : merged text and speech
  - enables speech-to-speech translation and speech recognition
- MusicGen : generate music by multiple parallel streams of acoustic tokens by EnCodec
Image Models
- iGPT : transformer to predict next pixel
- vision-language models : connect text and visual data
Biochemical sequence Models
- Tranception : transformers to predict protein fitness
- ProtGPT2 : generates protein sequences
- HyenaDNA : extends context lengths in genomic modelling

2.2 Byte Models

Binary data lacks the inherent structure and semantics of human-interpretable data
MalConv, DeepVSA : malware detection and program analysis
- MalConv uses CNN to analyze byte sequences
- DeepVSA : value seet analysis for post-mortem program analysis
Byte-level Byte Pair Encoding (BBPE) : used for multilingual pretraining, machine translation
ByT5 : transformers for byte sequences
- token-free encoding that improves noise robustness and spelling sensitivity in multilingual
ByteFormer : raw byte sequences from images and audio
MegaByte : modelling long byte sequences across various modalities
MambaByte : used Mamba to excel in byte-level language modelling and outperformed LMs based on subword tokenization
Current research often neglects native binary data, focusing on narrow tasks and overlooking broader potential in digital world simulation

3. Methodology

3.1 Model Architecture

the high granularity of bytes results in long sequences $\rightarrow$ computational cost
quadratic self-attention scaling $\rightarrow$ computational cost
hierarchical Transformer architecture
- sequence of bytes $B = \{ b_1, b_2, ..., b_T\}$ of length $T$
- sequence of patches $\mathcal{P} = [P_1, P_2, ..., P_N]$
- each patch contains $S$ bytes
- the number of patches $N = \lceil{T \over S} \rceil$
- $P_i = [b_{(i-1)S + 1}, ..., b_{iS}]$ for $1 \le i \le N$
- if $T \mod S \not= 0$ , the last patch is padded with $e$ to size $S$ (eop, end-of-patch token)

Linear Projection Layer

Each patch $P_i$ from $\mathcal{P}$ is viewed as a matrix of size $S \times 257$
- each byte is one-hot encoded (256 values + eop token)
Flatten those patches into one-dimensional vectors
- rows in the matrix are concatenated
the projection layer mats each flattened vector into a dense vector $E_i$ of a hidden size $H$
- $E_i = \text{Flatten}(P_i) \cdot W_{\text{linear}}, \quad 1 \le i \le N$
$W_{\text{linear}}$ has the shape of $(257\times S, H)$
Dense embedding enables more efficient processing of the byte sequence by reducing the dimension while preserving the essential information

Patch-Level Decoder

Takes the sequence of embedded patches $\mathcal{E} = \{ E_1, E_2, ..., E_N \}$ and processes it to autoregressively predict the features of the subsequent patch, effectively learning the structure of data
$\hat{E}_i = \text{Decoder}_{\text{patch}}(\mathcal{E}_{<i} \oplus \mathcal{X}_{<i})$
$\mathcal{E}_{<i}$ for the sequence of patch embedding before the $i$ -th patch
$\mathcal{X}_{<i}$ for corresponding positional embeddings
$\oplus$ for element-wise addition

Byte-Level Decoder

Takes the predicted feature $\hat{E}_i$ of each patch and autoregressively reconstructs the sequence of bytes within that patch
independent for each patch and operates by conditioning on the feature representation $\hat{E}_i$ of the current patch
$\hat{b}_{i, j} = \text{Decoder}_{\text{byte}}(\hat{E}_i, b_{i, <j}), \quad 1 \le j \le S$

3.2 Training Objectives

Generative Modelling

aims to predict the next byte $b_{i+1}$ based on preceding bytes $\{ b_1, b_2, ..., b_i\}$ without explicit guidance
the objective is minimizing the negative log-likelihood of the next byte prediction across the sequence
$\mathcal{L}_{\text{GEN}}(\theta) = - \displaystyle\sum_{i=1}^{T-1} \log p(b_{i+1}|b_1, b_2, ..., b_i; \theta)$
this loss encourages the model to understand the sequential dependencies in data at the byte level

Classification

After pretrained by next byte prediction, it is further trained on labelled datasets for classification
predicts categories from byte sequences
involves extracting a global feature from the byte sequence which is then processed by a classification head
$\mathcal{L}_{\text{CLF}}(\theta) = -\displaystyle\sum_{k=1}^K y_k \log p(y_k | B; \theta)$
$y_k$ is the boolean label for the $k$ -th category indicating whether the byte sequence is for that category
$K$ for total number of category
$p(y_k | B; \theta)$ is the predicted probability of category $k$ given the byte sequence $B$

4. Applications

4.1 Digital Media Processing

The field of deep learning is steadily advancing its proficiency in both generation and classification of text, audio, and images
These media is typically stored and transmitted as byte sequences $\rightarrow$ bGPT can process them for generative modelling and classification
bGPT is trained in next token prediction, uses features from the patch-level decoder and employs average pooling to derive global features for classification
Data
- Audio : convert to WAV, including an 8000Hz sampling rate, mono channel, 8-bit depth, trimmed to 1 sec
- Image : convert to BMP, 32 * 32, RGB, 24-bit depth

4.2 Algorithm and Hardware Simulation

Data Conversion

converting data from one format to another with symbolic music formats (ABC notation) and MIDI files
employs the generative modelling approach on concatenated byte sequences of paired ABC and MIDI files separated by a special patch
bGPT learns to convert text-based ABC notation into binary MIDI performance signals and its reverse
ability to simulate and reverse-engineer the conversion algorithm

CPU State Modeling

give concatenated sequences of low-level machine instructions followed by a series of CPU register states
to accurately predict how the state updates with each instruction until the program halts
interpreting operational data and replicate digital activities within hardware
CPU States dataset (2.1M instances)
- offering a simplified representation of CPU behavior
- each instance contains a 1KB memory block with varying numbers of machine instructions followed by a sequence of 16-byte CPU register states
- these states include various instructions (21 types with 43 variants - data movement, logical operations, arithmetic operations)
- within each state
  - 1 byte is for Program Counter and Accumulator
  - 4 bytes for Instruction Register
  - 10 bytes for general-purpose registers
- instances are randomly generated 1 to 256 instructions and their captured results

5. Experiments

5.1 Settings

used open-source datasets

110M parameter bGPT matches the standard Transformer based model scale
avoided hyper parameter tuning and data augmentation for all evaluatioins
Acc for classification
Bits-Per-Byte for other generative modelling

5.2 Digital Media Processing

used standard pre-training and fine-tuning approach
$\text{bGPT}_{\text{image}}$ : using ImageNet
$\text{bGPT}_{\text{wiki}}$ : Wikipedia
$\text{bGPT}_{\text{libri}}$ : LibriSpeech
$\text{bGPT}_{\text{signal}}$ : LibriSpeech + ImageNet
$\text{bGPT}_{\text{mix}}$ : LibriSpeech + ImageNet + Wikipedia
$\text{bGPT}_{\text{random}}$ : randomly initialized, baseline
first fine-tuned with next byte prediction on AGNews, CIFAR-10, Speech Commands v2
then fine-tuned for classification

5.2.1 Baselines

GPT2-small for text
- pretrained on English Wikipedia with same sattings as bGPT
ViT-B/16 for image
- pretrained on ImageNet
- results are taken from original studies
AST for audio

5.2.2 Results

When pretraining data and fine-tuning data are match, the model shows performance in downstream tasks
Despite not having modality-specific prior knowledge, bGPT still manage to achieve performances similar to baseline
but $\text{bGPT}_{\text{image}}$ much lower than ViT as sequential processing nature of byte models is not suitable for processing 2D data
- simply scaling while retaining this sequential processing holds
$\text{bGPT}_{\text{signal}}$ and $\text{bGPT}_{\text{mix}}$ shows compatible accuracy to the unimodal models but there is a small loss
- Trade-off in byte models : mixed modality dilutes the depth of domain-specific understanding but it fosters versatility
positive transferring (pretrain with Audio/Image and fine-tune with Image/Audio) shows improvements over random initialization
- audio and image have some shared byte pattern
negative transferring (from text to other modalities) shows the structured pattern learning in pretraining is not applied
- text has distinct byte-level organizational patterns than audio and image

To investigate cross-modal knowledge transfer
- convert the Speech Commands v2 into 32 * 32 BMP spectrograms
- 8KB audio to 3KB images
- there is some information loss
image model for its data format consistency with spectrograms
libri model for its information similarity
disparity in CIFAR-10 does not extend to this spectrogram task observing image and libri models' BPB
- CIFAR-10 shares fewer patterns with spectrograms than spectrograms and raw audio
libri model has the higher accuracy than image model with speech content spectrogram
byte models have an inherent capability to discern and translate abstract data features and patterns regardless of modality and format

5.3 Algorithm and Hardware Simulation

To evaluate bGPT's ability in simulating algorithms and hardware
Lack of baseline models and widely used datasets $\rightarrow$ evaluating scalability of bGPT on binary data
data conversion and CPU state modelling
$10^3$ to $10^6$ ( $\text{bGPT}^3$ to $\text{bGPT}^6)$
all models are randomly initialized
for data conversion, used IrishMAN dataset (ABC motation and MIDI files)

5.3.1 Data Conversion

for ABC to MIDI, $\text{BPB}_{\text{abc}}$ assesses generative modelling as it generates content from scratch and $\text{BPB}_{\text{MIDI}}$ evaluates data conversion as full ABC byte sequence is given

increased data volume directly enhances model performance in simulating data conversion
from Table 5, the BPB is decreasing as the model size grows
high BPB value for ABC in both directions
- ABC to MIDI focuses on simulating an existing algorithm with necessary information while the reverse process requires inferring and reconstructing missing information in MIDI (score structure, musical ornament, expression)
- as MIDI is binary and ABC is text, model may find it easier to learn patterns within MIDI files

5.3.2 CPU State Modelling

to replicate CPU functionality
selecting the highest probability byte at each step
accuracy $\rightarrow$ byte-wise comparisons with actual states

data volume significantly influences modelling performance
efficiency beyond simple memorization (each test case consists of average of 128 instructions)
After epoch 11, $\text{bGPT}^5$ showed significant improvement of performance $\rightarrow$ deeper understanding of CPU states may stem from a qualitative enhancement in capability
Aligns with emergent abilities in LLMs
Is this learning genuine?
- performance boosts are due to non-linear metrics or overfitting
- but BPB is linear and smooth
- this improvement is stem from a real comprehension of CPU
bGPT shows strong scalability on native binary data with emergent abilities in data conversion and CPU state modelling

6. Conclusions

bGPT : as a versatile simulator for the digital world
extending deep learning to binary data processing
effective in modeling digital media data + modality-agnostic knowledge transfer
strong scalability in modelling native binary data and signs of emergent abilities
without modality specific designs, it shows compatible performance
opportunities for improvement
- currently tested for short audio and low-resolution images
- data conversion between ABC and MIDI
- only simplified CPUs
Future research
- reducing computational cost
- scaling models and dataset to cover more broader data
- improving model performance for underexplored tasks

7. Impact Statements

it necessitates a careful examination if its ethical implications
its simulate or reverse-engineer algorithms
- can significantly boost technological innovation in cybersecurity, software, hardware
- poses a risk to intellectual property as training bGPT on paired source code and executable software might enable the reverse-engineering of proprietary software
it gives opportunities for advancing understanding of digial world but be careful for ethical, societal, legal implications

8. Comment

결국 모든 컴퓨터 데이터는 0과 1이므로 바이트로 접근해서 멀티모달을 실현한다는 아이디어. 이외 CPU 상태를 통한 리버스 엔지니어링 태스크도 꽤 흥미로웠음. 역시나 사이즈가 문제지만, 한가지 의문인 점은 바이트로 표현하면 현재 모델들에 비해 컨텍스트 길이가 엄청 길어야 할텐데, 이 부분에 대한 대응은 크게 없어보임.

임재석

이전 포스트

The Era of 1-bit LLMs: All LLMs are in 1.58 bits

다음 포스트

Beyond Language Models: Byte Models are Digital World Simulators

paper-study

1. Introduction

2. Background

2.1 Language Models

2.2 Byte Models

3. Methodology

3.1 Model Architecture

Linear Projection Layer

Patch-Level Decoder

Byte-Level Decoder

3.2 Training Objectives

Generative Modelling

Classification

4. Applications

4.1 Digital Media Processing

4.2 Algorithm and Hardware Simulation

Data Conversion

CPU State Modeling

5. Experiments

5.1 Settings

5.2 Digital Media Processing

5.2.1 Baselines

5.2.2 Results

5.3 Algorithm and Hardware Simulation

5.3.1 Data Conversion

5.3.2 CPU State Modelling

6. Conclusions

7. Impact Statements

8. Comment

The Era of 1-bit LLMs: All LLMs are in 1.58 bits

Is Cosine-Similarity of Embeddings Really About Similarity?

0개의 댓글

관련 채용 정보