directly interpreting and manipulating binary data
two-fold advantages
Text Models
Audio Models
AudioPaLM : merged text and speech
MusicGen : generate music by multiple parallel streams of acoustic tokens by EnCodec
Image Models
Biochemical sequence Models
Binary data lacks the inherent structure and semantics of human-interpretable data
MalConv, DeepVSA : malware detection and program analysis
Byte-level Byte Pair Encoding (BBPE) : used for multilingual pretraining, machine translation
ByT5 : transformers for byte sequences
ByteFormer : raw byte sequences from images and audio
MegaByte : modelling long byte sequences across various modalities
MambaByte : used Mamba to excel in byte-level language modelling and outperformed LMs based on subword tokenization
Current research often neglects native binary data, focusing on narrow tasks and overlooking broader potential in digital world simulation
the high granularity of bytes results in long sequences computational cost
quadratic self-attention scaling computational cost
hierarchical Transformer architecture
sequence of bytes of length
sequence of patches
each patch contains bytes
the number of patches
for
if , the last patch is padded with to size (eop, end-of-patch token)
Each patch from is viewed as a matrix of size
Flatten those patches into one-dimensional vectors
the projection layer mats each flattened vector into a dense vector of a hidden size
has the shape of
Dense embedding enables more efficient processing of the byte sequence by reducing the dimension while preserving the essential information
Takes the sequence of embedded patches and processes it to autoregressively predict the features of the subsequent patch, effectively learning the structure of data
for the sequence of patch embedding before the -th patch
for corresponding positional embeddings
for element-wise addition
Takes the predicted feature of each patch and autoregressively reconstructs the sequence of bytes within that patch
independent for each patch and operates by conditioning on the feature representation of the current patch
aims to predict the next byte based on preceding bytes without explicit guidance
the objective is minimizing the negative log-likelihood of the next byte prediction across the sequence
this loss encourages the model to understand the sequential dependencies in data at the byte level
After pretrained by next byte prediction, it is further trained on labelled datasets for classification
predicts categories from byte sequences
involves extracting a global feature from the byte sequence which is then processed by a classification head
is the boolean label for the -th category indicating whether the byte sequence is for that category
for total number of category
is the predicted probability of category given the byte sequence
The field of deep learning is steadily advancing its proficiency in both generation and classification of text, audio, and images
These media is typically stored and transmitted as byte sequences bGPT can process them for generative modelling and classification
bGPT is trained in next token prediction, uses features from the patch-level decoder and employs average pooling to derive global features for classification
Data
converting data from one format to another with symbolic music formats (ABC notation) and MIDI files
employs the generative modelling approach on concatenated byte sequences of paired ABC and MIDI files separated by a special patch
bGPT learns to convert text-based ABC notation into binary MIDI performance signals and its reverse
ability to simulate and reverse-engineer the conversion algorithm
give concatenated sequences of low-level machine instructions followed by a series of CPU register states
to accurately predict how the state updates with each instruction until the program halts
interpreting operational data and replicate digital activities within hardware
CPU States dataset (2.1M instances)
offering a simplified representation of CPU behavior
each instance contains a 1KB memory block with varying numbers of machine instructions followed by a sequence of 16-byte CPU register states
these states include various instructions (21 types with 43 variants - data movement, logical operations, arithmetic operations)
within each state
instances are randomly generated 1 to 256 instructions and their captured results
110M parameter bGPT matches the standard Transformer based model scale
avoided hyper parameter tuning and data augmentation for all evaluatioins
Acc for classification
Bits-Per-Byte for other generative modelling
used standard pre-training and fine-tuning approach
: using ImageNet
: Wikipedia
: LibriSpeech
: LibriSpeech + ImageNet
: LibriSpeech + ImageNet + Wikipedia
: randomly initialized, baseline
first fine-tuned with next byte prediction on AGNews, CIFAR-10, Speech Commands v2
then fine-tuned for classification
GPT2-small for text
ViT-B/16 for image
AST for audio
When pretraining data and fine-tuning data are match, the model shows performance in downstream tasks
Despite not having modality-specific prior knowledge, bGPT still manage to achieve performances similar to baseline
but much lower than ViT as sequential processing nature of byte models is not suitable for processing 2D data
and shows compatible accuracy to the unimodal models but there is a small loss
positive transferring (pretrain with Audio/Image and fine-tune with Image/Audio) shows improvements over random initialization
negative transferring (from text to other modalities) shows the structured pattern learning in pretraining is not applied
To investigate cross-modal knowledge transfer
image model for its data format consistency with spectrograms
libri model for its information similarity
disparity in CIFAR-10 does not extend to this spectrogram task observing image and libri models' BPB
libri model has the higher accuracy than image model with speech content spectrogram
byte models have an inherent capability to discern and translate abstract data features and patterns regardless of modality and format
To evaluate bGPT's ability in simulating algorithms and hardware
Lack of baseline models and widely used datasets evaluating scalability of bGPT on binary data
data conversion and CPU state modelling
to ( to
all models are randomly initialized
for data conversion, used IrishMAN dataset (ABC motation and MIDI files)
increased data volume directly enhances model performance in simulating data conversion
from Table 5, the BPB is decreasing as the model size grows
high BPB value for ABC in both directions
to replicate CPU functionality
selecting the highest probability byte at each step
accuracy byte-wise comparisons with actual states
data volume significantly influences modelling performance
efficiency beyond simple memorization (each test case consists of average of 128 instructions)
After epoch 11, showed significant improvement of performance deeper understanding of CPU states may stem from a qualitative enhancement in capability
Aligns with emergent abilities in LLMs
Is this learning genuine?
bGPT shows strong scalability on native binary data with emergent abilities in data conversion and CPU state modelling
bGPT : as a versatile simulator for the digital world
extending deep learning to binary data processing
effective in modeling digital media data + modality-agnostic knowledge transfer
strong scalability in modelling native binary data and signs of emergent abilities
without modality specific designs, it shows compatible performance
opportunities for improvement
Future research
it necessitates a careful examination if its ethical implications
its simulate or reverse-engineer algorithms
it gives opportunities for advancing understanding of digial world but be careful for ethical, societal, legal implications