Abstract
![](https://velog.velcdn.com/images/0404_not_found/post/0370c0cd-c12d-47d7-a4dc-99e287598225/image.png)
-
BitNet paved the way for a new era of 1-bit LLMs
-
BitNet b.58 has every parameter as a tenary {-1, 0, 1}
- matches a full-precision Transformer with the same model size
- significantly more cost-effective
-
defines new scalinglaw and recipe for training
1. The era of 1-bit LLMs
-
The recent LLMs' size is increasing
-
Post-training quantization to create low-bit models for inferenct
- reduces weights and activations
- 16 bits to lower bits (4-bits)
- sub-optimal
-
BitNet presents a direction for reducing the cost of LLMs while their performance
-
the major computation cost comes from the floating-point addition and multiplication
- BitNet has only integer addition
-
transferring model parameters from DRAM to the memory of an on-chip accelerator (SRAM) can be expensive during inference
- enlarging SRAM to improve throughput → significantly higher costs than DRAM
- 1-bit LLMs have a much lower memory footprint from both a capacity and bandwidth standpoint
-
BitNet b1.58
- added 0 to original BitNet
- retains all the benefits of the original BitNet
- included new computation paradigm (no multiplication for matmul)
- same energy consumption as the original BitNet
- stronger modeling capability → explicit support for reature filtering by inclusion of 0
- it can match full precision baselines in terms of PPL and end-task starting from 3B
2. BitNet b1.58
Recap: BitLinear
- Binarize weights to +1 or -1 with signum function
- Centralize to be zero-mean to increase the capacity witnin a limited numerical range
- Use scaling factor β after binarization to reduce l2 error between real-valued and the binarized.
W~=Sign(W−α)/β
Sign(Wij)={+1,−1,if Wij>0,if Wij≤0,
α=nm1ij∑Wij
β=nm1∣∣W∣∣1
- Quantize activations to b-bit precision with absmax
-
Qb=2b−1
-
ϵ is a small floating-point number that prevents overflow in clipping
x~=Quant(x)=Clip(x×γQb,−Qb+ϵ,Qb−ϵ)
γ=∣∣x∣∣∞
-
For activations before non-linear functions (ReLU) → scale into [0,Qb] by subtracting the minimum of the inputs
x~=Quant(x)=Clip((x−η)×γQb,ϵ,Qb−ϵ)
η=ijminxij
-
quantize with 8-bit
-
Training → quantize per tensor / Inference → quantize per token
- Matrix Multiplication
y=W~x~
The variance of the output y under following assumption
- the elements in W and x are mutually independent and share same distribution
- W and x are independent of each other
Var(y)=nVar(w~x~)=nE[w~2]E[x~2]=nβ2E[x~2]≈E[x~2]
In full-precision, Var(y)=1 with standard initialization method → training stability. To preserve this stability, use LayerNorm function.
Var(y)≈E[LN(x~)2]=1(SubLN)
Then, the final representation of BitLinear is:
y=Wx=WQuant(LN(x))×QbβγLN(x)=Var(x)+ϵx−E(x)
Qbβγ means Dequantization to restore original precision
- Model Parallelism with Group quantization and Normalization
- Calculate all parameters α,β,γ,η with each group (device)
- If the Number of group is G, then the parameter becomes
αg=nmGij∑Wij(g),βg=nmG∣∣W(g)∣∣1,γg=∣∣x(g)∣∣∞,ηg=ijminxij(g)
- LayerNorm should also be applied with similar way
BitNet B1.58
LLaMA-alike components
- used LLaMA alike components
- RMSNorm
- SwiGLU
- rotary embedding
- removed all biases
- it can be integrated into the popular open-source software
3. Results
-
BitNet b1.58 vs FP16 LLaMA
-
pretrained on RedPajama for 100B tokens
-
zero-shot performance
- ARC-Easy
- ARC-Challenge
- Hellaswag
- Winogrande
- PIQA
- OpenbookQA
- BoolQ
-
validation PPL
-
runtime GPU memory and latency
- FasterTransformer codebase
- 2-bit kernel from Ladder in BitNet
- the time per output token
![](https://velog.velcdn.com/images/0404_not_found/post/433c531f-d56d-405b-a4cf-e1a66e5cc3ae/image.png)
![](https://velog.velcdn.com/images/0404_not_found/post/506ee24e-0f80-4162-815e-a9c6bc87e631/image.png)
-
the performance gap between BitNet and LLaMA narrows as the model size increases
-
in terms of zero-shot performane, BitNet starts to match LLaMA at 3B size
-
BitNet b1.58 3.9B outperforms LLaMA → BitNet b1.58 is a Pareto improvement over the SOTA LLMs
Memory and Latency
![](https://velog.velcdn.com/images/0404_not_found/post/8e91fac3-e056-40bc-a24f-2e10ecb88965/image.png)
- the speed-up increases as the model size scales
- the proportion of nn.Linear increases as the model size grows
- for the memory, the trend follows that of the latency
- as the embedding remains full precision and its proportion gets smaller
- Both were measured with a 2-bit kernel
- there is still room for optimization
Energy
![](https://velog.velcdn.com/images/0404_not_found/post/ee87a15d-a147-44e7-9599-dabeb032e909/image.png)
-
for LLaMA model, the majority of matmul is FP16 multiplication while for BitNet, it is INT8 addition
-
BitNet is more efficient when model is large
- as the percentage of nn.Linear grows with the model size
Throughput
-
compared on two A100 80G cards
-
BitNet b1.58 and LLaMA 70B
-
maximum batch size for the GPU memory
![](https://velog.velcdn.com/images/0404_not_found/post/26a061f7-2ec8-47f0-8e44-a23249725c09/image.png)
- in terms of latency, memory usage and energy consumption,
- BitNet 13B > FP16 3B
- BitNet 30B > FP16 7B
- BitNet 70B > FP16 13B
Training with 2T tokens
![](https://velog.velcdn.com/images/0404_not_found/post/9f572fb5-5c2a-49f6-ba3b-a997ee9dcca9/image.png)
- It has strong generalization capabilities
4. Discussion and Future Work
1-bit MoE LLMs
Native Support of Long Sequence in LLMs
LLMs on Edge and Mobile
-
for Edge and Mobile device, BitNet b1.58 can resolve the issue of memory and computational power
-
BitNet is more friendly to CPU devices
New Hardware for 1-bit LLMs
- Groq demonstrated promising results and great potential for specific LLMs (LPU)
- expect new hardware for 1-bit LLM
두 번 날려먹고 다시 쓰는 코멘트. 1bit 모델의 가능성을 보여주었다면, 조금 더 다듬어진 듯한 논문. 온디바이스나 3진법 반도체가 떠오르게 하는 글이었음. 왜 처음에 0을 넣지 않았는지, 그리고 양자화 범위의 구분이 어떤 효과를 주는지 설명해주었다면 좋았을듯