Can neural networks be trained from scratch using Low-Rank Adapters?
stores different copies of the LoRA parameters
trained on different shards
their method enables distributed training with infrequent synchronizations allowing for single-device inference
Adapters : trainable functions that modify existing layers in an neural network
LoRA : subclass of linear adapters
Given input and a linear layer parameterized by the weight
LoRA re-parameterizes the function as
Forward pass incurs an extra computational overhead
the significance of LoRA pertains to the optimizer memory footprint
Low-Rank LoRA shows inferior performance to the models using standard optimization
LoRA is incapable of recovering weights that exceed the rank
Although there is a solution within a low-ranmk proximity of the initialization, it still needs the high-rank updates
this will show why LoRA heads in parallel can achieve the performance of standard pre-training
elevating the rank to the is sufficient to replicate standard pre-training performance
leveraging multiple low-rank adapters in parallel
given a matrix of the form and ,
then it is possible to represent the product as two lower-rank matrices
given a matrix and constant
reparameterizes full rank weights into a linear combination fo low-rank weights
single parallel LoRA head can approximate the trajectory of a single step of the multi-head LoRA given that the parallel LoRA heads are periodically merged into the full weights
The first scenario is rank deficient unable to recover the original model performance
The latter case necessitates that accumulates all the information of the LoRA parameters at every iteration if we use a merge operator at every iteration, recovering the exact update is possible
one can recover the exact gradient updates of the MHLoRA
in distributed setting, only the LoRA params/gradients have to be communicated across devices good when the interconnect speed is limited
To reduce the communication cost of LTE
allow LoRA parameters to train independently for longer period befor e merge operator
Merging every iteration ensures the representation will not diverge
using stale estimetes relaxes this equivalence it can still match the standard training performance
As its estimate is inaccurate, the optimization trajectory diverge from the optimization path of MHLoRA
it doesn't imply that the model won't optimize
just different path from MHLoRA
used simple averaging (left more sophisticated merging as future work)
achieving an informative update that does not require materialization of the full parameter size during training
parameterizing such that it can be stored in low-precision and communicated efficiently (using quantized weights and keeping a high-precision copy)
LoRA-the-Explorer (LTE) : optimization algorithm that approximates full-rank updates with parallel low-rank updates
creates -different LoRA for each linear layer at initialization
each worker is assigned the LoRA parameter and creates a local optimizer
independently sample data from the same distribution
for each LoRA head , optimize the parameters with own partition for iterations to get
don't synchronize the optimizer state across workers
After the optimization, synchronize the LoRA parameters to compute the final update for the main weight
then update the LoRA parameters with the updated weights
since it doesn't train directly on the main parameter , using quantized parameter is possible
scaling has the same effect as tuning the lr common misconception
during experiment, there is no comparable performance when using to be 1~4
using large and slightly lowering worked best
standard practice : set proportional to the rank , i.e.
used and
lr doesn't scale linearly with
only affects the forward computation
scales quadratically with the alignment of and
iteratively merging LoRA is a key component in recovering the full-rank representation
they assess the effectiveness of merging a single LoRA head in context of linera networks trained on synthetic LS regression datasets
without merging, the model performance is not changing
iterative merging recovers the GT solution with the rate increasing with higher merge frequency
monotonic improvement in performance with an increased number of heads and ranks
extending the merge iteration negatively impacts performance
in LS regression, excessive merging hurts model accuracy
with large enough rank and head, the model converges to better accuracy even if the test loss was similar
averaging of the LoRA heads has a regularization effect similar to model ensembling
ViT-S as the primary architecture
in ablation, they fixed cumulative batch size of 4096 and epoch of 1200
each LoRA head received a reduced batch size of 4096/heads
scaling the rank exerts a greater impact than increasing the number of heads
proportional scaling of gradient noise with smaller mini-batches
gradient noise contribute to slower convergence in addition to the use of stale parameter estimates
increasing the number of heads necessitates more sequential FLOPs but it offers efficient parallelization
using a larger batch size for gradient estimation may prove beneficial in distributed training
this study is focused on training deep networks with parallel low-rank adapters (not efficiency!)
hypothetical computation analysis for future scaling efforts
model size and for LTE
the number of devices for each method and
with quantization, each LTE device require a memory footprint of
as base model is 16-bit and if we use 4-bit quantizing,
with AdamW, DDP necessitates an additional parameters (total )
for LTE, is needed
Assuming the training is parameter bound by the main weights , LTE can leverage GPUs roughly 1/3 size of DDP
LTE requires 40% more data and 20% slowdown per iteration with quantization (QLoRA)
on average, each LTE device observes 1/3 less data than a device in DDP
Communication bottleneck
Training with adapters
Distributed Training and Federated Learning
Linear mode connectivity and model averaging
Low-rank adapters for model pre-training
LTE : bi-level optimization method that capicalizes on the memory-efficient properties of LoRA
how to accelerate convergence during the final 10% of the training?
how to dynamically determine the nuimber of ranks or heads?
is heterogeneous parameterization of LoRA feasible where each LoRA has a different rank?
what strategies for merging can achieve higher performance?
This study is showing viability
tests on larger models are needed
this will pave the way for pre-training models in computationally constrained or low-bandwidth environments
메인 파라미터들을 건드리지 않고 어댑터를 이용해 전체 모델을 근사하여 복원한다는 아이디어. rank r의 LoRA를 다시 rank 1로 decompose하여 바로 합치는 것이 아니라 주기적으로 병합해주는 것. 왜 lora를 다시 분해한다는 생각은 안해봤을까.