Our custom ViT model architecture closely mimics that of the ViT paper, however, our training recipe misses a few things.
- ImageNet-22k pretraining (more data) - Train a model on a large corpus of images (14 million in the case of ImageNet-22k) with 22,000 classes so it can learn a good underlying representation of images that can be applied to other problems.
- Learning rate warmup - start with a small learning rate (almost 0) and warm it up to a desired value (e.g. 1e-3) to prevent a model's loss from exploding during the start of training.
- Learning rate decay - slowly lower learning rate overtime so a model's loss doesn't explode when it's close to convergence (like reaching for a coin at the back of a couch, the closer you get to the coin, the small steps you take).
- Gradient clipping - reduce a model's gradients by a certain amount to prevent them from getting too large and causes the loss to explode.
All of the above are ways to prevent overfitting (regularization) and in the case of ImageNet-22k pretraining, it also helps to prevent underfitting (apply learned patterns from another dataset to your own for better performance).