ViTAR: Vision Transformer with Any Resolution (Fan et al, 2024)
ViT -> constrained scalability accross different image resolutions
ViTAR
Element 1
Element 2
Results
Receives tokens that have been processed through patch embedding as its input
-> number of tokens ultimately aim to obtain
ATM partitions tokens with the shape of into a grid of size
Assume is divisible by , is divisible by
=> Number of tokens contained in each grid -> (each typically set to 1 or 2)
Case -> ->
Case -> pad to , set ->
Case -> tokens on the edge of are no longer fused ->
Same with
Specific grid -> has tokens , and
Average pooling on all -> mean token
Cross attention to merge all tokens within a grid into a single token
Residual connections
Fused token -> FFN to complete channel fusion => One iteration of merging token
All iterations share the same weights
Gradually decreasing the value of until &
Learnable positional encoding + sin-cos positional encoding
=> highly sensitive to changes in input resolution (fail to provide effective resolution adaptability)
Conv-based PE -> better resolution robustness
Randomly initialize a set of learnable PE.
Positional information shifts within a certain range
as positional coordinates generated with FPE
Add randomly generated coordinate offsets to the reference coordinates in the training stage
Perfrom grid sample on the learnable positional embeddings -> resulting in the FPE
Case 1
Case 2
ssss