KeyPoints:
Backgrounds
- Former works are concentrated on encoders only (PVT, Swin Transformers, etc).
- Still requires high computation on decoders.
Overal Method
- Input image(H×W×3) is divided in to patches of size 4×4.
- By hierarchicl Transformer encoder, mulit-level features sized {1/4,1/8,1/16,1/32} of the original image, are obtained from the patches.
- Pass the features to ALL-MLP decoder and get 4H×4W×Ncls resolution segmentation mask prediction. (Ncls : num of categories)
Overlapped patch merging.
- Like ViT unified N×N×3 patches to 1×1×C vectors, heirarchical features from 4H×4W×Ci are shrinked into 8H×8W×Ci+1. And method iterates for other heirarchy.
(Note. What ViT paper actually did was flattening p×p×3 patches into 1×p2c vectors.
Dividing H×W×3 images into N patches size p×p and flatten those patches into N×P2C.)
- To preserve local continuity among the patches, K = 7, S = 4, P = 3 (and K = 3, S = 2, P = 1) is set, where K is the patch size, S is the stride, P is the padding size.
Efficient Self-Attention.
By reducing lenth of sequence K(size N×C) from N to RN with
K^ = Reshape(RN , C · R)(K) [reshape N×C to RN×(C · R)]
K = Linear(C · R , C)(K^ ) [linear transpose RN×(C · R) to RN×C]
the self-attension mechanism complexity decreased from O(N2) to O(RN2)
Mix-FFN.