SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

shinunjune·2023년 5월 25일


- Positional encodeing free hierarchically structured Transformer encoder
(resolution independent test/training performance, multiscale features)
- Simple structured MLP decoder


  • Former works are concentrated on encoders only (PVT, Swin Transformers, etc).
  • Still requires high computation on decoders.

Overal Method

  1. Input image(H×W×3) is divided in to patches of size 4×4.
  2. By hierarchicl Transformer encoder, mulit-level features sized {1/4,1/8,1/16,1/32} of the original image, are obtained from the patches.
  3. Pass the features to ALL-MLP decoder and get H4×W4×Ncls\frac{H}{4}\times\frac{W}{4}\times N_{cls} resolution segmentation mask prediction. (NclsN_{cls} : num of categories)

Overlapped patch merging.

  1. Like ViT unified N×N×3 patches to 1×1×C vectors, heirarchical features from H4×W4×Ci\frac{H}{4}\times\frac{W}{4}\times C_i are shrinked into H8×W8×Ci+1\frac{H}{8}\times\frac{W}{8}\times C_{i+1}. And method iterates for other heirarchy.
    (Note. What ViT paper actually did was flattening p×p×3p\times p\times3 patches into 1×p2c1\times p^2c vectors.
    Dividing H×W×3 images into N patches size p×pp \times p and flatten those patches into N×P2CN\times P^2C.)
  2. To preserve local continuity among the patches, K = 7, S = 4, P = 3 (and K = 3, S = 2, P = 1) is set, where K is the patch size, S is the stride, P is the padding size.

Efficient Self-Attention.

By reducing lenth of sequence KK(size N×\timesC) from N to NRN\over R with
K^\hat{K} = Reshape(NRN\over R , C · R)(KK) [reshape N×\timesC to NRN\over R×\times(C · R)]
KK = Linear(C · R , C)(K^\hat{K} ) [linear transpose NRN\over R×\times(C · R) to NRN\over R×\timesC]
the self-attension mechanism complexity decreased from O(N2)O(N^2) to O(NR2)O({N\over R}^2)


HAPPY the cat

0개의 댓글