16. Segmentation
16.1. Semantic Segmentation
- Unpooling
- Spatial information within a receptive field is lost during pooling, which may be critical for precise localization that is required for semantic segmentation.
- To resolve such issue, unpooling layers reconstruct the original size of activations and place each activation back to its original pooled location.
- Deconvolution
- The output of an unpooling layer is an enlarged, yet sparse activation map.
- The deconv layers densify the sparse activations obtained by unpooling through conv-like operations with multiple learned filters.
- Analysis of Deconv Net
- Since unpooling captures example-specific structures by tracing the original locations, it effectively reconstructs the detailed structure of an object in finer resolutions.
- Learned filters in deconv layers captures class-specific shapes. Through deconv, the activations closely related to the targe classes are amplified while noisy activations from other regions are suppressed effectively.
Network Architecture
- The network consists of a contracting path (left side) and an expansive path (right side).
- Every step in the expansive path consists of
- an upsampling of the feature map followed by a 2×2 convolution ("up-convolution")
- a concatenation with the correspondingly cropped feature map from the contracting path
- two 3×3 conv
- Softmax
- Weight map
w(x)=wc(x)+w0⋅exp(−2σ2(d1(x)+d2(x))2) A pixel-wise loss weight force the network to learn the border pixels.
16.2. Instance Segmentation
Network Architecture
- Faster R-CNN with FCN on RoIs
- RoIPool performs quantization which introduces misalignments between the RoI and the extracted features. While this may not impact classfication, it has a large negative effect on predicting pixel-accurate masks.
- RoIAlgin removes the harsh quantization of RoIPool, using bilinear interpolation. So, there is no information loss.
Segmenter is based on a fully transformer-based encoder-decoder architecture mapping a sequence of a patch embeddings to pixel-level class annotations. Its approach relies on a ViT backbone and introduces a mask decoder inspired by DETR.
Contextualized tokens at specific transformer layers are reassembled by convolution. Then, reassembled feature maps are fed into fusion block which performs convolution. Finally, a task-specific output head is attached to produce the final prediction.
Receptive field
- CNNs
progressively increase their receptive field as feature pass through consecutive layers.
- Transformer
has a global receptive filed at every stage after the initial embedding.
Transformer Encoder
- DPT use ViT as a backbone architecture.
- image
- # tokens
Np=pH⋅pW p is the resolution of image patch. (p=16 in the paper)
- sequence of flattend 2D patches
- trainable linear projection
- patch embedding t0
- D=768 (for ViT-Base)
Feature dimensions > # pixels in an input patch
Embedding procedure can learn to retain information if it is beneficial for the task.
Convolutional Decoder
- Reassemble operation
ReassemblesD^(t)=(Resamples∘Concatenate∘Read)(t) Features from deeper layers of the transformer are assembled at lower resolution.
Features from early layers are assembled at higher resolution.
- Stage 1: Read
R(Np+1)×D→RNp×D There are three different variants of this mapping, (ignore, add, proj)
- Stage 2: Concatenate
RNp×D→RpH×pW×D Reshape Np tokens into an image-like representation (Recall Np=pH⋅pW)
- Stage 3: Resample
- Use 1×1 conv to project the representation to D^ (D^=256).
- Use (strided) 3×3 conv or transpose conv to implement donwsampling or upsampling.
- Fusion block
- Combine the extracted feature maps from consecutive stages using a RefineNet-based feature fusion block.
- Progressively upsample the representation by a factor of two in each fusion stage. (e.g. 7×7 → 14×14 → 28×28 → 56×56 → 112×112)
- The final representation size has half the resolution of the input image. (e.g. 112×112 representation for 224×224 image)
- Attach a task-specific output head to produce the final prediction.
📙 강의