16. Segmentation
16.1. Semantic Segmentation
![](https://velog.velcdn.com/images/zzwon1212/post/4044fccf-7c97-4507-9773-88549f910968/image.png)
![](https://velog.velcdn.com/images/zzwon1212/post/4bd280ae-240d-401c-aa70-166fc2188c04/image.png)
- Unpooling
- Spatial information within a receptive field is lost during pooling, which may be critical for precise localization that is required for semantic segmentation.
- To resolve such issue, unpooling layers reconstruct the original size of activations and place each activation back to its original pooled location.
- Deconvolution
- The output of an unpooling layer is an enlarged, yet sparse activation map.
- The deconv layers densify the sparse activations obtained by unpooling through conv-like operations with multiple learned filters.
![](https://velog.velcdn.com/images/zzwon1212/post/8a21322d-a2ca-4bd8-910d-83349e6ff80a/image.png)
- Analysis of Deconv Net
- Since unpooling captures example-specific structures by tracing the original locations, it effectively reconstructs the detailed structure of an object in finer resolutions.
- Learned filters in deconv layers captures class-specific shapes. Through deconv, the activations closely related to the targe classes are amplified while noisy activations from other regions are suppressed effectively.
![](https://velog.velcdn.com/images/zzwon1212/post/e39bd94f-fb9b-497b-ba1a-2c8c12aab730/image.png)
-
Network Architecture
- The network consists of a contracting path (left side) and an expansive path (right side).
- Every step in the expansive path consists of
- an upsampling of the feature map followed by a 2×2 convolution ("up-convolution")
- a concatenation with the correspondingly cropped feature map from the contracting path
- two 3×3 conv
-
Loss
E=x∈Ω∑w(x)log(pl(x)(x))
- Softmax
pk(x)=exp(ak(x))/(k′=1∑Kexp(ak′(x)))
- Weight map
w(x)=wc(x)+w0⋅exp(−2σ2(d1(x)+d2(x))2) A pixel-wise loss weight force the network to learn the border pixels.
![](https://velog.velcdn.com/images/zzwon1212/post/e11624b1-9e24-40ed-9f16-034eb2376b21/image.png)
16.2. Instance Segmentation
![](https://velog.velcdn.com/images/zzwon1212/post/822f8b9c-f3b5-4c70-a352-aac7fdae8055/image.png)
![](https://velog.velcdn.com/images/zzwon1212/post/97af3d03-8c94-4a62-9921-ebf601326fea/image.png)
-
Network Architecture
- Faster R-CNN with FCN on RoIs
-
RoIAlign
![](https://velog.velcdn.com/images/zzwon1212/post/9e75d6a4-2100-46cf-b88e-506b643430bf/image.png)
- RoIPool performs quantization which introduces misalignments between the RoI and the extracted features. While this may not impact classfication, it has a large negative effect on predicting pixel-accurate masks.
- RoIAlgin removes the harsh quantization of RoIPool, using bilinear interpolation. So, there is no information loss.
![](https://velog.velcdn.com/images/zzwon1212/post/fc239b3c-76f1-473a-8a44-ebec963e6882/image.png)
Segmenter is based on a fully transformer-based encoder-decoder architecture mapping a sequence of a patch embeddings to pixel-level class annotations. Its approach relies on a ViT backbone and introduces a mask decoder inspired by DETR.
![](https://velog.velcdn.com/images/zzwon1212/post/fc460691-3215-4989-bec5-bd1b874db1e0/image.png)
-
Overview
Contextualized tokens at specific transformer layers are reassembled by convolution. Then, reassembled feature maps are fed into fusion block which performs convolution. Finally, a task-specific output head is attached to produce the final prediction.
-
Receptive field
- CNNs
progressively increase their receptive field as feature pass through consecutive layers.
- Transformer
has a global receptive filed at every stage after the initial embedding.
-
Transformer Encoder
- DPT use ViT as a backbone architecture.
- image
x∈RH×W×C
- # tokens
Np=pH⋅pW p is the resolution of image patch. (p=16 in the paper)
- sequence of flattend 2D patches
xp∈RNp×(p2⋅C)
- trainable linear projection
E∈R(p2⋅C)×D
- patch embedding t0
{xclass,xp1E,xp2E,...,xpNpE}tn0=xpnE,tn0∈RDt0={t00,...,tNp0},t0∈R(Np+1)×D
- D=768 (for ViT-Base)
Feature dimensions > # pixels in an input patch
↓
Embedding procedure can learn to retain information if it is beneficial for the task.
-
Convolutional Decoder
- Reassemble operation
![](https://velog.velcdn.com/images/zzwon1212/post/90edcedc-f833-4bf7-8a61-0e0076761e3e/image.png)
ReassemblesD^(t)=(Resamples∘Concatenate∘Read)(t) Features from deeper layers of the transformer are assembled at lower resolution.
Features from early layers are assembled at higher resolution.
- Stage 1: Read
R(Np+1)×D→RNp×D There are three different variants of this mapping, (ignore, add, proj)
- Stage 2: Concatenate
RNp×D→RpH×pW×D Reshape Np tokens into an image-like representation (Recall Np=pH⋅pW)
- Stage 3: Resample
RpH×pW×D→RsH×sW×D^
- Use 1×1 conv to project the representation to D^ (D^=256).
- Use (strided) 3×3 conv or transpose conv to implement donwsampling or upsampling.
- Fusion block
![](https://velog.velcdn.com/images/zzwon1212/post/6dde10ec-d2d9-4e14-85c7-fe555f638e4a/image.png)
- Combine the extracted feature maps from consecutive stages using a RefineNet-based feature fusion block.
- Progressively upsample the representation by a factor of two in each fusion stage. (e.g. 7×7 → 14×14 → 28×28 → 56×56 → 112×112)
- The final representation size has half the resolution of the input image. (e.g. 112×112 representation for 224×224 image)
- Attach a task-specific output head to produce the final prediction.
📙 강의