Segmentation - 시각적 이해를 위한 머신러닝 7

zzwon1212·2024년 7월 22일
0

딥러닝

목록 보기
31/33

16. Segmentation

16.1. Semantic Segmentation

16.1.1. Deconvolution Network

  • Unpooling
    • Spatial information within a receptive field is lost during pooling, which may be critical for precise localization that is required for semantic segmentation.
    • To resolve such issue, unpooling layers reconstruct the original size of activations and place each activation back to its original pooled location.
  • Deconvolution
    • The output of an unpooling layer is an enlarged, yet sparse activation map.
    • The deconv layers densify the sparse activations obtained by unpooling through conv-like operations with multiple learned filters.

  • Analysis of Deconv Net
    • Since unpooling captures example-specific structures by tracing the original locations, it effectively reconstructs the detailed structure of an object in finer resolutions.
    • Learned filters in deconv layers captures class-specific shapes. Through deconv, the activations closely related to the targe classes are amplified while noisy activations from other regions are suppressed effectively.

16.1.2. U-Net

  • Network Architecture

    • The network consists of a contracting path (left side) and an expansive path (right side).
    • Every step in the expansive path consists of
      • an upsampling of the feature map followed by a 2×22 \times 2 convolution ("up-convolution")
      • a concatenation with the correspondingly cropped feature map from the contracting path
      • two 3×33 \times 3 conv
  • Loss

    E=xΩw(x)log(pl(x)(x))E = \sum_{x \in \Omega} w(\text{x}) \log (p_{l(\text{x})} (\text{x}))
    • Softmax
      pk(x)=exp(ak(x))/(k=1Kexp(ak(x)))p_k(\text{x}) = \exp (a_k(\text{x})) / \left (\sum_{k'=1}^K \exp (a_{k'}(\text{x})) \right)
    • Weight map
      w(x)=wc(x)+w0exp((d1(x)+d2(x))22σ2)w(\text{x}) = w_c(\text{x}) + w_0 \cdot \exp \left( - \frac{(d_1(\text{x}) + d_2(\text{x}))^2}{2 \sigma^2} \right)
      A pixel-wise loss weight force the network to learn the border pixels.

16.2. Instance Segmentation

16.2.1. Mask R-CNN

  • Network Architecture

  • RoIAlign

    • RoIPool performs quantization which introduces misalignments between the RoI and the extracted features. While this may not impact classfication, it has a large negative effect on predicting pixel-accurate masks.
    • RoIAlgin removes the harsh quantization of RoIPool, using bilinear interpolation. So, there is no information loss.

16.3. Segmentation with Transformers

16.3.1. Segmenter

Segmenter is based on a fully transformer-based encoder-decoder architecture mapping a sequence of a patch embeddings to pixel-level class annotations. Its approach relies on a ViT backbone and introduces a mask decoder inspired by DETR.

  • Architecture

    • Encoder

    • Decoder

      • The sequence of patch encodings zLRN×D\text{z}_\text{L} \in \mathbb{R}^{N \times D} is decoded to a segmentation map: SRH×W×K\text{S} \in \mathbb{R}^{H \times W \times K} where KK is the number of classes.

      • Flow

        • patch-level encodings coming from the encoder
          ↓ (are mapped by learned decoder)
        • patch-level class scores
          ↓ (are upsampled by bilinear interpolation)
        • pixel-level scores
      • Mask Transformer

        • L2\text{L2}-normalized patch embeddings output by the decoder
          zMRN×D\text{z}'_\text{M} \in \mathbb{R}^{N \times D}
        • Class embeddings output by the decoder
          cRK×D\text{c} \in \mathbb{R}^{K \times D}
          The class embeddings are initialized randomly and learned by the decoder.
        • The set of class masks
          Masks(zM,c)=zMcTRN×K\text{Masks}(\text{z}'_{\text{M}}, \text{c}) = \text{z}'_{\text{M}} \text{c}^T \in \mathbb{R}^{N \times K}
        • Reshaped 2D mask
          smaskRH/P×W/P×K\text{s}_{\text{mask}} \in \mathbb{R}^{H/P \times W/P \times K}
        • Bilinearly upsampled feature map
          sRH×W×K\text{s} \in \mathbb{R}^{H \times W \times K}
        • Final segmentation map is obtained after softmax.

16.3.2. DPT (Dense Prediction Transformer)

  • Overview
    Contextualized tokens at specific transformer layers are reassembled by convolution. Then, reassembled feature maps are fed into fusion block which performs convolution. Finally, a task-specific output head is attached to produce the final prediction.

  • Receptive field

    • CNNs
      progressively increase their receptive field as feature pass through consecutive layers.
    • Transformer
      has a global receptive filed at every stage after the initial embedding.
  • Transformer Encoder

    • DPT use ViT as a backbone architecture.
    • image
      xRH×W×C\mathrm{x} \in \mathbb{R}^{H \times W \times C}
    • # tokens
      Np=HpWpN_p = \frac{H}{p} \cdot \frac{W}{p}
      pp is the resolution of image patch. (p=16p = 16 in the paper)
    • sequence of flattend 2D patches
      xpRNp×(p2C)\mathrm{x}_p \in \mathbb{R}^{N_p \times (p^2 \cdot C)}
    • trainable linear projection
      ER(p2C)×D\mathbf{E} \in \mathbb{R}^{(p^2 \cdot C) \times D}
    • patch embedding t0t^0
      {xclass,xp1E,xp2E,...,xpNpE}tn0=xpnE,tn0RDt0={t00,...,tNp0},t0R(Np+1)×D\{\mathrm{x_{class}}, \mathrm{x}_p^1 \mathbf{E},\mathrm{x}_p^2 \mathbf{E}, ..., \mathrm{x}_p^{N_p} \mathbf{E}\} \\ \, \\ t_{n}^0 = \mathrm{x}_p^{n} \mathbf{E}, \,\,\, t_{n}^0 \in \mathbb{R}^D \\ \, \\ t^0 = \{t_0^0, ..., t_{N_p}^0\}, \,\,\, t^0 \in \mathbb{R}^{(N_p + 1) \times D} \\ \, \\
    • D=768D = 768 (for ViT-Base)
      Feature dimensions > # pixels in an input patch

      Embedding procedure can learn to retain information if it is beneficial for the task.
  • Convolutional Decoder

    • Reassemble operation
      ReassemblesD^(t)=(ResamplesConcatenateRead)(t)\text{Reassemble}_s^{\hat{D}}(t) = (\text{Resample}_s \circ \text{Concatenate} \circ \text{Read}) (t)
      Features from deeper layers of the transformer are assembled at lower resolution.
      Features from early layers are assembled at higher resolution.
      • Stage 1: Read
        R(Np+1)×DRNp×D\mathbb{R}^{(N_p + 1) \times D} \rightarrow \mathbb{R}^{N_p \times D}
        There are three different variants of this mapping, (ignore, add, proj)
      • Stage 2: Concatenate
        RNp×DRHp×Wp×D\mathbb{R}^{N_p \times D} \rightarrow \mathbb{R}^{\frac{H}{p} \times \frac{W}{p} \times D}
        Reshape NpN_p tokens into an image-like representation (Recall Np=HpWpN_p = \frac{H}{p} \cdot \frac{W}{p})
      • Stage 3: Resample
        RHp×Wp×DRHs×Ws×D^\mathbb{R}^{\frac{H}{p} \times \frac{W}{p} \times D} \rightarrow \mathbb{R}^{\frac{H}{s} \times \frac{W}{s} \times \hat{D}}
        • Use 1×11 \times 1 conv to project the representation to D^\hat{D} (D^=256)\hat{D} = 256).
        • Use (strided) 3×33 \times 3 conv or transpose conv to implement donwsampling or upsampling.
    • Fusion block
      • Combine the extracted feature maps from consecutive stages using a RefineNet-based feature fusion block.
      • Progressively upsample the representation by a factor of two in each fusion stage. (e.g. 7×77 \times 714×1414 \times 1428×2828 \times 2856×5656 \times 56112×112112 \times 112)
      • The final representation size has half the resolution of the input image. (e.g. 112×112112 \times 112 representation for 224×224224 \times 224 image)
      • Attach a task-specific output head to produce the final prediction.

📙 강의

profile
JUST DO IT.

0개의 댓글

관련 채용 정보