[WIP] Masked-attention Mask Transformer for Universal Image Segmentation

Estelle Yoon·2025년 3월 18일

WIP

목록 보기
4/13

Date: 2022
Journal: CVPR

1. Introduction

Background

The universal architecture is showing SOTA performance for semantic/panoptic segmentation and is flexible. But recent research is focusing on advancing specialized architectures.

Problem

Why not universal architectures replace specialized ones.

→ Mask2Former : backbone feature extractor - pixel decoder - transformer decoder

2. Related Work

Specialized semantic segmentation architectures

Typically per pixel classification

FCN based independently per pixel

Follow-up find context per pixel, focus on context modules/self-attention variants

Specialized instance segmentation architectures

Typically predict a set of binary masks for each class

Mask R-CNN generate masks from bounding boxes

Follow-up focus on precise bounding boxes/new ways to generate dynamic # of masks

Lack flexibility to generalization

Panoptic segmentation

Proposed to unify semantic/panoptic segmentation

Universal architectures

Emerge w/ DETR

Show mask classification architectures w/ E2E prediction → general for any image segmentation

3. Masked-attention Mask Transformer

3.1 Mask classification preliminaries

Mask classification architectures group pixels into N segments by N binary mask (for corresponding category labels) and is general

Difficult to fins good representations for each segment

→ each segmentation can be represented as C-dimention feature vector(”object query”), which can be processed by transformer decoder

Architecture components
1.backbone - extract low resolution features
2.pixel decoder - gradually upsample to generate high resolution per pixel embeddings
3.transformer decoder - operate to process object queries, from which the binary mask predictions are decoded

3.2 Transformer decoder w/ masked attention

Key components of proposing Transformer decoder

Extract localized features by constraining cross-attention to foreground region of predicted mask for each query

For small objects, propose efficient multi-scale strategy to use high resolution features

3.2.1 Masked attention

Context feature is important for image segmentation, but cause slow converge as global context need many epoch for cross-attention to learn to attend local object

Hypotheses
1. local features are enough to update query features
2. context information can be gathered through self attention

Solution
cross-attention attends only within the foreground region of predicted mask for each query

Masked attention matrix
Xl=softmax(Ml1+QlKlT)Vl+Xl1X_l = softmax(M_{l-1} + Q_l K_l ^T ) V_l + X_{l-1}
Ml1(x, y)={0if Ml1(x, y)=1if otherwiseM_{l-1} (x,~y) = \begin{cases} 0 &if~ M_{l-1} (x,~y) = 1\\ - \infty & if ~ otherwise \end{cases}

Ml1M_{l-1} is binarized mask prediction of previous Transformer decoder layer obtained from Xl1X_{l-1} resized to same resolution of KlK_l

3.2.2 High resolution features

Problem
High resolution features good for small objects, but high computation cost

Solution
Not always use high resolution feature map, use multi scale feature to control computation increase
both low/high resolution feature to one Transformer decoder layer

3.2.3 Optimization improvements

  1. switch self/cross attention order

query features to first self attention layer is image independent and do not have signal from image, which means it does not enrich information

  1. make query feature learnable, and supervise features before use in Transformer decoder

These learnable feature function like region proposal network and have ability to generate mask proposals

  1. remove dropout

dropout is not necessary and decrease performance

3.3 Improving training efficiency

Problem
Large memory consumption while training

Solution
Motivated by PointRend/Implicit PointRend, which show a segmentation model can be trained with mask loss calculated on KK randomly sampled points

Use sampled points to calculate mask loss in matching/final loss

For matching loss, uniformly sample same set of KK points for all prediction and ground truth

For final loss, importance sample different pairs of prediction and ground truth

4. Experiments

Datasets

  • COCO(80 things, 53 stuff)
  • ADE20K(100 things, 50 stuff)
  • Cityscapes(8things, 11 stuff)
  • Mapillary Vistas(37 things, 28 stuff)

Limitations
On panoptic, slightly worse than exact samemodel trained with corresponding annotation for instance and semantic, which means need to be trained for specific tasks

profile
Studying

0개의 댓글