[WIP] Masked-attention Mask Transformer for Universal Image Segmentation

Estelle Yoon·2025년 3월 18일

AI Attention Deep Learning VISION image segmentation segmentation transformer 논문스터디 딥러닝 비전 세그멘테이션 트렌스포머

WIP

목록 보기

4/13

Date: 2022
Journal: CVPR

1. Introduction

Background

The universal architecture is showing SOTA performance for semantic/panoptic segmentation and is flexible. But recent research is focusing on advancing specialized architectures.

Problem

Why not universal architectures replace specialized ones.

→ Mask2Former : backbone feature extractor - pixel decoder - transformer decoder

Specialized semantic segmentation architectures

Typically per pixel classification

FCN based independently per pixel

Follow-up find context per pixel, focus on context modules/self-attention variants

Specialized instance segmentation architectures

Typically predict a set of binary masks for each class

Mask R-CNN generate masks from bounding boxes

Follow-up focus on precise bounding boxes/new ways to generate dynamic # of masks

Lack flexibility to generalization

Panoptic segmentation

Proposed to unify semantic/panoptic segmentation

Universal architectures

Emerge w/ DETR

Show mask classification architectures w/ E2E prediction → general for any image segmentation

3. Masked-attention Mask Transformer

3.1 Mask classification preliminaries

Mask classification architectures group pixels into N segments by N binary mask (for corresponding category labels) and is general

Difficult to fins good representations for each segment

→ each segmentation can be represented as C-dimention feature vector(”object query”), which can be processed by transformer decoder

Architecture components
1.backbone - extract low resolution features
2.pixel decoder - gradually upsample to generate high resolution per pixel embeddings
3.transformer decoder - operate to process object queries, from which the binary mask predictions are decoded

3.2 Transformer decoder w/ masked attention

Key components of proposing Transformer decoder

Extract localized features by constraining cross-attention to foreground region of predicted mask for each query

For small objects, propose efficient multi-scale strategy to use high resolution features

3.2.1 Masked attention

Context feature is important for image segmentation, but cause slow converge as global context need many epoch for cross-attention to learn to attend local object

Hypotheses
1. local features are enough to update query features
2. context information can be gathered through self attention

Solution
cross-attention attends only within the foreground region of predicted mask for each query

Masked attention matrix
$X_l = softmax(M_{l-1} + Q_l K_l ^T ) V_l + X_{l-1}$
$M_{l-1} (x,~y) = \begin{cases} 0 &if~ M_{l-1} (x,~y) = 1\\ - \infty & if ~ otherwise \end{cases}$

$M_{l-1}$ is binarized mask prediction of previous Transformer decoder layer obtained from $X_{l-1}$ resized to same resolution of $K_l$

3.2.2 High resolution features

Problem
High resolution features good for small objects, but high computation cost

Solution
Not always use high resolution feature map, use multi scale feature to control computation increase
both low/high resolution feature to one Transformer decoder layer

3.2.3 Optimization improvements

switch self/cross attention order

query features to first self attention layer is image independent and do not have signal from image, which means it does not enrich information

make query feature learnable, and supervise features before use in Transformer decoder

These learnable feature function like region proposal network and have ability to generate mask proposals

remove dropout

dropout is not necessary and decrease performance

3.3 Improving training efficiency

Problem
Large memory consumption while training

Solution
Motivated by PointRend/Implicit PointRend, which show a segmentation model can be trained with mask loss calculated on $K$ randomly sampled points

Use sampled points to calculate mask loss in matching/final loss

For matching loss, uniformly sample same set of $K$ points for all prediction and ground truth

For final loss, importance sample different pairs of prediction and ground truth

4. Experiments

Datasets

COCO(80 things, 53 stuff)
ADE20K(100 things, 50 stuff)
Cityscapes(8things, 11 stuff)
Mapillary Vistas(37 things, 28 stuff)

Limitations
On panoptic, slightly worse than exact samemodel trained with corresponding annotation for instance and semantic, which means need to be trained for specific tasks

Estelle Yoon

Studying

이전 포스트

[WIP] MatrixVT를 돌려보자

다음 포스트

[WIP] Masked-attention Mask Transformer for Universal Image Segmentation

WIP

1. Introduction

Background

Problem

Specialized semantic segmentation architectures

Specialized instance segmentation architectures

Panoptic segmentation

Universal architectures

3. Masked-attention Mask Transformer

3.1 Mask classification preliminaries

3.2 Transformer decoder w/ masked attention

3.2.1 Masked attention

3.2.2 High resolution features

3.2.3 Optimization improvements

3.3 Improving training efficiency

4. Experiments

[WIP] MatrixVT를 돌려보자

[WIP] DETR: End-to-End Object Detection with Transformers

0개의 댓글

[WIP] Masked-attention Mask Transformer for Universal Image Segmentation

WIP

1. Introduction

Background

Problem

2. Related Work

Specialized semantic segmentation architectures

Specialized instance segmentation architectures

Panoptic segmentation

Universal architectures

3. Masked-attention Mask Transformer

3.1 Mask classification preliminaries

3.2 Transformer decoder w/ masked attention

3.2.1 Masked attention

3.2.2 High resolution features

3.2.3 Optimization improvements

3.3 Improving training efficiency

4. Experiments

[WIP] MatrixVT를 돌려보자

[WIP] DETR: End-to-End Object Detection with Transformers

0개의 댓글