Date: 2022
Journal: CVPR
The universal architecture is showing SOTA performance for semantic/panoptic segmentation and is flexible. But recent research is focusing on advancing specialized architectures.
Why not universal architectures replace specialized ones.
→ Mask2Former : backbone feature extractor - pixel decoder - transformer decoder
Typically per pixel classification
FCN based independently per pixel
Follow-up find context per pixel, focus on context modules/self-attention variants
Typically predict a set of binary masks for each class
Mask R-CNN generate masks from bounding boxes
Follow-up focus on precise bounding boxes/new ways to generate dynamic # of masks
Lack flexibility to generalization
Proposed to unify semantic/panoptic segmentation
Emerge w/ DETR
Show mask classification architectures w/ E2E prediction → general for any image segmentation
Mask classification architectures group pixels into N segments by N binary mask (for corresponding category labels) and is general
Difficult to fins good representations for each segment
→ each segmentation can be represented as C-dimention feature vector(”object query”), which can be processed by transformer decoder
Architecture components
1.backbone - extract low resolution features
2.pixel decoder - gradually upsample to generate high resolution per pixel embeddings
3.transformer decoder - operate to process object queries, from which the binary mask predictions are decoded
Key components of proposing Transformer decoder
Extract localized features by constraining cross-attention to foreground region of predicted mask for each query
For small objects, propose efficient multi-scale strategy to use high resolution features
Context feature is important for image segmentation, but cause slow converge as global context need many epoch for cross-attention to learn to attend local object
Hypotheses
1. local features are enough to update query features
2. context information can be gathered through self attention
Solution
cross-attention attends only within the foreground region of predicted mask for each query
Masked attention matrix
is binarized mask prediction of previous Transformer decoder layer obtained from resized to same resolution of
Problem
High resolution features good for small objects, but high computation cost
Solution
Not always use high resolution feature map, use multi scale feature to control computation increase
both low/high resolution feature to one Transformer decoder layer
query features to first self attention layer is image independent and do not have signal from image, which means it does not enrich information
These learnable feature function like region proposal network and have ability to generate mask proposals
dropout is not necessary and decrease performance
Problem
Large memory consumption while training
Solution
Motivated by PointRend/Implicit PointRend, which show a segmentation model can be trained with mask loss calculated on randomly sampled points
Use sampled points to calculate mask loss in matching/final loss
For matching loss, uniformly sample same set of points for all prediction and ground truth
For final loss, importance sample different pairs of prediction and ground truth
Datasets
Limitations
On panoptic, slightly worse than exact samemodel trained with corresponding annotation for instance and semantic, which means need to be trained for specific tasks