[논문 리뷰] FreeSOLO: Learning to Segment Objects without Annotations

한의진·2024년 9월 22일

스터디_리뷰

목록 보기

5/15

Goal

• Provide fully unsupervised learning method that learn class-agnostic instance segmentation

• Motivation

• Instance Segmentation requires costly annotations and segmentation masks

• Contribution

• FreeSOLO outperformed several segmentation proposal methods that use manual annotations

Background

• SOLO(supervised instance segmentation) directly map an input image to the desired object
categories and instance masks using FCN(Fully Convolutional networks)

• Eliminated the need for bounding box detection or grouping via post-processing

• Conceptually divides the input image into S*S grids

• Grid cell is responsible for predicting the semantic category as the segmentation mask

• Category branch and a mask branch

• Category branch predicts the semantic categories

• Mask branch generates S sized masks, corresponding to each grid cell 2

S = G * F

• G is convolution kernel, S: score maps

• Dynamic SOLO variant employs dynamic convolutions to predict the mask kernels and
mask features

• S is normalized via s sigmoid function, and input to mask NMS

• SOLO adapted one-stage-design, which contains a category branch and a mask branch
to encode the object category information and segmentation proposals

• Top-down meets bottom-up design

FreeSOLO

• FreeSOLO does not require any type of annotations

• Promoted objectness in network attention

• Free Mask approach

• Generate coarse object mask with simple operations(21 FPS on V100, ResNet
Backbone)

• Obtained an instance segmentation model given only unlabeled images

• Self-Supervised SOLO

Free Mask

• Free Mask generated object masks from unlabeled images

• Dense feature maps I are extracted by a backbone model trained via self-supervision(ResNet
등)

• Construct query Q and keys K from the features I to generate the coarse segmentation mask

• Downsample I to form the queries Q, where H’ and W’ denote the downsampled spatial size
I(Itself used to K)

• Cosine Similarity(Q, K) calculated by the dot product between L2-normalized Q, K

• S = Q’ * K’

Score maps are then normalized as soft masks
Compute the ‘maskness’ score
Soft masks are converted to binary masks using a threshold
Sort the binary masks by their maskless scores and remove the redundant masks via NMS(Non-maximum-supression)
Further remove redundant

Self-supervised pre-training
Free Mask
Use a pre-trained backbone via self-supervision as the starting point
Dense contrastive learning achieved considerably better results with Free Mask approach
Similar object with DenseCL and Free Mask
Pyramid Queries
To generate masks for instances at different scales
Set a list of scale factors [1.0, 0.5, 0.25]
All pyramid queries are flattened and concatenated together as the final Q

Maskness Score

maskness obtain the confidence score of an extracted mask

Nf denotes the number of foreground pixels of the soft mask
p: the pixels that have values greater than threshold

Score weights more heavily on masks that have high confidence on foreground pixels and down weights masks with uncertain foreground pixels

Self-supervised SOLO
Learning with coarse masks
In SOLO the Dice Loss is used to supervise the masks
In self-supervised SOLO, masks are coarse
Coarse masks as a type of weak annotation and perform weakly supervised instance segmentation
Project the predicted masks and the coarse masks (Max operation, Average Operation)
Max: Emphasize outlier segmentations, Avg: De-emphasize the outliers

Pairwise Loss

Leverage the prior that proximal pixels are likely to be in the same class
Self-training
Train a SOLO-based instance segmented with the free and noisy coarse masks
Low-confidence predictions are removed and the remaining ones are treated as a new set of coarse masks

Semantic Representation Learning

Recognize the semantic categories
Decouple the category branch to perform foreground/background binary classification and semantic embedding learning
Focal Loss
Object-level semantic representation learning
Add a branch in parallel to the last layer of the original category branch

후기

Instance Segmentation은 주로 Supervised Learning의 방법론에 의해 이루어져 왔고, '지도 학습'을 해서 데이터셋을 바탕으로 한다는 고정 관념이 있었는데, 기존의 SOLO 방법이 Computational Cost가 많다 보니 이를 해결하기 위한 접근 방법이 있었다는 것이 흥미로웠다.

Coarse Mask를 Q, K로 생성한 후, 정확한 마스크를 만들어가는 방법도 매우 흥미로웠는데, 파인 튜닝을 통해 다른 데이터셋에도 적용해 봐아겠다.

한의진

이전 포스트

[논문 리뷰] DINOv2: Learning Robust Visual Features without Supervision

다음 포스트