[Paper Review] Segmenting Transparent Object in the Wild with Transformer

gredora·2023년 3월 9일

Paper Review

목록 보기

5/20

https://arxiv.org/pdf/2101.08461.pdf

질문

abstract 2번째에 learnabble prototpe을 쿼리로 썼다는게 이해가 안 됨

Abstract

New dataset, Trans10K-v2, is introduced. It has larger scale than v1. Also Trans2Seg, which is based on transformer, is proposed. It enables receptive field to get global areas. Author designed learnable prototypes as the query of Trans2Seg’s transformer decoder

Introduction

In robotics sonars or lidars are used for perceiving envirnments. Previous datasets and researches have 3 problems(1. limited scale 2. Poor diversity 3. Fewer classes)(Things for grabbing stuff for navigation)

Using self attention, global receptive field was enabled. Decoder stacks successive layers to interact query embedding.

FLOPS(연산 시간 영향)

Trans10K-v2 Dataset

Evaluation metrics

Pixel accuracy
Mean IoU
Category IoU

Method

Encoder

In the Transformer encoder, the spatial dimensions of the input feature map (H16, W16, C) are flattened into one dimension (H16W16, C) to be processed as a sequence. Positional embedding is added to the flattened feature with the same dimension. The encoder is composed of stacked encoder layers, each consisting of a multi-head self-attention module and a feed forward network.

Decoder

In decoder, learnable class prototype embeddings are used as query, with encoded feature as key and value, to output an attention map that is fed into a small conv head to obtain the final segmentation result. The class prototype embeddings are updated iteratively through multi-head attention mechanisms in each decoder layer, and the attention map is extracted from the final decoder layer. The attention map is then up-sampled to (N, M, H4, W4), fused with high-resolution feature map Res2.

SETR

segmentation pipeline using conv in decoder

DETR

Share similar components as Trans2Seg, including a CNN backbone, Transformer encoder, and decoder. The main difference is in the definition of queries. The decoder's queries represent N learnable objects since DETR is designed for object detection, whereas in Trans2Seg, the queries represent N learnable class prototypes, with each query representing one category.
- category → 분류
- object → 탐지 하려는 각 객체

Experiments

FCN was used as a baseline and demonstrate that adding a Transformer encoder improves the mean intersection over union (mIoU). They also show that using learnable class prototypes as queries in the Transformer decoder further improves accuracy. Additionally, the authors investigate the impact of the size of the Transformer on performance and find that larger models do not always lead to better performance without massive data to pretrain.

Conclusion

This paper introduces new dataset and propose a transformer-based pipeline called Trans2Seg. The transformer encoder provides a global receptive field, and the transformer decoder models segmentation as a dictionary lookup with a set of learnable queries, where each query represents one category. In the future, the authors plan to make the transformer design on general segmentation tasks, including transparent objects.

gredora

그래도라

이전 포스트

[Paper Review] [DINO] Emerging Properties in Self-Supervised Vision Transformers

다음 포스트