https://arxiv.org/pdf/2101.08461.pdf
New dataset, Trans10K-v2, is introduced. It has larger scale than v1. Also Trans2Seg, which is based on transformer, is proposed. It enables receptive field to get global areas. Author designed learnable prototypes as the query of Trans2Seg’s transformer decoder
In robotics sonars or lidars are used for perceiving envirnments. Previous datasets and researches have 3 problems(1. limited scale 2. Poor diversity 3. Fewer classes)(Things for grabbing stuff for navigation)
Using self attention, global receptive field was enabled. Decoder stacks successive layers to interact query embedding.
Evaluation metrics
In the Transformer encoder, the spatial dimensions of the input feature map (H16, W16, C) are flattened into one dimension (H16W16, C) to be processed as a sequence. Positional embedding is added to the flattened feature with the same dimension. The encoder is composed of stacked encoder layers, each consisting of a multi-head self-attention module and a feed forward network.
In decoder, learnable class prototype embeddings are used as query, with encoded feature as key and value, to output an attention map that is fed into a small conv head to obtain the final segmentation result. The class prototype embeddings are updated iteratively through multi-head attention mechanisms in each decoder layer, and the attention map is extracted from the final decoder layer. The attention map is then up-sampled to (N, M, H4, W4), fused with high-resolution feature map Res2.
FCN was used as a baseline and demonstrate that adding a Transformer encoder improves the mean intersection over union (mIoU). They also show that using learnable class prototypes as queries in the Transformer decoder further improves accuracy. Additionally, the authors investigate the impact of the size of the Transformer on performance and find that larger models do not always lead to better performance without massive data to pretrain.
This paper introduces new dataset and propose a transformer-based pipeline called Trans2Seg. The transformer encoder provides a global receptive field, and the transformer decoder models segmentation as a dictionary lookup with a set of learnable queries, where each query represents one category. In the future, the authors plan to make the transformer design on general segmentation tasks, including transparent objects.