SSD 논문 리뷰

김상현·2021년 5월 12일

논문 제목: SSD: Single Shot MultiBox Detector

SSD 개요

SSD는 그 당시 state-of-the-art의 성능을 보여줬던 Faster R-CNN보다 높은 detection accuracy를 갖으면서 속도를 향상시켰다.
속도 향상을 위한 개선은 다음과 같다.

bounding box proposal 제거
pixel or feature resampling stage 제거

위와 같은 개선을 통해 하나의 네트워크로 작동하는 1-stage detector를 구성했다. 물론 YOLO와 같이 속도가 빠른 1-stage detector가 이미 존재했지만 2-stage datector보다 성능이 좋지 못했다.
정확도 향상을 위한 개선은 다음과 같다.

object categories와 offsets을 예측하기 위해 작은 convolution filter를 사용
다른 aspect ratio detections를 위해 seperate predictors(filters)를 사용
이러한 filter들을 multiple feature maps에 적용

즉, SSD는 2-stage detector(Faster R-CNN)의 정확도를 갖으면서 1-stage detector(YOLO)의 성능을 갖는 모델이다.

논문의 저자들이 요약한 contributions는 다음과 같다.

YOLO보다 빠르고 Faster R-CNN보다 정확도가 높은 single-shot detector인 SSD를 소개했다.
SSD의 핵심은 작은 covolution filter들을 feature maps에 적용하여 고정된 default bounding boxes의 category score와 box offsets을 예측하는 것이다.
detection의 정확도를 높이기 위해 여러 크기(scale)의 다른 feature map들로부터 여러 크기(scale)의 예측을 하고, aspect ratio로 예측을 구분한다.
이러한 구조는 end-to-end 학습이 가능하며 높은 정확도를 가능하게 한다.
PASCAL VOC, COCO, ILSVRC와 같은 benchmark dataset들을 이용해 당시의 state-of-the-art의 방법들과 비교한다.

The Single Shot Detector(SSD)

사진 1. model architecture

Model

SSD의 접근은 bounding box들과 그 box들에 존재하는 객체의 점수를 반환하는 feed-forward convolutional network에 기반을 하고 있다.
Network의 앞부분은 image classification에서 좋은 성능을 갖는 구조에서 분류를 위한 layer를 제거한 구조를 사용하고 이를 base network라 한다. Base network에 보조 구조를 추가한다.

Multi-scale feature maps for detection

Base network의 마지막 부분에 convolutional feature layer들을 추가한다. 이 layer들은 점진적으로 size를 감소시키고, 여러 크기(scale)에 대한 detection 예측을 한다. Detection 예측을 위한 convolutional model은 각 feature layer마다 다르다.

Convolutional predictors for detection

Predictor는 mxn with p channels의 feature layer(map)에 3x3xp kernel의 convolutional detector를 사용한다. 이는 category score와 defualt box 좌표들의 상대적인 offset을 출력한다.
cf) YOLO는 convolutional predictor가 아닌 fully connected layer를 사용한다.

Default boxes and aspect ratios

사진 2

Feautre map의 각 cell마다 서로 다른 크기(scale)과 aspect ratio를 갖는 default box를 생성한다. Default box는 Faster R-CNN 모델에서 사용하는 anchor box와 개념적으로 유사하다.
Predictor는 각 cell마다 (c+4)k개의 예측을 한다. 이때, c는 dataset의 class 수, k는 default box의 개수를 나타낸다. 즉 각 cell의 하나의 default box에서는 c개의 class score와 4개의 offset을 나타내고 defualt box마다 예측하므로 하나의 cell에서는 (c+4)k개의 예측을 한다. 이를 mxn feature map에 적용하면 (c+4)kmn개의 output을 출력한다. Output의 예시는 위의 사진2와 같고 특정된 숫자(c=21, k=6)들은 VOC 데이터로 학습한 모델의 특성을 반영한 것이다.

code

SSD의 네트워크 구조는 다음과 같다.
본 논문에서는 base network로 vgg를 사용했으나 code에서는 resnet기반의 base network 사용

input: nbatch x 3 x 300 x 300
output: nbatch x 8732 x {nlabels(21), nlocs(4)}

class SSD300(nn.Module):
    def __init__(self, backbone=ResNet('resnet50')):
        super().__init__()

        self.feature_extractor = backbone

        self.label_num = 81  # number of COCO classes
        self._build_additional_features(self.feature_extractor.out_channels)
        self.num_defaults = [4, 6, 6, 6, 4, 4]
        self.loc = []
        self.conf = []

        # 각각의 feature map에 따른 convolutional predictor
        for nd, oc in zip(self.num_defaults, self.feature_extractor.out_channels):
            self.loc.append(nn.Conv2d(oc, nd * 4, kernel_size=3, padding=1))  # for offset
            self.conf.append(nn.Conv2d(oc, nd * self.label_num, kernel_size=3, padding=1))  # for score

        self.loc = nn.ModuleList(self.loc)
        self.conf = nn.ModuleList(self.conf)
        self._init_weights()

    def _build_additional_features(self, input_size):
        # base network 이후의 convolution 연산들을 수행한다.
        self.additional_blocks = []
        for i, (input_size, output_size, channels) in enumerate(zip(input_size[:-1], input_size[1:], [256, 256, 128, 128, 128])):
            if i < 3:
                layer = nn.Sequential(
                    nn.Conv2d(input_size, channels, kernel_size=1, bias=False),
                    nn.BatchNorm2d(channels),
                    nn.ReLU(inplace=True),
                    nn.Conv2d(channels, output_size, kernel_size=3, padding=1, stride=2, bias=False),
                    nn.BatchNorm2d(output_size),
                    nn.ReLU(inplace=True),
                )
            else:
                layer = nn.Sequential(
                    nn.Conv2d(input_size, channels, kernel_size=1, bias=False),
                    nn.BatchNorm2d(channels),
                    nn.ReLU(inplace=True),
                    nn.Conv2d(channels, output_size, kernel_size=3, bias=False),
                    nn.BatchNorm2d(output_size),
                    nn.ReLU(inplace=True),
                )

            self.additional_blocks.append(layer)

        self.additional_blocks = nn.ModuleList(self.additional_blocks)

    def _init_weights(self):
        layers = [*self.additional_blocks, *self.loc, *self.conf]
        for layer in layers:
            for param in layer.parameters():
                if param.dim() > 1: nn.init.xavier_uniform_(param)

    # Shape the classifier to the view of bboxes
    def bbox_view(self, src, loc, conf):
        ret = []
        for s, l, c in zip(src, loc, conf):
            # 반복문을 통해 각각의 feature map에 convolutional predictor를 적용하고
            # flatten한 후 list에 추가
            ret.append((l(s).view(s.size(0), 4, -1), c(s).view(s.size(0), self.label_num, -1)))

        locs, confs = list(zip(*ret))
        locs, confs = torch.cat(locs, 2).contiguous(), torch.cat(confs, 2).contiguous()  # batch별로 저장되게 한다.
        return locs, confs

    def forward(self, x):
        x = self.feature_extractor(x)

        detection_feed = [x]  # feature extractor에서 나온 x를 list에 추가
        for l in self.additional_blocks:
            # additional block을 반복문으로 feature map에 적용하면서 list에 추가
            x = l(x) 
            detection_feed.append(x)

        # 추가하면 다음과 같은 feature map들이 생긴다.
        # Feature Map 38x38x4, 19x19x6, 10x10x6, 5x5x6, 3x3x4, 1x1x4
        locs, confs = self.bbox_view(detection_feed, self.loc, self.conf)

        # For SSD 300, shall return nbatch x 8732 x {nlabels, nlocs} results
        return locs, confs

Inference

사진 3

SSD model inference는 위의 사진3과 같다.

Training

Matching strategy

학습을 진행하기 위해 default box들을 ground truth와 대응시켜야 한다. ground truth와 jaccard overlap(=IoU)가 0.5이상인 default box들을 positive sample로 설정한다. IoU가 가장 높은 box만을 positive sample로 사용하는 것보다 0.5 이상인 box들을 다 사용할 때 학습 문제를 단순화시켜 더 높은 성능의 예측을 수행한다.

Training objective

L(x,c,l,g) = \frac{1}{N}(L_{conf}(x,c) + \alpha L_{loc}(x,l,g))

N은 ground truth와 매칭된 default box의 개수이고, N=0일 때 loss를 0으로 한다. 논문에서 $\alpha$ 는 1을 사용한다.
$x^p_{ij} = \{1,0\}$ 으로 i번째 default box가 category p의 j번째 ground truth와 매칭될 때 1이고 매칭되지 않을때 0이다.

$L_{loc}$ 은 R-CNN 계열 모델들의 bounding box loss와 유사하다. 수식은 다음과 같다.

L_{loc}(x,l,g) = \Sigma^{N}_{i\in Pos}\Sigma_{m \in \{cx,cy,w,h\}}x^k_{ij}smooth_{L1}(l^m_i - \hat{g}^m_j)

\hat{g}^{cx}_{j}=(g^{cx}_j - d^{cx}_i) / d^w_i \ \ \ \ \hat{g}^{cy}_{j}=(g^{cy}_j - d^{cy}_i) / d^h_i

\hat{g}^{w}_{j}=log(\frac{g^{w}_j}{d^{w}_i}) \ \ \ \ \hat{h}^{w}_{j}=log(\frac{g^{h}_j}{d^{h}_i})

$l$ : predicted box
$g$ : ground truth box
$d$ : default box
$cx,cy$ : center x,y
$w$ : width
$h$ : height

$L_{conf}$ 는 softmax를 사용한 cross entropy loss이고, 수식은 다음과 같다.

L_{conf}(x,c) = - \Sigma^N_{i \in Pos}x^p_{ij}log(\hat{c}^p_i)- \Sigma_{i \in Neg}log(\hat{c}^0_i) \ \ \ where\ \ \ \hat{c}^p_i = \frac{exp(c^p_i)}{\Sigma_p exp(c^p_i)}

$c$ : multiple classes confidences

Choosing scales and aspect ratios for default boxes

사진 4

SSD는 각각 다른 receptive field를 갖는 여러 scale의 feature map들을 통해 예측을 수행한다. 따라서 사진4와 같이 feature map의 receptive field마다 검출하는 object의 크기가 다르다. 즉, network 앞부분의 feature map은 작은 object를 검출하고, 뒷부분의 feature map은 큰 object를 검출한다.

SSD는 각각 검출을 수행하는 feature map에서 default box의 scale을 수식으로 정의한다. 수식은 다음과 같다.

s_k = s_{min} + \frac{s_{max} - s_{min}}{m-1} (k-1)\ , k \in [1,m]

$s_{min}$ = 0.2
$s_{max}$ = 0.9
$m$ : 예측에 사용할 feature map의 개수(SSD의 경우 6)

이렇게 나온 $s_k$ 는 원본 이미지에 대한 비율을 나타낸다. 예를 들어 300x300의 원본 image에 대해 s = 0.1이고, aspect ratio가 1:1일 때, default box의 크기는 30x30이 된다.

각 feature map의 cell의 중앙이 default box의 중앙으로 한다. Default box 중앙에 대해 서로 다른 aspect ratio $a_r \in \{1,2,3,\frac{1}{2}, \frac{1}{3} \}$ 를 통해 width $w^a_k$ = $s_k\sqrt{a_r}$ 와 hegith $h^a_k$ = $s_k/ \sqrt{a_r}$ 를 정의한다. Aspect ratio가 1인경우 $s'_k = \sqrt{s_k s_{k+1}}$ 의 scale을 추가하여 각 feature map의 cell에 대해 6개의 default box를 생성한다.
cf) feature map에 따라 각 cell마다 4개의 default box를 사용하는 경우 aspect ratio 중 4개를 사용한다.

다양한 scale과 aspect ratio를 통해 생성된 많은 default box들을 예측에 사용해 입력 image에 속한 다양한 객체의 크기와 모양을 포함하는 예측을 수행한다.

Hard negative mining

Positive sample에 비해 negative sample이 굉장히 많은 문제를 해결하기 위해 hard negative mining을 수행한다. 이때, negative와 positive의 비율을 3:1로 한다.

Data augmentation

모델을 더 robust하게 하기 위해 data augmentation을 수행한다.
원본 image외에 random하게 뽑은 patch를 이용한다. 각 patch들의 크기는 원본의 [0.1,1]크기이고, aspect ratio는 1/2와 2 둘 중 하나로 한다. 그 후 0.5확률로 horizontal flip을 수행한다.

Experimental Results

사진 5

위의 사진 5를 보면 기존에 높은 성능을 보이던 R-CNN 계열보다 높은 성능을 보이는 것을 확인할 수 있다.

사진 6

위의 사진 6을 보면 SSD가 빠른 detection 속도를 갖으면서 높은 mAP를 갖는 것을 알 수 있다.

Model analysis

논문의 저자들은 실험을 통해 다음과 같은 사실을 발견했다.

Data augmentation이 성능에 큰 영향을 미친다.
SSD가 작은 object를 잘 detection하지 못 한다.

cf) 작은 object를 잘 detection 하지 못 하는 이유는 후속 연구에서 밝히길 작은 object를 detection할 때 사용되는 첫번째 feature map을 나타내는 layer의 깊이가 깊지 않아서 세세한 특징을 잘 포착하지 못하기 때문이라고 한다.

Conclusions

SSD는 1-stage detector로 2-stage detector 수준의 높은 detection 성능과 빠른 속도를 보여준 모델이다. 이 후 YOLO 이외에 다른 1-stage detector들의 기본 모델이 된 모델이다.

Reference

SSD 논문: SSD: Single Shot MultiBox Detector
코드 참조: Nvidia DeepLearningExamples
https://www.youtube.com/watch?v=ej1ISEoAK5g
https://herbwood.tistory.com/15
https://hohodu.tistory.com/8

김상현

Mucha Suerte

이전 포스트

DenseNet 논문 리뷰

다음 포스트