[논문리뷰] SSD MobilenetV2 Pytorch 구현 및 아이폰 앱 구동 (2)

Park YunSu·2024년 9월 29일

논문리뷰

목록 보기
2/5

1. MobilnetV2 Backbone

기존 MobilenetV2 Model Architecture (input: 300x300)

층 (Layer)출력 채널 수 (Output Channels)Feature Map 크기 (Feature Map Size)
입력 이미지 (Input Image)3300 × 300
Conv2d (stride=2, kernel=3x3)32150 × 150
Bottleneck 1 (stride=1)16150 × 150
Bottleneck 2 (stride=2)2475 × 75
Bottleneck 3 (stride=2)3238 × 38
Bottleneck 4 (stride=2)6419 × 19
Bottleneck 5 (stride=1)9619 × 19
Bottleneck 6 (stride=2)16010 × 10
Bottleneck 7 (stride=1)32010 × 10
Conv2d 1x1 (Linear)128010 × 10
Average Pooling (adaptive)12801 × 1
Fully Connected Layer (FC)10001 × 1

1) 모델 구조 개선

SSD 모델의 백본으로 사용하기 위해서 4가지 방법으로 모델 구조를 개선하여서 실험을 진행
목표 : MobilenetV2의 모델 속도를 위한 flop-efficient한 Bottleneck 구조의 조정으로 Feature가 부족하지 않게 모델 구조 개선

Experiment 1.

실험 1

Bottleneck 7번과 마지막 출력인 Linear Block을 사용하여 Base Network로 설정

class MobileNetV2Base(nn.Module):
    def __init__(self):
        super(MobileNetV2Base, self).__init__()
        # First convolution layer
        self.first_conv = nn.Sequential(
            nn.Conv2d(3, 32, 3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(32),
            nn.ReLU6(inplace=True)
        )

        # Bottleneck layers (inverted residual blocks)
        self.bottlenecks = nn.Sequential(
            self._make_stage(32, 16, t=1, n=1),
            self._make_stage(16, 24, t=6, n=2, stride=2),
            self._make_stage(24, 32, t=6, n=3, stride=2),
            self._make_stage(32, 64, t=6, n=4, stride=2),
            self._make_stage(64, 96, t=6, n=3),
            self._make_stage(96, 160, t=6, n=3, stride=2),
            self._make_stage(160, 320, t=6, n=1)
        )

        # Last convolution layer in MobileNetV2
        self.last_conv = nn.Sequential(
            nn.Conv2d(320, 1280, 1, bias=False),
            nn.BatchNorm2d(1280),
            nn.ReLU6(inplace=True)
        )

        # Additional layers for SSD (conv6, conv7)
        self.conv6 = nn.Conv2d(1280, 1024, kernel_size=3, padding=6, dilation=6)  # atrous convolution
        self.conv7 = nn.Conv2d(1024, 1024, kernel_size=1)

        # Adjust conv4_3 feature map to have 512 channels
        self.conv4_3_adjust = nn.Conv2d(320, 512, kernel_size=1)

Result
Total prior boxes : 1212
Localization output shape: torch.Size([1, 1212, 4])
Class scores output shape: torch.Size([1, 1212, 21])

Experiment 2.

실험2

Bottleneck 3번과 마지막 출력인 Linear block을 사용하여 Base Network로 설정

class MobileNetV2Base(nn.Module):
    def __init__(self):
        super(MobileNetV2Base, self).__init__()

        # First convolution layer
        self.first_conv = nn.Sequential(
            nn.Conv2d(3, 32, 3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(32),
            nn.ReLU6(inplace=True)
        )

        # Bottleneck layers (inverted residual blocks)
        self.bottlenecks = nn.Sequential(
            self._make_stage(32, 16, t=1, n=1),
            self._make_stage(16, 24, t=6, n=2, stride=2),
            self._make_stage(24, 32, t=6, n=3, stride=2),
            self._make_stage(32, 64, t=6, n=4, stride=2),
            self._make_stage(64, 96, t=6, n=3),
            self._make_stage(96, 160, t=6, n=3, stride=2),
            self._make_stage(160, 320, t=6, n=1)
        )

        # Last convolution layer in MobileNetV2
        self.last_conv = nn.Sequential(
            nn.Conv2d(320, 1280, 1, bias=False),
            nn.BatchNorm2d(1280),
            nn.ReLU6(inplace=True)
        )

        # Additional layers for SSD (conv6, conv7)
        self.conv6 = nn.Conv2d(1280, 1024, kernel_size=3, padding=6, dilation=6)  # atrous convolution
        self.conv7 = nn.Conv2d(1024, 1024, kernel_size=1)

        # Adjust conv4_3 feature map to have 512 channels
        self.conv4_3_adjust = nn.Conv2d(32, 512, kernel_size=1)

Result
Total prior boxes: 6600
Localization output shape: torch.Size([1, 6600, 4])
Class scores output shape: torch.Size([1, 6600, 21])

Experiment 3.

실험 3

기존 MobilenetV2의 모델 구조를 개선하여 conv4_3, conv7 Base Network 설정

class MobileNetV2Base(nn.Module):
    def __init__(self):
        super(MobileNetV2Base, self).__init__()

        # First convolution layer
        self.first_conv = nn.Sequential(
            nn.Conv2d(3, 32, 3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(32),
            nn.ReLU6(inplace=True)
        )

        # Bottleneck layers
        self.bottlenecks = nn.Sequential(
            self._make_stage(32, 24, t=1, n=1),   # Bottleneck 1: 16 -> 24
            self._make_stage(24, 32, t=6, n=2, stride=2),  # Bottleneck 2: 24 -> 32
            self._make_stage(32, 64, t=6, n=3, stride=2),  # Bottleneck 3: 32 -> 64
            self._make_stage(64, 128, t=6, n=4, stride=2),  # Bottleneck 4: 64 -> 128
            self._make_stage(128, 256, t=6, n=3),  # Bottleneck 5: 128 -> 256
            self._make_stage(256, 512, t=6, n=3, stride=1),  # Bottleneck 6: 256 -> 512
            self._make_stage(512, 1024, t=6, n=1)  # Bottleneck 7: 512 -> 1024
        )

    def forward(self, image):
        out = self.first_conv(image) # ([1, 32, 150, 150])
        out = self.bottlenecks[:6](out) #([1, 512, 19, 19])        
        conv4_3_feats = out # ([1, 512, 19, 19])
        
        out = self.bottlenecks[6:](out) # Rest of Bottleneck [1, 1024, 19, 19])
        conv7_feats = out # [1, 1024, 19, 19])
        
        return conv4_3_feats, conv7_feats


Result
Total prior boxes: 4400
Localization output shape: torch.Size([1, 4400, 4])
Class scores output shape: torch.Size([1, 4400, 21])

Experiment 4.

실험 4

실험 3과 동일한 구조이지만, 추후 iphone ios(Real Time Object Detection) App을 만들떄 모델의 Bottleneck 구조를 하나로 설정하게 되면, conv4_3, conv7을 각각 뽑아낼 수 없기에 nn.Sequential() 로 묶어주기

class MobileNetV2Base(nn.Module):
    def __init__(self):
        super(MobileNetV2Base, self).__init__()

        # First convolution layer
        self.first_conv = nn.Sequential(
            nn.Conv2d(3, 32, 3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(32),
            nn.ReLU6(inplace=True)
        )

        # Bottleneck layers with gradual channel increase
        self.bottleneck1_6 = nn.Sequential(
            self._make_stage(32, 24, t=1, n=1),   # Bottleneck 1: 32 -> 24
            self._make_stage(24, 32, t=6, n=2, stride=2),  # Bottleneck 2: 24 -> 32
            self._make_stage(32, 64, t=6, n=3, stride=2),  # Bottleneck 3: 32 -> 64
            self._make_stage(64, 128, t=6, n=4, stride=2),  # Bottleneck 4: 64 -> 128
            self._make_stage(128, 256, t=6, n=3),  # Bottleneck 5: 128 -> 256
            self._make_stage(256, 512, t=6, n=3, stride=1)  # Bottleneck 6: 256 -> 512
        )
        
        self.bottleneck7_rest = nn.Sequential(
            self._make_stage(512, 1024, t=6, n=1)  # Bottleneck 7: 512 -> 1024
        )

    def forward(self, image):
        out = self.first_conv(image) # ([1, 32, 150, 150])
        out = self.bottleneck1_6(out) #([1, 512, 19, 19])
        conv4_3_feats = out # ([1, 512, 19, 19])
        
        out = self.bottleneck7_rest(out) # Rest of Bottleneck ([1, 1024, 19, 19])
        conv7_feats = out # [1, 1024, 19, 19])
        
        return conv4_3_feats, conv7_feats

Result
Total prior boxes: 4400
Localization output shape: torch.Size([1, 4400, 4])
Class scores output shape: torch.Size([1, 4400, 21])

2) Mean Average Precision(mAP)

각 실험별 mAP 값 확인

ClassExp. 1Exp. 2Exp. 3Exp. 4
aeroplane63.466.371.370.4
bicycle69.670.279.980.6
bird49.847.263.165.9
boat42.448.157.760.1
bottle15.216.234.934.8
bus69.771.181.480.2
car68.374.377.779.4
cat77.276.082.283.2
chair33.734.954.153.9
cow52.259.171.371.2
diningtable59.963.373.373.4
dog68.165.579.278.9
horse77.278.384.484.4
motorbike74.772.180.982.4
person59.863.073.673.3
pottedplant27.630.645.642.5
sheep51.756.868.470.6
sofa63.366.677.778.2
train76.374.182.783.2
tvmonitor50.256.565.765.5
Mean Average Precision (mAP)57.559.570.370.6
Total Prior Boxes1212660044004400

논문 SSD별 mAP

정리

모델백본(Backbone)데이터셋mAP
SSD300VGG16Pascal 2007+201272.4
SSD(MobileNetV2)MobileNetV2Pascal 2007+201270.6

3) FLOPs

FLOPs 비교:
VGG16 기반 SSD: 31,373,537,792 FLOPs (약 31.37 GFLOPs)
MobileNetV2 기반 SSD: 6,945,977,920 FLOPs (약 6.95 GFLOPs)

약 4.5배 적은 FLOPs로 적은 연산량 요구

4) Object Detection in Video Frames

  1. 24fps 영상

  2. 36fps 영상

  3. 60fps 영상

5) CoreML ios Application

Apple의 Coreml을 통해서 iphone App 만들어서 구동

github : https://github.com/PARKYUNSU/SSD_MobilenetV2

0개의 댓글