[Object Tracking] Towards Real-Time Multi-Object Tracking 논문 리뷰

tobigs1415 이미지 세미나·2021년 6월 23일

Object Tracking 이미지세미나 투빅스

Tobigs1415 Image Semina

목록 보기

13/16

투빅스 14기 김상현

논문 제목: Towards Real-Time Multi-Object Tracking

Introduction

현대의 Multiple Object Tracking(MOT)은 주로 'tracking-by-detection'의 방법을 따른다. 이 방법은 1) detection model, 2) appearance embedding model 두 개로 이뤄져있고, 이는 효율성의 문제를 발생시킨다. 이 방법을 Separate Detection and Embedding(SDE)라 논문의 저자들은 편의를 위해 명명했다.

연산량을 절약하기 위해 Faster R-CNN 구조를 채택한 방법이 있다. 이는 2-stage model로 RPN을 이용해 bounding box들을 detection 한 후 Fast R-CNN 부분의 classification 부분을 embedding model로 바꾼 구조를 이용한다. 2-stage model은 SDE보다 연산량을 줄였지만 여전히 10fps의 성능으로 real-time에 적용하기에는 무리가 있었다. 또한 target의 개수가 증가하는만큼 두번째 단계(embedding model)의 연산이 증가하는 문제가 존재한다.

본 논문은 MOT의 효율성을 증가시키기 위해 Jointly learns the Detector and Embedding(JDE)를 소개한다. 제안된 JDE는 하나의 network에서 동시에 detection 결과들과 이에 대응되는 appearance embedding들을 출력한다. JDE 방법은 SED 방법들과 비슷한 정확도를 보이면서 real-time 성능에 거의 근접했다. MOT-16 test set에 대해 20.2fps의 속도로 MOTA=64.4%를 이뤘다. 이에 비교해서 Faster R-CNN+ QAN embedding 모델은 6fps의 속도로 MOTA=66.1%를 보여줬다.

cf) MOTA란?
MOTA는 Multi-Object Tracking Accuracy를 의미하면 수식은 다음과 같다.

MOTA = 1 - \frac{\sum_t (m_t + fp_t + mme_t)}{\sum_t g_t}

$m_t$ : the number of misses for time t
$fp_t$ : the number of false positives for time t
$mme_t$ : the number of mismatches for time t
$g_t$ : the number of all objects for time t

다음 사진1 은 SDE, Two-stage, JDE의 구조를 비교한 그림이다.

사진 1. Model Comparison

해당 논문의 기여는 다음과 같다.

Single-shot framework for joint detection and embedding learning인 JDE를 소개한다. 이는 state-of-the-art의 SDE와 유사한 정확도를 갖으면서 real-time 실행이 가능하다.
Training data, network architecture, learning objectives and optimization strategy의 다양한 관점에 대해 분석하고 실험하여 JDE 구조를 설계했다.
같은 학습 데이터를 이용한 실험에서 JDE가 가장 빠른 속도를 달성하면서 SED 모델 조합들과 비슷한 성능을 보여줬다.
MOT-16 데이터에서의 실험에서 training data의 수, accuracy, speed 의 이점을 JDE에서 확인할 수 있었다.

Joint Learning of Detection and Embedding

1. Problem Settings

JDE의 목적은 하나의 forward pass에서 location과 appearance embedding을 동시에 출력하는 것이다. 먼저 학습데이터 $\{I,B,y\}_{i=1}^N$ 을 가정하자. $I \in R^{c*h*w}$ 는 image frame을 나타내고, $B \in R^{k*4}$ 는 하나의 프레임에서 k targets의 bounding box annotations를 나타낸다. $y \in Z^k$ 는 label을 나타내고, -1인 경우 식별되는 label이 없는 즉 배경을 나타낸다. JDE는 예측된 bounding box들인 $\hat{B} \in R^{\hat{k}*4}$ 와 appearance embedding들인 $\hat{F} \in R^{\hat{k}*D}$ 를 출력하는 것을 목표로 한다. (이때, D는 embedding의 차원)

따라서 다음이 만족되어야 한다.

$B^*$ is as class to $B$ as possible.
Given a distance metric $d(.)$ , $\forall (k_t, k_{t+\Delta t}, k'_{t + \Delta t})$ that satisfy $y_{k_{t + \Delta t}} = y_{k_t}$ and $y_{k'_{t+\Delta t}} \neq y_{k_t}$ , we have $d(f_{k_t}, f_{k_{t+\Delta t}}) < d(f_{k_t}, f_{k'_{t+\Delta t}})$ , where $f_{k_t}$ is a row vector from $\hat{F_t}$ and $f_{k_{t+ \Delta t}}$ , $f_{k'_{t+ \Delta t}}$ are row vector from $\hat{F}_{t+\Delta t}$ , i.e., embeddings of targets in frame $t$ and $t+\Delta t$ , respectively

첫번째 목적은 target들을 정확하게 detect하는 것을 요구한다. 두번째 목적은 appearance embedding들에 대해 연속적인 frame에서 같은 label의 거리가 다른 label과의 거리 보다 작은 것을 요구한다. Distance metric은 Euclidean distance 또는 cosine distance가 될 수 있다.

2. Architecture Overview

사진 2. Architecture

전체 구조는 Feature Pyramid Network(FPN)을 사용했고, 위의 사진2와 같다.

Backbone network를 통해 3 scales feature map들을 얻는다.
Feature map을 up-sampling한다. 이때 작은 feature map부터 순서대로 up-sample하면서 backbone network의 feature map을 skip-connection을 통해 concatenate하는 U-Net과 같은 구조를 갖는다.
Up-sampling을 하면서 새롭게 얻은 3 scales feature map에 대해 각각 prediction head를 적용한다.
Prediction head은 convolutional layer들로 구성되어 있고, (6A+D)xHxW의 dense prediction map인 output을 갖는다. (A는 anchor의 개수) Dense prediction map은 다음 3부분으로 나뉜다.
1) the box classification results of size 2A x H x W
2) the box regression coefficients of size 4A x H x W
3) the dense embedding map of size D x H x W

3. Learning to Detect

Detection branch는 Faster R-CNN의 RPN 구조와 유사하지만 2가지 변화를 줬다.

1) Anchors

보행자를 detect 하는 것이 목적이므로 aspect ratio를 1:3으로 고정했다. 또한 각각의 anchor의 scale을 $11 \approx 8* 2^{\frac{1}{2}}$ 부터 $512 = 8*2^{\frac{12}{2}}$ 으로 해서 각 anchor에서 12개의 anchor box들을 사용한다.

2) Select foreground/background

Foreground(객체)와 background(배경) sample을 판별하기 위해 기존의 방법과 다른 threshold를 적용했다. Ground truth와 IoU가 0.5 이상인 경우 foreground로 하고, 0.4 이하인 경우 background로 했다. 이와 같은 기준은 시각화를 통해 결정했다.

Detection branch의 object function은 RPN과 같다. Foreground/ background classification을 위한 $L_{\alpha}$ 와 bounding box regression을 위한 $L_{\beta}$ 가 있다.
$L_{\alpha}$ : cross-entropy loss
$L_{\beta}$ : smooth-L1 loss

4. Learning Appearance Embeddings

같은 identity를 갖는 경우 가깝게 위치하고, 다른 identity를 갖는 경우 거리를 멀게 하는 embedding space를 학습하기 위해 triplet loss를 사용한다. triplet loss는 다음과 같다.

L_{triplet} = max(0, f^T f^{-} - f^T f^{+})

$f^T$ : instance in a mini-batch selected as an anchor
$f^{+}$ : positive sample
$f^{-}$ : negative sample

이러한 naive한 triplet loss를 사용할 때 몇 가지 문제가 존재한다.

1) Huge sampling space

첫번째 문제는 거대한 sampling space이다. positive sample과 negative sample을 모두 고려하는 경우 너무 큰 sampling space를 갖는다. 이를 해결하기 위해 hardest positive sample을 사용해 학습에 적용했다. Hardest positive sample은 positive sample 중에서 가장 positive라 예측하기 어려운 sample로 triplet loss에서는 거리가 가장 먼 sample이다. loss는 다음과 같이 변화한다.

L_{triplet} = \sum_{i} max(0, f^Tf_i^{-} - f^Tf^+)

where\ f^t\ is\ the\ hardest\ sample\ in\ a\ mini\ bacth

2) Unstable and Convergence

Triplet loss를 사용한 학습은 불안정하고 수렴속도가 느릴 수 있다. 학습과정의 안정화와 수렴 속도를 빠르게 하기 위해서 smooth upper bound of triplet loss를 사용해 최적화를 진행한다.

L_{upper} = log(1+\sum_{i} exp(f^Tf_i^{-} - f^Tf^+))

위 식은 다음과 같이 표현할 수 있다.

L_{upper} = -log\frac{exp(f^Tf^+)}{exp(f^Tf^+) + \sum_{i} exp(f^Tf^-_i)}

이는 cross-entropy loss 공식과 유사하다.

L_{CE} = -log\frac{exp(f^Tg^+)}{exp(f^Tg^+) + \sum_{i} exp(f^Tg^-_i)}

$g^+$ : class-wise weight of the positive class
$g^-$ : class-wise weights of negative classes

$L_{upper}$ 와 $L_{CE}$ 의 구분되는 차이점은 2가지 이다.

cross-entropy loss는 인스턴스 임베딩을 직접 사용하는 대신 학습 가능한 클래스 별 가중치를 클래스 인스턴스의 대리로 사용한다.
cross-entropy loss는 embedding space를 고려해서 loss 계산 시 모든 negative sample들이 사용되는 반면에 upper loss의 경우 sampled negative instance들만을 사용한다.

위의 세 loss 중 어떤 loss를 사용할지 고려하고 실험적 결과로 논문의 저자들은 cross-entropy loss를 사용했다.

네트워크를 통해 추출된 embedding 벡터는 학습 시 shared fully-connected layer를 통해 class-wise logits가 된다. 그러고 나서 logits에 cross-entropy loss가 적용된다. Inference시에는 추출된 embedding 벡터를 사용해 객체를 추적한다.

5. Automatic Loss Balancing

L_{totoal} = \sum^M_i \sum_{j= \alpha, \beta, \gamma} \frac{1}{2} (\frac{1}{e^{s^j_i}} L^i_j + s^i_j)

$s^i_j$ : task-dependent uncertainty for each individual loss
$L^i_j$ : $i$ 번째 scale의 $j$ loss

$L_{\alpha}, L_{\beta}, L_{\gamma}$ : classification loss, bounding box regression loss, embedding loss

$s^i_j$ 는 learnabel parameter로 학습시 자동으로 학습된다. 따라서 automatic loss balancing을 이루게 된다.

6. Online Association

본 논문의 주요 주제는 아니지만 논문의 저자들은 JDE와 함께 작동하는 간단하고 빠른 online association을 소개한다.

먼저 tracklet은 appearance state $e_i$ 와 motion state $m_i = (x, y, \gamma, h, \dot{x}, \dot{y}, \dot{\gamma}, \dot{h})$ 로 표현된다. 여기서 $x,y$ 는 bounding box center position, h는 bounding box height, $\gamma$ 는 aspect ratio, $\dot{x}$ 는 $x$ 방향으로 velocity를 나타낸다.

tracklet appearnace $e_i$ 는 첫 번째 observation의 appearance embedding인 $f^0_i$ 으로 시작된다.
observation들이 연관 될 가능성이 있는 모든 reference tracklet들을 포함하는 tracklet pool을 유지한다.
들어오는 프레임에 대해서 pair-wise motion affinity(유사성) matrix $A_m$ 과 appearnace affinity matrix $A_e$ 를 모든 observation들과 pool의 tracklet들에 대해서 계산한다. 이때 appearance affinity는 cosine similarity를 사용하고, motion affinity는 mahalanobis distance를 사용해 계산한다.
이후 선형 할당 문제를 cost matrix $C = \lambda A_e + (1-\lambda)A_m$ 인 Hungarian algorithm을 이용해 해결한다.
motion state는 Kalman filter에 의해 update되고, appearance state는 다음 수식으로 업데이트 된다.
$e_i^t = \alpha e^{t-1}_i + (1-\alpha)f^t_i$
where $f^t_i$ is the appearance embedding of the current matched observation, $\alpha = 0.9$ is a momentum term
마지막으로 어떤 tracklet들에도 할당 되지 않은 observation은 새로운 tracklet들로 시작된다. 또한 만약 최근 30 frame에서 update가 되지 않은 경우 tracklet은 종료된다.

논문에서 제안 online association은 SORT 보다 좋은 성능을 보여준다. 실험결과는 다음 사진3 과 같다.

사진 3. comparison online association

Experiments

사진 4. comparison of embed.loss & weighting strategy

위의 사진4를 통해 위에서 설명한 것과 같이 embedding loss로 cross-entropy를 사용할 때 성능이 가장 좋았고, 또한 weighting strategy도 automatic balancing을 통해 uncertainty를 이용한 모델이 성능이 가장 좋았다.

사진 5. comparison with SOTA methods

위의 사진5를 통해 당시 SOTA의 object tracking 방법들과 비교했을 때 준수한 성능을 보여주면서 굉장히 빠른 속도를 확인할 수 있다.

Conclusion

해당 논문에서는 detection과 embedding을 함께 수행하는 JDE를 제안했다. 해당 모델은 당시 SOTA의 방법들과 비교했을 때 준수한 성능을 보이면서 real-time에 가까운 속도를 보여줬다.

후속 연구

FairMOT

논문 제목: FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking

다음과 같은 문제점들을 해결하기 위해 제안된 논문이다.

Unfairness Caused by Anchors: 동일한 객체에 여러 개의 anchor box들로 인해 Re-ID시 모호성이 생긴다.
Unfairness Caused by Features: low-level feature와 high level feature를 동시에 학습시켜야 한다.
Unfairness Caused by Feature Dimension: high-level feature를 학습하는 것이 object detection 성능에는 악영향을 미칠 수 있다. 또한 overfitting의 문제가 있다.

다음 사진을 통해 기존의 anchor-based model과 anchor-free model인 FairMOT의 차이를 확인할 수 있다.

사진 6. Anchor base vs Anchor free

사진 7. FairMOT Architecture

위의 사진7은 FairMOT의 네트워크 구조이다.

Backbone network은 ResNet-34를 채택한 Deep Layer Aggregation(DLA)를 이용하므로 multi-layer features들을 포착한다.

Detection branch에서 classification이 아닌 pixel-wise logistic regression을 통해 heatmap을 사용한다.

사진 7. FairMOT performance

위의 사진7을 통해 대부분의 benchmark에서 state-of-the-art의 성능을 보여준다.

References

Towards Real-Time Multi-Object Tracking 논문: https://arxiv.org/abs/1909.12605

FairMOT 논문: FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking

https://gazelle-and-cs.tistory.com/29

https://eehoeskrap.tistory.com/447

tobigs1415 이미지 세미나

2021 투빅스 14, 15기 이미지 세미나입니다😀

이전 포스트

[Pose Estimation] Human Pose Estimation via Deep Neural Networks : DeepPose 논문 리뷰

다음 포스트

[Object Tracking] Simple Online and Realtime Tracking : SORT 논문 리뷰

4개의 댓글

JOY

2021년 6월 29일

투빅스 14기 장혜림

이번 강의는 Towards Real-Time Multi-Object Tracking 논문 리뷰로 김상현님께서 진행해주셨습니다.

JDE

일반적으로 MOT 모델은 detection model과 appearance embedding model로 구성됨. 이러한 SDE(Separable Detection and Embedding) 모델의 경우 효율성의 문제가 존재하는데 이를 보완하기 위해 JDE(Jointly learns the Detector and Embedding)를 제안함
JDE는 하나의 network로 detection과 appearance embedding을 수행함. 기존 모델들과 유사한 성능을 보이면서도 속도의 향상을 보여주었음
구조는 Feature Pyramid Network(FPN)을 사용하여 각 scale별 feature map을 얻고 각각에 대해 prediction head를 적용함. 이를 통해 얻게 되는 dense prediction map은 box classification, box regression, embedding에 사용됨
loss function의 경우, box classification은 cross-entropy, box regression은 smooth L1 loss, embedding은 triplet loss 사용. triplet loss의 경우 hard positive sampling 방식 적용하였고, 학습 과정의 안정화와 빠른 수렴을 위해 smooth upper bound of triplet loss 사용
이후 association 방법으로 online association 사용. motion affinity와 appearance affinity 고려한 cost matrix를 헝가리안 알고리즘을 통해 할당.

FairMOT

JDE의 경우 anchor box를 사용하여 detection 수행함. 그러나 anchor box들이 여러 개이면 re-ID 과정에서 모호성 발생함. 이를 해결하기 위해 anchor-free 방식으로 heatmap 사용.
구조는 ResNet을 Deep Layer Aggregation한 모델로 Unet++와 유사함.

fps가 다른 MOT 모델에 비해 매우 빠른 JDE 모델을 소개해주셨는데, 자세하게 설명해주신 덕분에 이해하는데 많은 도움이 되었습니다! 또한 후속 연구로 소개해주신 FairMOT 역시 매우 흥미로웠습니다. 좋은 강의 감사합니다:)

답글 달기

tobigs1415 이미지 세미나

2021년 6월 30일

투빅스 14기 김민경

JDE(Jointly learns the Detector and Embedding)는 하나의 network에서 동시에 detection 결과들과 이에 대응되는 appearance embedding들을 출력해서 MOT의 효율성을 증가시킨다.
JDE는 1) target들을 정확하게 detect하는 것, 2) appearance embedding들에 대해 연속적인 frame에서 같은 label의 거리가 다른 label과의 거리 보다 작은 것을 만족해야 한다.
전체 구조는 Feature Pyramid Network(FPN)을 사용하므로 multiple scale로부터 prediction한다. 그리고 detection을 위해 Faster RCNN의 RPN 구조에 약간의 수정을 더해 detection branch를 구성했다.
embedding space를 학습하기 위해서는 triplet loss를 사용했다. 그런데 naive한 triplet loss는 1) Huge sampling space, 2) Unstable and Convergence의 문제가 있어서 cross-entropy loss 공식을 사용했다.

해당 논문에서 제안한 JDE는 기존의 MOT 방법과 달리 detection과 embedding이 동시에 진행되는 모델입니다. 실시간 MOT에서 간단하고 빠르게 적용되는 것 같아 매우 유용한 방법인 것 같습니다. 유익한 강의 감사합니다:)

답글 달기

서아라(엘텍공과대학 휴먼기계바이오공학부)

2021년 6월 30일

투빅스 14기 서아라

이번 주차에는 object tracking 논문에 대하여 리뷰를 진행하였으며, Towards Real-Time Multi-Object Tracking에 대해서는 김상현님께서 진행해주셨습니다.

현대의 Multiple Object Tracking(MOT)은 주로 'tracking-by-detection'의 방법을 따릅니다. 이 방법은 detection model, appearance embedding model 두 개로 이뤄져있고, 이는 효율성의 문제를 발생시킨다고 합니다.
해당 논문은 MOT의 효율성을 증가시키기 위해 Jointly learns the Detector and Embedding(JDE) 방법을 소개하였습니다.
JDE의 첫 번째 목적은 target들을 정확하게 detect하는 것이며, 두번째 목적은 appearance embedding들에 대해 연속적인 frame에서 같은 label의 거리가 다른 label과의 거리 보다 작은 것입니다.
전체 구조는 Feature Pyramid Network(FPN)을 사용했으며 Backbone network를 통해 3 scales feature map들을 얻은 후 Feature map을 up-sampling하고, Up-sampling을 하면서 새롭게 얻은 3 scales feature map에 대해 각각 prediction head를 적용하는 방식입니다.
이를 통해 얻은 Dense prediction map은 the box classification, the box regression coefficients, the dense embedding map 세 가지로 나뉜다고 합니다.
triplet loss: 같은 identity를 갖는 경우 가깝게 위치하고, 다른 identity를 갖는 경우 거리를 멀게 하는 embedding space를 학습하기 위한 loss
- 그러나 triplet loss에는 Huge sampling space, Unstable and Convergence 문제가 존재합니다.
- 따라서 cross-entropy loss를 사용하여 위와 같은 문제들을 해결하였습니다.

MOT 논문에 대해서 처음 접해보았는데 Towards Real-Time Multi-Object Tracking논문에 대하여 자세하게 리뷰해주셔서 이것저것 많은 내용을 얻어갈 수 있었습니다! 추가적으로 설명해주신 FairMOT 리뷰를 통해서도 MOT 연구의 흐름에 이해할 수 있었습니다. 좋은 리뷰 정말 감사합니다:)

답글 달기

tbogan

2024년 3월 11일

콘텐츠의 강점 중 하나는 다중 객체 추적 시스템에 대한 철저한 연구와 비교입니다. 본 연구에서는 독특한 방법론인 JDE(Jointly Learn the Detector and Embedding)를 제시하고 이를 최신 방법과 비교하여 종합적으로 평가합니다. 이 엄격한 평가는 연구자와 실무자가 다양한 접근 방식의 장단점을 이해하고 MOT 기술을 향상시키는 데 도움이 됩니다. run 3

답글 달기