DEtection TRansformer

gilson·2024년 12월 20일

Object Detection

목적 : 관심있는 Object에 대해 category label 과 bounding boxes 집합을 예측 하는 것. 즉, Label과 Bounding Box를 하나의 Set으로 구성하면 Object Detection문제는 Set Prediction 문제로 치환된다.

해결해야 할 항목들 :

Set Predction Loss을 어떻게 정의 할 것인가?
- Set Prediction 은 Permutation-Invariant 문제를 가진다. 즉, Cardinality Prediction 을 어떻게 풀 것인가?
- Object Loss는 Class + BBox(Position, Size) Loss
A set of objects 을 예측하고, 이들의 관계를 어떻게 모델일 할 것 인가?
- Transformer를 사용한다.

Contribution

DETR은 Object Detection 을 Direct Set Prediction Problem 으로 다룬다
End-to-End Prediction이 가능하다

Set Prediction Problem

Cardinality Prediction + Element Prediction

DETR의 output으로 N개의 set(class, bbox) of objects(Object Query)가 나오게 되는데 이 때 순서가 정해져있지 않기 때문에 DETR에서 예측한 output이 어떤 GT Set를 예측한 것인지 알아야 모델의 loss를 계산 할 수 있기 때문에 먼저 Predicted Set과 GT Set 사이에 최적의 매칭이 필요하다. 즉, Cardinality Prediction을 먼저 수행해야 한다. 이 과정이 이분매칭(Bipartite Matching) 이다.

This paradigm involves matching elements between the set of predictions and the set of labels, such that the sum of the errors between the matches is minimized rather than the sum of individual box errors. The optimal set of matches is computed by a bipartite matching algorithm called the Hungarian algorithm.

The Hungarian algorithm operates on a cost matrix that stores the dissimilarity between all elements of both sets. It’s important to note that not every element matches its best correspondence, as the bipartite algorithm optimizes the sum of the matches.

Stage-1 : Predicted Set과 GT Set 사이의 이상적인 이분매칭(Bipartite Matching) 생성

최적의 매칭 Set을 찾는 것은 모든 Set의 매칭 합이 가장 작은 것을 찾으면 된다. 따라서, Set은 Class와 BBox(Position+Size)을 가지기 때문에 각각의 오차를 모두 합한 것이 최소가 되는 것이 최적의 매칭이 된다. 하지만, 100 Set을 예측한다면 Predicted Set과 GT Set 사이의 매칭을 위한 경우의 수는 100! 개가 존재한다. 이것은 불가능한 경우의 수가 되기 때문에, Assignment Problem & Hungarian Algorithm을 이용해서 해결 한다.

Ground Truth Set(N) : Predicted Sets(N)의 최적의 이분매칭

N : Object Queries 개수( Larger than a set of objects in an image), 100

$y$ : set of objects of the ground truth with N padded with $\phi$ (No object)

$\hat y$ : the set of N predictions,

모델이 예측한 class가 gt class와 같을 확률이 높을수록, 모델이 예측한 bbox와 gt의 bbox간 loss가 작을 수록 매칭 cost가 작아진다는 것을 알 수 있다. 즉,

$\mathcal{L}_{match}$ 을 최소로 하는 Set(Class+BBox)의 순열을 찾는 것이다.

\hat \sigma=\underset{{\sigma \in \mathfrak{S}_N}}{\mathrm{argmin}}\sum_{i}^{N}\mathcal{L}_{match}(y_i,\hat y_{\sigma(i)}) \\ \mathcal{L}_{match}(y_i,\hat y_{\sigma(i)})=-\mathbb{1}_{\{c_i \neq\phi\}} \hat p_{\sigma(i)}(c_i) + \mathbb{1}_{\{c_i \neq\phi\}}\mathcal{L}_{box}(b_i,\hat b_{\sigma(i)}) \\ \\ \mathcal{L}_{box}(b_i,\hat b_{\sigma(i)}=\lambda_{iou}\mathcal{L}_{iou}(b_i,\hat b_{\sigma(i)})+\lambda_{L1}||b_i-\hat b_{\delta(i)}||_1 \\ \mathcal{L}_{iou}(b_i,\hat b_{\sigma(i)})=1-(\frac{|b_{\sigma(i)}\cap\hat{b_i}|}{|b_{\sigma(i)}\cup\hat{b_i}|}-\frac{|B(b_{\sigma(i)},\hat{b_i})\backslash b_\sigma(i) \cup \hat{b_i}|}{|B(b_{\sigma(i)},\hat{b_i})|})

$\mathcal{L}_{match}(y_i,\hat y_{\sigma(i)})$ : Pair-wise matching cost between ground truth $y_i$ and a prediction with index $\sigma(i)$ , Cost_Matrix

$\sigma(i)$ : Permutation of GT Object Set

$\hat \sigma$ : Permutation of Optimized Predicted Object Set

$\hat p_{\sigma(i)}(c_i)$ : probability of class $c_i$

$\hat b_{\sigma(i)}$ : predicted box of $b_i$

$\lambda_{iou},\lambda_{L1} \in \mathbb{R}$ is hyperparameters

$\mathcal{L}_{iou}(b_i,\hat b_{\sigma(i)})$ is generalized IoU loss that is scale-invariant

즉, Hungarian Algorithm을 통해 최적의 조합을 만드는게 목적이다.

$y_i=(c_i,b_i)$ : i 번째 GT Set(target class, bbox[center, width, height])

$\hat y_i=(\hat c_i, \hat b_i)$ : i 번째 Predicted Set(target class, bbox[center, width, height])

$|\cdot|$ : area computed by min/max of linear function of $b_{\sigma(i)}$ and $\hat b_i$

$B(b_{\sigma(i)}, \hat b_i)$ : largest box containing $b_{\sigma(i)}, \hat b_i$

여기서, bounding box의 크기가 클수록 L1 loss가 커지기 때문에 아래와 같이 bounding box간 IoU loss를 더하여 이를 보정해 준다. 즉, $\mathcal{L}_{match}$ (Cost_Matrix)를 최소로 만족하는 GT와 Prediction의 순열 쌍을 Hungarian Algorithm을 이용해서 $\hat{\sigma}$ 을 찾는다. Scipy.optimize.linear_sum_assignment를 사용하면 된다.

Stage-2 : 이분매칭(Bipartite Matching: 최적의 순열 $\hat {\sigma}$ )이 완성되면 Elements(Label+BBox)을 이용한 Loss Function

Hungarian Algorithm으로 찾의 최적의 Set 순열 $\hat \sigma$ 을 기반으로 전체 Loss 가 감소하는 방향으로 Network을 업데이트

\text{Hungarian Loss} \\ \mathcal{L}_{Hungarian}(y,\hat y)=\sum_{i=1}^N [-log (\hat p_{\hat \sigma(i)}(c_i)) + \mathbb{1}_{\{c_i \neq\phi\}}\mathcal{L}_{box}(b_i,\hat b_{\sigma(i)})]

Archetecture

Backbone : Resnet50

$x_{img}$ -> Backbone => $f$ (Feature Map)

x_{img} \in \mathbb{R}^{3 \times H_0 \times W_o} \\ f \in \mathbb{R}^{C \times H \times W}, C=2048, H=\frac{H_0}{32}, W=\frac{W_0}{32} \\

Transformer Encoder

Encoder는 image를 이해하는 역할이라고 볼 수 있다. 이 때 MHSA에서는 image feature vector간 distance에 상관없이 image feature vector를 분석하는 과정을 거치게 되는데 결과적으로 image 내에서 어떤 grid가 object이고 어떤 grid가 background인지를 학습하게 된다.

z_0 \in \mathbb{R}^{d \times H \times W}

Transformer Decoder

Decoder의 역할은 크게 2가지로 볼 수 있다.

Multi Head Self Attention 을 통해 object query의 이해

object query는 decoder의 MHSA에서 서로간의 관계를 조사하면서 object에 대해 이해하는 과정을 거치게 된다. object query만 MHSA에 넣는게 무슨 의미가 있나 싶을 수 도 있지만 object 끼리 비교하는 과정을 거쳐야 서로간의 공통점, 차이점을 비교하면서 해당 object에 대한 이해도가 높아질 수 밖에 없다. 예를 들어 코끼리 object와 기린 object를 비교한다고 치면 코끼리는 기린과 마찬가지로 다리가 4개지만 귀가 크고 코가 길며 뚱뚱하구나! 라는 object의 특성을 이해할 수 있다.

Multi Head Attension(Cross Attention)을 통해 object query가 image feature vector사이의 관계을 이해

Query : 영향을 받는 entity

Key : 영향을 주는 entity

Value : Q,K matmul을 통해 계산된 attention score를 적용할 entity.

즉 cross attention에서 일어나는 일을 말로 풀어서 설명하면 "object query가 image feature로부터 어떤 영향을 받는지, 얼마나 관련이 있는지를 조사"하게 된다고 할 수 있다.

gilson

이전 포스트

View and Reshape VS permute and transpose

다음 포스트