[paper review]Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

JOY·2021년 6월 1일

paper-review pose estimation

Introduction

Multi-person pose estimation

Top-down approach
먼저 사람을 detection 하고, bounding box 내부에서 pose 예측하는 방식
- 장점: 정확도가 Bottom-up 방식에 비해 높음
- 단점: 각 사람마다 별도로 pose 예측 -> computational cost 큼. 속도 느림
Bottom-up approach
영상에 포함된 사람의 keypoint를 먼저 예측하고, keypoint 간의 관계를 분석해서 pose를 예측하는 방식
- 장점: 사람을 detection하는 과정이 없어서 계산 비용이 적음 -> real-time 적용 가능
- 단점: Top-down 방식에 비해 정확도는 떨어짐

OpenPose (2017 CVPR 논문 중심으로)

용어 정리
- part (또는 joint, keypoint) : 관절
- limb (또는 part pair, part connection) : 두 관절의 연결
(단, 실제 관절의 연결이라고는 볼 수 없는 pair도 있음.
ex. 코 - 왼쪽 눈 연결)

Network

input으로 image가 들어오면 VGG-19 network와 같은 convolution network를 통해 feature map(= $F$ )를 얻음
F를 두 branch의 input으로 사용
각 branch는 part confidence map, part affinity fields 예측
각 branch에서 얻은 결과와 stage1의 input으로 사용한 feature map을 concat해서 다음 stage의 input으로 사용
-> confidence map 추정시, 이전 stage 의 affinity field 정보를 활용해 현재 stage 의 confidence map 을 추정함. affinity field 추정시에도 confidence map 정보를 활용함

part confidence map

관절(part)의 위치를 detection
각 pixel 마다 part가 있을 것 같은 정도(confidence)를 인코딩
각 part마다 confidence map 존재
confidence map의 크기는 input의 크기와 동일
ground-truth

$x_{j,k} ∈ R^2$ : the ground-truth position of body part j for person k in the image.
$p$ : 이미지 내 pixel 위치

동일한 pixel 위치에 여러 사람의 part confidence 가 중첩되어 있을 경우, 가장 큰 score를 해당 pixel의 값으로 할당

average를 사용하는 것보다 max를 사용하는 것이 가우시안 분포의 peak에 근접함

part affinity fields

서로 연결되는 두 관절(part)의 관계 추정
두 part 의 연결선 상에 있는 point(pixel) 에 vector field 할당
할당되는 vector field 는 특정 part 에서 다른 part 로 향하는 방향성 정보 및 위치 정보를 가지고 있음
각 limb마다 part affinity fields 존재

Bottom-up 접근으로도 pose estimation의 정확도 개선

- 한 이미지 내에 여러 사람이 존재할 경우 각 part가 다른 사람의 part로 연결될 가능성이 있음( (a)의 회색선 )
- 이러한 것을 방지하기 위하여, 중간 part 를 하나 더 삽입하여 ( (b) 의 노란색 part ) part 조합의 수를 줄일 수 있으나, 여전히 다른 사람에게 속한 part와 연결될 수 있음 ( (b) 녹색 선 )
- part affinity field를 사용하게 되면 위와 같은 문제를 해결할 수 있음

ground-truth

만약 pixel $p$ 가 limb c 위에 있다면 $L_{c,k}(p)$ 는 $j_1$ -> $j_2$ 를 향하는 unit vector 할당. 그렇지 않으면 zero vector 할당

참고로 limb는 line으로 구성되는 것이 아니라 distance threshold를 갖는 region으로 모델링됨

$v_\perp$ : $v$ 와 직각을 이루는 unit vector
$l_{c,k}$ : limb의 길이
$\sigma_l$ : limb의 두께

동일한 pixel 위치에 여러 사람의 part affinity field 가 중첩되어 있을 경우, 모든 사람의 affinity vector 평균을 해당 pixel의 값으로 할당

$n_c(p)$ : pixel $p$ 에 affinity vector 가 할당된 사람 수

multi-stage

stage가 진행됨에 따라 confidence map과 part affinity fields 정확하게 예측함

Loss function

매 stage마다 L2 loss 측정해 합산 -> vanishing gradeint 문제 해결
confidence map loss ( $f_S$ )와 affinity field loss ( $f_L$ ) 의 조합으로 전체 손실 함수를 정의

$t$ : stage index
$T$ : 전체 stage 개수 (논문에서는 6 stage까지)
$J$ : 전체 part 종류 개수
$C$ : 전체 affinity field 종류 개수
$p$ : 이미지 내의 위치를 나타내는 좌표
$S_j$ : confidence map score
$L_c$ : affinity field
$W(p)$ : binary mask ( $p$ 위치에 confidence 및 affinity field annotation 이 있으면 1, 없으면 0)

the authors has added some weight to the loss functions to address a practical issue that some datasets do not completely label all people.

W(p) = 0 when the annotation is missing at an image location p. The mask is used to avoid penalizing the true positive predictions during training.

지금까지는 network 구조에 대해서 다루었고, multi-person parsing 단계로
1) 어떻게 confidence map에서 후보 part를 찾는지,
2) PAFs 이용해서 어떻게 part pair로 연결할 수 있는지,
3) 연결된 part pairs가 어떻게 사람(skeleton)으로 연결될 수 있는지 알아보도록 하겠습니다.

Part candidates

confidence map에서 구한 confidence에 non-max suppresion을 적용해 part candidates 구함

Bipartite matching

part candidates를 part pairs로 연결하는 과정

(b)와 같이 가능한 모든 part 연결을 고려하게 되면 K-dimensional mathching problem이 되고 이는 NP-hard 문제로 알려짐

NP-hard: 어떤 결정 문제의 답이 YES일 때, 그 문제의 답이 YES라는 것을 입증하는 힌트(=relaxation)가 주어지면, 그 힌트를 사용해서 그 문제의 답이 정말로 YES라는 것을 다항식 시간 이내에 확인할 수 있는 문제

논문에서 제안하는 relaxations

(1) 각 part에 인접하는 part를 미리 제공 (Fig 6.(c))
(ex. 오른쪽 어깨와 연결되는 것은, 목과 오른쪽 팔꿈치 뿐)
(2) 더 나아가 (c) graph를 여러 bipartite graph로 decompose (Fig 6.(d))

bipartite graph / matching

- 이분 그래프: 정점을 두개의 그룹으로 나누었을 때, 존재하는 모든 간선의 양 끝 정점이 서로 다른 그룹에 속하는 형태의 그래프
- 이분 매칭: 이분 그래프에서 한 그룹의 정점에서 다른 그룹의 정점으로 간선을 연결할 때 각각이 일대일 대응으로 매칭
(ex. 그룹1: 왼쪽 팔꿈치 candidates, 그룹2: 왼쪽 손목 candidates 일 때, 같은 그룹끼리는 연결될 수 없고, 다른 그룹과도 한 번 이내만 연결이 가능)

간선을 선택하는 기준으로 part affinity field를 이용해서 구한 association(= $E$ ) 사용

$p(u)$ 가 limb c 위에 있을 때 affinity vector값을 가지므로, $E$ 가 클수록 두 part가 잘 연결되었다고 할 수 있음

$z^{mn}_{j_1j_2}$ ∈ {0, 1} : indicate whether two detection candidates $d_{j_1}^m$ and $d_{j_2}^n$ are connected
$Z$ : $\{z^{mn}_{j_1j_2} : for \ j_1, j_2 ∈ {1 . . . J}, m ∈ {1 . . .N_{j_1}}, n ∈ {1 . . .N_{j_2}}\}$

합산한 association E의 값을 최대화하는 part pairs를 찾게 됨

Merging

part pairs를 skeleton으로 연결하는 과정
a naive assumption: each person only have a one part pair

만약 part index(with same coordinate)가 일치하는 part를 공유하고 있으면 같은 사람으로 판단함

공유하는 part가 없을 때까지 해당 과정을 반복함. 이를 통해 part pairs가 skeleton으로 만들어짐

summary

Result

MPII와 COCO 2016 keypoints challenge dataset 사용

Deepcut: Bottom-up 방식의 pose estimation 모델 (SVM 사용)
Deepercut: Deepcut의 속도 개선

Top-down approach 는 사람 수가 늘어나면, 그에 비례하여 Runtime이 증가하는 반면, Bottom-up 방식인 본 연구는 연산량이 거의 증가하지 않는 것을 보여줌
(cf. the speed of 8.8 fps for a video with 19 people)

후속 논문 (2019 TPAMI)

OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields

직렬 구조로 연결하여 속도와 성능을 개선

This approach differs from [3], where both the PAF and
confidence map branches were refined at each stage. Hence,the amount of computation per stage is reduced by half.

We empirically observe in Section 5.2 that refined affinity field predictions improve the confidence map results, while the opposite does not hold. Intuitively, if we look at the PAF channel output, the body part locations can be guessed

redundant PAF connection 추가하여 성능 개선

Our current model also incorporates redundant PAF connections (e.g., between ears and shoulders, wrists and shoulders, etc.). This redundancy particularly improves the accuracy in crowded images