Structure-from-Motion (COLMAP)

김민솔·2024년 5월 12일

Vision

목록 보기

1/3

COLMAP의 SFM 파이프라인입니다. 크게 두 부분으로 나눠지는데, Correspondence Search와 Incremental Reconstruction입니다. Correspondence Search는 2D features를 찾고, 이미지 사이의 이들을 매칭하는 단계입니다. Incremental Reconstruction은 매칭된 피쳐들이 3D structure와 카메라 파라미터들로 사용되는 단계입니다. SFM의 대략적인 구조를 살펴보았으니, 필요한 개념들을 알아보겠습니다. (매우 많습니다..)

Camera Calibration

Camera Calibration은 카메라 셋업의 내부(intrinsic), 외부(extrinsic) 파라미터를 찾는 과정이며, 2D 이미지셋으로 3D 구조 정보를 추론할 때 매우 중요합니다. (Computer Vision 개념 정리! (velog.io): 관련 CV 개념을 정리해놓은 포스트입니다!)

대부분의 경우, 이미 타겟을 알고 있는 체커보드에서 캘리브레이션을 사용합니다. 다양한 각도의 체커보드 이미지를 준비하는 것이 첫 단계입니다.

코너와 같은 features를 이미지에서 찾습니다.

카메라 내부, 외부 파라미터를 공동으로 최적화합니다. Closed-form solution으로 모든 파라미터들 초기화 후(distortion 파라미터들은 이때 제외입니다.) Non-linear optimization으로 모든 파라미터들을 얻어냅니다.

Calibration Code (OpenCV)

import numpy as np
import cv2 as cv
import glob
 
# termination criteria
criteria = (cv.TERM_CRITERIA_EPS + cv.TERM_CRITERIA_MAX_ITER, 30, 0.001)
 
# prepare object points, like (0,0,0), (1,0,0), (2,0,0) ....,(6,5,0)
objp = np.zeros((6*7,3), np.float32)
objp[:,:2] = np.mgrid[0:7,0:6].T.reshape(-1,2)
 
# Arrays to store object points and image points from all the images.
objpoints = [] # 3d point in real world space
imgpoints = [] # 2d points in image plane.
 
images = glob.glob('*.jpg')
 
for fname in images:
 img = cv.imread(fname)
 gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
 
 # Find the chess board corners
 ret, corners = cv.findChessboardCorners(gray, (7,6), None)
 
 # If found, add object points, image points (after refining them)
 if ret == True:
 objpoints.append(objp)
 
 corners2 = cv.cornerSubPix(gray,corners, (11,11), (-1,-1), criteria)
 imgpoints.append(corners2)
 
 # Draw and display the corners
 cv.drawChessboardCorners(img, (7,6), corners2, ret)
 cv.imshow('img', img)
 cv.waitKey(500)
 
cv.destroyAllWindows()

# Calibration
ret, mtx, dist, rvecs, tvecs = cv2.calibrateCamera(objpoints, imgpoints, gray.shape[::-1], None, None)

OpenCV에서 제공하는 코드는 다음과 같습니다.

cv.findChessboardCorners 함수로 체커보드의 코너를 찾고, cv.cornerSubPix 함수로 체커보드의 코너들을 조정합니다. 이후 찾아낸 이미지 코너들과 3D 포인트들을 활용하여 Camera의 파라미터들을 찾아내는 과정입니다.

카메라 모델 및 카메라 캘리브레이션의 이해와 Python 실습

OpenCV: Camera Calibration

Calibration 구현 논문

www.microsoft.com

Feature Detection

Point features는 이미지의 local, salient regions을 표현합니다. point features를 활용하여 다른 viewpoints를 가지는 이미지들을 매칭하는 것이 가능합니다!

Features는 원근 효과나 조도에 영향을 받지 않아야 하고, 같은 포인트는 pose나 viewpoint와 관계 없이 비슷한 벡터를 가져야 합니다.

Scale Invariant Feature Transform (SIFT)

SIFT는 반복적으로 이미지를 가우시안 필터링하여 scale space를 구성하고, 이미지를 일정한 간격으로 축소시킵니다. 인접한 스케일을 빼면, Difference of Gaussian(DoF) 이미지들이 생성됩니다. DoG로 blobs가 발견되며 scale space에서 extrema로 나타납니다.

blobs를 추출하고 나면, discriptor를 회전시켜 주요한 gradient 방향과 일치시킵니다. 이후 128D feature vector(Keypoint descriptor)를 형성하는 데 사용되는 gradient histogram이 계산됩니다.

SIFT는 20년이 지남에도 불구하고, COLMAP에서 아직도 사용되고 있습니다. 위 사진은 두 이미지에 SIFT를 적용한 예시입니다. (초록: 정답 / 빨강: 오답)

Epipolar Geometry

epipolar geometry의 목적은 이미지 상관관계로부터 camera pose와 3D structure 정보를 얻는 것입니다. 자세히 알아보기 전, 표현들부터 짚어보겠습니다.

$\mathbf{x}$ : a 3D point
$\bar{\mathbf{x}}$ : 이미지 평면에 사영된 point
Baseline: 두 카메라 센터를 지나는 선
epipole: Baseline이 이미지 평면과 만나는 point
epipolar line: epipolar plane과 image plane이 만나는 교차선

two-view epipolar geometry를 적용하기 위한 조건은 다음과 같습니다. (개인적으로는 그림으로 파악하는 것이 이해하기 쉬웠습니다.)

$\mathbf{R}$ 과 $\mathbf{t}$ 가 두 카메라에 정의되어 있음.
a 3D point $\mathbf{x}$ 가 pixel $\bar{\mathbf{x}}_1$ 과 pixel $\bar{\mathbf{x}}_2$ 에 놓여 있어야 함.
a 3D point $\mathbf{x}$ 와 두 카메라 센터가 epipolar plane 구성.
pixel $\bar{\mathbf{x}}_1$ 에 대응되는 pixel( $\bar{\mathbf{x}}_2$ )이 epipolar line $\tilde{{\mathbf{l}}}_2$ 에 놓여 있어야 함.
모든 epipolar lines이 epipole을 통과.

Estimate epipolar geometry

두 이미지간의 매칭된 features로 epipolar geometry를 추정하는 과정을 유도해보겠습니다.

방법 (1)

$\mathbf{K}_i \in \mathbb{R}^{3 \times 3}$ : camera matrix of camera i ← calibration!
$\tilde{\mathbf{x}}_i = \mathbf{K}_i^{-1}\bar{\mathbf{x}}_i$ : local ray direction of pixel $\bar{\mathbf{x}}_i$

위 두 가지 조건을 사용하면 다음의 비례식을 얻을 수 있습니다.

\tilde{\mathbf{x}}_2 \propto {\mathbf{x}}_2 = \mathbf{R}\mathbf{x}_1 + \mathbf{t } \propto \mathbf{R}\tilde{\mathbf{x}}_1 + s\mathbf{t}

양쪽에 $\mathbf{t}$ 로 외적을 취하면 다음과 같습니다.

[\mathbf{t}]_{\times}\tilde{\mathbf{x}}_2 \propto [\mathbf{t}]_{\times}\mathbf{R}\tilde{\mathbf{x}}_1

이후 양쪽에 $\tilde{\mathbf{x}}_2^T$ 로 내적을 취하면 다음과 같습니다.

\tilde{\mathbf{x}}_2^T[\mathbf{t}]_{\times}\tilde{\mathbf{x}}_2 = 0 \propto \tilde{\mathbf{x}}_2^T[\mathbf{t}]_{\times}\mathbf{R}\tilde{\mathbf{x}}_1 \rightarrow \tilde{\mathbf{x}}_2^T[\mathbf{t}]_{\times}\mathbf{R}\tilde{\mathbf{x}}_1 = 0

이를 essential matrix로 정리하여 표현하면 epipolar constraint를 얻을 수 있습니다.

\tilde{\mathbf{x}}_2^T\tilde{\mathbf{E}}\tilde{\mathbf{x}}_1 = 0

essential matrix: $\tilde{\mathbf{E}} = [\mathbf{t}]_{\times}\mathbf{R}$
$\tilde{\mathbf{E}}$ 는 point $\tilde{\mathbf{x}}_1$ 를 epipolar line in image 2로 매핑하는 것이 가능합니다. (반대로도 가능!) $\tilde{\mathbf{l}}_2 = \tilde{\mathbf{E}} \tilde{\mathbf{x}}_1$

임의의 점 $\tilde{\mathbf{x}}_1$ 에 대해 대응하는 epipolar line $\tilde{\mathbf{l}}_2 = \tilde{\mathbf{E}} \tilde{\mathbf{x}}_1$ 은 epipole $\tilde{\mathbf{e}}_2$ 를 지나게 되며, 다음의 식을 만족합니다.

\tilde{\mathbf{e}}_2^T\tilde{\mathbf{l}}_2 = \tilde{\mathbf{e}}_2^T\tilde{\mathbf{E}} \tilde{\mathbf{x}}_1 = 0. \rightarrow \tilde{\mathbf{e}}_2^T\tilde{\mathbf{E}} =0.

위 식을 통해서 $\tilde{\mathbf{e}}_2^T$ 가 $\tilde{\mathbf{E}}$ 의 left null-space임을 알아낼 수 있습니다. (left singular vector with singular value 0 or smallest value{noise때문에 0이 아닐 수도 있음!})
반대로, $\tilde{\mathbf{e}}_1^T$ 가 $\tilde{\mathbf{E}}$ 의 right null-space입니다. (right singular vector with singular value 0)

방법 (2)

\tilde{\mathbf{x}}_2 \propto {\mathbf{x}}_2 = \mathbf{R}\mathbf{x}_1 + \mathbf{t }

이때 $\mathbf{R}\mathbf{x}_1 + \mathbf{t }$ 와 $\mathbf{t }$ 는 epipolar plane에 놓여 있으므로, 외적을 취하면 plane에 직교하는 평면 벡터 $[\mathbf{t }]_{\times}\mathbf{R}\mathbf{x}_1$ 를 구할 수 있습니다. 이는 ${\mathbf{x}}_2^T$ 와의 내적을 취하면 해당 값이 0으로 나오고, 이를 통해 epipolar constraint를 얻을 수 있습니다. (해당 설명은 ${\mathbf{K}}$ 가 ${\mathbf{I}}$ 일 때로 가정되었지만, 이외에도 적용 가능합니다.)

8-point algorithm

앞에서 얻은 epipolar constraint를 가지고, $N$ 개의 이미지 쌍에 대해 다음의 방정식을 구할 수 있습니다.

이를 SVD를 통해 해결할 수 있으며, 8개의 쌍으로 해결하는 것이 가능하기 때문에 8-point algorithm으로 불립니다.

Fundamental matrix

위의 과정을 통해 $\tilde{\mathbf{E}}$ 를 구할 수 있으며, $\hat{\mathbf{t}}$ 와 ${\mathbf{R}}$ 을 차례로 구할 수 있습니다. (해당 과정의 증명은 아래에 남겨두었습니다.)

calibration metrix $\mathbf{K}_i \in \mathbb{R}^{3 \times 3}$ 를 식에 대입하여 fundamental matrix를 표현하면, Intrinsic matrix 없이도 epipoles를 구할 수 있습니다.

\tilde{\mathbf{x}}_2^T\tilde{\mathbf{E}}\tilde{\mathbf{x}}_1 = \bar{\mathbf{x}}_2^T\mathbf{K}^T_2\tilde{\mathbf{E}}\mathbf{K}_1^{-1}\bar{\mathbf{x}}_1 = \bar{\mathbf{x}}_2^T\tilde{\mathbf{F}}\bar{\mathbf{x}}_1 = 0.

$\tilde{\mathbf{F}} = \mathbf{K}^T_2\tilde{\mathbf{E}}\mathbf{K}_1^{-1}$ : fundamental matrix
fundamental matrix, essential matrix 모두 2 Rank의 행렬임!!

Demo of epipolar geometry

epipolar lines: 검정색 라인
대응되는 점은 대응되는 epipolar line에 놓여 있음!

Triangulation

만약 2D 이미지 관측이 주어졌을 때, 두 rays가 교차하지 않는 경우를 살펴보겠습니다.

$\tilde{\mathbf{x}}_i^s = \tilde{\mathbf{P}}_i\tilde{\mathbf{x}}_w$ : projection (3D world point $\tilde{\mathbf{x}}_w$ → i번째 카메라 $\tilde{\mathbf{x}}_i^s$ 의 이미지)
양쪽 벡터가 homogeneous → 두 벡터가 같은 방향임 → $\tilde{\mathbf{x}}_i^s \times \tilde{\mathbf{P}}_i\tilde{\mathbf{x}}_w = 0.$ → 아래의 식 유도 가능!

\begin{bmatrix} x^s_i\tilde{\mathbf{p}}_{i3}^T -\tilde{\mathbf{p}}_{i1}^T \\ y^s_i\tilde{\mathbf{p}}_{i3}^T -\tilde{\mathbf{p}}_{i2}^T \end{bmatrix} \tilde{\mathbf{x}}_w = 0

$\mathbf{A}_i = \begin{bmatrix} x^s_i\tilde{\mathbf{p}}_{i3}^T -\tilde{\mathbf{p}}_{i1}^T \\ y^s_i\tilde{\mathbf{p}}_{i3}^T -\tilde{\mathbf{p}}_{i2}^T \end{bmatrix}$
$\tilde{\mathbf{p}}_{ik}^T$ : $k$ 번째 행 벡터 of $\tilde{\mathbf{P}}_i$

DLT(링크)를 통해 이를 least square 문제로 해결할 수 있습니다. 마찬가지로, A의 singular value 중 가장 작은 값에 해당하는 right singular vector가 최적의 해가 됩니다. 이것은 reprojection error를 최소화하는 것과 같은 의미를 가지기도 합니다.

\bar{\mathbf{x}}_w^* = \argmin_{\bar{\mathbf{x}}_w}\sum^N_{i=1}||\bar{\mathbf{x}}_i^s(\bar{\mathbf{x}}_w)-\bar{\mathbf{x}}_i^o||^2_2

$\bar{\mathbf{x}}_i^o$ : observation

rays가 평행할수록 Uncertainty(shaded region) 증가
Tradeoff: view가 가까울수록 feature matching은 쉬움 / triangulation은 어려워짐

Factorization

이번 챕터에서는 2개 이상의 views로 3D geometry를 형성하는 과정을 살펴보겠습니다.

$\mathcal{W} = \{(x_{ip},y_{ip})|i=1,...,N, p=1,...,P\}$ u: $P$ feature points tracked over $N$ 개의 프레임
$\mathcal{W}$ 와 orthographic projection을 통해서 camera motion (rotation), structure(3D points $\mathbf{x}_p$ ← $(x_{ip},y_{ip})$ )을 구하는 것이 목표!

Orthographic Factorization

orthographic projection: a 3D points $\mathbf{x}_p$ → a pixel $(x_{ip},y_{ip})$ in frame $i$ $x_{ip} = \mathbf{u}^T_i(\mathbf{x}_p-\mathbf{t}_i) \\ y_{ip} = \mathbf{v}^T_i(\mathbf{x}_p-\mathbf{t}_i)$

3D 좌표계(빨간색), 이미지 좌표계(파란색) 모두 centering ! (zero-mean) → frame 마다 이미지 피쳐들도 센터링하여 centered measurement matrix $\tilde{\mathbf{W}}$ 를 얻을 수 있음.

\tilde{\mathbf{W}} = \begin{bmatrix} \tilde{x}_{11} & ... & \tilde{x}_{1P} \\ \vdots & & \vdots \\ \tilde{x}_{N1} & ... & \tilde{x}_{NP} \\ \tilde{y}_{11} & ... & \tilde{y}_{1P} \\ \vdots & & \vdots \\ \tilde{y}_{N1} & ... & \tilde{y}_{NP} \\ \end{bmatrix}

$\tilde{x}_{ip} = {x}_{ip} - \frac 1 P \sum^P_{q=1}x_{iq}$
$\tilde{y}_{ip} = {y}_{ip} - \frac 1 P \sum^P_{q=1}y_{iq}$
이때 $\tilde{}$ 는 homogeneous를 의미하지 않음 -> centered 의미!

위 성질들을 이용하여 centered image x 좌표를 다음과 같이 구할 수 있습니다.

\begin{aligned} \tilde{x}_{ip} &={x}_{ip} - \frac 1 P \sum^P_{q=1}x_{iq}\\ &=\mathbf{u}^T_i(\mathbf{x}_p-\mathbf{t}_i) - \frac 1 P \sum^P_{q=1}\mathbf{u}^T_i(\mathbf{x}_q-\mathbf{t}_i)\\ &=\mathbf{u}^T_i(\mathbf{x}_p-\mathbf{t}_i) - \mathbf{u}^T_i\frac 1 P \sum^P_{q=1}\mathbf{x}_q + \mathbf{u}^T_i\mathbf{t}_i \\ &=\mathbf{u}^T_i\Big(\mathbf{x}_p-\frac 1 P \sum^P_{q=1}\mathbf{x}_q \Big) = \mathbf{u}^T_i\mathbf{x}_p \end{aligned}

centered image y 좌표: $\tilde{y}_{ip} = \mathbf{v}^T_i\mathbf{x}_p$

따라서 centered measurement matrix $\tilde{\mathbf{W}}$ 도 다음과 같이 분해할 수 있습니다.

\tilde{\mathbf{W}} = \begin{bmatrix} \tilde{x}_{11} & ... & \tilde{x}_{1P} \\ \vdots & & \vdots \\ \tilde{x}_{N1} & ... & \tilde{x}_{NP} \\ \tilde{y}_{11} & ... & \tilde{y}_{1P} \\ \vdots & & \vdots \\ \tilde{y}_{N1} & ... & \tilde{y}_{NP} \\ \end{bmatrix} = \mathbf{R}\mathbf{X}

$\mathbf{R} = \begin{bmatrix} \mathbf{u}^T_1 \\ \vdots \\ \mathbf{u}^T_N \\ \mathbf{v}^T_1 \\ \vdots \\ \mathbf{v}^T_N \\ \end{bmatrix} \in \mathbb{R}^{2N\times 3}$ : camera motion (rotation)
$\mathbf{X} = \begin{bmatrix} \mathbf{x}_1 & ... & \mathbf{x}_P \end{bmatrix} \in \mathbb{R}^{3\times P}$ : structure of the 3D scene

$\tilde{\mathbf{W}}$ 는 대부분 rank 3입니다. 만약 noise가 추가되면, full rank이며 rank 3 approximation을 적용합니다.

(|| $\hat{\mathbf{W}} - \tilde{\mathbf{W}}$ ||를 최소화하는 방향으로 근사. / SVD로 구할 수 있음 !!)

\hat{\mathbf{W}} = {\mathbf{U}}\Sigma{\mathbf{V}}^T = {\mathbf{R}}{\mathbf{X}} = (\hat{\mathbf{R}}\mathbf{Q})(\mathbf{Q}^{-1}\hat{\mathbf{X}})

$\hat{\mathbf{R}} = {\mathbf{U}}\Sigma^{\frac 1 2}$
$\hat{\mathbf{X}} = \Sigma^{\frac 1 2}{\mathbf{V}}^T$

이때 행렬 $\mathbf{Q}$ 에 따라 SVD 분해가 여러 개가 등장할 수 있기 때문에 행렬 조건들을 몇 가지 적용해야 합니다. $\mathbf{R}$ 의 두 가지 특성을 이용하여 행렬 조건을 세운 것이 아래 사진입니다.

$\mathbf{R}$ 의 행 벡터가 단위 벡터입니다.
$\mathbf{R}$ 의 첫 번째 절반(about $\mathbf{u}$ )과 두 번째 절반(about $\mathbf{v}$ )은 직교 상태입니다.

이 제약 조건을 통해 $\mathbf{Q}$ 를 구하여 unique solution을 구하게 됩니다.

Algorithm

Orthographic factorization을 정리하면 다음과 같습니다.

$\hat{\mathbf{W}}$ 를 측정합니다.
$\hat{\mathbf{W}}$ 의 SVD를 계산한 후, top 3 Singular vectors를 얻습니다.
$\hat{\mathbf{R}} = {\mathbf{U}}\Sigma^{\frac 1 2}$ , $\hat{\mathbf{X}} = \Sigma^{\frac 1 2}{\mathbf{V}}^T$ 를 정의합니다.
$\mathbf{Q}\mathbf{Q}^T$ 를 계산하여 $\mathbf{Q}$ 를 구합니다.
$\mathbf{R} = \hat{\mathbf{R}}\mathbf{Q}$ , $\mathbf{X} = \mathbf{Q}^{-1}\hat{\mathbf{X}}$ 를 구합니다.

Bundle Adjustment

incremental bundle adjustment는 선택된 two-view recon으로부터 반복적으로 새로운 이미지/카메라를 reconstruction에 추가하는 알고리즘입니다. (COLMAP에서는 BA를 사용하고 있습니다.)

$\Pi = \{\pi_i\}$ : N cameras with intrinsic, extrinc params
$\mathcal{X}_w = \{\mathbf{x}^w_p\} \in \mathbb{R}^3$ : set of $P$ 3D points (world coord!)
$\mathcal{X}_s = \{\mathbf{x}^s_{ip}\} \in \mathbb{R}^2$ : image(screen) observations

BA는 아래의 reprojection error를 최소화하는 방향으로 최적화됩니다. Projection은 non-linear하기 때문에, 비선형적 최적화가 이루어집니다.

\Pi^*,\mathcal{X}^*_w = \argmin_{\Pi,\mathcal{X}_W} \sum^N_{i=1} \sum^P_{p=1} w_{ip} ||\mathbf{x}^s_{ip} - \pi_i(\mathbf{x}^w_p)||^2_2

$w_{ip}$ : point $p$ 가 이미지 $i$ 에서 관찰되었는지 나타내는 역할
$\pi_i(\mathbf{x}^w_p)$ : 3D-to-2D projection of 3D world point $\mathbf{x}^w_p$
- $\tilde{\mathbf{x}^s_p} = \mathbf{K}_i(\mathbf{R}_i\mathbf{x}_p^w+\mathbf{t}_i)$ → $\pi_i(\mathbf{x}^w_p) = \begin{pmatrix} \tilde{{x}^s_p} / \tilde{{w}^s_p} \\ \tilde{{y}^s_p} / \tilde{{w}^s_p} \end{pmatrix}$