기초 컴퓨터 비전: (7) Stereo

고영민·2022년 2월 8일

SLAM: 기초 컴퓨터 비전부터 visual&LiDAR SLAM까지

목록 보기

13/17

앞에서 calibration이나, P3P, epipolar 등의 내용에서 주어진 control points를 통해 카메라의 포즈를 구하기 위해 활용될 수 있는 방법들을 살펴보았다. 이 글에서는 반대로 카메라의 포즈나 어떤 특성을 알고있을 때, 영상 촬영된 포인트의 3차원 위치를 복원하는 방법을 알아본다. 기본적으로 포인트의 3차원 위치를 복원하기 위해서는 하나의 영상으로는 정보가 부족하며, 보통 다양한 위치에서 촬영된 영상들이 함께 사용된다.

1. Triangulation

Figure 1. Triangulation

우리가 만약 두 카메라의 포즈 $\mathbf{R}, \mathbf{C}$ 와 카메라 내부 파라미터를 알고 있다면, 각 image 평면 상의 투영된 점 $\mathbf{x}$ 에서 투영선(위 그림에서 카메라 중심 $\mathbf{C}$ 에 연결된 직선으로, 이 직선 상의 모든 점은 image 평면 상의 $\mathbf{x}$ 로 투영됨)을 복원하는 backprojection이 가능하다. 다만 이 투영선 위의 모든 점이 $\mathbf{x}$ 로 투영되기 때문에 하나의 영상 만으로는 객체의 정확한 위치를 찾을 수 없다. 하지만 만약 다른 위치에서 같은 객체를 촬영한 영상이 있다면, 해당 영상에서도 같은 방법으로 backprojection할 수 있고, 두 투영선의 교점을 찾는 다면 객체의 정확한 위치를 찾을 수 있다. 실제 환경에서는 데이터 및 연산 상에 노이즈가 존재하기 때문에 두 투영선이 정확히 한 점에서 교차하지 않는 경우가 많으며, 보통 두 직선과의 거리가 가장 짧은 점을 선택한다(오른쪽 그림에서 점 $\mathbf{H}$ ).

이를 수식으로 나타내기 위하여 각 카메라 중심을 $\mathbf{P}, \mathbf{Q}$ , 각 카메라에서의 투영선의 방향 벡터를 $\mathbf{r}, \mathbf{s}$ 라고 하면, 각 카메라에서의 투영선은 다음과 같다.

\mathbf{f} = \mathbf{P} + \lambda \mathbf{r}\\ \mathbf{g} = \mathbf{Q} + \mu \mathbf{s}

두 카메라의 포즈 $\mathbf{R}, \mathbf{C}$ 와 카메라 내부 파라미터를 알고 있다고 가정하였으므로 두 방향 벡터는 $\mathbf{r}=\mathbf{R}\mathbf{x},\;\mathbf{s}=\mathbf{R'}\mathbf{x'}$ 로 쉽게 구할 수 있다( $\mathbf{x}, \mathbf{x'}$ 는 투영된 점에 대하여 카메라 좌표계에서 표현됨(3D)).

이때 우리가 찾고자 하는 점 $\mathbf{H}$ 는 두 직선에 모두 수직은 선분 $\overline{FG}$ 의 중점이며, 두 직선에 수직이라는 점을 사용하여 다음과 같은 식을 만들 수 있다.

\begin{cases} (\mathbf{f}-\mathbf{g})\cdot\mathbf{r}=0\\ (\mathbf{f}-\mathbf{g})\cdot\mathbf{s}=0 \end{cases} \;\;\; \rightarrow \;\;\; \begin{cases} (\mathbf{P} + \lambda \mathbf{r} - \mathbf{Q} + \mu \mathbf{s})^T\mathbf{r}=0\\ (\mathbf{P} + \lambda \mathbf{r} - \mathbf{Q} + \mu \mathbf{s})^T\mathbf{s}=0 \end{cases}

이제 우리는 위의 조건을 만족하는 알맞는 $\lambda, \mu$ 를 찾아야 한다. 위의 식을 전개하여 $\lambda, \mu$ 에 대한 식으로 정리하면 다음과 같다.

\begin{pmatrix} \mathbf{r}^T\mathbf{r}&-\mathbf{s}^T\mathbf{r}\\ \mathbf{r}^T\mathbf{s}&-\mathbf{s}^T\mathbf{s} \end{pmatrix} \begin{pmatrix} \lambda\\ \mu \end{pmatrix}= \begin{pmatrix} (\mathbf{Q}-\mathbf{P})^T\mathbf{r}\\ (\mathbf{Q}-\mathbf{P})^T\mathbf{s} \end{pmatrix} \;\;\; \rightarrow \;\;\; \mathbf{A}\mathbf{x}=\mathbf{b}

최종적으로 $\mathbf{A}\mathbf{x}=\mathbf{b}$ 의 형태가 되므로, least square 방법 등으로 해결이 가능하며, 이렇게 구한 $\lambda, \mu$ 를 통해 $\mathbf{f}, \mathbf{g}$ 를 찾고, 점 $\mathbf{H}$ 는 둘 사이의 교점 $(\mathbf{f}+\mathbf{g})/2$ 이 된다.

2. Normal Stereo Case

위에서 triangulation을 통해 어떤 점의 3D 좌표를 복원하였지만, 만약 일반적인 stereo camera를 사용한다면, 더욱 쉽게 점의 깊이 정보를 복원 할 수 있다. 일반적인 stereo camera의 경우 두 카메라 사이의 상대적인 회전이 없으며, 수평 방향의 translation만 존재하고, 두 카메라 사이의 거리가 얼마나 되는지 알 수 있다.

Figure 2. (좌)일반적인 stereo 카메라, (우) stereo 카메라에서 촬영 모델

일반적인 stereo camera에서는 간단한 도형의 닮음을 통해 촬영된 점 $\mathbf{P}$ 의 3D 좌표를 알아 낼 수 있다.

Figure 3. Stereo camera에서 z좌표를 구하기 위한 닮음 관계

위의 그림에서 $\mathbf{x'}, \mathbf{x''}$ 는 투영된 점에 대한 image 좌표이다. 두 카메라가 약간 다른 위치에 설치되어 있기 때문에, 하나의 같은 객체를 촬영하였더라도 영상 내의 위치는 다르게 나타난다. 이를 시차(parallax)라고 하며, normal stereo case에서는 두 카메라 간 수평 방향( $x$ 축)의 위치 차이만 있기 때문에 시차 또한 $x$ 축으로만 나타나며, 위 그림에서 $p_x=\mathbf{x''}-\mathbf{x'}$ (윗쪽 초록색 선)에 해당한다. 닮음의 성질에 적용하면 다음과 같다.

\frac{Z}{c} = \frac{B}{\mathbf{x'}-\mathbf{x''}} \;\;\; \rightarrow \;\;\; Z=c\frac{B}{-(\mathbf{x''}-\mathbf{x'})}=c\frac{B}{-p_x}

Figure 4. Stereo camera에서 x좌표를 구하기 위한 닮음 관계

X좌표는 위의 그림에서 다음과 같은 닮음 관계를 통해 구할 수 있다.

\frac{X}{x'} = \frac{Z}{c} \;\;\; \rightarrow \;\;\; X=x'\frac{B}{-(\mathbf{x''}-\mathbf{x'})}=x'\frac{B}{-p_x}

Figure 5. Stereo camera에서 y좌표를 구하기 위한 닮음 관계

Triangulation에서 보았듯이 투영선이 서로 완벽히 만나지는 않으며, 이를 해결하기 위하여 중점을 선택했었다. Y좌표는 이러한 점을 고려하여 다음과 같이 구할 수 있다.

\frac{Y}{X}=\frac{\frac{y'+y''}{2}}{x'} \;\;\; \rightarrow \;\;\; Y=\frac{y'+y''}{2}\frac{B}{-(\mathbf{x''}-\mathbf{x'})}=\frac{y'+y''}{2}\frac{B}{-p_x}

만약 y방향의 시차가 없다면, $Y=y'B/(-p_x)$ 로 쓸 수 있으며, 위 결과를 homogeneous coordinate로 나타내면 다음과 같다.

\begin{pmatrix} U\\ V\\ W\\ T \end{pmatrix}= \begin{pmatrix} B & 0 & 0 & 0\\ 0 & B & 0 & 0\\ 0 & 0 & Bc & 0\\ 0 & 0 & 0 & -1 \end{pmatrix} \begin{pmatrix} x'\\ y'\\ 1\\ p_x \end{pmatrix}

3. Absolute Orientation

위에서는 카메라 간 변환이 주어졌을 때, 영상에 촬영된 포인트들의 3D 좌표를 복원하는 방법을 살펴보았다. 만약, 두 개의 3D 점군(point cloud)가 각각의 로컬 좌표값과 함께 주어졌을 때, 두 점군을 하나의 프레임으로 정합(registration)하기 위하여 absolute orientation은 어떻게 찾을 수 있을까?

Figure 6. 점군(point cloud) 데이터의 정합(registration)

먼저, $\mathbf{x}, \mathbf{y}$ 와 같은 두 개의 point cloud에서 $\mathbf{x_n}, \mathbf{y_n}$ 와 같은 corresponding points가 각각의 weight, $p_n$ 과 함께 주어졌다고 가정하자. 정합의 목표는 두 point cloud 사이의 알맞는 변환 $\mathbf{R}, \mathbf{t}$ 를 찾아 서로 꼭 맞게 합치는 것으로 다음과 같이 error 함수를 설정하여 minimize한다.

\overline{\mathbf{x}}_n = \lambda\mathbf{R}\mathbf{x}_n + \mathbf{t}\\ \;\\ min(ERROR) \;\;\; \rightarrow min(\sum{|\mathbf{y_n}-\overline{\mathbf{x}}_n|^2p_n})

Figure 7. 정합과정

$\mathbf{y_n}$ 들의 weighted mean은 다음과 같다.

\mathbf{y_0}=\frac{\sum{\mathbf{y_n}p_n}}{\sum{p_n}}

$\mathbf{y_0}$ 를 이용하면 다음과 같이 point cloud들을 $\mathbf{y_0}$ 를 중심으로 옮겨 정리할 수 있다.

\begin{aligned} \overline{\mathbf{x}}_n-\mathbf{y_0} & = \lambda\mathbf{R}\mathbf{x}_n + \mathbf{t}-\mathbf{y_0}\\ & = \lambda\mathbf{R}(\mathbf{x}_n + \mathbf{R}^T\mathbf{t}-\mathbf{R}^T\mathbf{y_0})\\ & = \lambda\mathbf{R}(\mathbf{x}_n - \mathbf{x}_0)\\ \;\\ \lambda^{-\frac{1}{2}}(\overline{\mathbf{x}}_n-\mathbf{y_0}) & = \lambda^{-\frac{1}{2}}\mathbf{R}(\mathbf{x}_n - \mathbf{x}_0) \end{aligned}

이때, $\mathbf{x}_0=\mathbf{R}^T\mathbf{y_0}-\mathbf{R}^T\mathbf{t}$ 이며, 위의 식을 이용하면 error 함수를 다음과 같이 정리할 수 있다.

\begin{aligned} min(\sum{|\mathbf{y_n}-\overline{\mathbf{x}}_n|^2p_n}) & = min(\sum{|\mathbf{y_n}-\mathbf{y_0}-(\overline{\mathbf{x}}_n-\mathbf{y_0})|^2p_n})\\ & =min(\sum{[\lambda^{-\frac{1}{2}}(\mathbf{y_n}-\mathbf{y_0})-\lambda^{-\frac{1}{2}}\mathbf{R}(\mathbf{x}_n - \mathbf{x}_0)]^T[\lambda^{-\frac{1}{2}}(\mathbf{y_n}-\mathbf{y_0})-\lambda^{-\frac{1}{2}}\mathbf{R}(\mathbf{x}_n - \mathbf{x}_0)]p_n})\\ & =\sum{\lambda^{-1}(\mathbf{y_n}-\mathbf{y_0})^T(\mathbf{y_n}-\mathbf{y_0})p_n}\\ & \;\;\;\; + \sum{\lambda(\mathbf{x}_n - \mathbf{x}_0)^T(\mathbf{x}_n - \mathbf{x}_0)p_n}\\ & \;\;\;\; -2\sum{(\mathbf{y_n}-\mathbf{y_0})^T\mathbf{R}(\mathbf{x}_n - \mathbf{x}_0)p_n} \end{aligned}

위와 같은 error 함수를 최소화하기 위하여 각 변수 $\mathbf{x}_0, \mathbf{R}, \lambda$ 별로 error 함수를 미분하여, 그 값이 0이 되도록 한다.

먼저, $\mathbf{x}_0$ 에 대해서 미분하면 다음과 같다.

\frac{\partial \Phi(\mathbf{x}_0, \mathbf{R}, \lambda)}{\partial \mathbf{x}_0} = -2\sum{\lambda(\mathbf{x}_n - \mathbf{x}_0)p_n} + 2\sum{\mathbf{R}^T(\mathbf{y_n}-\mathbf{y_0})p_n}=0

\begin{aligned} \sum{\lambda(\mathbf{x}_n - \mathbf{x}_0)p_n} & = \sum{\mathbf{R}^T(\mathbf{y_n}-\mathbf{y_0})p_n}\\ & = 0 \end{aligned}

\begin{aligned} \sum{\lambda(\mathbf{x}_np_n)}-\sum{\lambda(\mathbf{x}_0p_n)}=0 \end{aligned}

\mathbf{x}_0=\frac{\sum{\mathbf{x}_np_n}}{\sum{p_n}}

이때, 두번째 식에서 $\sum{\mathbf{R}^T(\mathbf{y_n}-\mathbf{y_0})p_n}=0$ 이 되는 이유는 우리가 $\mathbf{y_0}$ 를 $\mathbf{y_n}$ 의 weighted mean이라고 가정했기 때문이며, 결과적으로 error 함수를 최적화 하는 $\mathbf{x}_0$ 또한 $\mathbf{x}_n$ 의 weighted mean이라는 것을 알 수 있다.

$\lambda$ 의 경우에도 비슷한 방법으로 미분을 하고 그 값을 0으로 두면 되며, 그 결과는 다음과 같다.

\lambda=\sqrt{\frac{\sum{(\mathbf{y_n}-\mathbf{y_0})^T(\mathbf{y_n}-\mathbf{y_0})p_n}}{\sum{(\mathbf{x}_n - \mathbf{x}_0)^T(\mathbf{x}_n - \mathbf{x}_0)p_n}}}

$\mathbf{R}$ 의 경우, error 식에서 3번째 term에만 관계가 있으며, 마이너스 부호가 있으므로 최적의 회전행렬 $\mathbf{R}^*$ 은 다음과 같이 주어진 식을 최대화한다.

\mathbf{R}^* = \argmax_{\mathbf{R}}{\sum{(\mathbf{y_n}-\mathbf{y_0})\mathbf{R}(\mathbf{x}_n - \mathbf{x}_0)p_n}}, \;\;\;\;\;\;\;\; \mathbf{R}^T\mathbf{R}=\mathbf{I}

여기서 $\mathbf{a_n}=(\mathbf{x}_n - \mathbf{x}_0), \; \mathbf{b_n}=(\mathbf{y_n}-\mathbf{y_0}), \; \mathbf{H} = \sum{(\mathbf{a_n}\mathbf{b_n}^T)p_n}$ 라고 하면, 위의 식은 다음과 같이 쓸 수 있다.

\begin{aligned} \mathbf{R}^* & = \argmax_{\mathbf{R}}{\sum{\mathbf{b_n}^T\mathbf{R}\mathbf{a_n}p_n}}\\ & = \argmax_{\mathbf{R}}{[tr(\mathbf{R}\mathbf{H})]} \end{aligned}

이제 목표는 $tr(\mathbf{R}\mathbf{H})$ 를 최대화하는 $\mathbf{R}$ 을 찾는 것이다. $svd(\mathbf{H})=\mathbf{U}\mathbf{D}\mathbf{V}^T$ 일 때, $\mathbf{R}=\mathbf{V}\mathbf{U}^T$ 라고 가정하면 다음과 같다.

\begin{aligned} tr(\mathbf{R}\mathbf{H}) & = tr(\mathbf{V}\mathbf{U}^T\mathbf{U}\mathbf{D}\mathbf{V}^T) = tr(\mathbf{V}\mathbf{D}\mathbf{V}^T)\\ & = tr(\mathbf{V}\mathbf{D}^{\frac{1}{2}}\mathbf{D}^{\frac{1}{2}}\mathbf{V}^T) = tr(\mathbf{V}\mathbf{D}^{\frac{1}{2}}(\mathbf{V}\mathbf{D}^{\frac{1}{2}})^T)\\ & = tr(\mathbf{A}\mathbf{A}^T) \end{aligned}

여기서 임의의 영이 아닌 벡터 $\mathbf{w}$ 에 대해서 $\mathbf{w}^T\mathbf{R}\mathbf{H}\mathbf{w}=\mathbf{w}^T\mathbf{V}\mathbf{D}\mathbf{V}^T\mathbf{w}=\mathbf{\bar{w}}^T\mathbf{D}\mathbf{\bar{w}}$ 이며, 이때 $\mathbf{\bar{w}}=\mathbf{V}^T\mathbf{w}$ 이다. $\mathbf{D}=diag(\sigma_1, \cdots, \sigma_n)$ 이므로, $\mathbf{\bar{w}}^T\mathbf{D}\mathbf{\bar{w}}=\sigma_1 w_1^2 + \cdots + \sigma_n w_n^2$ 이고, SVD의 결과로 발생한 특이값은 0이상(영벡터를 SVD한 것이 아니라면, 적어도 하나는 0보다 큼), $w^2_i$ 들 또한 양수이므로, $\mathbf{w}^T\mathbf{R}\mathbf{H}\mathbf{w} = \mathbf{\bar{w}}^T\mathbf{D}\mathbf{\bar{w}} \gt 0$ 이 되어 $\mathbf{R}\mathbf{H}=\mathbf{A}\mathbf{A}^T$ 는 양의 정부호(positive definite) 행렬이 된다 (여기서 $\mathbf{A}\mathbf{A}^T$ 는 대칭행렬로써 직교대각화가 가능하고 앞의 과정에 따라 양의 정부호가 된다).

이때, 양의 정부호 행렬 $\mathbf{A}\mathbf{A}^T$ 과 임의의 orthonormal matrix(회전 행렬 또한 orthonormal임) $\mathbf{B}$ 에 대해서 $tr(\mathbf{B}\mathbf{A}\mathbf{A}^T)=tr(\mathbf{A}^T\mathbf{B}\mathbf{A})=\sum{\mathbf{a_i^T}(\mathbf{B}\mathbf{a_i})}$ 인데( $\mathbf{a_i}$ 는 $\mathbf{A}$ 의 i번째 행벡터), Schwarz inequality에 의해 $\mathbf{a_i^T}(\mathbf{B}\mathbf{a_i}) \le \sqrt{(\mathbf{a_i^T}\mathbf{a_i})(\mathbf{a_i^T}\mathbf{B}^T\mathbf{B}\mathbf{a_i})}=\mathbf{a_i^T}\mathbf{a_i}$ 가 성립한다. 따라서 $tr(\mathbf{B}\mathbf{A}\mathbf{A}^T) \le \sum{\mathbf{a_i^T}\mathbf{a_i}}=tr(\mathbf{A}\mathbf{A}^T)$ 이다.

이를 이용하면 $\mathbf{R}'$ 가 임의의 회전행렬이라고 했을 때, 다음과 같다.

tr(\mathbf{R}\mathbf{H})=tr(\mathbf{A}\mathbf{A}^T) \ge tr(\mathbf{R}'\mathbf{A}\mathbf{A}^T) = tr(\mathbf{R}'\mathbf{R}\mathbf{H})

즉, 임의의 $\mathbf{R}'\mathbf{R}$ 보다 $\mathbf{R}$ 가 언제나 크거나 같은 trace값을 갖으므로 $\mathbf{R}=\mathbf{V}\mathbf{U}^T$ 가 주어진 식을 최적화하는 optimal한 해이다.

이제 최적의 $\mathbf{x_0}, \mathbf{R}, \lambda$ 를 모두 찾았므며, $\mathbf{t} = \mathbf{y_0}-\mathbf{R}\mathbf{x_0}$ 를 통해 translation을 계산할 수 있다.

이러한 방법은 ICP(Iterative Closest Point)방법 등에 응용 될 수 있으며, corresponding points를 찾는 과정에서 ICP방법은 target point cloud의 점과 가장 가까운 기준 point cloud의 점을 corresponding point로 취급한다. 물론 이는 정확하지 않으므로 같은 과정을 여러번 반복하여 품질을 개선해 나간다.

4. Basic SLAM Framework

앞선 내용에서 주어진 point들을 이용하여 두 카메라 사이의 relative orientation을 찾는 방법이나 반대로 orientation이 주어졌을 때 traiangulation 등을 이용하여 촬영된 point의 3D 좌표를 복원하는 방법 등을 살펴보았다. 이를 조합하면 다음과 같이 간단한 SLAM framework를 구성할 수도 있다.

Figure 8. Basic SLAM framework

고영민

이전 포스트

기초 컴퓨터 비전: (6) Epipolar

다음 포스트