Radiance Fields 의 view-direction encoding 은 왜 Spherical Harmonics 을 사용할까?

Hwan Heo·2024년 5월 23일

Instant-ngp NTK NeRF Radiance Fields Ref-NeRF Spherical Harmonics neural tangent kernel positional encoding

Neural Rendering

목록 보기

10/22

이 글은 NeRF, Plenoctrees, Instant-NGP 등의 Neural Rendering 기술들에 대해 익숙하다는 가정 하에 쓰여졌습니다.
또한 글의 이해를 위해 ‘Fourier Features Let Networks Learn High-Frequency Functions in Low Dimensional Domains’ 를 읽는 것을 권장합니다. (참고 리뷰)

1. 들어가며

SIGGPRAH2022 에 발표되었던 Instant-NGP 는 이미 발표된 지 꽤 시간이 흐른 논문이지만, NeRF 가속화를 위해 필수적인 기술로써 최근 발표되는 논문들도 (e.g., Zip-NeRF) NGP 의 Hash-Grid Representation 을 적극적으로 사용하고 있다. 이때 Hash-Grid Representation 은 xyz position 에만 dependent 하므로, color 를 예측하는 later MLP 부분은 original NeRF 처럼 Non-Lambertian Color prediction 을 위해 view-direciton 을 input 으로 받는다.

그런데 잘 생각해보면, original NeRF 는 view direction encoding 에도 spatial location 에서 사용한 것과 동일하게 positional encoding 을 사용하는데 반면, NGP 는 Spherical Harmonics 를 view-direction encoding 으로 사용한다.

cf. encoder_dir code snippet

self.encoder_dir = tcnn.Encoding(
        n_input_dims=3,
        encoding_config={
                  "otype": "SphericalHarmonics",
                  "degree": 4,
              })

이에 대한 설명은 논문에서 아주 짧게 언급되어 있는데,

We use the spherical harmonics basis in NeRF, similar to concurrent work [Verbin et al. 2021; Yu et al. 2021a].

라며 Ref-NeRF 와 Plenoctrees 를 인용한다. 이 중 실제로 NGP 와 같이 View-Direction 에 SH encoding 을 사용하는 논문은 Ref-NeRF 이다. (Plenoctrees 에서 SH 를 사용하는 이유에 대해서는 이전 리뷰를 참조)

그리고 Ref-NeRF 에서는 SH encoding 을 사용하는 당위성에 대해서 다음과 같이 주장하는데,

This encoding benefits from being stationary on the sphere, a property that is crucial to the effectiveness of positional encoding in Euclidean space.

즉 SH encoding 을 사용하는 이유는 sphere 상에서 stationay 한 성질을 가지기 때문에라고 밝히고 있다.

그렇다면 sphere 상에서 view-direction 의 stationary 는 NeRF 학습에서 왜 중요한 것일까?

2. View-Direction Encoding in terms of a kernel regression

cf. Recap of 'Fourier Features Let Networks Learn High-Frequency Functions in Low Dimensional Domains'

Neural Network 는 학습에 Spectral Bias 가 있다. 즉, low frequency 를 빨리 배우고 high frequency 를 늦게 배운다. 이는 NeRF 에서 fine detail 학습을 어렵게 만드는 요인이다.
NeRF 에서 positional encoding 은 spatial location 간의 상대적 거리에만 집중하도록, 즉 isotropic 하게 feature 를 배우게하여 high-frequency 에서도 학습이 잘 되도록 만든다.

해당 논문은 NTK theory 를 통해 positional encoding 이 spatial location 에 적용될 때 어떤 역할을 하는지 수학적으로 잘 규명하였다. 하지만 NeRF 는 spatial location $(x, y, z)$ 뿐만 아니라, Non-Lambertian color 를 잘 모사하기 위해 view-direction $(\theta, \phi)$ 또한 input 으로 받는다.

그런데 NeRF implementation 에서는 이러한 view direction 을 Spherical Coordinate System 의 $(\theta, \phi)$ 으로 다루는 것이 아니라, Cartessian Coordinate System 의 3D unit vector 로 다루어 3D vetor valued view direction 에 대해서도 spatial location 과 동일하게 positional encoding 을 적용한다.

즉 view direction encoding $\textbf{d} = \gamma(d)$ 에 대하여, NeRF network 의 NTK kernel 은 다음과 같을 것이다.

\begin{aligned} h_\text{NTK} (\textbf{d}i ^{\rm T} \textbf{d}j) &= h_\text{NTK} (\gamma(d_i) ^{\rm T} \gamma(d_j)) \\ &= h_\text{NTK} (h_\gamma(d_i - d_j)). \end{aligned}

따라서 kernel regression 관점으로 view-direction encoding 은 spatial location 과 마찬가지로 shift-invariant 하다.

그런데 이상하지 않은가? Direction 에 대한 encoding 이면 shift-invariant 가 아니라 rotation-invariant 의 성질을 가져야 학습이 더 잘된다.

여기서, 의문을 가졌던 sphere 상의 stationary 가 중요한 이유가 등장한다. Rotation-Invariancy, 즉 sphere 상에서의 stationary 는 view direction 에 대한 kernel regression 이라면 응당 가져야 할 성질인 것이다. 단순히 spatial location 과 같이 sinusoidal positional encoding 을 적용하면 MLP 는 direction 간의 '상대적인 각도 차이' 를 isotropic 하게 학습할 수 없다.

3. Spherical Harmonics as a view-direction encoding

이제 View-Direction Encoding 이 Spherical Harmonics 일 경우를 살펴보자. Ref-NeRF, NGP 등에서 사용하는 SH encoding 은 Spherical Harmonics $Y^l_m (\theta, \phi) = \alpha^l_m P^l _m (\cos \theta) e^{im\phi}$ 에 대해 다음과 같이 정의할 수 있다. ( $\Re$ , $\Im$ 은 각각 real, imaginary part)

\begin{aligned} \gamma_{\text{SH}}(v) = & \left\{ Y^{l}{{0}}(v), \sqrt{2} \Re \big(Y^{l}{{1}}(v) \big), \sqrt{2} \Im \big(Y^{l}{{1}}(v) \big), \cdots, \sqrt{2} \Re \big( Y^{l}{{l}}(v) \big), \sqrt{2} \Im \big( Y^{l}_{{l}}(v) \big) \right\} \\ &\qquad \qquad \qquad \text{where } l \in \{ 1,\dots,L \}. \end{aligned}

이제 SH view-direction encoding 에 대한 kernel regression 을 유도해보면 다음과 같이 쓸 수 있는데,

\begin{aligned} k_{\gamma_{\text{SH}}} (v_1, \ v_2 ) &= \langle \gamma_{\text{SH}} (v_1), \ \gamma_{\text{SH}} (v_2) \rangle \\ &= \sum_{i=1}^N \sum_{j=-i}^{i} \Re \big( Y^{i}{j}(v_1) \big) \cdot \Re \big( Y^{i}{j}(v_2) \big) + \Im \big( Y^{i}{j}(v_1) \big) \cdot \Im \big( Y^{i}{j}(v_2) \big) \\ &= \big \langle Y^l_m (\theta_{v_1}, \phi_{v_1}), \ Y^l_m (\theta_{v_2}, \phi_{v_2}) \big \rangle , \end{aligned}

이는 정확히 우리가 원했던 view-direction encoding 의 rotation-invariancy 를 의미하는 것을 알 수 있다.

따라서 view direction 에 대한 Spherical Harmonics encoding 은 kernel space 상에서 direction 간의 rotation invariancy, sphere 상에서의 stationary 성질을 만족시키며, 이는 direction 간의 상대적인 각도 차이에만 Network 가 집중하여 학습할 수 있게 된다.

4. 마치며

여담으로 믿거나 말거나지만, 해당 아이디어는 사실 필자가 2022년 Naver AI Lab 연구 인턴으로 지원할 때 제출한 연구 계획의 draft version 아이디어와 동일하다. (아래는 당시 작성하던 tex file)

당시에는 이를 바탕으로 연구를 develop 하려다가 마땅한 방향이 없어서 다른 방향 (Robust Camera Pose Refinement for Multi-Resolution Hash Encoding) 으로 넘어갔었는데, 이제와서 Ref-NeRF 를 다시 살펴보니 단순히 이와 같은 하나의 observation 으로 끝나지 않고 해당 성질이 더 유용할만한 shading 에서 이를 이용하고 develope 한 것이 대단하다는 생각이 들었다.

Hwan Heo

기타치는AI Researcher

이전 포스트

Diffusion 으로 추정한 Depth Map 을 이용해 Textured Mesh 만들어보기 (feat. Marigold)

다음 포스트