Representing Scenes as Neural Radiance Fields for View Synthesis - ECCV 2020

이뱅갈·2023년 3월 3일

Computer Vision DeepLearning NeRF

geoAI-lab

목록 보기

1/1

Summary

Input : spatial location (x, y, z), viewing direction (θ, φ) (5D coordinate)

Spatial Position
- $\textbf{x}$ : xyz coordinate
Viewing Direction
- $\textbf{d}$ : camera position

output : volume density, view-dependent emitted radiance

Volume Density map
- $\sigma(x): \mathbb{R}^3 \rightarrow \mathbb{R}_{\ge0}$
- xyz coordinate to opaque
Color Radiance map
- $\textbf{c}(\textbf{x}, \textbf{d}): \mathbb{R}^5 \rightarrow \mathbb{R}^3$
- xyz coordinate, camera position to RGB Color

$F_\theta: (\textbf{x}, \textbf{d}) \rightarrow (\textbf{c}, \sigma)$

After getting radiance output, process volume rendering to build motion

What is Neuiral Radiance Field

Approximate continuous 5D scene representation with an MLP network $F_\theta: (\textbf{x}, \textbf{d}) \rightarrow (\textbf{c}, \sigma)$ and optimaize weight $\theta$ to map from each input 5D coordinate to its corresponding volume density and directional emitted vector.

Positional Encoding

Neural network is universal function approximators.

That deep networks are biased towards learning lower frequency functions. Additionally, it also known as higher dimensional space using high frequency functions. So reformulating $F_\theta$ as a composition of $F_\theta = F'_\theta \circ \gamma$ , $\gamma$ is a mapping from $\mathbb{R}$ into a higher dimensional space $\mathbb{R}^{2L}$ , and $F'_\theta$ is still simply a regular MLP.

\gamma(p) = (sin(2^0\pi p), \space cos(2^0\pi p), \space ... \space ,(sin(2^{L-1}\pi p),\space cos(2^{L-1}\pi p))

This function is applied separately to each of the three coordinate values in $\textbf{x}$ . In this paper, $L = 10$ for $\gamma(\textbf{x})$ , $L = 4$ for $\gamma(\textbf{d})$ . This Process helps MLP network to learn more high dimensional space easily to approximate higher frequency function. It means, more detailed representation can be adapted.

Hierarchical Volume Sampling

There are a lot of free space and occluded regions that don’t contribute to the rendered image. So, need special method that sampling radiance point more efficiently. Paper propose a hierarchical representation that increases rendering efficiency by allocating samples proportionally to ehir expected effect on the final rendering.

Paper optimizes two networks. One is coarse, other one is fine. First, sample $N_c$ locations using stratified sampleing, and evaluate the coarse network at these locations. After evaluate network, produce a more informed sampling where samples are biased towards the relevant parts of the volume.

\hat{C}_c(r) = \sum_{i=1}^{N_c} w_ic_i, \space w_i = T_i(1-exp(-\sigma_i\delta_i)).

Rewrite alpha compositioned color from the coarse network $\hat{C}_c(r)$ as a weighted sum of all sampled colors $c_i$ . And sample a second set of $N_f$ locations from this distribution using inverse transform sampling. Evaluate “fine” network at union of the first and second set of samples. It is similar as importance sampling.

So Input is $\gamma(\textbf{x})$ which is 3(coordinate) 2(cos / sin space) 10 (L), to 256 channel. Each black arrow is ReLU process. After finishing 4th layer, concat $\gamma(\textbf{x})$ term. An additional layer outputs the volume density $\sigma$ and 256 dimensional feature vector. This feature vector is concatencted with the positional encoding of the input viewing direction $\gamma(\textbf{d})$ and is processed by and additional ReLU layer with 128 channels.

How to Rendering Volume from Radience?

how to build 3D Image from this output?
$C(r) = \int_{t_n}^{t_f} T(t)\sigma(\textbf{r}(t))\textbf{c}(\textbf{r}(t),\textbf{d})dt$
- $T(t)$ express previous sum of opaque to $t_n$ to $t$ $T(t) = exp(-\int_{t_n}^{t}\sigma(\textbf{r}(s))ds)$
- $**\sigma{(\textbf{r}(t))}$ ** express opaque at spatial point $\textbf{r}(t)$ . It interpreted as the differential probability of a ray terminating at infinitesimal particle at location $\textbf{r}(t)$
  - $\textbf{r}(t) = \textbf{o} + t\textbf{d}$ , d is viewing direction, o is starting point?
- $\textbf{c}(\textbf{r}(t), \textbf{d})$ express color at specific point consider viewing direction
Change quadrature to our discrete method
$\hat{C}(\textbf{r}) = \sum_{i=1}^{N}T_i(1 - exp(\sigma_i\delta_i))\textbf{c}_i$
- $T_i$ express previous sum of opaque to $1$ to $i - 1$ $T_i = exp(-\sum_{j=1}^{i-1}\sigma_j\delta_j)$

Loss

The paper use COLMAP package to estimate parameters for real data. Each optimization iteration, authors randomly sample a batch of camera rays from the set of all pixels in the dataset, and then follow the hierarchical sampling to query $N_c$ samples from the coarse, $N_c + N_f$ samples from fine network.

Loss is total squared error between rendered and true pixel colors for both networks.

L = \sum_{r\in R}[|| \hat{C}_c(r) - C(r)||^2_2 \space + \space ||\hat{C}_f(r) - C(r)||^2_2 ]

where $R$ is the set of rays in each batch. In this experiment, $N_c = 64, N_f = 128$ , so all parameters are $64 + (64 + 128) = 256$ .

이뱅갈

Overdose