Representing Scenes as Neural Radiance Fields for View Synthesis - ECCV 2020

이뱅갈·2023년 3월 3일
0

geoAI-lab

목록 보기
1/1

Summary


Input : spatial location (x, y, z), viewing direction (θ, φ) (5D coordinate)

  • Spatial Position
    • x\textbf{x} : xyz coordinate
  • Viewing Direction
    • d\textbf{d} : camera position

output : volume density, view-dependent emitted radiance

  • Volume Density map
    • σ(x):R3R0\sigma(x): \mathbb{R}^3 \rightarrow \mathbb{R}_{\ge0}
    • xyz coordinate to opaque
  • Color Radiance map
    • c(x,d):R5R3\textbf{c}(\textbf{x}, \textbf{d}): \mathbb{R}^5 \rightarrow \mathbb{R}^3
    • xyz coordinate, camera position to RGB Color

Fθ:(x,d)(c,σ)F_\theta: (\textbf{x}, \textbf{d}) \rightarrow (\textbf{c}, \sigma)

After getting radiance output, process volume rendering to build motion

What is Neuiral Radiance Field


Approximate continuous 5D scene representation with an MLP network Fθ:(x,d)(c,σ)F_\theta: (\textbf{x}, \textbf{d}) \rightarrow (\textbf{c}, \sigma) and optimaize weight θ\theta to map from each input 5D coordinate to its corresponding volume density and directional emitted vector.

Positional Encoding


Neural network is universal function approximators.

That deep networks are biased towards learning lower frequency functions. Additionally, it also known as higher dimensional space using high frequency functions. So reformulating FθF_\theta as a composition of Fθ=FθγF_\theta = F'_\theta \circ \gamma, γ\gamma is a mapping from R\mathbb{R} into a higher dimensional space R2L\mathbb{R}^{2L}, and FθF'_\theta is still simply a regular MLP.

γ(p)=(sin(20πp), cos(20πp), ... ,(sin(2L1πp), cos(2L1πp))\gamma(p) = (sin(2^0\pi p), \space cos(2^0\pi p), \space ... \space ,(sin(2^{L-1}\pi p),\space cos(2^{L-1}\pi p))

This function is applied separately to each of the three coordinate values in x\textbf{x}. In this paper, L=10L = 10 for γ(x)\gamma(\textbf{x}), L=4L = 4 for γ(d)\gamma(\textbf{d}). This Process helps MLP network to learn more high dimensional space easily to approximate higher frequency function. It means, more detailed representation can be adapted.

Hierarchical Volume Sampling


There are a lot of free space and occluded regions that don’t contribute to the rendered image. So, need special method that sampling radiance point more efficiently. Paper propose a hierarchical representation that increases rendering efficiency by allocating samples proportionally to ehir expected effect on the final rendering.

Paper optimizes two networks. One is coarse, other one is fine. First, sample NcN_c locations using stratified sampleing, and evaluate the coarse network at these locations. After evaluate network, produce a more informed sampling where samples are biased towards the relevant parts of the volume.

C^c(r)=i=1Ncwici, wi=Ti(1exp(σiδi)).\hat{C}_c(r) = \sum_{i=1}^{N_c} w_ic_i, \space w_i = T_i(1-exp(-\sigma_i\delta_i)).

Rewrite alpha compositioned color from the coarse network C^c(r)\hat{C}_c(r) as a weighted sum of all sampled colors cic_i. And sample a second set of NfN_f locations from this distribution using inverse transform sampling. Evaluate “fine” network at union of the first and second set of samples. It is similar as importance sampling.

So Input is γ(x)\gamma(\textbf{x}) which is 3(coordinate) 2(cos / sin space) 10 (L), to 256 channel. Each black arrow is ReLU process. After finishing 4th layer, concat γ(x)\gamma(\textbf{x}) term. An additional layer outputs the volume density σ\sigma and 256 dimensional feature vector. This feature vector is concatencted with the positional encoding of the input viewing direction γ(d)\gamma(\textbf{d}) and is processed by and additional ReLU layer with 128 channels.

How to Rendering Volume from Radience?


  • how to build 3D Image from this output?
    C(r)=tntfT(t)σ(r(t))c(r(t),d)dtC(r) = \int_{t_n}^{t_f} T(t)\sigma(\textbf{r}(t))\textbf{c}(\textbf{r}(t),\textbf{d})dt
    • T(t)T(t) express previous sum of opaque to tnt_n to tt
      T(t)=exp(tntσ(r(s))ds)T(t) = exp(-\int_{t_n}^{t}\sigma(\textbf{r}(s))ds)
    • σ(r(t))**\sigma{(\textbf{r}(t))}** express opaque at spatial point r(t)\textbf{r}(t). It interpreted as the differential probability of a ray terminating at infinitesimal particle at location r(t)\textbf{r}(t)
      • r(t)=o+td\textbf{r}(t) = \textbf{o} + t\textbf{d}, d is viewing direction, o is starting point?
    • c(r(t),d)\textbf{c}(\textbf{r}(t), \textbf{d}) express color at specific point consider viewing direction
  • Change quadrature to our discrete method
    C^(r)=i=1NTi(1exp(σiδi))ci\hat{C}(\textbf{r}) = \sum_{i=1}^{N}T_i(1 - exp(\sigma_i\delta_i))\textbf{c}_i
    • TiT_i express previous sum of opaque to 11 to i1i - 1
      Ti=exp(j=1i1σjδj)T_i = exp(-\sum_{j=1}^{i-1}\sigma_j\delta_j)

Loss


The paper use COLMAP package to estimate parameters for real data. Each optimization iteration, authors randomly sample a batch of camera rays from the set of all pixels in the dataset, and then follow the hierarchical sampling to query NcN_c samples from the coarse, Nc+NfN_c + N_f samples from fine network.

Loss is total squared error between rendered and true pixel colors for both networks.

L=rR[C^c(r)C(r)22 + C^f(r)C(r)22]L = \sum_{r\in R}[|| \hat{C}_c(r) - C(r)||^2_2 \space + \space ||\hat{C}_f(r) - C(r)||^2_2 ]

where RR is the set of rays in each batch. In this experiment, Nc=64,Nf=128N_c = 64, N_f = 128, so all parameters are 64+(64+128)=25664 + (64 + 128) = 256.

profile
Overdose

0개의 댓글