Summary
Input : spatial location (x, y, z), viewing direction (θ, φ) (5D coordinate)
- Spatial Position
- x : xyz coordinate
- Viewing Direction
- d : camera position
output : volume density, view-dependent emitted radiance
- Volume Density map
- σ(x):R3→R≥0
- xyz coordinate to opaque
- Color Radiance map
- c(x,d):R5→R3
- xyz coordinate, camera position to RGB Color
Fθ:(x,d)→(c,σ)
After getting radiance output, process volume rendering to build motion
What is Neuiral Radiance Field
Approximate continuous 5D scene representation with an MLP network Fθ:(x,d)→(c,σ) and optimaize weight θ to map from each input 5D coordinate to its corresponding volume density and directional emitted vector.
Positional Encoding
Neural network is universal function approximators.
That deep networks are biased towards learning lower frequency functions. Additionally, it also known as higher dimensional space using high frequency functions. So reformulating Fθ as a composition of Fθ=Fθ′∘γ, γ is a mapping from R into a higher dimensional space R2L, and Fθ′ is still simply a regular MLP.
γ(p)=(sin(20πp), cos(20πp), ... ,(sin(2L−1πp), cos(2L−1πp))
This function is applied separately to each of the three coordinate values in x. In this paper, L=10 for γ(x), L=4 for γ(d). This Process helps MLP network to learn more high dimensional space easily to approximate higher frequency function. It means, more detailed representation can be adapted.
Hierarchical Volume Sampling
There are a lot of free space and occluded regions that don’t contribute to the rendered image. So, need special method that sampling radiance point more efficiently. Paper propose a hierarchical representation that increases rendering efficiency by allocating samples proportionally to ehir expected effect on the final rendering.
Paper optimizes two networks. One is coarse, other one is fine. First, sample Nc locations using stratified sampleing, and evaluate the coarse network at these locations. After evaluate network, produce a more informed sampling where samples are biased towards the relevant parts of the volume.
C^c(r)=i=1∑Ncwici, wi=Ti(1−exp(−σiδi)).
Rewrite alpha compositioned color from the coarse network C^c(r) as a weighted sum of all sampled colors ci. And sample a second set of Nf locations from this distribution using inverse transform sampling. Evaluate “fine” network at union of the first and second set of samples. It is similar as importance sampling.
So Input is γ(x) which is 3(coordinate) 2(cos / sin space) 10 (L), to 256 channel. Each black arrow is ReLU process. After finishing 4th layer, concat γ(x) term. An additional layer outputs the volume density σ and 256 dimensional feature vector. This feature vector is concatencted with the positional encoding of the input viewing direction γ(d) and is processed by and additional ReLU layer with 128 channels.
How to Rendering Volume from Radience?
- how to build 3D Image from this output?
C(r)=∫tntfT(t)σ(r(t))c(r(t),d)dt
- T(t) express previous sum of opaque to tn to t
T(t)=exp(−∫tntσ(r(s))ds)
- ∗∗σ(r(t))** express opaque at spatial point r(t). It interpreted as the differential probability of a ray terminating at infinitesimal particle at location r(t)
- r(t)=o+td, d is viewing direction, o is starting point?
- c(r(t),d) express color at specific point consider viewing direction
- Change quadrature to our discrete method
C^(r)=i=1∑NTi(1−exp(σiδi))ci
- Ti express previous sum of opaque to 1 to i−1
Ti=exp(−j=1∑i−1σjδj)
Loss
The paper use COLMAP package to estimate parameters for real data. Each optimization iteration, authors randomly sample a batch of camera rays from the set of all pixels in the dataset, and then follow the hierarchical sampling to query Nc samples from the coarse, Nc+Nf samples from fine network.
Loss is total squared error between rendered and true pixel colors for both networks.
L=r∈R∑[∣∣C^c(r)−C(r)∣∣22 + ∣∣C^f(r)−C(r)∣∣22]
where R is the set of rays in each batch. In this experiment, Nc=64,Nf=128, so all parameters are 64+(64+128)=256.