SpotLessSplats

김민솔·2024년 11월 3일

Gaussian Splatting SpotLessSplats in-the-wild

Gaussian-Splatting

목록 보기

3/6

가짜연구소 NeRF with Real-World에서 진행한 SpotLessSplats 리뷰 영상입니다.

https://youtu.be/xKylfpxRSEA?si=Wj2WBgnd_ZFDowF7
리뷰 영상에선 SOTA인지 확인 필요하다고 언급하였는데, WildGaussian이랑 직접 비교한 결과 더 높은 metric을 보였습니다(!)

Abstract

3D reconstruction 분야에서 3DGS(Gaussian Splatting)는 효과적인 학습과 렌더링 속도로 주목 받고 있습니다. 하지만, GS 사용에 많은 제약 조건이 있어, in-the-wild에서 사용하는 데에는 어려움이 있습니다. (no distractors, consistent lighting 등)
SplotLossSplats는 robust optimization을 GS에 적용하여, 효과적으로 transient distractors를 무시하는 효과를 보여줍니다.

1. Introduction

기존 NeRF 모델 기반 연구에서는, 1️⃣ down-weighting 혹은 2️⃣ discarding distractors의 방향으로 outlier를 다루었습니다. SLS는 RobustNeRF(링크)를 3DGS에 적용하는 시도를 담고 있으므로, 2️⃣ 번 방향의 학습으로 볼 수 있습니다.
이때, outlier를 RGB space가 아닌, learned feature space에서 찾게 됩니다. T2I 모델을 통해 얻은 feature map에 sparsification을 적용하여 robust mask를 얻게 됩니다. 또한, robust kernel을 3DGS에 바로 적용하지 않고, 몇몇 method를 추가하여 적용하였습니다.

Robust in NeRF

original NeRF에서 갖는 한계점은 다음과 같습니다.
1. scene이 완전하게 static하다.
2. illumination이 변하지 않는다.
이에 따라, wild에서 NeRF를 활용하도록 고안한 논문들이 많이 등장하였습니다.

NeRF-W

링크: https://velog.io/@rlaalsthf02/NeRF-in-the-Wild
using photometric error (residual with L2) -> transient distractor 제거
3D uncertainty field 모델링 -> high-error pixel down-weighting
global appearance 모델링

Ha-NeRF, Cross-Ray

2D outlier mask 모델링 (3D field 제거) for uncertainty 추정
Ha-NeRF -> CNN / Cross-Ray -> transformer

RobustNeRF

링크: https://velog.io/@rlaalsthf02/RobustNeRF
robust outlier estimator
binary weights + blur box kernel -> robust loss 설계
- distractor의 픽셀과 주변 edge가 smooth하다는 inductive bias 적용
background와 outlier가 비슷한 색을 가질 경우, misclassify (=NeRF-W)

Precomputed features

NeRF On-the-go

링크: https://velog.io/@rlaalsthf02/NeRF-On-the-go
using DINOv2 -> semantic features 추출
MLP with DINO features -> uncertainty mask 추정
structural rendering error를 사용 (generalization X) -> outlier에 대해 over or under masking이 적용될 수 있음.

Robustness in 3DGS

GS-W

appearance modeling을 통해, global 및 local per-primitive appearance embeddings을 학습합니다.

Wild-GS

spatial triplane field를 도입하여 appearnace modeling을 구현하였습니다.

3. Background

Gaussian Splatting

링크:
- https://velog.io/@rlaalsthf02/3D-Gaussian-Splatting
- https://velog.io/@rlaalsthf02/Gaussian-Rasterization-%EB%B6%84%EC%84%9D%ED%95%98%EA%B8%B0

Gaussian Splatting은 3D 이방성(anisotropic) 가우시안들로 3D scene을 표현하는 모델입니다. posed image set이 주어졌을 때, GS 모델이 가우시안들을 통해 scene을 재구성합니다.
각, splat $\mathcal{g}_i$ 는 다음의 파라미터들로 최적화됩니다.

$\mu_{i}$ : mean of Gaussian
$\Sigma_{i}$ : covariance matrix (positive seme-definite)
$\alpha_{i}$ : opacity
$\mathbf{c}_{i}$ : SH color (view-dependent)

3D 가우시안들에 대해 rasterization을 적용하여 3D scene을 렌더링합니다.

\tilde{\Sigma}=\mathbf{J}\mathbf{W}\Sigma\mathbf{W}^{\top}\mathbf{J}^\top

3D covariance를 2D로 레스터화하여, 3D 가우시안을 2D screen space로 정사영합니다. 자세한 과정은 위의 포스트를 참고해주시면 감사하겠습니다.

Robust Optimization

SLS는 RobustNeRF에서의 distractor를 제거하는 부분을 3DGS에 적용해보았습니다. RobustNeRF의 masked L1 loss는 다음과 같습니다.

{\arg\min}_\mathcal{G} \sum\limits^{N}_{n=1}\mathbf{M}^{(t)}_{n} \odot ||\mathbf{I}_{n}- \hat{\mathbf{I}}_{n}^{(t)}||_{1}

$\mathbf{M}_{n}$ : inlier/outlier mask
$\hat{\mathbf{I}}_{n}$ : rendering of Gaussians

mask가 연산되는 식은 다음과 같습니다.

\mathbf{M}^{(t)}_{n} = \mathbb{1}\Big\{\Big( \mathbb{1}\{\mathbf{R}_{n}^{(t)}>\rho\} \circledast\mathbf{B} \Big) > 0.5 \Big\}, \quad P(\mathbf{R}_{n}^{(t)}>\rho)=\tau

$\mathbb{1}$ : indicator function
- predicate가 true면 1로, 아니면 0으로 반환하는 함수입니다.
- masking 생성하는 연산입니다.
$\rho$ : outlier를 찾기 위해 적용하는, percentile입니다.
- default: median(0.5)
$\mathbf{B}$ : box filter
- convolution 연산으로 픽셀에 morphological dilation을 적용합니다.

하지만, robust loss를 3DGS에 직접적으로 적용하는 것은 어려울 뿐만 아니라(in 4.2.2), mask를 그대로 사용하게 되면 성능도 하락합니다. 위의 Fig에서도 확인할 수 있습니다. SpotLessSplats는 masking 생성 방법 및 기타 방법을 추가하여, 3D GS의 Robust optimization을 구현하였습니다.

4. Method

Robust loss의 경우 photometric error를 기반으로 작성되었지만, 이는 local color residual에 대한 의존도를 높이고, color error에 과적합되는 단점이 존재합니다. SpotLessSplats에서는 Stable-Diffusion 모델을 통해 얻은 semantic feature 간의 유사도로 최적화하였습니다. 따라서 distractors를 제거하는 과정이 feature에서 photometric error를 크게 발생시키는 sub-space를 찾는 문제가 됩니다.

4.1 Recognizing distractors

학습 과정 전에, input images에 대해 Stable-Diffusion을 통해 feature map을 뽑아냅니다. 이후, semantic feature maps $\{\mathbf{F}_{n}\}^{N}_{n=1}$ 을 활용하여 inlier/outlier masks $\mathbf{M}^{(t)}$ 를 구하게 됩니다. (batch 하나 당 image 하나를 다루기 때문에, index n는 표기에서 생략하겠습니다.)
다음에는, diffusion feature를 통해 mask를 구하는 방법으로 제시된 두 가지를 소개하겠습니다.

4.1.1. Spatial clustering

\begin{aligned} P(c\in\mathbf{M}^{(t)}) = \Big(\sum\limits_{p}\mathbf{C}[c,p]\cdot \mathbf{M}^{(t)}[p] \Big) / \sum\limits_{p}\mathbf{C}[c,p] \\ \mathbf{M}^{(t)}_{agg}(p) = \sum\limits_c\mathbb{1}\{P(c\in\mathbf{M}^{(t)}) >0.5 \} \cdot\mathbf{C}[c,p] \end{aligned}

feature map $\mathbf{F}$ 에 대해 agglomerative clustering을 적용하여, C개의 cluster를 생성합니다. (C=100) cluster $c$ 에 해당하는 pixel $p$ 를 $\mathbf{C}[c,p] \in \{0,1\}$ 로 표기하였습니다. cluster $c$ 가 inlier가 될 확률을 구한 후에, 해당 cluster label을 pixel로 전달하여, 최종적으로 Mask $\mathbf{M}^{(t)}_{agg}$ 를 얻게 됩니다.

4.1.2 Spatio-temporal clustering

두 번째 방법은 inlier/outlier를 구분하는 mlp classifier를 학습시키는 것입니다. mlp로 얻은 mask를 Loss에 활용합니다.

\mathbf{M}^{(t)}_{mlp}=\mathcal{H}(\mathbf{F};\theta^{(t)})

MLP classifier는 아래의 Loss식을 통해 최적화됩니다.

\mathcal{L}(\theta^{(t)})=\mathcal{L}_{sup}(\theta^{(t)}) + \lambda\mathcal{L}_{reg}(\theta^{(t)})

supervision term $\mathcal{L}_{sup}$ :
$\mathcal{L}_{sup}(\theta^{(t)}) = \max(\mathbf{U}^{(t)}-\mathcal{H}(\mathbf{F};\theta^{(t)}), 0) + \max(\mathcal{H}(\mathbf{F};\theta^{(t)})- \mathbf{L}^{(t)}, 0)$
$\mathbf{U}$ 는 0.5 threshold를 갖는 self-supervision upper label이며, $\mathbf{L}$ 은 0.9의 threshold를 갖는 lower label입니다.
regularization term $\mathcal{L}_{reg}$ :
$\alpha\prod^{l}_{i=1}||W_{i}||_\infty$
Lipschitz regularization을 사용하여 유사한 feature들이 비슷한 확률 값을 갖도록 mlp를 학습되게 하였습니다. Lipschitz reg를 사용하면, latent space를 smooth하게 만드는 효과를 가집니다.

4.2 Adapting 3DGS to robust optimization

robust masking을 3D GS에 바로 적용하게 되면, 학습 초기 단계의 3D GS 모델에 overfitting을 초래합니다. 저자들은 robust optimization을 해당 문제를 해결하기 위해, 몇 가지 방법을 추가적으로 제안하였습니다.

4.2.1 Warm up with scheduled sampling

\mathbf{M}^{(t)} \sim \mathcal{B}(\alpha +(1-\alpha)\cdot \mathbf{M}^{(t)}_{*})

학습 초기에 3DGS에 robust mask를 바로 적용하게 되면, mlp가 학습이 충분하게 되지 않았을 때이므로, random mask를 예측하는 현상이 발생합니다. 따라서, iteration이 늘어남에 따라 $\alpha$ 를 비례하여 적용하여, mask를 Bernoulli 분포에서 샘플링하였습니다. (mask value ~ [0, 1])

4.2.2 Trimmed estimators in image-based training

RobustNeRF mask는 patch 단위로 minibatch가 구성되어 있습니다. 해당 batch에는 같은 비율의 outlier가 구성되어 있는데, 3DGS는 배치가 이미지 전체로 구성되어 있기 때문에, robust mask를 바로 적용할 수 없습니다.

info["err"] = torch.histogram(
	torch.mean(torch.abs(colors - pixels), dim=-3).clone().detach().cpu(),
	bins=cfg.bin_size,
	range=(0.0, 1.0),
)[0]

따라서, 여러 개의 training batch를 tracking하는 것으로 해당 문제를 해결하였습니다. residuals를 B개의 히스토그램으로 만든 후, 각 bucket의 likelihood를 갱신하는 것으로 residual 분포를 구성하였습니다. 히스토그램에서 quantile을 추출하여, Robust masking의 threshold에 사용합니다.

4.2.3 A friendly alternative to "opacity reset"

3D GS에서는 M iteration마다의 opacity reset이 필수적입니다. 하지만, opacity reset이 가져오는 문제점은 다음과 같습니다. 1️⃣ 가우시안들이 카메라에 가깝게 축적되는 현상이 발생합니다. 카메라와 거리가 먼 가우시안이 scene에서 정확하게 표현되지 못하고, 투과도가 카메라 앞쪽에서 너무 빠르게 흡수되면서 floater를 발생시킵니다. 2️⃣ 한번 opacity가 리셋된 가우시안이 다시 높은 opacity를 가질 수 없게 됩니다. 따라서, 몇몇 가우시안들이 추가로 pruning됩니다.

u_{g} = \sum\limits_{t\in\mathcal{N}_{T}(t)}\mathbb{E}_{w,h}||\mathbf{M}^{(t)}_{h,w}\cdot \frac{\partial{\hat{\mathbf{I}}^{(t)}_{h,w}}}{\partial{x^{(t)}_{g}}} ||^{2}_{2}

저자들은 utilized-based pruning(UBP) 방법을 제시하여, opacity reset의 효과를 얻을 뿐만 아니라, 더 적은 cost로 inference를 이끌어냈습니다. 3D position이 아닌, 2D projected position에 대한 gradient를 계산하여, gaussian pruning에 적용하는 방식입니다.

(+) Gaussian pruning

가우시안의 mean/cov 값이 너무 크거나, opacity가 특정 임계점보다 낮은 경우, 가우시안을 제거하는 로직입니다.

4.2.4 Appearance modeling

\hat{\mathbf{c}}_{i}= \mathbf{a} \odot \mathbf{c}_{i} + \mathbf{b}, \quad \mathbf{a},\mathbf{b}=\mathcal{Q}(\mathbf{z}_{n};\theta_{\mathcal{Q}})

$\mathbf{a}$ : white-balance (색도를 조절하여 원래 색깔처럼 보이게 수정 / 조명)
$\mathbf{b}$ : brightness
$\mathbf{c}_{i}$ : SH coefs
$\mathbf{z}_n$ : camera-view embedding
$\theta_\mathcal{Q}$ : color MLP
photometrically consistent로 구현되어 있는 구조(현실에서 거의 X)에 auto-exposure과 white-balance를 추가하였습니다. 이를 통해 view에 따른 SH 계수로 색을 표현하여, 카메라의 각도가 달라짐에 따라 색이 다르게 표현되도록 구현하였습니다.

5. Results

SpotLessSplats는 On-the-go 데이터셋과 RobustNeRF 데이터셋에 대한 평가가 각각 이루어졌습니다.

On-the-go dataset

distractor 비율에 상관없이, robust 모델 사이에서 SOTA를 기록하였습니다.
distractor-free일 때도 일관된 성능을 보여주고 있습니다.

RobustNeRF dataset

RobustNeRF 데이터셋에서도, SOTA를 보였습니다.
해당 평가에는 NeRF-HuGS도 포함되었습니다.
WildGaussians 등 GS에 in-the-wild를 접목한 모델들이 평가에서 제외된 점이 아쉬움으로 남습니다.

Ablations

with UBP

SLS-agg보다, SLS-mlp가 더 좋은 성능을 보였습니다.
UBP가 없을 때, 더 높은 평가지표를 얻을 수 있습니다.

Crap 데이터에서도, UBP 제거 시 더 높은 평가지표를 보였습니다.
하지만, UBP가 없을 때 distractor를 제거하지 못한다고 주장하였습니다. (qualitative)
UBP 사용 시, 가우시안의 수가 4~6배 줄고, 2배의 training 시간을 감소시키며, 3배의 inference 시간을 감소시켰다고 합니다.

with Masking

CNN으로 예측한 것보다, MLP로 마스크를 예측하는 것이 더 좋은 성능을 보입니다.
agglomerative cluster의 개수를 늘릴수록, 성능이 향상됩니다.

Contribution

text-to-image diffusion 모델로 생성한 image feature를 robust loss에 활용하여, transient distractors를 찾아내었습니다.
- semantic outlier modeling
sparsification과 robust loss를 결합하여, 가우시안의 수를 줄일 수 있었습니다. (연산, 메모리 절약)
- 2~4배의 splat 횟수를 줄였습니다.
- distractor가 없는 데이터셋에도 같은 효과를 보였습니다.
robust reconstruction(with On-the-go dataset)에서 SOTA를 기록하였습니다.

Reference

[1] SpotLessSplats, Sara Sabour, Lily Goli, https://arxiv.org/pdf/2406.20055
[2] NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild, ETH Zu ̈rich/Microsoft, https://rwn17.github.io/nerf-on-the-go/
[3] RobustNeRF, ETH Zu ̈rich/Microsoft, https://robustnerf.github.io/
[4] NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections, Google Research, https://arxiv.org/pdf/2008.02268
[5] anistropic 설명, https://xoft.tistory.com/63
[6] Learning Smooth Neural Functions via Lipschitz Regularization, Hsueh-Ti Derek Liu, https://arxiv.org/pdf/2202.08345

김민솔

Interested in Vision, Generative, Neural Rendering

이전 포스트

Gaussian Rasterization 분석하기!

다음 포스트

SpotLessSplats

Gaussian-Splatting

Abstract

1. Introduction