Scaffold-GS

김민솔·2025년 2월 14일

1. Introduction

3D GS의 한계

Scene structure가 무시됩니다. -> complex large-scale scene에 대해 표현 능력이 제한됩니다.
view dependent effects가 구현되지 않습니다. 즉, view changes와 lighting effects에 취약합니다.
3D scene을 표현하는 과정에서 Gaussians가 불필요하게 증가합니다.

Contribution

1) Voxel grid로부터 anchor points 생성하여 local 3D Gaussians 분포 가이드 -> hierarchical and region-aware scene representation
2) view frustum 내에서, 각 앵커로부터 neural Gaussians 예측 -> robust novel view synthesis for diverse viewing directions and distances
3) 앵커에 맞는 growing, pruning 최적화

MLP-based Neural Fields and Rendering

일반적인 NeRF based 방법입니다. 각 camera ray에 따른 많은 수의 샘플링 포인트들에 대해 rendering이 이뤄지며, 렌더링 속도가 느리다는 단점을 가집니다.

Grid-based Neural Fields and Rendering

rendering 속도를 향상시키기 위해 grid로 scene을 표현하는 방법입니다.

Plenoxel

spatial data structure -> faster training + inference
sparse voxel grid로 연속적인 density field를 interpolate.
SH로 view-dependent visual effects 구현

Point-based Neural Fields and Rendering

geometry primitives를 사용하여 렌더링합니다. 주로, Structure-from-Motion(SFM)으로 point cloud를 추출하여 렌더링에 사용합니다. 3D GS와 같이, point primitives를 ellipsoids로 volume rendering에 사용하는 것도 가능합니다.

3D GS

https://velog.io/@rlaalsthf02/3D-Gaussian-Splatting
initialized from SFM
rasterization: 3D Gaussians -> 2D projected

3. Method

1️⃣ Initialization, 2️⃣ Neural Gaussian Derivation, 3️⃣ Anchor Points Refinement의 과정으로 이뤄집니다. 하나씩 살펴보겠습니다.

Anchor Point Initialization

\mathbf{V} = \Big\{\Big\lfloor \frac{\mathbf{P}}{\epsilon} \Big\rceil\Big\} \cdot \epsilon

use point cloud from COLMAP -> voxelize the scene from this point cloud
$\mathbf{V}$ : voxel centers
$\epsilon$ : voxel size

각 voxel의 중심이 anchor point가 되며, 다음의 파라미터들을 포함합니다.

$f_{v} \in \mathbb{R}^{32}$ : a local context feature
$l_{v} \in \mathbb{R}^{3}$ : a scaling factor
$\mathbf{O}_{v} \in \mathbb{R}^{k\times 3}$ : $k$ learnable offsets

Feature weighted sum

\delta_{vc}=||\mathbf{x}_{v}-\mathbf{x}_{c}||_{2}, \quad \overrightarrow{d}_{vc}= \frac{\mathbf{x}_{v}-\mathbf{x}_{c}}{||\mathbf{x}_{v}-\mathbf{x}_{c}||_{2}}

또한 local context feature $f_{v}$ 가 multi-resolution과 view-dependent 특징을 가지도록 다음의 방법을 적용합니다.
1) 2만큼 다운 샘플링하여 features bank $\{f_{v},f_{v_{1}}, f_{v_{2}}\}$ 를 생성합니다.
2) anchor와 position의 차이로 viewing distance $\delta_{vc}$ , viewing direction $\overrightarrow{d}_{vc}$ 을 계산하여 MLP로 weights를 추출합니다.
3) view-dependent weights와 함께 feature bank를 weighted sum 적용하여, integrated anchor feature $\hat{f}_{v}$ 를 형성합니다.

Neural Gaussian Derivation

neural Gaussian을 파라미터화한 항목은 3D GS에서의 가우시안과 동일합니다.

Position
Opacity
Covariance
- quaternion
- scaling
Color

viewing frustum 내의 anchor point에 대해 $k$ 개의 neural Gaussians를 생성하고 attributes를 예측합니다. 특히, neural Gaussians의 positions은 아래와 같이 계산됩니다.

\{\mu_{0}, ..., \mu_{k-1}\}=\mathbf{x}_{v}+\{\mathcal{O}_{0},...,\mathcal{O}_{k-1}\}\cdot l_{v}

$\{\mathcal{O}_{0},...,\mathcal{O}_{k-1}\} \in \mathbb{R}^{k\times 3}$ : learnable offsets
$l_{v}$ : scaling factor associated with the anchor

neural Gaussians의 attributes는 anchor feature $\hat{f}_{v}$ , viewing distance and direction으로부터 직접적으로 디코딩됩니다. 이때, 사용하는 MLP들은 독립적입니다.

neural Gaussians의 prediction은 on-the-fly로 이뤄집니다. 즉, frustum 내에 존재하는 보이는 앵커들에 대해서만 활성화되어 가우시안들이 만들어집니다. 이때, 3D GS와 동일하게 pre-defined 임계값 $\tau_{\alpha}$ 보다 큰 opacity를 갖는 neural Gaussians만 유지합니다.

Growing

SFM으로 초기화된 anchor points로 생성되는 neural Gaussians의 경우, local region에 모델링이 제한되는 경우가 생깁니다. 즉, SFM에 따라 scene이 생성되는 퀄리티가 달라집니다. 따라서, error based anchor growing을 제안합니다. 이는 neural Gaussians가 '중요한' 영역을 찾았을 때 새로운 anchor를 두는 방법입니다.

voxel size $\epsilon_g$ 를 통해 neural Gaussians를 quantize합니다.
각 voxel에 대해 neural Gaussians의 gradients 평균 $\nabla_{g}$ 을 구합니다.
threshold $\tau_{g}$ 보다 높은 경우에 새로운 앵커를 생성합니다.
- $\tau_{g}$ 도 미리 정의된 임계값입니다.
multi-resolution voxel에 대해 해당 과정을 반복합니다.
- voxel size를 4씩 down-scale할 때 threshold는 2씩 up-scale합니다.
- $\epsilon^{(m)}_{g}=\epsilon_g/4^{m-1}$
- $\tau_{g}^{(m)}=\tau_{g}*2^{m-1}$
랜덤으로 새로운 앵커들을 제거합니다.
- 앵커가 빠르게 팽창하는 것을 방지하기 위함입니다.

Pruning

trivial 앵커들을 제거하기 위하여, opacity 값을 축적합니다. 만약 앵커가 원하는 만큼의 opacity를 갖는 neural Gaussians를 생성하지 못하면, scene에서 제거합니다.

Loss

\mathcal{L}=\mathcal{L}_{1}+\lambda_{SSIM}\mathcal{L}_{SSIM}+\lambda_{vol}\mathcal{L}_{vol}

3D GS와 동일하게 L1, SSIM로 learnable parameters, MLP를 학습합니다. volume regularization은 neural Gaussians의 scale vector 값들의 product로 계산되며, neural Gaussians를 최대한 겹치지 않게 생성하기 위하여 사용합니다.

4. Results

Qualitatives

Lighting, Fine-scale, Texture-less, Insufficient observations 등 다양한 상황에서 좋은 퀄리티를 보여줍니다.

Quantitatives

3D GS와 비교하였을 때 비교할 만한 퀄리티를 보입니다.
일반적인 데이터셋에 대해서도 약 4배 낮은 메모리 사용량을 보입니다.
Large-scale, complex lighting, synthetic scene에서는 scaffold-gs가 3D GS에 비해 더 높은 PSNR을 보입니다.

Ablations

anchor refinement, filtering에서도 모두 각 기법 적용했을 때 성능이 향상됩니다.

Anchor feature clustering

K-means로 anchor features를 clustering한 시각화입니다. 난간, 유모차 등의 scene contents를 독립적으로 구분하고 있으며, anchor feature space가 시각적 속성이나 기하학적 구조를 잡아내고 있음을 확인할 수 있습니다.

View-adaptive neural Gaussian attributes

해당 그래프의 각 point가 space 내 viewpoint에 해당합니다. viewpoint가 변함에 따라 scale, opacity 값이 연속적으로 변하는 것을 확인할 수 있으며, 이는 scaffold-gs가 view dependent effects를 잘 나타내고 있음을 보여줍니다.

Limitations

많은 임계값들이 heuristic하게 구성되어 있습니다.
앵커 최적화에서도 개선이 이루어질 수 있습니다.
- 랜덤으로 앵커 제거 등

Reference

[1] Lu, Tao, et al. "Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering." CVPR. 2024.
[2] Jonathan T. Barron, et al. “Mip-nerf 360: Unbounded anti-aliased neural radiance fields.” CVPR. 2022
[3] Alex Yu, et al. "Plenoxels: Radiance fields without neural networks." CVPR. 2022
[4] Bernhard Kerbl, et al. "3D Gaussian Splatting for Real-Time Radiance Field Rendering" SIGGRAPH. 2023.

김민솔

Interested in Vision, Generative, Neural Rendering

이전 포스트

OpenGL(3)

다음 포스트

Scaffold-GS

1. Introduction

3D GS의 한계

Contribution

MLP-based Neural Fields and Rendering