'Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors' Paper Summary

구명규·2023년 7월 4일

'23 Internship Study

목록 보기

14/19

Abstract

'Two-stage coarse-to-fine approach for high quality, textured 3D meshes generation from a single unposed image in the wild using both 2D and 3D priors'

Coarse Stage. NeRF fitting으로 coarse geometry 생성
Fine Stage. Memory-efficient differentiable mesh representation으로 적절한 texture를 갖춘 고화질의 mesh 생성

2D/3D diffusion prior로 novel view를 생성하여 3D content 학습. 이들 간 trade-off parameter로 3D 형태에 대한 exploration vs. exploitation 조절 가능.
Textual inversion과 monocular depth regularization으로 consistency 유지.

1. Introduction

Computer vision에서의 single image - 3D reconstruction은 다음의 두 가지 challenging factor 존재.
- Large-scale 3D dataset에 대한 prior 결여
- 3D data의 detail과 computational resource 간 trade-off
3D content를 위한 2D model의 rich prior 활용(e.g. RealFusion, Neural Lift). 하지만 2D prior의 의존도가 높아질 경우 3D fidelity 및 consistency가 저하되어 multiple faces, mismatched sizes, inconsistent texture 발생. 반면 3D prior는 제한적인 3D training data로 uncommon object에 대해선 oversimplified/flat geometry를 생성.
NeRF로 implicit volume representation 학습(coarse stage) 후 DMTet(Deep Marching Tetrahedra)로 적은 memory를 사용하며 1K까지 up-scaling(fine stage, geometry와 texture에 대한 refinement).

2. Methodology

2.1 Magic123 pipeline

Preprocessing: Dense Prediction Transformer 모델로 segmentation 진행, MiDaS 모델로 depth map 추출하여 regularization prior로 활용.

2.1.1 Coarse stage

Instant-NGP 모델 사용(빠른 inference, 복잡한 geometry에 대한 expressiveness)
Reference view reconstruction loss $\mathcal{L}_{rec}$
$\mathcal{L}_{rec}=\lambda_{rgb}||\bold{M}\odot(\bold{I}^r-G_\theta(\bold{v}^r))||_2^2+\lambda_{mask}||\bold{M}-M(G_\theta(\bold{v}^r))||_2^2$
: Foreground RGB와 mask에 대한 MSE loss. Background는 모델링하지 않으며 흰색 바탕을 기본으로 함.
Novel view guidance $\mathcal{L}_g$
: 2D prior와 3D prior를 모두 활용하여 novel view generation을 guide. (아래에서 상세히 설명)
$\mathcal{L}_g=\lambda_{2D/3D}\mathcal{L}_{2D}+40\mathcal{L}_{3D}$
Depth prior $\mathcal{L}_d$
$\mathcal{L}_d=\frac{1}{2}\left[1-\frac{\text{cov}(\bold{M}\odot d^r, \bold{M}\odot d)}{\sigma(\bold{M}\odot d^r)\sigma(\bold{M}\odot d)}\right]$
: Pre-trained monocular depth estimator가 출력하는 pseudo depth $d^r$ 와 NeRF의 reference viewpoint에 대한 depth output이 같아지도록 유도. MSE loss 대신 negative Pearson correlation 사용.
Normal smoothness $\mathcal{L}_n$
$\mathcal{L}_d=\frac{1}{2}||\bold{n}-\tau(g(\bold{n}, k))||$
: Normal map의 smoothness에 대한 regularization 부여해 표면에 발생하는 high-frequency artifacts 완화.

$\mathcal{L}_c=\mathcal{L}_{rec}+\mathcal{L}_g+\lambda_d\mathcal{L}_d+\lambda_n\mathcal{L}_n$

2.1.2 Fine stage

NeRF의 high-frequency artifacts를 완화하기 위해 DMTet (hybrid SDF-Mesh representation) 사용.
$\rarr$ 높은 메모리 효율로 고화질의 3D shape 생성 가능. (coarse stage와 비슷한 메모리 소요)
Magic3D와 같이 neural color field로 texture 생성.

2.2 Joint 2D and 3D priors for image-to-3D generation

2D priors.
: DreamFusion의 SDS loss 적용.
$\mathcal{L}_{2D}=\mathbb{E}_{t, \epsilon}\left[w(t)(\epsilon_\phi(\bold{z}_t;\bold{e}, t)-\epsilon)\frac{\partial\bold{z}}{\partial\bold{I}}\frac{\partial\bold{I}}{\partial\theta}\right]$
: $\theta$ 는 coarse stage에선 MLP의 NeRF, fine stage에선 SDF, triangular deformation, color field에 해당.
: Input image를 512x512로 interpolate하여 사용, Stable Diffusion v1.5 사용.
Textual inversion
: RealFusion에서의 textual inversion technique 활용하여 reference image에서의 object를 대표하는 token 학습.
: Prompt는 "A high-resolution DSLR image of $<{e}>$ "로 고정.

2.2.1 3D prior

Zero-1-to-3의 novel-view synthesis 능력을 3D prior로 활용하여 학습.
$\mathcal{L}_{3D}=\mathbb{E}_{t, \epsilon}\left[w(t)(\epsilon_\phi(\bold{z}_t;\bold{I}^r, t, R, T)-\epsilon)\frac{\partial\bold{I}}{\partial\theta}\right]$
2D prior에서는 textual inversion 값을 포함한 text prompt로, 3D prior에서는 novel-view image로 conditioning 진행.
2D prior에 비해 generalization capability는 낮음.

2.2.2 Joint 2D and 3D priors

2D prior의 imagination으로 부정확한 형상을 생성하는 geometry exploration과, 3D prior의 uncommon objects에 대해 over-simplified geometry를 생성하는 geometry exploitation 간의 complementation.
$\mathcal{L}_g=\lambda_{2D}\mathcal{L}_{2D}+\lambda_{3D}\mathcal{L}_{3D}$
3D prior로 사용한 Zero-1-to-3가 2D prior로 사용한 Stable Diffusion보다 tolerant함을 발견( $\lambda_{3D}$ 의 변화에 둔감함). 따라서 아래와 같이 loss function 수정.
$\mathcal{L}_g=\lambda_{2D/3D}\mathcal{L}_{2D}+40\mathcal{L}_{3D}$

3. Experiments

3.1 Datasets

NeRF4, RealFusion15 사용.

3.2 Implementation details

Coarse/fine stage 각각 5,000 epochs씩 학습, 초반 3,000 epochs 동안에는 geometry 학습에 집중하기 위해 normals' shading 사용, 이후에는 0.75의 확률로 diffuse shading, 0.25의 확률로 textureless shading 사용.
Coarse/fine stage 각각의 렌더링 화질은 128x128, 1024x1024.
Reference image가 front view에 radial distance는 1.8, FOV는 40°라고 가정(not sensitive for 3D reconstruction).

3.3 Results

Reference view에 대한 reconstruction quality와 perceptual similarity를 측정하는 PSNR과 LPIPS, 3D consistency를 측정하는 CLIP-similarity 모두 기존의 모델 성능 상회.

(생략)

5. Conclusion and Discussion

Reference image가 front view에 해당한다는 전제 조건 $\rarr$ 위에서 아래로 찍힌 사진에 대해서는 성능이 크게 하락함.
Preprocessed segmentation과 monocular depth estimation의 결과에 의존함.
Over-saturation issue (물체의 texture가 과하게 표현되는 현상)

Implementation Details

Known view loss = loss_rgb + loss_mask + loss_normal + loss_depth
Novel view loss = loss_sds + loss_if + loss_zero123 + loss_clip

Paper Summary

DreamFusion에서의 2D prior와 Zero-1-to-3에서의 3D prior를 동시에 사용 + coarse/fine stage로 나누어 fine stage에서는 DMTet(textured 3D meshes) 사용 + depth와 normal map에 대한 regularization을 통해 single unposed image-to-3D construction의 성능 및 연산량 개선.

구명규

K'AI'ST 학부생까지의 기록

이전 포스트

'Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior' Paper Summary

다음 포스트

'Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors' Paper Summary

'23 Internship Study

Abstract

1. Introduction

2. Methodology