Gaussian Rasterization 분석하기!

김민솔·2024년 10월 31일

Gaussian-Splatting

목록 보기

2/6

3D Gaussian Splatting의 코드에서 Rasterization 부분이 cuda 언어로 작성되어 있습니다.... 미루고 미룬 숙제지만, rasterization 부분을 최대한 해부하고 분석하는 포스트를 작성하였습니다.

gsplat의 rasterization 코드를 분석하였습니다.
shape을 가지는 data processing 코드를 사용하였습니다. (packed X)
Distributed code(multi-GPU)는 생략하였습니다.
backward code는 살펴보지 않고, forward code를 중점적으로 분석하였습니다.

Introduction

Rendering.py 내 위치한 rasterization의 주요한 기능은 총 세 가지의 함수로 구성되어 있습니다. 1️⃣ fully_fused_projection, 2️⃣ isect_tiles (intersecting tiles), 3️⃣ rasterize_to_pixels입니다. 각 함수들의 로직을 파헤치기 전에, rasterization의 input arguments를 먼저 살펴보겠습니다.

Input arguments

means: 3D 가우시안의 중심점입니다. [N, 3]
quats: 가우시안의 quaternions입니다. (wxyz) [N, 4]
scales: 가우시안의 scale입니다. [N, 3]
opacities: 가우시안의 opacity [N]
colors: 가우시안의 color 값입니다.
- rgb: [(C,) N, D]
- SH: [(C,) N, J, 3], K는 SH 개수
viewmats: w2c transformation matrix입니다. [C, 4, 4]
Ks: 카메라 intrinsics. [C, 3, 3]

(1) fully_fused_projection

3D 가우시안들을 2D projection하는 함수입니다. quaternions과 scales 값을 통해 projection 과정을 구현합니다. 이때 rotation quats와 scales를 통해 3D covariances를 계산합니다. (quats와 scales는 3D 가우시안의 파라미터들로, 최적화로 구해집니다.)

\Sigma = RSS^{\top}R^{\top}

Quaternions to Rotation matrix

quaternion(사원수)는 회전을 표현하는 4개의 값으로 이루어진 복소수 벡터입니다. (wxyz) quats를 사용하여 최적화할 경우, 연산 수가 줄고 최적화 할 파라미터 수가 줄게 되는 장점을 갖습니다. unit quaternion을 활용하여 rotation matrix로 표현하는 식은 다음과 같습니다.

R = \begin{bmatrix} w^{2}+x^2-y^2-z^2 & 2(xy-wz) & 2(xz+wy) \\ 2(xy + wz) & w^{2}-x^2+y^2-z^2 & 2(yz-wx) \\ 2(xz-wy) & 2(yz+wx) & w^{2}-x^2-y^2+z^2 \\ \end{bmatrix}

코드에서도 quats에 대해 normalization 적용 후, 위의 tranformation 행렬을 구합니다.

template <typename T>
inline __device__ mat3<T> quat_to_rotmat(const vec4<T> quat) {
    T w = quat[0], x = quat[1], y = quat[2], z = quat[3];
    // normalize
    T inv_norm = rsqrt(x * x + y * y + z * z + w * w);
    x *= inv_norm;
    y *= inv_norm;
    z *= inv_norm;
    w *= inv_norm;
    T x2 = x * x, y2 = y * y, z2 = z * z;
    T xy = x * y, xz = x * z, yz = y * z;
    T wx = w * x, wy = w * y, wz = w * z;
    return mat3<T>(
        (1.f - 2.f * (y2 + z2)),
        (2.f * (xy + wz)),
        (2.f * (xz - wy)), // 1st col
        (2.f * (xy - wz)),
        (1.f - 2.f * (x2 + z2)),
        (2.f * (yz + wx)), // 2nd col
        (2.f * (xz + wy)),
        (2.f * (yz - wx)),
        (1.f - 2.f * (x2 + y2)) // 3rd col
    );
}

이후 $\Sigma = RSS^{\top}R^{\top}$ 계산으로 3D covariance를 구합니다.

World to camera

https://velog.io/@rlaalsthf02/Computer-Vision-%EA%B0%9C%EB%85%90-%EC%A0%95%EB%A6%AC#transformations-chain

world 좌표계에서 camera 좌표계로 옮기는 과정입니다. 카메라 extrinsic 파라미터를 곱하여 연산됩니다.

mean (in Camera coords) $\mu_{camera} = R\mu_{world} + t$

template <typename T>
inline __device__ void pos_world_to_cam(
    // [R, t] is the world-to-camera transformation
    const mat3<T> R,
    const vec3<T> t,
    const vec3<T> p,
    vec3<T> &p_c
) {
    p_c = R * p + t;
}

covariance (in Camera coords) $\Sigma_{camera} = R\Sigma R^{\top}$

template <typename T>
inline __device__ void covar_world_to_cam(
    // [R, t] is the world-to-camera transformation
    const mat3<T> R,
    const mat3<T> covar,
    mat3<T> &covar_c
) {
    covar_c = R * covar * glm::transpose(R);
}

Perspective Projection

Camera 좌표계까지 옮긴 가우시안을 2D image space에 정사영하는 부분입니다. 카메라는 Default 값인 Pinhole로 설정하였습니다.

\Sigma' = JW\Sigma W^{\top}J^{\top}

$W$ : viewing transformation (w2c)
$J$ : Projective transformation의 affine 근사 Jacobian
$\Sigma'$ : 2D covariance

이때 Projective transformation은 카메라의 intrinsic 파라미터들을 통해 표현됩니다.

\mathbf{x}'_{image} = \begin{pmatrix} x' \\ y' \\ 1 \end{pmatrix} = K\mathbf{x}_{camera} = \begin{bmatrix} f_x & 1 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix} \begin{pmatrix} x \\ y \\ z \end{pmatrix}

$K$ : camera intrinsics

위의 식을 통해 2D projection이 이루어지는 x', y'를 표현할 수 있습니다.

\begin{aligned} x' = \frac {f_{x}\cdot x}{z} + c_{x} \\ y' = \frac {f_{y}\cdot y}{z} + c_{y} \end{aligned}

$f$ : focal length (카메라의 초점 거리)
$c$ : 이미지 평면 중심까지의 거리 (아래 설명에서는 center로 표기하였습니다.)

각 변수에 대한 편미분을 적용하여 Jacobian을 구한 결과입니다.

J = \begin{bmatrix} \frac{\partial{x}'}{\partial{x}} & \frac{\partial{x}'}{\partial{y}} & \frac{\partial{x}'}{\partial{z}}\\ \frac{\partial{y}'}{\partial{x}} & \frac{\partial{y}'}{\partial{y}} & \frac{\partial{y}'}{\partial{z}} \end{bmatrix} = \begin{bmatrix} \frac{f_{x}}{z} & 0 & - \frac{f_{x}\cdot x}{z^{2}} \\ 0 & \frac{f_{y}}{z} & - \frac{f_{y}\cdot y}{z^{2}} \end{bmatrix}

Perspective Projection code

코드에서 Perspective projection이 구현된 부분입니다.

template <typename T>
inline __device__ void persp_proj(
    // inputs
    const vec3<T> mean3d,
    const mat3<T> cov3d,
    const T fx,
    const T fy,
    const T cx,
    const T cy,
    const uint32_t width,
    const uint32_t height,
    // outputs
    mat2<T> &cov2d,
    vec2<T> &mean2d
) {
    T x = mean3d[0], y = mean3d[1], z = mean3d[2];

	//Pov limitation
    T tan_fovx = 0.5f * width / fx; 
    T tan_fovy = 0.5f * height / fy;
    T lim_x_pos = (width - cx) / fx + 0.3f * tan_fovx;
    T lim_x_neg = cx / fx + 0.3f * tan_fovx;
    T lim_y_pos = (height - cy) / fy + 0.3f * tan_fovy;
    T lim_y_neg = cy / fy + 0.3f * tan_fovy;

    T rz = 1.f / z;
    T rz2 = rz * rz;
    T tx = z * min(lim_x_pos, max(-lim_x_neg, x * rz));
    T ty = z * min(lim_y_pos, max(-lim_y_neg, y * rz));

    // mat3x2 is 3 columns x 2 rows.
    mat3x2<T> J = mat3x2<T>(
        fx * rz,
        0.f, // 1st column
        0.f,
        fy * rz, // 2nd column
        -fx * tx * rz2,
        -fy * ty * rz2 // 3rd column
    );
    cov2d = J * cov3d * glm::transpose(J);
    mean2d = vec2<T>({fx * x * rz + cx, fy * y * rz + cy});
}

이때 x, y에 대한 center 값을 그대로 사용하지 않고 tx, ty로 표현합니다. 해당 과정에 대해 아래에서 살펴보겠습니다.

x, y축에 대해 $0.5 * \{h \ or \ w \} / f$ 로 FOV(카메라의 시야각)를 tan로 구합니다.
최대 시야각의 0.3 정도까지로 position을 제한한 center 값을 사용합니다.

Gaussian Culling

view frustrum을 구한 후에, 해당 범위 안에 포함되지 않는 가우시안을 제거하는 로직입니다.

T b = 0.5f * (covar2d[0][0] + covar2d[1][1]);
T v1 = b + sqrt(max(0.01f, b * b - det));
T radius = ceil(3.f * sqrt(v1));

v1: projected Gaussian에 대한 분산 값입니다.
- Gaussian의 표준편차 값은 cov 행렬의 고유값과 같습니다. $(\Sigma u =\lambda u)$
- 고유벡터는 가우시안 타원의 방향을 결정하게 되고, 고유값은 각 방향에 대한 크기를 의미합니다.
- quadratic form을 적용하여 variance 값을 구합니다.
- v1, v2에 대한 값이 나오지만, 최대값만 필요하기 때문에 (최대 반경을 구하는 과정이므로) v1만 취하게 됩니다.
radius: 2D covariance 표준편차의 세 배를 가우시안의 반경으로 설정합니다.

if (radius <= radius_clip) {
	radii[idx] = 0;
	return;
}

// mask out gaussians outside the image region
if (mean2d.x + radius <= 0 || mean2d.x - radius >= image_width ||
	mean2d.y + radius <= 0 || mean2d.y - radius >= image_height) {
	radii[idx] = 0;
	return;
}

radius 반경을 벗어난 가우시안의 radii를 0으로 설정합니다. (radii가 0인 가우시안은 제거되는 것과 같은 의미입니다.)

Quadratic form (for variance) 유도 과정

\begin{aligned} det(\Sigma−\lambda I)=0 \\ λ^{2}−(\sigma_x^2​+\sigma_y^2​)\lambda+(\sigma_x^2​\sigma_y^2​−σ_{xy}^2​)=0 \\ b = \frac{\sigma_{x}^{2}​+\sigma_{y}^{2}​}{2} \\ \lambda_{1}, \lambda_{2} = b + \sqrt{b^{2} \pm (\sigma_x^2​\sigma_y^2​−σ_{xy}^2​)} \end{aligned}

$\sigma_{x}^{2}, \sigma_{xy}^2, \sigma_{y}^{2}$ : Covariance의 upper triangle elements

Outputs for fully_fused_projection

1️⃣ Quaternions to Rotation matrix, 2️⃣ World to camera, 3️⃣ Perspective Projection, 4️⃣ Calculating radius 과정을 거치면, 최종적으로 2D projected Gaussian을 얻게 됩니다.

radii: radius of Gaussian [C, N]
means2d: [C, N, 2]
depth: mean_camera에서의 z 값으로 구하기 [C, N]
conics: [C, N, 3]
- 2D cov의 역행렬로 2D 가우시안의 모양인 conic을 표현합니다.
- [C, N, 3]인 이유: 2D cov도 대칭 행렬이므로, upper triangle만 고려.

(2) isect_tiles

projected Gaussian들을 tile안에 교차시키는 과정입니다.

그림과 같이, 각 가우시안에 대해 tile 크기(16 x 16)만큼 resizing을 적용합니다.

// gaussian resizing for tile size
OpT tile_radius = radius / static_cast<OpT>(tile_size);
OpT tile_x = mean2d.x / static_cast<OpT>(tile_size);
OpT tile_y = mean2d.y / static_cast<OpT>(tile_size);

// tile_min is inclusive, tile_max is exclusive
// tile_id == y value * tile width + x value
uint2 tile_min, tile_max;
tile_min.x = min(max(0, (uint32_t)floor(tile_x - tile_radius)), tile_width);
tile_min.y =
	min(max(0, (uint32_t)floor(tile_y - tile_radius)), tile_height);
tile_max.x = min(max(0, (uint32_t)ceil(tile_x + tile_radius)), tile_width);
tile_max.y = min(max(0, (uint32_t)ceil(tile_y + tile_radius)), tile_height);

if (first_pass) {
	// first pass only writes out tiles_per_gauss
	tiles_per_gauss[idx] = static_cast<int32_t>(
		(tile_max.y - tile_min.y) * (tile_max.x - tile_min.x)
	);
	return;
}

각 가우시안에 대한 tile 값을 구한 후 cumsum을 적용하여, 전체 이미지에 대한 tile id를 생성합니다.
cum_tiles_per_gauss = torch::cumsum(tiles_per_gauss.view({-1}), 0);

Assigning Keys for tiles of Gaussian

int64_t depth_id_enc = (int64_t) * (int32_t *)&(depths[idx]);
int64_t cur_idx = (idx == 0) ? 0 : cum_tiles_per_gauss[idx - 1];
for (int32_t i = tile_min.y; i < tile_max.y; ++i) {
	for (int32_t j = tile_min.x; j < tile_max.x; ++j) {
		int64_t tile_id = i * tile_width + j;
		// e.g. tile_n_bits = 22:
		// camera id (10 bits) | tile id (22 bits) | depth (32 bits)
		isect_ids[cur_idx] = cid_enc | (tile_id << 32) | depth_id_enc;
		// the flatten index in [C * N] or [nnz]
		flatten_ids[cur_idx] = static_cast<int32_t>(idx);
		++cur_idx;
	}
}

camera id + tild id + depth 값에 대한 id를 생성합니다.
- depth 값은 $\alpha$ -blending을 위한 depth sort에 활용됩니다.
isect_id: tile id
flatten_id: Gaussian id

Radix Sort

렌더링을 위해 depth 값을 기준으로, 가우시안을 정렬하는 과정입니다. 이때, 같은 tile에 존재하는 가우시안들에 대해서만 정렬이 적용됩니다. CUDA 내 구현된 RadixSort api를 사용하였습니다.

// tile ids
cub::DoubleBuffer<int64_t> d_keys(
	isect_ids.data_ptr<int64_t>(),
	isect_ids_sorted.data_ptr<int64_t>()
);
// gaussian ids
cub::DoubleBuffer<int32_t> d_values(
	flatten_ids.data_ptr<int32_t>(),
	flatten_ids_sorted.data_ptr<int32_t>()
);
// RadixSort
GSPLAT_CUB_WRAPPER(
	cub::DeviceRadixSort::SortPairs,
	d_keys,
	d_values,
	n_isects,
	0,
	32 + tile_n_bits + cam_n_bits,
	stream
);

Gaussian Splatting의 rasterization 코드에서는, depth 값을 오름차순으로 정렬합니다.

즉, (depth 값이 작은) 카메라와 가까운 가우시안부터 렌더링이 이루어집니다.
cub::DeviceRadixSort::SortPairs(d_temp_storage,
								temp_storage_bytes,
								d_keys_in, d_keys_out,
								d_values_in, d_values_out,
								num_items);
// d_keys_out            <-- [0, 3, 5, 6, 7, 8, 9]
// d_values_out          <-- [5, 4, 3, 1, 2, 0, 6]

(3) rasterize_to_pixels

드디어,, rasterization 코드 부분으로 오게 되었습니다.
rasterization kernel을 적용하기 전, 각 thread block을 하나의 tile에 대응시킵니다. (각 thread는 하나의 pixel에 대응됩니다.) 따라서 block의 총 개수는 $C*tile\_h*tile\_w$ 가 됩니다.

dim3 threads = {tile_size, tile_size, 1};
dim3 blocks = {C, tile_height, tile_width};

thread와 block 구성이 끝나면, 앞에서 저장한 camera 및 tile 등의 id를 불러옵니다.

auto block = cg::this_thread_block();
    int32_t camera_id = block.group_index().x;
    int32_t tile_id =
        block.group_index().y * tile_width + block.group_index().z;
    uint32_t i = block.group_index().y * tile_size + block.thread_index().y;
    uint32_t j = block.group_index().z * tile_size + block.thread_index().x;

Output에 대한 인덱싱도 아래 코드로 적용합니다.

tile_offsets += camera_id * tile_height * tile_width;
    render_colors += camera_id * image_height * image_width * COLOR_DIM;
    render_alphas += camera_id * image_height * image_width;

같은 tile 내에 존재하는 모든 thread를 모아 batch들을 생성합니다. 이후, 각 batch에 존재하는 모든 가우시안들을 정렬된 순서대로 (from low depth gaussians) 모아줍니다.

// each thread fetch 1 gaussian from front to back
// index of gaussian to load
uint32_t batch_start = range_start + block_size * b;
uint32_t idx = batch_start + tr;
if (idx < range_end) {
	int32_t g = flatten_ids[idx]; // flatten index in [C * N] or [nnz]
	id_batch[tr] = g;
	const vec2<S> xy = means2d[g];
	const S opac = opacities[g];
	xy_opacity_batch[tr] = {xy.x, xy.y, opac};
	conic_batch[tr] = conics[g];
}

// wait for other threads to collect the gaussians in batch
block.sync();

Blending

가우시안의 projection 형태는 conic입니다. conic으로부터 sigma(density) 값을 얻어내는 코드는 다음과 같습니다.

const vec3<S> conic = conic_batch[t];
const vec3<S> xy_opac = xy_opacity_batch[t];
const S opac = xy_opac.z;
const vec2<S> delta = {xy_opac.x - px, xy_opac.y - py};
const S sigma = 0.5f * (conic.x * delta.x * delta.x +
						conic.z * delta.y * delta.y) +
				conic.y * delta.x * delta.y;
S alpha = min(0.999f, opac * __expf(-sigma));

sigma: 가우시안 중점으로부터의 거리 (!= depth)
alpha: $\alpha_{i}=(1-\exp{(-\sigma_{i}\delta_{i})})$ / 픽셀에서의 투명도

이때, sigma를 conic으로부터 유도하는 과정은 Appendix에서 소개하겠습니다.
카메라 밖에 있거나(depth가 negative), alpha가 임계값(1/255) 이하인 경우, 해당 가우시안을 무시합니다. (pruning)

 if (sigma < 0.f || alpha < 1.f / 255.f) {
	continue;
}

또한, alpha를 축적하여 transimittance(이전 픽셀까지의 축적 투과도)를 구합니다.

const S next_T = T * (1.0f - alpha);
if (next_T <= 1e-4) { // this pixel is done: exclusive
	done = true;
	break;
}

마지막으로, image formation process에 따라, weights 및 color 값을 렌더링하는 부분입니다.

C = \sum\limits_{i\in N} c_{i}\alpha_{i}\prod^{i-1}_{j=1}(1-\alpha_{j})

int32_t g = id_batch[t];
const S vis = alpha * T;
const S *c_ptr = colors + g * COLOR_DIM;
GSPLAT_PRAGMA_UNROLL
for (uint32_t k = 0; k < COLOR_DIM; ++k) {
	pix_out[k] += c_ptr[k] * vis;
}
cur_idx = batch_start + t;

T = next_T; // transimittance update for next batch

vis: $\sum\limits T_{i}\alpha_{i}$ / =weights for color
c_ptr: color of each point
pix_out: rendered color

이전 단계에서 구한 alpha와, iteration에 따라 update되는 Transimittance, 가우시안의 color를 통해, 최종적으로 rendered 컬러 값을 얻습니다. 위의 과정이 반복되면서, 렌더링 과정이 이루어진다고 보시면 될 것 같습니다.

이것으로 1️⃣ Rasterization과 2️⃣ Tile-insecting, 3️⃣ Rendering까지의 CUDA 코드를 살펴보았습니다. 한번 뜯어가며 분석하여, 더 확실하게 개념을 익힐 수 있었던 뿌듯한 포스트로 남을 것 같습니다!

Appendix

sigma from conic

2D Gaussian

f(x, y) = \exp\Big(\frac{−1}{2}\begin{bmatrix} x−μ_{x}​ \\ y−μ_{y}​​\end{bmatrix}^\top Σ^{−1}\begin{bmatrix} x−μ_{x}​ \\ y−μ_{y}​​\end{bmatrix} \Big)

$\mu_{x}, \mu_{y}$ : Gaussian 분포의 중심 좌표
$\Sigma^{-1}$ : conic (=inverse covariance)
(x, y)에 따른 2D Gaussian의 확률 밀도 함수를 나타낸 식입니다. covariance의 역행렬을 통해, conic을 표현할 수 있습니다.

conic 방정식

\begin{aligned}\begin{bmatrix} x−μ_{x}​ \\ y−μ_{y}​​\end{bmatrix}^\top Σ^{−1}\begin{bmatrix} x−μ_{x}​ \\ y−μ_{y}​​\end{bmatrix} = a(x-\mu_{x})^{2}+2b(x-\mu_{x})(y-\mu_{y})+c(y-\mu_{y})^{2} \\ \sigma = 0.5 \times (a \cdot \Delta x^{2}+ 2b \cdot \Delta x \cdot \Delta y + c \cdot \Delta y^{2}) \end{aligned}

$\Delta x = x-\mu_{x}$ (y도 동일)
sigma: projected position과 가우시안 중심 사이의 거리
$a, b, c$ : conic matrix의 upper triangle
이때, exp 부분을 식으로 정리한 후에, delta로 치환하면 다음과 같이sigma를 표현할 수 있습니다.