Recent 3D deep learning methods infer general shapes from very few images, even a single input. However, the resulting resolutions and accuracy are limited, due to in effective model representaions.
➡️ This paper proposes a new Pixel-aligned Implicit Function (PIFu) representation for 3D deep learning for the challenging problem of textured surface inference of clothed 3D humans from a single or multiple input images.
This paper introduced PIFu(Pixel-aligned Implicit Function), an implicit representation that locally alligns pixels of 2D images with the global context of their corresponding 3D object.
This paper proposes an end-to-end deep learning method for digitizing highly detailed clothed humans that can infer both 3D surface and texture from a single image, and optionally, multiple input images.
Highly intricate shpaes, such as hairstyles, clothing, as well as their variations and deformations can be digitized in a unified way.
PIFu produces high-resolution surfaces including largely unseen regions such as the back of a person.
PIFu is memory efficient, can handle arbritrary topology, and the resulting surface is spatially aligned with the input image.
This paper shows the combination of local features and 3D-aware implicit surface representation makes a siginificant differenece inicluding highly detailed reconstruction even from a single view.
Paper's algorithm can handle a wide range of complex clothing, such as skirts, scarfs, and even high-heels while capturing high frequency details such as wrinkles that match the input image at the pixel level.
Goal: Given a single or multi-view images, reconstruct the underlining 3D geometry and texture of clothed human,
while preserving the detail present in the image.
An implicit function defines a surface as a level set of a function f, e.g. f(X) = 0
The propsed pixel-aligned implicit function consists of a fully convolutional image encoder g,
and a continuous implicit function f represented by MLP,
where the surface is defined as a level set of
The key observation is that method learn an implicit function over the 3D space with pixel-aligned image features, rather than global features, whcih allows the learned functions to preserve the local detail present in the image.
Unlkike existing methods, which synthesize the back regions based on frontal views in an image space, paper's approach can predict colors in unseen, concave and side regions directly on the surface.
Given an input image, PIFu for surface reconstruction predicts the continuous inside/outside probaility field of a clothed human, in which iso-surface can be easily extracted.
PIFu for texture inference outputs an RGB value at 3D positions of the surface geometry,
enabling texture inference in self-occluded surface regions and shapes of arbitrary topology.
Proposed approach can handle single-view and multi-view input naturally, which produce even higher fidelity results when more views are available.
PIFu enables to directly predict the RGB colors on the surface geometry by defining s in Eq.1 as an RGB vector field instead of a scalar field.
However, extending PIFu to color prediction is a non-trivial task as RGB colors are defined only on the surface while 3D occupancy field is defined over the entire 3D space.
This paper highlights the modification of PIFu in terms of training procedure and a network architecture.
Condition the image encoder for texture inference with the image features learned for the surface reconsturction Fv.
➡️ This way, the image encoder can focus on color inference of a given geometry even if unseen objects have different shape, pose, or topology.
Introduce an offset ε ~ N(0, d) to the surface points along the surface normal N
➡️ so that the color can be defined not only on the exact surface bit also on the 3D space around it.
With the modifications above, the training objective function can be rewritten as:
Additional views provide more coverage about the person and improve the digitization accuracy.
Formulation of PIFu provides the option to incorporate info from more views for both surface reconstruction and texture inference.
This paper achieve this by using PIFu to learn a feature embedding for every 3D point in space.
Output domain of Eq.1 is a n-dimensional vector space s ∈ ℝ^n that represents the latent feature embedding associated with the specified 3D coordinate and the image feature from each view.
Since this embedding is defined in the 3D world coordinate space, aggregating the embedding from all available views that share the same 3D point is possible.
The aggregated feature vector can be used to make a more confident prediction of the surface and the texture.
This paper decomposes the pixel-aligned function f into a
feature embedding network f1 and
multi-view reasoning network f2
as f := f2 ∘ f1.
f1: encodes the image feature and depth value from each view point i into latent feature embedding Φi by average pooling operation,
and obtain the fused embedding
f2: maps from the aggregated embedding Φ to target impicit field s
(i.e., inside/outside probability for surface reconstruction and RGB value for texure inference.)
Comparison with other deep learning-based multi-view methods.