BAE-NET: Branched Autoencoder for Shape Co-Segmentation

YEOM JINSEOP·2023년 9월 3일

ML For 3D Data

목록 보기

19/27

🚀 Motivations

Given the strong belief that object recognition by humans is part-based,
the simplest explanation for a collection of shapes
belonging to the same categroy would be a
combination of universal parts for that category, (e.g., chair backs or airplane wings.)
✅ Hence, an unsupervised shape co-segmentation would amount to
finding the simplest part representations for a shape collection.
✅ Paper's choice for the representation learning module is a variant of autoencoder
In principle, autoencoders learn "compact representations" of a set of data
via dimentionality reduction
while minimizing a "self reconstruction loss"
✅ To learn shape parts,
paper introduce a branched version of autoencoders,
where each branch is tasked to learn a simple representation
for universal part of the input shape coolection.
✅ Paper's branched autoencoder, BAE-NET for short,
is trained to minimize a shape reconstruction loss,
where shapes are represented using implicit fields.

🔑 Key Contributions

This paper treat shape co-segmentation
as a representation learning problem
and introduce BAE-NET

⭐ Methods

Abstract

BAE-NET is trained with a "collection of un-segmented shapes",
using a shape reconstruction loss, without any GT labels.
Input: shape (represented using implicit fields)
Encoder: produce the feature code for the given shape( encodes shape using a CNN)
Decoder: 3-layer fully connected NN
Decoder's input: joint vector of point coordinates and an feature code of the input shape
point coordinate adds spatial awareness to reconstruction process, often lost in the convolutional features from the CNN encoder)
Decoder's output: value in each output branch that indicates
whether the input point is inside a part(1) or not(0).
Max pooling opeartor: mere parts together and obtain the entire shape.
Decoder is branched: each branch learns a "compact representation"
for one commomly recurring part of the shape collection (e.g., airplane wings)
By complementing the shape reconsturction loss with a label loss,
BAE-NET is easily tuned for one-shot learning.

Architecture

Each neuron in the L3 is trained to learn the "inside/outside" status of the point
relative to one shape part.
The parts are learned in a joint space of shape features and point locations.

Network losses for various learning scenarios

Unsupervised

P(S): distribution of training shapes,
P(p|S): distribution of sampled points given shape S,
f(p): output value of decoder for input point p
F(p): GT inside-outside status for point p.

This loss function allows to reconstruct the target shape
in the output layer.
Segmented parts will be expressed in the branches of L3,
since the final output is taken as
the maximum value over the fields
represented by those branches.

Supervised

If examples with GT part labels exist,
we can also traing BAE-NET in a supervised way.

Fm(P): GT status for point p
fm(P): output value of the m-th branch in L3
k: # of branches in L3

In the datset such as the ShapeNet part dataset,
shapes are represented by point clouds
sample from their surfaces.
In such datasets, the inside-outside status of a point can be ambiguous.
However, since paper's sampled points are from voxel grids
and the reconstructed shapes are thicker than the original,
we can assume all points in the ShapeNet part dataset
are inside our reconstructed shapes.
Paper uses both paper's sampled points from voxel grids
and the point clouds in the ShapeNet part dataset,
by modifying the loss function:

P(p|S): distribution of paper's sampled points from voxel grids,
P(q|S): distribution of points in the ShapeNet part data set.
(Paper set α to 1 in experiments)

One-shot learning

Paper's network also supports "one-shot training",
where feed the network 1, 2, or 3, ... , shapes
with GT part labels,
and other shapes without part labels.
To enable one-shot training, joint loss is defined as:

P(S): distribution of all shapes,
P(S'): distribution of the few given shapes with part labels.
Additionally, paper add a very small L1 regularization term
for the parameters of L3 to prevent unnecessary overlap,
e.g., when the part output by one branch contains
the part output by another branch.

Point label assignment

After training, we get an approximate implicit field for the input shape.
To label a given point of an input shape,
simply feed the point into the network together
with the code encoding the feature of the input shape,
and label the point by looking
at which branch in L3 gives the highest value.
If the training has exemplar shapes as guidance,
each bracnch will be assigned a label automatically
w.r.t exemplars.

-If training is unsupervised,
need to look at the branch outputs
and give a lable for each branch by hand.
For example in Fig2, we can label bracn #3 as "jet engine",
and each point having the highest value in this branch
will be labeled as "jet engine".

To label a mesh model,
subdivide the surfaces
to obtain fine-grained triangels,
and assigan a label fo each triangle.
To label a triangle,
feed its three vertices into network
and sum their output values in each branch,
and assign the label
whose associated branch gives the highest value

👨🏻‍🔬 Experimental Results

Network design choices and insights

Train BAE-NET with 4 branches on the two datasets.
Successfully separated the shape patterns,
even when two patterns overlap.
Each of the output branches
only outputs
one specific shape pattern,
thus we also obtain a shape correspondence from the co-seg process.
Compared the segmentation results of current 3-layer model with a 2-layer model, a 4-layer model, and a CNN model in Fig 4.
2-layer model: hard to reconstruct the rings,
since L2 is better at representing convex shapes
4-layer model: can separate parts,
but since most shapes can already be represented in L3,
extra layer does not necessrily output separated parts.
CNN model: not sensitive to parts
and outputs basically everything or nothing in each branch,
since there is no bias towards sparsity or segmentation.
Overall, the 3-layer network is the best choice
for independent shape extraction,
making it a suitable candidate for unsupervised anand weakly supervised shape segmentation.
In L1: point coordinates and shape feature code
have gone through a "linear transform" and a "leaky ReLU activation",
therefore the activation maps in L1 are linear "space dividers" with gradients.
In L2: each neuron linearly combines the fields in L1
to form basic shapes.
L2 represents higher level shapes than L1.
L3 neurons combine the shapes in L2
to form output parts in network,
and final output combines all L3 outputs via max pooling.

Evaluation of unsupervised learning

Fig 1 and 6 show visual results.
Reasonable parts are obtained,
and each branch only outputs a specific part
with a designated color,
giving natural part correspondence.

Comparison with SOTA network

SOTA weakliy supervised part labeling network: Tags2Parts

One-shot training vs supervised methods

🤔 Limitations

For unsupervised segmentation,
There is no easy way to predict
which branch will output which part,
since we initialize the network parameters randomly and
optimize a reconstruction
while treating each branch eqaully.
Paper's network groups similar and close-by parts in different shapes for correspondence.
This is reasonable in most cases,
but for some categories e.ge., lamps or tables,
where the similar and close-by parts may be assigned different labels,
paper's netowrk can be confused.
BAE-NET is much shallower and thinner compared to IM-NET,
since paper cares more about segmentation (not reconstruction) quality.
However, the limited depth and width of the network
make it difficult
to train on high-resolution models,
which hinders us from obtaining fine-grained segmentations.