CNN has boosted the performance of 2D segmentation.
However, given unordered and unstructured 3D point clouds,
2D methods cannot be directly extended to 3D points.
β‘οΈ Paper designs bottom-up end-to-end framework, PointGroup
for 3D instance segmentation,
with the key target of better grouping of points.
Paper presents PointGroup,
a new end-to-end bottom-up architecture,
specifically focused on better groupint the points
by exploring the void space between objects,
to deal with the challengin 3D instance segmentation task.
Proposes a point clustering method
based on dual coordinate sets,
i.e., the original and shifted sets.
Along with the new ScoreNet,
object instances can be better segmented out.
PointGroup with the key target of better griuping points.
Main two problems to deal with
1) seperate the contents in 3D space into individual objces
2) determine the semantic label of each object.
Backbone Network
Designed a two-branch network
to extract point features (by semantic seg backbone)
and predict semantic labels and offsets(by offset branch),
for shifting each point
toward its respective instance centroid.
Clustering
Adopt effective algorithm to group points into clusters.
For each point, take its coordinates as a reference,
group it with nearby points of the same label,
and expand the group progressively.
Consider two coordinate sets in two separate passes (called "Dual-Set Point Grouping")
1) original point positions
2) those shifted by the prdicted offsets.
ScoreNet
Formulate the ScoreNet
to evaluate and pick candidate groups,
followed by the NMS(Non-Maximum Suppressio)
to remove duplicate predictions.
Input: Point set P of N points.
Each points has a color fi = (ri, gi, bi)
and 3D coordinates pi = (xi, yi, zi)
(where i β {1, ..., N}
Backbone extracts point-wise feature Fi for each point.
: output feature of the backbos as F = {Fi} β β^(N x K)
(where K: # of channels)
Paper's Implementation
1) Voxelize the points and construct a U-Net
with SSC(Submanifold Sparse Convolution) and SC(Sparse Convolution)
2) Then, Recover points from voxels
to obtain the point-wise features.
Contextual and geotetric information
is well extracted by U-Net,
which provieds discriminative point-wise featrues F
Apply an MLP to F
to prodce semantic scores SC = {sc1, ..., scN} β R^(NxNclass)
for the N points over the Nclass classes.
Regularize the results
by a cross entropy loss Lsem.
Predicted semantic label si for point i
is the class with the maximum score,
i.e., si = argmax(sci)
Encodes F
to produce N offset vectors O = {o1, ..., oN} β β^(Nx3)
for the N points.
For points belonging to the same instance,
constrain their learned offsets by an L1 regression loss as:
Paper finds it hard to regress precise offsets,
particulary for boundary points of large-size objcets,
since these points are relatively far from the instance centroids.
To address this issue,
formulat a direction loss
to constrain the direction of predicted offset vectors.
After obtaining the semantic labels,
begin to group points
into instance clusters
based on the void space betwwen objects.
Clustering method to group poitns
close to each other
into same cluster,
if they have the same semantic label.
However, clustering directly based on
the point coordinate set P = {pi}
may fail to separate same category objects
that are close to each other in 3D space
and mis-group them.
Thus, use learned offset oi
1) to shift point i
towards its respective instance centroid
2) obtain shifted coordinates qi = pi + oi β β^3
For points belonging to the same object instance,
different from pi,
shited coordinates qi clutter around the same centroid.
So by clustering based on shifted coorindate set Q = {qi},
separate nearby objects better,
even they have same semantic labels.
However, for points near object boundary,
prdicted offset may not be accurate.
So, clustering algorithm employs
"Dual" point coordinate sets,
(original coord P and shifted coord Q).
clustering reulst C as C^p U C^q
(clusters discovered based on P and Q)
Core step of algoriths is that
1) for point i,
get points within the ball of radius r
centerd at xi
2) and group points with same semantic labels
as point i into same cluster.
(r serves as spatial constraint in the clustering,
so that two intra-category objects
at a distance larger than r are not grouped.
)
Use BFS to group points of the same instance into a cluster.
ScoreNet
1) to process the proposed point clusters C = C^p U C^q
2) produce a score per cluster proposal.
NMS is applied to these proposals
with the scores
to generate final instance prediction.
(G: instance prediction = {G1, ..., GMpred} β C
I: GT instances = {I1, ..., IMgt})
Input: set of candidate clusters C = {C1, ..., CM}
(M: # of candidate clusters,
Ci: i-th cluster,
Ni: # of points in Ci
)
Goal of ScoreNet
: to predict a score for each cluster
to indicate the quality of the associated cluster poposal,
for precisely reserving the better clusters in NMS
and thus combine strength of C^p and C^q.
For each cluster,
1) gather the point features from F β R^(N x K)
(features extracted by the backbone)
2) and form
where h maps the point index in Ci
to corresponding point index in P.
Similarly, Express coordinates for points in Ci as
To better aggregate the cluster info,
take Fci and Pci as the initial features and coordinates,
and voxelize the clusters.
Feature for each voxel is average-pooled
from the initial features of points
in that voxel.
Then feed them into a small U-Net with SSC and SC
to further eoncode the features.
Cluster-aware max-pooling is then followed
to produce a single cluster feature vector
Final cluster scores
Inference
perform NMS on clusters C
with predicted scores Sc
to obtain the final instance predictions G β C.
IoU threshold is empirically set as 0.3
Datasets
ScanNet v2: 1613 sacns with 3D object instance annotations.
S3DIS: 3D scans across six areas with 271 scenes in total.
Each point is assigned on label out of 13 semantic classes.
Evaluation Metrics
mAP(mean average percision)
Propsed PointGroup for 3D instance segmentation,
with a focus of
1) better grouping points by
exploring the in-between space
2) and point semantic labels
among the object instances.
Considering situation
two intra-category objets
may be very close to each other,
Paper designs a two-branch network
to respectively learn a per-point semantic lable
and per-point offset vector
for moving each point towards its instance centroid.
Cluster points based on both
1) original point coordinates
2) offset-shifted point coordinates
and combines strength of two coord sets
to optimize point grouping precision.
Introduced ScoreNet
to learn to evalutate the generated candidate clusteres,
Followed by the NMS
to avoid duplicates
before output the final predicted instances.
Introduce a progressive refinement module
to relieve the semanic inaccuracy problem
that affects the instance grouping
Explore the possibility of
incorporating weakly- or self-supervision techinques
to boost the performance.