https://arxiv.org/pdf/1911.11763.pdf
SuperGlue matches two sets of local features by finding correspondences and rejecting non-matchable points. The network uses a differentiable optimal transport problem to estimate assignments and a graph neural network to predict costs. SuperGlue also has a flexible context aggregation mechanism based on attention, allowing it to reason about the underlying 3D scene and feature assignments jointly. Through end-to-end training, SuperGlue learns priors over geometric transformations and regularities of the 3D world, outperforming other learned approaches. The method runs in real-time on a GPU and can be integrated into SfM or SLAM systems.
SuperGlue is a neural architecture that learns to match local features for 2D-to-2D data association. SuperGlue acts as a middle-end between the feature extraction front-end and the pose estimation back-end. The network uses a graph neural network and attention to solve an assignment optimization problem and handle partial point visibility and occlusion. The method is trained end-to-end from image pairs and learns priors for pose estimation from a large annotated dataset. SuperGlue outperforms handcrafted and learned matchers and advances the state-of-the-art in indoor and outdoor pose estimation.
Local feature matching in computer vision tasks typically involves detecting interest points, computing visual descriptors, matching with a Nearest Neighbor search, filtering incorrect matches, and estimating a geometric transformation. Classical methods use handcrafted heuristics to filter matches, while recent works on deep learning focus on learning better sparse detectors and local descriptors or filtering matches by classifying them into inliers and outliers. Graph matching problems are usually formulated as NP-hard quadratic assignment problems, but optimal transport provides an efficient approximate solution. Deep learning for sets such as point clouds uses attention to perform flexible global and data-dependent local aggregation. SuperGlue is a learnable middle-end that performs context aggregation, matching, and filtering in a single end-to-end architecture, solving an assignment optimization problem using a graph neural network with attention.
The motivation behind SuperGlue is to leverage regularities of the 3D world and physical constraints of keypoint projections to improve feature matching. Correspondences across images should adhere to certain physical constraints such as a keypoint having at most a single correspondence in the other image, and some keypoints being unmatched due to occlusion and detector failure. SuperGlue formulates feature matching as an optimization problem with a cost predicted by a deep neural network, allowing relevant priors to be learned directly from data and eliminating the need for domain expertise and heuristics.
SuperGlue considers two images A and B, each with a set of local features consisting of keypoint positions p and associated visual descriptors d. The visual descriptors can be extracted by a CNN like SuperPoint or traditional descriptors like SIFT.
The model is formulated as an optimization problem, with the goal of finding correspondences between keypoints in two images. The model uses a keypoint encoder and a multiplex graph neural network to propagate information across both intra-image and inter-image edges. The model also corporates an attention mechanism to perform aggregation and compute the message between elements in the graph. The final matching descriptors are obtained through linear projections. Overall, the SuperGlue model aims to learn relevant priors directly from data, without the need for domain expertise and heuristics.
Optimal matching layer produces a partial assignment matrix by computing a score matrix for all possible matches and maximizing the total score under the constraints of a partial soft assignment matrix. The pairwise score is expressed as the similarity of matching descriptors, and dustbins are used to explicitly assign unmatched keypoints. The solution to the optimization problem is found using the Sinkhorn algorithm, which is a differentiable version of the Hungarian algorithm and can be efficiently solved on GPU. The final assignment matrix is obtained by dropping the dustbins.
SuperGlue is equivariant to permutation of keypoints and images, and its attention mechanism is more flexible than instance normalization used by previous approaches. It outperforms existing matchers and can be a simple drop-in replacement for them. SuperGlue also borrows self-attention from the Transformer but embeds it into a graph neural network and introduces symmetric cross-attention for better feature reuse.
Backpropagating through the SuperPoint descriptor network while training SuperGlue improves the AUC@20◦. Visualizations of self- and cross-attention patterns are shown in Figure 7, reflecting the complexity of the learned behavior. A detailed analysis of the trends and inner-workings of attention is provided in Appendix D.
The paper presents SuperGlue, an attention-based graph neural network for local feature matching. The network uses self-attention and cross-attention to improve the receptive field of local descriptors and enable cross-image communication. The approach handles partial assignments and occluded points by solving an optimal transport problem. The experiments show that SuperGlue achieves significant improvement over existing approaches and enables highly accurate relative pose estimation on wide-baseline indoor and outdoor image pairs. SuperGlue runs in real-time and works well with both classical and learned features. The paper concludes that SuperGlue is a major milestone towards end-to-end deep SLAM.
BTS brought me here!!!