SG-NN: Sparse Generative Neural Networks for Self-Supervised Scene Completion of RGB-D Scans (CVPR'20)

1Technical University of Munich

Abstract

We present a novel approach that converts partial and noisy RGB-D scans into high-quality 3D scene reconstructions by inferring unobserved scene geometry.

Our approach is fully self-supervised and can hence be trained solely on real-world, incomplete scans.

To achieve self-supervision, we remove frames from a given (incomplete) 3D scan in order to make it even more incomplete; self-supervision is then formulated by correlating the two levels of partialness of the same scan while masking out regions that have never been observed.

Through generalization across a large training set, we can then predict 3D scene completion without ever seeing any 3D scan of entirely complete geometry.

Combined with a new 3D sparse generative neural network architecture, our method is able to predict highly-detailed surfaces in a coarse-to-fine hierarchical fashion, generating 3D scenes at 2cm resolution, more than twice the resolution of existing state-of-the-art methods as well as outperforming them by a significant margin in reconstruction quality.


Video

Figures

Figures found in the paper. Click to enlarge and reveal the figure captions.

Our method takes as input a partial RGB-D scan and predicts a high-resolution 3D reconstruction while predicting unseen, missing geometry. Key to our approach is its self-supervised formulation, enabling training solely on real-world, incomplete scans. This not only obviates the need for synthetic ground truth, but is also capable of generating more complete scenes than any single target scene seen during training. To achieve high-quality surfaces, we further propose a new sparse generative neural network, capable of generating large-scale scenes at much higher resolution than existing techniques. Our self-supervision approach for scan completion learns through deltas in partialness of RGB-D scans. From a given (incomplete) RGB-D scan, on the left, we produce a more incomplete version of the scan by removing some of its depth frames (middle). We can then train to complete the more incomplete scan (middle) using the original scan as a target (left), while masking out unobserved regions in the target scene (in orange). This enables our prediction to produce scenes that are more complete than the target scenes seen during training, as the training process effectively masks out incompleteness. Our Sparse Generative Neural Network architecture for the task of scan completion. An input scan is encodedusing a series of sparse convolutions, each set reducing the spatial dimensions by a factor of two. To generate high-resolution scene geometry, the coarse encoding is converted to a dense representation for a coarse prediction of the complete geometry. The predicted coarse geometry is converted to a sparse representation and input to our sparse, coarse-to-fine hierarchy, where each level of the hierarchy predicts the geometry of the next resolution (losses indicated in orange). The final output is a TSDF represented by sparse set of voxel locations and their corresponding distance values. Progressive generation of a 3D scene using our SG-NN which formulates a generative model to predict a sparse TSDF as output. Comparison to state-of-the-art scan completion approaches on Matterport3D [3] data (5cm resolution), with input scans generated from a subset of frames. In contrast to the fully-supervised 3D-EPN [11] and ScanComplete [12], our self-supervised approach produces more accurate, complete scene geometry. Evaluating varying target data completeness available for training. We generate various incomplete versions of the Matterport3D [3] scans using ≈30%, 40%, 50%, 60%, and 100% (all) of the frames associated with each room scene, and evaluate on the 50% incomplete scans. Our self-supervised approach remains robust to the level of completeness of the target training data. Scan completion results on Matterport3D [3] data (2cm resolution), with input scans generated from a subset of frames. Our self-supervision approach using loss masking enables more complete scene prediction than direct supervision using the target RGB-D scan, particularly in regions where occlusions commonly occur. SG-NN architecture in detail. The final TSDF values are highlighted in orange, and intermediate outputs in yellow. Convolution parameters are given as (nf in, nf out, kernel size, stride, padding), with stride and padding default to 1 and 0. Arrows denote concatenation, and 􏰀+ denotes addition.

Paper

Paper