Single-view 3D object reconstruction with convolutional networks have demonstrated remarkable capabilities. Single-view 3D reconstruction models generate the 3D model of any object using a single image as the reference, making it one of the hottest topics of research in computer vision.
For example, let’s consider the motorbike in the above image. Generating its 3D structure requires a complex pipeline that first combines cues from low-level images with high level semantic information, and knowledge about the structural arrangement of parts.
Owing to the complex process, Single-view 3D reconstruction has been a major challenge in computer vision. In an attempt to enhance the efficiency of Single-view 3D reconstruction, developers have worked on Splatter Image, a method that aims to achieve ultra-fast single-view 3D shape and 3D appearance construction of the objects. At its core, the Splatter Image framework uses the Gaussian Splatting method to analyze 3D representations, taking advantage of the speed and quality it offers.
Recently, the Gaussian Splatting method has been implemented by numerous multi-view reconstruction models for real-time rendering, enhanced scaling, and fast training. With that being said, Splatter Image is the first framework that implements the Gaussian Splatting method for single-view reconstruction tasks.
In this article, we will be exploring how the Splatter Image framework employs Gaussian Splatting to achieve ultra-fast single-view 3D reconstruction. So let’s get started.
As mentioned earlier, Splatter Image is an ultra-fast approach for Single-view 3D object reconstruction based on the Gaussian Splatting method. Splatter Image is the first ever computer vision framework to implement Gaussian Splatting for monocular 3D object generation since traditionally, Gaussian Splatting has been powering multi-view 3D object reconstruction frameworks. However, what separates the Splatter Image framework from prior methods is that it is a learning-based approach, and reconstruction in testing only requires the feed-forward evaluation of the neural network.
Splatter Image relies fundamentally on Gaussian Splatting’s rendering qualities, and high processing speed to generate 3D reconstructions. The Splatter Image framework features a straightforward design: the framework uses a 2D image-to-image neural network to predict a 3D Gaussian per input image pixel, and maps the input image to one 3D Gaussian per pixel. The resulting 3D Gaussians have the form of an image, known as the Splatter Image, and they Gaussians also provide 360 degree representation of the image. The process is demonstrated in the following image.
Although the process is simple and straightforward, there are some key challenges faced by the Splatter Image framework when using Gaussian Splatting to generate 3D Gaussians for single-view 3D representations. The first major hurdle is to design a neural network that accepts the image of an object as an input, and generates a corresponding Gaussian mixture representing all sides of the image as the output. To tackle this, the Splatter Image takes advantage of the fact that even though the generated Gaussian mixture is a set or an unordered collection of items, it can still be stored in an ordered data structure. Accordingly, the framework uses a 2D image as a container for the 3D Gaussians as a result of which each pixel in the container contains the parameters of one Gaussian, including its properties like shape, opacity, and color.
By storing 3D Gaussian sets in an image, the Splatter Image framework is able to reduce the reconstruction hurdles faced when learning an image to image neural network. By using this approach, the reconstruction process can be implemented only by utilizing efficient 2D operators instead of relying on 3D operators. Furthermore, in the Splatter Image framework, the 3D representation is a mixture of 3D Gaussians allowing it to exploit the rendering speed and memory efficiency advantages offered by Gaussian Splatting that enhances the efficiency in training as well as in inference. Moving along, the Splatter Image framework not only generates single-view 3D representations, but it also demonstrates remarkable efficiency as it can be trained even on a single GPU on standard 3D object benchmarks. Furthermore, the Splatter Image framework can be extended to take several images as input. It is able to achieve so by registering the individual Gaussian mixtures to a common reference and then by taking the combination of the Gaussian mixtures predicted from individual views. The framework also injects lightweight cross-attention layers in its architecture that allows different views to communicate with one another during prediction.
From an empirical point of view, it is worth noting that the Splatter Image framework can produce 360 degree reconstruction of the object even though it sees only one side of the object. The framework then allocated different Gaussians in a 2D neighborhood to different parts of the 3D object to code the generated 360 degree information in the 2D image. Furthermore, the framework sets the opacity of several Gaussians to zero that deactivates them, thus allowing them to be culled during post-processing.
To summarize, the Splatter Image framework is
- A novel approach to generate single-view 3D object reconstructions by porting the Gaussian Splatting approach.
- Extends the method for multi-view 3D object reconstruction.
- Achieves state of the art 3D object reconstruction performance on standard benchmarks with exceptional speed and quality.
Splatter Image : Methodology and Architecture
Gaussian Splatting
As mentioned earlier, Gaussian Splatting is the primary method implemented by the Splatter Image framework to generate single-view 3D object reconstructions. In simple terms, Gaussian Splatting is a rasterization method for reconstructing 3D images and real-time, and rendering images having multiple point of views. The 3D space in the image is referred to as Gaussians, and machine learning techniques are implemented to learn the parameters of each Gaussian. Gaussian Splatting does not require training during rendering that facilitates faster rendering times. The following image summarizes the architecture of 3D Gaussian Splatting.
3D Gaussian Splatting first uses the set of input images to generate a point cloud. Gaussian Splatting then uses the input images to estimate the external parameters of the camera like tilt and position by matching the pixels between the images, and these parameters are then used to compute the point cloud. Using different machine learning methods, Gaussian Splatting then optimizes four parameters for each Gaussian namely: Position (where is it located), Covariance (the extent of its stretching or scaling in 3×3 matrix), Color (what is the RGB color scheme), and Alpha (measuring the transparency). The optimization process renders the image for each camera position and uses it to determine the parameters closer to the original image. As a result, the resultant 3D Gaussian Splatting output is an image, named the Splatter Image that resembles the original image the most at the camera position from which it was captured.
Furthermore, the opacity function and the color function in Gaussian Splatting gives a radiance field with the viewing direction of the 3D point. The framework then renders the radiance field onto an image by integrating the colors observed along the ray that passes through the pixel. Gaussian Splatting represents these functions as a combination of colored Gaussians where the Gaussian mean or center along with the Gaussian covariance helps in determining its shape and size. Each Gaussian also has an opacity property and a view-dependent color property that together define the radiance field.
Splatter Image
The renderer component maps the set of 3D Gaussians to an image. To perform single-view 3D reconstruction, the framework then seeks an inverse function for 3D Gaussians that reconstruct the mixture of 3D Gaussians from an image. The key inclusion here is to propose an effective yet a simple design for the inverse function. Specifically, for an input image, the framework predicts a Gaussian for each individual pixel using an image-to-image neural network architecture to output an image, the Splatter Image. The network also predicts the shape, the opacity, and the color.
Now, it might be speculated that how can the Splatter Image framework reconstruct the 3D representation of an object even though it has access to only one of its views? In real-time, the Splatter Image framework learns to use some of the available Gaussians to reconstruct the view, and uses the remaining Gaussians to automatically reconstruct unseen parts of the image. To maximize its efficiency, the framework can automatically switch off any Gaussians by predicting if the opacity is zero. If the opacity is zero, the Gaussians are switched off, and the framework does not render these points, and are instead culled in post-processing.
Image Level Loss
A major advantage of exploiting the speed and efficiency offered by the Splatter Gaussian method is that it facilitates the framework to render all of the images at each iteration, even for batches with relatively larger batch size. Furthermore, it implies that not only is the framework able to use decomposable losses, it can also use the image-level losses that do not decompose into losses per-pixel.
Scale Normalization
It is challenging to estimate the size of an object by looking at a single view, and it is a challenging task to resolve this ambiguity when it is trained with a loss. The same issue is not observed in synthetic datasets as all the objects are rendered with identical camera intrinsics and the objects are at a fixed distance from the camera, that ultimately helps in resp;ving the ambiguity. However, in datasets with real-life images, the ambiguity is quite evident, and the Splatter Image framework employs several pre-processing methods to approximately fix the scale of all objects.
View Dependent Color
To represent view dependent colors, the Splatter Image framework uses spherical harmonics to generalize the colors beyond the Lambertian color model. For any specific Gaussian, the model defines coefficients that are predicted by the network and the spherical harmonics. The viewpoint change transforms a viewing direction in the camera source to its corresponding viewing direction in the frame of reference. The model then finds the corresponding coefficients to find the transformed color function. The model is able to do so because when under rotation, the spherical harmonics are closed, along with every other order.
Neural Network Architecture
A majority of the architecture of the predictor mapping the input image to the combination of Gaussian is identical to the process used in the SongUNet framework. The last layer in the architecture is replaced by a 1×1 convolutional layer with the color model determining the width of the output channels. Given the input image, the network produces an output channel tensor as output, and for each pixel channel, codes the parameters that are then transformed into offset, opacity, rotation, depth, and color. The framework then uses nonlinear functions to activate the parameters and obtain the Gaussian parameters.
For reconstructing 3D representations with multi-view, the Splatter Image framework applies the same network to each input view, and then uses the viewpoint approach to combine the individual reconstructions. Furthermore, to facilitate efficient coordination and exchange of information between the views in the network, the Splatter Image framework makes two modifications in the network. First, the framework conditions the model with its respective camera pose, and passes vectors by encoding each entry using a sinusoidal position embedding resulting in multiple dimensions. Second, the framework adds cross-attention layers to facilitate communication between the features of different views.
Splatter Image : Experiments and Results
The Splatter Image framework measures the quality of its reconstructions by evaluating the Novel View Synthesis quality since the framework uses the source view and renders the 3D shape to target unseen views to perform reconstructions. The framework evaluates its performance by measuring the SSIM or Structural Similarity, Peak Signal to Noise Ratio or PSNR, and Perceptual Quality or LPIPS scores.
Single-View 3D Reconstruction Performance
The following table demonstrates the performance of the Splatter Image model in single-view 3D reconstruction task on the ShapeNet benchmark.
As it can be observed, the Splatter Image framework outperforms all deterministic reconstruction methods across the LPIPS and SSIM scores. The scores indicate that the Splatter Image model generates images with sharper reconstructions. Furthermore, the Splatter Image model also outperforms all deterministic baseline in terms of the PSNR score that indicates that the generated reconstructions are also more accurate. Furthermore, in addition to outperforming all the deterministic methods, the Splatter Image framework only requires the relative camera poses to enhance its efficiency in both training and testing phases.
The following image demonstrates the qualitative prowess of the Splatter Image framework, and as it can be seen, the model generates reconstructions with thin and interesting geometries, and captures the details of the conditioning views.
The following image shows that the reconstructions generated by the Splatter Image framework is not only sharper but also has better accuracy that previous models especially in unconventional conditions with thin structures and limited visibility.
Multi-View 3D Reconstruction
To evaluate its multi-view 3D reconstruction capabilities, the Splatter Image framework is trained on the SpaneNet-SRN Cars dataset for two view predictions. Existing methods use absolute camera pose conditioning for multi-view 3D reconstruction tasks that means the model learns to rely primarily on the object’s canonical orientation in the object. Although it does the job, it limits the applicability of the models as the absolute camera pose is often unknown for a new image of an object.
Final Thoughts
In this article, we have talked about Splatter Image, a method that aims to achieve ultra-fast single-view 3D shape and 3D appearance construction of the objects. At its core, the Splatter Image framework uses the Gaussian Splatting method to analyze 3D representations, taking advantage of the speed and quality it offers. The Splatter Image framework processes images using an off the shelf 2D CNN architecture to predict a pseudo-image that contains one colored Gaussian per every pixel. By using Gaussian Splatting method, the Splatter Image framework is able to combine fast rendering with fast inference that results in quick training and quicker evaluation on real and synthetic benchmarks.
Credit: Source link