(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: University of Alberta, Edmonton AB T6G 2R3, Canada
¹¹email: {ymu3, lcheng5}@ualberta.ca ²²institutetext: Noah’s Ark Lab, Huawei Canada, Markham ON L3R 5A4, Canada

GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction

Yuxuan Mu\orcidlink0000-0001-7132-3155 11 Xinxin Zuo\orcidlink0000-0002-7116-9634 22 Chuan Guo\orcidlink0000-0002-4539-0634 11 Yilin Wang 11 Juwei Lu 22 Xiaofeng Wu 22 Songcen Xu 22 Peng Dai 22 Youliang Yan 22 Li Cheng\orcidlink0000-0003-3261-3533 11

Abstract

We present GSD, a diffusion model approach based on Gaussian Splatting (GS) representation for 3D object reconstruction from a single view. Prior works suffer from inconsistent 3D geometry or mediocre rendering quality due to improper representations. We take a step towards resolving these shortcomings by utilizing the recent state-of-the-art 3D explicit representation, Gaussian Splatting, and an unconditional diffusion model. This model learns to generate 3D objects represented by sets of GS ellipsoids. With these strong generative 3D priors, though learning unconditionally, the diffusion model is ready for view-guided reconstruction without further model fine-tuning. This is achieved by propagating fine-grained 2D features through the efficient yet flexible splatting function and the guided denoising sampling process. In addition, a 2D diffusion model is further employed to enhance rendering fidelity, and improve reconstructed GS quality by polishing and re-using the rendered images. The final reconstructed objects explicitly come with high-quality 3D structure and texture, and can be efficiently rendered in arbitrary views. Experiments on the challenging real-world CO3D dataset demonstrate the superiority of our approach.

Keywords:

Object Reconstruction, Gaussian Splatting, Guided Diffusion Model

1 Introduction

Given the abundance of image data in the real world, the problem of 3D reconstruction from single-view images has garnered notable attentions [40, 43, 10, 48]. While humans can effortlessly deduce the general object shape and even imagine its texture from unseen views, for computational models, the problem becomes highly non-trivial. As to be described next, there are three key aspects that underpins this problem. First, a proper 3D representation capable of encoding high-fidelity 3D information, while being compatible with various levels of quantization. Second, akin to the human perception system, it is crucial to have a generative model being able to produce an object with diverse appearances of the object’s backside and being faithful to the input views. Finally, the ability to efficiently and precisely render a 3D object into an arbitrary view.

Existing efforts often struggle to properly address one or multiple aspect(s) of the above three. For instance, the recent 2D novel view synthesis methods [51, 3, 20, 17, 30, 41] usually fall short in maintaining 3D consistency. Another line of research [43, 44, 46, 10, 19, 18, 11, 23, 40] based on explicit 3D representations such as voxels, point clouds and meshes, are limited to coarse geometry and suboptimal rendering quality, though offering consistent 3D rendering. This may be attributed to the low resolution or sparse nature of 3D features (e.g., points, voxels) and the challenges of their discretization (e.g., mesh) in deep learning models. Implicit 3D representation, on the other hand, formulates 3D space as a query-based implicit function [6, 24, 1, 25, 32], achieving remarkable quality in single-view 3D reconstruction in well-defined canonical space [48, 4, 14]. Unfortunately, they typically rely on cumbersome efforts such as marching cubes for 3D geometry extraction and view rendering.

Motivated by these observations, we introduce a novel framework, GSD, for high-quality single-view 3D reconstruction by building a generative Diffusion Transformer (DiT) [27] upon the emerging Gaussian Splatting (GS) representation [16]. Specifically, GS encodes a scene by a set of GS ellipsoids, with each ellipsoid parameterized by its center position, covariance, regional color, and opacity. Unlike existing 3D representations, GS explicitly encodes 3D geometry and texture in high resolution and density. Furthermore, due to its spatial explicitness, we can easily deploy a point-space DiT even without positional encoding [26]. Different from other diffusion models using classifier-free conditioning with image encoders[51, 26, 22], our GS DiT enables efficient fine-grained image conditioning through its unique splatting-based rendering and loss-guided sampling [34], which also ensures fidelity to the given-view images.

Refer to caption — Figure 1: A illustration of our View-Guided Gaussian Splatting Diffusion framework for single-view 3D reconstruction. It works by progressively denoising a randomly initialized set of Gaussian Splatting (GS) ellipsoids with continuous guidance from the discrepancies between the input and rendered images. The gray arrow represents the splatting-based GS rendering, while the orange arrow depicts the backpropagation of guidance gradients. The diffusion model built directly upon the GS representation in our context provides explicit geometry information. The view-guided sampling takes the advantage of splatting function to faithfully yet efficiently obtain fine-grained features from the given view.

Upon the GS representation, a category-specific 3D DiT is trained to capture the space of plausible 3D objects in terms of their diverse geometries and textures. As illustrated in Fig. 2, when w/o an input image, our diffusion model learns to generate high-fidelity 3D objects with distinct geometries and textures. When an input image is provided, the same diffusion model is used to reconstruct the specific 3D object, faithful when rendering to the same view. This process is presented in Fig. 1. During test-time, the GS object at each denoising step is projected to the given view through the differentiable splatting-based rendering. The gradients of discrepancies between the rendered and the reference images are then backpropagated to the corresponding GS samples to refine the 3D object at the current step, similar to classifier guidance [34]. This approach is simple yet effective, and can be easily adapted to multi-view reconstruction. In addition, an auxiliary 2D diffusion model is employed to further improve the quality of rendered images, which reciprocally facilitates better 3D reconstruction.

Our main contributions can be summarized as follows:

•

Our proposed GSD is, to our knowledge, the first diffusion model that directly models raw GS representation that capturing its 3D generative prior for single-view reconstruction.
•

The GS DiT intuitively comes with an effective yet flexible view-guided sampling strategy that can extract fine-grained features from the given views using the efficient splatting function. Given an input image at test-time, the guided iterative denoising of our GS-based diffusion model allows a progressive refinement of the reconstructed 3D object consistent to the input view.
•

Empirical experiments of the real-world CO3D dataset demonstrate the superiority of our approach when comparing to the state-of-the-arts. Our approach is flexible that can also work with multi-view images.

2 Related Work

View-Conditioned 3D Reconstruction and Generation. Many related works reconstruct the 3D shape by jointly modeling the unconditional 3D priors and the conditional distribution with a generative model [15, 26, 49]. Some of them working on explicit 3D representations [49, 26, 23, 43] can recover explicit geometry but fail to synthesize photorealistic views. For the other stream of studies [15, 4, 25, 32, 1, 3], the use of advanced implicit representations enables photo-quality view synthesis while struggling to extract accurate geometry in unconstrained space. We find the emerging 3D representation Gaussian Splatting [16] has great potential to be a generally suitable representation for this task, which enjoys both benefits from explicit geometry and realistic view synthesis. While concurrent works using GS representation either re-form a deterministic prediction problem [36] or combine with latent representation for additional feature decoration [52, 45]. Moreover, most of the previous works perceive the given view in the camera space by an image encoder [26, 15, 37], which cannot ensure faithful reconstruction due to the compression and also requires canonical coordinates to constrain the modeling space. They indeed should be image-conditioned generation, rather than reconstruction from views. Inspired by the fine-grained projection method from PC² [23], we take advantage of GS splatting-based rendering [16] to get access to the image through pixel-level gradients that reliably keep the view information and are flexible to accommodate arbitrary views using relative camera parameters in world space. A similar gradient conditioning approach is also used in SSD NeRF [4], while restricted by the robustness of its dataset-specific neural rendering. Combining GS universal rendering with view-guided sampling conditioning on the GS diffusion model could fully explore the potential of these methods.

Novel View Synthesis. The current novel view synthesis (NVS) task becomes progressively close to the 3D reconstruction task [51, 31, 17, 30]. While one of the most significant pinpoints is the 3D inconsistency issue from being short of 3D geometry priors. To address this problem, some works try to involve multi-view geometry [51, 48], depth [2, 22], and clues from large multi-view 2D dataset [22, 20, 31, 41]. However, since it primarily focuses on imaging geometry awareness rather than modeling the prior distribution of 3D shapes. We argue that this 2D objective deviates from 3D reconstruction, potentially resulting in unsatisfactory 3D geometry.

SDS-based 3D Asset Creation. 3D asset creation emerges thanks to the boom of big models. Most of them distill the 3D representation from pre-trained image generation model by Score Distillation Sampling (SDS) [28, 38, 5]. While one of the primary weaknesses lies in 3D geometry, despite involving multi-view geometry to regress the 3D representation. These approaches heavily rely on the consistent performance of large pre-trained image models, where views are treated as independent during pre-training. This could leads to the Janus problem. Another issue arises when applying these methods to our object reconstruction scenario, as the 3D space should ideally be well-constrained. But achieving alignment in a real-world setting is challenging. Consequently, the 3D asset creation approaches may not easily adapt to practical applications in real-world object reconstruction.

3 Background: Gaussian Splatting

Gaussian Splatting [16] presents an emerging method in the field of novel view synthesis and 3D reconstruction from multi-view images. In contrast to NeRF style implicit representations [24], GS takes a different approach that characterizes the scene using a set of anisotropic GS ellipsoids defined by their center positions $\mu\in\mathbb{R}^{3}$ , covariance $\Sigma\in\mathbb{R}^{6}$ , color $c\in\mathbb{R}^{3}$ , and opacity $\alpha\in\mathbb{R}^{1}$ . During rendering, the GS is projected onto the imaging plane and then allocated to individual tiles [53]. The color of p on the image is given by typical point-based blending [47] as follows: \linenomathAMS

	$\displaystyle C(\textbf{p})=\sum_{i\in\mathcal{N}}c_{i}\sigma_{i}\prod_{j=1}^{% i-1}(1-\sigma_{i}),$		(1)
	$\displaystyle\text{where}\ \sigma_{i}=\alpha_{i}e^{-\frac{1}{2}(\textbf{p}-\mu% _{i})^{T}\Sigma_{i}^{-1}(\textbf{p}-\mu_{i})}.$

Compared with NeRF’s importance sampling process, these GS points have the potential to cluster towards critical regions, thereby improving overall efficiency for expression and rendering.

4 Our Approach

Section 4.1 explains how we formulate the GS diffusion model. In Section 4.2, we elaborate how GS splatting function is utilized to guide the denoising sampling process of GS DiT for view-based 3D reconstruction. Finally, Section 4.3 details the polishing and re-using process of the auxiliary 2D diffusion model jointly with GS DiT. See Figure 3 for an overview of our pipeline at inference.

4.1 Modeling GS Generative Prior

Building upon the recent advancements in denoising diffusion probabilistic models (DDPM) [12], we formulate our GS dataset distribution modeling using a diffusion-based generative model.

Representing 3D Objects with GS. To prepare the dataset for training gs diffusion model, we convert our dense-view image dataset into a dataset of GS using the 3D GS scene reconstruction method [16]. However, this dense-view regression-based approach regularizes the optimization process only through its densify-and-prune function on points. In contrast, for applicable feed-forward network modeling, we aim to further regularize the feature distribution of GS and obtain a constant quantity of GS ellipsoids per-scene. Therefore, we initially restrict the number of GS ellipsoids by densifying the GS ellipsoids only with the Top-K gradients values, where K is the difference between the pre-set maximum ellipsoid quantity and the current ellipsoid quantity. We observe that this constrained densification allows for effective reconstruction of object data with only a 2% PSNR deduction compared to the full model which has two orders of magnitude more GS ellipsoids.

Training a Diffusion Model on GS. In general, a diffusion model takes Gaussian noise as input and progressively denoises it in $T$ steps. It learns strong data priors from the denoising-diffusion process [49]. In our framework, the diffusion model operates on GS ellipsoids $\mathbf{x}\in\mathbb{R}^{16}$ , with features including position $\mathbb{R}^{3}$ , scale of covariance $\mathbb{R}^{3}$ , rotation of covariance $\mathbb{R}^{6}$ [50], opacity $\mathbb{R}^{1}$ and color $\mathbb{R}^{3}$ . In our GS DDPM, $\mathbf{x}_{T}\sim\mathcal{N}(\mathbf{x}_{T};0,\mathbf{I})$ is the purely noisy GS ellipsoids, and $\mathbf{x}_{0}\sim q(\mathbf{x}_{0})$ is a data point sampled from the data distribution, which is practically our GS dataset.

During training, the diffuse process is described as

q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};\sqrt{1-\beta_{t% }}\mathbf{x}_{t-1},\beta_{t}\mathbf{I}),

(2)

which formulates a Markov process with a variance schedule $\{\beta_{t}\}^{T}_{t=0}$ that gradually adds Gaussian noise to the GS. The training objective is to learn the reverse denoising process with a neural approximator $p_{\theta}(\mathbf{x}_{0};\mathbf{x}_{t},t)$ following [29], given by

\mathcal{L}_{\text{DDPM}}=\mathbb{E}_{\mathbf{x}_{0}\sim q(\mathbf{x}_{0}),t% \sim[1,T]}\left[\left\|\mathbf{x}_{0}-p_{\theta}(\mathbf{x}_{0};\mathbf{x}_{t}% ,t)\right\|^{2}\right].

(3)

Backbone Choice for $p_{\theta}$ . When our neural approximator $p_{\theta}$ operates on GS, it treats GS as point clouds with rich features. To keep it simple, inspired by [26, 27], we employ a vanilla transformer for unconditional GS modeling. The transformer acts as a densely-connect Graph Neural Network with implicit edges realized by multi-head attention. It is possibly more effective at handling this informative unstructured representation compared with other point cloud learning architectures, such as PVCNN [21], which is discussed in Sec. 5.3. We choose not to include positional encoding as our explicit GS representation already incorporates positional information. This design allows our architecture to be versatile, accommodating an ideally arbitrary number of GS points. For training efficiency, we keep the number of points fixed at 1024 for category-specific experiments on CO3D [30]. We also explore the scaling-up performance of our GS diffusion transformer by training a single model on relative general objects set, OmniObject3D [42], with results shown in Fig. 11.

4.2 View-Guided Sampling

Taking inspiration from the projection conditioning in [23], the fine-grained conditioning is a more faithful approach compared to global conditioning using an image encoder. Empirical experiments reveal that the naive point projection method fails to effectively convey features from photographs to the informative GS, which is discussed in Sec. 5.3.

Considering that GS features on the denoising process can be seamlessly projected to the image space through the splatting function, it is evident that its reverse operation has the potential to backpropagate fine-grained view information to the GS space through gradients. This idea enlightens us on the approach of loss-guided sampling for the diffusion model [13, 34].

In conditional generation, we may want to draw samples $x_{0}$ from the prior distribution subject to certain conditions $y$ . For diffusion models, the conditional score at time $t$ can be obtained via Bayes’ rule:

\nabla_{x_{t}}\log p_{t}(x_{t}|y)=\nabla_{x_{t}}\log p_{t}(x_{t})+\nabla_{x_{t% }}\log p_{t}(y|x_{t}),

(4)

where the first term is the unconditional score function $\nabla_{x_{t}}\log p_{\theta}(x_{t})$ learned via the denoising-diffusion objective Eq. 3. For the second term, the naive solution is to train a classifier $p_{\phi}(y|x_{t})$ on paired data $(y,x_{t})$ that operates as this posterior distribution, i.e. classifier guidance [8]. However, a labeled dataset for noisy samples is not always available nor flexible. Diffusion Posterior Sampling (DPS) [7] instead approximates $p_{t}(y|x_{t})$ by $p_{t}(y|\hat{x}_{0})$ , when assuming $p(y|x_{0})$ is given, where $\hat{x}_{0}$ is essentially a point estimation from the denoiser $p_{\theta}$ in our case. Reconstruction guidance [13] simplifies this approximation by assuming $p(y|x_{0})$ is Gaussian. So, the $p_{t}(y|x_{t})$ becomes $\mathcal{N}\left[p_{\theta}(x_{t}),\left(\bar{\beta}/(1-\bar{\beta})\right)% \mathbf{I}\right]$ , Eq. 6. Loss Guidance [34] promotes this method to more common setting where we have a differentiable loss function $\ell_{y}$ to replace the MSE estimation, by the following:

$\displaystyle\text{DPS}(x_{t},y):$	$\displaystyle=\nabla_{x_{t}}\log p_{t}(y\|\hat{x}_{0}),$	(5)
	$\displaystyle=\nabla_{x_{t}}-\frac{1-\bar{\beta_{t}}}{2\bar{\beta_{t}}}\|\|x_{0}% -\hat{x}_{0}\|\|^{2},$	(6)
	$\displaystyle=\nabla_{x_{t}}-\frac{1-\bar{\beta_{t}}}{2\bar{\beta_{t}}}\ell_{y% }(\hat{x}_{0}).$	(7)

Since we only access the noiseless input 2D image $y_{0}$ , we utilize the approximator $p_{\theta}(\mathbf{x}_{0};\mathbf{x}_{t},t)$ and splatting function $f_{splat}$ Eq. 1, to compute $\hat{x}_{0}$ and $\hat{y}_{0}$ , forming a differentiable loss function in Eq. 7, It then approximates the gradients w.r.t. $\mathbf{x}_{t}$ , defined as:

	grad	$\displaystyle\leftarrow\nabla_{\mathbf{x}_{t}}-\frac{1-\bar{\beta_{t}}}{2\bar{% \beta_{t}}}(\mathcal{L}_{\text{img}}\circ f_{splat})\left(x_{0},\hat{x}_{0}% \right),$		(8)
		$\displaystyle\leftarrow\nabla_{\mathbf{x}_{t}}-\frac{1-\bar{\beta_{t}}}{2\bar{% \beta_{t}}}\mathcal{L}_{\text{img}}\left[y_{0},f_{splat}\left(p_{\theta}(% \mathbf{x}_{0};\mathbf{x}_{t},t)\right)\right].$		(9)

where the view-point camera P is omitted in the splatting function $f_{splat}(y;x,\textbf{P})$ for simplicity. The guidance gradients then bias the unconditional score prediction by

\tilde{\mathbf{x}}_{t}\leftarrow\hat{\mathbf{x}}_{t}+\lambda_{\text{gd}}\frac{% \bar{\beta}}{\sqrt{1-\bar{\beta}}}\textit{grad},

(10)

where $\lambda_{\text{gd}}$ is empirically a large weighting factor [13]. We further perform predictor-corrector sampling [35] during the denoising approximation by insert extra Langevin correction steps between DDIM steps [33].

4.3 Polishing and Re-using

Experiments from concurrent research working on GS [16, 38, 5, 36] indicate that when the available views are deficient, needle-shaped artifacts may occur in NVS. We also observe that mild perturbation in the GS space would lead to strong annoying needle-shaped artifacts in 2D views, while the 3D geometry is relatively satisfactory. So, we suggest the presence of a domain gap between the GS rendering views and real images. The 3D modeling process may encounter challenges in assigning adequate importance to features crucial for 2D appearance, such as covariance.

Building on the aforementioned hypothesis, we propose to construct an auxiliary 2D diffusion model that takes imperfect GS rendering images as condition and generate clean, photorealistic images. To further enhance the view rendering quality of GS diffusion reconstruction, we polish and re-use the rendered and refined images by iterative performing GS diffusion and 2D diffusion, depicted in Fig. 3 (d).

5 Experiments

5.1 Experimental Setup

Dataset. We conduct experiments on CO3Dv2 [30], an unconstrained multi-view dataset of real-word objects with point cloud annotation. The dataset is extraordinarily challenging [51, 23, 41] since it is captured in-the-wild without any coordinate calibration, which is closer to the daily application conditions. We use the dataset-split annotation from fewview-dev for training and evaluation. We show results for core-ten categories: hydrant, bench, donut, teddy bear, apple, vase, plant, suitcase, ball and cake. For scaling-up performance, we additionally illustrate qualitative results on general objects with a model trained on OmniObject3D [42], which comprises 6,000 scanned objects in 190 daily categories.

Baselines. We compare our approach against the current state-of-the-art methods: NerFormer [30], ViewFormer [17] and SparseFusion [51]. We re-train NerFormer on each category using its official implementation. For ViewFormer which trained across categories, we use their checkpoint for all categories of CO3Dv2. We compare against SparseFusion only for reconstruction from two views, since their design doesn’t support single view setting. We use the category-specific model provided by the authors. For comparison in 3D geometry, we use the official released checkpoint on hydrant from PC² [23].

Metrics. Following prior works, we report standard image metrics: PSNR, SSIM, and LPIPS, that cover different aspects of image quality for evaluation in 2D views. For 3D geometry, we measure F-score@0.01 [39] and Chamfer Distance. F-score@0.01 evaluates the precision and recall with a threshold 0.01. The reconstructed point with its nearest distance to the ground truth point cloud under the threshold would be considered as a correct prediction.

Implementation Details. For diffusion model, we schedule 1000 steps for GS and 500 steps for 2D, both set to predict the clean sample at each steps. We build category-specific transformer encoders for GS denoiser $p_{\theta}$ each with 19.6M parameters. The models are trained with GS points number fixed at 1024, for 200k iterations. We mask out the points with outlier scaling features to stablize the training. We also find L1 loss performs better than MSE in Eq. 3 for GS. For 2D diffusion model, we build a naive UNet denoiser following common setting with FP16, input size of 256x256, trained for 200k iterations. If not specified otherwise, we use AdamW with default parameters and learning rate of 0.0001 for optimization.

We take 100 and 25 DDIM steps in generative sampling process for GS diffusion and 2D diffusion respectively. We use three polishing and re-using iterations for single-view reconstruction and two for 2-views reconstruction. We adopt multi-view GS refinement initiated from our reconstructed GS, with 2D refined views. We jointly perform this regression and 2D view refinement in a iterative improvement manner for two iterations in the single-view setting. Reconstructing a single instance takes around 3 minutes on an A100 GPU, depending on number of iterations. The object represented by GS is obtained as the final output.

5.2 Reconstruction on Real-World Images

We present category-specific reconstruction results from a single viewpoint for objects such as hydrant, donut, teddy bear, and bench. These categories vary in structural complexity, scale, and the captured environment. For each instance, we adopt similar experiment setting as [51] that loads 32 linearly spaced views, from which we randomly select 1 input view and assess the performance on the remaining 31 unseen views.

Methods	Hydrant		Bench		Donut		Teddy bear		Apple		Vase		Plant		Suitcase		Ball		Cake		All
Methods	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	SSIM
NerFormer [30]	16.9	0.33	15.4	0.46	18.0	0.37	13.1	0.46	18.2	0.36	16.9	0.35	16.4	0.46	19.5	0.41	16.1	0.37	16.2	0.47	16.67	0.404	0.562
ViewFormer [17]	16.6	0.24	15.7	0.34	17.1	0.37	12.9	0.34	19.1	0.30	17.9	0.26	15.9	0.37	20.2	0.30	16.5	0.33	16.8	0.36	16.87	0.321	0.625
Ours w/o iter	18.7	0.25	15.7	0.32	18.5	0.33	16.3	0.35	18.6	0.30	19.4	0.23	17.4	0.33	21.0	0.29	17.1	0.31	18.0	0.32	18.07	0.303	0.679
Ours	19.7	0.20	16.2	0.31	18.9	0.32	16.8	0.31	19.2	0.27	20.1	0.22	17.4	0.30	20.4	0.31	17.7	0.30	18.5	0.34	18.49	0.288	0.696

Table 1: Quantitative comparison in view quality of from single-view reconstruction. We report PSNR

\uparrow

and LPIPS

\downarrow

averaged across the testing set. Bold face indicates the best result, while underscore refers to the second best. “iter” refers to our iterative polishing and re-using strategy.

Methods	Hydrant		Bench		Donut		Teddy bear		Apple †		Vase †		Plant †		Suitcase †		Ball †		Cake †
Methods	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS
NerFormer [30]	18.2	0.30	15.9	0.43	20.2	0.34	15.8	0.44	19.5	0.33	17.7	0.34	17.8	0.45	20.0	0.39	16.8	0.35	16.9	0.44
ViewFormer [17]	17.5	0.16	16.4	0.30	18.6	0.24	15.6	0.33	20.1	0.26	20.4	0.21	17.8	0.31	21.0	0.26	18.3	0.31	17.3	0.33
SparseFusion [51]	22.3	0.16	16.7	0.29	22.8	0.22	20.6	0.24	22.8	0.20	22.8	0.18	20.0	0.25	22.2	0.22	22.4	0.22	20.8	0.28
Ours	22.6	0.15	18.4	0.28	22.7	0.22	20.7	0.22	23.0	0.18	22.8	0.16	19.0	0.24	22.8	0.21	22.2	0.20	20.9	0.28

Table 2: Quantitative comparison in view quality of reconstruction from two views on core-10 catergories. We follow the experiment setting from [51], and report PSNR

\uparrow

, and LPIPS

\downarrow

averaged across the first ten scenes from the testing set. Baselines results marked with ‘†’ are reported by [51].

Methods	Hydrant
PC² [23]	0.185	0.073
Ours	0.191	0.068
Ours w/o iter	0.180	0.067

Table 1 shows that our approach outperforms other methods on all view synthesis metrics. The performance is better illustrated in the qualitative comparison presented in Figure 4. The baselines either produce blurry views or struggle to maintain the view consistency. Our reconstruction faithfully keeps the content from the given view and generalize to the whole 3D to generate views with geometrical consistency.

To further demonstrate the flexibility and efficacy of our view-guided sampling strategy, we additionally report results for reconstruction from 2-views, in Table 2 and qualitatively in Figure 6. By perceiving more views of the object, our approach can gain improved results, very comparable to the most current state-of-the-art, SparseFusion. While SparseFusion takes much longer time for SDS to extract the 3D scene.

For 3D geometry accuracy, we show quantitative comparison in Table 3. Since the scences are highly unconstrained, we normalize each object with its ground truth bounding box to evaluate it in the unit scale. While our model not only focuses on 3D geometry accuracy but also takes realistic view synthesis into account, it outperforms the state-of-the-art approach, i.e. PC² [23], concentrating on point cloud reconstruction. The qualitative results from Figure 7 further reveal the strength of our approach. PC² tends to reconstruction the object in a category-mean shape, so it struggles when the target deviates a bit more from the dataset distribution center. We argue that it mainly results from how they add conditions to the diffusion model, which we will discuss in Sec. 5.3. Although our results may contain sparse outlier points around the object due to the nature of GS, we achieve a better fit to the 3D shape of various instances.

5.3 Ablation Studies

Effect of View-Guided Sampling. Fig. 8 compares the different approaches to add conditions to the diffusion model in this task. Both results are rendered from purely GS reconstruction without iterative polishing and re-using process. For the forward projection method, we adopt a fine-grained feature extraction module similar to the point projection condition in PC² [23] to train a conditional diffusion model. We utilize classifier-free guidance at inference time to reconstruct the object from the given view. However, empirical results from Fig. 7 and Fig. 8 both suggest that this adding-conditions strategy is not as effective as our proposal. Concurrent studies [9] also argue that this point projection-based strategy is unstable for relatively special cases.

Methods	Hydrant
ProjCond	15.10	0.519	0.407	0.12s
Ours	18.61	0.783	0.265	0.17s

Choice of backbone for GS modeling. We compare the transformer we used with PVCNN, a currently popular backbone for unstructured data learning, used by PC² [23], LION [49]. To examine how the model learns the distribution from the dataset, Fig. 10 shows the unconditional generation results with and without our iterative polishing and re-using strategy. Both results support that transformer learns the GS dataset distribution better. This is possible because the transformer organizes edges implicitly while PVCNN uses the explicit point position, which is not suitable for GS feature since the covariance vectors also contain spatial information in addition to positions.

Effect of Iterative polishing and re-using process. Quantitative results from Tab. 1, suggest the efficacy of our iterative polishing and re-using strategy in improving the view rendering quality for the reconstructed GS. This is also supported by Fig. 10 and Fig. 10. The green square highlights the geometry consistency which is inherently the strength of modeling in 3D space. The comparable view quality in these paired results supports that our iterative polishing and re-using process improves reconstructed GS view quality with the assistance of 2D diffusion. On the other hand, modeling in 3D inherently enhances the view consistency for novel view synthesis.

5.4 Limitations

The primary limitation of our model is the need for GS ground truth for training. It would limit us from scaling up our model to a common image dataset or working on a generic object reconstruction scenario. We adopted a constrained densification to obtain GS ground truth which has been empirically examined to be relatively efficient while still providing a wide area for exploration.

6 Conclusion

We proposed GSD, a generative real-world object reconstruction approach from a single image using Diffusion Transformer upon Gaussian Splatting. We make use of the splatting function for efficient fine-grained 2D feature perception with view-guided sampling. The proposed method has showcased superior performance in category-specific reconstruction tasks. Thanks to the DiT and the fine-grained conditioning mechanism, GSD exhibits the potential to scale up Fig. 11, which could pave the way toward achieving photo-realistic performance in generic object reconstruction tasks.

References

[1] Cao, A., Johnson, J.: Hexplane: A fast representation for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 130–141 (2023)
[2] Cao, A., Rockwell, C., Johnson, J.: Fwd: Real-time novel view synthesis with forward warping and depth. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15713–15724 (2022)
[3] Chan, E.R., Nagano, K., Chan, M.A., Bergman, A.W., Park, J.J., Levy, A., Aittala, M., De Mello, S., Karras, T., Wetzstein, G.: Generative novel view synthesis with 3d-aware diffusion models. arXiv preprint arXiv:2304.02602 (2023)
[4] Chen, H., Gu, J., Chen, A., Tian, W., Tu, Z., Liu, L., Su, H.: Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. arXiv preprint arXiv:2304.06714 (2023)
[5] Chen, Z., Wang, F., Liu, H.: Text-to-3d using gaussian splatting. arXiv preprint arXiv:2309.16585 (2023)
[6] Chibane, J., Alldieck, T., Pons-Moll, G.: Implicit functions in feature space for 3d shape reconstruction and completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6970–6981 (2020), https://openaccess.thecvf.com/content_CVPR_2020/html/Chibane_Implicit_Functions_in_Feature_Space_for_3D_Shape_Reconstruction_and_CVPR_2020_paper.html
[7] Chung, H., Kim, J., Mccann, M.T., Klasky, M.L., Ye, J.C.: Diffusion posterior sampling for general noisy inverse problems. arXiv preprint arXiv:2209.14687 (2022)
[8] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794 (2021)
[9] Di, Y., Zhang, C., Wang, P., Zhai, G., Zhang, R., Manhardt, F., Busam, B., Ji, X., Tombari, F.: Ccd-3dr: Consistent conditioning in diffusion for single-image 3d reconstruction. arXiv preprint arXiv:2308.07837 (2023)
[10] Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3d object reconstruction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 605–613 (2017), https://openaccess.thecvf.com/content_cvpr_2017/html/Fan_A_Point_Set_CVPR_2017_paper.html
[11] Gao, J., Chen, W., Xiang, T., Jacobson, A., McGuire, M., Fidler, S.: Learning deformable tetrahedral meshes for 3d reconstruction. In: Advances in Neural Information Processing Systems. vol. 33, pp. 9936–9947. Curran Associates, Inc. (2020), https://proceedings.neurips.cc//paper/2020/hash/7137debd45ae4d0ab9aa953017286b20-Abstract.html
[12] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)
[13] Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models (2022)
[14] Jang, W., Agapito, L.: Codenerf: Disentangled neural radiance fields for object categories. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12949–12958 (2021)
[15] Jun, H., Nichol, A.: Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023)
[16] Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG) 42(4), 1–14 (2023)
[17] Kulhánek, J., Derner, E., Sattler, T., Babuška, R.: Viewformer: Nerf-free neural rendering from few images using transformers. In: European Conference on Computer Vision. pp. 198–216. Springer (2022)
[18] Li, K., Pham, T., Zhan, H., Reid, I.: Efficient dense point cloud object reconstruction using deformation vector fields. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 497–513 (2018), https://openaccess.thecvf.com/content_ECCV_2018/html/Kejie_Li_Efficient_Dense_Point_ECCV_2018_paper.html
[19] Lin, C.H., Kong, C., Lucey, S.: Learning efficient point cloud generation for dense 3d object reconstruction. In: proceedings of the AAAI Conference on Artificial Intelligence. vol. 32 (2018)
[20] Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9298–9309 (2023)
[21] Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel cnn for efficient 3d deep learning. Advances in Neural Information Processing Systems 32 (2019)
[22] Long, X., Guo, Y.C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.H., Habermann, M., Theobalt, C., et al.: Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008 (2023)
[23] Melas-Kyriazi, L., Rupprecht, C., Vedaldi, A.: $PC^2$: Projection-conditioned point cloud diffusion for single-image 3d reconstruction (2023-02-23), http://arxiv.org/abs/2302.10668
[24] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65(1), 99–106 (2021)
[25] Müller, N., Siddiqui, Y., Porzi, L., Bulo, S.R., Kontschieder, P., Nießner, M.: Diffrf: Rendering-guided 3d radiance field diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4328–4338 (2023)
[26] Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)
[27] Peebles, W., Xie, S.: Scalable diffusion models with transformers (2023)
[28] Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. In: The Eleventh International Conference on Learning Representations (2022)
[29] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022)
[30] Reizenstein, J., Shapovalov, R., Henzler, P., Sbordone, L., Labatut, P., Novotny, D.: Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10901–10911 (2021)
[31] Rombach, R., Esser, P., Ommer, B.: Geometry-free view synthesis: Transformers and no 3d priors. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14356–14366 (2021)
[32] Shue, J.R., Chan, E.R., Po, R., Ankner, Z., Wu, J., Wetzstein, G.: 3d neural field generation using triplane diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20875–20886 (2023)
[33] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
[34] Song, J., Zhang, Q., Yin, H., Mardani, M., Liu, M.Y., Kautz, J., Chen, Y., Vahdat, A.: Loss-guided diffusion models for plug-and-play controllable generation (2023)
[35] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
[36] Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: Ultra-fast single-view 3d reconstruction. arXiv preprint arXiv:2312.13150 (2023)
[37] Tang, J., Han, X., Tan, M., Tong, X., Jia, K.: Skeletonnet: A topology-preserving solution for learning mesh reconstruction of object surfaces from rgb images. IEEE transactions on pattern analysis and machine intelligence 44(10), 6454–6471 (2021)
[38] Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)
[39] Tatarchenko, M., Richter, S.R., Ranftl, R., Li, Z., Koltun, V., Brox, T.: What do single-view 3d reconstruction networks learn? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3405–3414 (2019)
[40] Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2mesh: Generating 3d mesh models from single RGB images. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 52–67 (2018), https://openaccess.thecvf.com/content_ECCV_2018/html/Nanyang_Wang_Pixel2Mesh_Generating_3D_ECCV_2018_paper.html
[41] Watson, D., Chan, W., Martin-Brualla, R., Ho, J., Tagliasacchi, A., Norouzi, M.: Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628 (2022)
[42] Wu, T., Zhang, J., Fu, X., Wang, Y., Ren, J., Pan, L., Wu, W., Yang, L., Wang, J., Qian, C., et al.: Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 803–814 (2023)
[43] Xie, H., Yao, H., Zhang, S., Zhou, S., Sun, W.: Pix2vox++: Multi-scale context-aware 3d object reconstruction from single and multiple images. International Journal of Computer Vision 128(12), 2919–2935 (2020)
[44] Xing, Z., Chen, Y., Ling, Z., Zhou, X., Xiang, Y.: Few-shot single-view 3d reconstruction with memory prior contrastive network. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022, vol. 13661, pp. 55–70. Springer Nature Switzerland (2022). https://doi.org/10.1007/978-3-031-19769-7_4, https://link.springer.com/10.1007/978-3-031-19769-7_4, series Title: Lecture Notes in Computer Science
[45] Xu, D., Yuan, Y., Mardani, M., Liu, S., Song, J., Wang, Z., Vahdat, A.: Agg: Amortized generative 3d gaussians for single image to 3d (2024)
[46] Yang, S., Xu, M., Xie, H., Perry, S., Xia, J.: Single-view 3d object reconstruction from shape priors in memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3152–3161 (2021), https://openaccess.thecvf.com/content/CVPR2021/html/Yang_Single-View_3D_Object_Reconstruction_From_Shape_Priors_in_Memory_CVPR_2021_paper.html
[47] Yifan, W., Serena, F., Wu, S., Öztireli, C., Sorkine-Hornung, O.: Differentiable surface splatting for point-based geometry processing. ACM Transactions on Graphics (TOG) 38(6), 1–14 (2019)
[48] Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelnerf: Neural radiance fields from one or few images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4578–4587 (2021)
[49] Zeng, X., Vahdat, A., Williams, F., Gojcic, Z., Litany, O., Fidler, S., Kreis, K.: Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978 (2022)
[50] Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5745–5753 (2019)
[51] Zhou, Z., Tulsiani, S.: Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12588–12597 (2023)
[52] Zou, Z.X., Yu, Z., Guo, Y.C., Li, Y., Liang, D., Cao, Y.P., Zhang, S.H.: Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers (2023)
[53] Zwicker, M., Pfister, H., Van Baar, J., Gross, M.: Ewa splatting. IEEE Transactions on Visualization and Computer Graphics 8(3), 223–238 (2002)

Methods	Hydrant
Methods	F-score	ChamferDist
PC² [23]	0.185	0.073
Ours	0.191	0.068
Ours w/o iter	0.180	0.067