(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: University of Alberta, Edmonton AB T6G 2R3, Canada
11email: {ymu3, lcheng5}@ualberta.ca
22institutetext: Noah’s Ark Lab, Huawei Canada, Markham ON L3R 5A4, Canada

GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction

Yuxuan Mu\orcidlink0000-0001-7132-3155 11    Xinxin Zuo\orcidlink0000-0002-7116-9634 22    Chuan Guo\orcidlink0000-0002-4539-0634 11    Yilin Wang 11    Juwei Lu 22    Xiaofeng Wu 22    Songcen Xu 22    Peng Dai 22    Youliang Yan 22    Li Cheng\orcidlink0000-0003-3261-3533 11
Abstract

We present GSD, a diffusion model approach based on Gaussian Splatting (GS) representation for 3D object reconstruction from a single view. Prior works suffer from inconsistent 3D geometry or mediocre rendering quality due to improper representations. We take a step towards resolving these shortcomings by utilizing the recent state-of-the-art 3D explicit representation, Gaussian Splatting, and an unconditional diffusion model. This model learns to generate 3D objects represented by sets of GS ellipsoids. With these strong generative 3D priors, though learning unconditionally, the diffusion model is ready for view-guided reconstruction without further model fine-tuning. This is achieved by propagating fine-grained 2D features through the efficient yet flexible splatting function and the guided denoising sampling process. In addition, a 2D diffusion model is further employed to enhance rendering fidelity, and improve reconstructed GS quality by polishing and re-using the rendered images. The final reconstructed objects explicitly come with high-quality 3D structure and texture, and can be efficiently rendered in arbitrary views. Experiments on the challenging real-world CO3D dataset demonstrate the superiority of our approach.

Keywords:
Object Reconstruction, Gaussian Splatting, Guided Diffusion Model

1 Introduction

Given the abundance of image data in the real world, the problem of 3D reconstruction from single-view images has garnered notable attentions [40, 43, 10, 48]. While humans can effortlessly deduce the general object shape and even imagine its texture from unseen views, for computational models, the problem becomes highly non-trivial. As to be described next, there are three key aspects that underpins this problem. First, a proper 3D representation capable of encoding high-fidelity 3D information, while being compatible with various levels of quantization. Second, akin to the human perception system, it is crucial to have a generative model being able to produce an object with diverse appearances of the object’s backside and being faithful to the input views. Finally, the ability to efficiently and precisely render a 3D object into an arbitrary view.

Existing efforts often struggle to properly address one or multiple aspect(s) of the above three. For instance, the recent 2D novel view synthesis methods [51, 3, 20, 17, 30, 41] usually fall short in maintaining 3D consistency. Another line of research [43, 44, 46, 10, 19, 18, 11, 23, 40] based on explicit 3D representations such as voxels, point clouds and meshes, are limited to coarse geometry and suboptimal rendering quality, though offering consistent 3D rendering. This may be attributed to the low resolution or sparse nature of 3D features (e.g., points, voxels) and the challenges of their discretization (e.g., mesh) in deep learning models. Implicit 3D representation, on the other hand, formulates 3D space as a query-based implicit function [6, 24, 1, 25, 32], achieving remarkable quality in single-view 3D reconstruction in well-defined canonical space [48, 4, 14]. Unfortunately, they typically rely on cumbersome efforts such as marching cubes for 3D geometry extraction and view rendering.

Motivated by these observations, we introduce a novel framework, GSD, for high-quality single-view 3D reconstruction by building a generative Diffusion Transformer (DiT) [27] upon the emerging Gaussian Splatting (GS) representation [16]. Specifically, GS encodes a scene by a set of GS ellipsoids, with each ellipsoid parameterized by its center position, covariance, regional color, and opacity. Unlike existing 3D representations, GS explicitly encodes 3D geometry and texture in high resolution and density. Furthermore, due to its spatial explicitness, we can easily deploy a point-space DiT even without positional encoding [26]. Different from other diffusion models using classifier-free conditioning with image encoders[51, 26, 22], our GS DiT enables efficient fine-grained image conditioning through its unique splatting-based rendering and loss-guided sampling [34], which also ensures fidelity to the given-view images.

Refer to caption
Figure 1: A illustration of our View-Guided Gaussian Splatting Diffusion framework for single-view 3D reconstruction. It works by progressively denoising a randomly initialized set of Gaussian Splatting (GS) ellipsoids with continuous guidance from the discrepancies between the input and rendered images. The gray arrow represents the splatting-based GS rendering, while the orange arrow depicts the backpropagation of guidance gradients. The diffusion model built directly upon the GS representation in our context provides explicit geometry information. The view-guided sampling takes the advantage of splatting function to faithfully yet efficiently obtain fine-grained features from the given view.

Upon the GS representation, a category-specific 3D DiT is trained to capture the space of plausible 3D objects in terms of their diverse geometries and textures. As illustrated in Fig. 2, when w/o an input image, our diffusion model learns to generate high-fidelity 3D objects with distinct geometries and textures. When an input image is provided, the same diffusion model is used to reconstruct the specific 3D object, faithful when rendering to the same view. This process is presented in Fig. 1. During test-time, the GS object at each denoising step is projected to the given view through the differentiable splatting-based rendering. The gradients of discrepancies between the rendered and the reference images are then backpropagated to the corresponding GS samples to refine the 3D object at the current step, similar to classifier guidance [34]. This approach is simple yet effective, and can be easily adapted to multi-view reconstruction. In addition, an auxiliary 2D diffusion model is employed to further improve the quality of rendered images, which reciprocally facilitates better 3D reconstruction.

Refer to caption
Figure 2: Unconditional Generation of GS DiT on the Hydrant dataset. Ten distinct samples are generated from our unconditional diffusion model which is trained on more than 500 hydrant scenes using the GS representation. Our diffusion model shows appealing ability on modeling the generative priors of 3D objects.

Our main contributions can be summarized as follows:

  • Our proposed GSD is, to our knowledge, the first diffusion model that directly models raw GS representation that capturing its 3D generative prior for single-view reconstruction.

  • The GS DiT intuitively comes with an effective yet flexible view-guided sampling strategy that can extract fine-grained features from the given views using the efficient splatting function. Given an input image at test-time, the guided iterative denoising of our GS-based diffusion model allows a progressive refinement of the reconstructed 3D object consistent to the input view.

  • Empirical experiments of the real-world CO3D dataset demonstrate the superiority of our approach when comparing to the state-of-the-arts. Our approach is flexible that can also work with multi-view images.

2 Related Work

View-Conditioned 3D Reconstruction and Generation. Many related works reconstruct the 3D shape by jointly modeling the unconditional 3D priors and the conditional distribution with a generative model [15, 26, 49]. Some of them working on explicit 3D representations [49, 26, 23, 43] can recover explicit geometry but fail to synthesize photorealistic views. For the other stream of studies [15, 4, 25, 32, 1, 3], the use of advanced implicit representations enables photo-quality view synthesis while struggling to extract accurate geometry in unconstrained space. We find the emerging 3D representation Gaussian Splatting [16] has great potential to be a generally suitable representation for this task, which enjoys both benefits from explicit geometry and realistic view synthesis. While concurrent works using GS representation either re-form a deterministic prediction problem [36] or combine with latent representation for additional feature decoration [52, 45]. Moreover, most of the previous works perceive the given view in the camera space by an image encoder [26, 15, 37], which cannot ensure faithful reconstruction due to the compression and also requires canonical coordinates to constrain the modeling space. They indeed should be image-conditioned generation, rather than reconstruction from views. Inspired by the fine-grained projection method from PC2 [23], we take advantage of GS splatting-based rendering [16] to get access to the image through pixel-level gradients that reliably keep the view information and are flexible to accommodate arbitrary views using relative camera parameters in world space. A similar gradient conditioning approach is also used in SSD NeRF [4], while restricted by the robustness of its dataset-specific neural rendering. Combining GS universal rendering with view-guided sampling conditioning on the GS diffusion model could fully explore the potential of these methods.

Novel View Synthesis. The current novel view synthesis (NVS) task becomes progressively close to the 3D reconstruction task [51, 31, 17, 30]. While one of the most significant pinpoints is the 3D inconsistency issue from being short of 3D geometry priors. To address this problem, some works try to involve multi-view geometry [51, 48], depth [2, 22], and clues from large multi-view 2D dataset [22, 20, 31, 41]. However, since it primarily focuses on imaging geometry awareness rather than modeling the prior distribution of 3D shapes. We argue that this 2D objective deviates from 3D reconstruction, potentially resulting in unsatisfactory 3D geometry.

SDS-based 3D Asset Creation. 3D asset creation emerges thanks to the boom of big models. Most of them distill the 3D representation from pre-trained image generation model by Score Distillation Sampling (SDS) [28, 38, 5]. While one of the primary weaknesses lies in 3D geometry, despite involving multi-view geometry to regress the 3D representation. These approaches heavily rely on the consistent performance of large pre-trained image models, where views are treated as independent during pre-training. This could leads to the Janus problem. Another issue arises when applying these methods to our object reconstruction scenario, as the 3D space should ideally be well-constrained. But achieving alignment in a real-world setting is challenging. Consequently, the 3D asset creation approaches may not easily adapt to practical applications in real-world object reconstruction.

3 Background: Gaussian Splatting

Gaussian Splatting [16] presents an emerging method in the field of novel view synthesis and 3D reconstruction from multi-view images. In contrast to NeRF style implicit representations [24], GS takes a different approach that characterizes the scene using a set of anisotropic GS ellipsoids defined by their center positions μ3𝜇superscript3\mu\in\mathbb{R}^{3}italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, covariance Σ6Σsuperscript6\Sigma\in\mathbb{R}^{6}roman_Σ ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, color c3𝑐superscript3c\in\mathbb{R}^{3}italic_c ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and opacity α1𝛼superscript1\alpha\in\mathbb{R}^{1}italic_α ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. During rendering, the GS is projected onto the imaging plane and then allocated to individual tiles [53]. The color of p on the image is given by typical point-based blending [47] as follows: \linenomathAMS

C(p)=i𝒩ciσij=1i1(1σi),𝐶psubscript𝑖𝒩subscript𝑐𝑖subscript𝜎𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝜎𝑖\displaystyle C(\textbf{p})=\sum_{i\in\mathcal{N}}c_{i}\sigma_{i}\prod_{j=1}^{% i-1}(1-\sigma_{i}),italic_C ( p ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (1)
whereσi=αie12(pμi)TΣi1(pμi).wheresubscript𝜎𝑖subscript𝛼𝑖superscript𝑒12superscriptpsubscript𝜇𝑖𝑇superscriptsubscriptΣ𝑖1psubscript𝜇𝑖\displaystyle\text{where}\ \sigma_{i}=\alpha_{i}e^{-\frac{1}{2}(\textbf{p}-\mu% _{i})^{T}\Sigma_{i}^{-1}(\textbf{p}-\mu_{i})}.where italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( p - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( p - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT .

Compared with NeRF’s importance sampling process, these GS points have the potential to cluster towards critical regions, thereby improving overall efficiency for expression and rendering.

4 Our Approach

Section 4.1 explains how we formulate the GS diffusion model. In Section 4.2, we elaborate how GS splatting function is utilized to guide the denoising sampling process of GS DiT for view-based 3D reconstruction. Finally, Section 4.3 details the polishing and re-using process of the auxiliary 2D diffusion model jointly with GS DiT. See Figure 3 for an overview of our pipeline at inference.

Refer to caption
Figure 3: Approach Overview. (a) An unconditional diffusion model is trained on objects represented by N GS ellipsoids (N=1024). After training, the GS ellipsoids of an object can be generated through T𝑇Titalic_T denoising steps (Sec. 4.1). (b) For single-view reconstruction, we apply view-space loss guidance at each denoising step. The gray arrow represents the splatting-based GS rendering, while the orange arrow depicts the backpropagation of guidance gradients. The 3D GS object rendering through the splatting function fsplatsubscript𝑓splatf_{\text{splat}}italic_f start_POSTSUBSCRIPT splat end_POSTSUBSCRIPT from input-view is compared with the given image using imgsubscript𝑖𝑚𝑔\mathcal{L}_{img}caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT, and the gradients backpropagate to the diffusion model for adjusting the sampling process (Sec. 4.2). (c) A 2D diffusion model is employed to enhance the fidelity of rendered views from reconstructed GS x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. (d) The refined synthetic view images are then re-used to improve GS reconstruction quality in an alternating iterative enhancement manner (Sec. 4.3). We obtain the final reconstructed GS object x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the last run of GS diffusion.

4.1 Modeling GS Generative Prior

Building upon the recent advancements in denoising diffusion probabilistic models (DDPM) [12], we formulate our GS dataset distribution modeling using a diffusion-based generative model.

Representing 3D Objects with GS. To prepare the dataset for training gs diffusion model, we convert our dense-view image dataset into a dataset of GS using the 3D GS scene reconstruction method [16]. However, this dense-view regression-based approach regularizes the optimization process only through its densify-and-prune function on points. In contrast, for applicable feed-forward network modeling, we aim to further regularize the feature distribution of GS and obtain a constant quantity of GS ellipsoids per-scene. Therefore, we initially restrict the number of GS ellipsoids by densifying the GS ellipsoids only with the Top-K gradients values, where K is the difference between the pre-set maximum ellipsoid quantity and the current ellipsoid quantity. We observe that this constrained densification allows for effective reconstruction of object data with only a 2% PSNR deduction compared to the full model which has two orders of magnitude more GS ellipsoids.

Training a Diffusion Model on GS. In general, a diffusion model takes Gaussian noise as input and progressively denoises it in T𝑇Titalic_T steps. It learns strong data priors from the denoising-diffusion process [49]. In our framework, the diffusion model operates on GS ellipsoids 𝐱16𝐱superscript16\mathbf{x}\in\mathbb{R}^{16}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT, with features including position 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, scale of covariance 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, rotation of covariance 6superscript6\mathbb{R}^{6}blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT [50], opacity 1superscript1\mathbb{R}^{1}blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and color 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. In our GS DDPM, 𝐱T𝒩(𝐱T;0,𝐈)similar-tosubscript𝐱𝑇𝒩subscript𝐱𝑇0𝐈\mathbf{x}_{T}\sim\mathcal{N}(\mathbf{x}_{T};0,\mathbf{I})bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; 0 , bold_I ) is the purely noisy GS ellipsoids, and 𝐱0q(𝐱0)similar-tosubscript𝐱0𝑞subscript𝐱0\mathbf{x}_{0}\sim q(\mathbf{x}_{0})bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is a data point sampled from the data distribution, which is practically our GS dataset.

During training, the diffuse process is described as

q(𝐱t|𝐱t1)=𝒩(𝐱t;1βt𝐱t1,βt𝐈),𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝒩subscript𝐱𝑡1subscript𝛽𝑡subscript𝐱𝑡1subscript𝛽𝑡𝐈q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};\sqrt{1-\beta_{t% }}\mathbf{x}_{t-1},\beta_{t}\mathbf{I}),italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) , (2)

which formulates a Markov process with a variance schedule {βt}t=0Tsubscriptsuperscriptsubscript𝛽𝑡𝑇𝑡0\{\beta_{t}\}^{T}_{t=0}{ italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT that gradually adds Gaussian noise to the GS. The training objective is to learn the reverse denoising process with a neural approximator pθ(𝐱0;𝐱t,t)subscript𝑝𝜃subscript𝐱0subscript𝐱𝑡𝑡p_{\theta}(\mathbf{x}_{0};\mathbf{x}_{t},t)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) following [29], given by

DDPM=𝔼𝐱0q(𝐱0),t[1,T][𝐱0pθ(𝐱0;𝐱t,t)2].subscriptDDPMsubscript𝔼formulae-sequencesimilar-tosubscript𝐱0𝑞subscript𝐱0similar-to𝑡1𝑇delimited-[]superscriptnormsubscript𝐱0subscript𝑝𝜃subscript𝐱0subscript𝐱𝑡𝑡2\mathcal{L}_{\text{DDPM}}=\mathbb{E}_{\mathbf{x}_{0}\sim q(\mathbf{x}_{0}),t% \sim[1,T]}\left[\left\|\mathbf{x}_{0}-p_{\theta}(\mathbf{x}_{0};\mathbf{x}_{t}% ,t)\right\|^{2}\right].caligraphic_L start_POSTSUBSCRIPT DDPM end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_t ∼ [ 1 , italic_T ] end_POSTSUBSCRIPT [ ∥ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (3)

Backbone Choice for pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. When our neural approximator pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT operates on GS, it treats GS as point clouds with rich features. To keep it simple, inspired by [26, 27], we employ a vanilla transformer for unconditional GS modeling. The transformer acts as a densely-connect Graph Neural Network with implicit edges realized by multi-head attention. It is possibly more effective at handling this informative unstructured representation compared with other point cloud learning architectures, such as PVCNN [21], which is discussed in Sec. 5.3. We choose not to include positional encoding as our explicit GS representation already incorporates positional information. This design allows our architecture to be versatile, accommodating an ideally arbitrary number of GS points. For training efficiency, we keep the number of points fixed at 1024 for category-specific experiments on CO3D [30]. We also explore the scaling-up performance of our GS diffusion transformer by training a single model on relative general objects set, OmniObject3D [42], with results shown in Fig. 11.

4.2 View-Guided Sampling

Taking inspiration from the projection conditioning in [23], the fine-grained conditioning is a more faithful approach compared to global conditioning using an image encoder. Empirical experiments reveal that the naive point projection method fails to effectively convey features from photographs to the informative GS, which is discussed in Sec. 5.3.

Considering that GS features on the denoising process can be seamlessly projected to the image space through the splatting function, it is evident that its reverse operation has the potential to backpropagate fine-grained view information to the GS space through gradients. This idea enlightens us on the approach of loss-guided sampling for the diffusion model [13, 34].

In conditional generation, we may want to draw samples x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the prior distribution subject to certain conditions y𝑦yitalic_y. For diffusion models, the conditional score at time t𝑡titalic_t can be obtained via Bayes’ rule:

xtlogpt(xt|y)=xtlogpt(xt)+xtlogpt(y|xt),subscriptsubscript𝑥𝑡subscript𝑝𝑡conditionalsubscript𝑥𝑡𝑦subscriptsubscript𝑥𝑡subscript𝑝𝑡subscript𝑥𝑡subscriptsubscript𝑥𝑡subscript𝑝𝑡conditional𝑦subscript𝑥𝑡\nabla_{x_{t}}\log p_{t}(x_{t}|y)=\nabla_{x_{t}}\log p_{t}(x_{t})+\nabla_{x_{t% }}\log p_{t}(y|x_{t}),∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_y ) = ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (4)

where the first term is the unconditional score function xtlogpθ(xt)subscriptsubscript𝑥𝑡subscript𝑝𝜃subscript𝑥𝑡\nabla_{x_{t}}\log p_{\theta}(x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) learned via the denoising-diffusion objective Eq. 3. For the second term, the naive solution is to train a classifier pϕ(y|xt)subscript𝑝italic-ϕconditional𝑦subscript𝑥𝑡p_{\phi}(y|x_{t})italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) on paired data (y,xt)𝑦subscript𝑥𝑡(y,x_{t})( italic_y , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) that operates as this posterior distribution, i.e. classifier guidance [8]. However, a labeled dataset for noisy samples is not always available nor flexible. Diffusion Posterior Sampling (DPS) [7] instead approximates pt(y|xt)subscript𝑝𝑡conditional𝑦subscript𝑥𝑡p_{t}(y|x_{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) by pt(y|x^0)subscript𝑝𝑡conditional𝑦subscript^𝑥0p_{t}(y|\hat{x}_{0})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), when assuming p(y|x0)𝑝conditional𝑦subscript𝑥0p(y|x_{0})italic_p ( italic_y | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is given, where x^0subscript^𝑥0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is essentially a point estimation from the denoiser pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT in our case. Reconstruction guidance [13] simplifies this approximation by assuming p(y|x0)𝑝conditional𝑦subscript𝑥0p(y|x_{0})italic_p ( italic_y | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is Gaussian. So, the pt(y|xt)subscript𝑝𝑡conditional𝑦subscript𝑥𝑡p_{t}(y|x_{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) becomes 𝒩[pθ(xt),(β¯/(1β¯))𝐈]𝒩subscript𝑝𝜃subscript𝑥𝑡¯𝛽1¯𝛽𝐈\mathcal{N}\left[p_{\theta}(x_{t}),\left(\bar{\beta}/(1-\bar{\beta})\right)% \mathbf{I}\right]caligraphic_N [ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ( over¯ start_ARG italic_β end_ARG / ( 1 - over¯ start_ARG italic_β end_ARG ) ) bold_I ], Eq. 6. Loss Guidance [34] promotes this method to more common setting where we have a differentiable loss function ysubscript𝑦\ell_{y}roman_ℓ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT to replace the MSE estimation, by the following:

DPS(xt,y)::DPSsubscript𝑥𝑡𝑦absent\displaystyle\text{DPS}(x_{t},y):DPS ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ) : =xtlogpt(y|x^0),absentsubscriptsubscript𝑥𝑡subscript𝑝𝑡conditional𝑦subscript^𝑥0\displaystyle=\nabla_{x_{t}}\log p_{t}(y|\hat{x}_{0}),= ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (5)
=xt1βt¯2βt¯x0x^02,absentsubscriptsubscript𝑥𝑡1¯subscript𝛽𝑡2¯subscript𝛽𝑡superscriptnormsubscript𝑥0subscript^𝑥02\displaystyle=\nabla_{x_{t}}-\frac{1-\bar{\beta_{t}}}{2\bar{\beta_{t}}}||x_{0}% -\hat{x}_{0}||^{2},= ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 - over¯ start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 2 over¯ start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG | | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (6)
=xt1βt¯2βt¯y(x^0).absentsubscriptsubscript𝑥𝑡1¯subscript𝛽𝑡2¯subscript𝛽𝑡subscript𝑦subscript^𝑥0\displaystyle=\nabla_{x_{t}}-\frac{1-\bar{\beta_{t}}}{2\bar{\beta_{t}}}\ell_{y% }(\hat{x}_{0}).= ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 - over¯ start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 2 over¯ start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG roman_ℓ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . (7)

Since we only access the noiseless input 2D image y0subscript𝑦0y_{0}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we utilize the approximator pθ(𝐱0;𝐱t,t)subscript𝑝𝜃subscript𝐱0subscript𝐱𝑡𝑡p_{\theta}(\mathbf{x}_{0};\mathbf{x}_{t},t)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and splatting function fsplatsubscript𝑓𝑠𝑝𝑙𝑎𝑡f_{splat}italic_f start_POSTSUBSCRIPT italic_s italic_p italic_l italic_a italic_t end_POSTSUBSCRIPT Eq. 1, to compute x^0subscript^𝑥0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and y^0subscript^𝑦0\hat{y}_{0}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, forming a differentiable loss function in Eq. 7, It then approximates the gradients w.r.t. 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, defined as:

grad 𝐱t1βt¯2βt¯(imgfsplat)(x0,x^0),absentsubscriptsubscript𝐱𝑡1¯subscript𝛽𝑡2¯subscript𝛽𝑡subscriptimgsubscript𝑓𝑠𝑝𝑙𝑎𝑡subscript𝑥0subscript^𝑥0\displaystyle\leftarrow\nabla_{\mathbf{x}_{t}}-\frac{1-\bar{\beta_{t}}}{2\bar{% \beta_{t}}}(\mathcal{L}_{\text{img}}\circ f_{splat})\left(x_{0},\hat{x}_{0}% \right),← ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 - over¯ start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 2 over¯ start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( caligraphic_L start_POSTSUBSCRIPT img end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_s italic_p italic_l italic_a italic_t end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (8)
𝐱t1βt¯2βt¯img[y0,fsplat(pθ(𝐱0;𝐱t,t))].absentsubscriptsubscript𝐱𝑡1¯subscript𝛽𝑡2¯subscript𝛽𝑡subscriptimgsubscript𝑦0subscript𝑓𝑠𝑝𝑙𝑎𝑡subscript𝑝𝜃subscript𝐱0subscript𝐱𝑡𝑡\displaystyle\leftarrow\nabla_{\mathbf{x}_{t}}-\frac{1-\bar{\beta_{t}}}{2\bar{% \beta_{t}}}\mathcal{L}_{\text{img}}\left[y_{0},f_{splat}\left(p_{\theta}(% \mathbf{x}_{0};\mathbf{x}_{t},t)\right)\right].← ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 - over¯ start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 2 over¯ start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG caligraphic_L start_POSTSUBSCRIPT img end_POSTSUBSCRIPT [ italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_s italic_p italic_l italic_a italic_t end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) ] . (9)

where the view-point camera P is omitted in the splatting function fsplat(y;x,P)subscript𝑓𝑠𝑝𝑙𝑎𝑡𝑦𝑥Pf_{splat}(y;x,\textbf{P})italic_f start_POSTSUBSCRIPT italic_s italic_p italic_l italic_a italic_t end_POSTSUBSCRIPT ( italic_y ; italic_x , P ) for simplicity. The guidance gradients then bias the unconditional score prediction by

𝐱~t𝐱^t+λgdβ¯1β¯grad,subscript~𝐱𝑡subscript^𝐱𝑡subscript𝜆gd¯𝛽1¯𝛽grad\tilde{\mathbf{x}}_{t}\leftarrow\hat{\mathbf{x}}_{t}+\lambda_{\text{gd}}\frac{% \bar{\beta}}{\sqrt{1-\bar{\beta}}}\textit{grad},over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT gd end_POSTSUBSCRIPT divide start_ARG over¯ start_ARG italic_β end_ARG end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_β end_ARG end_ARG end_ARG grad , (10)

where λgdsubscript𝜆gd\lambda_{\text{gd}}italic_λ start_POSTSUBSCRIPT gd end_POSTSUBSCRIPT is empirically a large weighting factor [13]. We further perform predictor-corrector sampling [35] during the denoising approximation by insert extra Langevin correction steps between DDIM steps [33].

4.3 Polishing and Re-using

Experiments from concurrent research working on GS [16, 38, 5, 36] indicate that when the available views are deficient, needle-shaped artifacts may occur in NVS. We also observe that mild perturbation in the GS space would lead to strong annoying needle-shaped artifacts in 2D views, while the 3D geometry is relatively satisfactory. So, we suggest the presence of a domain gap between the GS rendering views and real images. The 3D modeling process may encounter challenges in assigning adequate importance to features crucial for 2D appearance, such as covariance.

Building on the aforementioned hypothesis, we propose to construct an auxiliary 2D diffusion model that takes imperfect GS rendering images as condition and generate clean, photorealistic images. To further enhance the view rendering quality of GS diffusion reconstruction, we polish and re-use the rendered and refined images by iterative performing GS diffusion and 2D diffusion, depicted in Fig. 3 (d).

5 Experiments

5.1 Experimental Setup

Dataset. We conduct experiments on CO3Dv2 [30], an unconstrained multi-view dataset of real-word objects with point cloud annotation. The dataset is extraordinarily challenging [51, 23, 41] since it is captured in-the-wild without any coordinate calibration, which is closer to the daily application conditions. We use the dataset-split annotation from fewview-dev for training and evaluation. We show results for core-ten categories: hydrant, bench, donut, teddy bear, apple, vase, plant, suitcase, ball and cake. For scaling-up performance, we additionally illustrate qualitative results on general objects with a model trained on OmniObject3D [42], which comprises 6,000 scanned objects in 190 daily categories.

Baselines. We compare our approach against the current state-of-the-art methods: NerFormer [30], ViewFormer [17] and SparseFusion [51]. We re-train NerFormer on each category using its official implementation. For ViewFormer which trained across categories, we use their checkpoint for all categories of CO3Dv2. We compare against SparseFusion only for reconstruction from two views, since their design doesn’t support single view setting. We use the category-specific model provided by the authors. For comparison in 3D geometry, we use the official released checkpoint on hydrant from PC2 [23].

Metrics. Following prior works, we report standard image metrics: PSNR, SSIM, and LPIPS, that cover different aspects of image quality for evaluation in 2D views. For 3D geometry, we measure F-score@0.01 [39] and Chamfer Distance. F-score@0.01 evaluates the precision and recall with a threshold 0.01. The reconstructed point with its nearest distance to the ground truth point cloud under the threshold would be considered as a correct prediction.

Implementation Details. For diffusion model, we schedule 1000 steps for GS and 500 steps for 2D, both set to predict the clean sample at each steps. We build category-specific transformer encoders for GS denoiser pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT each with 19.6M parameters. The models are trained with GS points number fixed at 1024, for 200k iterations. We mask out the points with outlier scaling features to stablize the training. We also find L1 loss performs better than MSE in Eq. 3 for GS. For 2D diffusion model, we build a naive UNet denoiser following common setting with FP16, input size of 256x256, trained for 200k iterations. If not specified otherwise, we use AdamW with default parameters and learning rate of 0.0001 for optimization.

We take 100 and 25 DDIM steps in generative sampling process for GS diffusion and 2D diffusion respectively. We use three polishing and re-using iterations for single-view reconstruction and two for 2-views reconstruction. We adopt multi-view GS refinement initiated from our reconstructed GS, with 2D refined views. We jointly perform this regression and 2D view refinement in a iterative improvement manner for two iterations in the single-view setting. Reconstructing a single instance takes around 3 minutes on an A100 GPU, depending on number of iterations. The object represented by GS is obtained as the final output.

5.2 Reconstruction on Real-World Images

We present category-specific reconstruction results from a single viewpoint for objects such as hydrant, donut, teddy bear, and bench. These categories vary in structural complexity, scale, and the captured environment. For each instance, we adopt similar experiment setting as [51] that loads 32 linearly spaced views, from which we randomly select 1 input view and assess the performance on the remaining 31 unseen views.

Methods Hydrant Bench Donut Teddy bear Apple Vase Plant Suitcase Ball Cake All
PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS SSIM
NerFormer [30] 16.9 0.33 15.4 0.46 18.0 0.37 13.1 0.46 18.2 0.36 16.9 0.35 16.4 0.46 19.5 0.41 16.1 0.37 16.2 0.47 16.67 0.404 0.562
ViewFormer [17] 16.6 0.24 15.7 0.34 17.1 0.37 12.9 0.34 19.1 0.30 17.9 0.26 15.9 0.37 20.2 0.30 16.5 0.33 16.8 0.36 16.87 0.321 0.625
Ours w/o iter 18.7 0.25 15.7 0.32 18.5 0.33 16.3 0.35 18.6 0.30 19.4 0.23 17.4 0.33 21.0 0.29 17.1 0.31 18.0 0.32 18.07 0.303 0.679
Ours 19.7 0.20 16.2 0.31 18.9 0.32 16.8 0.31 19.2 0.27 20.1 0.22 17.4 0.30 20.4 0.31 17.7 0.30 18.5 0.34 18.49 0.288 0.696
Table 1: Quantitative comparison in view quality of from single-view reconstruction. We report PSNR \uparrow and LPIPS \downarrow averaged across the testing set. Bold face indicates the best result, while underscore refers to the second best. “iter” refers to our iterative polishing and re-using strategy.
Refer to caption
Figure 4: View synthesis qualitative results from single-view reconstruction. We show novel view synthesis results given the object reconstructed from single-view input on hydrant, bench, donut, and teddy bear. Our method takes the raw input view with an object mask with various resolutions as input. Notably, our novel views are rendered from the GS in real-time once we obtain this reconstructed 3D representation.
Refer to caption
Figure 5: Additional view synthesis qualitative results from single-view reconstruction. We shows three novel views rendered from the object reconstructed from single-view input shown on the left.
Methods Hydrant Bench Donut Teddy bear Apple † Vase † Plant † Suitcase † Ball † Cake †
PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS
NerFormer [30] 18.2 0.30 15.9 0.43 20.2 0.34 15.8 0.44 19.5 0.33 17.7 0.34 17.8 0.45 20.0 0.39 16.8 0.35 16.9 0.44
ViewFormer [17] 17.5 0.16 16.4 0.30 18.6 0.24 15.6 0.33 20.1 0.26 20.4 0.21 17.8 0.31 21.0 0.26 18.3 0.31 17.3 0.33
SparseFusion [51] 22.3 0.16 16.7 0.29 22.8 0.22 20.6 0.24 22.8 0.20 22.8 0.18 20.0 0.25 22.2 0.22 22.4 0.22 20.8 0.28
Ours 22.6 0.15 18.4 0.28 22.7 0.22 20.7 0.22 23.0 0.18 22.8 0.16 19.0 0.24 22.8 0.21 22.2 0.20 20.9 0.28
Table 2: Quantitative comparison in view quality of reconstruction from two views on core-10 catergories. We follow the experiment setting from [51], and report PSNR \uparrow, and LPIPS \downarrow averaged across the first ten scenes from the testing set. Baselines results marked with ‘†’ are reported by [51].
Refer to caption
Figure 6: View synthesis qualitative results from 2-views reconstruction. We provide the visual results coherent with [51] demo setting.
Refer to caption
Figure 7: Point cloud visualization of single-view reconstruction. We visualize the object reconstructed from single-view input on hydrant. For our method, the point cloud is extracted from the position feature of GS. In order to present the 3D shape better, we provide 2-3 different views for each object.
Methods Hydrant
F-score ChamferDist
PC2 [23] 0.185 0.073
Ours 0.191 0.068
Ours w/o iter 0.180 0.067
Table 3: Quantitative comparison in 3D geometry of reconstruction from single view. We report F-score@0.01 \uparrow and Chamfer Distance \downarrow averaged across 50 scenes in the testing set of hydrant.

Table 1 shows that our approach outperforms other methods on all view synthesis metrics. The performance is better illustrated in the qualitative comparison presented in Figure 4. The baselines either produce blurry views or struggle to maintain the view consistency. Our reconstruction faithfully keeps the content from the given view and generalize to the whole 3D to generate views with geometrical consistency.

To further demonstrate the flexibility and efficacy of our view-guided sampling strategy, we additionally report results for reconstruction from 2-views, in Table 2 and qualitatively in Figure 6. By perceiving more views of the object, our approach can gain improved results, very comparable to the most current state-of-the-art, SparseFusion. While SparseFusion takes much longer time for SDS to extract the 3D scene.

For 3D geometry accuracy, we show quantitative comparison in Table 3. Since the scences are highly unconstrained, we normalize each object with its ground truth bounding box to evaluate it in the unit scale. While our model not only focuses on 3D geometry accuracy but also takes realistic view synthesis into account, it outperforms the state-of-the-art approach, i.e. PC2 [23], concentrating on point cloud reconstruction. The qualitative results from Figure 7 further reveal the strength of our approach. PC2 tends to reconstruction the object in a category-mean shape, so it struggles when the target deviates a bit more from the dataset distribution center. We argue that it mainly results from how they add conditions to the diffusion model, which we will discuss in Sec. 5.3. Although our results may contain sparse outlier points around the object due to the nature of GS, we achieve a better fit to the 3D shape of various instances.

5.3 Ablation Studies

Effect of View-Guided Sampling. Fig. 8 compares the different approaches to add conditions to the diffusion model in this task. Both results are rendered from purely GS reconstruction without iterative polishing and re-using process. For the forward projection method, we adopt a fine-grained feature extraction module similar to the point projection condition in PC2 [23] to train a conditional diffusion model. We utilize classifier-free guidance at inference time to reconstruct the object from the given view. However, empirical results from Fig. 7 and Fig. 8 both suggest that this adding-conditions strategy is not as effective as our proposal. Concurrent studies [9] also argue that this point projection-based strategy is unstable for relatively special cases.

Refer to caption
Figure 8: Qualitative results on different strategies to add conditions. For clarity, we present these results without improvement from our iterative polishing and re-using process.
Methods Hydrant
PSNR SSIM LPIPS Time
ProjCond 15.10 0.519 0.407 0.12s
Ours 18.61 0.783 0.265 0.17s
Table 4: Quantitative comparison of different adding-conditions strategies on single-view reconstruction. We report PSNR \uparrow, SSIM \uparrow, LPIPS \downarrow, and inference time per denoising step \downarrow averaged across all scenes in the hydrant testing set.

Choice of backbone for GS modeling. We compare the transformer we used with PVCNN, a currently popular backbone for unstructured data learning, used by PC2 [23], LION [49]. To examine how the model learns the distribution from the dataset, Fig. 10 shows the unconditional generation results with and without our iterative polishing and re-using strategy. Both results support that transformer learns the GS dataset distribution better. This is possible because the transformer organizes edges implicitly while PVCNN uses the explicit point position, which is not suitable for GS feature since the covariance vectors also contain spatial information in addition to positions.

Refer to caption
Figure 9: Qualitative results on different backbones for GS denoiser pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. We shows unconditional generation results directly rendered from GS by models built on Transformer (ours) and PVCNN [21], w/ and w/o our iterative polishing and re-using process
Refer to caption
Figure 10: Qualitative results of continuous frames from the initial 2D view diffusion and the final reconstructed GS rendering. The inconsistency and artifacts are emphasized by the red circle.

Effect of Iterative polishing and re-using process. Quantitative results from Tab. 1, suggest the efficacy of our iterative polishing and re-using strategy in improving the view rendering quality for the reconstructed GS. This is also supported by Fig. 10 and Fig. 10. The green square highlights the geometry consistency which is inherently the strength of modeling in 3D space. The comparable view quality in these paired results supports that our iterative polishing and re-using process improves reconstructed GS view quality with the assistance of 2D diffusion. On the other hand, modeling in 3D inherently enhances the view consistency for novel view synthesis.

5.4 Limitations

The primary limitation of our model is the need for GS ground truth for training. It would limit us from scaling up our model to a common image dataset or working on a generic object reconstruction scenario. We adopted a constrained densification to obtain GS ground truth which has been empirically examined to be relatively efficient while still providing a wide area for exploration.

Refer to caption
Figure 11: Qualitative results of general object reconstruction with a single model trained on OmniObject3D.

6 Conclusion

We proposed GSD, a generative real-world object reconstruction approach from a single image using Diffusion Transformer upon Gaussian Splatting. We make use of the splatting function for efficient fine-grained 2D feature perception with view-guided sampling. The proposed method has showcased superior performance in category-specific reconstruction tasks. Thanks to the DiT and the fine-grained conditioning mechanism, GSD exhibits the potential to scale up Fig. 11, which could pave the way toward achieving photo-realistic performance in generic object reconstruction tasks.

References

  • [1] Cao, A., Johnson, J.: Hexplane: A fast representation for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 130–141 (2023)
  • [2] Cao, A., Rockwell, C., Johnson, J.: Fwd: Real-time novel view synthesis with forward warping and depth. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15713–15724 (2022)
  • [3] Chan, E.R., Nagano, K., Chan, M.A., Bergman, A.W., Park, J.J., Levy, A., Aittala, M., De Mello, S., Karras, T., Wetzstein, G.: Generative novel view synthesis with 3d-aware diffusion models. arXiv preprint arXiv:2304.02602 (2023)
  • [4] Chen, H., Gu, J., Chen, A., Tian, W., Tu, Z., Liu, L., Su, H.: Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. arXiv preprint arXiv:2304.06714 (2023)
  • [5] Chen, Z., Wang, F., Liu, H.: Text-to-3d using gaussian splatting. arXiv preprint arXiv:2309.16585 (2023)
  • [6] Chibane, J., Alldieck, T., Pons-Moll, G.: Implicit functions in feature space for 3d shape reconstruction and completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6970–6981 (2020), https://openaccess.thecvf.com/content_CVPR_2020/html/Chibane_Implicit_Functions_in_Feature_Space_for_3D_Shape_Reconstruction_and_CVPR_2020_paper.html
  • [7] Chung, H., Kim, J., Mccann, M.T., Klasky, M.L., Ye, J.C.: Diffusion posterior sampling for general noisy inverse problems. arXiv preprint arXiv:2209.14687 (2022)
  • [8] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794 (2021)
  • [9] Di, Y., Zhang, C., Wang, P., Zhai, G., Zhang, R., Manhardt, F., Busam, B., Ji, X., Tombari, F.: Ccd-3dr: Consistent conditioning in diffusion for single-image 3d reconstruction. arXiv preprint arXiv:2308.07837 (2023)
  • [10] Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3d object reconstruction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 605–613 (2017), https://openaccess.thecvf.com/content_cvpr_2017/html/Fan_A_Point_Set_CVPR_2017_paper.html
  • [11] Gao, J., Chen, W., Xiang, T., Jacobson, A., McGuire, M., Fidler, S.: Learning deformable tetrahedral meshes for 3d reconstruction. In: Advances in Neural Information Processing Systems. vol. 33, pp. 9936–9947. Curran Associates, Inc. (2020), https://proceedings.neurips.cc//paper/2020/hash/7137debd45ae4d0ab9aa953017286b20-Abstract.html
  • [12] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)
  • [13] Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models (2022)
  • [14] Jang, W., Agapito, L.: Codenerf: Disentangled neural radiance fields for object categories. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12949–12958 (2021)
  • [15] Jun, H., Nichol, A.: Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023)
  • [16] Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG) 42(4), 1–14 (2023)
  • [17] Kulhánek, J., Derner, E., Sattler, T., Babuška, R.: Viewformer: Nerf-free neural rendering from few images using transformers. In: European Conference on Computer Vision. pp. 198–216. Springer (2022)
  • [18] Li, K., Pham, T., Zhan, H., Reid, I.: Efficient dense point cloud object reconstruction using deformation vector fields. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 497–513 (2018), https://openaccess.thecvf.com/content_ECCV_2018/html/Kejie_Li_Efficient_Dense_Point_ECCV_2018_paper.html
  • [19] Lin, C.H., Kong, C., Lucey, S.: Learning efficient point cloud generation for dense 3d object reconstruction. In: proceedings of the AAAI Conference on Artificial Intelligence. vol. 32 (2018)
  • [20] Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9298–9309 (2023)
  • [21] Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel cnn for efficient 3d deep learning. Advances in Neural Information Processing Systems 32 (2019)
  • [22] Long, X., Guo, Y.C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.H., Habermann, M., Theobalt, C., et al.: Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008 (2023)
  • [23] Melas-Kyriazi, L., Rupprecht, C., Vedaldi, A.: $PC^2$: Projection-conditioned point cloud diffusion for single-image 3d reconstruction (2023-02-23), http://arxiv.org/abs/2302.10668
  • [24] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65(1), 99–106 (2021)
  • [25] Müller, N., Siddiqui, Y., Porzi, L., Bulo, S.R., Kontschieder, P., Nießner, M.: Diffrf: Rendering-guided 3d radiance field diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4328–4338 (2023)
  • [26] Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)
  • [27] Peebles, W., Xie, S.: Scalable diffusion models with transformers (2023)
  • [28] Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. In: The Eleventh International Conference on Learning Representations (2022)
  • [29] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2),  3 (2022)
  • [30] Reizenstein, J., Shapovalov, R., Henzler, P., Sbordone, L., Labatut, P., Novotny, D.: Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10901–10911 (2021)
  • [31] Rombach, R., Esser, P., Ommer, B.: Geometry-free view synthesis: Transformers and no 3d priors. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14356–14366 (2021)
  • [32] Shue, J.R., Chan, E.R., Po, R., Ankner, Z., Wu, J., Wetzstein, G.: 3d neural field generation using triplane diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20875–20886 (2023)
  • [33] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
  • [34] Song, J., Zhang, Q., Yin, H., Mardani, M., Liu, M.Y., Kautz, J., Chen, Y., Vahdat, A.: Loss-guided diffusion models for plug-and-play controllable generation (2023)
  • [35] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
  • [36] Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: Ultra-fast single-view 3d reconstruction. arXiv preprint arXiv:2312.13150 (2023)
  • [37] Tang, J., Han, X., Tan, M., Tong, X., Jia, K.: Skeletonnet: A topology-preserving solution for learning mesh reconstruction of object surfaces from rgb images. IEEE transactions on pattern analysis and machine intelligence 44(10), 6454–6471 (2021)
  • [38] Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)
  • [39] Tatarchenko, M., Richter, S.R., Ranftl, R., Li, Z., Koltun, V., Brox, T.: What do single-view 3d reconstruction networks learn? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3405–3414 (2019)
  • [40] Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2mesh: Generating 3d mesh models from single RGB images. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 52–67 (2018), https://openaccess.thecvf.com/content_ECCV_2018/html/Nanyang_Wang_Pixel2Mesh_Generating_3D_ECCV_2018_paper.html
  • [41] Watson, D., Chan, W., Martin-Brualla, R., Ho, J., Tagliasacchi, A., Norouzi, M.: Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628 (2022)
  • [42] Wu, T., Zhang, J., Fu, X., Wang, Y., Ren, J., Pan, L., Wu, W., Yang, L., Wang, J., Qian, C., et al.: Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 803–814 (2023)
  • [43] Xie, H., Yao, H., Zhang, S., Zhou, S., Sun, W.: Pix2vox++: Multi-scale context-aware 3d object reconstruction from single and multiple images. International Journal of Computer Vision 128(12), 2919–2935 (2020)
  • [44] Xing, Z., Chen, Y., Ling, Z., Zhou, X., Xiang, Y.: Few-shot single-view 3d reconstruction with memory prior contrastive network. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022, vol. 13661, pp. 55–70. Springer Nature Switzerland (2022). https://doi.org/10.1007/978-3-031-19769-7_4, https://link.springer.com/10.1007/978-3-031-19769-7_4, series Title: Lecture Notes in Computer Science
  • [45] Xu, D., Yuan, Y., Mardani, M., Liu, S., Song, J., Wang, Z., Vahdat, A.: Agg: Amortized generative 3d gaussians for single image to 3d (2024)
  • [46] Yang, S., Xu, M., Xie, H., Perry, S., Xia, J.: Single-view 3d object reconstruction from shape priors in memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3152–3161 (2021), https://openaccess.thecvf.com/content/CVPR2021/html/Yang_Single-View_3D_Object_Reconstruction_From_Shape_Priors_in_Memory_CVPR_2021_paper.html
  • [47] Yifan, W., Serena, F., Wu, S., Öztireli, C., Sorkine-Hornung, O.: Differentiable surface splatting for point-based geometry processing. ACM Transactions on Graphics (TOG) 38(6), 1–14 (2019)
  • [48] Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelnerf: Neural radiance fields from one or few images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4578–4587 (2021)
  • [49] Zeng, X., Vahdat, A., Williams, F., Gojcic, Z., Litany, O., Fidler, S., Kreis, K.: Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978 (2022)
  • [50] Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5745–5753 (2019)
  • [51] Zhou, Z., Tulsiani, S.: Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12588–12597 (2023)
  • [52] Zou, Z.X., Yu, Z., Guo, Y.C., Li, Y., Liang, D., Cao, Y.P., Zhang, S.H.: Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers (2023)
  • [53] Zwicker, M., Pfister, H., Van Baar, J., Gross, M.: Ewa splatting. IEEE Transactions on Visualization and Computer Graphics 8(3), 223–238 (2002)