-
Meta 3D Gen
Authors:
Raphael Bensadoun,
Tom Monnier,
Yanir Kleiman,
Filippos Kokkinos,
Yawar Siddiqui,
Mahendra Kariya,
Omri Harosh,
Roman Shapovalov,
Benjamin Graham,
Emilien Garreau,
Animesh Karnewar,
Ang Cao,
Idan Azuri,
Iurii Makarov,
Eric-Tuan Le,
Antoine Toisoul,
David Novotny,
Oran Gafni,
Natalia Neverova,
Andrea Vedaldi
Abstract:
We introduce Meta 3D Gen (3DGen), a new state-of-the-art, fast pipeline for text-to-3D asset generation. 3DGen offers 3D asset creation with high prompt fidelity and high-quality 3D shapes and textures in under a minute. It supports physically-based rendering (PBR), necessary for 3D asset relighting in real-world applications. Additionally, 3DGen supports generative retexturing of previously gener…
▽ More
We introduce Meta 3D Gen (3DGen), a new state-of-the-art, fast pipeline for text-to-3D asset generation. 3DGen offers 3D asset creation with high prompt fidelity and high-quality 3D shapes and textures in under a minute. It supports physically-based rendering (PBR), necessary for 3D asset relighting in real-world applications. Additionally, 3DGen supports generative retexturing of previously generated (or artist-created) 3D shapes using additional textual inputs provided by the user. 3DGen integrates key technical components, Meta 3D AssetGen and Meta 3D TextureGen, that we developed for text-to-3D and text-to-texture generation, respectively. By combining their strengths, 3DGen represents 3D objects simultaneously in three ways: in view space, in volumetric space, and in UV (or texture) space. The integration of these two techniques achieves a win rate of 68% with respect to the single-stage model. We compare 3DGen to numerous industry baselines, and show that it outperforms them in terms of prompt fidelity and visual quality for complex textual prompts, while being significantly faster.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials
Authors:
Yawar Siddiqui,
Tom Monnier,
Filippos Kokkinos,
Mahendra Kariya,
Yanir Kleiman,
Emilien Garreau,
Oran Gafni,
Natalia Neverova,
Andrea Vedaldi,
Roman Shapovalov,
David Novotny
Abstract:
We present Meta 3D AssetGen (AssetGen), a significant advancement in text-to-3D generation which produces faithful, high-quality meshes with texture and material control. Compared to works that bake shading in the 3D object's appearance, AssetGen outputs physically-based rendering (PBR) materials, supporting realistic relighting. AssetGen generates first several views of the object with factored s…
▽ More
We present Meta 3D AssetGen (AssetGen), a significant advancement in text-to-3D generation which produces faithful, high-quality meshes with texture and material control. Compared to works that bake shading in the 3D object's appearance, AssetGen outputs physically-based rendering (PBR) materials, supporting realistic relighting. AssetGen generates first several views of the object with factored shaded and albedo appearance channels, and then reconstructs colours, metalness and roughness in 3D, using a deferred shading loss for efficient supervision. It also uses a sign-distance function to represent 3D shape more reliably and introduces a corresponding loss for direct shape supervision. This is implemented using fused kernels for high memory efficiency. After mesh extraction, a texture refinement transformer operating in UV space significantly improves sharpness and details. AssetGen achieves 17% improvement in Chamfer Distance and 40% in LPIPS over the best concurrent work for few-view reconstruction, and a human preference of 72% over the best industry competitors of comparable speed, including those that support PBR. Project page with generated assets: https://assetgen.github.io
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
GOEmbed: Gradient Origin Embeddings for Representation Agnostic 3D Feature Learning
Authors:
Animesh Karnewar,
Roman Shapovalov,
Tom Monnier,
Andrea Vedaldi,
Niloy J. Mitra,
David Novotny
Abstract:
Encoding information from 2D views of an object into a 3D representation is crucial for generalized 3D feature extraction. Such features can then enable 3D reconstruction, 3D generation, and other applications. We propose GOEmbed (Gradient Origin Embeddings) that encodes input 2D images into any 3D representation, without requiring a pre-trained image feature extractor; unlike typical prior approa…
▽ More
Encoding information from 2D views of an object into a 3D representation is crucial for generalized 3D feature extraction. Such features can then enable 3D reconstruction, 3D generation, and other applications. We propose GOEmbed (Gradient Origin Embeddings) that encodes input 2D images into any 3D representation, without requiring a pre-trained image feature extractor; unlike typical prior approaches in which input images are either encoded using 2D features extracted from large pre-trained models, or customized features are designed to handle different 3D representations; or worse, encoders may not yet be available for specialized 3D neural representations such as MLPs and hash-grids. We extensively evaluate our proposed GOEmbed under different experimental settings on the OmniObject3D benchmark. First, we evaluate how well the mechanism compares against prior encoding mechanisms on multiple 3D representations using an illustrative experiment called Plenoptic-Encoding. Second, the efficacy of the GOEmbed mechanism is further demonstrated by achieving a new SOTA FID of 22.12 on the OmniObject3D generation task using a combination of GOEmbed and DFM (Diffusion with Forward Models), which we call GOEmbedFusion. Finally, we evaluate how the GOEmbed mechanism bolsters sparse-view 3D reconstruction pipelines.
△ Less
Submitted 15 July, 2024; v1 submitted 14 December, 2023;
originally announced December 2023.
-
Replay: Multi-modal Multi-view Acted Videos for Casual Holography
Authors:
Roman Shapovalov,
Yanir Kleiman,
Ignacio Rocco,
David Novotny,
Andrea Vedaldi,
Changan Chen,
Filippos Kokkinos,
Ben Graham,
Natalia Neverova
Abstract:
We introduce Replay, a collection of multi-view, multi-modal videos of humans interacting socially. Each scene is filmed in high production quality, from different viewpoints with several static cameras, as well as wearable action cameras, and recorded with a large array of microphones at different positions in the room. Overall, the dataset contains over 4000 minutes of footage and over 7 million…
▽ More
We introduce Replay, a collection of multi-view, multi-modal videos of humans interacting socially. Each scene is filmed in high production quality, from different viewpoints with several static cameras, as well as wearable action cameras, and recorded with a large array of microphones at different positions in the room. Overall, the dataset contains over 4000 minutes of footage and over 7 million timestamped high-resolution frames annotated with camera poses and partially with foreground masks. The Replay dataset has many potential applications, such as novel-view synthesis, 3D reconstruction, novel-view acoustic synthesis, human body and face analysis, and training generative models. We provide a benchmark for training and evaluating novel-view synthesis, with two scenarios of different difficulty. Finally, we evaluate several baseline state-of-the-art methods on the new benchmark.
△ Less
Submitted 22 July, 2023;
originally announced July 2023.
-
Novel-View Acoustic Synthesis
Authors:
Changan Chen,
Alexander Richard,
Roman Shapovalov,
Vamsi Krishna Ithapu,
Natalia Neverova,
Kristen Grauman,
Andrea Vedaldi
Abstract:
We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benc…
▽ More
We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benchmark this task, we collect two first-of-their-kind large-scale multi-view audio-visual datasets, one synthetic and one real. We show that our model successfully reasons about the spatial cues and synthesizes faithful audio on both datasets. To our knowledge, this work represents the very first formulation, dataset, and approach to solve the novel-view acoustic synthesis task, which has exciting potential applications ranging from AR/VR to art and design. Unlocked by this work, we believe that the future of novel-view synthesis is in multi-modal learning from videos.
△ Less
Submitted 24 October, 2023; v1 submitted 20 January, 2023;
originally announced January 2023.
-
Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories
Authors:
Samarth Sinha,
Roman Shapovalov,
Jeremy Reizenstein,
Ignacio Rocco,
Natalia Neverova,
Andrea Vedaldi,
David Novotny
Abstract:
Obtaining photorealistic reconstructions of objects from sparse views is inherently ambiguous and can only be achieved by learning suitable reconstruction priors. Earlier works on sparse rigid object reconstruction successfully learned such priors from large datasets such as CO3D. In this paper, we extend this approach to dynamic objects. We use cats and dogs as a representative example and introd…
▽ More
Obtaining photorealistic reconstructions of objects from sparse views is inherently ambiguous and can only be achieved by learning suitable reconstruction priors. Earlier works on sparse rigid object reconstruction successfully learned such priors from large datasets such as CO3D. In this paper, we extend this approach to dynamic objects. We use cats and dogs as a representative example and introduce Common Pets in 3D (CoP3D), a collection of crowd-sourced videos showing around 4,200 distinct pets. CoP3D is one of the first large-scale datasets for benchmarking non-rigid 3D reconstruction "in the wild". We also propose Tracker-NeRF, a method for learning 4D reconstruction from our dataset. At test time, given a small number of video frames of an unseen object, Tracker-NeRF predicts the trajectories of its 3D points and generates new views, interpolating viewpoint and time. Results on CoP3D reveal significantly better non-rigid new-view synthesis performance than existing baselines.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction
Authors:
Jeremy Reizenstein,
Roman Shapovalov,
Philipp Henzler,
Luca Sbordone,
Patrick Labatut,
David Novotny
Abstract:
Traditional approaches for learning 3D object categories have been predominantly trained and evaluated on synthetic datasets due to the unavailability of real 3D-annotated category-centric data. Our main goal is to facilitate advances in this field by collecting real-world data in a magnitude similar to the existing synthetic counterparts. The principal contribution of this work is thus a large-sc…
▽ More
Traditional approaches for learning 3D object categories have been predominantly trained and evaluated on synthetic datasets due to the unavailability of real 3D-annotated category-centric data. Our main goal is to facilitate advances in this field by collecting real-world data in a magnitude similar to the existing synthetic counterparts. The principal contribution of this work is thus a large-scale dataset, called Common Objects in 3D, with real multi-view images of object categories annotated with camera poses and ground truth 3D point clouds. The dataset contains a total of 1.5 million frames from nearly 19,000 videos capturing objects from 50 MS-COCO categories and, as such, it is significantly larger than alternatives both in terms of the number of categories and objects. We exploit this new dataset to conduct one of the first large-scale "in-the-wild" evaluations of several new-view-synthesis and category-centric 3D reconstruction methods. Finally, we contribute NerFormer - a novel neural rendering method that leverages the powerful Transformer to reconstruct an object given a small number of its views. The CO3D dataset is available at https://github.com/facebookresearch/co3d .
△ Less
Submitted 1 September, 2021;
originally announced September 2021.
-
DensePose 3D: Lifting Canonical Surface Maps of Articulated Objects to the Third Dimension
Authors:
Roman Shapovalov,
David Novotny,
Benjamin Graham,
Patrick Labatut,
Andrea Vedaldi
Abstract:
We tackle the problem of monocular 3D reconstruction of articulated objects like humans and animals. We contribute DensePose 3D, a method that can learn such reconstructions in a weakly supervised fashion from 2D image annotations only. This is in stark contrast with previous deformable reconstruction methods that use parametric models such as SMPL pre-trained on a large dataset of 3D object scans…
▽ More
We tackle the problem of monocular 3D reconstruction of articulated objects like humans and animals. We contribute DensePose 3D, a method that can learn such reconstructions in a weakly supervised fashion from 2D image annotations only. This is in stark contrast with previous deformable reconstruction methods that use parametric models such as SMPL pre-trained on a large dataset of 3D object scans. Because it does not require 3D scans, DensePose 3D can be used for learning a wide range of articulated categories such as different animal species. The method learns, in an end-to-end fashion, a soft partition of a given category-specific 3D template mesh into rigid parts together with a monocular reconstruction network that predicts the part motions such that they reproject correctly onto 2D DensePose-like surface annotations of the object. The decomposition of the object into parts is regularized by expressing part assignments as a combination of the smooth eigenfunctions of the Laplace-Beltrami operator. We show significant improvements compared to state-of-the-art non-rigid structure-from-motion baselines on both synthetic and real data on categories of humans and animals.
△ Less
Submitted 31 August, 2021;
originally announced September 2021.
-
Unsupervised Learning of 3D Object Categories from Videos in the Wild
Authors:
Philipp Henzler,
Jeremy Reizenstein,
Patrick Labatut,
Roman Shapovalov,
Tobias Ritschel,
Andrea Vedaldi,
David Novotny
Abstract:
Our goal is to learn a deep network that, given a small number of images of an object of a given category, reconstructs it in 3D. While several recent works have obtained analogous results using synthetic data or assuming the availability of 2D primitives such as keypoints, we are interested in working with challenging real data and with no manual annotations. We thus focus on learning a model fro…
▽ More
Our goal is to learn a deep network that, given a small number of images of an object of a given category, reconstructs it in 3D. While several recent works have obtained analogous results using synthetic data or assuming the availability of 2D primitives such as keypoints, we are interested in working with challenging real data and with no manual annotations. We thus focus on learning a model from multiple views of a large collection of object instances. We contribute with a new large dataset of object centric videos suitable for training and benchmarking this class of models. We show that existing techniques leveraging meshes, voxels, or implicit surfaces, which work well for reconstructing isolated objects, fail on this challenging data. Finally, we propose a new neural network design, called warp-conditioned ray embedding (WCR), which significantly improves reconstruction while obtaining a detailed implicit representation of the object surface and texture, also compensating for the noise in the initial SfM reconstruction that bootstrapped the learning process. Our evaluation demonstrates performance improvements over several deep monocular reconstruction baselines on existing benchmarks and on our novel dataset.
△ Less
Submitted 30 March, 2021;
originally announced March 2021.
-
Canonical 3D Deformer Maps: Unifying parametric and non-parametric methods for dense weakly-supervised category reconstruction
Authors:
David Novotny,
Roman Shapovalov,
Andrea Vedaldi
Abstract:
We propose the Canonical 3D Deformer Map, a new representation of the 3D shape of common object categories that can be learned from a collection of 2D images of independent objects. Our method builds in a novel way on concepts from parametric deformation models, non-parametric 3D reconstruction, and canonical embeddings, combining their individual advantages. In particular, it learns to associate…
▽ More
We propose the Canonical 3D Deformer Map, a new representation of the 3D shape of common object categories that can be learned from a collection of 2D images of independent objects. Our method builds in a novel way on concepts from parametric deformation models, non-parametric 3D reconstruction, and canonical embeddings, combining their individual advantages. In particular, it learns to associate each image pixel with a deformation model of the corresponding 3D object point which is canonical, i.e. intrinsic to the identity of the point and shared across objects of the category. The result is a method that, given only sparse 2D supervision at training time, can, at test time, reconstruct the 3D shape and texture of objects from single views, while establishing meaningful dense correspondences between object instances. It also achieves state-of-the-art results in dense 3D reconstruction on public in-the-wild datasets of faces, cars, and birds.
△ Less
Submitted 6 December, 2020; v1 submitted 28 August, 2020;
originally announced August 2020.
-
Multi-utility Learning: Structured-output Learning with Multiple Annotation-specific Loss Functions
Authors:
Roman Shapovalov,
Dmitry Vetrov,
Anton Osokin,
Pushmeet Kohli
Abstract:
Structured-output learning is a challenging problem; particularly so because of the difficulty in obtaining large datasets of fully labelled instances for training. In this paper we try to overcome this difficulty by presenting a multi-utility learning framework for structured prediction that can learn from training instances with different forms of supervision. We propose a unified technique for…
▽ More
Structured-output learning is a challenging problem; particularly so because of the difficulty in obtaining large datasets of fully labelled instances for training. In this paper we try to overcome this difficulty by presenting a multi-utility learning framework for structured prediction that can learn from training instances with different forms of supervision. We propose a unified technique for inferring the loss functions most suitable for quantifying the consistency of solutions with the given weak annotation. We demonstrate the effectiveness of our framework on the challenging semantic image segmentation problem for which a wide variety of annotations can be used. For instance, the popular training datasets for semantic segmentation are composed of images with hard-to-generate full pixel labellings, as well as images with easy-to-obtain weak annotations, such as bounding boxes around objects, or image-level labels that specify which object categories are present in an image. Experimental evaluation shows that the use of annotation-specific loss functions dramatically improves segmentation accuracy compared to the baseline system where only one type of weak annotation is used.
△ Less
Submitted 23 June, 2014;
originally announced June 2014.