Multimodal Conditional 3D Face Geometry Generation

Christopher Otto 0000-0002-5625-593X ETH ZürichSwitzerland DisneyResearch||||StudiosSwitzerland christopher.otto@inf.ethz.ch Prashanth Chandran 0000-0001-6821-5815 DisneyResearch||||StudiosSwitzerland prashanth.chandran@disneyresearch.com Sebastian Weiss 0000-0003-4399-3180 DisneyResearch||||StudiosSwitzerland sebastian.weiss@disneyresearch.com Markus Gross 0009-0003-9324-779X ETH ZürichSwitzerland DisneyResearch||||StudiosSwitzerland gross@disneyresearch.com Gaspard Zoss 0000-0002-0022-8203 DisneyResearch||||StudiosSwitzerland gaspard.zoss@disneyresearch.com  and  Derek Bradley 0000-0002-2055-9325 DisneyResearch||||StudiosSwitzerland derek.bradley@disneyresearch.com
Abstract.

We present a new method for multimodal conditional 3D face geometry generation that allows user-friendly control over the output identity and expression via a number of different conditioning signals. Within a single model, we demonstrate 3D faces generated from artistic sketches, 2D face landmarks, Canny edges, FLAME face model parameters, portrait photos, or text prompts. Our approach is based on a diffusion process that generates 3D geometry in a 2D parameterized UV domain. Geometry generation passes each conditioning signal through a set of cross-attention layers (IP-Adapter), one set for each user-defined conditioning signal. The result is an easy-to-use 3D face generation tool that produces high resolution geometry with fine-grain user control.

Multimodal Generation, 3D Face Geometry, Deep Learning
ccs: Computing methodologies Mesh geometry models
Refer to caption
Figure 1. We propose a novel method for diffusion-based controllable 3D face geometry generation that allows for controlling the results via several conditioning modes: artistic sketches, 2D facial landmarks, Canny edges, FLAME model parameters, portrait photos and text.

1. Introduction

The creation of 3D facial geometry for digital human characters is a modeling task that usually requires tremendous artistic skill. Digital sculpting with 3D modeling tools is a time-consuming and demanding process, especially when the target is as recognizable as a human face. This complexity has prompted research into data-driven sculpting methods (Gruber et al., 2020) and other, more user-friendly, interactive interfaces (Kim et al., 2015).

Several common morphable 3D face models (e.g. FLAME (Li et al., 2017)) simplify the facial modeling task by providing a shape subspace to operate in, as well as simple parameters to control the identity and expression geometry without the need for 3D modeling skills, but they are limited in expressiveness and offer only basic control knobs.

Recent methods can create high quality 3D geometry and textures from text prompts (Zhang et al., 2023a; Wu et al., 2023) via optimization, leveraging large pre-trained text-to-image diffusion models (Rombach et al., 2022). These methods allow layman users to create 3D faces through natural text descriptions. While this is a powerful approach, it can still be difficult to achieve a particular output through text description (Koley et al., 2024). Some concepts like the specific curvature of a face or a unique facial expression are much easier to convey via sketches, edge contours or portrait photos than through text.

In the image domain, approaches like ControlNet (Zhang et al., 2023b) or T2I-Adapter (Mou et al., 2023) have demonstrated controllable image generation beyond text using sketches, images, or edge maps as conditioning signals. These methods provide users with much more fine-grained control over the generation process than text-based methods alone. Ye et al. (2023) propose IP-Adapters to control Stable Diffusion (Rombach et al., 2022) with image prompts by learning new cross-attention layers. However, image-based methods are not easy to extend to 3D facial geometry generation.

We present a flexible new method for 3D facial geometry generation that creates high quality faces from any one of various inputs, including sketches, 2D landmarks, Canny edges, portrait photos, FLAME face model parameters and text. Our approach is to train a conditional diffusion model on a high quality 3D facial dataset constructed from high resolution scans (Chandran et al., 2020) represented in the 2D UV domain. Our model is trained from scratch, without the need for a pre-trained foundation model like Stable Diffusion. To condition our model we train one set of cross-attention layers for each type of conditioning input, following IP-Adapter (2023). First, the diffusion model learns to inject FLAME parameters via the original UNet cross-attention layers. We then freeze the diffusion model while training additional sets of cross-attention layers (e.g. one for artist sketches, 2D landmarks, portrait photos, etc). An interesting side effect of the FLAME-conditioned model is that it allows us to re-interpret FLAME-parameterized faces in a generative sense - providing a space of high resolution stochastic variations on top of the traditional low resolution FLAME model.

Our method allows for fast and user-friendly creation of 3D digital character faces, controlled by the input mode preferred by the user, all within a single model. We demonstrate several applications of our method including sketch-based 3D face modeling, geometry from 2D facial landmarks, Canny edges, or portrait photos, text-to-3D facial geometry, and finally, extending the FLAME model space by allowing stochastic diffusion sampling conditioned on the same semantic parameters.

2. Related Work

In the following, we present relevant related work on 3D face geometry generation with diffusion models, as well as on injecting additional control modes into diffusion models.

2.1. 3D Face Geometry Generation

Recent work uses diffusion models to control the generation of novel 3D face geometry. ShapeFusion (Potamias et al., 2024) generates face geometry by running the diffusion process directly on the mesh input vertices. It allows unconditional and conditional face geometry generation and supports various editing operations on a given mesh, based on selected vertices (anchor points). However, it does not support conditioning signals beyond vertices. 4DFM (Zou et al., 2024) trains an unconditional diffusion model on a set of sparse 3D landmarks for facial expression generation. It can generate dynamic facial expression sequences based on 3D landmarks by retargeting the landmarks to a mesh after the diffusion process. While they support conditioning with different signals such as expression labels and text, they achieve control via classifier-guidance (Dhariwal and Nichol, 2021) which requires training additional classifiers on noisy data.

Other methods focus on 3D face or head avatars, which can generate 3D geometry and texture. Rodin (Wang et al., 2023) can generate triplane-based head geometry with text or image conditioning, but the resulting geometry is extracted with Marching Cubes (Lorensen and Cline, 1987) and thus not in the same topology across generations. DreamFace (Zhang et al., 2023a), Wu et al. (2023) and Bergman et al. (2023) propose pipelines that generate 3D face geometry and textures from text. The geometry is created by optimizing 3D Morphable Model (3DMM) (Blanz and Vetter, 1999) parameters using a Score Distillation Loss (SDS) (Poole et al., 2023). During the optimization, the SDS loss uses the feedback from a pre-trained text-to-image latent diffusion model to update the 3DMM parameters given a geometry render. This setup requires a differentiable renderer and often generates over-smoothed results with little details when compared to diffusion sampling (Zhou et al., 2024). Additionally, they require Stable Diffusion (SD) (Rombach et al., 2022) which was pre-trained on billions of images (Schuhmann et al., 2022) to guide their generations.

In our work, we generate controllable 3D face geometry in a single common topology from several different conditioning modes without reliance on SDS optimization, classifier-guidance or a SD prior.

2.2. Multimodal Conditional Image Generation

Image generation with diffusion models can be controlled with conditioning modes that are different from text such as sketches (Voynov et al., 2023), Canny edges (Canny, 1986), RGB images (Ye et al., 2023), expression parameters (Kirschstein et al., 2024) or face shape (Ding et al., 2023; Gu et al., 2024). To control pre-trained diffusion models with new modes, ControlNet (Zhang et al., 2023b) introduces a trainable copy of the diffusion model’s UNet encoding blocks, which take the new conditioning as input. The output of the copied model is added to the skip-connections of the frozen pre-trained diffusion model. A separate trainable copy of the UNet encoding blocks (361M parameters) is created per conditioning mode. T2I-Adapter (Mou et al., 2023) aligns the internal knowledge of a pre-trained diffusion model with new control modes by proposing a small adapter network that achieves control similar to ControlNet, while requiring less parameters (77M). IP-Adapter (Ye et al., 2023) injects each conditioning via separate cross-attention layers (Vaswani et al., 2017) while requiring even less parameters (22M). It introduces new cross-attention layers whose outputs are added to those of the original UNet. Ye et al. (2023) show that the diffusion model follows the added conditioning signal closely, when it comes through the newly trained cross-attention layers. In general, adapters can add control with new conditioning modes even long after the training of the underlying base diffusion model is concluded. They can also add new conditioning modes for which only limited paired training data is available, because the underlying diffusion model is frozen, avoiding issues such as catastrophic forgetting (Kirkpatrick et al., 2017; Zhang et al., 2023b). While pre-trained text-to-image diffusion models understand the RGB image domain and can generate images conditioned on a large variety of input modalities, they are not trivially extendable to generating 3D face geometry instead. We represent 3D face geometry in the 2D UV domain, which enables us to train a 2D diffusion model that can incorporate new conditioning modes using IP-Adapters. However, even in the 2D UV domain, 3D geometry is still far from what pre-trained text-to-image diffusion models have seen before. Thus, we train our diffusion model from scratch.

3. Multimodal 3D Face Geometry Generation

Refer to caption
Figure 2. Our pipeline for diffusion-based controllable 3D face geometry generation which uses a delta UV position map representation ΔΔ\Deltaroman_Δ𝐓𝐓\bf{T}bold_T to generate results. We can control the results with several conditioning modes (i.e. FLAME parameters 𝐜fsubscript𝐜𝑓\mathbf{c}_{f}bold_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, sketches 𝐜1subscript𝐜1\mathbf{c}_{1}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Canny edges 𝐜2subscript𝐜2\mathbf{c}_{2}bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, 2D landmarks 𝐜3subscript𝐜3\mathbf{c}_{3}bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, portrait photos 𝐜4subscript𝐜4\mathbf{c}_{4}bold_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and text 𝐜5subscript𝐜5\mathbf{c}_{5}bold_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT).

We propose a novel method for diffusion-based controllable 3D face geometry generation that allows for controlling the results with several conditioning modes (i.e. sketches, 2D landmarks, Canny edges, FLAME parameters, portrait photos and text).

Our method consists of four components: First, we create a dataset of 3D faces, where each face is represented as a UV position map describing the vertex positions (Section 3.1). This data representation can be easily processed with 2D convolutional neural networks. Second, a variational autoencoder (VAE), which compresses our UV position map face data into a latent space representation (Section 3.2). Third, a latent diffusion model (LDM), which learns a non-linear, deep controllable face model in latent space (Section 3.3). Fourth, we learn mode-specific cross-attention layers (IP-Adapters) with the ability to transform and inject conditioning modes into the LDM for controllable 3D face geometry generation (Section 3.4). Each of the components is explained in the following and visualized in Fig. 2.

3.1. Dataset and Geometry Representation

To represent our face geometries, we generate a novel dataset based on the 3D scan data acquired by Chandran et al. (2020), where all faces are stabilized and in topological correspondence. In total, we use 323 identities in our training dataset, where each identity shows 24 different facial expressions (7752 examples). We subtract a template face shape 𝐓𝐓\bf{T}bold_T from all faces in the dataset and thus represent each individual face as a delta from the template face. The computed delta representation ΔΔ\Deltaroman_Δ𝐓𝐓\bf{T}bold_T reduces artifacts in the generated 3D face geometry, when compared to the full face representation (Fig. 3). We transform each delta face into a vertex delta position map in UV space, which is suitable for being processed by neural networks (Feng et al., 2018; Otto et al., 2022). This representation records the x, y, z coordinates of the face geometry within a 3-channel image, similar to traditional color texture maps, but instead of an RGB value at each pixel we store the x, y, z delta values. To improve generalization, we augment our existing geometry training data by synthetic identities which we generate via identity interpolation (50k examples) and by mixing face parts of different identities together (Funkhouser et al., 2004) (150k examples). Adding the augmentations during LDM training improves the generalization to novel identities when conditioning on the FLAME parameters (Table 1). We use the parameter space of a common 3D morphable face model (FLAME (Li et al., 2017)) as a base conditioning because we can fit FLAME to the scan data and to the augmented data and thus generate a large dataset of paired geometry-FLAME parameter data. Additionally, we create paired training data for several conditioning modes that only have limited paired data available. For example, portrait photos are only available for the scans from the dataset of Chandran et al. (2020), but not for the augmentation data. However, it is possible to inject new modalities with limited paired data by training new cross attention layers while keeping the LDM frozen as shown by Ye et al.(2023) (and described in Section 3.4).

Refer to caption
Figure 3. On the left is the generated geometry when training on the full vertex map representation, which shows visible artifacts (e.g., on the eyelid). Our proposed training with the delta vertex maps (right) results in removes such artifacts.

3.2. Variational Autoencoder

To reduce the computational requirements for the diffusion model, we downsample our 2562superscript2562256^{2}256 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT UV position map data by a factor of four into a 642superscript64264^{2}64 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT latent space using a variational autoencoder (VAE) (Kingma and Welling, 2014) consisting of an encoder \mathcal{E}caligraphic_E and a decoder 𝒟𝒟\mathcal{D}caligraphic_D. We train the VAE from scratch following the autoencoder loss function and architecture as it is presented in related work (Rombach et al., 2022; Esser et al., 2021). Specifically, we use the VQ-GAN (Esser et al., 2021) autoencoder loss:

(1) VAE=rec+GAN+reg.subscriptVAEsubscriptrecsubscriptGANsubscriptreg\begin{split}\mathcal{L}_{\text{VAE}}=\mathcal{L}_{\text{rec}}+\mathcal{L}_{% \text{GAN}}+\mathcal{L}_{\text{reg}}.\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT VAE end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT GAN end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT . end_CELL end_ROW

recsubscript𝑟𝑒𝑐\mathcal{L}_{rec}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT consists of a pixel-wise L1 loss and a LPIPS (Zhang et al., 2018) perceptual loss. It compares the input UV position maps to the reconstructions through the VAE. GANsubscript𝐺𝐴𝑁\mathcal{L}_{GAN}caligraphic_L start_POSTSUBSCRIPT italic_G italic_A italic_N end_POSTSUBSCRIPT evaluates inputs x𝑥xitalic_x and reconstructions 𝒟((x))𝒟𝑥\mathcal{D}(\mathcal{E}(x))caligraphic_D ( caligraphic_E ( italic_x ) ) with a patch-based discriminator (Isola et al., 2017) and regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT employs a codebook loss which serves as a latent space regularizer. Please refer to Esser et al. (2021) for more details.

3.3. Latent Diffusion Model

Next, we train a latent diffusion model (LDM) (Rombach et al., 2022), that learns to generate latent UV position maps 𝐳=(𝐱)𝐳𝐱\mathbf{z}=\mathcal{E}(\mathbf{x})bold_z = caligraphic_E ( bold_x ). To train the LDM, a forward diffusion process is defined as a Markov chain, which noises the latents 𝐳𝐳\mathbf{z}bold_z following a fixed noise schedule of T𝑇Titalic_T uniformly sampled timesteps. At the last time step T𝑇Titalic_T, the distribution is Gaussian. We can directly sample 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at an arbitrary timestep t𝑡titalic_t by:

(2) 𝐳t(𝐳0,ϵ)=α¯t𝐳0+1α¯tϵϵ𝒩(𝟎,𝐈),formulae-sequencesubscript𝐳𝑡subscript𝐳0bold-italic-ϵsubscript¯𝛼𝑡subscript𝐳01subscript¯𝛼𝑡bold-italic-ϵsimilar-tobold-italic-ϵ𝒩0𝐈\mathbf{z}_{t}(\mathbf{z}_{0},\bm{\epsilon})=\sqrt{\bar{\alpha}_{t}}\mathbf{z}% _{0}+\sqrt{1-\bar{\alpha}_{t}}\bm{\epsilon}\hskip 14.22636pt\bm{\epsilon}\sim% \mathcal{N}(\mathbf{0},\mathbf{I}),bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_ϵ ) = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) ,

where 1α¯t1subscript¯𝛼𝑡1-\bar{\alpha}_{t}1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT describes the variance of the noise and α¯t:=s=1tαsassignsubscript¯𝛼𝑡superscriptsubscriptproduct𝑠1𝑡subscript𝛼𝑠\bar{\alpha}_{t}:=\prod_{s=1}^{t}\alpha_{s}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT according to a fixed noise schedule.

We learn to predict the noise ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ that was added to a noisy latent image 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT following Ho et al. (2020):

(3) LDM=𝔼𝐳0,𝐜f,t,ϵ𝒩(𝟎,𝐈)[ϵϵθ(𝐳t,𝐜f,t)22],subscript𝐿𝐷𝑀subscript𝔼similar-tosubscript𝐳0subscript𝐜𝑓𝑡bold-italic-ϵ𝒩0𝐈delimited-[]superscriptsubscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜃subscript𝐳𝑡subscript𝐜𝑓𝑡22\mathcal{L}_{LDM}=\mathop{{}\mathbb{E}_{\mathbf{z}_{0},\mathbf{c}_{f},t,\bm{% \epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})}}\left[||\bm{\epsilon}-\bm{% \epsilon}_{\theta}(\mathbf{z}_{t},\mathbf{c}_{f},t)||_{2}^{2}\right],caligraphic_L start_POSTSUBSCRIPT italic_L italic_D italic_M end_POSTSUBSCRIPT = start_BIGOP blackboard_E start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_t , bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) end_POSTSUBSCRIPT end_BIGOP [ | | bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_t ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

where t𝑡titalic_t is the timestep, 𝐜f=𝝆ϕ(𝐲)subscript𝐜𝑓subscript𝝆italic-ϕ𝐲\mathbf{c}_{f}=\bm{\rho}_{\phi}(\mathbf{y})bold_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = bold_italic_ρ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_y ) is a FLAME parameter conditioning and ϵθ(𝐳t,𝐜f,t)subscriptbold-italic-ϵ𝜃subscript𝐳𝑡subscript𝐜𝑓𝑡\bm{\epsilon}_{\theta}(\mathbf{z}_{t},\mathbf{c}_{f},t)bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_t ) is the UNet (Ronneberger et al., 2015) neural network with parameters θ𝜃\thetaitalic_θ. The FLAME conditioning 𝐜fsubscript𝐜𝑓\mathbf{c}_{f}bold_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is obtained by fitting FLAME to the face geometry encoded in 𝐳𝐭subscript𝐳𝐭\mathbf{z_{t}}bold_z start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT and mapping it through a MLP 𝝆ϕ(𝐲)subscript𝝆italic-ϕ𝐲\bm{\rho}_{\phi}(\mathbf{y})bold_italic_ρ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_y ).

During inference (reverse diffusion process) we generate latent 2D UV position maps from the model distribution. We start from 𝐳T𝒩(𝟎,𝐈)similar-tosubscript𝐳𝑇𝒩0𝐈\mathbf{z}_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ) and iteratively compute less noisy latents until we reach a clean latent sample 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Sampling following DDPM (Ho et al., 2020) or DDIM (Song et al., 2021) computes 𝐳t1subscript𝐳𝑡1\mathbf{z}_{t-1}bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT from 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on the UNet output.

3.4. Multimodal Conditional Generation

To control the generations with additional conditioning modes (beyond the FLAME parameters 𝐜fsubscript𝐜𝑓\mathbf{c}_{f}bold_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT), we train different sets of cross-attention layers, following IP-Adapter (Ye et al., 2023). The LDM itself is kept frozen. In this way, we can integrate novel conditioning modes post-LDM training even with limited paired mode-geometry data. We train one set for each of the following conditioning modes: sketches 𝐜1subscript𝐜1\mathbf{c}_{1}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Canny edges 𝐜2subscript𝐜2\mathbf{c}_{2}bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, 2D landmarks 𝐜3subscript𝐜3\mathbf{c}_{3}bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and portrait photos 𝐜4subscript𝐜4\mathbf{c}_{4}bold_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. The output of the new cross-attention layers is added to the outputs of the existing LDM cross-attention layers, thereby injecting the new conditioning signal into the generation process:

(4) 𝐙=Attention(𝐐,𝐊,𝐕)+Attention(𝐐,𝐊m,𝐕m),𝐙𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝐐𝐊𝐕𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜superscript𝑛𝐐subscriptsuperscript𝐊𝑚subscriptsuperscript𝐕𝑚\mathbf{Z}=Attention(\mathbf{Q},\mathbf{K},\mathbf{V})+Attention^{\prime}(% \mathbf{Q},\mathbf{K^{\prime}}_{m},\mathbf{V^{\prime}}_{m}),bold_Z = italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( bold_Q , bold_K , bold_V ) + italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_Q , bold_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ,

where 𝐐𝐐\mathbf{Q}bold_Q are the intermediate UNet query features, 𝐊𝐊\mathbf{K}bold_K and 𝐕𝐕\mathbf{V}bold_V are keys and values for our FLAME conditioning 𝐜fsubscript𝐜𝑓\mathbf{c}_{f}bold_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and 𝐊msubscriptsuperscript𝐊𝑚\mathbf{K^{\prime}}_{m}bold_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT,𝐕msubscriptsuperscript𝐕𝑚\mathbf{V^{\prime}}_{m}bold_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are keys and values for the newly injected modality 𝐜msubscript𝐜𝑚\mathbf{c}_{m}bold_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

(5) 𝐊𝐊\displaystyle\mathbf{K}bold_K =𝐜f𝐖k,𝐕=𝐜f𝐖v,formulae-sequenceabsentsubscript𝐜𝑓subscript𝐖𝑘𝐕subscript𝐜𝑓subscript𝐖𝑣\displaystyle=\mathbf{c}_{f}\cdot\mathbf{W}_{k},\mathbf{V}=\mathbf{c}_{f}\cdot% \mathbf{W}_{v},= bold_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ⋅ bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_V = bold_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ⋅ bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ,
𝐊msubscriptsuperscript𝐊𝑚\displaystyle\mathbf{K^{\prime}}_{m}bold_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT =𝐜m𝐖k,m,𝐕m=𝐜m𝐖v,mformulae-sequenceabsentsubscript𝐜𝑚subscriptsuperscript𝐖𝑘𝑚subscriptsuperscript𝐕𝑚subscript𝐜𝑚subscriptsuperscript𝐖𝑣𝑚\displaystyle=\mathbf{c}_{m}\cdot\mathbf{W^{\prime}}_{k,m},\mathbf{V^{\prime}}% _{m}=\mathbf{c}_{m}\cdot\mathbf{W^{\prime}}_{v,m}= bold_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT , bold_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_m end_POSTSUBSCRIPT

Here, 𝐖k,msubscriptsuperscript𝐖𝑘𝑚\mathbf{W^{\prime}}_{k,m}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT and 𝐖v,msubscriptsuperscript𝐖𝑣𝑚\mathbf{W^{\prime}}_{v,m}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_m end_POSTSUBSCRIPT represent the newly added weights that are updated during training.

Prior to passing each of the above-mentioned conditioning modes to the cross-attention layers, we pass each of them through CLIP (Radford et al., 2021) and extract a 768 dimensional global CLIP feature vector, which serves as our conditioning representation. Following IP-Adapter, we train a small projection network consisting of one linear layer and layer normalization, designed to project the CLIP feature vector into several extra context tokens before injecting it into the cross-attention layers. We use 16 tokens for each of our conditionings to allow for meaningful attention computation.

At inference time, we can control the 3D face geometry generation with any of the modes using the respective set of cross-attention layers and classifier-free guidance (Ho and Salimans, 2021). The strength of the conditioning signal can be increased by increasing the hyperparameter w𝑤witalic_w:

(6) ϵ^θ(𝐳t,𝐜f,𝐜m,t)=wϵθ(𝐳t,𝐜f,𝐜m,t)+(1w)ϵθ(𝐳t,t)subscript^bold-italic-ϵ𝜃subscript𝐳𝑡subscript𝐜𝑓subscript𝐜𝑚𝑡𝑤subscriptbold-italic-ϵ𝜃subscript𝐳𝑡subscript𝐜𝑓subscript𝐜𝑚𝑡1𝑤subscriptbold-italic-ϵ𝜃subscript𝐳𝑡𝑡\hat{\bm{\epsilon}}_{\theta}(\mathbf{z}_{t},\mathbf{c}_{f},\mathbf{c}_{m},t)=w% \bm{\epsilon}_{\theta}(\mathbf{z}_{t},\mathbf{c}_{f},\mathbf{c}_{m},t)+(1-w)% \bm{\epsilon}_{\theta}(\mathbf{z}_{t},t)over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_t ) = italic_w bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_t ) + ( 1 - italic_w ) bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t )

To condition only on a newly added mode or to generate geometry unconditionally, the FLAME conditioning 𝐜fsubscript𝐜𝑓\mathbf{c}_{f}bold_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is set to its null embedding. Additionally, for unconditional generation, the new cross-attention in Eq. 4, which feeds 𝐜msubscript𝐜𝑚\mathbf{c}_{m}bold_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to the diffusion model, is simply not added to the original attention. Our full method pipeline is visualized in Fig. 2.

3.5. Implementation Details

For our dataset, we crop the full head face geometry to allocate more vertices to the face region, representing 50520 face vertices within each 2562superscript2562256^{2}256 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT UV position map. At 2562superscript2562256^{2}256 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT resolution we can represent reasonably high-resolution face geometry while being able to limit our VAE training time to 8 days using our training dataset of 7752 samples (similar-to\sim1.4 seconds/iteration; batch size 8). We use a learning rate of 4.5e-6 and a codebook size of 8192. Note that for training our VAE, we did not use the data augmentations described in Section 3.1 as the autoencoder was already able to reconstruct test geometries with high accuracy when trained only on the studio dataset. Next, we train our LDM for 4 days (similar-to\sim1.6 seconds/iteration; batch size 12) with a learning rate of 1e-4 and diffusion timesteps T=1000𝑇1000T=1000italic_T = 1000. We utilize geometry data augmentations with corresponding FLAME fits during training (+200k samples) to allow for better generalization across identities during generation. Afterwards, we train each set of cross-attention layers with a learning rate of 1e-4 for 6 days (similar-to\sim3.3 seconds/iteration; batch size 24) while keeping the LDM frozen. With a probability of 0.05, we randomly set either 𝐜fsubscript𝐜𝑓\mathbf{c}_{f}bold_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT or 𝐜msubscript𝐜𝑚\mathbf{c}_{m}bold_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT or both to their null embeddings during training. This step enables classifier-free guidance at inference. Note that we do not add augmented geometry data to train the new cross-attention layers because we do not have access to paired mode-geometry data for modes such as portrait photos. All experiments were run on a single RTX A6000 GPU. Also note that the VAE and the LDM have to be trained only once and novel conditioning modes can be added by training only the new cross-attention layers. In the next section, we show and evaluate our results either on a separate held-out validation set which contains 18 different identities displaying 24 expression each (432 samples) or on unconstrained in-the-wild test data. Unless mentioned otherwise, we generate every result by running DDPM sampling steps S=50𝑆50S=50italic_S = 50 with conditioning strength w=1𝑤1w=1italic_w = 1. In total, generating a geometry sample takes similar-to\sim11 seconds.

4. Results

We now show several results and applications of our new multimodal conditional 3D face geometry generation method. We begin by demonstrating control over the generated facial geometry using FLAME’s identity and expression parameters (Section 4.1). We then demonstrate multimodal conditioned geometry generation by guiding the denoising process using sketches, sparse 2D landmarks, Canny edges and text. We evaluate the effectiveness of these different modalities in guiding the generated geometry both qualitatively and by using quantitative metrics in Section 4.2. In addition to using different conditioning modes, we show how one can also spatially restrict the guidance to a particular face region to perform precise geometry edits in Section 4.3. Finally we generate dynamic facial performances by guiding our model from video inputs and demonstrate that our method can produce facial animations that are stable across time (Section 4.4).

4.1. Identity and Expression Parameter Conditioning

The base conditioning used to train our geometry generator are the identity and expression parameters from the FLAME model (Li et al., 2017). We use 300 identity parameters (𝜷𝜷\bm{\beta}bold_italic_β), 100 expression parameters (𝝍𝝍\bm{\psi}bold_italic_ψ) and 3 jaw pose (𝜽𝜽\bm{\theta}bold_italic_θ) parameters. We combine these FLAME parameters into a 403-dimensional conditioning vector which we pass through a 3-layer MLP with Leaky ReLU activation functions prior to injecting it into the diffusion model via cross-attention.

Recollect that we do not use the geometry from the FLAME model itself to train our diffusion model. Instead we fit the FLAME model to the high quality facial geometry from Chandran et al. (2020) only to obtain identity and expression parameters, and train the diffusion model directly on the geometry captured by Chandran et al (2020). As there is a loss in geometric detail during this FLAME fitting step, our diffusion model is expected to compensate for the information missing in the FLAME parameters, but present in the ground truth facial geometry. In other words, our geometry generator has to re-interpret the FLAME parameters such that it can re-produce the training data and therefore does not memorize the FLAME parameter space exactly.

We visualize geometries generated by our model for unseen FLAME parameters in  Fig. 4. As our underlying mesh topology is different from FLAME and represents similar-to\sim10-times more vertices, it can express a greater level of detail that is not present in the lower resolution FLAME mesh. This high resolution detail is captured and reproduced by the denoising process. Therefore, by simply varying the noise seed, one can obtain variations of the FLAME-conditioned geometry, each of which contain different mid/high frequency details.

Input seed 1111 seed 2222 seed 3333
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Figure 4. Changing the noise input while conditioning on the FLAME parameters does not affect the identity and expression of the generated geometry, but only the stochastic details that are added on top. Our model can capture richer geometric detail that is not present in the original FLAME mesh, while still respecting FLAME’s identity and expression parameters. Each row shows a different set of FLAME parameters, with the corresponding FLAME mesh visualized in the first column. We generate samples from three different seeds which are shown in the remaining columns.

Disentangling Identity and Expression. Our diffusion model also preserves the disentanglement between facial identity and expression that is present in FLAME. In Fig. 5, we show how a smooth interpolation of FLAME’s identity and expression coefficients results in a smooth, yet nonlinear, interpolation of our generated geometry. We can observe that the facial expression remains fixed when interpolating between identities, and vice versa (please also refer to the supplemental video). To eliminate the randomness in the generation, we used the same initial noise to generate the interpolated geometries along with DDIM sampling (Song et al., 2021).

Extreme Expressions Due to the limited number of extreme expressions in our training datasets, our model can produce suboptimal results when conditioned on FLAME parameters corresponding to extreme expressions (Wide open mouth for example). However, this is purely a data limitation and could be resolved by having a larger and more balanced dataset, by oversampling expressions during training or by weighting the loss towards focusing more on extreme expressions.

Refer to caption
Figure 5. Identity and expression disentanglement. Traversing the first dimension in FLAME’s identity space leads to smooth changes in our generated face geometry (rows 1 and 2). Similarly a linear interpolation of FLAME’s expression parameters results in a smooth, yet nonlinear, interpolation of facial expression as seen in rows 3 and 4.

4.2. Multimodal Conditioning

Beyond the underlying FLAME-based control, we introduce additional conditioning modes to control our diffusion model following Section 3.4. We now discuss the results of facial geometry generation by conditioning our diffusion model on sketches, sparse 2D landmarks, Canny edges, portrait RGB photos and text. To create facial geometry from text input, we leverage the fact that CLIP (Radford et al., 2021) embeds both RGB images and text into a shared latent space. Therefore, instead of training separate cross-attention layers for text, we re-use those learned from our portrait images and directly feed them with the CLIP vectors obtained from a text prompt.

Geometries generated for different conditionings from each of these additional modes are shown in Fig. 6. While our generated geometry follows the identity and expression seen in the input conditioning signal, the degree to which the geometry can match the conditioning varies from mode to mode. For example, as portrait photos contain the most amount of information about a face when compared to sketches, Canny edges, landmarks or text, geometries generated by conditioning our model on portrait images tend to more tightly constrain the generated identity and expression (also see Table 2).

Portrait Refer to caption Refer to caption Refer to caption Refer to caption
Output Refer to caption Refer to caption Refer to caption Refer to caption
Sketch Refer to caption Refer to caption Refer to caption Refer to caption
Output Refer to caption Refer to caption Refer to caption Refer to caption
FLAME Refer to caption Refer to caption Refer to caption Refer to caption
Output Refer to caption Refer to caption Refer to caption Refer to caption
Edges Refer to caption Refer to caption Refer to caption Refer to caption
Output Refer to caption Refer to caption Refer to caption Refer to caption
Landmarks Refer to caption Refer to caption Refer to caption Refer to caption
Output Refer to caption Refer to caption Refer to caption Refer to caption
Text Refer to caption Refer to caption Refer to caption Refer to caption
Output Refer to caption Refer to caption Refer to caption Refer to caption
Figure 6. Multimodal conditional generation results on our validation dataset, including conditioning on portrait images, sketches, FLAME parameters, Canny edges, 2D landmarks and text. The FLAME parameter inputs are visualized as meshes. Across all conditioning modes, our model captures the facial identity and expression reasonably well.
V2V error Mean \downarrow Median \downarrow Std \downarrow
No augmentations 4.093 3.692 2.170
With augmentations 3.757 3.352 2.085
Table 1. We compare the 3D face geometry generated by diffusion models that were trained with and without 3D data augmentations. We measure the vertex-to-vertex error (V2V) in mm between FLAME parameter conditioned generations and ground truth geometry on neutral shapes from our validation set. The model trained using data augmentation is able to capture unseen identities better. Results are averaged over three different seeds.
V2V error Mean \downarrow Median \downarrow Std \downarrow
2D landmarks 6.489 5.765 2.314
Canny edges 5.938 5.488 1.897
Sketch 5.630 5.034 2.099
Portrait photo 5.394 4.874 2.025
FLAME parameters 4.772 4.462 1.317
Table 2. We report the vertex-to-vertex error (V2V) in mm on 178 shapes from our validation set for different types of conditioning signals. Conditions that are more descriptive of the end facial geometry like FLAME parameters, and portrait images achieve a lower error than others.

To control the strength w𝑤witalic_w of the guiding condition, we make use of classifier-free guidance (Ho and Salimans, 2021) following Eq. 6. Fig. 7 demonstrates the effect of different guidance strengths for each condition. Increasing the guidance strength increases the effect that the conditioning has on the resulting geometry. For example, in Fig. 7, the expression of the generated geometry of the subject in the first row displays stronger wrinkles, and a closer match to the portrait image when setting w=3𝑤3w=3italic_w = 3 compared to setting it to w=1𝑤1w=1italic_w = 1. Finally, despite training on purely studio data, we show how our model responds to conditionings derived from in-the-wild data in Fig. 8.

Mode Input w=0𝑤0w=0italic_w = 0 w=1𝑤1w=1italic_w = 1 w=3𝑤3w=3italic_w = 3
Portrait Refer to caption Refer to caption Refer to caption Refer to caption
Sketch Refer to caption Refer to caption Refer to caption Refer to caption
Flame Refer to caption Refer to caption Refer to caption Refer to caption
Edges Refer to caption Refer to caption Refer to caption Refer to caption
Landmarks Refer to caption Refer to caption Refer to caption Refer to caption
Figure 7. By varying the guidance strength w𝑤witalic_w, we can control the extent to which our conditioning signals affect the generated geometry. Setting w=0𝑤0w=0italic_w = 0 results in unconditional generation, while w>=1𝑤1w>=1italic_w > = 1 results in conditional generation.
Refer to caption
Figure 8. Generation using conditioning signals obtained from in-the-wild test data (Portrait images top row, Canny edge maps bottom row). Our model produces reasonable facial geometry from in-the-wild conditions despite being trained only on studio data.

4.2.1. Quantitative Evaluation

To evaluate the effectiveness of each of our conditioning modes in guiding the generated geometry towards ground truth scans, we compute the Euclidean error between the generated geometry and the ground truth geometry for each conditioning mode. In Table 2, we report the vertex-to-vertex (V2V) error of 178 shapes from our validation set for each type of conditioning. We observe that conditioning signals that are more descriptive, such as FLAME parameters or portrait photos, obtain a lower error when compared to signals that are less descriptive of the final geometry (2D landmarks, Canny edges and sketches). For the base FLAME parameter conditioning, we also visualize the error maps on the geometry in Fig. 9. We also highlight the importance of training with 3D geometric augmentations in Table 1 where our diffusion model trained with augmentations outperforms one trained without data augmentations, when evaluated on only neutral expressions from our validation set.

Refer to caption
Figure 9. Error maps on our validation set. The first row shows the FLAME mesh as generated by the FLAME face model from the input FLAME parameters. The second row shows the generated geometry from our model conditioned on the respective FLAME parameters. The third row visualizes the error from our conditional generations to the original scanned geometry in our validation set. The first four columns are various identities, while the last five columns are different expressions of the same subject.

Finally, as our method involves learning additional modes of conditioning on top of an underlying FLAME conditioned diffusion model, we can also use more than one conditioning signal at inference time to guide the generation according to Eq. 6. In Fig. 10, we show how combining both FLAME parameter and portrait image conditioning lowers the vertex error on a validation sample, as the denoising UNet now has access to more information about the desired identity and expression.

Refer to caption
Figure 10. Multimodal conditioning using portrait photo and FLAME parameter conditioning separately and simultaneously. The first row shows the conditioning inputs. The second row shows the generated face geometry. The third row shows the error map when compared with the original geometry from our validation set.

4.3. Geometry Editing

The latent space of our autoencoder preserves the spatial layout of the original UV position map, much like how the latent space of the image autoencoder in text-to-image models (Rombach et al., 2022) preserves the spatial layout of the encoded image. As a consequence, by masking regions in the latent UV position map corresponding to regions we wish to modify, and by denoising the masked regions, one can apply intuitive edits to particular regions of the facial geometry. Please refer to RePaint (Lugmayr et al., 2022) for more details on the masking process. Even when using masks with sharp boundaries, the denoising process can take care of smoothly interpolating at the mask boundaries. This mask-based editing of facial geometries using our model is shown in  Fig. 11. In the top row of  Fig. 11 we mask the nose region of the latent position map, such that it remains fixed throughout the multiple steps of denoising. We then generate multiple geometry samples by varying the initial noise input to the diffusion model. The noise predicted at each denoising step is multiplied with the nose mask before being fed as input to the denoising UNet for the next time step. This denoising procedure leads to generations where the generated samples all share the same nose shape, but vastly differing facial identities. In the bottom row of  Fig. 11, we show the result of inverse masking, where the face shape is held fixed while allowing the nose shape to change. Our model produces meaningful results in both cases.

Refer to caption
Figure 11. Unconditional face shape editing (inpainting). In the top row, the nose is kept fixed, while we sample the remaining regions unconditionally. In the bottom row, we sample the nose region unconditionally while keeping the other regions fixed.

The editing of facial geometry by masking the latent position map can also be guided further with user conditions. In  Fig. 12, we show an example of an interactive sketching workflow, where an artist can progressively edit a generated geometry by modifying one region at a time. For this application, we start from an initial 3D geometry obtained from our model by conditioning it on a hand-drawn sketch (a). An artist further edits the sketch, such as to open the mouth for instance (b). We automatically mask the latent position map such that only the regions corresponding to the edited sketch lines are allowed to change. The masked regions of the latent position map are populated with noise and denoised through our model while using the edited sketch as the conditioning signal. When used in this fashion, our model can perform local edits on the initial geometry to reflect the changes made by the artist. Additional examples of how subsequently modifying the mouth (c), face shape (d) and nose (e) are translated to changes in the generated 3D face geometry are shown in the remaining columns of Fig. 12. Please see the supplemental video for another illustration of this workflow.

\begin{overpic}[width=86.72267pt,trim=0.0pt 0.0pt 0.0pt 10.0pt,clip]{figures/% sketchedit/inputs_0001.png} \end{overpic} \begin{overpic}[width=86.72267pt,trim=0.0pt 0.0pt 0.0pt 10.0pt,clip]{figures/% sketchedit/inputs_0002.png} \put(0.0,-9.0){\includegraphics[width=26.01613pt,trim=150.0pt 200.0pt 150.0pt % 100.0pt,clip]{figures/sketchedit/masks_render_0002.png}} \end{overpic} \begin{overpic}[width=86.72267pt,trim=0.0pt 0.0pt 0.0pt 10.0pt,clip]{figures/% sketchedit/inputs_0003.png} \put(0.0,-9.0){\includegraphics[width=26.01613pt,trim=150.0pt 200.0pt 150.0pt % 100.0pt,clip]{figures/sketchedit/masks_render_0003.png}} \end{overpic} \begin{overpic}[width=86.72267pt,trim=0.0pt 0.0pt 0.0pt 10.0pt,clip]{figures/% sketchedit/inputs_0004.png} \put(0.0,-9.0){\includegraphics[width=26.01613pt,trim=150.0pt 200.0pt 150.0pt % 100.0pt,clip]{figures/sketchedit/masks_render_0004.png}} \end{overpic} \begin{overpic}[width=86.72267pt,trim=0.0pt 0.0pt 0.0pt 10.0pt,clip]{figures/% sketchedit/inputs_0005.png} \put(0.0,-9.0){\includegraphics[width=26.01613pt,trim=150.0pt 200.0pt 150.0pt % 100.0pt,clip]{figures/sketchedit/masks_render_0005.png}} \end{overpic}
\begin{overpic}[width=86.72267pt,trim=0.0pt 100.0pt 0.0pt 10.0pt,clip]{figures% /sketchedit/blender_render_0001.png} \put(2.0,5.0){a)} \end{overpic} \begin{overpic}[width=86.72267pt,trim=0.0pt 100.0pt 0.0pt 10.0pt,clip]{figures% /sketchedit/blender_render_0002.png} \put(2.0,5.0){b)} \end{overpic} \begin{overpic}[width=86.72267pt,trim=0.0pt 100.0pt 0.0pt 10.0pt,clip]{figures% /sketchedit/blender_render_0003.png} \put(2.0,5.0){c)} \end{overpic} \begin{overpic}[width=86.72267pt,trim=0.0pt 100.0pt 0.0pt 10.0pt,clip]{figures% /sketchedit/blender_render_0004.png} \put(2.0,5.0){d)} \end{overpic} \begin{overpic}[width=86.72267pt,trim=0.0pt 100.0pt 0.0pt 10.0pt,clip]{figures% /sketchedit/blender_render_0005.png} \put(2.0,5.0){e)} \end{overpic}
Figure 12. Conditional generation from an input sketch (a), followed by local edits of the mouth (b,c), the face shape (d) and the nose (e). The masks used to constrain the region of modification are shown in the insets.

4.4. Dynamic Generation

Even though our model is only trained with static face shapes, we find that it can generate temporally stable 3D facial geometries when conditioned on per-frame FLAME parameters derived from animation sequences or on CLIP embeddings obtained from individual frames from in-the-wild videos. In Fig. 13, we show the generated 3D face geometry produced by our method when conditioned on various signals derived from videos. To demonstrate the use of sketches as dynamic conditioning, we use a recent face reconstruction technique (Chandran et al., 2023) to track the facial geometry in 3D from an in-the-wild video and then render out 2D sketches using a hand-painted texture map and the tracked geometry. We identify that the only pre-processing required to obtain temporally stable generations from CLIP embeddings is to smooth them with a box filter before using them as the conditioning signal. To further ensure stable generations across time, we use the same noise seed across the video as well as DDIM sampling. We kindly refer to our accompanying video for more dynamic results.

Refer to caption
Figure 13. Dynamic geometry generation results given sketch, landmark, FLAME parameters or portrait photos from 4 different input videos as conditionings. Our results change smoothly across time while maintaining a consistent identity rather well. We use the same noise seed across all frames and DDIM sampling to reduce stochasticity of the results.

5. Conclusion

We propose a new framework for 3D facial geometry generation based on latent diffusion models that can be guided using multiple types of conditionings. Our conditional geometry generator operates in a latent geometry space and can produce high resolution UV position maps. It can be seamlessly conditioned on hand-drawn sketches, 2D landmarks, Canny edges, FLAME-parameters, RGB portrait photos and even text; resulting in a comprehensive facial geometry generator that supports many applications like geometry super-resolution, geometry editing, face reconstruction, etc. We train our model from scratch on only static face shapes captured in a studio setting and yet demonstrate that our model can generalize reasonably to in-the-wild conditioning signals, and can also generate facial performances when conditioned on frames from video data. As limitations, we identify that our model can produce geometric artifacts and suboptimal geometry for extreme expressions, especially when controlled using FLAME’s jaw pose parameters. We are hopeful that our work will encourage future research towards building comprehensive facial foundation models that also consider facial geometry.

References

  • (1)
  • Bergman et al. (2023) Alexander W. Bergman, Wang Yifan, and Gordon Wetzstein. 2023. Articulated 3D Head Avatar Generation using Text-to-Image Diffusion Models. arXiv. arXiv:2307.04859
  • Blanz and Vetter (1999) Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’99). ACM Press/Addison-Wesley Publishing Co., USA, 187–194. https://doi.org/10.1145/311535.311556
  • Canny (1986) John Canny. 1986. A Computational Approach To Edge Detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on PAMI-8 (12 1986), 679 – 698. https://doi.org/10.1109/TPAMI.1986.4767851
  • Chandran et al. (2020) Prashanth Chandran, Derek Bradley, Markus Gross, and Thabo Beeler. 2020. Semantic Deep Face Models. In 2020 International Conference on 3D Vision (3DV). IEEE Computer Society, Los Alamitos, CA, USA, 345–354. https://doi.org/10.1109/3DV50981.2020.00044
  • Chandran et al. (2023) P. Chandran, G. Zoss, P. Gotardo, and D. Bradley. 2023. Continuous Landmark Detection with 3D Queries. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 16858–16867.
  • Dhariwal and Nichol (2021) Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems (NeurIPS, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 8780–8794.
  • Ding et al. (2023) Zheng Ding, Cecilia Zhang, Zhihao Xia, Lars Jebe, Zhuowen Tu, and Xiuming Zhang. 2023. DiffusionRig: Learning Personalized Priors for Facial Appearance Editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Esser et al. (2021) Patrick Esser, Robin Rombach, and Björn Ommer. 2021. Taming Transformers for High-Resolution Image Synthesis. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New York, 12868–12878.
  • Feng et al. (2018) Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. 2018. Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network. In Proceedings of the European Conference on Computer Vision (ECCV).
  • Funkhouser et al. (2004) Thomas Funkhouser, Michael Kazhdan, Philip Shilane, Patrick Min, William Kiefer, Ayellet Tal, Szymon Rusinkiewicz, and David Dobkin. 2004. Modeling by example. In ACM SIGGRAPH 2004 Papers (Los Angeles, California) (SIGGRAPH ’04). Association for Computing Machinery, New York, NY, USA, 652–663. https://doi.org/10.1145/1186562.1015775
  • Gruber et al. (2020) Aurel Gruber, Marco Fratarcangeli, Gaspard Zoss, Roman Cattaneo, Thabo Beeler, Markus Gross, and Derek Bradley. 2020. Interactive Sculpting of Digital Faces Using an Anatomical Modeling Paradigm. Computer Graphics Forum (2020), 93–102. https://doi.org/10.1111/cgf.14071
  • Gu et al. (2024) Jiatao Gu, Qingzhe Gao, Shuangfei Zhai, Baoquan Chen, Lingjie Liu, and Josh Susskind. 2024. Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images. International Conference on 3D Vision (3DV) (2024).
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems (NeurIPS), H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 6840–6851.
  • Ho and Salimans (2021) Jonathan Ho and Tim Salimans. 2021. Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
  • Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5967–5976. https://doi.org/10.1109/CVPR.2017.632
  • Kim et al. (2015) Hyeon-Joong Kim, A. Cengiz Öztireli, Il-Kyu Shin, Markus Gross, and Soo-Mi Choi. 2015. Interactive Generation of Realistic Facial Wrinkles from Sketchy Drawings. Computer Graphics Forum 34, 2 (2015), 179–191. https://doi.org/10.1111/cgf.12551
  • Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations (ICLR).
  • Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114, 13 (2017), 3521–3526. https://doi.org/10.1073/pnas.1611835114
  • Kirschstein et al. (2024) Tobias Kirschstein, Simon Giebenhain, and Matthias Nießner. 2024. DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Koley et al. (2024) Subhadeep Koley, Ayan Kumar Bhunia, Deeptanshu Sekhri, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, and Yi-Zhe Song. 2024. It’s All About Your Sketch: Democratising Sketch Control in Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Li et al. (2017) Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. 2017. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics (ToG), (Proc. SIGGRAPH Asia) 36, 6 (2017), 194:1–194:17. https://doi.org/10.1145/3130800.3130813
  • Lorensen and Cline (1987) William E. Lorensen and Harvey E. Cline. 1987. Marching cubes: A high resolution 3D surface construction algorithm.. In SIGGRAPH. ACM, 163–169.
  • Lugmayr et al. (2022) Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. 2022. RePaint: Inpainting Using Denoising Diffusion Probabilistic Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11461–11471.
  • Mou et al. (2023) Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. 2023. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. arXiv (2023). arXiv:2302.08453
  • Otto et al. (2022) Christopher Otto, Jacek Naruniec, Leonhard Helminger, Thomas Etterlin, Graziana Mignone, Prashanth Chandran, Gaspard Zoss, Christopher Schroers, Markus Gross, Paulo Gotardo, Derek Bradley, and Romann Weber. 2022. Learning Dynamic 3D Geometry and Texture for Video Face Swapping. Computer Graphics Forum 41, 7 (Oct 2022), 611–622. https://doi.org/10.1111/cgf.14705
  • Poole et al. (2023) Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2023. DreamFusion: Text-to-3D using 2D Diffusion. In The Eleventh International Conference on Learning Representations (ICLR).
  • Potamias et al. (2024) Rolandos Alexandros Potamias, Michail Tarasiou Stylianos Ploumpis, and Stefanos Zafeiriou. 2024. ShapeFusion: A 3D diffusion model for localized shape editing. arXiv (2024). arXiv:2403.19773
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (PMLR) (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748–8763.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi (Eds.). Springer International Publishing, Cham, 234–241.
  • Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Song et al. (2021) Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion Implicit Models. In International Conference on Learning Representations (ICLR).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems (NIPS), I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.
  • Voynov et al. (2023) Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or. 2023. Sketch-Guided Text-to-Image Diffusion Models. In ACM SIGGRAPH 2023 Conference Proceedings (, Los Angeles, CA, USA,) (SIGGRAPH ’23). Association for Computing Machinery, New York, NY, USA, Article 55, 11 pages. https://doi.org/10.1145/3588432.3591560
  • Wang et al. (2023) Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, and Baining Guo. 2023. RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4563–4573.
  • Wu et al. (2023) Yunjie Wu, Yapeng Meng, Zhipeng Hu, Lincheng Li, Haoqian Wu, Kun Zhou, Weiwei Xu, and Xin Yu. 2023. Text-Guided 3D Face Synthesis – From Generation to Editing. arXiv (2023). arXiv:2312.00375
  • Ye et al. (2023) Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. arXiv (2023). arXiv:2308.06721
  • Zhang et al. (2023a) Longwen Zhang, Qiwei Qiu, Hongyang Lin, Qixuan Zhang, Cheng Shi, Wei Yang, Ye Shi, Sibei Yang, Lan Xu, and Jingyi Yu. 2023a. DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance. ACM Transactions on Graphics (ToG) 42, 4, Article 138 (jul 2023), 16 pages. https://doi.org/10.1145/3592094
  • Zhang et al. (2023b) Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023b. Adding Conditional Control to Text-to-Image Diffusion Models. In IEEE International Conference on Computer Vision (ICCV).
  • Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Zhou et al. (2024) Mingyuan Zhou, Rakib Hyder, Ziwei Xuan, and Guojun Qi. 2024. UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures. arXiv (2024). arXiv:2401.11078
  • Zou et al. (2024) Kaifeng Zou, Sylvain Faisan, Boyang Yu, Sébastien Valette, and Hyewon Seo. 2024. 4D Facial Expression Diffusion Model. ACM Trans. Multimedia Comput. Commun. Appl. (mar 2024). https://doi.org/10.1145/3653455