Latent Diffusion Model for Generating Ensembles of Climate Simulations

Johannes Meuer    Maximilian Witte    Tobias Sebastian Finn    Claudia Timmreck    Thomas Ludwig    Christopher Kadow
Abstract

Obtaining accurate estimates of uncertainty in climate scenarios often requires generating large ensembles of high-resolution climate simulations, a computationally expensive and memory intensive process. To address this challenge, we train a novel generative deep learning approach on extensive sets of climate simulations. The model consists of two components: a variational autoencoder for dimensionality reduction and a denoising diffusion probabilistic model that generates multiple ensemble members. We validate our model on the Max Planck Institute Grand Ensemble and show that it achieves good agreement with the original ensemble in terms of variability. By leveraging the latent space representation, our model can rapidly generate large ensembles on-the-fly with minimal memory requirements, which can significantly improve the efficiency of uncertainty quantification in climate simulations.

Machine Learning, ICML

1 Introduction

Climate simulations are essential tools for understanding Earth system processes and supporting diverse applications. However, these simulations exhibit internal variability due to chaotic variability and unknown forcings. Ensemble-based approaches, such as the Max Planck Institute (MPI) Grand Ensemble (Maher et al., 2019), address these uncertainties by providing a collection of simulations with varied initial conditions and model parameters. Nevertheless, these ensembles are computationally expensive and often limited in scope.

Machine learning has emerged as a promising complementary tool, capable of uncovering patterns and correlations. Reichstein et al. (2019) discuss the growing role of deep learning in improving climate science, highlighting its potential to identify non-linear relationships between climate variables. Their work illustrates how deep learning can reveal previously hidden patterns that improve climate simulations. Ensemble-based learning, as explored by (Lorenz et al., 2018), further improves predictive accuracy by weighting climate models based on their historical performance. This approach allows better integration of the strengths of different models, leading to more robust predictions.

Recent work using generative adversarial networks (GANs) shows promise in climate modelling by generating realistic weather simulations that match high-resolution numerical models. Brochet et al. (2023) demonstrate how GANs can provide multivariate emulation of numerical weather predictions. However, while GANs are powerful in generating plausible simulations, ofthen they suffer from mode collapse, where the generator produces limited types of output and fails to cover the diversity of the training data. This deficiency is particularly problematic in climate modelling, where robust sampling from the distribution of climate simulations is crucial for uncertainty quantification.


Refer to caption

Figure 1: Monthly anomalies of the strongest El Niño events. Top row shows an original simulation of the MPI-GE, bottom row a selected generated simulation using our latent diffusion model.

Denoising Diffusion Models (Ho et al., 2020) solve this problem by sampling from a Gaussian noise distribution and minimising the KL divergence between the distribution of the predictions and the training data. This allows stable training and effective uncertainty quantification. For example, Rasul et al. (2021) use an autoregressive diffusion model to perform multivariate probabilistic forecasting, significantly improving the ability to simultaneously predict different climate-related variables. Another notable application is the use of generative diffusion models to capture the inherent uncertainties in weather forecasting with GenCast by Price et al. (2023). By creating an ensemble of forecasts, GenCast provides a range of probable weather scenarios, which is crucial for medium-range forecasting. Their approach helps in better capturing the variability and uncertainties associated with weather patterns.

We address the challenges of ensemble climate modelling with a machine learning technique based on generative diffusion models. We generate temporally coherent simulations conditioned on a single climate simulation. With this objective, we can efficiently sample an implicit representation of the distribution that specifies the uncertainty conditioned on one climate model simulation. A drawback of diffusion models is their computational time, as the denoising process has to be run iteratively, making it difficult to prarallelize. Furthermore, when used in an auto-regressive prediction model, the computational time scales linearly with the predicted time domain. We present a diffusion model that addresses the efficiency drawbacks of the original diffusion approach by sampling from a latent space. We also introduce two different techniques for generating long sequences: an autoregressive prediction technique that generates long sequences iteratively and a transformer-based technique that can generate long sequences in a single step. Our model successfully reconstructs realistic climate patterns (see figure 1) and its simulations provide a similar range of possible outcomes compared to the numerical simulations. Our model exploits the strengths of deep learning, in particular diffusion models, to generate diverse simulations that complement existing ensemble approaches to provide improved uncertainty quantification in climate modelling.

2 Methodology

Refer to caption
E
Refer to caption
D
VAE
Refer to caption
DDM
Refer to caption
x^=[x^0,,x^t]^𝑥subscript^𝑥0subscript^𝑥𝑡\hat{x}=[\hat{x}_{0},...,\hat{x}_{t}]over^ start_ARG italic_x end_ARG = [ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
xc=[x0,,xt]subscript𝑥𝑐subscript𝑥0subscript𝑥𝑡{x}_{c}=[{x}_{0},...,{x}_{t}]italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
Refer to caption
training on single climate states
zcsubscript𝑧𝑐{z}_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT
Refer to caption
z^ysubscript^𝑧𝑦\hat{z}_{y}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT
Refer to caption
training on sequences
Figure 2: Our latent diffusion approach is split into two models, a variational autoencoder (VAE) pre-trained on independent climate states and a denoising diffusion model (DDM) trained on sequences of latent representations. During inference, the DDM generates new simulations in latent space, which are remapped to the original resolution by the decoder (D).

2.1 Latent Diffusion Model

Our model (see 2) uses a diffusion process in latent space (Rombach et al., 2022) generated by a pre-trained variational autoencoder (VAE) (Kingma & Welling, 2013). This approach significantly reduces spatial complexity while preserving essential features of the climate simulations. The pre-trained VAE, described in detail in the appendix A, compresses each simulation x𝑥xitalic_x into a lower-dimensional latent space z𝑧zitalic_z using the encoder E𝐸Eitalic_E:

z=E(x)𝑧𝐸𝑥z=E(x)italic_z = italic_E ( italic_x )

The VAE is unaware of the time dimension and treats each timestep independently. The diffusion model, detailed in Appendix B, is trained on the latent representations of the climate simulations, conditioned on a single simulation xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT being mapped into latent space, giving a general focus on long-term trends in climate evolution: zc=E(xc)subscript𝑧𝑐𝐸subscript𝑥𝑐z_{c}=E(x_{c})italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_E ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ). The prediction task is defined as the difference between a target latent z𝑧zitalic_z and the conditioned latent simulation zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, where zysubscript𝑧𝑦z_{y}italic_z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT represents the residual that the diffusion model has to learn:

zy=zzcsubscript𝑧𝑦𝑧subscript𝑧𝑐z_{y}=z-z_{c}italic_z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = italic_z - italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT

During training, the diffusion model is optimised to predict this residual in latent space. The diffusion model is a U-Net (Ronneberger et al., 2015) using BigGAN (Brock et al., 2018) residual blocks followed by down-sampling convolutions in the encoder and up-sampling convolutions in the decoder. After training, during inference, we use the diffusion model (DDM) to generate a large number of residuals z^ysubscript^𝑧𝑦\hat{z}_{y}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT in the latent space:

z^y=DDM(zc)subscript^𝑧𝑦DDMsubscript𝑧𝑐\hat{z}_{y}=\text{DDM}(z_{c})over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = DDM ( italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )

To speed up the inference time of the diffusion model, we apply a denoising diffusion implicit sampler (Song et al., 2020), which provides a more efficient generation process by using deterministic steps. The final simulations are reconstructed by adding the generated residuals back to the conditioned latent simulation zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and applying the VAE decoder D𝐷Ditalic_D:

x^=D(zc+z^y)^𝑥𝐷subscript𝑧𝑐subscript^𝑧𝑦\hat{x}=D(z_{c}+\hat{z}_{y})over^ start_ARG italic_x end_ARG = italic_D ( italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT )

2.2 Sequence Generation

We explore two approaches to generating long sequences in the latent space:

2.2.1 Autoregressive Prediction

This approach iteratively generates long sequences by predicting the next latent state based on a window of previous states and the conditioned simulation zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Given an input sequence ztn+1:t=[ztn+1,ztn+2,,zt]subscript𝑧:𝑡𝑛1𝑡subscript𝑧𝑡𝑛1subscript𝑧𝑡𝑛2subscript𝑧𝑡z_{t-n+1:t}=[z_{t-n+1},z_{t-n+2},\ldots,z_{t}]italic_z start_POSTSUBSCRIPT italic_t - italic_n + 1 : italic_t end_POSTSUBSCRIPT = [ italic_z start_POSTSUBSCRIPT italic_t - italic_n + 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - italic_n + 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ], the model predicts:

z^t+1=DDM(ztn+1:t,zc)subscript^𝑧𝑡1𝐷𝐷𝑀subscript𝑧:𝑡𝑛1𝑡subscript𝑧𝑐\hat{z}_{t+1}=DDM(z_{t-n+1:t},z_{c})over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_D italic_D italic_M ( italic_z start_POSTSUBSCRIPT italic_t - italic_n + 1 : italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )

2.2.2 Transformer-Based Attention Mechanism

Inspired by natural language processing (Vaswani et al., 2017), this approach uses a transformer to process the entire time domain at once. This allows parallel processing and accelerates sequence generation. Each transformer block sequentially applies spatial attention, focusing on spatial patterns, and temporal attention, focusing on temporal correlations, following a residual block. To manage memory costs, we implement a cascaded transformer mechanism. The higher levels of the diffusion network focus on small time scales, capturing detailed short-term patterns. The lower levels deal with overall time scales, ensuring a comprehensive understanding of long-term trends. A detailed description of the model can be found in Appendix B. The transformer processes the entire sequence in a single step and is not additionally conditioned on an initial state:

z^=DDM(zc)^𝑧𝐷𝐷𝑀subscript𝑧𝑐\hat{z}=DDM(z_{c})over^ start_ARG italic_z end_ARG = italic_D italic_D italic_M ( italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )

3 Results

Our model is trained and evaluated on the 200 ensemble members from the historical simulations of the MPI Grand Ensemble (Maher et al., 2019). This ensemble covers the period from 1850 to 2005 with a monthly temporal frequency and a spatial resolution of 1.8° (192x96 grid). The focus of this analysis is on surface temperatures.

During training, we used one member as input for conditional simulation. For model evaluation, we used another member for conditioning during inference. We trained on the remaining 198 members over the entire time range. After training, we generated 100 new artificial ensemble members. These generated members are then analysed to compare their annual mean surface temperatures over the entire historical period with the first 100 members of the original MPI Grand Ensemble. The comparison focuses on two key statistical measures: the ensemble mean and the spread in the temperatures. Figure 3 shows the results of our transformer-based model. While the ensemble mean contains signatures of the forced response to climate change, the ensemble standard deviation represents the internal variability of the climate system.


Refer to caption

Figure 3: Ensemble spread and ensemble mean of annual spatially averaged 98 original members from the MPI-GE (blue) compared to the generated members (red) from 1850 to 2005.

The ensemble mean and variability of the generated members closely mirror those of the original ensemble members with respect to annual spatially averaged temperatures. This highlights the ability of the model to reproduce central tendencies such as the global warming trend and global cooling following major volcanic eruptions (1883, 1963, 1982 and 1991). This validates the ability of our machine learning approach to reproduce complex climate dynamics. Appendix C also provides an analysis of uncertainty quantification for an unseen time range, where the autoregressive model was trained on data from 1850 to 1975 only and iteratively generated simulations for 1975 to 2000.

In a further analysis, we looked specifically at the El Niño-Southern Oscillation (ENSO) (Trenberth, 1997) timelines of a selected member to assess the model’s ability to capture more localised and medium-term climate phenomena. Figure 1 shows the anomaly maps of the strongest El Niño event from an original simulation and a selected member, which was generated by our autoregressive model. Although the El Niño appears less pronounced in our generated simulation, the model is able to generate temporally and spatially coherent El Niño patterns. This can be also seen in Figure 4, which shows the ENSO timeline from 1950 to 2005. The generated data show a realistic pattern of recurring ENSO events, similar to those observed in real climate data. This similarity confirms that our model not only maintains general climate trends over time, but also effectively reproduces specific, influential climate phenomena such as ENSO.


Refer to caption

Figure 4: ENSO timeline (Trenberth, 1997) of an original simulation from the MPI-GE (blue) in comparison to a generated member (red) ranging from 1950 to 2005. The red dashed line marks the threshold of an El Niño event, the blue dashed line the threshold of a La Niña event.

We found that the autoregressive model was better at preserving the evolution over time, while the transformer model was better at preserving spatial patterns and long-term trends. A comparison of absolute temperature maps can be seen in figure 7 in the appendix.

Our results demonstrate that our machine learning framework, integrating a variational autoencoder and a diffusion model, can effectively generate plausible climate scenarios that are statistically consistent with simulated historical data. This capability marks a significant advance in the field of climate modelling, particularly for sampling that accounts for the internal variability and projection of long-term climate phenomena.

4 Conclusion and Outlook

We present a latent diffusion model for numerical climate model emulation, highlighting its ability to reproduce both global and local climate phenomena. Our results show that the model effectively captures central tendencies such as the global warming trend and significant cooling events following major volcanic eruptions. Furthermore, the model successfully generates coherent temporal and spatial patterns, such as the El Niño-Southern Oscillation, even when trained on limited historical data.

We carried out a detailed analysis of two different diffusion models: An autoregressive prediction approach and a transformer-based approach. When evaluated on the MPI Grand Ensemble, the generated ensemble members of the transformer-based approach show remarkable agreement with the original ensemble in terms of mean and variability, confirming the robustness and reliability of the model. The autoregressive approach showed the ability to generate realistic climate patterns over extended periods, including unseen time spans, and good performance in temporally coherent simulations. Future work will investigate the combination of the two techniques to leverage the strengths of both.

The promising results of our approach open up several avenues for future research and development. Future work can extend the training data to include more recent years, higher spatial resolutions and multiple climate models. This would allow the model to capture more detailed and recent climate phenomena, improving its applicability to contemporary climate studies. In addition, the inclusion of more climate variables such as precipitation, sea level pressure and ocean currents could provide a more comprehensive understanding of climate dynamics and improve the predictive capabilities of the model. The techniques and insights from this work could be applied to other fields requiring time-series prediction and uncertainty quantification, such as economics, epidemiology and energy systems. A deeper exploration of uncertainty quantification would provide more insight into the confidence and reliability of predictions, which is crucial for policy making and scientific research.

Our latent diffusion model not only quantifies uncertainty, but also provides real scenarios that support the uncertainty quantification. Unlike traditional methods, which often provide abstract uncertainty metrics, our approach generates diverse and plausible climate simulations, providing concrete scenarios for better understanding and decision making. Compared to numerical models for ensemble generation, our method could provide a much less computationally expensive alternative.

References

  • Brochet et al. (2023) Brochet, C., Raynaud, L., Thome, N., Plu, M., and Rambour, C. Multivariate emulation of kilometer-scale numerical weather predictions with generative adversarial networks: A proof of concept. Artificial Intelligence for the Earth Systems, 2(4):230006, 2023.
  • Brock et al. (2018) Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  • Dhariwal & Nichol (2021) Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Lorenz et al. (2018) Lorenz, R., Herger, N., Sedláček, J., Eyring, V., Fischer, E. M., and Knutti, R. Prospects and caveats of weighting climate models for summer maximum temperature projections over north america. Journal of Geophysical Research: Atmospheres, 123(9):4509–4526, 2018.
  • Maher et al. (2019) Maher, N., Milinski, S., Suarez-Gutierrez, L., Botzet, M., Dobrynin, M., Kornblueh, L., Kröger, J., Takano, Y., Ghosh, R., Hedemann, C., et al. The max planck institute grand ensemble: enabling the exploration of climate system variability. Journal of Advances in Modeling Earth Systems, 11(7):2050–2069, 2019. doi: 10.1029/2019MS001639.
  • Price et al. (2023) Price, I., Sanchez-Gonzalez, A., Alet, F., Ewalds, T., El-Kadi, A., Stott, J., Mohamed, S., Battaglia, P., Lam, R., and Willson, M. Gencast: Diffusion-based ensemble forecasting for medium-range weather. arXiv preprint arXiv:2312.15796, 2023.
  • Rasul et al. (2021) Rasul, K., Seward, C., Schuster, I., and Vollgraf, R. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In International Conference on Machine Learning, pp.  8857–8868. PMLR, 2021.
  • Reichstein et al. (2019) Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., and Prabhat. Deep learning and process understanding for data-driven earth system science. Nature, 566(7743):195–204, 2019.
  • Rombach et al. (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  • Ronneberger et al. (2015) Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp.  234–241. Springer, 2015.
  • Sohl-Dickstein et al. (2015) Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.  2256–2265. PMLR, 2015.
  • Song et al. (2020) Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • Trenberth (1997) Trenberth, K. E. The definition of el nino. Bulletin of the American Meteorological Society, 78(12):2771–2778, 1997.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Appendix A Variational Autoencoder

The Variational Autoencoder (VAE) is based on Rombach et al. (2022) and focuses on perceptual image compression. The encoder (E𝐸Eitalic_E) encodes an image xH×W×3𝑥superscript𝐻𝑊3x\in\mathbb{R}^{H\times W\times 3}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT into a latent representation z=E(x)𝑧𝐸𝑥z=E(x)italic_z = italic_E ( italic_x ), where zh×w×c𝑧superscript𝑤𝑐z\in\mathbb{R}^{h\times w\times c}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_c end_POSTSUPERSCRIPT. The decoder D𝐷Ditalic_D reconstructs the image from the latent representation, giving x~=D(z)=D(E(x))~𝑥𝐷𝑧𝐷𝐸𝑥\tilde{x}=D(z)=D(E(x))over~ start_ARG italic_x end_ARG = italic_D ( italic_z ) = italic_D ( italic_E ( italic_x ) ). The compression factor is given by f=Hh=Ww𝑓𝐻𝑊𝑤f=\frac{H}{h}=\frac{W}{w}italic_f = divide start_ARG italic_H end_ARG start_ARG italic_h end_ARG = divide start_ARG italic_W end_ARG start_ARG italic_w end_ARG. In our setup, we use a compression factor of f=8𝑓8f=8italic_f = 8 and a latent dimension of c=4𝑐4c=4italic_c = 4, giving us a total compression of 16.

The given objective VAEsubscriptVAE\mathcal{L}_{\text{VAE}}caligraphic_L start_POSTSUBSCRIPT VAE end_POSTSUBSCRIPT combines the reconstruction loss, adversarial loss, and Kullback-Leibler divergence (KL-divergence) regularization for training the VAE with adversarial training. The reconstruction loss recsubscriptrec\mathcal{L}_{\text{rec}}caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT measures the L2 distance between the original image x𝑥xitalic_x and the reconstructed image x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG to produce images that are close to the input in pixel space:

rec=xx~22subscriptrecsuperscriptsubscriptnorm𝑥~𝑥22\mathcal{L}_{\text{rec}}=\|x-\tilde{x}\|_{2}^{2}caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT = ∥ italic_x - over~ start_ARG italic_x end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (1)

The adversarial loss ensures that the reconstructed images are perceptually similar to the real images by adding a discriminator D𝐷Ditalic_D:

adv=𝔼xpdata(x)[logD(x)]+𝔼x~pmodel(x~)[log(1D(x~))]subscriptadvsubscript𝔼similar-to𝑥subscript𝑝data𝑥delimited-[]𝐷𝑥subscript𝔼similar-to~𝑥subscript𝑝model~𝑥delimited-[]1𝐷~𝑥\mathcal{L}_{\text{adv}}=\mathbb{E}_{x\sim p_{\text{data}}(x)}[\log D(x)]+% \mathbb{E}_{\tilde{x}\sim p_{\text{model}}(\tilde{x})}[\log(1-D(\tilde{x}))]caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT [ roman_log italic_D ( italic_x ) ] + blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG ∼ italic_p start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG ) end_POSTSUBSCRIPT [ roman_log ( 1 - italic_D ( over~ start_ARG italic_x end_ARG ) ) ] (2)

The KL divergence regularizes the latent space by making the distribution of the encoded latent variables q(z|x)𝑞conditional𝑧𝑥q(z|x)italic_q ( italic_z | italic_x ) close to a prior distribution p(z)𝑝𝑧p(z)italic_p ( italic_z ), in our case a standard normal distribution. The overall objective for the VAE with adversarial training is given by:

VAE=𝔼xpdata(x)[λrecxx~22+λadv(𝔼xpdata(x)[logD(x)]+𝔼x~pmodel(x~)[log(1D(x~))])+λKLKL(q(z|x)p(z))]subscriptVAEsubscript𝔼similar-to𝑥subscript𝑝data𝑥delimited-[]subscript𝜆recsuperscriptsubscriptnorm𝑥~𝑥22subscript𝜆advsubscript𝔼similar-to𝑥subscript𝑝data𝑥delimited-[]𝐷𝑥subscript𝔼similar-to~𝑥subscript𝑝model~𝑥delimited-[]1𝐷~𝑥subscript𝜆KLKLconditional𝑞conditional𝑧𝑥𝑝𝑧\mathcal{L}_{\text{VAE}}=\mathbb{E}_{x\sim p_{\text{data}}(x)}\left[\lambda_{% \text{rec}}\|x-\tilde{x}\|_{2}^{2}+\lambda_{\text{adv}}\left(\mathbb{E}_{x\sim p% _{\text{data}}(x)}[\log D(x)]+\mathbb{E}_{\tilde{x}\sim p_{\text{model}}(% \tilde{x})}[\log(1-D(\tilde{x}))]\right)+\lambda_{\text{KL}}\text{KL}(q(z|x)\|% p(z))\right]caligraphic_L start_POSTSUBSCRIPT VAE end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT [ italic_λ start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT ∥ italic_x - over~ start_ARG italic_x end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT [ roman_log italic_D ( italic_x ) ] + blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG ∼ italic_p start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG ) end_POSTSUBSCRIPT [ roman_log ( 1 - italic_D ( over~ start_ARG italic_x end_ARG ) ) ] ) + italic_λ start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT KL ( italic_q ( italic_z | italic_x ) ∥ italic_p ( italic_z ) ) ] (3)

We evaluated the performance of our VAE independently of the overall setup to ensure that the diffusion model was provided with high quality latent data. To do this, we mapped our complete data set of simulations into latent space using the encoder E𝐸Eitalic_E and reconstructed all the data using the decoder D𝐷Ditalic_D. Our VAE achieved a pixel-wise RMSE of 0.250.250.250.25°C. Figure 5 also shows the comparison of the spatial annual mean and spread from the original and reconstructed datasets over the full time range.

Refer to caption

Figure 5: Ensemble spread and ensemble mean of annual spatially averaged 98 original members from the MPI-GE (blue) compared to the reconstructed members of the VAE (red) ranging from 1850 to 2005.

Appendix B Denoising Diffusion Model

Diffusion models generate images by reversing a gradual noise process (Sohl-Dickstein et al., 2015). The process starts with pure noise and iteratively denoises it to produce a high-quality image. The main steps in this process are a forward and a backward diffusion process. The forward process gradually adds Gaussian noise to the data over a series of diffusion steps. Starting from the original image x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, noise is added at each diffusion step t𝑡titalic_t to produce a noisy image xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This process can be described by the following equation, where βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a variance schedule that controls the amount of noise added at each step:

q(xt|xt1)=𝒩(xt;1βtxt1,βtI)𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1𝒩subscript𝑥𝑡1subscript𝛽𝑡subscript𝑥𝑡1subscript𝛽𝑡𝐼q(x_{t}|x_{t-1})=\mathcal{N}(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}I)italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_I ) (4)

The reverse diffusion process is learned by training a neural network to predict the noise component added at each step. The model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT predicts the noise given the noisy image xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the diffusion step t𝑡titalic_t. The objective is to minimise the difference between the predicted noise and the true noise, in our case using a mean square error loss:

=𝔼t,x0,ϵ[ϵϵθ(xt,t)2]subscript𝔼𝑡subscript𝑥0italic-ϵdelimited-[]superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡2\mathcal{L}=\mathbb{E}_{t,x_{0},\epsilon}\left[\|\epsilon-\epsilon_{\theta}(x_% {t},t)\|^{2}\right]caligraphic_L = blackboard_E start_POSTSUBSCRIPT italic_t , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (5)

The reverse process can then generate xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT from xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using the learned model, where μθsubscript𝜇𝜃\mu_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and ΣθsubscriptΣ𝜃\Sigma_{\theta}roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT are functions of the predicted noise:

pθ(xt1|xt)=𝒩(xt1;μθ(xt,t),Σθ(xt,t))subscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡𝒩subscript𝑥𝑡1subscript𝜇𝜃subscript𝑥𝑡𝑡subscriptΣ𝜃subscript𝑥𝑡𝑡p_{\theta}(x_{t-1}|x_{t})=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\Sigma_{% \theta}(x_{t},t))italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) (6)

Our model is based on the work of Dhariwal & Nichol (2021) and implements a UNet architecture with residual blocks as described by Brock et al. (Brock et al., 2018). Similar to the original implementation used for image synthesis, we apply spatial attention mechanisms in the upper layers of the model. To address the time cost associated with iterative inference over large data sequences, our model operates in the latent space. For autoregressive prediction sequence processing, the temporal dimension of the latent simulations is encoded within the channel dimension of the model.

The transformer-based component of our model processes temporal information in an additional dimension. Each transformer block consists of three parts: a spatial attention mechanism, a temporal attention mechanism, and a multilayer perceptron (MLP). The core concept of the transformer (Vaswani et al., 2017) is the self-attention mechanism, which allows the model to evaluate the importance of different regions in the spatial dimension or time steps in the temporal dimension. The data is divided into patches (either spatial or temporal) and transformed into three vectors: Query (Q), Key (K) and Value (V). The size of the temporal patches varies with the depth of the network; higher layers consider smaller timescales, while the bottleneck layer includes all timescales. Attention scores are computed by taking the dot product of the query vector with all key vectors, followed by a softmax function to obtain weights. These weights are then used to compute a weighted sum of the value vectors:

Attention(Q,K,V)=softmax(QKTdk)VAttention𝑄𝐾𝑉softmax𝑄superscript𝐾𝑇subscript𝑑𝑘𝑉\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)VAttention ( italic_Q , italic_K , italic_V ) = softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V (7)

Instead of performing a single attention function, the transformer uses multiple attention heads to capture different aspects of the relationships between elements. Each head has its own set of Q𝑄Qitalic_Q, K𝐾Kitalic_K and V𝑉Vitalic_V matrices, and their outputs are concatenated and linearly transformed. In our model, the number of attention heads increases with layer depth, starting with fewer heads in the early layers and reaching a maximum number in the bottleneck layer. Following the multi-head attention mechanisms, a residual connection MLP network is applied. This consists of a layer normalisation, a linear layer and a Gaussian Error Linear Unit (GELU) activation function.

Appendix C Extended Analysis

We investigated the generalisability of our model to unseen time periods. The channel-based autoregressive diffusion model was trained on all historical MPI-GE members from 1850 to 1975. We then conditioned the trained model on a single simulation from 1975 to 2000, generating 100 members. Figure 6 shows the results. While the mean and spread of the generated simulations do not perfectly match the original ones, the simulations successfully capture the ongoing global warming trend despite not being trained on this period. In addition, the generated simulations strongly reflect the major volcanic eruptions of 1982 and 1991.

Refer to caption

Figure 6: Ensemble spread and ensemble mean of annual spatially averaged 100 original members from the MPI-GE (blue) compared to the reconstructed members of the autoregressive model (red) ranging from 1975 to 2000.

Refer to caption

Figure 7: Absolute temperature maps of some randomly chosen samples. Top row shows the samples from an original MPI-GE simulation, center row from the autoregressive technique and bottom row from the transformer-based technique.