Sampling Hybrid Climate Simulation at Scale to Reliably Improve Machine Learning Parameterization

Abstract

Machine-learning (ML) parameterizations of subgrid processes (here of turbulence, convection, and radiation) may one day replace conventional parameterizations by emulating high-resolution physics without the cost of explicit simulation. However, their development has been stymied by uncertainty surrounding whether or not improved offline performance translates to improved online performance (i.e., when coupled to a large-scale general circulation model (GCM)). A key barrier has been the limited sampling of the online effects of the ML design decisions and tuning due to the complexity of performing large ensembles of hybrid physics-ML climate simulations. Our work examines the coupled behavior of full-physics ML parameterizations using large ensembles of hybrid simulations, totalling 2,970 in our case. With extensive sampling, we statistically confirm that lowering offline error lowers online error (given certain constraints). However, we also reveal that decisions decreasing online error, like removing dropout, can trade off against hybrid model stability and vice versa. Nevertheless, we are able to identify design decisions that yield unambiguous improvements to offline and online performance, namely incorporating memory and training on multiple climates. We also find that converting moisture input from specific to relative humidity enhances online stability and that using a Mean Absolute Error (MAE) loss breaks the aforementioned offline/online error relationship. By enabling rapid online experimentation at scale, we empirically answer previously unresolved questions regarding subgrid ML parameterization design.

\draftfalse\journalname

Journal of Advances in Modeling Earth Systems (JAMES)

Department of Earth System Sciences, University of California at Irvine, Irvine, CA, USA Multimodal Cognitive AI, Intel Labs, Santa Clara, CA 95054, USA Faculty of Geosciences and Environment, University of Lausanne, Lausanne, Switzerland Expertise Center for Climate Extremes, University of Lausanne, Lausanne, Switzerland Department of Statistics, University of California at Irvine, Irvine, CA, USA Department of Earth and Planetary Sciences, Harvard University NVIDIA Research LEAP Science and Technology Center, School of Engineering and Applied Sciences, Climate School, Columbia University Berkeley AI Research (BAIR), Department of Electrical Engineering and Computer Science, University of California at Berkeley, Berkeley, CA, USA Department of Biomedical Data Science, Stanford University School of Medicine, Palo Alto, CA, USA

\correspondingauthor

Jerry LinjerryL9@uci.edu

{keypoints}

Lower offline error lowers online error, but stability and online error do not necessarily improve in tandem.

Using memory and training on multiple climates reduce prognostic error, improve stability, and are advised for future ML parameterizations.

Using ensemble sizes of $\mathcal{O}(100)$ may be necessary to detect causally relevant differences in the online performance of ML parameterizations.

Plain Language Summary

Running high-resolution simulations of deep convection for climate forecasts takes a lot of computing power, which is why we use simpler models that can introduce some uncertainty in the results. Machine learning, especially neural networks, could mimic these simulations more efficiently, but it’s tricky to make them work well within the actual climate models (i.e., in an “online” setting). It’s hard to predict how they’ll perform because testing them inside climate models is complex and results can vary a lot, making it tough to draw solid conclusions from a few tests. Our research extensively tests how these neural networks interact with climate models so that we can confirm some theories with actual data. We would like to reduce the online error of these models and improve their stability (i.e., preventing situations in which they crash and stop running). We found that using data from different climates and a previous time step helps with both goals, converting from specific humidity to relative humidity improves stability, and removing dropout reduces error at the cost of stability. With all this testing, we’re helping figure out how to develop machine learning for climate models, moving us closer to using them in real-world forecasts.

1 Introduction

Despite the fact that they occur at scales smaller than the grid cell resolution of a climate model, subgrid processes (e.g. turbulence, deep convection, microphysics) can have major effects on model accuracy and fidelity. Unfortunately, such subgrid processes are too computationally costly to resolve explicitly for long-term climate projections. These processes therefore necessitate the use of sub-grid parameterizations, i.e., empirical representation of the impact of subgrid processes on the coarse grid, which are inevitably error-prone due to their phenomenological nature. For over two decades, neural networks (NNs) have held the promise of circumventing the large amounts of compute required to resolve fine scale processes in climate simulations by learning the coarse representation of subgrid processes from data [Chevallier \BOthers. (\APACyear1998), Krasnopolsky \BOthers. (\APACyear2013), Gentine \BOthers. (\APACyear2018), Rasp \BOthers. (\APACyear2018), Yu \BOthers. (\APACyear2023)]. Of particular interest is the potential to replace parameterizations that crudely approximate highly nonlinear subgrid scale processes like ocean momentum transport [Guillaumin \BBA Zanna (\APACyear2021)], radiative transfer [Cachay \BOthers. (\APACyear2021)], and moist convection [Gentine \BOthers. (\APACyear2018), Rasp \BOthers. (\APACyear2018), Brenowitz, Beucler\BCBL \BOthers. (\APACyear2020), Yuval \BOthers. (\APACyear2021), Wang \BOthers. (\APACyear2022), Iglesias-Suarez \BOthers. (\APACyear2024)]. If NNs could be trained to reliably emulate the behavior of more explicit simulation, climate models could be improved compared to traditional parameterizations without incurring the associated computational cost of high-resolution simulations. In the context of convection, this could mean breaking the parameterization “deadlock” as coined by \citeARandall2003-db—a situation in which slowing progress on traditional convective parameterizations has not kept pace with the growing societal need to accurately represent the underlying subgrid physics in climate models. [D\BPBIA. Randall (\APACyear2013), Shepherd (\APACyear2014), Gentine \BOthers. (\APACyear2018), IPCC (\APACyear2021)].

The vision of breaking deadlock with NN parameterizations has been challenged by difficulties ensuring their reliability when they are dynamically coupled to a host climate model (i.e., used online). NNs are typically first fitted offline on some training data (typically high-resolution simulations) and then plugged online, typically resulting in reduced performance or even lack of stability altogether. NNs have been shown to outperform conventional convective parameterizations when evaluated offline i.e., on test data from the cloud-resolving models (CRMs) they are trained to emulate [Gentine \BOthers. (\APACyear2018), Han \BOthers. (\APACyear2020), Mooers \BOthers. (\APACyear2021)]. When coupled online, however, small offline imperfections can compound over timesteps, resulting in dramatic error growth and simulation crashes [Brenowitz, Beucler\BCBL \BOthers. (\APACyear2020), Wang \BOthers. (\APACyear2022)]. Previous work has argued that skillful offline fits—which can be found via ample ML tools that exist to optimize it—do not guarantee stable online performance [Ott \BOthers. (\APACyear2020), Wang \BOthers. (\APACyear2022)].

An overarching problem is that online error is rarely sampled with large ensemble sizes. We hypothesize that the main obstacle is the associated technical work: training thousands of ML parameterizations, adapting each to run within a climate model, and running multi-hundred member ensembles of hybrid (i.e., physical model with embedded ML parameterization) climate simulations can be technically challenging and computationally intensive. As a result, basic questions such as “what differences in online error can be statistically shown to originate from offline ML architecture and optimization choices?” are not empirically addressed with statistical confidence.

To answer these questions, we designed a modular, end-to-end pipeline for rapidly sampling the coupled behavior of NN emulators of moist convection using multi-hundred member ensembles. We use this capability to evaluate the effects of different design decisions for full-physics emulators of subgrid atmospheric processes (see Section 2.3):

1.

A standard configuration that serves as a baseline. To provide for a strong baseline, it consolidates several design decisions previously advocated for stabilizing online simulations. More details regarding these heuristics can be found in Section 2.3.1.
2.

A specific humidity configuration that uses specific humidity instead of relative humidity for the moisture input.
3.

A no memory configuration that omits the temporal history of heating and moistening tendencies (predictands) from the input vector.
4.

A no wind configuration that omits meridional wind from the input vector.
5.

A no ozone configuration that omits ozone from the input vector.
6.

A no zenith angle configuration that omits zenith angle from the input vector.
7.

An MAE configuration that uses Mean Absolute Error (MAE) instead of Mean Squared Error (MSE) for the (offline) loss function.
8.

A no dropout configuration that sets dropout to zero.
9.

A multiclimate configuration that is trained on -4K, +0K, and +4K adjusted sea surface temperatures (SSTs) compared to the reference simulation.

Our level of online sampling is extensive and at least two orders of magnitude higher than what is traditionally shown in the literature [Wang \BOthers. (\APACyear2022), Han \BOthers. (\APACyear2023), Iglesias-Suarez \BOthers. (\APACyear2024)]. We will show in the following sections that sampling online at a scale of hundreds of architecture trials leads to reproducible improvements to hybrid model performance that become statistically identifiable.

2 Methods

2.1 Reference Climate Simulation

The data used to train the NNs comes from the Super-Parameterized Community Atmosphere Model v3 (SPCAM 3) in an aquaplanet setting of intermediate complexity that has proved popular for exploring trade-offs in hybrid-ML simulation [Gentine \BOthers. (\APACyear2018), Rasp \BOthers. (\APACyear2018), Ott \BOthers. (\APACyear2020), Behrens \BOthers. (\APACyear2022), Beucler \BOthers. (\APACyear2024)]. In superparameterization, each grid cell of a coarse-resolution general circulation model (GCM) contains an idealized two-dimensional CRM that explicitly resolves convection while making use of parameterizations for small-scale turbulence and cloud microphysics [Khairoutdinov \BOthers. (\APACyear2005), M\BPBIS. Pritchard \BBA Bretherton (\APACyear2014), Jones \BOthers. (\APACyear2019)]. The NNs are trained to emulate subgrid heating ( $\Delta T_{\text{phy}}$ ) in K/s and moistening ( $\Delta Q_{\text{phy}}$ ) in kg/kg/s, where $\Delta T_{\text{phy}}$ represents the combined effect of convective processes, radiation, and subgrid turbulence on heating. This is analogous to a traditional albeit multi-process parameterization. The GCM is a simplified, zonally-symmetric aquaplanet with a full diurnal cycle, fixed season (perpetual austral summer), prescribed sea surface temperatures, and a nearly uniformly spaced grid with 64 points along the latitude and 128 points along the longitude dimensions. There are 30 vertical levels within each grid cell, and consecutive GCM timesteps are 30 simulation minutes apart while the CRM has a 20 second timestep. The embedded CRMs have a 4-km horizontal resolution and are each made up of 32 columns. Cloud condensate coupling is omitted from the framework for simplicity. Additional details can be found in the Supplementary Information (SI).

2.2 End-to-End Pipeline and Analysis

Our end-to-end pipeline, called ClimScale, standardizes preprocessing, training, coupling, and analysis across all configurations and is available at https://github.com/SciPritchardLab/ClimScale [Lin \BBA Yu (\APACyear2024)]. Training is parallelized across multiple-GPUs using KerasTuner [O’Malley \BOthers. (\APACyear2019)], allowing for the efficient training of hundreds of models in the span of days using a random search strategy. Coupling to SPCAM 3 is enabled by the Fortran-Keras Bridge (FKB) developed by \citeAOtt2020-qe. All hybrid runs are initialized with the same initial condition file used for the reference SPCAM 3 simulation. Subsampling simulation output for analysis is automated with GNU parallel [Tange (\APACyear2018)] and NCO [Zender (\APACyear2008)]. To detect differences resulting from various design choices compared to a baseline “standard” configuration for both offline and online performance, we conduct an array of statistical tests. For comparing average ensemble offline heating and moistening root mean squared error (RMSE) between configurations, we make use of two-tailed, two-sample Welch’s t-tests (using the scipy.stats library) [Virtanen \BOthers. (\APACyear2020)]. For online ensemble survival rates (i.e., the percentage of hybrid model runs from a configuration that do not prematurely crash) and ensemble-median online temperature and moisture RMSE, we use two-tailed proportion tests and two-tailed permutation tests (using 10,000 permutations), respectively. Online error is only computed for runs that integrated without “crashing,” or experiencing numerical instability that halts the hybrid simulation. Our decision to compare ensemble-median RMSE (and not ensemble-mean RMSE) for average ensemble online error is informed by our lack of meaningful online error statistics for crashed runs. To control for the false discovery rate, we apply a Benjamini-Hochberg correction to the unadjusted p-values before declaring statistical significance at a significance level of 5% [Benjamini \BBA Hochberg (\APACyear1995)]. We report significant findings in Table 3. Additional details can be found in the SI while further details regarding ClimScale can be found in Section 6.2.

2.3 Neural Network Configurations

To empirically measure the online effects of various design decisions hypothesized to be critical in the literature [Han \BOthers. (\APACyear2020), Han \BOthers. (\APACyear2023), Clark \BOthers. (\APACyear2022), Bhouri \BOthers. (\APACyear2023), Beucler \BOthers. (\APACyear2024), Behrens \BOthers. (\APACyear2024)], we conduct a modified ablation study for a “standard” configuration that bundles several claims regarding what would be necessary to improve the online behavior of NN convective parameterizations (detailed in Section 2.3.1). In our ablation study, we test two different kinds of modifications, which we will call “forward” and “backward” ablations. As with a traditional ablation study in ML, we test the downstream effect of removing a component of the standard (aka baseline) configuration to reveal the importance of the removed component. We call these “backward ablations”. Our “forward ablations” correspond to configurations that add a feature we suspect could improve online performance.

Our naive hypothesis is that offline error, online error, and online stability improve together. Correspondingly, we expect the backward ablations to worsen offline and online performance and vice versa.

Since small sample sizes and high variability in online error stemming from stochastic hyperparameter selection can mask differences between configurations, we train 330 NNs for each of the nine configurations for a total of 2,970 NNs, which is $>27.5\times$ that of \citeAOtt2020-qe and at least two orders of magnitude larger than what is traditionally shown in the literature [Rasp \BOthers. (\APACyear2018), Wang \BOthers. (\APACyear2022), Iglesias-Suarez \BOthers. (\APACyear2023)]. Our search space was initially identical to that used by \citeAOtt2020-qe, but we made several modifications intended to reduce variability resulting from an excessively large, sparse, or suboptimal hyperparameter search space. We found that using a learning rate range that spanned multiple orders of magnitude was a dominant source of variation in validation error, so we narrowed the learning rate range to just one order of magnitude (1e-4 to 1e-3) and sampled it logarithmically. For similar reasons, we decided to sample all possible layer widths within tighter bounds instead of sampling a few possibilities separated by factors of two. Finally, since the Adam optimizer was shown to be dominant in \citeAOtt2020-qe, we replaced the stochastic gradient descent (SGD) and RMSprop optimizers with Rectified Adam and Quasi-Hyperbolic Adam, which performed similarly to Adam in preliminary experiments and have been shown to be performant in other contexts [Ma \BBA Yarats (\APACyear2018), Liu \BOthers. (\APACyear2019), Geleta \BOthers. (\APACyear2023)]. Put together, these interventions result in distributions of online temperature and moisture error that are concentrated within one order of magnitude (instead of two as in \citeAOtt2020-qe).

Hyperparameter	Range
Hidden layers	$\llbracket$ 4, 11 $\rrbracket$
Nodes per layer	$\llbracket$ 200, 480 $\rrbracket$
Batch normalization	{On, Off}
Dropout	[0.0, 0.25]
Optimizer	{Adam, RAdam, QHAdam}
Leaky ReLu slope	[0.0, 0.4]
Learning rate	[1e-4, 1e-3]

Table 1: Common neural network hyperparameter search space shared by all tested parameterization design configurations (excluding the no dropout configuration). All NNs tested here are dense, feedforward NNs. RAdam and QHAdam stand for “Rectified Adam” and “Quasi-hyperbolic Adam”, respectively. Each hyperparameter is sampled uniformly at random, logarithmically in the case of learning rates.

All nine configurations share a set of inputs used in other studies for the emulation task (e.g., [Rasp \BOthers. (\APACyear2018), Ott \BOthers. (\APACyear2020)]): vertically resolved coarse-scale temperature and humidity variables as well as scalars for surface pressure, top-of-atmosphere insolation, surface sensible heat flux, and surface latent heat flux. Output variables across all configurations are identical, consisting of vertically resolved heating and moistening tendencies. In all cases, the top five vertical levels (corresponding to anything approximately above 55 hPa) of moisture and moistening are not included during training and set to zero during coupling, in line with findings from \citeAClark2022-sr and \citeABrenowitz2020-bv that negligible magnitudes and causatively irrelevant variability at these altitudes can inhibit the machine learning. Radiative fluxes and precipitation are excluded from the output as they are diagnostic (uncoupled) in an aquaplanet setting (i.e., without land). The complete list of inputs and outputs for the standard configuration is shown in Table 2.

Input variables			Output variables
Variable	Unit	$N_{z}$	Variable	Unit	$N_{z}$
Temperature	K	30	Heating rate $\Delta T_{\text{phy}}$	K s^-1	30
Relative Humidity*	%	25	Moistening rate $\Delta Q_{\text{phy}}$	kg kg^-1 s^-1	25
$(t-1)$ Heating rate $\Delta T_{\text{phy}}$ *	K s^-1	30
$(t-1)$ Moistening rate $\Delta Q_{\text{phy}}$ *	kg kg^-1 s^-1	25
Surface pressure	Pa	1
Incoming solar radiation	W m^-2	1
Sensible heat flux	W m^-2	1
Latent heat flux	W m^-2	1
Meridional wind*	m s^-1	30
Ozone volume mixing ratio*	m³ m^-3	30
Cosine of zenith angle*		1
Size of stacked vectors		175			55

Table 2: Table showing input and output variables, their units, and their number of vertical levels

N_{z}

for the standard configuration.

(t-1)

refers to the value corresponding to the previous timestep. Variables indicated with an asterisk are removed or transformed in configurations corresponding to backward ablations.

2.3.1 Standard Configuration [Baseline]

The standard configuration is our baseline against which other configurations are compared. While there is some design similarity to early work of \citeARasp2018-fk, additional heuristics like using a relative humidity transformation for moisture input [Beucler \BOthers. (\APACyear2024)], removing stratospheric moisture and moistening [Brenowitz, Beucler\BCBL \BOthers. (\APACyear2020), Clark \BOthers. (\APACyear2022)], including convective memory [Han \BOthers. (\APACyear2020), Han \BOthers. (\APACyear2023)], expanding the input vector to include ozone volume mixing ratio and cosine of zenith angle, and normalizing the outputs by their standard deviation (per vertical level) are also included so as to provide for a skillful preliminary baseline. Several of these heuristics are empirically tested in the other configurations listed below. A full list of inputs and outputs for this configuration is shown in Table 2.

2.3.2 Specific Humidity Configuration [Backward ablation]

The specific humidity configuration reverses the relative humidity transformation for moisture input in the standard configuration in order to see if the offline benefits of a “climate-invariant” transformation discovered in \citeABeucler2024-vb also manifest in improved performance online. \citeABeucler2024-vb identified multiple feature transformations that improve generalization across climates in an offline setting by decreasing the number of situations in which neural networks extrapolate out-of-distribution. In the context of moisture, the marginal distribution of relative humidity changes very little in warmer climates (when measuring change using Hellinger distance) and is bounded by design, except in cases of supersaturation [Beucler \BOthers. (\APACyear2024)]. While the NNs are tested online in the same climate, the implicit hypothesis is that online errors may include pathologies in which the ML-coupled fluid dynamics lead the input state vector out-of-distribution from the training set, such that climate-invariant feature transformations may be beneficial. Consequently, we expect a configuration that removes such a transformation to worsen offline and online performance relative to our baseline.

2.3.3 No Memory Configuration [Backward ablation]

The no memory configuration omits convective memory (i.e., heating and moistening tendencies from the previous timestep) from the input. Including such memory was argued to be beneficial online in \citeAHan2023-nm and offline in \citeAHan2020-xy while similar benefits were shown using precipitation from a previous timestep in \citeABehrens2024-vj. In other work, \citeAColin2019-ks and \citeAShamekh2023-ed formally reveal the existence of a persistent “microstate memory” due to convective organization in the CRMs that our NNs with memory are trained to emulate. Thus we expect the configuration without memory to worsen offline and online performance relative to our baseline.

2.3.4 No Wind Configuration [Backward ablation]

The no wind configuration excludes meridional wind speeds from the input, which was included in previous work from \citeARasp2018-fk and \citeAOtt2020-qe but argued to be of secondary importance by \citeAHan2020-xy and \citeAMooers2021-sh. Since this dimension of the horizontal wind field theoretically has the capacity to constrain the convection in SPCAM due to organizing effects of wind shear, we expect this configuration to worsen offline and online performance relative to our baseline.

2.3.5 No Ozone Configuration [Backward ablation]

The no ozone configuration omits ozone mixing ratio from the input vector similar to previous work [Rasp \BOthers. (\APACyear2018), Ott \BOthers. (\APACyear2020)]. Because ozone is causally relevant to radiative heating in the stratosphere and is prescribed as a function of latitude in the climate simulator, its omission in \citeAOtt2020-qe and \citeARasp2018-fk was a mistake. Accordingly, we expect this configuration to worsen offline and online performance relative to our baseline.

2.3.6 No Zenith Angle Configuration [Backward ablation]

The no zenith angle configuration omits the cosine of zenith angle from the input vector similar to previous work [Rasp \BOthers. (\APACyear2018), Ott \BOthers. (\APACyear2020)]. Calculating optical depth is a crucial part of radiative transfer models. Given that the zenith angle is a key factor in optical depth calculations, we expect omitting it to worsen online and offline performance.

2.3.7 Mean Absolute Error (MAE) Configuration [Forward ablation]

The MAE configuration swaps out the MSE loss used for training in all other configurations with a MAE loss. Compared to the MSE loss, the MAE loss is designed to learn the conditional median rather than the conditional mean, making it potentially less sensitive to outliers in the data as well as asymmetric distributions. In our context, our experimentation with the MAE loss is primarily motivated by the fact that the marginal distributions of different tendency variables appear more Laplacian than Gaussian. For Laplacian distributed errors, an MAE loss is more appropriate [Hodson (\APACyear2022)]. While we do not have access to the conditional distributions of the outputs since we do not have multiple realizations of the CRM given a single input, we suspect they are similar to the marginal distributions. If this is the case, we expect a MAE loss to improve offline and online performance. The marginal distributions of the heating and moistening tendencies are shown in Figures S23 and S24 in the SI.

2.3.8 No Dropout Configuration [Forward ablation]

The no dropout configuration removes dropout from the search space shown in Table 1. In many ML contexts, dropout is typically used as a regularization tool to prevent overfitting [Hinton \BOthers. (\APACyear2012), Srivastava \BOthers. (\APACyear2014), Molina \BOthers. (\APACyear2021)] but also has seen application as a computationally-efficient tool for representing model uncertainty [Gal \BBA Ghahramani (\APACyear2015), Behrens \BOthers. (\APACyear2024)]. While \citeAOtt2020-qe did not find a strong relationship between dropout and online performance, \citeABehrens2024-vj found that using even a modest amount of dropout (as small as .05) in the last hidden layer of a neural network resulted in noticeable deterioration in offline skill. Given recent experience from \citeABehrens2024-vj, we expect eliminating dropout completely to improve both offline and online performance.

2.3.9 Multiclimate Configuration [Forward ablation]

The multiclimate configuration is trained on three different climates, one of which is identical to that used for other configurations. The other two are created by prescribing -4K colder and +4K warmer sea surface temperatures and waiting for the atmosphere to equilibrate. The number of training data samples used per climate is roughly 1/3 of the total number of training data samples used in the other configurations. Training on multiple climates is argued as helpful to reduce extrapolation error in prior work [Clark \BOthers. (\APACyear2022), Bhouri \BOthers. (\APACyear2023), Lin, Bhouri\BCBL \BOthers. (\APACyear2024)], and, in general, diversifying training data is a well-known and intuitive heuristic for improving the generalization performance of deep learning models. Intuitively, we expect training on multiple climates to improve offline and online performance.

3 Results

3.1 Offline Results

Figure 1 tests our expectations for offline error. The comprehensive sampling (330 NNs per configuration) successfully yields a useful range of ranked offline skill values for each configuration, which are already discernible from one another. More formally, we compare the ensemble-average offline heating and moistening errors of each backward and forward ablation to those of the standard configuration, using two-tailed Welch’s t-tests to detect significant differences in means.

Several of our expectations are confirmed. Among configurations corresponding to backward ablations, removing memory stands out as having the largest impact as it increases the ensemble-average offline RMSE substantially by 0.567 K/day (0.256 g/kg/day) for mean heating (moistening) (Figure 1 a,b). The offline impact of reformulating the moisture input from relative to specific humidity is less extreme yet nonetheless detectable, increasing ensemble-average offline RMSE by .0288 K/day (.00789 g/kg/day). In terms of forward ablations, the MAE and no dropout configurations both show offline improvement, decreasing ensemble-average offline RMSE by .0226 K/day (.00952 g/kg/day) and .0765 K/day (.0545 g/kg/day), respectively. These differences are all statistically significant.

Other expectations are invalidated. Our backward ablations of removing wind, ozone, and zenith angle from the input vector did not statistically affect offline error (Figure 1 c,d). Meanwhile, our forward ablation of training on multiple climates also had no statistically detectable effect. Since \citeAOtt2020-qe and \citeAWang2022-po both showed a relationship between offline error and online performance (albeit solely in the form of online simulation stability), we might expect inter-configuration differences in online error (and online stability) to mirror differences identified offline. However, these relationships, especially with respect to offline and online error specifically, have not been statistically evaluated in previous works (e.g., \citeAOtt2020-qe, Wang2022-po).

If offline error truly is predictive of online performance in general, Figure 1 implies we should expect both online stability and online error to be worse for the no memory and specific configurations, statistically the same for the no wind, no ozone, no zenith angle, and multiclimate configurations, and better for the MAE and no dropout configurations. We will investigate this issue in the following section.

Refer to caption — Figure 1: Offline test RMSE for subgrid heating (1a,1c,1e) and moistening (1b,1d,1f) tendencies across configurations are plotted against validation error rank. 1a and 1b show RMSE for average, multiple linear regression (MLR), and martingale baselines. The MLR baseline makes use of inputs and outputs from the standard configuration. 1e and 1f show configurations with statistically distinct average RMSE, and 1c and 1d show the others.

Configuration	Survival	Online Error (K)	Offline Error (K/day)	Spearman’s $\rho$
standard	42.7%	3.48 K	1.93 K/day	.855
specific humidity	-26.1%	+.130 K	+.0288 K/day	.834
no memory	-32.7%	+.800 K	+.567 K/day	.311
no wind	+1.82%	-.00430 K	+.00526 K/day	.789
no ozone	+13.3%	+.212 K	+.00443 K/day	.803
no zenith angle	+11.2%	-.142 K	+.00226 K/day	.829
MAE	+32.4%	+1.16 K	-.0226 K/day	.706
no dropout	-14.2%	-.967 K	-.0765 K/day	.353
multiclimate	+27.3%	-.237 K	-.000903 K/day	.858

Table 3: The columns titled Survival, Online Error (K), Offline Error (K/day), and Spearman’s

\rho

in Table 3 refer to online survival rate (i.e., percentage of runs that integrated the full simulation year without crashing), ensemble-median online temperature RMSE in K, ensemble-average offline heating RMSE in K/day, and Spearman correlation between offline heating and online temperature RMSE. Online error statistics and offline-online relationships exclude runs that crash. All rows following the first denote anomalies relative to the standard configuration’s statistics. Significant anomaly values are bolded, and color indicates direction of difference (with red indicating worse and teal indicating better performance). Statistics are deemed significant at

\alpha=0.05

after applying a Benjamini-Hochberg correction, and Figure S1 in the SI shows this correction being applied. The analogous table for online moisture and online moistening error is similar and can be found in Table S3 in the SI.

3.2 Online Results

We report statistically significant relationships between offline and online error, which has not been examined in previous studies. Table 3 present statistically significant Spearman rank correlation between offline heating (moistening) and online temperature (moisture) error—despite the fact that these relationships have been contested or questioned in previous work [Brenowitz, Henn\BCBL \BOthers. (\APACyear2020)]. This holds in every configuration that includes convective memory (which may be why such relationships may not have been reported previously where convective memory was not included as input). That being said, the relationship is weaker for the no dropout configuration and not detectable when specifically looking at offline moistening versus online moisture error. A possible explanation is that the no dropout configuration has the least variance in offline error (with $>70\%$ less variance in offline heating and moistening error than the standard configuration) and the lowest offline error across most models, potentially hinting at saturating improvements given architecture and data constraints.

Also for the first time, we show that solely using stability as a proxy for general online performance can lead to incorrect conclusions, a departure from previous findings that found that improving performance across one online metric is accompanied by (or implies) simultaneous improvement on all others [Gagne \BOthers. (\APACyear2020), Ott \BOthers. (\APACyear2020), Wang \BOthers. (\APACyear2022)]. This is illustrated by the MAE and no dropout configurations, which have the exact opposite effects on stability and online error despite both decreasing offline error relative to the standard configuration. The MAE configuration simultaneously has best-in-class stability and the highest ensemble-median online temperature and moisture error. By contrast, the no dropout configuration has best-in-class online error and a 14.2 percentage point lower survival rate (in absolute terms) than the standard configuration. From the perspective of the survival rate metric used in \citeAOtt2020-qe and \citeAWang2022-po, the MAE configuration might appear superior. However, from an operational climate modeling perspective where only the most skillful trained emulators are considered for operational use, the no dropout configuration is the clear winner.

Among configurations corresponding to both backward and forward ablations, only one configuration comes with no surprises online. Excluding memory unambiguously worsens performance across all metrics both offline and online: offline error, stability, and online error. By contrast, although the multiclimate configuration had no detectable offline impact, it is the only configuration to statistically improve upon the standard configuration on every online metric, as seen in Figures 2h and 3. The no wind, no ozone, and no zenith angle configurations also had no detectable offline impact, but, the no ozone configuration does have statistically higher ensemble-median online temperature RMSE, a reassuring expected finding. Despite the importance of zenith angle in calculating optical depth, we suspect it had less of an impact on the ensemble-median temperature RMSE because of a high R-squared between cosine of zenith angle and solar insolation ( $>.99$ when values corresponding to zero solar insolation are masked out). Surprisingly, the no ozone and no zenith angle configurations both have improved stability. We speculate this is because certain input features are redundant and causal pruning may be necessary to avoid learning spurious correlations that fail to generalize outside the training data [Iglesias-Suarez \BOthers. (\APACyear2024)]. As for the remaining backward ablation, the specific humidity configuration worsens stability but does not have a statically significant impact on online error. We should note that detection of an arbitrary effect size is partially a function of sample size. For example, it is possible that, had we tried more than 330 NNs per configuration, we might have detected differences in ensemble-median online RMSE for the specific humidity configuration. In the next section, we discuss what sample sizes are necessary for robust detection.

4 Estimating necessary sample size for robust detection

Because the dispersion of online error seen in previous work like [Ott \BOthers. (\APACyear2020)] can easily span two orders of magnitude, we chose a sample size per configuration of 330 NNs. However, we readily admit this exact number was chosen arbitrarily and may not reflect best practices going forward. It is important to note that selecting a sample size for non-parametric tests—especially for statistics like medians which do not obey the usual Central Limit Theorem—is not firmly established; there exist multiple methods but they are imperfect but make use of simplifying assumptions [Noether (\APACyear1987), Hamilton \BBA Collings (\APACyear1991)]. The situation is further complicated by the fact that many hybrid runs may crash, and only those completing the entire simulation are used for online error statistics.

To make headway on the sample size estimation, we make use of our least detectable (or highest significant p-value) yet scientifically justified finding—that not including ozone in the input increases temperature error online. In line with common practice, we choose a significance level, $\alpha$ , of $5\%$ and a power, $1-\beta$ , of $80\%$ . We also make several simplifying assumptions:

•

An equal proportion, $\gamma$ , of NNs integrate without crashing in both configurations. More specifically, this proportion is equal to the survival rate of the standard configuration.
•

If the distributions of online error for both configurations vary, they only vary in location, not shape, skew, or dispersion (i.e., the distribution of online error for one configuration can be approximated by shifting the other).

Using these assumptions, the Mann Whitney U-Test (otherwise known as the Wilcoxon ranked sum test) is equivalent to a test of a difference in medians. We initially used permutation tests to detect differences in ensemble-median online error as they can detect differences in percentiles between samples without making assumptions regarding shape, skew, or dispersion. The Mann Whitney U-test makes stronger assumptions than a permutation test, but provides a closed form for power calculations.

Using the sample size estimation for the Mann Whitney U-Test derived in \citeANoether1987-po and tested in \citeAHamilton1991-xa, we have:

n^{*}=\left\lceil\left\lceil\frac{\left(\Phi^{-1}(\kappa_{0})+\Phi^{-1}(1-% \alpha)\right)^{2}}{6(P-.5)^{2}}\right\rceil*\frac{1}{\gamma}\right\rceil

where $n^{*}$ is our sample size estimate, $\kappa_{0}=.8$ is our desired power, $\alpha=.05$ is our significance level, $\gamma=141/330$ is our assumed survival rate (i.e., the surival rate of the standard configuration), and $P$ is $P(Y>X)$ , where $X$ and $Y$ are realizations from the populations that our configurations are each sampling from. $\Phi^{-1}$ is the inverse of the cumulative distribution function of a normal distribution, and the ceiling notation indicates rounding up to the next integer. To calculate $P$ , we apply the Mann Whitney U-Test to the online RMSEs for the standard configuration and an offset version of those same online RMSEs. This offset is set to the difference in ensemble-median RMSE when compared to the no ozone configuration. From this test, we get a U-statistic of 8,363. This means $P\approx 42.1\%$ and $n^{*}\approx 384$ . We hope this sample size estimate serves as a useful reference point, and not a canonical requirement, for future work. Larger effect sizes require far fewer samples for detection. As an example, detecting the shifts in ensemble-median online temperature error for the multiclimate and no dropout configurations with 80% power using the same assumptions would require sample sizes of 312 and 29, respectively. Additionally, the value of $\gamma$ may change in future work, as the configuration used as the new baseline will likely have improved stability. This will make smaller sample sizes viable for detecting the same effect size.

5 Persistent zonal mean biases

Across configurations, we witness common patterns in zonal mean biases relative to the reference SPCAM simulation, with a cold bias in the stratosphere and a warm bias in the poles (as seen in Figure 4). Systematic biases in the stratosphere and near the poles have also been reported by others [Wang \BOthers. (\APACyear2022), Iglesias-Suarez \BOthers. (\APACyear2024)], and our results confirm that the biases experienced in our work are reproducible. However, we also show that certain interventions can change the structure of this bias. For example, training on multiple climates results in an improved, asymmetric zonal mean bias, indicating that while persistent, this bias structure can be systematically reduced with fairly rudimentary design interventions. Nevertheless, our bias structure is very sensitive to hyperparameter sampling, as the hybrid simulation with the smallest zonal biases comes from the specific humidity configuration, which is less stable than the standard configuration and consistently has higher offline temperature and moisture RMSE than the no dropout configuration. Sensitivity to hyperparameter tuning may imply that reliably mitigating this bias structure in future work will also require testing additional hypotheses with large ensembles.

6 Conclusion

6.1 Summary and Implications

Our work brings much-needed clarity to the relationship between offline and online performance and yields empirically robust recommendations for the development of future neural network parameterizations of moist convection. This was enabled by the development of a modular, end-to-end pipeline for rapidly training and coupling multi-hundred member ensembles of NN convective parameterizations. Our results reveal that:

•

Hyperparameter tuning to reduce offline error can indeed be a viable means for reducing online error.
•

However, changing the loss function from MSE to MAE can reduce offline error while increasing online error.
•

Lower online error does not inherently come with improved online stability and vice versa.
•

A “climate-invariant” (i.e., with limited distribution shift under climate change) feature transformation from specific to relative humidity described in \citeABeucler2024-vb improves stability.
•

Using memory (to implicitly represent convective memory) and training on multiple climates unambiguously reduces offline and online error while improving online stability.

While we confirm that a strong relationship between offline and online error exists within most configurations, our results also demonstrate the value of robustly sampling online behavior, which is highly variable even among models with similar offline skill. For example, dispersion in online RMSE or survival rates from hyperparameter tuning can be large enough to obscure the impact of different design choices, and it is possible to train a superior model (i.e., when evaluated online) using seemingly suboptimal design choices offline, as evidenced by a single neural network from the specific humidity configuration having the lowest zonal mean temperature biases across all 2,970 models evaluated. Had we drawn conclusions from this singular, performant NN parameterization, we would have drawn a conclusion contrary to that implied by the ensemble at large. Although hundreds of ensemble members may be necessary to detect small yet statistically significant and causally relevant differences (e.g. as shown by the no ozone and multiclimate configurations), our work suggests that, at a minimum, ensemble sizes of a few dozen should be used to detect larger effect sizes with sufficient power.

A key benefit to confirming strong offline-online error relationships is that hybrid physics-ML modelers can have renewed confidence leveraging modern hyperparameter tuning algorithms like Bayesian Optimization Hyperband (BOHB) and Faster Improvement Rate Population Based Training (FIRE PBT) on offline data to improve subgrid ML parameterizations [Falkner \BOthers. (\APACyear2018), Dalibard \BBA Jaderberg (\APACyear2021)]. Additionally, open competitions that crowd-source offline error optimization, like the recent 2024 LEAP Atmospheric Physics using AI (ClimSim) Kaggle competition, may become more relevant to the online hybrid climate modeling evaluation task than it may have previously appeared [Lin, Hu\BCBL \BOthers. (\APACyear2024)].

Previously, skepticism regarding the relevance of optimizing for lower offline error in improving online performance has motivated sophisticated strategies for directly training these emulators on their online behavior. This includes online coupled learning as described in \citeARasp2020-ns, gradient-free ensemble Kalman methods as in \citeALopez-Gomez2022-sy, and creating a fully differentiable hybrid physics-ML atmospheric model, as demonstrated by Google’s NeuralGCM [Kochkov \BOthers. (\APACyear2023)]. These efforts are important for future development of hybrid physics-ML climate models; however, our results show that reproducible progress is possible when optimizing for offline error alone.

We also show that lower online error and improved stability do not necessarily come in tandem, as evidenced by the opposite effects of using an MAE loss and removing dropout (despite both improving offline error relative to the standard configuration). Finally, we reveal the existence of persistent online stratospheric biases that can be partially mitigated via extensive hyperparameter tuning and training on multiple climates but deserve a more targeted approach in future research.

Beyond climate simulation, the lessons learned in this paper are relevant to hybrid (ML-physics) simulations across the geosciences. We have demonstrated a standardized process for distinguishing signal from noise in the ML model design decisions that contribute to stable and accurate online performance. Our view is that without infrastructure for such rapid, large-scale downstream online tests, development of hybrid ML-physics models can be muddled. This complexity arises from the difficulty of comparing configurations of ML-parameterizations whose distributions of online error fail to converge due to insufficient sampling. Evidently, the climate model use case is one in which high dispersion owing to stochastic aspects of ML optimization, such as architecture searches, can mask modest but real gains in downstream performance. It can also give a false impression of the benefits of a given design decision. This can ultimately manifest in premature conclusions and missed opportunities. We hope our work motivates a transition from extrapolating from anecdotal experiences or small ensemble sizes to adopting a more systematic and reproducible approach to ML parameterization development.

Open Research

6.2 Software Availability Statement

Version v2 of spcam3.0-neural-net-spreadtesting used for creating the reference SPCAM3 simulation and dynamical core for online runs is preserved at https://zenodo.org/records/11392025, available via Apache License 2.0 and developed openly at https://github.com/SciPritchardLab/spcam3.0-neural-net [M. Pritchard \BOthers. (\APACyear2024)]. v4 of ClimScale used for conducting hybrid physics-ML climate runs is preserved at https://zenodo.org/records/11402897, available via Apache License 2.0 and developed openly at https://github.com/SciPritchardLab/ClimScale [Lin \BBA Yu (\APACyear2024)]. All experiments were run on NVIDIA V100 GPUs. Approximately 2,290 GPU-hours were used to train 2,970 NNs across all nine configurations.

Acknowledgements.

This work is primarily funded by National Science Foundation (NSF) Science and Technology Center (STC) Learning the Earth with Artificial Intelligence and Physics (LEAP), Award # 2019625-STC. High-performance computing was facilitated by Bridges2 at the Pittsburgh Supercomputing Center (PSC) through allocation ATM190002 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by NSF grants #2138259, #2138286, #2138307, #2137603, and #2138296. MP acknowledges co-funding from the US Department of Energy (DE-SC0023368). Tom Beucler acknowledges partial funding from the Swiss State Secretariat for Education, Research and Innovation (SERI) for the Horizon Europe project AI4PEX (Grant agreement ID: 101137682). Eliot Wong-Toi is funded by the Hasso Plattner Research School at UC Irvine. We are thankful to the system administrators at PSC, in particular Tom Maiden and TJ Olesky. We thank David Walling for his assistance with HPC support through the XSEDE Extended Collaborative Support Service program. Finally, we would like to acknowledge Dave Lawrence from NCAR and Laure Zanna from NYU for helpful conversations during the 2024 LEAP-STC NSF site visit.

References

Behrens \BOthers. (\APACyear2022) \APACinsertmetastarBehrens2022-qq{APACrefauthors}Behrens, G., Beucler, T., Gentine, P., Iglesias-Suarez, F., Pritchard, M.\BCBL \BBA Eyring, V. \APACrefYearMonthDay2022\APACmonth08. \BBOQ\APACrefatitleNon-Linear Dimensionality Reduction With a Variational Encoder Decoder to Understand Convective Processes in Climate Models Non-Linear dimensionality reduction with a variational encoder decoder to understand convective processes in climate models.\BBCQ \APACjournalVolNumPagesJ Adv Model Earth Syst148e2022MS003130. \PrintBackRefs\CurrentBib
Behrens \BOthers. (\APACyear2024) \APACinsertmetastarBehrens2024-vj{APACrefauthors}Behrens, G., Beucler, T., Iglesias-Suarez, F., Yu, S., Gentine, P., Pritchard, M.\BDBLEyring, V. \APACrefYearMonthDay2024\APACmonth02. \BBOQ\APACrefatitleImproving atmospheric processes in Earth system models with deep learning ensembles and stochastic parameterizations Improving atmospheric processes in earth system models with deep learning ensembles and stochastic parameterizations.\BBCQ \APACjournalVolNumPagesarXiv [physics.ao-ph]. \PrintBackRefs\CurrentBib
Benjamini \BBA Hochberg (\APACyear1995) \APACinsertmetastarBenjamini1995-un{APACrefauthors}Benjamini, Y.\BCBT \BBA Hochberg, Y. \APACrefYearMonthDay1995\APACmonth01. \BBOQ\APACrefatitleControlling the false discovery rate: A practical and powerful approach to multiple testing Controlling the false discovery rate: A practical and powerful approach to multiple testing.\BBCQ \APACjournalVolNumPagesJ. R. Stat. Soc. Series B Stat. Methodol.571289–300. \PrintBackRefs\CurrentBib
Beucler \BOthers. (\APACyear2024) \APACinsertmetastarBeucler2024-vb{APACrefauthors}Beucler, T., Gentine, P., Yuval, J., Gupta, A., Peng, L., Lin, J.\BDBLPritchard, M. \APACrefYearMonthDay2024\APACmonth02. \BBOQ\APACrefatitleClimate-invariant machine learning Climate-invariant machine learning.\BBCQ \APACjournalVolNumPagesSci Adv106eadj7250. \PrintBackRefs\CurrentBib
Bhouri \BOthers. (\APACyear2023) \APACinsertmetastarBhouri2023-pe{APACrefauthors}Bhouri, M\BPBIA., Peng, L., Pritchard, M\BPBIS.\BCBL \BBA Gentine, P. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleMulti-fidelity climate model parameterization for better generalization and extrapolation Multi-fidelity climate model parameterization for better generalization and extrapolation.\BBCQ \APACjournalVolNumPagesarXiv.org. \PrintBackRefs\CurrentBib
Brenowitz, Beucler\BCBL \BOthers. (\APACyear2020) \APACinsertmetastarBrenowitz2020-bv{APACrefauthors}Brenowitz, N\BPBID., Beucler, T., Pritchard, M.\BCBL \BBA Bretherton, C\BPBIS. \APACrefYearMonthDay2020\APACmonth12. \BBOQ\APACrefatitleInterpreting and Stabilizing Machine-Learning Parametrizations of Convection Interpreting and stabilizing Machine-Learning parametrizations of convection.\BBCQ \APACjournalVolNumPagesJ. Atmos. Sci.77124357–4375. \PrintBackRefs\CurrentBib
Brenowitz, Henn\BCBL \BOthers. (\APACyear2020) \APACinsertmetastarBrenowitz2020-lj{APACrefauthors}Brenowitz, N\BPBID., Henn, B., McGibbon, J., Clark, S\BPBIK., Kwa, A., Andre Perkins, W.\BDBLBretherton, C\BPBIS. \APACrefYearMonthDay2020\APACmonth11. \BBOQ\APACrefatitleMachine Learning Climate Model Dynamics: Offline versus Online Performance Machine learning climate model dynamics: Offline versus online performance.\BBCQ \PrintBackRefs\CurrentBib
Cachay \BOthers. (\APACyear2021) \APACinsertmetastarCachay2021-ds{APACrefauthors}Cachay, S\BPBIR., Ramesh, V., Cole, J\BPBIN\BPBIS., Barker, H.\BCBL \BBA Rolnick, D. \APACrefYearMonthDay2021\APACmonth11. \BBOQ\APACrefatitleClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models ClimART: A benchmark dataset for emulating atmospheric radiative transfer in weather and climate models.\BBCQ \PrintBackRefs\CurrentBib
Chevallier \BOthers. (\APACyear1998) \APACinsertmetastarChevallier1998-ne{APACrefauthors}Chevallier, F., Chéruy, F., Scott, N\BPBIA.\BCBL \BBA Chédin, A. \APACrefYearMonthDay1998\APACmonth11. \BBOQ\APACrefatitleA Neural Network Approach for a Fast and Accurate Computation of a Longwave Radiative Budget A neural network approach for a fast and accurate computation of a longwave radiative budget.\BBCQ \APACjournalVolNumPagesJ. Appl. Meteorol. Climatol.37111385–1397. \PrintBackRefs\CurrentBib
Clark \BOthers. (\APACyear2022) \APACinsertmetastarClark2022-sr{APACrefauthors}Clark, S\BPBIK., Brenowitz, N\BPBID., Henn, B., Kwa, A., McGibbon, J., Perkins, W\BPBIA.\BDBLHarris, L\BPBIM. \APACrefYearMonthDay2022\APACmonth09. \BBOQ\APACrefatitleCorrecting a 200 km resolution climate model in multiple climates by machine learning from 25 km resolution simulations Correcting a 200 km resolution climate model in multiple climates by machine learning from 25 km resolution simulations.\BBCQ \APACjournalVolNumPagesJ. Adv. Model. Earth Syst.149. \PrintBackRefs\CurrentBib
Colin \BOthers. (\APACyear2019) \APACinsertmetastarColin2019-ks{APACrefauthors}Colin, M., Sherwood, S., Geoffroy, O., Bony, S.\BCBL \BBA Fuchs, D. \APACrefYearMonthDay2019\APACmonth03. \BBOQ\APACrefatitleIdentifying the Sources of Convective Memory in Cloud-Resolving Simulations Identifying the sources of convective memory in Cloud-Resolving simulations.\BBCQ \APACjournalVolNumPagesJ. Atmos. Sci.763947–962. \PrintBackRefs\CurrentBib
Dalibard \BBA Jaderberg (\APACyear2021) \APACinsertmetastarDalibard2021-hw{APACrefauthors}Dalibard, V.\BCBT \BBA Jaderberg, M. \APACrefYearMonthDay2021\APACmonth09. \BBOQ\APACrefatitleFaster Improvement Rate population Based Training Faster improvement rate population based training.\BBCQ \APACjournalVolNumPagesarXiv [cs.NE]. \PrintBackRefs\CurrentBib
Falkner \BOthers. (\APACyear2018) \APACinsertmetastarFalkner2018-yv{APACrefauthors}Falkner, S., Klein, A.\BCBL \BBA Hutter, F. \APACrefYearMonthDay2018\APACmonth07. \BBOQ\APACrefatitleBOHB: Robust and Efficient Hyperparameter Optimization at Scale BOHB: Robust and efficient hyperparameter optimization at scale.\BBCQ \PrintBackRefs\CurrentBib
Gagne \BOthers. (\APACyear2020) \APACinsertmetastarGagne2020-av{APACrefauthors}Gagne, D\BPBIJ., II, Christensen, H\BPBIM., Subramanian, A\BPBIC.\BCBL \BBA Monahan, A\BPBIH. \APACrefYearMonthDay2020\APACmonth03. \BBOQ\APACrefatitleMachine learning for stochastic parameterization: Generative adversarial networks in the Lorenz ’96 model Machine learning for stochastic parameterization: Generative adversarial networks in the lorenz ’96 model.\BBCQ \APACjournalVolNumPagesJ. Adv. Model. Earth Syst.123e2019MS001896. \PrintBackRefs\CurrentBib
Gal \BBA Ghahramani (\APACyear2015) \APACinsertmetastarGal2015-ss{APACrefauthors}Gal, Y.\BCBT \BBA Ghahramani, Z. \APACrefYearMonthDay2015\APACmonth06. \BBOQ\APACrefatitleDropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning Dropout as a bayesian approximation: Representing model uncertainty in deep learning.\BBCQ \PrintBackRefs\CurrentBib
Geleta \BOthers. (\APACyear2023) \APACinsertmetastarGeleta2023-dg{APACrefauthors}Geleta, M., Montserrat, D\BPBIM., Giro-i Nieto, X.\BCBL \BBA Ioannidis, A\BPBIG. \APACrefYearMonthDay2023\APACmonth09. \APACrefbtitleDeep Variational Autoencoders for Population Genetics. Deep variational autoencoders for population genetics. \PrintBackRefs\CurrentBib
Gentine \BOthers. (\APACyear2018) \APACinsertmetastarGentine2018-ux{APACrefauthors}Gentine, P., Pritchard, M., Rasp, S., Reinaudi, G.\BCBL \BBA Yacalis, G. \APACrefYearMonthDay2018\APACmonth06. \BBOQ\APACrefatitleCould machine learning break the convection parameterization deadlock? Could machine learning break the convection parameterization deadlock?\BBCQ \APACjournalVolNumPagesGeophys. Res. Lett.45115742–5751. \PrintBackRefs\CurrentBib
Guillaumin \BBA Zanna (\APACyear2021) \APACinsertmetastarGuillaumin2021-xg{APACrefauthors}Guillaumin, A\BPBIP.\BCBT \BBA Zanna, L. \APACrefYearMonthDay2021\APACmonth09. \BBOQ\APACrefatitleStochastic‐deep learning parameterization of ocean momentum forcing Stochastic‐deep learning parameterization of ocean momentum forcing.\BBCQ \APACjournalVolNumPagesJ. Adv. Model. Earth Syst.139. \PrintBackRefs\CurrentBib
Hamilton \BBA Collings (\APACyear1991) \APACinsertmetastarHamilton1991-xa{APACrefauthors}Hamilton, M\BPBIA.\BCBT \BBA Collings, B\BPBIJ. \APACrefYearMonthDay1991\APACmonth08. \BBOQ\APACrefatitleDetermining the Appropriate Sample Size for Nonparametric Tests for Location Shift Determining the appropriate sample size for nonparametric tests for location shift.\BBCQ \APACjournalVolNumPagesTechnometrics. \PrintBackRefs\CurrentBib
Han \BOthers. (\APACyear2020) \APACinsertmetastarHan2020-xy{APACrefauthors}Han, Y., Zhang, G\BPBIJ., Huang, X.\BCBL \BBA Wang, Y. \APACrefYearMonthDay2020\APACmonth09. \BBOQ\APACrefatitleA moist physics parameterization based on deep learning A moist physics parameterization based on deep learning.\BBCQ \APACjournalVolNumPagesJ. Adv. Model. Earth Syst.129. \PrintBackRefs\CurrentBib
Han \BOthers. (\APACyear2023) \APACinsertmetastarHan2023-nm{APACrefauthors}Han, Y., Zhang, G\BPBIJ.\BCBL \BBA Wang, Y. \APACrefYearMonthDay2023\APACmonth10. \BBOQ\APACrefatitleAn ensemble of neural networks for moist physics processes, its generalizability and stable integration An ensemble of neural networks for moist physics processes, its generalizability and stable integration.\BBCQ \APACjournalVolNumPagesJ. Adv. Model. Earth Syst.1510. \PrintBackRefs\CurrentBib
Hinton \BOthers. (\APACyear2012) \APACinsertmetastarHinton2012-hi{APACrefauthors}Hinton, G\BPBIE., Srivastava, N., Krizhevsky, A., Sutskever, I.\BCBL \BBA Salakhutdinov, R\BPBIR. \APACrefYearMonthDay2012\APACmonth07. \BBOQ\APACrefatitleImproving neural networks by preventing co-adaptation of feature detectors Improving neural networks by preventing co-adaptation of feature detectors.\BBCQ \PrintBackRefs\CurrentBib
Hodson (\APACyear2022) \APACinsertmetastarHodson2022-fm{APACrefauthors}Hodson, T\BPBIO. \APACrefYearMonthDay2022\APACmonth07. \BBOQ\APACrefatitleRoot-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not.\BBCQ \APACjournalVolNumPagesGeosci. Model Dev.15145481–5487. \PrintBackRefs\CurrentBib
Iglesias-Suarez \BOthers. (\APACyear2023) \APACinsertmetastarIglesias-Suarez2023-qm{APACrefauthors}Iglesias-Suarez, F., Gentine, P., Solino-Fernandez, B., Beucler, T., Pritchard, M., Runge, J.\BCBL \BBA Eyring, V. \APACrefYearMonthDay2023\APACmonth04. \BBOQ\APACrefatitleCausally-informed deep learning to improve climate models and projections Causally-informed deep learning to improve climate models and projections.\BBCQ \PrintBackRefs\CurrentBib
Iglesias-Suarez \BOthers. (\APACyear2024) \APACinsertmetastariglesias2024causally{APACrefauthors}Iglesias-Suarez, F., Gentine, P., Solino-Fernandez, B., Beucler, T., Pritchard, M., Runge, J.\BCBL \BBA Eyring, V. \APACrefYearMonthDay2024. \BBOQ\APACrefatitleCausally-informed deep learning to improve climate models and projections Causally-informed deep learning to improve climate models and projections.\BBCQ \APACjournalVolNumPagesJournal of Geophysical Research: Atmospheres1294e2023JD039202. \PrintBackRefs\CurrentBib
IPCC (\APACyear2021) \APACinsertmetastarIPCC2021{APACrefauthors}IPCC. \APACrefYear2021. \APACrefbtitleClimate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change Climate change 2021: The physical science basis. contribution of working group i to the sixth assessment report of the intergovernmental panel on climate change. \PrintBackRefs\CurrentBib
Jones \BOthers. (\APACyear2019) \APACinsertmetastarJones2019-ke{APACrefauthors}Jones, T\BPBIR., Randall, D\BPBIA.\BCBL \BBA Branson, M\BPBID. \APACrefYearMonthDay2019\APACmonth11. \BBOQ\APACrefatitleMultiple‐instance superparameterization: 1. Concept, and predictability of precipitation Multiple‐instance superparameterization: 1. concept, and predictability of precipitation.\BBCQ \APACjournalVolNumPagesJ. Adv. Model. Earth Syst.11113497–3520. \PrintBackRefs\CurrentBib
Khairoutdinov \BOthers. (\APACyear2005) \APACinsertmetastarKhairoutdinov2005-fk{APACrefauthors}Khairoutdinov, M., Randall, D.\BCBL \BBA DeMott, C. \APACrefYearMonthDay2005\APACmonth07. \BBOQ\APACrefatitleSimulations of the Atmospheric General Circulation Using a Cloud-Resolving Model as a Superparameterization of Physical Processes Simulations of the atmospheric general circulation using a Cloud-Resolving model as a superparameterization of physical processes.\BBCQ \APACjournalVolNumPagesJ. Atmos. Sci.6272136–2154. \PrintBackRefs\CurrentBib
Kochkov \BOthers. (\APACyear2023) \APACinsertmetastarKochkov2023-bd{APACrefauthors}Kochkov, D., Yuval, J., Langmore, I., Norgaard, P., Smith, J., Mooers, G.\BDBLHoyer, S. \APACrefYearMonthDay2023\APACmonth11. \BBOQ\APACrefatitleNeural General Circulation Models Neural general circulation models.\BBCQ \APACjournalVolNumPagesarXiv [physics.ao-ph]. \PrintBackRefs\CurrentBib
Krasnopolsky \BOthers. (\APACyear2013) \APACinsertmetastarKrasnopolsky2013-ed{APACrefauthors}Krasnopolsky, V\BPBIM., Fox-Rabinovitz, M\BPBIS.\BCBL \BBA Belochitski, A\BPBIA. \APACrefYearMonthDay2013\APACmonth05. \BBOQ\APACrefatitleUsing Ensemble of Neural Networks to Learn Stochastic Convection Parameterizations for Climate and Numerical Weather Prediction Models from Data Simulated by a Cloud Resolving Model Using ensemble of neural networks to learn stochastic convection parameterizations for climate and numerical weather prediction models from data simulated by a cloud resolving model.\BBCQ \APACjournalVolNumPagesAdvances in Artificial Neural Systems2013. \PrintBackRefs\CurrentBib
Lin, Bhouri\BCBL \BOthers. (\APACyear2024) \APACinsertmetastarLin2024-zo{APACrefauthors}Lin, J., Bhouri, M\BPBIA., Beucler, T., Yu, S.\BCBL \BBA Pritchard, M. \APACrefYearMonthDay2024\APACmonth01. \BBOQ\APACrefatitleStress-testing the coupled behavior of hybrid physics-machine learning climate simulations on an unseen, warmer climate Stress-testing the coupled behavior of hybrid physics-machine learning climate simulations on an unseen, warmer climate.\BBCQ \APACjournalVolNumPagesarXiv [physics.ao-ph]. \PrintBackRefs\CurrentBib
Lin, Hu\BCBL \BOthers. (\APACyear2024) \APACinsertmetastarleap-atmospheric-physics-ai-climsim{APACrefauthors}Lin, J., Hu, Z., Yu, S., Pritchard, M., Gupta, R., Zheng, T.\BDBLReade, W. \APACrefYearMonthDay2024. \APACrefbtitleLEAP - Atmospheric Physics using AI (ClimSim). Leap - atmospheric physics using ai (climsim). \APACaddressPublisherKaggle. {APACrefURL} https://kaggle.com/competitions/leap-atmospheric-physics-ai-climsim \PrintBackRefs\CurrentBib
Lin \BBA Yu (\APACyear2024) \APACinsertmetastarlin_2024_11402897{APACrefauthors}Lin, J.\BCBT \BBA Yu, S. \APACrefYearMonthDay2024\APACmonth05. \APACrefbtitleClimScale. Climscale. \APACaddressPublisherZenodo. {APACrefURL} https://zenodo.org/records/11402897 {APACrefDOI} 10.5281/zenodo.11402897 \PrintBackRefs\CurrentBib
Liu \BOthers. (\APACyear2019) \APACinsertmetastarLiu2019-hk{APACrefauthors}Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J.\BCBL \BBA Han, J. \APACrefYearMonthDay2019\APACmonth09. \BBOQ\APACrefatitleOn the Variance of the Adaptive Learning Rate and Beyond On the variance of the adaptive learning rate and beyond.\BBCQ \PrintBackRefs\CurrentBib
Lopez-Gomez \BOthers. (\APACyear2022) \APACinsertmetastarLopez-Gomez2022-sy{APACrefauthors}Lopez-Gomez, I., Christopoulos, C., Langeland Ervik, H\BPBIL., Dunbar, O\BPBIR\BPBIA., Cohen, Y.\BCBL \BBA Schneider, T. \APACrefYearMonthDay2022\APACmonth08. \BBOQ\APACrefatitleTraining physics‐based machine‐learning parameterizations with gradient‐free ensemble Kalman methods Training physics‐based machine‐learning parameterizations with gradient‐free ensemble kalman methods.\BBCQ \APACjournalVolNumPagesJ. Adv. Model. Earth Syst.148e2022MS003105. \PrintBackRefs\CurrentBib
Ma \BBA Yarats (\APACyear2018) \APACinsertmetastarMa2018-lv{APACrefauthors}Ma, J.\BCBT \BBA Yarats, D. \APACrefYearMonthDay2018\APACmonth10. \BBOQ\APACrefatitleQuasi-hyperbolic momentum and Adam for deep learning Quasi-hyperbolic momentum and adam for deep learning.\BBCQ \PrintBackRefs\CurrentBib
Molina \BOthers. (\APACyear2021) \APACinsertmetastarMolina2021-on{APACrefauthors}Molina, M\BPBIJ., Gagne, D\BPBIJ.\BCBL \BBA Prein, A\BPBIF. \APACrefYearMonthDay2021\APACmonth09. \BBOQ\APACrefatitleA benchmark to test generalization capabilities of deep learning methods to classify severe convective storms in a changing climate A benchmark to test generalization capabilities of deep learning methods to classify severe convective storms in a changing climate.\BBCQ \APACjournalVolNumPagesEarth Space Sci.89. \PrintBackRefs\CurrentBib
Mooers \BOthers. (\APACyear2021) \APACinsertmetastarMooers2021-sh{APACrefauthors}Mooers, G., Pritchard, M., Beucler, T., Ott, J., Yacalis, G., Baldi, P.\BCBL \BBA Gentine, P. \APACrefYearMonthDay2021\APACmonth05. \BBOQ\APACrefatitleAssessing the potential of deep learning for emulating cloud superparameterization in climate models with real‐geography boundary conditions Assessing the potential of deep learning for emulating cloud superparameterization in climate models with real‐geography boundary conditions.\BBCQ \APACjournalVolNumPagesJ. Adv. Model. Earth Syst.135. \PrintBackRefs\CurrentBib
Noether (\APACyear1987) \APACinsertmetastarNoether1987-po{APACrefauthors}Noether, G\BPBIE. \APACrefYearMonthDay1987\APACmonth06. \BBOQ\APACrefatitleSample size determination for some common nonparametric tests Sample size determination for some common nonparametric tests.\BBCQ \APACjournalVolNumPagesJ. Am. Stat. Assoc.82398645–647. \PrintBackRefs\CurrentBib
O’Malley \BOthers. (\APACyear2019) \APACinsertmetastaromalley2019kerastuner{APACrefauthors}O’Malley, T., Bursztein, E., Long, J., Chollet, F., Jin, H., Invernizzi, L.\BCBL \BOthersPeriod. \APACrefYearMonthDay2019. \APACrefbtitleKerasTuner. Kerastuner. \APAChowpublishedhttps://github.com/keras-team/keras-tuner. \PrintBackRefs\CurrentBib
Ott \BOthers. (\APACyear2020) \APACinsertmetastarOtt2020-qe{APACrefauthors}Ott, J., Pritchard, M., Best, N., Linstead, E., Curcic, M., Baldi, P.\BCBL \BBA Acacio Sanchez, M\BPBIE. \APACrefYearMonthDay2020\APACmonth01. \BBOQ\APACrefatitleA Fortran-Keras Deep Learning Bridge for Scientific Computing A Fortran-Keras deep learning bridge for scientific computing.\BBCQ \APACjournalVolNumPagesSci. Program.2020. \PrintBackRefs\CurrentBib
M. Pritchard \BOthers. (\APACyear2024) \APACinsertmetastarPritchard2024-ac{APACrefauthors}Pritchard, M., Yu, S.\BCBL \BBA Lin, J. \APACrefYearMonthDay2024\APACmonth05. \APACrefbtitlespcam3.0-neural-net-spreadtesting. spcam3.0-neural-net-spreadtesting. \APACaddressPublisherZenodo. \PrintBackRefs\CurrentBib
M\BPBIS. Pritchard \BBA Bretherton (\APACyear2014) \APACinsertmetastarPritchard2014-cv2{APACrefauthors}Pritchard, M\BPBIS.\BCBT \BBA Bretherton, C\BPBIS. \APACrefYearMonthDay2014\APACmonth02. \BBOQ\APACrefatitleCausal Evidence that Rotational Moisture Advection is Critical to the Superparameterized Madden–Julian Oscillation Causal evidence that rotational moisture advection is critical to the superparameterized Madden–Julian oscillation.\BBCQ \APACjournalVolNumPagesJ. Atmos. Sci.712800–815. \PrintBackRefs\CurrentBib
D. Randall \BOthers. (\APACyear2003) \APACinsertmetastarRandall2003-db{APACrefauthors}Randall, D., Khairoutdinov, M., Arakawa, A.\BCBL \BBA Grabowski, W. \APACrefYearMonthDay2003\APACmonth11. \BBOQ\APACrefatitleBreaking the Cloud Parameterization Deadlock Breaking the cloud parameterization deadlock.\BBCQ \APACjournalVolNumPagesBull. Am. Meteorol. Soc.84111547–1564. \PrintBackRefs\CurrentBib
D\BPBIA. Randall (\APACyear2013) \APACinsertmetastarRandall2013-hl{APACrefauthors}Randall, D\BPBIA. \APACrefYearMonthDay2013\APACmonth11. \BBOQ\APACrefatitleBeyond deadlock Beyond deadlock.\BBCQ \APACjournalVolNumPagesGeophys. Res. Lett.40225970–5976. \PrintBackRefs\CurrentBib
Rasp (\APACyear2020) \APACinsertmetastarRasp2020-ns{APACrefauthors}Rasp, S. \APACrefYearMonthDay2020\APACmonth05. \BBOQ\APACrefatitleCoupled online learning as a way to tackle instabilities and biases in neural network parameterizations: general algorithms and Lorenz 96 case study (v1.0) Coupled online learning as a way to tackle instabilities and biases in neural network parameterizations: general algorithms and lorenz 96 case study (v1.0).\BBCQ \APACjournalVolNumPagesGeosci. Model Dev.1352185–2196. \PrintBackRefs\CurrentBib
Rasp \BOthers. (\APACyear2018) \APACinsertmetastarRasp2018-fk{APACrefauthors}Rasp, S., Pritchard, M.\BCBL \BBA Gentine, P. \APACrefYearMonthDay2018\APACmonth08. \BBOQ\APACrefatitleDeep learning to represent subgrid processes in climate models Deep learning to represent subgrid processes in climate models.\BBCQ \PrintBackRefs\CurrentBib
Shamekh \BOthers. (\APACyear2023) \APACinsertmetastarShamekh2023-ed{APACrefauthors}Shamekh, S., Lamb, K\BPBID., Huang, Y.\BCBL \BBA Gentine, P. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleImplicit learning of convective organization explains precipitation stochasticity Implicit learning of convective organization explains precipitation stochasticity.\BBCQ \APACjournalVolNumPagesProceedings of the National Academy of Sciences12020e2216158120. \PrintBackRefs\CurrentBib
Shepherd (\APACyear2014) \APACinsertmetastarShepherd2014-cm{APACrefauthors}Shepherd, T\BPBIG. \APACrefYearMonthDay2014\APACmonth09. \BBOQ\APACrefatitleAtmospheric circulation as a source of uncertainty in climate change projections Atmospheric circulation as a source of uncertainty in climate change projections.\BBCQ \APACjournalVolNumPagesNat. Geosci.710703–708. \PrintBackRefs\CurrentBib
Srivastava \BOthers. (\APACyear2014) \APACinsertmetastarSrivastava2014-cy{APACrefauthors}Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.\BCBL \BBA Salakhutdinov, R. \APACrefYearMonthDay2014. \BBOQ\APACrefatitleDropout: A Simple Way to Prevent Neural Networks from Overfitting Dropout: A simple way to prevent neural networks from overfitting.\BBCQ \APACjournalVolNumPagesJ. Mach. Learn. Res.15561929–1958. \PrintBackRefs\CurrentBib
Tange (\APACyear2018) \APACinsertmetastartange_ole_2018_1146014{APACrefauthors}Tange, O. \APACrefYear2018. \APACrefbtitleGNU Parallel 2018 Gnu parallel 2018. \APACaddressPublisherOle Tange. {APACrefURL} https://doi.org/10.5281/zenodo.1146014 {APACrefDOI} 10.5281/zenodo.1146014 \PrintBackRefs\CurrentBib
Virtanen \BOthers. (\APACyear2020) \APACinsertmetastar2020SciPy-NMeth{APACrefauthors}Virtanen, P., Gommers, R., Oliphant, T\BPBIE., Haberland, M., Reddy, T., Cournapeau, D.\BDBLSciPy 1.0 Contributors \APACrefYearMonthDay2020. \BBOQ\APACrefatitleSciPy 1.0: Fundamental Algorithms for Scientific Computing in Python SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python.\BBCQ \APACjournalVolNumPagesNature Methods17261–272. {APACrefDOI} 10.1038/s41592-019-0686-2 \PrintBackRefs\CurrentBib
Wang \BOthers. (\APACyear2022) \APACinsertmetastarWang2022-po{APACrefauthors}Wang, X., Han, Y., Xue, W., Yang, G.\BCBL \BBA Zhang, G\BPBIJ. \APACrefYearMonthDay2022\APACmonth05. \BBOQ\APACrefatitleStable climate simulations using a realistic general circulation model with neural network parameterizations for atmospheric moist physics and radiation processes Stable climate simulations using a realistic general circulation model with neural network parameterizations for atmospheric moist physics and radiation processes.\BBCQ \APACjournalVolNumPagesGeosci. Model Dev.1593923–3940. \PrintBackRefs\CurrentBib
Yu \BOthers. (\APACyear2023) \APACinsertmetastarYu2023-on{APACrefauthors}Yu, S., Hannah, W., Peng, L., Lin, J., Bhouri, M\BPBIA., Gupta, R.\BDBLPritchard, M. \APACrefYearMonthDay2023\APACmonth06. \BBOQ\APACrefatitleClimSim: A large multi-scale dataset for hybrid physics-ML climate emulation ClimSim: A large multi-scale dataset for hybrid physics-ML climate emulation.\BBCQ \PrintBackRefs\CurrentBib
Yuval \BOthers. (\APACyear2021) \APACinsertmetastarYuval2021-bh{APACrefauthors}Yuval, J., O’Gorman, P\BPBIA.\BCBL \BBA Hill, C\BPBIN. \APACrefYearMonthDay2021\APACmonth03. \BBOQ\APACrefatitleUse of neural networks for stable, accurate and physically consistent parameterization of subgrid atmospheric processes with good performance at reduced precision Use of neural networks for stable, accurate and physically consistent parameterization of subgrid atmospheric processes with good performance at reduced precision.\BBCQ \APACjournalVolNumPagesGeophys. Res. Lett.486. \PrintBackRefs\CurrentBib
Zender (\APACyear2008) \APACinsertmetastarZender2008-rd{APACrefauthors}Zender, C\BPBIS. \APACrefYearMonthDay2008\APACmonth10. \BBOQ\APACrefatitleAnalysis of self-describing gridded geoscience data with netCDF Operators (NCO) Analysis of self-describing gridded geoscience data with netCDF operators (NCO).\BBCQ \APACjournalVolNumPagesEnvironmental Modelling & Software2310-111338–1342. \PrintBackRefs\CurrentBib