\jyear

2021 \externaldocumentSI

{CJK*}

UTF8gbsn

\equalcont

These authors contributed equally to this work. \equalcontThese authors contributed equally to this work.

[1]\fnmHao \surLi \equalcontThese authors contributed equally to this work.

\equalcont

These authors contributed equally to this work.

[3,4]\fnmBo \surLu

[2,1]\fnmYuan \surQi

1]\orgdivArtificial Intelligence Innovation and Incubation Institute, \orgnameFudan University, \orgaddress\cityShanghai, \postcode200433, \countryChina

2]\orgnameShanghai Academy of Artificial Intelligence for Science, \orgaddress\cityShanghai, \postcode200232, \countryChina

3]\orgdivChina Meteorological Administration Key Laboratory for Climate Prediction Studies, \orgnameNational Climate Center, \orgaddress\cityBeijing, \postcode100081, \countryChina

4]\orgnameXiong’an Institute of Meteorological Artificial Intelligence, \orgaddress\cityXiong’an, \countryChina

5]\orgnameUniversity of Gothenburg, \orgaddress\countrySweden

6]\orgdivScripps Institution of Oceanography, \orgnameUniversity of California San Diego, \orgaddress\countryUSA

7]\orgdivSchool of Data Science, \orgnameFudan University, \orgaddress\cityShanghai, \postcode200433, \countryChina

8]\orgdivInstitute for Big Data, \orgnameFudan University, \orgaddress\cityShanghai, \postcode200433, \countryChina

9]\orgdivMOE Laboratory for National Development and Intelligent Governance, \orgnameFudan University, \orgaddress\cityShanghai, \postcode200433, \countryChina

A machine learning model that outperforms conventional global subseasonal forecast models

\fnmLei \surChen cltpys@163.com \fnmXiaohui \surZhong x7zhong@gmail.com lihao

\_

lh@fudan.edu.cn \fnmJie \surWu wujie@cma.gov.com bolu@cma.gov.cn \fnmDeliang \surChen deliang@gvc.gu.se \fnmShang-Ping \surXie sxie@ucsd.edu \fnmLibo \surWu wulibo@fudan.edu.cn \fnmQingchen \surChao chaoqc@cma.gov.cn \fnmChensen \surLin linchensen@fudan.edu.cn \fnmZixin \surHu huzixin@fudan.edu.cn qiyuan@fudan.edu.cn [ [ [ [ [ [ [ [ [

Abstract

Skillful subseasonal forecasts are crucial for various sectors of society but pose a grand scientific challenge. Recently, machine learning based weather forecasting models outperform the most successful numerical weather predictions generated by the European Centre for Medium-Range Weather Forecasts (ECMWF), but have not yet surpassed conventional models at subseasonal timescales. This paper introduces FuXi Subseasonal-to-Seasonal (FuXi-S2S), a machine learning model that provides global daily mean forecasts up to 42 days, encompassing five upper-air atmospheric variables at 13 pressure levels and 11 surface variables. FuXi-S2S, trained on 72 years of daily statistics from ECMWF ERA5 reanalysis data, outperforms the ECMWF’s state-of-the-art Subseasonal-to-Seasonal model in ensemble mean and ensemble forecasts for total precipitation and outgoing longwave radiation, notably enhancing global precipitation forecast. The improved performance of FuXi-S2S can be primarily attributed to its superior capability to capture forecast uncertainty and accurately predict the Madden–Julian Oscillation (MJO), extending the skillful MJO prediction from 30 days to 36 days. Moreover, FuXi-S2S not only captures realistic teleconnections associated with the MJO, but also emerges as a valuable tool for discovering precursor signals, offering researchers insights and potentially establishing a new paradigm in Earth system science research.

keywords:

subseasonal forecast, machine learning, FuXi, MJO, explainable machine learning

1 Introduction

Subseasonal forecasting, which predicts weather patterns from 2 to 6 weeks in advance, bridges a critical gap between short-term weather forecasts, typically up to 15 days, and longer-term climate forecasts that extend to seasonal and longer timescales board2016next . Forecasting at this intermediate subseasonal timescale is indispensable for a variety of applications, including agricultural planning, disaster preparedness, mitigating impacts of extreme events such as heatwaves, droughts, floods, and cold spells, and water resource management White2017 ; pegion2019subseasonal ; White2022 ; domeisen2022 . Despite its significant socioeconomic benefits, subseasonal forecasting has historically not received sufficient attention compared to medium-range weather and climate predictions. This gap existed because accurate subseasonal forecasts were once considered nearly impossible. Subseasonal forecasts are particularly challenging as they rely on both atmospheric initial conditions, essential in short-term weather forecasts, and boundary conditions at the Earth’s surface, key to seasonal and climate forecasts Lorenz1979 ; mariotti2018progress . However, neither of these condition provides sufficient predictability, leaving subseasonal forecasts in a so-called predictability desert. Despite these challenges, recent advances in both physical and statistical modeling have enabled the regular production of subseasonal forecasts globally. Nonetheless, there remains a ongoing, strong demand for their further development to support informed decision-making across various sectors.

Developing an ensemble prediction system (EPS) based on traditional physics-based numerical weather prediction (NWP) models is a widely acknowledged and effective method for enhancing subseasonal forecast accuracy Weyn2021sub ; han2023ensemble . Major forecasting centers have implemented such EPS for subseasonal forecasts Vitart2014Sub ; saha2014ncep ; vitart2017subseasonal ; pegion2019subseasonal . However, these systems often exhibit considerable biases nowak2017sub ; Monhart2018 ; hwang2019improving ; vitart2022 ; Mouatadid2023adaptive , particularly in predicting extreme events domeisen2022advances . The two primary challenges in this field are ensuring an adequate ensemble size within computational constraints and designing ensemble perturbations that accurately reflect uncertainty in key atmospheric and oceanic variability Demaeyer2022 . Enlarging the ensemble size is beneficial for forecast performance buizza1999stochastic ; Buizza2019 ; leutbecher2019ensemble , but the substantial computational costs typically limit ensemble sizes to between 4 and 51 members across 11 international forecasting centers vitart2017subseasonal . Given these computational limitations, machine learning model emerges as a promising alternative for direct subseasonal forecasting cohen2019s2s . Machine learning models have the advantages of significantly higher computational efficiency, facilitating the generation of a large number of ensemble members which are crucial for prediction skill and reliability richardson2001measures . Recent advancements in machine learning for medium-range weather forecasting hu2022swinvrnn ; pathak2022fourcastnet ; lam2022graphcast ; bi2022panguweather ; chen2023fuxi ; fuxi_extreme2023 ; nguyen2023scaling have demonstrated that machine learning models can outperform the high-resolution forecasts (HRES) generated by the European Centre for Medium-Range Weather Forecasts (ECMWF), widely considered as the most accurate global weather forecasts (ECMWF2021, ).

Machine learning models have achieved made significant strides in medium-range weather forecasting and seasonal forecasting wang2024coupled , but their success in subseasonal forecasting has been less pronounced Weyn2021sub ; He2021 ; kiefer2023can . This shortfall primarily stems from the limited range of variables incorporated into the models, and more importantly, from the inadequate methods employed for ensemble generation. Conventional machine learning techniques for ensemble forecasting, such as introducing random perturbations into initial conditions and altering model structures, overlook the background flow and consequently leads to rapid reduction in ensemble spread. The inadequate representation of the complexities limits the performance of these prior machine learning based subseasonal forecasting models, which does not yet rival that of traditional EPS based on NWP models. To overcome these challenges, we introduce the FuXi Subseasonal-to-Seasonal (FuXi-S2S) model, representing a significant advancement in machine learning for subseasonal forecasting. This model is designed to generate global daily mean forecasts for 42 days from initialization. Unlike previous models that incorporated a limited set of variables, it incorporates a comprehensive suite of variables, instead of a couple of variables in previous models: 5 upper-air atmospheric variables at 13 pressure levels and 11 surface variables. Furthermore, it features a innovative perturbation module specifically designed to generate flow-dependent perturbations for ensemble forecasting. This module leverages vast amounts of historical data to learn probability distributions, thereby introducing flow-dependent perturbations directly into the model’s hidden features. Compared to conventional NWP ensemble forecasting methods, which often struggle with constructing initial condition perturbations due to the complexities of multivariate interactions and the need to maintain dynamic balance and ensemble spread in simulations Molteni1996 , our approach of introducing perturbations directly into the model’s latent space, presenting a novel and effective alternative. This perturbation module significantly enhances the performance of the FuXi-S2S forecasts, as demonstrated in Supplementary Figure LABEL:flow_dependent. More details about the FuXi-S2S model architecture are available in Section 4.

Remarkably, FuXi-S2S outperforms the ECMWF Subseasonal to Seasonal (S2S) ensemble, which is recognized as the most skillful S2S modeling system, in producing both the ensemble mean and probabilistic forecasts de2019global ; domeisen2022 . Its efficacy is particularly evident in extreme total precipitation (TP) forecasting, as exemplified by its accurate forecasts for the 2022 Pakistan floods. Such capability is closely related to FuXi-S2S’s improved prediction of the Madden–Julian Oscillation (MJO) madden1971detection ; madden1972description , a key driver of global climate patterns, extending the skillful MJO prediction from 30 days to 36 days. These results further confirm that the notable improvement in FuXi-S2S’s performance can be primarily attributed to the innovative perturbation module for ensemble generation. Another promising result is the ability of the FuXi-S2S model to identify potential precursor signals to physical processes. Beyond mere accuracy, in many applications involving machine learning forecasts, it is imperative to understand and validate the decision-making mechanisms of these models. Such understanding not only leads to enhanced trust in the models’ predictions but also increases the likelihood of implementing effective actions, particularly in mitigating the risks associated with extreme events. Therefore, interpreting machine learning models to align their reasoning with established knowledge becomes crucial. Recent developments in explainable machine learning (XML) LPR2015 ; McGovern2019 ; molnar2020 ; mamalakis2020 ; toms2021testing ; rasp2021data methods have facilitated this interpretation. This study delves into the 2022 Pakistan floods, investigating the FuXi-S2S model’s predictions to identify key geographic regions that significantly impact its predictive accuracy. This is achieved through the generation and analysis of saliency maps simonyan2013deep , wherein the identified regions in close alignment with insights from previous studies dunstone2023windows . Therefore, we argue that FuXi-S2S transcends traditional NWP models in terms of accuracy and speed, potentially unveiling previously unrecognized processes within Earth’s system in subseasonal forecasting faghmous2014big ; karpatne2017theory .

2 Results

This study conducts a thorough evaluation of the 51-member FuXi-S2S forecasts by analyzing testing data spanning from 2017 to 2021. It compares the performance of FuXi-S2S with that of the 11-member ECMWF S2S reforecasts from the model cycle C47r3 over the same period. The analysis primarily focuses on average forecasts for week 3 (days 15-21), week 4 (days 22-28), week 5 (days 29-35), and week 6 (days 36-42), weeks 3-4, and weeks 5-6. The evaluation employs a comprehensive set of metrics, including deterministic metrics for the ensemble mean, probabilistic metrics for all ensemble members, prediction skills specific for MJO forecasts, and tailored assessments for extreme events, notably the 2022 Pakistan floods. Furthermore, the study explores the underlying processes driving the FuXi-S2S model’s predictions for the 2022 Pakistan floods. This is accomplished by generating and analyzing the saliency maps, which provides profound insights into the model’s predictive processes.

Additional evaluations, including an analysis of energy spectra chattopadhyay2023longterm , are available in the supplementary material.

2.1 Deterministic metrics

This subsection compares the performance of ensemble mean forecasts from FuXi-S2S and ECMWF S2S based on deterministic metrics. Figure 1 presents the globally-averaged and latitude-weighted temporal anomaly correlation coefficient (TCC) for both FuXi-S2S and ECMWF S2S, considering four variables: TP, 2-meter temperature (T2M), geopotential at 500 hPa (Z500), and outgoing longwave radiation (OLR), across forecast lead times of 3, 4, 5, 6, 3-4, and 5-6 weeks. Significance testing is conducted as described in Section 4.4. When the FuXi-S2S forecasts do not show a statistically significant improvement over the ECMWF S2S reforecasts, these are indicated with a pale color scheme. It is evident that the ensemble mean forecasts from FuXi-S2S significantly outperform ECMWF S2S for TP and OLR, but not for T2M and Z500. The analysis is based on the averaged TCC computed from all testing data spanning the period from 2017 to 2021. The FuXi-S2S forecasts generally demonstrate higher TCC values than the ECMWF S2S reforecasts for TP and OLR at all lead times, while comparable TCC values for Z500 and T2M. Specifically, regarding Z500, the FuXi-S2S forecasts are superior to the ECMWF S2S reforecasts at lead times of 3, 4, 5, and 3-4 weeks, and have inferior performance at lead times of 6 and 5-6 weeks.

Supplementary Figure LABEL:TCC_spatial provides the spatial distributions of temporally-averaged TCC for both ECMWF S2S and FuXi-S2S, along with the differences in TCC between FuXi-S2S and ECMWF S2S for TP, T2M, Z500, and OLR forecasts at lead times of 3-4 and 5-6 weeks, respectively. The spatial distributions of TCC reveal considerably higher values over tropics, and greater values over oceans than over land. The TCC differences are described in red (positive values), blue (negative values), and white (zero values) patterns, suggesting whether FuXi-S2S’s performance is superior, inferior, or equivalent to ECMWF S2S, respectively. Overall, FuXi-S2S demonstrates positive TCC differences for TP and OLR in most regions worldwide, consistent with the findings presented in Figure 1. Moreover, FuXi-S2S also outperforms ECMWF in a majority of extra-tropical regions for both T2M and Z500, although its performance is generally less skilful in the tropical areas.

2.2 Probabilistic metrics

Deterministic metrics, evaluated using the ensemble mean, exhibit limited predictive skill, with TCC values below 0.5 for all subseasonal forecast lead times. Therefore, ensemble forecasts are essential for detecting predictable signals at subseasonal timescales.

The first two rows of Figure 2 present the spatial distribution of the temporally-averaged ranked probability skill score (RPSS) epstein1969scoring ; Wilks2011 for ECMWF S2S and FuXi-S2S, as well as the RPSS differences between FuXi-S2S and ECMWF S2S for TP forecasts over 3-4 and 5-6 week lead times. This analysis utilizes RPSS data which are temporally averaged from 2017 to 2021. The red contour lines in the first and second columns highlight areas with positive RPSS values, which indicate more skillful prediction than climatology forecast can be obtained over these areas. Notably, FuXi-S2S predicts more areas with positive RPSS values than ECMWF S2S. The color coding in the right panels of Figure 2 (red, blue, and white) indicates regions where FuXi-S2S performs better, worse, or equivalently compared to ECMWF S2S, respectively. The global distribution of RPSS suggests that both ECMWF S2S and FuXi-S2S primarily exhibit skill in tropical regions, whereas they lack skill in the extra-tropics compared to climatology. In contrast, RPSS demonstrates positive values (depicted in red color) in tropical regions, indicating enhanced predictive skills relative to climatology. Moreover, the RPSS values are notably higher over oceans compared to land areas. Predominantly, FuXi-S2S demonstrates nearly global positive RPSS differences for TP, except in some tropical regions where both models have quite high RPSS values. Compared to ECMWF S2S, whose skillful predictions are primarily confined to tropical ocean areas, FuXi-S2S demonstrates the capability of skillful predictions over more extra-tropical regions, such as East Asia, the North Pacific and the Arctic.

The latitude-weighted RPSS for the same 4 variables as in Figure 1 over forecast lead times of 3, 4, 5, 6, 3-4, and 5-6 weeks are given in Supplementary Figure LABEL:RPSS_area_bar. FuXi-S2S shows higher RPSS values than ECMWF S2S across most regions for all the examined variables: TP, T2M, Z500, and OLR. This superiority is especially noticeable in extratropical averages. However, in the tropics, ECMWF S2S outperforms FuXi-S2S at lead times of 3 to 6 weeks for one-week averages, whereas FuXi-S2S surpasses ECMWF S2S for two-week averages. This discrepancy in performance likely arises from the fact that one-week averages filter out variability with periods shorter than two weeks, while two-week averages attenuate variability with periods shorter than four weeks. Thus, the skill differences between the one-week and two-week averages may reflect FuXi-S2S’s enhanced ability in capturing lower-frequency variability. Furthermore, a previous study de2019global suggests that dynamical S2S models, particularly ECMWF S2S, demonstrate improved performance in the central-eastern Pacific, potentially due to their effective simulation of the realistic air-sea interactions in these regions.

2.3 Extreme forecast

A primary target of subseasonal forecasts is extreme weather events, to better prepare for disasters like droughts and floods. This subsection focuses on the prediction skills for extreme precipitation events. Such events are identified when TP exceeds the 90th climatological percentile, a threshold that varies based on grid location, forecast initialization time, and forecast lead time.

The last two rows of Figure 2 show the spatial distributions of the temporally-averaged Brier Skill Score (BSS) Wilks2011 for the extreme precipitation events, for ECMWF S2S and FuXi-S2S, and their differences over 3-4 and 5-6 week lead times. Similar to spatial pattern of RPSS, FuXi-S2S generally exhibts more regions with positive values of BSS than ECMWF S2S, suggesting more areas with skill relative to climatological forecasts. Similar to spatial pattern of RPSS, the BSS values are considerably higher over oceans than over land and decrease from lower latitudes to higher latitudes. Predominantly, the BSS differences favor FuXi-S2S in TP over land and in extra-tropical regions, marked by widespread red patterns. This suggests FuXi-S2S’s dominance over ECMWF S2S in predicting extreme TP across land and extra-tropics, which is of great importance for disaster preparedness and early warning.

Supplementary Figure LABEL:BSS_area_bar compares the latitude-weighted BSS between FuXi-S2S and ECMWF S2S, focusing on TP, T2M, Z500, and OLR in five geographical regions: global, in the extra-tropics (90°S - 30°S and 30°N - 90°N), in the tropics (30°S - 30°N), over land, and over the ocean. Globally, FuXi-S2S outperforms ECMWF S2S in terms of BSS for TP, T2M, and OLR. Notably, in contrast to ECMWF S2S, which exhibits consistently negative globally-averaged BSS values for TP across all lead times, FuXi-S2S demonstrates positive values for forecast lead times of 3, 3-4 and 5-6 week. In the extra-tropical regions, though the BSS scores are relatively lower in comparison to the global average, FuXi-S2S consistently exhibits superior performance compared to ECMWF S2S across all four variables. A similar pattern emerges in tropical regions, where FuXi-S2S demonstrates superior performance over ECMWF S2S for TP and OLR, while achieving comparable accuracy in T2M and Z500. Over land areas, FuXi-S2S demonstrates consistently higher BSS values for TP and T2M, suggesting its superior ability to provide more accurate forecasts of extreme rainfall and high temperatures compared to ECMWF S2S.

2.4 MJO forecast

Recent studies have demonstrated the importance of accurately modeling various sources of subseasonal predictability, particularly the MJO vitart2017subseasonal ; vitart2018sub ; merryfield2020 , for improving subseasonal prediction skills. The MJO has a significant impact on global weather and climate, serving as a primary source of predictability at subseasonal timescales due to its quasi-periodic nature Chidong2005 ; zhang2013madden ; zhang2013cracking ; neena2014predictability . Accurate MJO prediction is essential for reliable subseasonal predictions. Although current state-of-the-art dynamical forecasts can predict the MJO up to 3-4 weeks in advance, this falls short of the theoretical potential predictability of approximately 6-7 weeks neena2014predictability ; kim2018 ; jiang2020fifty . In recent years, increasing efforts have focused on applying machine learning models to improve MJO forecasts, either by post-processing dynamical forecasts Jie2021 ; silini2022improving ; Kim2023 or through direct forecasting silini2021machine ; toms2021testing ; Delaunay2022 . However, only improving MJO predictions with machine learning models does not inherently ensure improved forecasts of related weather phenomena, such as tropical cyclones and monsoons, which also depend on accurate predictions of various weather parameters by the model. Therefore, continuous improvement in forecasting models is essential for advancing subseasonal prediction capabilities. This section specifically examines the performance of our FuXi-S2S model in MJO forecasts, although it is not explicitly optimized for this purpose.

In this study, we employed the real-time multivariate MJO (RMM) index wheeler2004all , along with the commonly used metrics of bivariate correlation coefficient (COR), to evaluate the forecasting skill of the MJO. The RMM index used for verification was calculated using the Climate Prediction Center (CPC) OLR (CBO) data, in conjunction with the ERA5 zonal-wind component at 850 hPa and 200h Pa. Figure 3 presents the bivariate correlation (COR) skills of the RMM index for the ensemble mean of ECMWF S2S reforecasts and FuXi-S2S forecasts, averaged over the testing data spanning from 2017 to 2021. The results show a decrease in COR values as forecast lead times increase. Particularly, FuXi-S2S outperforms ECMWF S2S in MJO prediction, maintaining higher COR values for up to 42 days. When applying a COR threshold of 0.5 to determine skillful MJO forecast, FuXi-S2S extends the skillful forecast lead time from 30 days to 36 days, surpassing the performance of ECMWF S2S. Furthermore, the MJO prediction skills also depend on the seasonal cycle, as illustrated in Figure 3. Both FuXi-S2S and ECMWF S2S demonstrate higher MJO prediction skills in September and October. Additionally, FuXi-S2S exhibit superior skills compared to ECMWF S2S during the boreal spring and winter, with skillful predictions extending beyond 42 days in April and May, which is the longest forecast lead time achievable by the FuXi-S2S model. Moreover, Supplementary Figure LABEL:mjo_amp_phase presents the COR and error for the amplitude and phase of the MJO. These are calculated using the ensemble mean of ECMWF S2S reforecasts and FuXi-S2S forecasts, averaged across over the 2017-2021 testing dataset. The results suggest that the FuXi-S2S model outperforms the ECMWF S2S model in predicting the MJO, primarily due to its superior capability in forecasting the MJO phase. Additionally, FuXi-S2S demonstrates smaller amplitude errors, suggesting it more accurately maintains the amplitude of MJO events.

A two-dimensional phase-space diagram is commonly used to characterize the phase and amplitude of the MJO, using the x-axis and y-axis to represent the first and second principal components of Empirical Orthogonal Functions (EOFs) (RMM1 and RMM2), respectively. Supplementary Figure LABEL:RMM_composite illustrates the forecast performance of four distinct MJO events with initialization dates of 27 June 2018, 3 November 2018, 18 April 2019, and 21 March 2021, as predicted by ECMWF S2S and FuXi-S2S. Data points on this two-dimensional phase-space diagram are plotted at 5-day intervals. The phase of the MJO is determined by the azimuth of the combined RMM indices 1 and 2 (RMM1 and RMM2), while its amplitude is represented by the radial distance from the origin. As visually shown in Supplementary Figure LABEL:RMM_composite, the counterclockwise movement of data points signifies the eastward propagation of MJO-associated convection, with the distance between successive points reflecting the propagation speed. In comparison to the observed MJO derived from CBO and ERA5 reanalysis data, both ECMWF S2S and FuXi-S2S exhibit slower propagation speeds and reduced amplitudes as the forecast lead time increases, particularly noticeable for MJO forecasts initialized on 21 March 2021. However, FuXi-S2S shows a more consistent alignment with observations across all MJO phases, especially in mitigating the negative amplitude biases in MJO forecasts when compared to ECMWF S2S.

The MJO originates from interactions of tropical convection and circulation but its effect is of global reach. Indeed, large TCC for Z500 over the extra-tropical Pacific is found along the path of the Pacific North/South American (PNA/PSA) wallace1981teleconnections ; mo1987statistics teleconnection pattern (Supplementary Figure LABEL:TCC_spatial, rows 6 and 7). Compared to ECMWF S2S, improved MJO forecast in FuXi-S2S elevates TCC for these teleconnection patterns, especially along the PSA wave train in the Southern Hemisphere. Furthermore, the MJO is critical for stimulating these important teleconnection patterns, significantly affecting extra-tropical anomalies. Therefore, the accurate representation of MJO-related teleconnections is imperative for effective subseasonal forecasts. Supplementary Figure LABEL:z500 demonstrates that the FuXi-S2S model showcases enhanced skills in MJO prediction and realistic simulations of MJO teleconnections, which substantially contribute to its superior performance in subseasonal forecasts, particularly over extra-tropical regions.

This study highlights FuXi-S2S proficiency in predicting the MJO. We envision that FuXi-S2S could serve as a pivotal tool in investigating other primary modes of subseasonal variability, such as the Boreal Summer Intraseasonal Oscillation (BSISO) Zhu1993The3C , North Atlantic Oscillation (NAO) walker1924correlations , and East Asia-Pacific (EAP) pattern huang1987influence . Additionally, it would be worthwhile to explore how the prescribed fixed sea surface temperature (SST) or its absence impacts the forecast performance of the MJO. Savarin and Chen savarin2022pathways demonstrated that either using a coupled atmosphere-ocean model or updating SST with observed values is essential for accurately modeling the eastward propagation of the MJO. However, this analysis is beyond the scope of the current study and will be addressed in future research.

2.5 Prediction of the 2022 Pakistan floods

In 2022, Pakistan experienced a series of exceptionally intense monsoon rainfall surges from early July to late August, resulting in total rainfall that reached a level approximately four standard deviations above the climatological mean hong2023causes . This extreme rainfall event led to a significant humanitarian disaster, leaving over 2.1 million people homeless and resulting in 1,730 fatalities. According to the World Bank, the economic damages and losses exceeded USD 30 billion dunstone2023windows . Consequently, it is important to assess the ability of subseasonal forecasts to predict such extreme rainfall events.

Figure 4 illustrates the observed standardized TP anomaly alongside predictions that were initialized on different dates, generated by both the FuXi-S2S and ECMWF S2S models. These observations, taken from the Global Precipitation Climatology Project (GPCP), are spatially averaged over the Pakistan region (60 to 70°E in longitude and 25 to 35°N in latitude), and temporally over a two-week period from August 16th to August 31st, 2022, corresponding to the period of most intense rainfall. The standardized anomaly for observed rainfall is approximately 6 standard deviations above the climatological mean. It is evident that the ECMWF S2S model considerably underestimates rainfall intensity for forecasts initialized on July 21st, achieving only about one-third of the observed values. The ECMWF S2S forecasts gradually converge toward observations as the initialization dates approach the actual event. In contrast, FuXi-S2S exhibits superior forecast performance in predicting the intensity of extreme rainfall events earlier compared to ECMWF S2S. Specifically, FuXi-S2S predicts rainfall levels of at least 4 standard deviation above the climatological mean for forecasts initialized on July 21st, which is approximately 4 weeks in advance. Moreover, the spatial distributions of the standardized TP anomaly reveal that the FuXi-S2S predicted TP pattern more closely matches the observations.

Forecast skill typically improves with decreasing lead time, as in the ECMWF S2S model. The rainfall anomaly grows in FuXi-S2S forecasts initialized on July 28 (lead time of 18 days), albeit with a large forecast spread, possible due to SST influence. Indeed, the saliency maps show that the FuXi-S2S forecasts initialized on July 28 and July 21 successfully captured predictabable signals from SST anomalies in the tropical central Pacific and western Indian Ocean (Figures 4c). At shorter lead times, the SST influence decreases while the effect of atmospheric initial conditions increases. The varying importance of SST and initial conditions may cause variability in the FuXi-S2S forecasts with lead time.

2.6 Discovery of precursor signals for the 2022 Pakistan floods prediction

Data-driven machine learning forecasting models, such as FuXi-S2S, often lack explicit integration of prior knowledge about the physical system they aim to predict. As a result, they are often referred to as ’black boxes’. Although FuXi-S2S has shown accuracy in previous subsections, the opacity of its predictive processes can diminish confidence in its reliability. Therefore, it is imperative to interpret FuXi-S2S, ensuring that their underlying reasoning is consistent with established understanding of weather systems. Here, we generated saliency maps to disentangle the key driving processes behind the FuXi-S2S model’s prediction of the 2022 floods in Pakistan.

In this study, we utilized the negative absolute values of the TP anomaly, averaged across the Pakistan region (outlined by the green box in Figure 4c), as a loss function. By implementing backward propagation of this loss function to calculate gradients, we obtained the saliency maps. These maps use red and blue colors to signify positive and negative correlations, respectively between the negative of standardized TP anomaly and SST. Specifically, blue (red) areas indicate that a decrease (increase) in SST is associated with an increase (decrease) in the negative of standardized TP anomaly, thereby leading to an increase (decrease) in TP anomaly. Analysis of these saliency maps facilitated the identification of potential precursor signals and sources of predictability that contributed to the occurrence of the extreme TP event. As illustrated in Figure 4c, SST precursor signals, identified in forecasts initialized on different dates (July 28th and July 21st in 2022), show remarkable consistency. These signals indicate a consistent cooling of SST in the equatorial central Pacific and the tropical western Indian Ocean, along with warming in the tropical eastern Pacific. This spatial pattern aligns closely with findings from previous studies dunstone2023windows , which pinpointed the rapid development of a La Niña in the tropical Pacific and a negative phase of the Indian Ocean Dipole (IOD) in the summer of 2022 as key precursor signals and driving forces of Pakistan’s intense TP event. Our results confirm that the high predictive skill of the FuXi-S2S model can be attributed to its effective capture of the primary predictable sources of this event. Furthermore, these findings demonstrate the model’s potential as a valuable tool for rapidly exploring the mechanisms behind extreme events and uncovering teleconnections within Earth’s systems, thereby enhancing our physical understanding. Here, we focus on the gradient with respect to SST. Nevertheless, it is important to acknowledge the existence of other significant precursor signals that may be associated with this extreme event, including U, V, and Z anomalies as noted in hong2023causes . A more comprehensive examination of these factors is intended for future research.

3 Discussion

In this paper, we introduced FuXi-S2S, a machine learning based subseasonal forecasting model. This model provides global forecasts of daily mean values for up to 42 days, with a daily temporal resolution and 1.5^∘ spatial resolution encompassing five upper-air atmospheric variables across 13 pressure levels and 11 surface variables. The performance of FuXi-S2S was rigorously evaluated against ERA5 reanalysis data and compared with ECMWF S2S reforecasts. A comprehensive suite of metrics was employed for this evaluation, including the deterministic metrics of the ensemble mean, the probabilistic metrics of the ensemble forecast, and the capability to predict extreme events. Our results demonstrated that FuXi-S2S surpasses ECMWF S2S in forecast accuracy for the evaluated variables. Furthermore, FuXi-S2S significantly improves accuracy in predicting the MJO, extending the skillful MJO prediction from 30 days to 36 days. This improvement is particularly important given the MJO’s influence on global climate patterns, and consequently, it improves the model’s TP) forecast accuracy globally. Moreover, FuXi-S2S has shown utility in practical scenarios, such as its superior performance in predicting the extreme rainfall during the 2022 Pakistan floods earlier than the ECMWF S2S model. This early prediction capability is vital for improving disaster preparedness and response.

A key contributor to the superiority of FuXi-S2S is its innovative method of generating perturbations, which is essential for its successful ensemble forecasting. Unlike conventional models that employ random or meticulously calculated perturbations in initial conditions, FuXi-S2S incorporates background flow-dependent perturbations into its hidden features. These flow-dependent perturbations have shown to significantly enhance model’s subseasonal forecast performance, as illustrated in Supplementary Figure LABEL:flow_dependent. FuXi-S2S, as a machine learning model, also distinguishes itself by its ability to generate large ensembles forecasts rapidly and efficiently, requiring significantly less time and computational resources than traditional models. Specifically, it can complete a comprehensive 42-day forecast with daily time steps in approximately 7 seconds using an Nvidia A100 GPU for a single member. Ensemble size is a critical determinant of the ensemble forecast skill. Research suggests that the optimal number of members for subseasonal forecasts potentially falls within the range of 100 to 200 members Buizza2019 . To ensure a fair comparison with the ECMWF S2S model, we have currently limited the FuXi-S2S model to a 51-member ensemble. However, it’s important to note that FuXi-S2S is capable of generating larger ensembles with only a moderate increase in computational demands. Our supplementary Figure LABEL:larger_member illustrates that increasing the ensemble size to 101 members further enhances the forecast performance of FuXi-S2S compared to the 51-member ensemble.

Beyond its computational efficiency and superior accuracy, FuXi-S2S notably excels in identifying precursor signals and disentangling the complex processes underlying climate extremes, as demonstrated by its accurate prediction of the 2022 floods in Pakistan. Many subseasonal forecasting challenges stem from the limited understanding of these complex processes. Traditional physics-based models often rely on oversimplified representations of physical processes, which diminishes their forecast performance and analytical depth. In contrast, FuXi-S2S demonstrates proficiency in learning complex patterns and identifying subtle teleconnections from vast amounts of data. This approach resonates with Albert Einstein’s insight, ’You can’t solve a problem with the ways of thinking that created it.’. In our study of the 2022 extreme rainfall event in Pakistan, we demonstrate that backward propagation and the resulting saliency maps successfully reveal that FuXi-S2S makes accurate forecasts by effectively capturing the key predictable sources associated with this event. Moreover, such gradient-based interpretation methods aid in explaining weather and climate forecasts made by machine learning models, such as the FuXi-S2S model yang2024interpretable . Therefore, we advocate for a paradigm shift in the application of machine learning models like FuXi-S2S. The focus should not extend beyond enhancing forecast accuracy to include the development of a comprehensive framework for discovering previously unknown processes within the Earth’s system faghmous2014big ; karpatne2017theory . We foresee a growing reliance on machine learning models like FuXi-S2S within the scientific community, acknowledging their essential role in advancing scientific discovery in Earth system science.

While FuXi-S2S offers a computationally efficient and accurate alternative to conventional NWP models for subseasonal forecasting, it also presents significant opportunities for improvement. For instance, the ECWMF S2S model runs at a spatial resolution of 36 km haiden2018evaluation , which is considerably finer than the 1.5°resolution of FuXi-S2S. Currently, FuXi S2S predicts daily mean values up to 50 hPa and lacks critical weather parameters such as daily maximum and minimum temperatures, which are essential for some applications. Furthermore, given the known discrepancies between the ERA5 TP data and actual observations, as noted in nogueira2020inter ; Lavers2022 , GPCP observations have been utilized to evaluate the TP forecast performance for both ECMWF S2S and FuXi-S2S (refer to Supplementary Figure LABEL:gpcp_tp). Anticipated future enhancements to the FuXi-S2S model include increasing the spatial resolution from 1.5°to 0.25°, incorporating additional weather parameters, extending the forecast beyond the current upper limit of 50 hPa, and employing more accurate TP data sources to enhance forecast accuracy.

4 Methods

4.1 Data

ERA5 stands as the fifth iteration of the ECMWF reanalysis dataset, offering a rich array of surface and upper-air variables. It operates at a remarkable temporal resolution of 1 hour and a horizontal resolution of approximately 31 km, covering data from January 1950 to the present day hersbach2020era5 . Recognized for its expansive temporal and spatial coverage coupled with exceptional accuracy, ERA5 stands as the most comprehensive and precise reanalysis archive globally. In our study, we utilize daily statistics derived from the 1-hourly ERA5 dataset, which has a spatial resolution of $1.5^{\circ}$ (comprising $121\times 240$ latitude-longitude grid points) and a temporal resolution of 1 day. It serves as the sole data source for training the FuXi-S2S model.

Evaluating MJO predictions against MJO indices derived from satellite observed OLR data is a common practice. Therefore, alongside the ERA5 reanalysis data, a newly developed OLR dataset called the Climate Prediction Center (CPC) OLR (CBO) has emerged. Spanning from 1991 to the present day, this dataset undergoes near real-time updates. While showing slight differences in magnitude compared to the U.S. National Oceanic and Atmospheric Administration (NOAA) Advanced Very High-Resolution Radiometer (AVHRR) OLR, the CBO dataset notably exhibits a high level of similarity in both pattern and magnitude of anomalies. In our research, we utilize the CBO data, which has a spatial resolution of 1°and a temporal resolution of 1 day. This data serves as the ground truth for OLR in the identification and verification of MJO events. Furthermore, for the assessment of rainfall in the Pakistan region, observed rainfall data are sourced from the GPCP dataset adler2018global . It is noteworthy that the MJO indices derived from ERA5 OLR data closely align with those derived from CBO OLR data.

The FuXi-S2S model forecasts a total of 76 variables, encompassing 5 upper-air atmospheric variables across 13 pressure levels (50, 100, 150, 200, 250, 300, 400, 500, 600, 700, 850, 925, and 1000 hPa), and 11 surface variables. Among the upper-air atmospheric variables are geopotential (Z), temperature (T), u component of wind (U), v component of wind (V), and specific humidity (Q). The surface variables include 2-meter temperature (T2M), 2-meter dewpoint temperature (D2M), sea surface temperature (SST), outgoing longwave radiation (OLR), 10-meter u wind component (U10), 10-meter v wind component (V10), 100-meter u wind component (U100), 100-meter v wind component (V100), mean sea-level pressure (MSL), total column water vapor (TCWV), and TP. OLR is known as the negative of top net thermal radiation (TTR) in ECMWF convention. Table 1 provides a comprehensive list of these variables along with their abbreviations. Variables such as U100 and V100 were selected for their potential utility in wind energy forecasting. The selection of the SST is based on prior research, which suggests that slowly evolving variables like SST are crucial for identifying predictable signals albers2021subseasonal ; yan2021subseasonal ; richter2024quantifying . OLR was selected due to its significance in representing MJO events through OLR anomalies.

The model’s training relies on 67 years of data spanning from 1950 to 2016, while evaluation involves a 5-year dataset from 2017 to 2021. The z-score normalization technique is employed to normalize all input and output variables, thereby ensuring uniformity in their mean and variance. For upper-air variables, the mean and standard deviation are calculated separately for different vertical levels, using only the training dataset. Additionally, the dataset for the year 2022 undergoes evaluation and comparison against the ECMWF real-time S2S forecasts, specifically concerning the catastrophic flooding in Pakistan. More detailed evaluations of TP and MJO predictions for the year 2022 can be found in the supplementary material.

In certain cases, subseasonal forecasts receives regular updates through the implementation of the latest model, incorporating research discoveries tailored for operational use stan2022advances . For instance, the ECMWF S2S reforecasts, often termed hindcasts, which are generated on-the-fly by employing the most recent model version available at the time of forecast generation. In our research, we utilize the ECMWF S2S reforecasts generated from model cycle C47r3. These reforecasts encompass initialization dates over 20 years, ranging from January 3, 2002, to December 29, 2021. The ECMWF S2S reforecasts are initialized twice weekly, aligning with the real-time forecasts. Additionally, our comparative analysis involves employing the 51-member ECMWF real-time S2S forecast for the year 2022. For the analysis using testing data from 2017 to 2021, anomalies for all variables are defined as deviations from the climatological mean calculated over the 15-year period from 2002 to 2016. Meanwhile, for the analysis based on testing data in the year 2022, the climatological mean is calculated over the period from 2002 to 2021. Furthermore, a set of hindcasts from 2002 to 2016 is generated for FuXi-S2S, which are used to to establish a climatology. This climatology is then subtracted from the FuXi-S2S forecasts for the testing data spanning from 2017 to 2021. This process facilitates the calculation of FuXi-S2S anomalies for evaluations.

To ensure equitable comparisons, we evaluate FuXi-S2S forecasts specifically on identical initialization dates corresponding to those utilized for both the ECMWF S2S reforecasts and forecasts. This approach facilitates a fair and direct assessment between FuXi-S2S and ECMWF S2S.

4.2 FuXi-S2S model

Most state-of-the-art machine learning models utilized in medium-range weather forecasting are built upon encoder-decoder cho2014properties architectures bi2022panguweather ; lam2022graphcast ; chen2023fuxi ; olivetti2023advances . These structures are favored due to their proficiency in processing and generating sequential and spatial data. Within these architectures, the encoder processes key features from the input data, and transforms them into a compressed and abstract representation in the latent space. The decoder then utilizes this representation to generate weather forecasts. The primary objective of training these models is to minimize differences between the model’s output and the target data. However, the standard encoder-decoder structures are inherently deterministic, producing identical forecasts for the same inputs, which limits their applicability in generating ensemble forecasts. To overcome this limitation, we introduce the FuXi-S2S model, drawing inspiration from Variational Autoencoders (VAEs) doersch2016tutorial ; zhao2017learning ; kingma2019introduction . VAEs are inherently probabilistic, making them well-suited for tasks that require uncertainty quantification. Like VAEs, the FuXi-S2S model’s encoder does not merely generate a static hidden feature from input data. Instead, it transforms input data into a Gaussian distribution in the latent space, which captures the probabilistic characteristics of the data, along with a static hidden feature. Then, the decoder combines samples from the Gaussian distribution with the static hidden feature to generate forecasts. This methodology effectively captures the inherent uncertainty in the data, thereby enabling the generation of ensemble predictions under identical input conditions by repeatedly sampling from the Gaussian distribution. For better understanding, we draw analogies between these machine learning techniques and the conventional terminology in ensemble weather/subseasonal forecasting. In our model, the static hidden feature forms the basis for deterministic forecasts, while sampling from the Gaussian distribution serves as a perturbation module. This module introduces flow-dependent perturbations into the model’s hidden feature, facilitating the generation of ensemble forecasts.

The FuXi-S2S model, illustrated in Figure 5a, consists of three primary components: an encoder P, a perturbation module, and a decoder. The encoder, processing predicted weather parameters from two preceding time steps, with each time step representing one day, as FuXi-S2S is designed to forecast daily mean values. Specifically, it takes $\hat{\textrm{{X}}}^{t-1}$ and $\hat{\textrm{{X}}}^{t}$ as inputs into a two-dimensional (2D) convolution layer with a kernel size of 2, which reduces the dimensions of the input data by half. Following this, the hidden feature $\textrm{h}^{t}$ (with dimensions of $1536\times 60\times 120$ ) is derived from 12 repeated transformer blocks. The input to the encoder is a data cube that combines both upper-air and surface variables, with dimensions of $2\times 76\times 121\times 240$ . These dimensions represent two preceding time steps ( ${t-1}$ and ${t}$ ), the number of input variables, and the latitude (H) and longitude (W) grid points, respectively. To account for the accumulation of forecast error over time, the forecast lead time ( ${t}$ ) is also included in the encoder’s input. Besides $\textrm{h}^{t}$ , the encoder also generates a low-rank multivariate Gaussian distribution, ${\textrm{N($\Theta$}^{t}}_{p})$ , characterized by a mean vector $\mu^{t}$ ( $128\times 60\times 120$ ), a covariance matrix $\sigma^{t}$ ( $1536\times 60\times 120$ ), and a diagonal covariance matrix ${diag}^{t}$ ( $128\times 60\times 120$ ). Intermediate perturbation vectors ( $\textrm{z}^{t}_{p}$ , dimension: $128\times 60\times 120$ ) are sampled from this Gaussian distribution ( ${\textrm{N($\Theta$}^{t}}_{p})$ ). These vectors, after being weighted by a learned weight vector, yield the final perturbation vectors $\textrm{z}^{t}$ (dimension: $1536\times 60\times 120$ ). The decoder then processes the perturbed hidden features ( $\tilde{\textrm{h}}^{t}=\textrm{h}^{t}+\textrm{z}^{t}$ ) through 24 transformer blocks and a fully connected layer, resulting in the final ensemble output $\hat{\textrm{{X}}}^{t+1}$ . The number of ensemble members generated equals the number of samples drawn from the Gaussian distribution ${\textrm{N($\Theta$}^{t}}_{p})$ .

The FuXi-S2S model’s training primarily focuses on constructing a Gaussian distribution that accurately represents the uncertainty in the model’s predictions. A significant challenge in this process is the deviation of the Gaussian distribution derived from the model’s predictions from the Gaussian distribution based on the target data, largely attributable to prediction errors. This challenge is addressed through knowledge distillation, which enables the transfer of information from real-world distributions to those predicted by the model. Within this framework, the encoder Q plays a crucial role, converting the target data into a Gaussian distribution. This distribution serves as a supervisor for the distribution generated by the encoder P, aiming to align both distributions closely by minimizing the Kullback–Leibler (KL) divergence loss ( $\textrm{L}_{\textrm{KL}}$ ). This KL loss measures the discrepancy between the distributions predicted by both encoders. As illustrated in the Figure 5b, during the training phase of the FuXi-S2S model, the encoder Q, which shares the network structure with the encoder P, processes a data cube containing target weather parameters from a preceding and the current time steps: $\textrm{{X}}^{t}$ and $\textrm{{X}}^{t+1}$ . It predicts a low-rank multivariate Gaussian distribution ( ${\textrm{N($\Theta$}^{t}}_{q})$ ) similar to the encoder P. Intermediate perturbation vectors are sampled from the encoder Q’s distribution ( ${\textrm{N($\Theta$}^{t}}_{q})$ ) during training (see Figure 5b), and from the encoder P’s distribution ( ${\textrm{N($\Theta$}^{t}}_{p})$ ) during testing (see Figure 5a). These vectors have dimensions of $128\times 60\times 120$ . Additionally, a L1 loss is computed between the model’s output ( $\hat{\textrm{{X}}}^{t+1}$ ) and the target ${\textrm{{X}}}^{t+1}$ . Therefore, the overall loss function at each autoregressive step is thus determined by the following equation:

\textrm{L}=\lambda\textrm{L}_{\textrm{KL}}(\textrm{P}^{t},\textrm{Q}^{t})+|% \hat{\textrm{{X}}}^{t+1}-\textrm{{X}}^{t+1}|

(1)

where $\lambda$ , a tune-able coefficient balancing $\textrm{L}_{KL}$ and L1, is set to $1\times 10^{-4}$ in this study. The design of this loss function serves two purposes: the first term ensures the perturbation vector closely approximates the true data distribution, while the second term ensures the prediction unaffected by any perturbation vectors $\textrm{z}^{t}$ .

In this study, we employ 51 ensemble members for subseasonal ensemble forecasting. As illustrated in Supplementary Figure LABEL:larger_member, the FuXi-S2S model, when enhanced with flow-dependent perturbations incorporated into its hidden features, demonstrates considerably improved forecast performance compared to the FuXi-S2S model that combines Perlin noise in the initial conditions with fixed perturbations added to the hidden features. Notably, the addition of Perlin noise results in only marginal improvements in forecast accuracy when the ensemble size is small. However, with larger ensemble sizes, such as the 51 members in this study, the addition of Perlin noise does not enhance forecast accuracy.

Similar to FuXi, we utilize an autoregressive, multi-step loss function to mitigate cumulative errors over long lead times, as outlined in Lam et al. lam2022graphcast . The training process follows an autoregressive training regime and a curriculum training schedule, incrementally increasing the number of autoregressive steps from 1 to 17. Each autoregressive step undergoes 1000 gradient descent updates, resulting in a total number of 17,000 training steps. The training process utilizes 8 Nvidia A100 graphics processing units (GPUs), each employing a batch size of 1. Optimization is performed using the AdamW kingma2017adam ; loshchilov2017decoupled optimizer with the following parameters: ${\beta_{1}}$ =0.9 and ${\beta_{2}}$ =0.95, an initial learning rate of 2.5 $\times$ 10^-4, and a weight decay coefficient of 0.1. The optimisation hyperparameters used for training are summarised in Supplementary Table LABEL:hyper.

4.3 Saliency map

Recent developments in the field of XML have led to the emergence of various techniques samek2017explainable , including saliency mapping. Saliency mapping quantifies the influence of a model’s input on its output simonyan2013deep . This method is characterized by the gradient intensities within the saliency maps; areas with higher gradients are considered critical by the model for making accurate predictions.

The generation of saliency maps primarily depends on backward propagation. This differs from standard model training as the propagation target can be adjusted depending on the specific goal of the analysis. Here, the saliency of the predicted anomaly relative to the input data is given by:

\textrm{J}(\textbf{X}(c_{o}))=-\sum_{i,j\in\textrm{D}}\frac{|f^{n}(\textbf{X})% (c_{o},i,j)-\mu(c_{o},i,j)|}{\sigma(c_{o},i,j)}

(2)

\textrm{S}(c_{i}|c_{o})=\pdv{\textrm{J}(\textbf{X}(c_{o}))}{\textbf{X}(c_{i})}

(3)

where $f$ denotes the FuXi-S2S model and $n$ is the number of forward steps, while $\mu$ and $\sigma$ are the climatological mean and standard deviation, respectively. D specify the geographical area of interest. $c_{i}$ and $c_{o}$ represent the input and output variables. A well-trained model is expected to yield a saliency map that aligns well with the established physical understanding of weather systems. In our study, we construct a aggregated saliency map by averaging the individual maps generated from each of the 51 ensemble members.

4.4 Evaluation method

Prior to evaluation, each variable in the 42-day forecasts undergoes a detrending process to eliminate the linear trend. This step is essential for removing the linear long-term trends potentially affected by global warming weirich2023subseasonal . For detrending, a linear regression model is fitted to estimate the weekly mean linear trend from both forecasts and observations over the hindcast period (2002-2016). For the testing period (2017-2021), this model takes the week of the year as input data to calculate the trend, which is then subtracted from both the forecasts and observations to obtain the detrended fields. Subsequently, the deterministic metrics of the ensemble mean is evaluated using the latitude-weighted TCC, which is calculated as follows:

\textrm{TCC}(c,\tau,i,j)=\frac{\sum_{t_{0}\in D}\hat{\textbf{A}}^{t_{0}+\tau}_% {c,i,j}\textbf{A}^{t_{0}+\tau}_{c,i,j}}{\sqrt{\sum_{t_{0}\in D}(\hat{\textbf{A% }}^{t_{0}+\tau}_{c,i,j})^{2}\sum_{t_{0}\in D}(\textbf{A}^{t_{0}+\tau}_{c,i,j})% ^{2}}}

(4)

where ${t_{0}}$ represents the forecast initialization time in the testing dataset ${D}$ . $H$ , and $W$ denote the number of grid points in the latitude and longitude directions. The indices ${c}$ , ${i}$ , and ${j}$ correspond to variables, latitude and longitude coordinates, respectively. ${\tau}$ refers to the forecast lead time steps added to ${t_{0}}$ . $\hat{\textbf{A}}^{t_{0}+\tau}_{c,i,j}$ and $\textbf{A}^{t_{0}+\tau}_{c,i,j}$ are the differences between the forecast or observation and the climatological mean, with the climatological mean derived from data spanning the years from 2002 and 2016.

To evaluate the ensemble forecast performance, we use the RPSS epstein1969scoring ; Wilks2011 which quantifies the comparison between the cumulative squared probability errors of a given forecast and a climatological forecast. The calculation of the RPSS metric necessitates prior determination of the ranked probability scores (RPS) for both the forecast ( $\textrm{RPS}_{\textrm{forecast}}$ ) and the climatological forecast ( $\textrm{RPS}_{\textrm{clim}}$ ) should be calculated first. The RPS aggregates the squared probability errors across $K$ ( $K=3$ in this work) categories, such as tercile, arranged in ascending order. The tercile bounds are determined based on the average values over either one-week or two-week periods for each corresponding verification period. These calculations of tercile bounds are performed separately for each forecast model and observation (ERA5 data). The metric assesses the accuracy with which the probability forecast predicts the actual observation category. The RPS score is derived from the sum of the squared differences between the cumulative categorical forecast probability and its observed counterpart, where $p_{\textrm{O}(i)}=1$ denotes the observed category and $p_{\textrm{O}(i)}=0$ represents other categories:

\textrm{RPS}_{\textrm{forecast}}=\sum_{k=1}^{K}(\textrm{F}_{\textrm{forecast}(% k)}-\textrm{F}_{\textrm{O}(k)})

(5)

\textrm{RPS}_{\textrm{clim}}=\sum_{k=1}^{K}(\textrm{F}_{\textrm{clim}(k)}-% \textrm{F}_{\textrm{O}(k)})

(6)

where $\textrm{F}_{\textrm{forecast}(k)}=\sum_{i=1}^{k}p_{\textrm{forecast}(i)}$ , $\textrm{F}_{\textrm{clim}(k)}=\sum_{i=1}^{k}p_{\textrm{clim}(i)}$ , $\textrm{F}_{\textrm{O}(k)}=\sum_{i=1}^{k}p_{\textrm{O}(i)}$ represent the $k$ th components of the cumulative forecast, climatological, and observational distributions, respectively. And $p_{\textrm{forecast}(i)}$ , $p_{\textrm{clim}(i)}$ , $p_{\textrm{O}(i)}$ correspond to the forecasted, climatological, and observed probability of the event’s occurrence in category $i$ ( $i\leq k$ ). Crucially, the RPS is affected by both the forecast probabilities attributed to the observed category and the probabilities assigned to other categories. The RPS value varies between 0 and 1, where a lower value denotes a smaller forecast probability error, and thus a more accurate forecast. Specifically, a RPS value of 0 indicates a perfectly accurate categorical forecast. With the RPS values of both the forecast and the climatological forecast, the RPSS can be determined as:

\textrm{RPSS}=1-\frac{<\textrm{RPS}_{\textrm{forecast}}>}{<\textrm{RPS}_{% \textrm{clim}}>}

(7)

where, the brackets $<...>$ denote the average of the $\textrm{RPS}_{\textrm{forecast}}$ and $\textrm{RPS}_{\textrm{clim}}$ values across all forecast–observation pairs. Since each forecast category is equally probable by design, the climatological forecast assumes a 33 $\%$ probability of occurrence for each category. The RPSS metric serves a comparative measure against the climatological forecast. Its value range from $-\infty$ to 1, where 1 corresponds to a perfect forecast and higher values suggest better forecast performance. A positive RPSS value indicates superior accuracy over the climatological forecast, while a negative value suggests inferior accuracy. A value of zero suggests that the forecast has no added skill compared to the climatological forecast.

Additionally, we use the BSSWilks2011 to evaluate the performance of extreme forecasts. The BSS, a widely used metric for assessing the quality of categorical probabilistic forecasts, can be considered as a special case of the RPSS with two forecast categories weigel2007discrete . The BSS is computed using the following equation:

\textrm{BSS}=1-\frac{<\textrm{BS}_{\textrm{forecast}}>}{<\textrm{BS}_{\textrm{% clim}}>}

(8)

where $\textrm{BS}_{\textrm{forecast}}$ and $\textrm{BS}_{\textrm{clim}}$ represent the Brier Scores (BS) Brier1950 for the model’s forecast and the climatological forecast, respectively. Similar to the RPS, the BS quantifies the mean squared difference between the predicted probabilities and observations (either 0 or 1) in binary probabilistic forecasts. In this study, the BSS is calculated for the ensemble mean of both FuXi-S2S and ECMWF S2S, using the 90th climatological percentiles as the threshold for extreme events. The BS ranges from 0 to 1, with lower values indicating a better agreement between ensemble forecasts and observations with 0 suggesting the best possible BS score. On the contrary, a higher BSS, up to a maximum of 1, indicates better performance. The BSS measures the improvement of a forecast’s BS ( $\textrm{BS}_{\textrm{forecast}}$ ) relative to that of a climatological forecast ( $\textrm{BS}_{\textrm{clim}}$ ) as reference. A BSS of one indicates a perfect forecast, zero denotes no improvement over climatology, and negative values suggest inferior performance compared to climatology.

The evolution of MJO is typically characterized using the Real-time Multivariate MJO (RMM) index, as originally developed by Wheeler and Hendon wheeler2004all . The RMM1 and RMM2 indices represent the first and second principal components of the combined Empirical Orthogonal Function (EOF). This EOF is derived based on the daily mean values of OLR, zonal wind at 850 hPa (U850), and zonal wind at 200 hPa (U200), all averaged within the latitude range of 15°N and 15°S rashid2011prediction . In this study, we use the EOFs derived by Wheeler and Hendon (2004) wheeler2004all . To obtain the predicted MJO indices, data from both the FuXi-S2S and ECMWF S2S models are firstly interpolated from a spatial resolution of 1.5°to a 2.5°, and projected onto the observed EOFs. After calculating the ensemble mean anomalies, the RMM for the ensemble mean of both modes was derived. The amplitude and phase of the MJO are respectively defined by the formulas: $\textrm{RMMA}=\sqrt{\textrm{RMM1}^{2}(t)+\textrm{RMM2}^{2}(t)}$ and $\textrm{$\theta$}=tan^{-1}\frac{\textrm{RMM2}^{2}(t)}{\textrm{RMM1}^{2}(t)}$ . To assess the quality of the MJO forecasts, we calculate the bivariate COR using the following equation:

\textrm{COR}(\tau)=\frac{\sum_{t=1}^{N}[a_{1}(t)b_{1}(t,\tau)+a_{2}(t)b_{2}(t,% \tau)]}{\sqrt{\sum_{t=1}^{N}[a_{1}^{2}(t)+a_{2}^{2}(t)]}\sqrt{\sum_{t=1}^{N}[b% _{1}^{2}(t,\tau)+b_{2}^{2}(t,\tau)]}}

(9)

where $a_{1}(t)$ and $a_{2}(t)$ are the observed RMM1 and RMM2 at time $t$ derived from the ERA5 reanalysis dataset. Correspondingly, $b_{1}(t,\tau)$ and $b_{2}(t,\tau)$ represent the forecasts for time $t$ with a lead time of $\tau$ days, respectively. $N$ denotes the number of total predictions. We apply the threshold of $\textrm{COR}=0.5$ for skillful prediction rashid2011prediction .

Additionally, we assessed the respective contributions of amplitude and phase to the prediction skills of the MJO by examining the COR and error metrics of ensemble mean forecasts for each component. The COR for amplitude ( $\textrm{COR}_{\textrm{amplitude}}$ ) and phase ( $\textrm{COR}_{\textrm{phase}}$ ) were calculated using the methods outlined by Wang et al. wang2019prediction as follows:

\textrm{COR}_{amplitude}(\tau)=\frac{\sum_{t=1}^{N}\textrm{RMMA}_{\textrm{obs}% }(t)\times\textrm{RMMA}_{\textrm{forecast}}(t,\tau)}{\sqrt{\sum_{t=1}^{N}% \textrm{RMMA}_{\textrm{obs}}^{2}(t)}\sqrt{\sum_{t=1}^{N}\textrm{RMMA}_{\textrm% {forecast}}^{2}(t,\tau)}}

(10)

\textrm{COR}_{phase}(\tau)=\frac{\sum_{t=1}^{N}\textrm{RMMA}_{\textrm{obs}}(t)% \times cos(\textrm{$\theta$}_{\textrm{forecast}}(t,\tau)-\textrm{$\theta$}_{% \textrm{obs}}(t))}{\sum_{t=1}^{N}\textrm{RMMA}_{\textrm{obs}}^{2}(t)}

(11)

where $\textrm{RMMA}_{\textrm{obs}}$ and $\textrm{RMMA}_{\textrm{forecast}}$ represent the observed and predicted amplitudes of the MJO, respectively, while $\textrm{$\theta$}_{\textrm{obs}}$ and $\textrm{$\theta$}_{\textrm{forecast}}$ denote the observed and predicted phases. Additionally, we computed the average amplitude and phase errors ( $\textrm{ERROR}_{\textrm{amplitude}}$ and $\textrm{ERROR}_{\textrm{phase}}$ ) as follows, based on the method described by Rashid et al. rashid2011prediction :

\textrm{ERROR}_{\textrm{amplitude}}(\tau)=\frac{1}{N}\sum_{t=1}^{N}(\textrm{% RMMA}_{\textrm{forecast}}(t,\tau)-\textrm{RMMA}_{\textrm{obs}}(t))

(12)

\textrm{ERROR}_{\textrm{phase}}(\tau)=\frac{1}{N}\sum_{t=1}^{N}tan^{-1}(\frac{% a_{1}(t)b_{2}(t,\tau)-a_{2}(t)b_{1}(t,\tau)}{a_{1}(t)b_{1}(t,\tau)+a_{2}(t)b_{% 2}(t,\tau)})

(13)

Further details about the COR and ERROR for the amplitude and phase are presented in the Supplementary Figure LABEL:mjo_amp_phase.

Atmospheric predictability exhibits significant day-to-day variability, which in turn affects the potential accuracy of weather forecasts. To determine whether FuXi-S2S consistently outperform ECMWF S2S despite this variability, we adopted a bootstrapping approach for significance testing. This method involves generating a large number of synthetic datasets, for example 1000 in this work. For each day within these datasets, a forecast is randomly selected from either model A or model B. The forecast skill of each synthetic dataset is then evaluated by comparing it with actual observation. If the performance of model A surpasses the 97.5th percentile of the skill distribution derived from the synthetic datasets, it can be considered “significantly better” than model B. In contrast, if its performance falls below the 2.5th percentile, it is regarded as “significantly worse”. We also analyzed where the FuXi-S2S and ECMWF S2S models are significantly better or worse than the climatological forecasts, with model B representing these forecasts. Throughout the paper, significance testing has been applied to all bar plots and spatial map of statistical metrics. For all the bar plots in the paper, a pale color is used when the FuXi-S2S model do not show a statistically significant improvement over the ECMWF S2S model. Additionally, we have marked areas on all spatial maps where the skill score is statistically significant with stippling.

Data Availability Statement

We downloaded a subset of the daily statistics from the ERA5 hourly data from the official website of Copernicus Climate Data (CDS) at https://cds.climate.copernicus.eu/cdsapp#!/software/app-c3s-daily-era5-statistics. The ECMWF S2S data were obtained from https://apps.ecmwf.int/datasets/data/s2s/. The 1°CPC OLR data are provided by the NOAA Physical Sciences Laboratory (PSL) from their website of https://psl.noaa.gov. Rainfall data from the Global Precipitation Climatology Project (GPCP) was obtained from the National Oceanic and Atmospheric Administration (NOAA), specifically the National Centers for Environmental Information (NCEI), which is accessible at https://www.ncei.noaa.gov/products/global-precipitation-climatology-project.

The relevant data from each figure in the main manuscript and in the Supplementary Information are provided in https://zenodo.org/records/12662702 data2024 .

Code Availability Statement

The source code employed for training and running FuXi-S2S models in this research is accessible within a specific Google Drive folder (https://drive.google.com/drive/folders/1z47CRQdKFZaOjtKQWSNZobC1_RePUVIK?usp=sharing) code2023 . As the FuXi-S2S model and code are essential resources for this study. Currently, access to these resources is limited.

Calculation of MJO index is based on the EOFs derived by Wheeler and Hendon (2004) wheeler2004all .

The implementation of Perlin noise is based on publicly available from the GitHub repository: https://github.com/pvigier/perlin-numpy.

References

\bibcommenthead
(1) National Academies of Sciences, E.: Next Generation Earth System Prediction: Strategies for Subseasonal to Seasonal Forecasts. National Academies Press, Washington, DC (2016)
(2) White, C.J., et al.: Potential applications of subseasonal-to-seasonal (s2s) predictions. Meteorological Applications 24(3), 315–325 (2017)
(3) Pegion, K., et al.: The subseasonal experiment (subx): A multimodel subseasonal prediction experiment. Bulletin of the American Meteorological Society 100(10), 2043–2060 (2019)
(4) White, C.J., et al.: Advances in the application and utility of subseasonal-to-seasonal predictions. Bulletin of the American Meteorological Society 103(6), 1448–1472 (2022)
(5) Domeisen, D.I., et al.: Advances in the subseasonal prediction of extreme events: relevant case studies across the globe. Bulletin of the American Meteorological Society 103(6), 1473–1501 (2022)
(6) Lorenz, E.N.: Forced and free variations of weather and climate. Journal of Atmospheric Sciences 36(8), 1367–1376 (1979)
(7) Mariotti, A., Ruti, P.M., Rixen, M.: Progress in subseasonal to seasonal prediction through a joint weather and climate community effort. Npj Climate and Atmospheric Science 1(1), 4 (2018)
(8) Weyn, J.A., Durran, D.R., Caruana, R., Cresswell-Clay, N.: Sub-seasonal forecasting with a large ensemble of deep-learning weather prediction models. Journal of Advances in Modeling Earth Systems 13(7), 2021–002502 (2021)
(9) Han, J.-Y., Kim, S.-W., Park, C.-H., Son, S.-W.: Ensemble size versus bias correction effects in subseasonal-to-seasonal (s2s) forecasts. Geoscience Letters 10(1), 37 (2023)
(10) Vitart, F.: Evolution of ecmwf sub-seasonal forecast skill scores. Quarterly Journal of the Royal Meteorological Society 140(683), 1889–1899 (2014)
(11) Saha, S., et al.: The ncep climate forecast system version 2. Journal of climate 27(6), 2185–2208 (2014)
(12) Vitart, F., et al.: The subseasonal to seasonal (s2s) prediction project database. Bulletin of the American Meteorological Society 98(1), 163–173 (2017)
(13) Nowak, K., Webb, R., Cifelli, R., Brekke, L.: Sub-seasonal climate forecast rodeo. In: 2017 AGU Fall Meeting, New Orleans, LA, pp. 11–15 (2017)
(14) Monhart, S., et al.: Skill of subseasonal forecasts in europe: Effect of bias correction and downscaling using surface observations. Journal of Geophysical Research: Atmospheres 123(15), 7999–8016 (2018)
(15) Hwang, J., Orenstein, P., Cohen, J., Pfeiffer, K., Mackey, L.: Improving subseasonal forecasting in the western us with machine learning. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2325–2335 (2019)
(16) Vitart, F., et al.: Outcomes of the wmo prize challenge to improve subseasonal to seasonal predictions using artificial intelligence. Bulletin of the American Meteorological Society 103(12), 2878–2886 (2022)
(17) Mouatadid, S., et al.: Adaptive bias correction for improved subseasonal forecasting. Nature Communications 14(1), 3482 (2023)
(18) Domeisen, D.I., et al.: Advances in the subseasonal prediction of extreme events: relevant case studies across the globe. Bulletin of the American Meteorological Society 103(6), 1473–1501 (2022)
(19) Demaeyer, J., Penny, S.G., Vannitsem, S.: Identifying efficient ensemble perturbations for initializing subseasonal-to-seasonal prediction. Journal of Advances in Modeling Earth Systems 14(5), 1–30 (2022)
(20) Buizza, R., Milleer, M., Palmer, T.N.: Stochastic representation of model uncertainties in the ECMWF ensemble prediction system. Q. J. R. Meteorol. Soc. 125(560), 2887–2908 (1999)
(21) Buizza, R.: Introduction to the special issue on “25 years of ensemble forecasting”. Quarterly Journal of the Royal Meteorological Society 145(S1), 1–11 (2019)
(22) Leutbecher, M.: Ensemble size: How suboptimal is less than infinity? Quarterly Journal of the Royal Meteorological Society 145, 107–128 (2019)
(23) Cohen, J., et al.: S2s reboot: An argument for greater inclusion of machine learning in subseasonal to seasonal forecasts. Wiley Interdisciplinary Reviews: Climate Change 10(2), 00567 (2019)
(24) Richardson, D.S.: Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size. Quarterly Journal of the Royal Meteorological Society 127(577), 2473–2489 (2001)
(25) Hu, Y., Chen, L., Wang, Z., Li, H.: SwinVRNN: A data-driven ensemble forecasting model via learned distribution perturbation. J. Adv. Model. Earth Syst. 15(2), 2022–003211 (2023)
(26) Pathak, J., et al.: Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. Preprint at https://arxiv.org/abs/2202.11214 (2022)
(27) Lam, R., et al.: Learning skillful medium-range global weather forecasting. Science (2023)
(28) Bi, K., et al.: Accurate medium-range global weather forecasting with 3d neural networks. Nature (2023)
(29) Chen, L., et al.: Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast. npj Climate and Atmospheric Science, 1–11 (2023)
(30) Zhong, X., et al.: FuXi-Extreme: Improving extreme rainfall and wind forecasts with diffusion model (2023)
(31) Nguyen, T., et al.: Scaling transformer neural networks for skillful and reliable medium-range weather forecasting. Preprint at https://arxiv.org/abs/2312.03876 (2023)
(32) Haiden, T., et al.: Evaluation of ECMWF forecasts, including the 2021 upgrade (2021)
(33) Wang, C., Pritchard, M.S., Brenowitz, N., Cohen, Y., Bonev, B., Kurth, T., Durran, D., Pathak, J.: Coupled Ocean-Atmosphere Dynamics in a Machine Learning Earth System Model. Preprint at https://arxiv.org/abs/2406.08632 (2024)
(34) He, S., Li, X., DelSole, T., Ravikumar, P., Banerjee, A.: Sub-seasonal climate forecasting via machine learning: Challenges, analysis, and advances. Proceedings of the AAAI Conference on Artificial Intelligence 35(1), 169–177 (2021)
(35) Kiefer, S.M., Lerch, S., Ludwig, P., Pinto, J.G.: Can machine learning models be a suitable tool for predicting central european cold winter weather on subseasonal to seasonal time scales? Artificial Intelligence for the Earth Systems 2(4), 1–16 (2023)
(36) Molteni, F., Buizza, R., Palmer, T.N., Petroliagis, T.: The ecmwf ensemble prediction system: Methodology and validation. Quarterly Journal of the Royal Meteorological Society 122(529), 73–119 (1996)
(37) de Andrade, F., Coelho, C.A., Cavalcanti, I.F.: Global precipitation hindcast quality assessment of the subseasonal to seasonal (s2s) prediction project models. Climate Dynamics 52(9), 5451–5475 (2019)
(38) Madden, R.A., Julian, P.R.: Detection of a 40–50 day oscillation in the zonal wind in the tropical pacific. Journal of Atmospheric Sciences 28(5), 702–708 (1971)
(39) Madden, R.A., Julian, P.R.: Description of global-scale circulation cells in the tropics with a 40–50 day period. Journal of Atmospheric Sciences 29(6), 1109–1123 (1972)
(40) Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 10 (2015)
(41) McGovern, A., et al.: Making the black box more transparent: Understanding the physical implications of machine learning. Bulletin of the American Meteorological Society 100(11), 2175–2199 (2019)
(42) Molnar, C., Casalicchio, G., Bischl, B.: Interpretable machine learning–a brief history, state-of-the-art and challenges. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 417–431 (2020). Springer
(43) Mamalakis, A., Ebert-Uphoff, I., Barnes, E.A.: Explainable artificial intelligence in meteorology and climate science: Model fine-tuning, calibrating trust and learning new science. In: International Workshop on Extending Explainable AI Beyond Deep Models and Classifiers, pp. 315–339 (2020). Springer
(44) Toms, B.A., Kashinath, K., Yang, D., et al.: Testing the reliability of interpretable neural networks in geoscience using the madden–julian oscillation. Geoscientific Model Development 14(7), 4495–4508 (2021)
(45) Rasp, S., Thuerey, N.: Data-driven medium-range weather prediction with a resnet pretrained on climate simulations: A new model for weatherbench. J. Adv. Model. Earth Syst. 13(2), 2020–002405 (2021)
(46) Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. Preprint at https://arxiv.org/abs/1312.6034 (2013)
(47) Dunstone, N., et al.: Windows of opportunity for predicting seasonal climate extremes highlighted by the pakistan floods of 2022. Nature Communications 14(1), 6544 (2023)
(48) Faghmous, J.H., Kumar, V.: A big data guide to understanding climate change: The case for theory-guided data science. Big data 2(3), 155–163 (2014)
(49) Karpatne, A., et al.: Theory-guided data science: A new paradigm for scientific discovery from data. IEEE Transactions on knowledge and data engineering 29(10), 2318–2331 (2017)
(50) Chattopadhyay, A., Hassanzadeh, P.: Long-term instabilities of deep learning-based digital twins of the climate system: The cause and a solution. Preprint at https://arxiv.org/abs/2304.07029 (2023)
(51) Epstein, E.S.: A scoring system for probability forecasts of ranked categories. Journal of Applied Meteorology (1962-1982) 8(6), 985–987 (1969)
(52) Wilks, D.S.: Statistical Methods in the Atmospheric Sciences vol. 100, 3rd edn. (2011)
(53) Vitart, F., Robertson, A.W.: The sub-seasonal to seasonal prediction project (s2s) and the prediction of extreme events. npj Climate and Atmospheric Science 1(1), 3 (2018)
(54) Merryfield, W.J., et al.: Current and emerging developments in subseasonal to decadal prediction. Bulletin of the American Meteorological Society 101(6), 869–896 (2020)
(55) Zhang, C.: Madden-julian oscillation. Reviews of Geophysics 43(2) (2005)
(56) Zhang, C.: Madden-julian oscillation: Bridging weather and climate. Bulletin of the American Meteorological Society 94(12), 1849–1870 (2013)
(57) Zhang, C., et al.: Cracking the mjo nut. Geophysical Research Letters 40(6), 1223–1230 (2013)
(58) Neena, J., et al.: Predictability of the madden–julian oscillation in the intraseasonal variability hindcast experiment (isvhe). Journal of Climate 27(12), 4531–4543 (2014)
(59) Kim, H., Vitart, F., Waliser, D.E.: Prediction of the madden–julian oscillation: A review. Journal of Climate 31(23), 9425–9443 (2018)
(60) Jiang, X., et al.: Fifty years of research on the madden-julian oscillation: Recent progress, challenges, and perspectives. Journal of Geophysical Research: Atmospheres 125(17), 2019–030911 (2020)
(61) Wu, J., Jin, F.-F.: Improving the mjo forecast of s2s operation models by correcting their biases in linear dynamics. Geophysical Research Letters 48(6), 1–10 (2021)
(62) Silini, R., et al.: Improving the prediction of the madden–julian oscillation of the ecmwf model by post-processing. Earth System Dynamics 13(3), 1157–1165 (2022)
(63) Kim, H., Ham, Y.G., Joo, Y.S., Son, S.W.: Deep learning for bias correction of mjo prediction. Nature Communications 12(1) (2021)
(64) Silini, R., Barreiro, M., Masoller, C.: Machine learning prediction of the madden-julian oscillation. npj Climate and Atmospheric Science 4(1), 57 (2021)
(65) Delaunay, A., Christensen, H.M.: Interpretable deep learning for probabilistic mjo prediction. Geophysical Research Letters 49(16), 2022–098566 (2022)
(66) Wheeler, M.C., Hendon, H.H.: An all-season real-time multivariate mjo index: Development of an index for monitoring and prediction. Monthly weather review 132(8), 1917–1932 (2004)
(67) Wallace, J.M., Gutzler, D.S.: Teleconnections in the geopotential height field during the northern hemisphere winter. Monthly weather review 109(4), 784–812 (1981)
(68) Mo, K.C., Ghil, M.: Statistics and dynamics of persistent anomalies. Journal of Atmospheric Sciences 44(5), 877–902 (1987)
(69) Zhu, B., Wang, B.: The 30-60-day convection seesaw between the tropical indian and western pacific oceans. Journal of the Atmospheric Sciences 50, 184–199 (1993)
(70) Walker, G.T.: Correlations in seasonal variations of weather. viii, a further study of world weather. Men. Indian Meteor. Dept. 24, 275–332 (1924)
(71) Huang, R.H.: Influence of the heat source anomaly over the western tropical pacific for the subtropical high over east asia. In: International Conference on the General Circulation of East Asia. Chendu, China, April 10-15, 1987, pp. 40–50 (1987)
(72) Savarin, A., Chen, S.S.: Pathways to better prediction of the mjo: 2. impacts of atmosphere-ocean coupling on the upper ocean and mjo propagation. Journal of Advances in Modeling Earth Systems 14(6), 2021–002929 (2022)
(73) Hong, C.-C., et al.: Causes of 2022 pakistan flooding and its linkage with china and europe heatwaves. npj Climate and Atmospheric Science 6(1), 163 (2023)
(74) Yang, R., et al.: Interpretable Machine Learning for Weather and Climate Prediction: A Survey. Preprint at https://arxiv.org/abs/2403.18864 (2024)
(75) Haiden, T., et al.: Evaluation of ECMWF forecasts, including the 2018 upgrade (2018)
(76) Nogueira, M.: Inter-comparison of era-5, era-interim and gpcp rainfall over the last 40 years: Process-based analysis of systematic and random differences. Journal of Hydrology 583, 1–17 (2020)
(77) Lavers, D.A., Simmons, A., Vamborg, F., Rodwell, M.J.: An evaluation of era5 precipitation for climate monitoring. Q. J. R. Meteorol. Soc. 148(748), 3152–3165 (2022). https://doi.org/10.1002/qj.4351
(78) Hersbach, H., et al.: The era5 global reanalysis. Q. J. R. Meteorol. Soc. 146(730), 1999–2049 (2020)
(79) Adler, R.F., et al.: The global precipitation climatology project (gpcp) monthly analysis (new version 2.3) and a review of 2017 global precipitation. Atmosphere 9(4), 138 (2018)
(80) Albers, J.R., Newman, M.: Subseasonal predictability of the north atlantic oscillation. Environmental Research Letters 16(4), 1–10 (2021)
(81) Yan, Y., Liu, B., Zhu, C.: Subseasonal predictability of south china sea summer monsoon onset with the ecmwf s2s forecasting system. Geophysical Research Letters 48(24), 2021–095943 (2021)
(82) Richter, J.H., et al.: Quantifying sources of subseasonal prediction skill in cesm2. npj Climate and Atmospheric Science 7(1), 59 (2024)
(83) Stan, C., et al.: Advances in the prediction of mjo teleconnections in the s2s forecast systems. Bulletin of the American Meteorological Society 103(6), 1426–1447 (2022)
(84) Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-decoder approaches. Preprint at https://arxiv.org/abs/1409.1259 (2014)
(85) Olivetti, L., Messori, G.: Advances and prospects of deep learning for medium-range extreme weather forecasting. EGUsphere 2023, 1–20 (2023)
(86) Doersch, C.: Tutorial on variational autoencoders. Preprint at https://arxiv.org/abs/1606.05908 (2016)
(87) Zhao, T., Zhao, R., Eskenazi, M.: Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. Preprint at https://arxiv.org/abs/1703.10960 (2017)
(88) Kingma, D.P., Welling, M.: An introduction to variational autoencoders. Foundations and Trends® in Machine Learning 12(4), 307–392 (2019)
(89) Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. Preprint at https://arxiv.org/abs/1412.6980 (2017)
(90) Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2017)
(91) Samek, W., Wiegand, T., Müller, K.-R.: Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. Preprint at https://arxiv.org/abs/1708.08296 (2017)
(92) Weirich-Benet, E., Pyrina, M., Jiménez-Esteve, B., Fraenkel, E., Cohen, J., Domeisen, D.I.: Subseasonal prediction of central european summer heatwaves with linear and random forest machine learning models. Artificial Intelligence for the Earth Systems 2(2), 220038 (2023)
(93) Weigel, A.P., Liniger, M.A., Appenzeller, C.: The discrete brier and ranked probability skill scores. Monthly Weather Review 135(1), 118–124 (2007)
(94) Brier, G.W.: Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review 78(1), 1 (1950)
(95) Rashid, H.A., Hendon, H.H., Wheeler, M.C., Alves, O.: Prediction of the madden–julian oscillation with the poama dynamical prediction system. Climate Dynamics 36, 649–661 (2011)
(96) Wang, S., Sobel, A.H., Tippett, M.K., Vitart, F.: Prediction and predictability of tropical intraseasonal convection: Seasonal dependence and the maritime continent prediction barrier. Climate Dynamics 52, 6015–6031 (2019)
(97) Chen, L., et al.: A machine learning model that outperforms conventional global subseasonal forecast models (Version 1.0) [Figure Dataset]. Zenodo. https://zenodo.org/records/12662702 (2024)
(98) Chen, L., et al.: A machine learning model that outperforms conventional global subseasonal forecast models (Version 1.0) [Dataset] [Software]. Zenodo. https://zenodo.org/records/10402083 (2023)

Acknowledgements

This work was supported by the National Key R&D Program of China under Grant 2021YFA0718000, National Natural Science Foundation of China under Grant 42175052, and China Meteorological Administration (CMA) Youth Innovation Team (CMA2024QN06. We extend our sincere gratitude to the researchers at ECMWF for their invaluable contributions to the collection, archival, dissemination, and maintenance of the ERA5 reanalysis dataset and ECMWF S2S reforecast and real-time forecast data.

Author Contributions

H.L., X.Z, L.C., and B.L. designed the project. L.C. designed and performed the model training. X.Z. and L.C. performed the analysis under supervision of H.L., B.L., W.J., Q.C., L.W., C.L., Z.H., and Y.Q.. X.Z. and L.C. wrote and revised the manuscript. J.W., D.C., and S.X. contributed to interpreting results and discussions of associated dynamics.

Competing interests

The authors declare no competing interests.

Tables

Table 1: A summary of all the upper-air and surface variable names and their abbreviations in this paper.

Type	Full name	Abbreviation
upper-air variables	geopotential	Z
	temperature	T
	u component of wind	U
	v component of wind	V
	specific humidity	Q
surface variables	2-meter temperature	T2M
	2-meter dewpoint temperature	D2M
	sea surface temperature	SST
	outgoing longwave radiation	OLR
	10-meter u wind component	U10
	10-meter v wind component	V10
	100-meter u wind component	U100
	100-meter v wind component	V100
	mean sea-level pressure	MSL
	total column water vapor	TCWV
	total precipitation	TP

Figure Legends

Refer to caption — Figure 1: Comparison of globally-averaged and latitude-weighted temporal anomaly correlation coefficient (TCC) of the ensemble mean between ECMWF subseasonal-to-seasonal (S2S) reforecasts (in blue) and FuXi-S2S forecasts (in red) for total precipitation (TP), 2-meter temperature (T2M), geopotential at 500 hPa (Z500), and outgoing longwave radiation (OLR). Rows 1 and 2 represent the performance across these variables, utilizing all testing data from the period spanning from 2017 to 2021. A bootstrapping approach, repeated 1000 times, is used for significance testing. When the FuXi-S2S forecasts fail to show a statistically significant improvement over the ECMWF S2S reforecasts at the 97.5% confidence level, a pale color scheme is used to denote these results.