\jyear

2021 \externaldocumentSI

{CJK*}

UTF8gbsn

\equalcont

These authors contributed equally to this work. \equalcontThese authors contributed equally to this work.

[1,2]\fnmHao \surLi

[6]\fnmBo \surLu

1]\orgdivArtificial Intelligence Innovation and Incubation Institute, \orgnameFudan University, \orgaddress\cityShanghai, \postcode200433, \countryChina

2]\orgnameShanghai Academy of Artificial Intelligence for Science, \orgaddress\cityShanghai, \postcode200232, \countryChina

3]\orgdivDepartment of Atmospheric and Oceanic Sciences and Institute of Atmospheric Sciences, \orgnameFudan University, \orgaddress\cityShanghai, \postcode200433, \countryChina

4]\orgdivNational Meteorological Information Center, \orgnameChina Meteorological Administration, \orgaddress\cityBeijing, \postcode100081, \countryChina

5]\orgdivInstitute for Climate and Application Research (ICAR)/CIC-FEMD/KLME/ILCEC, \orgnameNanjing University of Information Science and Technology, \orgaddress\cityNanjing, \postcode210044, \countryChina

6]\orgdivChina Meteorological Administration Key Laboratory for Climate Prediction Studies, \orgnameNational Climate Center, \orgaddress\cityBeijing, \postcode100081, \countryChina

FuXi-ENS: A machine learning model for medium-range ensemble weather forecasting

\fnmXiaohui \surZhong x7zhong@gmail.com    \fnmLei \surChen cltpys@163.com    lihao__\__lh@fudan.edu.cn    \fnmJun \surLiu liujun__\__090003@163.com    \fnmXu \surFan fanxu@sais.com.cn    \fnmJie \surFeng fengjiefj@fudan.edu.cn    \fnmKan \surDai daikan@cma.gov.cn    \fnmJing-Jia \surLuo jjluo@nuist.edu.cn    \fnmJie \surWu wujie@cma.gov.cn    \fnmYuan \surQi qiyuan@fudan.edu.cn    bolu@cma.gov.cn [ [ [ [ [ [
Abstract

Ensemble forecasting is crucial for improving weather predictions, especially for forecasts of extreme events. Constructing an ensemble prediction system (EPS) based on conventional numerical weather prediction (NWP) models is highly computationally expensive. Machine learning (ML) models have emerged as valuable tools for deterministic weather forecasts, providing forecasts with significantly reduced computational requirements and even surpassing the forecast performance of traditional NWP models. However, challenges arise when applying ML models to ensemble forecasting. Recent ML models, such as GenCast and SEEDS model, rely on the ERA5 Ensemble of Data Assimilations (EDA) or operational NWP ensemble members for forecast generation. Their spatial resolution is also considered too coarse for many applications. To overcome these limitations, we introduce FuXi-ENS, an advanced ML model designed to deliver 6-hourly global ensemble weather forecasts up to 15 days. This model runs at a significantly increased spatial resolution of 0.25°, incorporating 5 atmospheric variables at 13 pressure levels, along with 13 surface variables. By leveraging the inherent probabilistic nature of Variational AutoEncoder (VAE), FuXi-ENS optimizes a loss function that combines the continuous ranked probability score (CRPS) and the Kullback–Leibler (KL) divergence between the predicted and target distribution, facilitating the incorporation of flow-dependent perturbations in both initial conditions and forecast. This innovative approach makes FuXi-ENS an advancement over the traditional ones that use L1 loss combined with the KL loss in standard VAE models for ensemble weather forecasting. Evaluation results demonstrate that FuXi-ENS outperforms ensemble forecasts from the European Centre for Medium-Range Weather Forecasts (ECMWF), a world leading NWP model, in the CRPS of 98.1% of 360 variable and forecast lead time combinations. This achievement underscores the potential of the FuXi-ENS model to enhance ensemble weather forecasts, offering a promising direction for further development in this field.

keywords:
machine learning, FuXi, ensemble forecast, medium-range forecast

1 Introduction

Weather forecasts are inherently uncertain due to the chaotic nature of the highly nonlinear atmosphere Lorenz1963 , necessitating the incorporation of uncertainty assessments for an improved forecast national2006completing . For various applications, forecasts that include uncertainty estimates are usually more valuable scher2018predicting ; calvo2024real . This is particularly true in sectors sensitive to weather and climate conditions for risk assessment Palmer2002 ; GOODARZI2019 , in renewable energy forecasting to ensure reliable and cost-effective operations of power system SPERATI2016 ; wang2017deep ; wu2018probabilistic , and in the aviation industry for enhanced safety and efficiency zhang2017probabilistic . The primary sources of uncertainty arise from two main factors: imperfections in forecast models that fail to accurately simulate all atmospheric processes, and inaccuracies in initial conditions due to observational data capturing only a limited subset of atmospheric information and deficiencies in data assimilation systems.

To date, ensemble forecasting has proven to be the most promising method for estimating forecast uncertainty, primarily through multiple runs of numerical weather prediction (NWP) models. Each run introduces slight variations in initial conditions and model physics gneiting2005weather ; leutbecher2008ensemble , effectively addressing the major sources of forecast uncertainty. For instance, the European Centre for Medium-Range Weather Forecasts (ECMWF) ensemble prediction system (EPS), a globally recognized leader in operational medium-range EPS, employs the singular vector (SV) approach Molteni1996 ; Buizza1995 ; barkmeijer19993d and the stochastically perturbed parametrization tendency (SPPT) scheme buizza1999stochastic ; palmer2009stochastic to introduce perturbations to initial conditions and physical tendencies. The SV method identifies the most sensitive directions within the initial conditions, where even minor modifications can result in significantly different forecasts, proving particularly invaluable for medium-range weather forecasting.

Despite the efficacy of convectional EPS, ensemble forecasts are computationally expensive, often necessitating a compromise in model spatial resolution and/or the number of ensemble members. Historically, the ECMWF has performed ensemble forecasts at a lower resolution than its deterministic counterpart. Recent advancements in high-performance computing have enabled high-resolution ensemble forecasting at a spatial resolution of 9 km, matching that of the high-resolution forecast (HRES) 111On June 27th, 2023, high-performance computing (HPC) upgrades facilitated high-resolution ensemble at a spatial resolution of 9 km, equivalent to the HRES.. Moreover, the ensemble approach is crucial for data assimilation, effectively enhancing forecast reliability by capturing uncertainties in background through the spread of ensemble members. However, the substantial computational costs limit the number of ensemble members that can be operationally conducted, constraining the ensembles’ ability to fully represent the probability distribution of possible weather scenarios. Despite these challenges, increasing the ensemble size has been shown to improve forecast performance buizza1999stochastic ; Buizza2019 . Nevertheless, computational constraints typically limit global ensemble forecasts to between 14 and 51 members across major forecasting centers leutbecher2019ensemble ; lang2021more ; GEFS2022 . Therefore, developing more computationally efficient methods for producing ensemble weather forecasts remains a vital area of research.

Recent advancements in machine learning (ML) have significantly enhanced the computational efficiency and accuracy of global weather forecasting. ML-based weather forecasting models now provide promising alternatives to traditional NWP models pathak2022fourcastnet ; bi2022panguweather ; lam2022graphcast ; chen2023fuxi ; chen2023fengwu ; bouallegue2024aifs , often matching or even surpassing the forecast skill of the high-resolution forecast (HRES) produced by ECMWF de2023machine . Initially, ML applications in this field focused on deterministic forecasts with an emphasis on minimizing errors such as the mean absolute error (MAE) or root mean square error (RMSE), which limited the assessment of forecast uncertainties. Subsequent studies have ventured into the more complex and challenging domain of ensemble forecasting. Chen et al. chen2023fuxi employed random Perlin noise to perturb initial conditions, which are independent of the background flow. This approach led to a reduction in ensemble spread from the 9th day of forecast onward. Price et al. price2023gencast developed GenCast, a diffusion model sohl2015deep ; karras2022elucidating capable of generating ensemble forecasts by sampling from a joint probability distribution of potential weather scenarios across space and time. GenCast can produce ensemble forecasts for up to 15 days at a spatial resolution of 0.25°and a temporal resolution of 12 hours, showing superior performance compared to ECMWF ensemble forecasts. Unlike previous ML models that relied only on the ECMWF ERA5 reanalysis dataset hersbach2020era5 , GenCast also incorporates the ERA5 Ensemble of Data Assimilations (EDA), which includes 9 perturbed members and 1 control member at a spatial resolution of 0.5625°to account for uncertainties in initial conditions. However, GenCast’s dependence on the ERA5 EDA poses limitations for operational forecasting. Additionally, Li et al. SEEDS2024 proposed Scalable Ensemble Envelope Diffusion Sampler (SEEDS), which generates large ensembles using two members from the Global Ensemble Forecast System (GEFS) version 12 GEFS2022 , the operational ensemble NWP system of the United Sates. SEEDS consists of two models: the generative ensemble emulation (SEEDS-GEE) model, which emulates the distribution of GEFS, and the generative postprocessing (SEEDS-GPP) model, which corrects biases in the GEFS by integrating distributions from GEFS and ERA5 EDA. However, SEEDS requires two operational NWP ensemble member for forecast generation and includes only 8 variables at a 2°resolution. Brenowitz et al. brenowitz2024practical constructed ensembles using lagged ensemble forecasts (LEF) hoffman1983lagged . The LEF method, while simple to implement and free from model training, is limited by the number of their ensemble members that can be generated. Inspired by probabilistic nature of Variational AutoEncoder (VAE) doersch2016tutorial ; zhao2017learning , SwinVRNN hu2022swinvrnn and FuXi-S2S chen2023fuxis2s focus on medium-range and sub-seasonal ensemble forecasting, respectively. These models combine reconstruction loss and the Kullback–Leibler (KL) divergence in their loss functions, introducing perturbations into the latent space to enhance forecast reliability. Despite its pioneering performance in medium-range ensemble weather forecasting, SwinVRNN produces forecasts at a 5.625°resolution, which is too coarse for many applications. While ML models have made significant efforts in ensemble weather forecasting, the optimal approach for ML-based ensemble forecasting remains an open question. Additionally, it is unclear when ML-based ensemble weather forecasting models will consistently outperform traditional ECMWF ensembles, particularly in accurately predicting the uncertainty and the likelihood of high-impact weather events at spatial resolutions.

In this paper, we introduce FuXi-ENS, a ML-based medium-range ensemble weather forecasting model that outperforms the ECMWF ensemble at a fine spatial resolution of 0.25°. FuXi-ENS generates 6-hourly forecasts up to 15 days, encompassing 5 atmospheric variables at 13 pressure levels and 13 surface variables. This model uniquely combines the continuous ranked probability score (CRPS) and KL loss, a method superior to the traditional usage of L1 loss paired with the KL loss in classical VAE model. Specifically, CRPS proves more effective than L1 loss for ensemble weather forecasting. Furthermore, FuXi-ENS introduces perturbations at both the initial conditions and each forecast step. These perturbations mirror the ensemble generation process in NWP models by accounting for errors in both initial conditions and physical tendencies. FuXi-ENS is developed using 17 years of 6-hourly ECMWF ERA5 reanalysis data at a spatial resolution of 0.25superscript0.250.25^{\circ}0.25 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. It employs analysis data and a deterministic model for initialization, complemented by a perturbation module that introduces perturbations at initialization step and every subsequent forecast steps. This distinguishes FuXi-ENS from models like SEEDS and GenCast, which rely on either EDA or operational ensemble forecast members.

FuXi-ENS demonstrates remarkable performance, outperforming the ECMWF ensemble across multiple metrics, including the accuracy of the ensemble mean forecast, ensemble verification metrics, and the prediction of extreme weather events. Extreme weather events have significantly impacted, and will continue to impact society and the economy mirza2003climate ; jahn2015economics ; frame2020climate . Recent decades have witnessed a notable increase in the frequency of such events globally, including heat wave and extreme rainfall, due to climate change national2016attribution ; newman2023global ; Aigner2023 . Accurately predicting these events remains a significant and long-standing challenge in weather forecasting. Among these, tropical cyclones (TCs), as one of the most destructive extreme events, pose severe threats to human lives and property mousavi2011global ; lang2012impact . Enhanced accuracy in forecasting TC track can reduce unnecessary warnings and evacuations, thereby improving emergency preparedness and management hamill2011global ; conroy2023track . Additionally, research suggests that anthropogenic climate warming and rising sea surface temperatures (SSTs) are likely to intensify TCs and increase their associated rainfall pielke2005hurricanes ; kossin2013trend ; patricola2018anthropogenic ; Knutson2019 . Therefore, this studies conduct a comprehensive evaluation of probabilistic TC track forecasts using the FuXi-ENS and ECMWF ensemble models, finding FuXi-ENS’s performance comparable to that of ECMWF ensemble. Moreover, the computational efficiency of the FuXi-ENS model is remarkable. It completes a 15-day forecast with a 6-hourly temporal resolution in approximately 10 seconds per member on an Nvidia A100 graphics processing unit (GPU), which is negligible compared to conventional NWP models. The model’s high spatial resolution and rapid computation speed enable a broad range of practical applications, capitalizing on the fast and accurate ensemble forecasts provided by FuXi-ENS.

2 Results

This study presents a comprehensive evaluation of the 48-member FuXi-ENS forecasts in forecasting weather during the testing period of one year in 2018. Each of the 8 A100 GPUs contributes to the ensemble by generating 6 members, collectively producing a total of 48 members. We compare the performance of FuXi-ENS with that of the 51-member ECMWF ensemble forecasts, utilizing a variety of metrics. These metrics encompass deterministic skill scores for ensemble mean forecasts, probabilistic forecast metrics derived from all ensemble members, and particular evaluations for extreme weather events, with a focus on probabilistic forecasts of TC tracks and the record-breaking 2018 heat wave in Northeast Asia.

2.1 Deterministic metrics

Deterministic metrics such as RMSE and anomaly correlation coefficient (ACC) are crucial for evaluating ensemble mean weather forecasts murphy1989skill ; hamill2013noaa . RMSE quantifies the average magnitude of forecast errors, providing a thorough assessment of accuracy, whereas ACC evaluates the correlation between predicted and observed anomalies, reflecting the model’s ability to capture large-scale synoptic patterns.

This subsection presents a comparative analysis of ensemble mean forecasts from the FuXi-ENS and ECMWF ensemble using these metrics. Figure 1 shows a time series of the globally-averaged, latitude-weighted RMSE for both models in RMSE at lead times of 0-15 days for 6 variables: geopotential at 500 hPa (Z500), temperature at 850 hPa (T850), wind speed at 850 hPa (WS850), mean sea level pressure (MSL), 2-meter temperature (T2M), and wind speed at 10 meter (WS10M). Due to subtle differences between the FuXi-ENS and ECMWF ensemble, normalized differences in RMSE are also provided for enhanced clarity. Additionally, Supplementary Figure LABEL:ACC_plot illustrates the absolute ACC values and their normalized differences as a function of forecast lead times for the same 6 variables in Figure 1. Many other atmospheric variables at different pressure levels are not included in the skill comparisons due to the unavailability of data in the ECMWF archive. Furthermore, acquiring all 51 members of the ECMWF ensemble from the ECMWF archive is challenging due to the substantial data volumes involved, which can not be easily downloaded. Normalized differences in RMSE and ACC are calculated using the ECMWF ensemble mean as the reference. FuXi-ENS forecasts demonstrate persistently lower RMSE and higher ACC skill than the ECMWF ensemble mean for nearly all the evaluated variables and forecast lead times, except for Z500 and T850, where the ensemble mean of FuXi-ENS is less accurate than that of ECMWF ensemble at 2 out of 60 forecast lead times.

Supplementary Figures LABEL:RMSE_spatial_upper and LABEL:RMSE_spatial_surface provide visual comparisons of average RMSE without latitude weighting for ensemble mean forecasts from the ECMWF ensemble and FuXi-ENS. These figures compare predictions across 3 upper-air variables (Z500, T850, and WS850) and 3 surface variables (MSL, T2M, and WS10M) at lead times of 5 days, 10 days and 15 days, using all testing data from 2018. Each figure’s first and second columns depict the average RMSE for each model, while the third column illustrates the differences between them. These RMSE differences are represented by areas in blue, red, and white to indicate regions where FuXi-ENS performs better (lower RMSE), worse (higher RMSE), or equivalently to the ECMWF ensemble, respectively. The RMSE spatial patterns generally show lower values in tropical versus extra-tropical regions. At a 5-day forecast lead time, FuXi-ENS typically demonstrates comparative or superior performance, as evidenced by the prevalence of blue in the difference maps. Notably, FuXi-ENS particularly outperforms ECMWF ensemble in WS850, WS10M, and MSL over oceanic regions, and in T850 and T2M over land areas. As lead times increase, the RMSE contrasts diminish, with FuXi-ENS showing varied performance relative to ECMWF ensemble across different regions. This pattern suggests fluctuations in the model’s efficacy over different geographic regions with increasing forecast lead times. Overall, the spatial distribution of RMSE differences demonstrates that FuXi-ENS generally surpasses the ECMWF ensemble in accuracy at most global grid points, as indicated by the predominance of blue colors.

Refer to caption
Figure 1: Comparison of the globally-averaged latitude-weighted RMSE (first and second rows) as well as normalized RMSE (third and fourth rows) differences of ensemble mean forecasts from the ECMWF ensemble (blue lines) and FuXi-ENS (red lines) for 3 upper-air variables, including Z500, T850, and WS850, and 3 surface variables, such as MSL, T2M, and WS10M, in 15-day forecasts using testing data from 2018. The normalized differences are calculated using the ECMWF ensemble mean as the reference.

2.2 Probabilistic metrics

In the evaluation of ensemble forecasts, statistical metrics of probability forecasts such as the CRPS and spread-skill ratio (SSR) are indispensable. The CRPS effectively compares the predicted probability distribution to observed values, focusing on the forecast’s accuracy across its entire distribution Hersbach2000 . The SSR) assesses forecast reliability by comparing the ensemble’s forecast spread with the ensemble mean error, revealing whether the ensemble’s variability matches its predictive accuracy. The ensemble spread quantifies the inherent forecast uncertainty among ensemble members, with a greater spread typically suggesting higher uncertainty, which should ideally correlate with the ensemble’s forecast error. This subsection compares these probabilistic metrics between the FuXi-ENS and ECMWF ensembles.

Figure 2 illustrates the time series of normalized differences in globally-averaged, latitude-weighted CRPS, ensemble spread, and SSR for the same 6 variables shown in Figure 1, over all 15-day forecast lead times. The CRPS values for the FuXi-ENS and ECMWF ensembles are close (refer to Supplementary Figure LABEL:CRPS_app), thus only the normalized CRPS differences are presented here. Overall, FuXi-ENS demonstrates superior CRPS values compared to the ECMWF ensemble in 360 variable and lead time combinations (360=6×60360660360=6\times 60360 = 6 × 60, 60 is the number of forecast steps), representing 98.1% of these comparisons. Specifically, FuXi-ENS outperforms the ECMWF ensemble in WS850, WS10M, MSL, and T2M throughout the entire 15-day forecast period. For Z500, the ECMWF ensemble initially leads in performance during the first day of forecasting but is surpassed by FuXi-ENS at about day 2. The FuXi-ENS ensemble then regains its superior performance from days 3 to 15. For T850, FuXi-ENS consistently surpasses the ECMWF ensemble over most of the forecast period except for the first 12 hours. Additionally, the ensemble spread for FuXi-ENS and ECMWF ensemble is comparable, especially for longer forecast lead times. This contrasts with previous methods such as random perturbations (e.g., Perlin noise), which do not account for the background flow and tend to decrease after 9 days chen2023fuxi . The SSR values for ECMWF ensemble are generally close to 1 across all 6 variables over all forecast periods, indicating that the spread serves as a reliable predictor of forecast uncertainties. However, for forecasts of ECMWF within the first day, SSR values exceed 1 for T850, WS850, MSL, and WS10M, suggesting overdispersion. In comparison, the SSR of FuXi-ENS is closer to 1 within the first day, possibly suggesting a properly generated initial ensemble in FuXi-ENS.

Similar to Supplementary Figures LABEL:RMSE_spatial_upper and LABEL:RMSE_spatial_surface, Supplementary Figures LABEL:CRPS_spatial_upper and LABEL:CRPS_spatial_surface illustrate the spatial distribution of CRPS for ensemble forecasts by the ECMWF ensemble and FuXi-ENS across 6 variables. For WS850, T850, WS10M, and T2M, FuXi-ENS either matches or surpasses the ECMWF ensemble in CRPS scores, as indicated by the prevalent blue and white areas. Notably, Z500 demonstrates the most significant variability in performance. These patterns suggest that FuXi-ENS generally provides more accurate probability forecasts than the ECMWF ensemble, particularly at shorter lead times and for the T2M and WS10 variables. The spatial maps of CRPS difference maps further highlight these enhancements, reflecting FuXi-ENS’s potential for improved medium-range ensemble weather predictions.

Refer to caption
Figure 2: Comparison of normalized differences in globally-averaged, latitude-weighted CRPS (first and second rows), ensemble spread (third and fourth rows), and SSR (fifth and sixth rows) of ensemble forecasts from ECMWF ensemble (blue lines) and FuXi-ENS (red lines) for 6 variables: Z500, T850, WS850, MSL, T2M, and WS10M, in 15-day forecasts using testing data from 2018. The normalized differences are calculated using the ECMWF ensemble mean as the reference.

2.3 Extreme forecast

Ensemble forecasts are crucial for estimating the likelihood of extreme events. In this study, we evaluate the performance of FuXi-ENS and ECMWF ensemble in predicting such events by computing the Brier score (BS) brier1950verification for values exceeding the high and low climatological percentiles. These percentiles are determined for each variable based on the 24-year ERA5 data from 1993 to 2016 for each grid point, month of the year, and time of day.

Figure 3 shows a comparison of the ECMWF and FuXi-ENS ensemble’s performance in terms of the normalized difference in the globally-averaged, latitude-weighted BS for percentiles above the 90th, 95th, 98th, and below the 10th, 5th, and 2nd, for the variables Z500, T850, MSL, and T2M across all lead times in 15-day forecasts. These thresholds represent extreme high and low events, respectively. A comprehensive analysis of 1440 cases, calculated as the product of 6 percentiles, 4 variables, and 60 forecast lead times (1440=6×4×60144064601440=6\times 4\times 601440 = 6 × 4 × 60), reveals that FuXi-ENS demonstrates superior performance to the ECMWF ensemble in 96.3% of these cases. Detailed evaluations of the probabilistic TC track forecast and predictions of the record-breaking Northeast Asia heatwave are provided in the following sections.

Refer to caption
Figure 3: Comparison of the ECMWF ensemble (blue lines) and the FuXi-ENS (red lines) on the normalized differences in globally-averaged, latitude-weighted BS for \gt90th (first row), \gt95th (second row), \gt98th (third row), \lt10th (fourth row), \lt5th (fifth row), and \lt2nd (sixth row) percentile events, (second row), and (third row) of for 4 variables: Z500 (first column), T850 (second column), MSL (third column), and T2M (fourth column), in 15-day forecasts using testing data from 2018.

2.4 Probabilistic Tropical Cyclone track forecast

Accurate prediction of TC tracks is crucial for public safety and economic stability due to the associated high winds, heavy rainfall, and storm surges along coastlines tsai2013detection . Significant advancements in TC track forecasting have been made, reducing errors by approximately two-thirds over the past generation, representing a significant meteorological achievement. However, evidence suggests that TC track prediction be approaching its predictability limits. This is indicated by the analyses of linear trends in annual TC track forecast errors averaged over 5 years for the North Atlantic and eastern North Pacific Landsea2018 . Recently, ML-based deterministic weather forecasting models bi2022panguweather ; lam2022graphcast ; fuxi_extreme2023 have demonstrated more accurate TC track forecasts compared to the ECMWF HRES, suggesting potentials for further improvements. Despite these advancements, TC track forecasts still involve significant uncertainties, with both conventional NWP model and ML models often failing to correctly predict landfall locations. Consequently, end users of TC forecasts require reliable track forecasts and estimates of forecast uncertainty. The traditional use of a deterministic model with static climatological errors demaria2013improvements proves insufficient for measuring the uncertainty of TC track forecasts. In contrast, ensemble forecasts offer substantial values by providing scenario-dependent uncertainty estimates zhang2015verification .

This section compares probabilistic TC track forecasts between the FuXi-ENS and the ECMWF ensemble, with the latter serving as a baseline. The ECMWF ensemble, recognized as one of the best global ensemble systems, incorporates specific enhancements for TC forecasting, such as the application of SVs to generate initial perturbations targeted at TCs puri2001ensemble . Figure 4 presents scatter plots displaying the accumulated mean position error (AccERRORTCsubscriptAccERROR𝑇𝐶\textrm{AccERROR}_{TC}AccERROR start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT) against the accumulated spread (AccSpreadTCsubscriptAccSpread𝑇𝐶\textrm{AccSpread}_{T}CAccSpread start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_C) of the ensemble, as well as the along-track (ATTCsubscriptAT𝑇𝐶\textrm{AT}_{TC}AT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT) component against the cross-track (CTTCsubscriptCT𝑇𝐶\textrm{CT}_{TC}CT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT) component of the position errors at a 72-hour lead time. The analysis includes 20% of all 2018 testing data, encompassing 167 predictions for FuXi-ENS and 165 for the ECMWF ensemble. Only forecasts with more than two-thirds of ensemble members detecting TCs were included in the evaluation zhang2015verification . In the scatter plots of the AccERRORTCsubscriptAccERROR𝑇𝐶\textrm{AccERROR}_{TC}AccERROR start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT against the AccSpreadTCsubscriptAccSpread𝑇𝐶\textrm{AccSpread}_{T}CAccSpread start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_C, FuXi-ENS shows 118 predictions above and 49 below the diagonal, indicating a predominance of forecasts where AccSpreadTCAccSpread𝑇𝐶\textrm{AccSpread}TCAccSpread italic_T italic_C falls below AccERRORTCAccERROR𝑇𝐶\textrm{AccERROR}{TC}AccERROR italic_T italic_C, thus underestimating forecast uncertainties. In contrast, the ECMWF ensemble, with 56 predictions above and 109 below the diagonal, appears to overestimate forecast uncertainties. Overall, FuXi-ENS demonstrates superior performance, showing smaller values of both AccSpreadTCsubscriptAccSpread𝑇𝐶\textrm{AccSpread}_{T}CAccSpread start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_C and AccERRORTCsubscriptAccERROR𝑇𝐶\textrm{AccERROR}_{TC}AccERROR start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT compared to the ECMWF ensemble, indicating more accurate track forecasting.

The analysis of scatter plots comparing ATTCsubscriptAT𝑇𝐶\textrm{AT}_{TC}AT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT and CTTCsubscriptCT𝑇𝐶\textrm{CT}_{TC}CT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT provides insights into the TC speed (slow/fast) and directional biases (left/right) relative to the best track. In these plots, mean values are depicted by blue stars, while the blue dots represent the mean values within each quadrant. For the ECMWF ensemble, the forecast distribution across the four quadrants is as follows: 27 in the upper right, 48 in the lower right, 41 in the upper left, and 49 in the lower left. This distribution suggests that the mean track generally lags behind the best track (41 + 27 \gt\gt\gt 49 + 48), as evidenced by the higher numbers in the lower quadrants (97) versus the upper quadrants (68). Furthermore, the ECMWF ensemble’s predictions predominantly occur to the left of the best track, as indicated by the predominance of cases in the left quadrants compared to the right quadrants (41 + 49 \gt\gt\gt 27 + 48). In contrast, the FuXi-ENS shows a more balanced distribution with 42 forecasts each in the upper right and lower right quadrants, 40 and 45 in the upper left and lower left quadrants, respectively. This pattern shows only a slight lag behind the best track (40 + 42 \lt\lt\lt 45 + 42), and no significant deviation in the perpendicular direction (40 + 45 \approx 42 + 42). In summary, the the FuXi-ENS shows relatively smaller mean position errors compared to the ECMWF ensemble, thereby demonstrating its superior performance in TC track forecasting. Additionally, Supplementary Figure LABEL:TC_ensemble_boxplot presents box plots of AccERRORTCsubscriptAccERROR𝑇𝐶\textrm{AccERROR}_{TC}AccERROR start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT, AccSpreadTCsubscriptAccSpread𝑇𝐶\textrm{AccSpread}_{T}CAccSpread start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_C, ATTCsubscriptAT𝑇𝐶\textrm{AT}_{TC}AT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT, and CTTCsubscriptCT𝑇𝐶\textrm{CT}_{TC}CT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT in 5-day forecasts for the FuXi-ENS and ECMWF ensemble. This figure clearly shows that, throughout the 5-day forecast period, the ECMWF ensemble consistently has a higher accumulated ensemble spread and positional error compared to the FuXi-ENS. Both ATTCsubscriptAT𝑇𝐶\textrm{AT}_{TC}AT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT and CTTCsubscriptCT𝑇𝐶\textrm{CT}_{TC}CT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT exhibit comparable magnitudes, though ATTCsubscriptAT𝑇𝐶\textrm{AT}_{TC}AT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT in the ECMWF ensemble shows significant deviations. Moreover, a case study on predicting the 2018 Hurricane Florence is presented in Supplementary material.

Refer to caption
Figure 4: Scatter plots comparing AccERRORTCsubscriptAccERROR𝑇𝐶\textrm{AccERROR}_{TC}AccERROR start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT against AccSpreadTCsubscriptAccSpread𝑇𝐶\textrm{AccSpread}_{T}CAccSpread start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_C, and ATTCsubscriptAT𝑇𝐶\textrm{AT}_{TC}AT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT against CTTCsubscriptCT𝑇𝐶\textrm{CT}_{TC}CT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT at a 72-hour forecast lead time for the FuXi-ENS and ECMWF ensemble. In the scatter plots of AccERRORTCsubscriptAccERROR𝑇𝐶\textrm{AccERROR}_{TC}AccERROR start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT against AccSpreadTCsubscriptAccSpread𝑇𝐶\textrm{AccSpread}_{T}CAccSpread start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_C, each red dot represents an individual forecast, summarized in the format S(M/N), where S is the total number of dots, M is the number above the diagonal, and N is the number below. In the scatter plots of ATTCsubscriptAT𝑇𝐶\textrm{AT}_{TC}AT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT against CTTCsubscriptCT𝑇𝐶\textrm{CT}_{TC}CT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT, red dots represent individual forecasts, with the blue dots denoting the mean values of red dots within each quadrant. The blue stars indicate the mean values of all forecasts.

2.5 Prediction of the Record-Breaking Northeast Asia heatwave in 2018

Heat waves are typically defined as prolonged periods of extreme heat resulting from consecutive hot days, which have negative impacts on human health, ecosystems, agriculture, and infrastructure. During the summer of 2018, Northeast Asia experienced a record-breaking heatwave, affecting various regions in Japan, the Korean Peninsula, and northeastern China WMO2018 ; Pang2020 , resulting in considerable economic damage and a significant number of fatalities. The Japan Meteorological Agency (JMA) reported a recording-setting maximum temperature of 41.1 °C in Kumagaya on July 23, the highest ever observed in the country. Similarly, on August 1, the Korean Meteorological Administration (KMA) recorded a peak temperature of 39.6°C in Seoul, the hottest day in 111 years. In northeastern China, Liaoning and Jilin provinces experienced temperatures approaching 39°C during July and August 2018, as reported by the China Meteorological Administration (CMA). The heatwave resulted in at least 138 in Japan and 42 in South Korea, with over 7,000 people hospitalized in Japan and 3,000 in South Korea due to heat-related illnesses. Given these severe consequences, improving the accuracy of forecasts for such high-impact weather events is critical to provide timely and effective warnings and mitigate their devastating effects.

Figure 5 presents a comparative analysis of T2M forecasts generated by the FuXi-ENS and ECMWF ensembles during the 2018 Northeast Asia heatwave. The ERA5 reanalysis data, shown in the first column, serves as a benchmark, depicting the spatial distribution of T2M at 6 UTC on July 23, 2018. Subsequent columns illustrate forecasts from both ensembles, including the best-performing member, the worst-performing member, the ensemble mean, and the ensemble spread. The rankings of different ensemble members are based on the T2M RMSE averaged over the depicted land area. Both ensemble model demonstrate comparable performances across the best and worst members, as well as the ensemble mean and spread. The forecasts include two lead times: 3 days (top two rows) and 11 days (bottom two rows) prior to the event. As the initialization date approaches the event, both models exhibit a reduced ensemble spread, indicating reduced forecast certainty. The spatial patterns of ensemble spread within the FuXi-ENS are similar to those of the ECMWF ensemble, reflecting consistency in estimating forecast uncertainties.

Several studies attribute the extreme heatwave to the positive anomalies in geopotential height (persistent high pressure) over Northeast Asia, which induced anomalous downward motion, amplified solar radiation, and consequently raised local temperatures xu2019large ; lu2020unusual ; Pang2020 ; ren2020attribution . The Z500 contours, superimposed on the forecasts in Figure 5, demonstrate that both the FuXi-ENS and ECMWF ensembles accurately predict the large-scale circulation patterns for forecasts initialized 3 days before the event, closely aligning with the ERA5 dataset.

Refer to caption
Figure 5: Spatial distributions of T2M generated by FuXi-ENS and ECMWF ensemble during the 2018 Northeast Asia heatwave. The first column displays the ERA5 reanalysis data for 6 UTC on July 23, 2018. Columns 2 through 5 are predictions from FuXi-ENS (first and third rows) and ECMWF ensemble (second and fourth rows), showing the best member (second column), worst member (third column), ensemble mean (fourth column), and ensemble spread (fifth column). The top two rows are forecasts made from 3 days earlier, and the bottom two rows are forecasts made from 11 days earlier. The black contours indicate the Z500.

3 Discussion

In recent years, ML models have demonstrated significant potential in weather forecasting, surpassing the performance of world’s top NWP models in medium-range forecasting. However, ML models face a greater challenge in outperforming NWP models in ensemble forecasting, which is critical for estimating the likelihood of extreme weather events. The accuracy of weather forecasts decreases with longer lead times due to the inherently chaotic nature of weather systems, which amplifies uncertainty as lead times increase. Unlike deterministic forecasts, which fail to capture these uncertainties, ensemble forecasts allow for assessment of the uncertainty in predictions by analyzing the spread of ensemble members. Moreover, the increasing intensity and frequency of extreme weather events, such as tropical cyclones (TCs), due to climate change, makes accurate uncertainty estimates increasingly important.

Recent ML models, such as Google’s GenCast and SEEDS model, rely on either EDA or two operational NWP ensemble members for their forecasts. This reliance complicates their implementation in operational settings and requires additional input data. Despite their superior forecasting performance compared to ECMWF ensemble forecasts, the 2°spatial resolution of SEEDS limits its utility for a wide range of applications. Furthermore, as a diffusion model, GenCast requires 20 solver steps to produce a single-step forecast, making it relatively time-consuming compared to other ML models. This paper introduces FuXi-ENS, a novel ML-based medium-range ensemble weather forecasting model that demonstrates superior performance compared to the ECMWF ensemble at a spatial resolution of 0.25°. FuXi-ENS is capable of generating 6-hourly forecasts up to 15 days, including 5 upper-air atmospheric variables at 13 pressure levels and 13 surface variables. A unique feature of the FuXi-ENS model is its incorporation of a innovative combination of CRPS and KL loss, which outperforms the conventional use of L1 loss combined with KL loss in classical VAE models. The flow-dependent perturbations of the FuXi-ENS significantly enhance its ensemble forecast performance. These perturbations are introduced at both the initial conditions and each forecast step, similar to traditional ensemble models. If we should have seen further, it is by standing upon the shoulders of giants. Results demonstrate that FuXi-ENS outperforms ECMWF ensemble in 98.1%percent\%% of 360 variable and forecast lead time combinations in terms of CRPS. Furthermore, case studies of high-impact weather events to further evaluate FuXi-ENS’s performance, such as probabilistic TC track forecasts and 2018 Northeast Asia heatwave. Compared to conventional NWP-based ensemble forecasts, the FuXi-ENS model distinguishes itself for its ability to rapidly and efficiently generate extensive ensemble forecasts, requiring significantly less time and computational resources. It can generate a single member of 15-day forecasts with a temporal resolution of 6 hours in approximately 10 seconds using an Nvidia A100 GPU.

Beyond medium-range weather forecasting, the framework used in developing FuXi-ENS shows promise for applications in other critical forecast scenarios that require ensemble predictions, such as subseasonal-to-seasonal forecasts, seasonal forecasts, and wave forecasts. The role of ML models in advancing weather forecasting is expected to grow, particularly for extreme weather events where uncertainty quantification is essential. Furthermore, ensemble forecasting plays a crucial role in data assimilation, as ensemble-based data assimilation methods can generate more accurate and flow-dependent estimates of background-error covariances bonavita2016evolution , thereby improving the quality of initial conditions. However, computational constraints often limit ensemble sizes significantly below the model’s degrees of freedom houtekamer2016review , leading to sampling errors in the ensemble-based data assimilation approach. FuXi-ENS mitigates these issues by reducing the cost of ensemble forecasting, facilitating the generation of larger ensembles for improved data assimilation.

4 Methods

4.1 Data

ERA5 stands as the fifth iteration of the ECMWF reanalysis dataset, offering a rich array of surface and upper-air variables. It operates at a remarkable temporal resolution of 1 hour and a horizontal resolution of approximately 31 km, covering data from January 1950 to the present day hersbach2020era5 . Recognized for its expansive temporal and spatial coverage coupled with exceptional accuracy, ERA5 stands as the most comprehensive and precise reanalysis archive globally. In our study, we utilize 6-hourly ERA5 dataset a spatial resolution of 0.25superscript0.250.25^{\circ}0.25 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT (721×14407211440721\times 1440721 × 1440 latitude-longitude grid points).

The FuXi-ENS model forecasts a total of 78 variables, encompassing 5 upper-air atmospheric variables across 13 pressure levels (50, 100, 150, 200, 250, 300, 400, 500, 600, 700, 850, 925, and 1000 hPa), and 11 surface variables. Among the upper-air atmospheric variables are geopotential (Z), temperature (T), u component of wind (U), v component of wind (V), and specific humidity (Q). The surface variables include 2-meter temperature (T2M), 2-meter dewpoint temperature (D2M), sea surface temperature (SST), 10-meter u wind component (U10), 10-meter v wind component (V10), 100-meter u wind component (U100), 100-meter v wind component (V100), mean sea-level pressure (MSL), surface net solar radiation (SNSR), surface net solar radiation downwards (SSRD), total sky direct solar radiation at surface (FDIR), top net thermal radiation (TTR) 222TTR, known as the negative of outgoing longwave radiation (OLR), and TP. Table 1 presents a detailed list of variables and their corresponding abbreviations. Variables such as U100, V100, SSR, SSRD, and FDIR have been selected for their relevance and utility in the wind and solar energy forecasting industries.

The model’s training relies on 15 years of data spanning from 2002 to 2016, 1 year (2017) data for validation and 1 year (2018) data for testing. More detailed evaluations of TP and predictions for the year 2022 can be found in the supplementary material.

Since the ECMWF EPS became fully operational in 1992, it has established as a global leader in medium-range and subseasonal-to-seasonal ensemble forecasting. The ECMWF ensemble consists of 51 forecasts, which include one control forecast generated from the optimal estimate of initial conditions, alongside 50 perturbed forecasts, each generated with slight variations in both initial conditions and model physics. On June 27, 2023, the spatial resolution of the ECMWF ensemble was upgraded to 9 km. For our analysis, we leveraged archived ECMWF ensemble forecasts from 2018, which has a spatial resolution of 0.25superscript0.250.25^{\circ}0.25 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. Furthermore, we validated that the continuous ranked probability score (CRPS) based on all 51 members. To compute the BS, all 51 members of the ECMWF ensemble forecasts are required. Due to the substantial data volume, we retrieved the full ensemble every five days throughout 2018. Consequently, 20% samples in 2018 are available for BS calculation compared to other metrics.

Table 1: A summary of all the input and output variables. The ”Type” indicates whether the variable is a time-varying variable including upper-air, surface, and geographical variables, or a temporal variable. The “Full name” and “Abbreviation” columns refer to the complete name of each variables and their corresponding abbreviations in this paper. The ”Role” column clarifies whether each variable serves as both an input and a output, or is solely utilized as an input by our model.
Type Full name Abbreviation Role
upper-air variables geopotential Z Input and Output
temperature T Input and Output
u component of wind U Input and Output
v component of wind V Input and Output
specific humidity Q Input and Output
surface variables 2-meter temperature T2M Input and Output
2-meter dewpoint temperature D2M Input and Output
sea surface temperature SST Input and Output
10-meter u wind component U10 Input and Output
10-meter v wind component V10 Input and Output
100-meter u wind component U100 Input and Output
100-meter v wind component V100 Input and Output
mean sea-level pressure MSL Input and Output
surface net solar radiation SNSR Input and Output
surface net solar radiation downwards SSRD Input and Output
total sky direct solar radiation at surface FDIR Input and Output
top net thermal radiation TTR Input and Output
total precipitation TP Input and Output
geographical orography OR Input
land-sea mask LSM Input
latitude LAT Input
longitude LON Input
temporal hour of day HOUR Input
day of year DOY Input
step STEP Input

We assessed tropical cyclone (TC) forecasts using the International Best Track Archive for Climate Stewardship (IBTrACS) knapp2010 ; knapp2018 dataset from the National Oceanic and Atmospheric Administration (NOAA), as the reference. The IBTrACS aggregates global best track data into a comprehensive archive, in which each track represents a TC’s position at six-hourly intervals using latitude and longitude coordinates, among other critical features. Additionally, we employed a modified version of the ECMWF TC tracker algorithm (refer to Subsection 4.3), as detailed by van der Grijn der2002tropical to both the ECMWF and FuXi-ENS ensemble members, as well as to the ERA5 dataset, facilitating the extraction and evaluation of TC track forecasts. Statistically robust characteristics only emerge from sufficiently large sample sizes (ensemble numbers), thus we excluded forecasts with fewer than two-thirds of ensemble members detecting TCs for evaluating probabilistic TC track forecasts zhang2015verification .

4.2 FuXi-ENS model

Previous ML-based medium-range weather forecasting models have predominantly employed deterministic encoder-decoder architectures bi2022panguweather ; lam2022graphcast ; chen2023fuxi ; olivetti2023advances . These models, aimed at minimizing the root mean square error (RMSE) between their outputs and ERA5 reanalysis data, fail to provide the uncertainty inherent in their forecasts. This limitation is particularly critical when predicting high-impact extreme weather events. Moreover, unlike deterministic models that use ERA5 data as a benchmark, ensemble forecasts lack a definitive ”ground truth”. Consequently, direct comparisons between ensemble model outputs and actual conditions are impractical, posing a significant challenge for the development of supervised ML models in ensemble weather forecasting.

Building on previous research into ensemble models using conventional NWP models, our approach introduces perturbations at both initial conditions and each forecast step to account for both initial condition errors and model uncertainties. We develop an auto-regressive ML model, FuXi-ENS, for ensemble weather forecasting that can address the limitations of traditional deterministic encoder-decoder models. This model incorporates a Variational Autoencoder (VAE) doersch2016tutorial ; zhao2017learning ; kingma2019introduction due to its probabilistic nature and capability to quantify uncertainty, crucial for ensemble forecasting. VAEs are based on a probabilistic generative model. In a VAE, the encoder transforms input data into a probability distribution, typically Gaussian, from which the perturbations are sampled. This sampling process introduces randomness and variability, providing a direct measure of uncertainty. The distribution’s spread reflects the model’s uncertainty about the latent representation of input data.

The FuXi-ENS model consists of two main components: a perturbation model and a forecasting model (see Figure 6). The perturbation model, based on a VAE, transforms input data into a Gaussian distribution with the same dimensions as the input, thereby reflecting the probabilistic characteristics of the input data. In contrast, the forecasting model, which uses an encoder-decoder framework, generates predictions from the perturbed initial conditions sampled from this Gaussian distribution. This method captures the inherent uncertainty in the data and facilitates the generation of ensemble forecasts through repeated sampling. The number of samples equals the number of ensemble members produced. This process is repeated after each forecasting step, enhancing forecast uncertainty estimation and the model’s utility in ensemble forecasting. To enhance comprehension, we can draw an analogy between this ML methodology and established techniques in traditional ensemble forecasting. Within our framework, the deterministic forecasts rely on the forecasting model and initial conditions, while the perturbation model introduces flow-dependent variations to account for uncertainties in the initial conditions and model predictions at each forecasting step.

The two main components of the FuXi-ENS model are the perturbation model and the forecasting model. The model takes a input data cube with dimensions of 2×78×721×144027872114402\times 78\times 721\times 14402 × 78 × 721 × 1440, representing meteorological variables from two previous time steps (t1𝑡1{t-1}italic_t - 1 and t𝑡{t}italic_t), the count of upper-air and surface variables (C𝐶{C}italic_C), and the number of grid points along latitude (H) and longitude (W), respectively. This data cube is initially reshaped to dimensions of (2C)×H×W2𝐶𝐻𝑊(2C)\times H\times W( 2 italic_C ) × italic_H × italic_W (156×721×14401567211440156\times 721\times 1440156 × 721 × 1440). The perturbation model starts by reducing the dimensions of the meteorological data cube to 2C×90×1802𝐶901802C\times 90\times 1802 italic_C × 90 × 180 through a 2-dimensional (2D) convolution layer with a kernel size of 8 and a stride of 8. Simultaneously, geographical and temporal variables are processed by an identical 2D convolution layer. The outputs are then concatenated with the meteorological data, and further processed through 14 consecutive Swin Transformer liu2021swin blocks, each employing a 18×18181818\times 1818 × 18 window. The data is subsequently restored to its original dimensions via a 2D transposed convolution layer, also with a kernel size and stride of 8 Zeiler2010 . The perturbation model finally generates a Gaussian distribution (N(Θtp){\textrm{N($\Theta$}^{t}}_{p})N( roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT )), defined by a mean matrix μtsuperscript𝜇𝑡\mu^{t}italic_μ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and a covariance matrix σtsuperscript𝜎𝑡\sigma^{t}italic_σ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT with dimensions 156×721×14401567211440156\times 721\times 1440156 × 721 × 1440. The forecasting model processes the perturbed initial conditions (X~t=Xt+ztsuperscript~X𝑡superscriptX𝑡superscriptz𝑡\tilde{\textrm{{X}}}^{t}=\textrm{{X}}^{t}+\textrm{{z}}^{t}over~ start_ARG X end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT) beginning with a 8×8888\times 88 × 8 2D convolution layer, followed by 40 transformer blocks, and a 8×8888\times 88 × 8 2D transposed convolution layer, producing the final ensemble forecasts X^t+1superscript^X𝑡1\hat{\textrm{{X}}}^{t+1}over^ start_ARG X end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT. The ensemble size equals the number of samples drawn from the Gaussian distribution N(Θtp){\textrm{N($\Theta$}^{t}}_{p})N( roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ).

Training FuXi-ENS focuses minimizing the CRPS loss, a key metric for assessing the quality of ensemble forecasts. A regularization term is used to align the model-derived Gaussian distribution (N(Θtp){\textrm{N($\Theta$}^{t}}_{p})N( roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT )) with the target distribution, essential for accurately representing prediction uncertainty. A significant challenge is reconciling discrepancies between these distributions, mainly due to prediction errors. Knowledge distillation is employed in transferring information from target to predicted distributions. Key to this approach is the perturbation model (Q), which converts target data into a Gaussian distribution that supervises the perturbation model P to minimize the Kullback–Leibler (KL) divergence loss (LKLsubscriptLKL\textrm{L}_{\textrm{KL}}L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT). This loss quantifies discrepancies between the two models’ distributions. During training, perturbation model Q processes target data from two previous time steps: XtsuperscriptX𝑡\textrm{{X}}^{t}X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Xt+1superscriptX𝑡1\textrm{{X}}^{t+1}X start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT. It predicts a Gaussian distribution (N(Θtq){\textrm{N($\Theta$}^{t}}_{q})N( roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT )) similar to that of model P. Sampling of intermediate perturbation vectors occurs from model Q’s distribution during training, and from model P’s distribution (N(Θtp){\textrm{N($\Theta$}^{t}}_{p})N( roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT )) during testing. Moreover, the CRPS loss is calculated between the ensemble forecast ( X^t+1superscript^X𝑡1\hat{\textrm{{X}}}^{t+1}over^ start_ARG X end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT) and the target data Xt+1superscriptX𝑡1{\textrm{{X}}}^{t+1}X start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT. The overall loss function is defined as:

L=LCRPS(X^it+1,Xit+1)+λLKL(N(Θtp,N(Θtq))\textrm{L}=\textrm{L}_{\textrm{CRPS}}(\hat{\textrm{{X}}}^{t+1}_{i},\textrm{{X}% }^{t+1}_{i})+\lambda\textrm{L}_{\textrm{KL}}({\textrm{N($\Theta$}^{t}}_{p},{% \textrm{N($\Theta$}^{t}}_{q}))L = L start_POSTSUBSCRIPT CRPS end_POSTSUBSCRIPT ( over^ start_ARG X end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , X start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_λ L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( N( roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , N( roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) ) (1)

where λ𝜆\lambdaitalic_λ is a tunable coefficient, set to 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, balancing LKLsubscriptL𝐾𝐿\textrm{L}_{KL}L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT and LCRPSsubscriptLCRPS\textrm{L}_{\textrm{CRPS}}L start_POSTSUBSCRIPT CRPS end_POSTSUBSCRIPT loss terms. This design ensures that perturbation vectors closely approximate the true data distribution and improve the CRPS of the ensemble forecasts. During training, each GPU generates one ensemble member, resulting in a total of eight members.

In this study, we utilize 48 ensemble members for medium-range ensemble forecasting. As illustrated in Supplementary Figure LABEL:l1_crps_comparison, the FuXi-ENS model incorporating CRPS in the loss function outperforms the same model utilizing L1 loss. The loss function for the FuXi-ENS model, which is trained with a combination of L1 and KL loss, is defined as follows:

L=L1(X^t+1,Xt+1)+λLKL(Pt,Qt)LL1superscript^X𝑡1superscriptX𝑡1𝜆subscriptLKLsuperscriptP𝑡superscriptQ𝑡\textrm{L}=\textrm{L1}(\hat{\textrm{{X}}}^{t+1},\textrm{{X}}^{t+1})+\lambda% \textrm{L}_{\textrm{KL}}(\textrm{P}^{t},\textrm{Q}^{t})L = L1 ( over^ start_ARG X end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , X start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) + italic_λ L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (2)

where L1 replaces the CRPS loss.

Refer to caption
Figure 6: Schematic diagram of the structures of the FuXi-ENS model.

The FuXi-ENS model is developed using the Pytorch framework (Paszke2017, ). It adopts an autoregressive training regime and a curriculum training schedule lam2022graphcast , wherein the number of the autoregressive steps from 1 to 3. Each of these steps consists of 3,000 training iterations. A batch size of 1 was used per GPU during training, which was conducted on a cluster of 8 Nvidia A100 GPUs. For optimization purposes, the AdamW optimizer kingma2017adam ; loshchilov2017decoupled was employed, configured with parameters β1subscript𝛽1{\beta_{1}}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=0.9 and β2subscript𝛽2{\beta_{2}}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=0.95, an initial learning rate of 2.5×\times×10-4, and a weight decay coefficient of 0.1.

4.3 Tropical cyclone tracking method

In our study, we implemented a modified version of the ECMWF TC tracking algorithm, as detailed by Zhong et al. fuxi_extreme2023 , applied to both ECMWF ensemble and FuXi-ENS forecasts. As noted by Zhong et al. fuxi_extreme2023 , multiple crucial adjustments to the algorithm aimed at reducing false TC tracks in the ML model predictions were made.

The ECMWF TC tracker begins by using TC observation data to estimate the initial position of the TC. The tracker’s primary function is to identify the cyclone within the initial conditions (analysis) by locating the minimum MSL within a 445 km radius of the observed TC location. We enhanced detection by incorporating local maxima of vorticity at 10 meters to increase the number of TC center candidates. The tracker subsequently monitor these candidates in forecasts until the cyclone becomes undetectable. It updates the TC position by calculating a displacement based on a linear extrapolation of locations from previous two steps, and an advection vector averaged from zonal (U) and meridional (V) wind components across multiple pressure levels, scaled by the forecast time step of 6 hours.

In the first forecast step, where the lead time is 6 hours and only one location estimate is available, linear extrapolation is not possible, necessitating reliance solely on the advection vector. Subsequent estimates involves assessing all local MSL minima within a 445 km radius. The inherent smoothness of machine learning (ML)-based forecasts and the minor differences among adjacent points within ML forecasts can lead to in false local minima, promoting us to decrease the spatial resolution of FuXi-ENS and ECMWF ensemble forecasts from 0.25°to 1.25°by average-pooling. TC Center candidates are then selected from the average-pooled values positioned at grid’s center. The search for the nearest candidate minima that meet three specified conditions (see Supplementary Table LABEL:summary) continues unless the elevation exceeds 1000 meters, indicating no cyclone presence. Upon the selection of center candidates, the data is restored to its original resolution, with final candidate needing to fulfill an additional criterion involving the local minima of Z850. To enhance tracking accuracy and prevent erroneous large displacements, we considered limiting the displacement to no more than 3 times the distance moved from the previous time step. However, we chose not to restrict the angle changes between directions.

4.4 Evaluation method

In this work, we assess the performance of both FuXi-ENS and ECMWF ensemble using their respective initial conditions for the year 2018, focusing on 00 and 12 UTC initialization times. Specifically, we utilize the ensemble control forecast (ENS-fc0) to evaluate ECMWF ensemble, and the ERA5 dataset to evaluate FuXi-ENS. Following the methodology outlined by Rasp et al. rasp2020weatherbench ; rasp2023weatherbench , our evaluation metrics include RMSE, ACC, CRPS Hersbach2000 ; Sloughter2010 , ensemble spread, and spread-skill ratio (SSR), and Brier Score (BS) brier1950verification . Specifically, we analyze the deterministic forecast performance of the ensemble mean through the latitude-weighted RMSE and ACC, calculated as follows:

RMSE(c,τ)=1Dt0D1H×Wi=1Hj=1Wai(X^¯c,i,jt0+τXc,i,jt0+τ)2RMSE𝑐𝜏1delimited-∣∣𝐷subscriptsubscript𝑡0𝐷1𝐻𝑊superscriptsubscript𝑖1𝐻superscriptsubscript𝑗1𝑊subscript𝑎𝑖superscriptsubscriptsuperscript¯^𝑋subscript𝑡0𝜏𝑐𝑖𝑗subscriptsuperscript𝑋subscript𝑡0𝜏𝑐𝑖𝑗2\textrm{RMSE}(c,\tau)=\frac{1}{\mid D\mid}\sum_{t_{0}\in D}\sqrt{\frac{1}{H% \times W}\sum_{i=1}^{H}\sum_{j=1}^{W}a_{i}{(\bar{\hat{X}}^{t_{0}+\tau}_{c,i,j}% -X^{t_{0}+\tau}_{c,i,j})}^{2}}RMSE ( italic_c , italic_τ ) = divide start_ARG 1 end_ARG start_ARG ∣ italic_D ∣ end_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_D end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_H × italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG over^ start_ARG italic_X end_ARG end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT - italic_X start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (3)
ACC(c,τ)=1Dt0Di,jai(X^¯c,i,jt0+τMc,i,jt0+τ)(X¯c,i,jt0+τMc,i,jt0+τ)i,jai(X^¯c,i,jt0+τMc,i,jt0+τ)2i,jai(X¯c,i,jt0+τMc,i,jt0+τ)2ACC𝑐𝜏1delimited-∣∣𝐷subscriptsubscript𝑡0𝐷subscript𝑖𝑗subscript𝑎𝑖subscriptsuperscript¯^𝑋subscript𝑡0𝜏𝑐𝑖𝑗subscriptsuperscript𝑀subscript𝑡0𝜏𝑐𝑖𝑗subscriptsuperscript¯𝑋subscript𝑡0𝜏𝑐𝑖𝑗subscriptsuperscript𝑀subscript𝑡0𝜏𝑐𝑖𝑗subscript𝑖𝑗subscript𝑎𝑖superscriptsubscriptsuperscript¯^𝑋subscript𝑡0𝜏𝑐𝑖𝑗subscriptsuperscript𝑀subscript𝑡0𝜏𝑐𝑖𝑗2subscript𝑖𝑗subscript𝑎𝑖superscriptsubscriptsuperscript¯𝑋subscript𝑡0𝜏𝑐𝑖𝑗subscriptsuperscript𝑀subscript𝑡0𝜏𝑐𝑖𝑗2\textrm{ACC}(c,\tau)=\frac{1}{\mid D\mid}\sum_{t_{0}\in D}\frac{\sum_{i,j}a_{i% }(\bar{\hat{X}}^{t_{0}+\tau}_{c,i,j}-M^{t_{0}+\tau}_{c,i,j})(\bar{X}^{t_{0}+% \tau}_{c,i,j}-M^{t_{0}+\tau}_{c,i,j})}{\sqrt{\sum_{i,j}a_{i}(\bar{\hat{X}}^{t_% {0}+\tau}_{c,i,j}-M^{t_{0}+\tau}_{c,i,j})^{2}\sum_{i,j}a_{i}(\bar{X}^{t_{0}+% \tau}_{c,i,j}-M^{t_{0}+\tau}_{c,i,j})^{2}}}ACC ( italic_c , italic_τ ) = divide start_ARG 1 end_ARG start_ARG ∣ italic_D ∣ end_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_D end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG over^ start_ARG italic_X end_ARG end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT - italic_M start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT ) ( over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT - italic_M start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG over^ start_ARG italic_X end_ARG end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT - italic_M start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT - italic_M start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (4)

where t0subscript𝑡0{t_{0}}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the forecast initialization time in the testing set D𝐷{D}italic_D, while τ𝜏{\tau}italic_τ represents the forecast lead time steps after t0subscript𝑡0{t_{0}}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The climatological mean, denoted as M, is derived from ERA5 reanalysis data spanning from 1993 to 2016. To enhance the differentiation of forecast performance among models with minor differences, we use the normalized RMSE difference between model A and baseline B, computed as (RMSEARMSEB)/RMSEBsubscriptRMSE𝐴subscriptRMSE𝐵subscriptRMSE𝐵(\textrm{RMSE}_{A}-\textrm{RMSE}_{B})/\textrm{RMSE}_{B}( RMSE start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - RMSE start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) / RMSE start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. Similarly, the normalized ACC difference is computed as (ACCAACCB)/(1ACCB)subscriptACC𝐴subscriptACC𝐵1subscriptACC𝐵(\textrm{ACC}_{A}-\textrm{ACC}_{B})/(1-\textrm{ACC}_{B})( ACC start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - ACC start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) / ( 1 - ACC start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ). Negative values in normalized RMSE difference and positive values in normalized ACC difference suggest superior performance of model A compared to model B.

In addition, we evaluate ensemble forecasts using various metrics, including the globally-averaged, latitude-weighted CRPS, ensemble spread, SSR. The CRPS is determined using the following equation:

CRPS=ai[F(X^c,i,jt0+τ)(Xc,i,jt0+τz)]𝑑zCRPSsubscript𝑎𝑖superscriptsubscriptdelimited-[]𝐹subscriptsuperscript^Xsubscript𝑡0𝜏𝑐𝑖𝑗subscriptsuperscriptXsubscript𝑡0𝜏𝑐𝑖𝑗𝑧differential-d𝑧\textrm{CRPS}=a_{i}\int_{-\infty}^{\infty}[F(\hat{\textbf{X}}^{t_{0}+\tau}_{c,% i,j})-\mathcal{H}(\textbf{X}^{t_{0}+\tau}_{c,i,j}\leq z)]\,dz\ CRPS = italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT [ italic_F ( over^ start_ARG X end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT ) - caligraphic_H ( X start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT ≤ italic_z ) ] italic_d italic_z (5)

where F𝐹Fitalic_F denotes the cumulative distribution function (CDF) of the variable (X^c,i,jt0+τsubscriptsuperscript^Xsubscript𝑡0𝜏𝑐𝑖𝑗\hat{\textbf{X}}^{t_{0}+\tau}_{c,i,j}over^ start_ARG X end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT), and \mathcal{H}caligraphic_H is an indicator function that equals 1 when Xc,i,jt0+τzsubscriptsuperscriptXsubscript𝑡0𝜏𝑐𝑖𝑗𝑧\textbf{X}^{t_{0}+\tau}_{c,i,j}\leq zX start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT ≤ italic_z is true, and 0 otherwise Wilks2011 . In deterministic forecasts, the CRPS is equal to the mean absolute error (MAE) Hersbach2000 . The calculation of normalized difference in CRPS follows the same procedure to that of the normalized difference in RMSE.

On the other hand, ensemble forecasts are designed to capture the full range of uncertainty. The SSR is a pivotal metric defined as the ratio of the ensemble spread (standard deviation) to the forecast skill (RMSE). This ratio measures the alignment between the ensemble spread and RMSE of the ensemble mean, thus serving as a essential indicator of the forecast’s reliability. Palmer et al. palmer2006ensemble demonstrated that in a perfect ensemble, the mean spread match the RMSE over the same period, suggesting that the ensemble spread can reliably predict the error in the ensemble mean. The formula for calculating the globally-averaged, latitude-weighted ensemble spread is:

Spread(c,τ)=1Dt0D1H×Wi=1Hj=1Waivar(X^c,i,jt0+τ)Spread𝑐𝜏1delimited-∣∣𝐷subscriptsubscript𝑡0𝐷1𝐻𝑊superscriptsubscript𝑖1𝐻superscriptsubscript𝑗1𝑊subscript𝑎𝑖𝑣𝑎𝑟subscriptsuperscript^Xsubscript𝑡0𝜏𝑐𝑖𝑗\textrm{Spread}(c,\tau)=\frac{1}{\mid D\mid}\sum_{t_{0}\in D}\sqrt{\frac{1}{H% \times W}\sum_{i=1}^{H}\sum_{j=1}^{W}a_{i}var(\hat{\textbf{X}}^{t_{0}+\tau}_{c% ,i,j})}Spread ( italic_c , italic_τ ) = divide start_ARG 1 end_ARG start_ARG ∣ italic_D ∣ end_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_D end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_H × italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v italic_a italic_r ( over^ start_ARG X end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT ) end_ARG (6)

where var(X^c,i,jt0+τ)𝑣𝑎𝑟subscriptsuperscript^Xsubscript𝑡0𝜏𝑐𝑖𝑗var(\hat{\textbf{X}}^{t_{0}+\tau}_{c,i,j})italic_v italic_a italic_r ( over^ start_ARG X end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT ) denotes the variance within the ensemble for each grid point (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) at lead time τ𝜏\tauitalic_τ from the initial time t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. A SSR value of 1 suggests a reliable ensemble according to Fortin et al. Fortin2014 . SSR values below one indicate that the ensemble is underdispersive, suggesting insufficient variability, whereas values exceeding one indicate overdispersion, pointing to excessive variability.

To evaluate the performance of extreme weather forecasts, we use the globally-averaged, latitude-weighted BS, computed as follows:

BS=ai(pforecasto)2¯BSsubscript𝑎𝑖¯superscriptsubscript𝑝𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡𝑜2\textrm{BS}=a_{i}\overline{(p_{forecast}-o)^{2}}BS = italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over¯ start_ARG ( italic_p start_POSTSUBSCRIPT italic_f italic_o italic_r italic_e italic_c italic_a italic_s italic_t end_POSTSUBSCRIPT - italic_o ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (7)

where pforecastsubscript𝑝𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡p_{forecast}italic_p start_POSTSUBSCRIPT italic_f italic_o italic_r italic_e italic_c italic_a italic_s italic_t end_POSTSUBSCRIPT represents the model’s forecast probability of an event, defined by whether the ensemble members exceed or fall below a specified threshold T𝑇Titalic_T within an N𝑁Nitalic_N-member ensemble. Specifically, pforecast=1Nn[Xn>T]subscript𝑝𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡1𝑁subscript𝑛delimited-[]superscript𝑋𝑛𝑇p_{forecast}=\frac{1}{N}\sum_{n}[X^{n}>T]italic_p start_POSTSUBSCRIPT italic_f italic_o italic_r italic_e italic_c italic_a italic_s italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ italic_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT > italic_T ] for events above the threshold, and pforecast=1Nn[Xn<T]subscript𝑝𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡1𝑁subscript𝑛delimited-[]superscript𝑋𝑛𝑇p_{forecast}=\frac{1}{N}\sum_{n}[X^{n}<T]italic_p start_POSTSUBSCRIPT italic_f italic_o italic_r italic_e italic_c italic_a italic_s italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ italic_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT < italic_T ] for events below it. The BS quantifies the mean squared difference between these predicted probabilities (pforecastsubscript𝑝𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡p_{forecast}italic_p start_POSTSUBSCRIPT italic_f italic_o italic_r italic_e italic_c italic_a italic_s italic_t end_POSTSUBSCRIPT) and the observed outcomes (o𝑜oitalic_o), which are either 0 or 1. Similar to RMSE, the BS is negatively orientated, with lower values indicating more accurate forecasts, ranging from 0 for perfect forecasts to 1 for consistently incorrect forecasts. In this study, the BS is calculated using a range of climatological percentiles as thresholds to define extreme events: \gt90th, \gt95th, and \gt98th for extreme high events, while \lt10th, \lt5th, and \lt2nd for extreme low events.

To evaluate the accuracy of ensemble TC track forecasts, ensemble spread serves as an effective indicator of forecast uncertainty. Ideally, this spread should align with the position error of the ensemble mean Yamaguchi2009 . Firstly, the accuracy of ensemble mean track predictions is evaluated by calculating the RMSE of distances between the ensemble mean and the best TC track values, denoted as ERRORTCsubscriptERROR𝑇𝐶\textrm{ERROR}_{TC}ERROR start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT. Subsequently, the effectiveness of forecast uncertainty on track predictions, using the ensemble spread, is determined by the RMSE of the distances between individual ensemble members and the ensemble mean whitaker1998relationship , represented as SPRDTCsubscriptSPRD𝑇𝐶\textrm{SPRD}_{TC}SPRD start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT. The equations are presented below:

ERRORTC=p¯forecastposubscriptERROR𝑇𝐶subscript¯𝑝𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡subscript𝑝𝑜\textrm{ERROR}_{TC}=\sqrt{\bar{p}_{forecast}-p_{o}}ERROR start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_f italic_o italic_r italic_e italic_c italic_a italic_s italic_t end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG (8)
SpreadTC=1Nn=1N(p¯forecastpforecast(n))subscriptSpread𝑇𝐶1𝑁superscriptsubscript𝑛1𝑁subscript¯𝑝𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡subscript𝑝𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡𝑛\textrm{Spread}_{TC}=\sqrt{\frac{1}{N}\sum_{n=1}^{N}(\bar{p}_{forecast}-p_{% forecast}(n))}Spread start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_f italic_o italic_r italic_e italic_c italic_a italic_s italic_t end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_f italic_o italic_r italic_e italic_c italic_a italic_s italic_t end_POSTSUBSCRIPT ( italic_n ) ) end_ARG (9)

where N𝑁Nitalic_N is the number of ensemble members, p¯forecastsubscript¯𝑝𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡\bar{p}_{forecast}over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_f italic_o italic_r italic_e italic_c italic_a italic_s italic_t end_POSTSUBSCRIPT is the ensemble mean, and pforecast(n)subscript𝑝𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡𝑛p_{forecast}(n)italic_p start_POSTSUBSCRIPT italic_f italic_o italic_r italic_e italic_c italic_a italic_s italic_t end_POSTSUBSCRIPT ( italic_n ) is the track forecast of member n𝑛nitalic_n. posubscript𝑝𝑜p_{o}italic_p start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT is the best track value from IBTrACS. The sum of ERRORTCsubscriptERROR𝑇𝐶\textrm{ERROR}_{TC}ERROR start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT and SpreadTCsubscriptSpread𝑇𝐶\textrm{Spread}_{TC}Spread start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT from the forecast initialization time to the prediction time, with time intervals of 6 hours, is defined as the accumulated ensemble mean position error and ensemble spread.

AccERRORTC=tTERRORTC(t)subscriptAccERROR𝑇𝐶subscript𝑡𝑇subscriptERROR𝑇𝐶𝑡\textrm{AccERROR}_{TC}=\sum_{t\in T}\textrm{ERROR}_{TC}(t)AccERROR start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t ∈ italic_T end_POSTSUBSCRIPT ERROR start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT ( italic_t ) (10)
AccSpreadTC=tTSpreadTC(t)subscriptAccSpread𝑇𝐶subscript𝑡𝑇subscriptSpread𝑇𝐶𝑡\textrm{AccSpread}_{TC}=\sum_{t\in T}\textrm{Spread}_{TC}(t)AccSpread start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t ∈ italic_T end_POSTSUBSCRIPT Spread start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT ( italic_t ) (11)

where t ranges from 6 to 120 hours. In an ideal EPS, the AccSpreadTCsubscriptAccSpread𝑇𝐶\textrm{AccSpread}_{TC}AccSpread start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT and the AccERRORTCsubscriptAccERROR𝑇𝐶\textrm{AccERROR}_{TC}AccERROR start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT should be equal. If AccSpreadTCsubscriptAccSpread𝑇𝐶\textrm{AccSpread}_{TC}AccSpread start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT exceeds AccERRORTCsubscriptAccERROR𝑇𝐶\textrm{AccERROR}_{TC}AccERROR start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT, it indicates that the ensemble spread surpasses the mean error, suggesting an overestimation of forecast uncertainties. Conversely, if AccSpreadTCsubscriptAccSpread𝑇𝐶\textrm{AccSpread}_{TC}AccSpread start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT is less than AccERRORTCsubscriptAccERROR𝑇𝐶\textrm{AccERROR}_{TC}AccERROR start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT, it signifies that the ensemble spread is insufficient, leading to an underestimation of forecast uncertainties.

Moreover, TC position error can be decomposed into along-track (ATTCsubscriptAT𝑇𝐶\textrm{AT}_{TC}AT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT) and cross-track (CTTCsubscriptCT𝑇𝐶\textrm{CT}_{TC}CT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT) components, each reflecting different aspects of error orientation neumann1981models ; goerss2000tropical ; aberson2001ensemble . The ATTCsubscriptAT𝑇𝐶\textrm{AT}_{TC}AT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT error, often referred to as timing error, quantifies discrepancies in the timing or speed of the TC’s movement. The CTTCsubscriptCT𝑇𝐶\textrm{CT}_{TC}CT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT error, also known as directional error, measures biases in the direction perpendicular to the TC’s trajectory. Consequently, significant ATTCsubscriptAT𝑇𝐶\textrm{AT}_{TC}AT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT errors can affect the timing of TC landfall, while substantial CTTCsubscriptCT𝑇𝐶\textrm{CT}_{TC}CT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT errors may influence the exact location or even occurrence of landfall Nicholas2021 . Specifically, an ATTCsubscriptAT𝑇𝐶\textrm{AT}_{TC}AT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT error is negative (positive) if the forecast TC position is behind (ahead) of the observed TC position along the TC track. In terms of CTTCsubscriptCT𝑇𝐶\textrm{CT}_{TC}CT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT error, a negative (positive) value indicates that the predicted position lies to the left (right) side of the observed track in the Northern Hemisphere, and this orientation reverses in the Southern Hemisphere. Therefore, ATTCsubscriptAT𝑇𝐶\textrm{AT}_{TC}AT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT errors provide insight into whether the forecasted TC motion is slower or faster than observed, while CTTCsubscriptCT𝑇𝐶\textrm{CT}_{TC}CT start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT errors help determine whether the forecast inaccurately predicts the TC’s recurvature timing.

Data Availability Statement

We downloaded a subset of the ERA5 dataset from the official website of Copernicus Climate Data (CDS) at https://cds.climate.copernicus.eu/. The ECMWF ensemble mean and ensemble std are available at https://apps.ecmwf.int/archive-catalogue/?type=em&class=od&stream=enfo&expver=1. We downloaded the ECMWF ensemble data from the public cloud storage bucket for WeatherBench 2 at the link https://console.cloud.google.com/storage/browser/weatherbench2/datasets/ifs_ens.

Code Availability Statement

The source code employed for training and running FuXi-ENS models in this research is accessible within a specific Google Drive folder (https://drive.google.com/drive/folders/1z47CRQdKFZaOjtKQWSNZobC1_RePUVIK?usp=sharing) code2024 .

The xskillscore Python package can be accessed at https://github.com/xarray-contrib/xskillscore/.

The implementation of Perlin noise utilized in this study is sourced from the publicly accessible GitHub repository, available at https://github.com/pvigier/perlin-numpy.

Acknowledgements

This work was supported by the Postdoctoral Fellowship Program of CPSF under Grant Number GZB20240154 and National Key R&D Program of China under Grant 2021YFA0718000, National Natural Science Foundation of China under Grant 42175052. We express our sincere appreciation to the researchers at ECMWF and Google for their invaluable efforts in collecting, archiving, disseminating, and maintaining the ERA5 reanalysis dataset and ECMWF ensemble.

The computations in this research were performed using the CFFF platform of Fudan University.

Competing interests

The authors declare no competing interests.

References

  • \bibcommenthead
  • (1) Lorenz, E.N.: Deterministic Nonperiodic Flow. J. Atmos. Sci. 20(2), 130–148 (1963)
  • (2) National Research Council and Division on Earth and Life Studies and Board on Atmospheric Sciences and Committee on Estimating and Communicating Uncertainty in Weather and Climate Forecasts: Completing the forecast: Characterizing and communicating uncertainty for better decisions using weather and climate forecasts. National Academies Press (2006)
  • (3) Scher, S., Messori, G.: Predicting weather forecast uncertainty with machine learning. Quarterly Journal of the Royal Meteorological Society 144(717), 2830–2841 (2018)
  • (4) Calvo-Olivera, C., Guerrero-Higueras, Á.M., Lorenzana, J., García-Ortega, E.: Real-time evaluation of the uncertainty in weather forecasts through machine learning-based models. Water Resources Management, 1–16 (2024)
  • (5) Palmer, T.N.: The economic value of ensemble forecasts as a tool for risk assessment: From days to decades. Quarterly Journal of the Royal Meteorological Society 128(581), 747–774 (2002)
  • (6) Goodarzi, L., Banihabib, M.E., Roozbahani, A.: A decision-making model for flood warning system based on ensemble forecasts. Journal of Hydrology 573, 207–219 (2019)
  • (7) Sperati, S., Alessandrini, S., Delle Monache, L.: An application of the ecmwf ensemble prediction system for short-term solar power forecasting. Solar Energy 133, 437–450 (2016)
  • (8) Wang, H.-z., Li, G.-q., Wang, G.-b., Peng, J.-c., Jiang, H., Liu, Y.-t.: Deep learning based ensemble approach for probabilistic wind power forecasting. Applied energy 188, 56–70 (2017)
  • (9) Wu, Y.-K., Su, P.-E., Wu, T.-Y., Hong, J.-S., Hassan, M.Y.: Probabilistic wind-power forecasting using weather ensemble models. IEEE Transactions on Industry Applications 54(6), 5609–5620 (2018)
  • (10) Zhang, B., Tang, L., Roemer, M.: Probabilistic planning and risk evaluation based on ensemble weather forecasting. IEEE Transactions on Automation Science and Engineering 15(2), 556–566 (2017)
  • (11) Gneiting, T., Raftery, A.E.: Weather forecasting with ensemble methods. Science 310(5746), 248–249 (2005)
  • (12) Leutbecher, M., Palmer, T.N.: Ensemble forecasting. J. Comput. Phys. 227(7), 3515–3539 (2008)
  • (13) Molteni, F., Buizza, R., Palmer, T.N., Petroliagis, T.: The ecmwf ensemble prediction system: Methodology and validation. Quarterly Journal of the Royal Meteorological Society 122(529), 73–119 (1996)
  • (14) Buizza, R., Palmer, T.N.: The singular-vector structure of the atmospheric global circulation. Journal of Atmospheric Sciences 52(9), 1434–1456 (1995). https://doi.org/10.1175/1520-0469(1995)052<1434:TSVSOT>2.0.CO;2
  • (15) Barkmeijer, J., Buizza, R., Palmer, T.: 3d-var hessian singular vectors and their potential use in the ecmwf ensemble prediction system. Quarterly Journal of the Royal Meteorological Society 125(558), 2333–2351 (1999)
  • (16) Buizza, R., Milleer, M., Palmer, T.N.: Stochastic representation of model uncertainties in the ECMWF ensemble prediction system. Q. J. R. Meteorol. Soc. 125(560), 2887–2908 (1999)
  • (17) Palmer, T.N., Buizza, R., Doblas-Reyes, F., Jung, T., Leutbecher, M., Shutts, G.J., Steinheimer, M., Weisheimer, A.: Stochastic parametrization and model uncertainty (2009)
  • (18) Buizza, R.: Introduction to the special issue on “25 years of ensemble forecasting”. Quarterly Journal of the Royal Meteorological Society 145(S1), 1–11 (2019)
  • (19) Leutbecher, M.: Ensemble size: How suboptimal is less than infinity? Quarterly Journal of the Royal Meteorological Society 145, 107–128 (2019)
  • (20) Lang, S.T., Dawson, A., Diamantakis, M., Dueben, P., Hatfield, S., Leutbecher, M., Palmer, T., Prates, F., Roberts, C.D., Sandu, I., et al.: More accuracy with less precision. Quarterly Journal of the Royal Meteorological Society 147(741), 4358–4370 (2021)
  • (21) Zhou, X., et al.: The development of the ncep global ensemble forecast system version 12. Weather and Forecasting 37(6), 1069–1084 (2022). https://doi.org/10.1175/WAF-D-21-0112.1
  • (22) Pathak, J., et al.: Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. Preprint at https://arxiv.org/abs/2202.11214 (2022)
  • (23) Bi, K., et al.: Accurate medium-range global weather forecasting with 3d neural networks. Nature (2023)
  • (24) Lam, R., et al.: Learning skillful medium-range global weather forecasting. Science (2023)
  • (25) Chen, L., et al.: Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast. npj Climate and Atmospheric Science, 1–11 (2023)
  • (26) Chen, K., et al.: FengWu: Pushing the Skillful Global Medium-range Weather Forecast beyond 10 Days Lead. Preprint at https://arxiv.org/abs/2304.02948 (2023)
  • (27) Bouallegue, Z.B., et al.: Aifs–ecmwf’s data-driven probabilistic forecasting. Technical report, Copernicus Meetings (2024)
  • (28) de Burgh-Day, C.O., Leeuwenburg, T.: Machine learning for numerical weather and climate modelling: a review. Geoscientific Model Development 16(22), 6433–6477 (2023)
  • (29) Price, I., et al.: GenCast: Diffusion-based ensemble forecasting for medium-range weather. Preprint at https://arxiv.org/abs/2312.15796 (2023)
  • (30) Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265 (2015). PMLR
  • (31) Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems 35, 26565–26577 (2022)
  • (32) Hersbach, H., et al.: The era5 global reanalysis. Q. J. R. Meteorol. Soc. 146(730), 1999–2049 (2020)
  • (33) Li, L., Carver, R., Lopez-Gomez, I., Sha, F., Anderson, J.: Generative emulation of weather forecast ensembles with diffusion models. Science Advances 10(13), 4489 (2024). https://doi.org/10.1126/sciadv.adk4489
  • (34) Brenowitz, N.D., et al.: A practical probabilistic benchmark for ai weather models (2024). Preprint at https://arxiv.org/pdf/2401.15305
  • (35) Hoffman, R.N., Kalnay, E.: Lagged average forecasting, an alternative to monte carlo forecasting. Tellus A: Dynamic Meteorology and Oceanography 35(2), 100–118 (1983)
  • (36) Doersch, C.: Tutorial on variational autoencoders. Preprint at https://arxiv.org/abs/1606.05908 (2016)
  • (37) Zhao, T., Zhao, R., Eskenazi, M.: Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. Preprint at https://arxiv.org/abs/1703.10960 (2017)
  • (38) Hu, Y., Chen, L., Wang, Z., Li, H.: SwinVRNN: A data-driven ensemble forecasting model via learned distribution perturbation. J. Adv. Model. Earth Syst. 15(2), 2022–003211 (2023)
  • (39) Chen, L., Zhong, X., Wu, J., Chen, D., Xie, S., Chao, Q., Lin, C., Hu, Z., Lu, B., Li, H., Qi, Y.: FuXi-S2S: An accurate machine learning model for global subseasonal forecasts (2023)
  • (40) Mirza, M.M.Q.: Climate change and extreme weather events: can developing countries adapt? Climate policy 3(3), 233–248 (2003)
  • (41) Jahn, M.: Economics of extreme weather events: Terminology and regional impact models. Weather and Climate Extremes 10, 29–39 (2015)
  • (42) Frame, D.J., Rosier, S.M., Noy, I., Harrington, L.J., Carey-Smith, T., Sparrow, S.N., Stone, D.A., Dean, S.M.: Climate change attribution and the economic costs of extreme weather events: a study on damages from extreme rainfall and drought. Climatic Change 162, 781–797 (2020)
  • (43) National Academies of Sciences and Division on Earth and Life Studies and Board on Atmospheric Sciences and Committee on Extreme Weather Events and Climate Change Attribution: Attribution of extreme weather events in the context of climate change. National Academies Press (2016)
  • (44) Newman, R., Noy, I.: The global costs of extreme weather that are attributable to climate change. Nature Communications 14(1), 6103 (2023)
  • (45) Aigner, E., et al.: Summary for Policymakers, pp. 19–33. Springer, Berlin, Heidelberg (2023). https://doi.org/10.1007/978-3-662-66497-1_2
  • (46) Mousavi, M.E., Irish, J.L., Frey, A.E., Olivera, F., Edge, B.L.: Global warming and hurricanes: the potential impact of hurricane intensification and sea level rise on coastal flooding. Climatic Change 104, 575–597 (2011)
  • (47) Lang, S., Leutbecher, M., Jones, S.: Impact of perturbation methods in the ecmwf ensemble prediction system on tropical cyclone forecasts. Quarterly Journal of the Royal Meteorological Society 138(669), 2030–2046 (2012)
  • (48) Hamill, T.M., Whitaker, J.S., Fiorino, M., Benjamin, S.G.: Global ensemble predictions of 2009’s tropical cyclones initialized with an ensemble kalman filter. Monthly Weather Review 139(2), 668–688 (2011)
  • (49) Conroy, A., et al.: Track forecast: Operational capability and new techniques-summary from the tenth international workshop on tropical cyclones (iwtc-10). Tropical Cyclone Research and Review 12(1), 64–80 (2023)
  • (50) Pielke Jr, R.A., Landsea, C., Mayfield, M., Layer, J., Pasch, R.: Hurricanes and global warming. Bulletin of the American Meteorological Society 86(11), 1571–1576 (2005)
  • (51) Kossin, J.P., Olander, T.L., Knapp, K.R.: Trend analysis with a new global record of tropical cyclone intensity. Journal of Climate 26(24), 9960–9976 (2013)
  • (52) Patricola, C.M., Wehner, M.F.: Anthropogenic influences on major tropical cyclone events. Nature 563(7731), 339–346 (2018)
  • (53) Knutson, T., et al.: Tropical cyclones and climate change assessment: Part i: Detection and attribution. Bulletin of the American Meteorological Society 100(10), 1987–2007 (2019). https://doi.org/%****␣main.bbl␣Line␣750␣****10.1175/BAMS-D-18-0189.1
  • (54) Murphy, A.H., Epstein, E.S.: Skill scores and correlation coefficients in model verification. Monthly weather review 117(3), 572–582 (1989)
  • (55) Hamill, T.M., et al.: Noaa’s second-generation global medium-range ensemble reforecast dataset. Bulletin of the American Meteorological Society 94(10), 1553–1565 (2013)
  • (56) Hersbach, H.: Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather Forecast 15(5), 559–570 (2000)
  • (57) Brier, G.W.: Verification of forecasts expressed in terms of probability. Monthly weather review 78(1), 1–3 (1950)
  • (58) Tsai, H.-C., Elsberry, R.L.: Detection of tropical cyclone track changes from the ecmwf ensemble prediction system. Geophysical research letters 40(4), 797–801 (2013)
  • (59) Landsea, C.W., Cangialosi, J.P.: Have we reached the limits of predictability for tropical cyclone track forecasting? Bulletin of the American Meteorological Society 99(11), 2237–2243 (2018). https://doi.org/10.1175/BAMS-D-17-0136.1
  • (60) Zhong, X., et al.: FuXi-Extreme: Improving extreme rainfall and wind forecasts with diffusion model. Preprint at https://arxiv.org/abs/2310.19822 (2023)
  • (61) DeMaria, M., et al.: Improvements to the operational tropical cyclone wind speed probability model. Weather and forecasting 28(3), 586–602 (2013)
  • (62) Zhang, X., Chen, G., Yu, H., Zeng, Z.: Verification of ensemble track forecasts of tropical cyclones during 2014. Tropical Cyclone Research and Review 4(2), 79–87 (2015)
  • (63) Puri, K., Barkmeijer, J., Palmer, T.N.: Ensemble prediction of tropical cyclones using targeted diabatic singular vectors. Quarterly Journal of the Royal Meteorological Society 127(572), 709–731 (2001)
  • (64) World Meteorological Organization: July sees extreme weather with high impacts (2018). https://wmo.int/media/july-sees-extreme-weather-high-impacts
  • (65) Hsu, P.-C., Qian, Y., Liu, Y., Murakami, H., Gao, Y.: Role of abnormally enhanced mjo over the western pacific in the formation and subseasonal predictability of the record-breaking northeast asian heatwave in the summer of 2018. Journal of Climate 33(8), 3333–3349 (2020). https://doi.org/10.1175/JCLI-D-19-0337.1
  • (66) Xu, K., et al.: Large-scale circulation anomalies associated with extreme heat in south korea and southern–central japan. Journal of Climate 32(10), 2747–2759 (2019)
  • (67) Lu, C., Ye, J., Wang, S., Yang, M., Li, Q., He, W., Qin, Y., Cai, J., Mao, J.: An unusual heat wave in north china during midsummer, 2018. Frontiers in Earth Science 8, 238 (2020)
  • (68) Ren, L., Zhou, T., Zhang, W.: Attribution of the record-breaking heat event over northeast asia in summer 2018: the role of circulation. Environmental Research Letters 15(5), 054018 (2020)
  • (69) Bonavita, M., Hólm, E., Isaksen, L., Fisher, M.: The evolution of the ecmwf hybrid data assimilation system. Quarterly Journal of the Royal Meteorological Society 142(694), 287–303 (2016)
  • (70) Houtekamer, P.L., Zhang, F.: Review of the ensemble kalman filter for atmospheric data assimilation. Monthly Weather Review 144(12), 4489–4532 (2016)
  • (71) Knapp, K.R., Kruk, M.C., Levinson, D.H., Diamond, H.J., Neumann, C.J.: The international best track archive for climate stewardship (ibtracs) unifying tropical cyclone data. Bulletin of the American Meteorological Society 91(3), 363–376 (2010)
  • (72) Knapp, K.R., Diamond, H.J., Kossin, J.P., Kruk, M.C., Schreck, C.J., et al.: International best track archive for climate stewardship (ibtracs) project, version 4. NOAA National Centers for Environmental Information 10 (2018)
  • (73) der Grijn, V.: Tropical cyclone forecasting at ecmwf: New products and validation. ECMWF Tech. Memo. 386, 1 (2002)
  • (74) Olivetti, L., Messori, G.: Advances and prospects of deep learning for medium-range extreme weather forecasting. EGUsphere 2023, 1–20 (2023)
  • (75) Kingma, D.P., Welling, M.: An introduction to variational autoencoders. Foundations and Trends® in Machine Learning 12(4), 307–392 (2019)
  • (76) Liu, Z., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
  • (77) Zeiler, M.D., Krishnan, D., Taylor, G.W., Fergus, R.: Deconvolutional networks. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2528–2535 (2010)
  • (78) Paszke, A., et al.: Automatic differentiation in pytorch. In: NIPS 2017 Workshop on Autodiff (2017)
  • (79) Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. Preprint at https://arxiv.org/abs/1412.6980 (2017)
  • (80) Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2017)
  • (81) Rasp, S., et al.: Weatherbench: a benchmark data set for data-driven weather forecasting. J. Adv. Model. Earth Syst. 12(11), 2020–002203 (2020)
  • (82) Rasp, S., et al.: WeatherBench 2: A benchmark for the next generation of data-driven global weather models. Preprint at https://arxiv.org/abs/2308.15560 (2023)
  • (83) Sloughter, J.M., Gneiting, T., Raftery, A.E.: Probabilistic wind speed forecasting using ensembles and bayesian model averaging. J. Am. Stat. Assoc. 105(489), 25–35 (2010)
  • (84) Wilks, D.S.: Statistical Methods in the Atmospheric Sciences vol. 100, 3rd edn. (2011)
  • (85) Palmer, T., et al.: Ensemble prediction: A pedagogical perspective. ECMWF newsletter 106(106), 10–17 (2006)
  • (86) Fortin, V., Abaza, M., Anctil, F., Turcotte, R.: Why should ensemble spread match the rmse of the ensemble mean? J. Hydrometeorol. 15(4), 1708–1713 (2014)
  • (87) Yamaguchi, M., Sakai, R., Kyoda, M., Komori, T., Kadowaki, T.: Typhoon ensemble prediction system developed at the japan meteorological agency. Monthly Weather Review 137(8), 2592–2604 (2009). https://doi.org/10.1175/2009MWR2697.1
  • (88) Whitaker, J.S., Loughe, A.F.: The relationship between ensemble spread and ensemble mean skill. Monthly weather review 126(12), 3292–3302 (1998)
  • (89) Neumann, C.J., Pelissier, J.M.: Models for the prediction of tropical cyclone motion over the north atlantic: An operational evaluation. Monthly Weather Review 109(3), 522–538 (1981)
  • (90) Goerss, J.S.: Tropical cyclone track forecasts using an ensemble of dynamical models. Monthly Weather Review 128(4), 1187–1193 (2000)
  • (91) Aberson, S.D.: The ensemble of tropical cyclone track forecasting models in the north atlantic basin (1976–2000). Bulletin of the American Meteorological Society 82(9), 1895–1904 (2001)
  • (92) Leonardo, N.M., Colle, B.A.: An investigation of large cross-track errors in north atlantic tropical cyclones in the gefs and ecmwf ensembles. Monthly Weather Review 149(2), 395–417 (2021). https://doi.org/10.1175/MWR-D-20-0035.1
  • (93) Zhong, X., et al.: Fuxi-ens (Version 1.0) [Dataset] [Software]. Zenodo. https://zenodo.org/records/12176478 (2024)