¹¹institutetext: David J. Gardner and Carol S. Woodward ²²institutetext: Lawrence Livermore National Laboratory, Center for Applied Scientific Computing, Livermore, CA, USA, ²²email: gardner48, woodward6@llnl.gov ³³institutetext: Lynda L. LoDestro ⁴⁴institutetext: Lawrence Livermore National Laboratory, Fusion Energy Sciences Program, Livermore, CA, USA, ⁴⁴email: lodestro1@llnl.gov

Towards the Use of Anderson Acceleration in Coupled Transport-Gyrokinetic Turbulence Simulations

David J. Gardner\orcidID0000-0002-7993-8282 Lynda L. LoDestro\orcidID0000-0001-6503-4189 Carol S. Woodward\orcidID0000-0002-6502-8659

Abstract

Predicting the behavior of a magnetically confined fusion plasma over long time periods requires methods that can bridge the difference between transport and turbulent time scales. The nonlinear transport solver, Tango, enables simulations of very long times, in particular to steady state, by advancing each process independently with different time step sizes and couples them through a relaxed iteration scheme. We examine the use of Anderson Acceleration (AA) to reduce the total number of coupling iterations required by interfacing Tango with the AA implementation, including several extensions to AA, provided by the KINSOL nonlinear solver package in SUNDIALS. The ability to easily enable and adjust algorithmic options through KINSOL allows for rapid experimentation to evaluate different approaches with minimal effort. Additionally, we leverage the GPTune library to automate the optimization of algorithmic parameters within KINSOL. We show that AA can enable faster convergence in stiff and very stiff tests cases without noise present and in all cases, including with noisy fluxes, increases robustness and reduces sensitivity to the choice of relaxation strength.

1 Introduction

Accurately simulating turbulent transport is a critical aspect of modeling magnetically confined fusion plasmas. Turbulent fluxes have a strong nonlinear dependence on the gradients of averaged fields and are the primary mechanism for transporting energy from the plasma interior. A major challenge in simulations of very long times, in particular steady state, is the significant difference in characteristic time scales for transport (seconds) and turbulence (microseconds) in magnetic-fusion-energy burning plasmas post2004report . This challenge is further complicated by the presence of noise in the averaged transport fluxes arising from turbulent simulations.

Evolving both processes at the same time step is prohibitively expensive; however, as the dynamics on each time scale are well separated, each process can be advanced independently at its own rate and coupled through an iteration scheme to bridge the scale gap. Turbulence model evaluations are the dominant computational expense, and the speed at which the coupling iteration converges determines the number of turbulence model evaluations. In this work, we aim to reduce the total number of coupling iterations necessary and thus lower the overall simulation cost by accelerating the convergence of the iterative method. To this end, we examine the use of Anderson Acceleration (AA) and extensions to the method through the KINSOL nonlinear solver package from the SUNDIALS library.

The remainder of this paper is organized as follows. In Section 2, we overview turbulent transport simulations for magnetically confined fusion plasmas and the coupling method applied. We introduce the AA algorithm and detail the various extensions we consider to improve performance in Section 3. In Section 4 we overview the Tango code for coupled simulations and its interfacing with the AA implementation from KINSOL as well as the use of GPTune to optimize parameter selection. In Section 5 we discuss results from numerical simulations with Tango in both stiff and very stiff settings along with the impact of noisy fluxes on performance. Finally, in Section 6, we provide concluding remarks and directions for future investigation.

2 Coupling Transport & Turbulence

Simulating turbulence-driven transport in toroidal magnetic-fusion-energy plasmas with near-first-principles accuracy requires solving a set of gyrokinetic equations describing the distribution of different species in a 5D phase space. 5D turbulence due to fast time-scale instabilities drives transport of averaged fields (i.e., density, momentum, and energy, which are constant on magnetic flux surfaces and thus spatially 1D in the magnetic flux coordinate) across the magnetic surfaces confining the plasma. Because the transport time scales are much slower than the turbulence time scales, a formal separation-of-scales approach is possible, permitting a multiscale computational solution which proceeds by coupling separate equation sets for the fast and slow scales shestakov2003self . For this work, we utilize a simplified model for a single species to evaluate algorithms for turbulent transport without the expense of 5D simulation or the complexities of the toroidal geometry. Consider the transport equation,

\frac{\partial p}{\partial t}+\frac{\partial q}{\partial x}=S,

(1)

where $p(t,x)$ is the plasma profile, $q(p(t,x))$ is the average turbulent flux, and $S(t,x)$ includes all local sources. Because the turbulence-driven transport is stiff, an implicit time-advance is required. Discretizing (1) in space with finite differences and in time with backward Euler, we obtain the implicit system,

p^{(n+1)}+H\mathcal{D}_{x}q^{(n+1)}=p^{(n)}+HS^{(n+1)},

(2)

for a time step from $t^{(n)}$ to $t^{(n+1)}$ with step size $H$ where superscripts denote the time at which a quantity is approximated, and $\mathcal{D}_{x}$ is the difference operator for the finite difference method. Using the LoDestro method shestakov2003self we solve (2) with the iteration,

p_{k+1}-H\mathcal{D}_{x}(D_{k}\mathcal{D}_{x}p_{k+1}-c_{k}p_{k+1})=p^{(n)}+HS^% {(n+1)},

(3)

where $p_{k+1}$ is the new iterate for $p^{(n+1)}$ and $q^{(n+1)}$ is split into diffusive and convective terms, with coefficients $D_{k}=-\theta_{k}q(p_{k})/\mathcal{D}_{x}p_{k}$ and $c_{k}=(1-\theta_{k})q(p_{k})/p_{k}$ respectively. The parameter $\theta_{k}$ controls the diffusive-convective split at each mesh point. The purpose of the split is to keep $D$ positive (assuring a parabolic linear equation (2)) and within bounds. Without some such scheme, profile maxima in the course of iterating would cause violent spatial swings in $D$ , and converged solutions with the flux, $q$ , running uphill (anti-diffusive) would be unobtainable. The value of $\theta$ may be chosen in different ways shestakov2003self ; parker2018bringing . Here we use the default strategy in Tango,

\theta_{k}=\begin{cases}0&\text{where}\quad\hat{D}_{k}<D_{\text{min}}\\ \frac{1}{2}\left(1+\frac{D_{\text{max}}-\hat{D}_{k}}{D_{\text{max}}-D_{\text{% min}}}\right)&\text{where}\quad D_{\text{min}}\leq\hat{D}_{k}\leq D_{\text{max% }}\;\text{and}\;|\mathcal{D}_{x}p_{k}|<10\\ \frac{1}{2}&\text{where}\quad\hat{D}_{k}>D_{\text{max}}\;\text{and}\;|\mathcal% {D}_{x}p_{k}|<10\end{cases},

(4)

with $\hat{D}_{k}=-q(p_{k})/\mathcal{D}_{x}p_{k}$ , $D_{\text{min}}=10^{-5}$ , and $D_{\text{max}}=10^{13}$ . Observe that if the iteration converges, the fluxes exactly sum to $q$ . Obtaining the new profile in (3) requires solving the linear system,

M_{k}p_{k+1}=b,

(5)

within each iteration, where $M_{k}=I-H\mathcal{D}_{x}(D_{k}\mathcal{D}_{x}-c_{k})$ and $b=p^{(n)}+HS^{(n+1)}$ . In practice, relaxation of the iteration is required to ensure convergence; i.e., $p_{k+1}=\beta p_{k+1}+(1-\beta)p_{k-1}$ . The overall method is outlined in Algorithm 1.

Algorithm 1 LoDestro Method

Input: $p_{0}$ , $\beta$ , and tol Output: $p^{*}$

1:for

k=0,1,\ldots,k_{\rm{max}}

2: Compute the flux

q_{k}(p_{k})

and split it into

D_{k}

and

c_{k}

3: Construct

M_{k}

and solve

M_{k}p_{k+1}=b

4: Relax the profile

p_{k+1}=\beta p_{k+1}+(1-\beta)p_{k}

5: if

\|p_{k+1}-p_{k}\|<\text{tol}

return

p_{k+1}

p^{*}

6:end for

Refer to caption — Figure 1: A diagram illustrating the coupling between the transport and turbulence models in Algorithm 1. The transport code (1D) uses a large implicit time step that requires iterating to convergence. During this iteration, the turbulence model (5D for physics runs; 1D in this paper) advances with a small explicit time step. The two models periodically exchange fluxes and profiles until the iteration has converged.

As illustrated in Fig. 1, the transport iteration and the turbulent simulation are run concurrently and periodically exchange information with the transport code providing profiles to the turbulent code and receiving fluxes. Concurrent running minimizes the number of time steps taken by the 5D code per iteration. The turbulent fluxes are obtained from a 5D global gyrokinetic code running with an explicit step size $h\ll H$ . Evolving the turbulence model requires far more computational resources and expense than solving the linear system (5) in the transport iteration. As such, minimizing the number of iterations for convergence is critical to overall performance.

3 Anderson Acceleration

The LoDestro method can be viewed as a damped fixed-point iteration given by

p_{k+1}=\beta G(p_{k})+(1-\beta)p_{k},

(6)

where $G(p)$ comprises steps 2 and 3 in Algorithm 1. This iteration may converge quite slowly ( $\beta\ll 1$ may be required for stability of the iterations), and, thus, we are interested in methods for accelerating convergence. One such method is Anderson Acceleration (AA) anderson1965iterative ; kelley2018numerical which computes an iteration update that approximately minimizes the linearized fixed-point residual over a subspace constructed from prior iterations. In a linear setting and under certain conditions, AA is essentially equivalent to the generalized minimal residual (GMRES) method walker2011anderson , and, with a sufficiently good initial iterate, AA performs no worse than a fixed-point iteration toth2015convergence .

The AA algorithm with fixed damping ( $\beta$ ) and subspace size, or depth ( $m$ ), is given in Algorithm 2. Following walker2011anderson , the optimization problem in step 6 for the weights, $\gamma_{k}\in\mathbb{R}^{m_{k}}$ , to compute the new iterate is presented as an unconstrained linear least-squares problem. Utilizing this formulation, rather than the constrained form of the problem, allows for an efficient solution via a QR factorization of $\mathcal{F}_{k}$ i.e., $R_{k}\gamma_{k}=Q^{T}_{k}f_{k}$ where $Q_{k}\in\mathbb{R}^{N\times m_{k}}$ is an orthogonal matrix, $R_{k}\in\mathbb{R}^{m_{k}\times N}$ is an upper triangular matrix, and $f_{k}\in\mathbb{R}^{N}$ is the nonlinear residual, $G(p_{k})-p_{k}$ . Updating the factorization and solving the linear system as the residual history changes is typically a minimal cost compared to the function evaluation loffeld2016considerations ; lockhart2022performance and thus not a major contributor to the overall iteration run time.

Algorithm 2 Anderson Acceleration

Input: $p_{0}$ , $m_{\rm{max}}$ , $k_{\rm{max}}$ , tol, and $\beta$ Output: $p^{*}$

1:Set

p_{1}=\beta G(p_{0})+(1-\beta)p_{0}

2:for

k=1,\ldots,k_{\rm{max}}

3: Set

m_{k}=\min\{k,m_{\rm{max}}\}

4: Set

\mathcal{G}_{k}=[\Delta G_{k-m_{k}},\ldots,\Delta G_{k-1}]

where

\Delta G_{i}=G(p_{i+1})-G(p_{i})

5: Set

\mathcal{F}_{k}=[\Delta f_{k-m_{k}},\ldots,\Delta f_{k-1}]

where

\Delta f_{i}=f_{i+1}-f_{i}

and

f_{i}=G(p_{i})-p_{i}

6: Determine

\gamma_{k}

that minimizes

\|f_{k}-\mathcal{F}_{k}\gamma_{k}\|_{2}

7: Set

p_{k+1}=G(p_{k})-\mathcal{G}_{k}\gamma_{k}-(1-\beta)(f_{k}-Q_{k}R_{k}\gamma_{k})

8: if

\|p_{k+1}-p_{i}\|<\text{tol}

return

p_{k+1}

p^{*}

9:end for

In this work, we consider several extensions to AA, including delaying the start of acceleration as well as non-stationary variants with variable damping and/or depth (subspace size), leading to Algorithm 3. When the initial iterate is far from the fixed-point, delaying acceleration can be advantageous in order to obtain a better initial iterate before enabling AA. Different approaches to adaptive damping in evans2020proof and chen2022non demonstrated significant improvements in convergence when compared to fixed damping values. The method in chen2022non computes an “optimal” damping value by solving a minimization problem at the cost of two additional function evaluations. To avoid extra evaluations, we use the heuristic approach from evans2020proof with the damping value

\beta_{k}=0.9-\omega_{\beta}\Gamma_{k}\quad\mathrm{with}\quad\Gamma_{k}=\sqrt{% 1-(\|Q^{T}f_{k}\|/\|f_{k}\|)^{2}},

(7)

where $\Gamma_{k}$ is the AA convergence gain in the $k$ -th iteration, $Q$ is readily available from the QR solve in the AA optimization step, and $\omega_{\beta}$ is a tuning parameter. In chen2022composite the adaptive damping algorithm from chen2022non is combined with a composite version of AA as a means of adapting the depth and further improves the convergence rate. A different adaptivity approach is taken in pollock2023filtering where a filtering method is applied to remove columns from the history matrix, $\mathcal{F}_{k}$ , in order to control the condition number of the least-squares problem. This method also leads to improved convergence. In this work, we utilize a strategy similar to that employed in pollock2021anderson where the depth is determined by the magnitude of the current residual with

m_{k}=\min\{\tilde{m}_{k},m_{k-1}+1,m_{max}\},

(8)

where $m_{0}=0$ , $\tilde{m}_{k}=\max\{0,\lfloor-\log_{10}(\omega_{m}\|f_{k}\|)\rfloor\}$ , and $\omega_{m}$ is a turning parameter.

Algorithm 3 Anderson Acceleration with delay and variable damping and depth

Input: $p_{0}$ , $m_{\rm{max}}$ , $k_{\rm{max}}$ , tol, $d$ , and $\beta$ Output: $p^{*}$

1:Set

p_{1}=\beta G(p_{0})+(1-\beta)p_{0}

2:for

k=1,\ldots,k_{\rm{max}}

3: if

k\leq d

then

4: Set

p_{k+1}=\beta G(p_{k})+(1-\beta)p_{k}

5: else

6: Select

m_{k}

using (8)

7: Set

\mathcal{G}_{k}=[\Delta G_{k-m_{k}},\ldots,\Delta G_{k-1}]

where

\Delta G_{i}=G(p_{i+1})-G(p_{i})

8: Set

\mathcal{F}_{k}=[\Delta f_{k-m_{k}},\ldots,\Delta f_{k-1}]

where

\Delta f_{i}=f_{i+1}-f_{i}

and

f_{i}=G(p_{i})-p_{i}

9: Determine

\gamma_{k}

that minimizes

\|f_{k}-\mathcal{F}_{k}\gamma_{k}\|_{2}

10: Select

\beta_{k}

using (7)

11: Set

p_{k+1}=G(p_{k})-\mathcal{G}_{k}\gamma_{k}-(1-\beta_{k})(f_{k}-Q_{k}R_{k}% \gamma_{k})

12: end if

13: if

\|p_{k+1}-p_{i}\|<\text{tol}

return

p_{k+1}

p^{*}

14:end for

4 Implementation

To evaluate whether the application of AA improves the convergence rate of the LoDestro method, we utilize the Tango 1D transport solver parker2018bringing interfaced with the KINSOL nonlinear system solver from the SUNDIALS library hindmarsh2005sundials ; gardner2022enabling . Tango is an open-source Python code that both solves the 1D transport equation (3) by way of the linear system (5), and implements the coupling Algorithm 1. The turbulent fluxes can be computed either by coupling with the global gyrokinetic code, GENE goerler2011global ; genecode , to simulate unstable drift-wave (such as the ion-temperature-gradient mode)-driven turbulence or by using analytic models for rapid algorithmic development. For this work we utilize the latter option.

KINSOL provides a number of algorithms for solving nonlinear algebraic systems including an Anderson-accelerated fixed-point iteration. AA in KINSOL leverages several implementation optimizations for greater performance such as efficient QR factorization updates and solves for computing the acceleration weights fang2009two ; kresse1996efficiency ; walker2011anderson or the use of low synchronization orthogonalization methods within the QR solve to minimize the number of global reductions lockhart2022performance . For this work, we have extended KINSOL’s existing support for delay and constant damping with a fixed depth to allow for adapting the depth and/or damping. These new capabilities will be included in an upcoming SUNDIALS release. Through the interface between Tango and KINSOL, we can readily access an efficient implementation of AA and easily experiment with different variants of AA with minimal modification to existing Tango code.

The interfacing between Tango and KINSOL primarily consists of creating a class within Tango with a method to evaluate $G(p)$ using Tango’s native functions for computing and splitting fluxes and setting up and solving the linear system. Additional class methods call KINSOL functions to configure the sovler based on command line options and solve the nonlinear system. As SUNDIALS packages are written in C, we leverage SWIG beazley1996swig to generate Python interfaces that align with the C API while allowing Tango to access data stored in SUNDIALS vectors as NumPy arrays without having to copy data between KINSOL and Tango data structures.

Algorithm 3 includes a number of parameters that can have a significant impact on performance. The optimal combination of settings is unclear a priori and depends on the application and problem considered. To automate parameter selection we utilize the GPTune package liu2021gptune , a Python-based auto-tuning framework using Bayesian optimization methodologies with support for multi-task learning, multi-objective tuning, sensitivity analysis, and many other features. For this work we use the single-task learning autotuning (SLA) capability in GPTune for a single-objective function. That is, we consider one configuration of the problem (1) when optimizing the algorithm parameters to minimize the number of iterations to reach a given solution tolerance. The problem class created for interfacing with KINSOL also streamlines using GPUTune, as the objective function only needs to construct the Tango problem class given a dictionary of input options, call the solve method, and return the number of solver iterations. Additionally, the driver program defines the input, output, and parameter spaces, as well as any constraints, to create the tuning problem. Once the problem is defined, the driver creates the GPTune object and runs the SLA method which calls Tango as needed through the objective function.

5 Numerical Results

In this section we present results applying AA to the LoDestro method for a nonlinear diffusion problem from shestakov2003self that mimics the challenges observed in turbulence-driven transport simulations. The problem is given by (1) on the interval $x\in[0,1]$ with

q=-D\frac{\partial p}{\partial x}+\epsilon\quad\text{where}\quad D=\left(\frac% {1}{p}\frac{\partial p}{\partial x}\right)^{r},

(9)

and $\epsilon$ is spatially correlated, temporally white Gaussian noise that mimics the noise present in GENE simulations parker2018investigation . Below, we first consider cases without noise. The system is stiff with $r=2$ , and the stiffness increases with the value of $r$ . The use of an analytic flux model allows for rapid testing and corresponds to running the turbulence code to saturation each coupling exchange. The source term in (1) is defined as

S(x)=\begin{cases}1&\text{for}\quad 0\leq x\leq 0.1\\ 0&\text{for}\quad 0.1<x\leq 1\end{cases},

(10)

and the boundary conditions are $p^{\prime}(0)=0$ and $p(1)=0.01$ . For the LoDestro method, the relaxation strength depends on the degree of stiffness, and the default value is $\beta=0.6/r$ . AA setups with zero depth ( $m=0$ ) and constant damping correspond to applying the unaccelerated LoDestro method as implemented in Tango and serves as the baseline comparison point.

5.1 A Stiff Test Case

We first consider the stiff case ( $r=2$ ) and the impact of different AA configurations on the convergence of the iteration.

Fixed Damping, Depth, and Delay

To begin, we consider AA with static damping, $\beta$ , and depth, $m$ , sweeping over the parameter space to better understand how each impacts performance. Fig. 2 shows a heat map for the number of iterations to reach a residual of $10^{-11}$ for different $\beta$ and $m$ values with and without a delay, $d$ , before starting AA. From the first row ( $m=0$ , no acceleration), we see the default, $\beta=0.3$ , is near optimal in this case, but $\beta=0.4$ can reduce the number of iterations by $16\%$ . However, increasing $\beta$ too much leads to unphysical values or divergence (both denoted by empty cells).

Without delay, adding acceleration, $m=1$ , gives some robustness in the choice of $\beta$ , enabling convergence for larger values than possible without AA. However, the convergence rate is improved only for $\beta$ far from the optimal value. Increasing the depth, $m>1$ , yields more consistent gains with AA. Comparing the best results, AA takes approximately $14\%$ fewer iterations. For a non-optimal $\beta$ , larger reductions in iteration count are realized due to the greater robustness and reduced sensitivity to the choice of $\beta$ . Improved robustness in $\beta$ with AA has also been observed in other multiphysics coupling applications hamilton2016assessment .

As the first iterate diverges from final solution before steadily approaching the correct value at each subsequent iteration, introducing a small delay before starting AA significantly strengthens the performance in all aspects. Generally, the results continue to improve with increased depth but with diminishing returns, especially near the optimal $\beta$ . Delaying AA further can improve the results slightly but also reduces robustness with larger $\beta$ and may increase iteration counts with smaller $\beta$ . The size of the delay necessary will depend on the quality of the initial iterate, and the best depth value is often problem dependent.

Optimized Fixed Damping, Depth, and Delay

In the prior section we manually explored the parameter space to understand the impact of different options on algorithmic performance. In this section we investigate optimizing these selections with GPTune and examine the convergence history. We consider three sets of parameters to optimize: $\beta$ with $m=0$ ; $\beta$ and $m$ ; and $\beta$ , $m$ , and $d$ . Each set is optimized for three residual tolerances ( $10^{-6}$ , $10^{-8}$ , and $10^{-11}$ ). We set the initial number of GPTune parameter-space samples to 10 and the total number of samples to 50. The convergence histories for the best parameter values from each set are shown in Fig. 3.

[width=7.5cm]optimize_fixed_params_p2.pdf

Figure 3: Residual history for autotuned configurations. The default (faded circles) and one setup from Fig. 2(b) (faded stars) are included for reference. The autotuned results align well with the parameter sweep in Fig. 2.

In the first few iterations there is little difference between the various options with smaller $\beta$ values performing slightly better. At more moderate residuals, optimizing $\beta$ alone is one or two iterations better than the AA results. However, for values below $10^{-7}$ the AA setups give the best results with the optimized parameter set edging out the best result from Fig. 2(b) by one or two iterations. The benefit of AA increases for values less than $10^{-9}$ where the convergence of the unaccelerated version slows. This result indicates that AA delivers accelerated convergence to a tight tolerance. Overall the optimized parameter sets from GPTune closely align with the best setups found with the manual sweep.

Adaptive Damping and Depth

In the previous section we saw that the best parameter configurations depend on the desired residual norm. Performance also depends significantly on whether the initial iterate is in the asymptotic convergence regime of the iterative scheme. In this section we evaluate AA with adaptive damping and/or depth to find setups that perform well across residual regimes. We again consider three sets of parameters to optimize: $\beta$ with adaptive $m$ ; $m$ and $d$ with adaptive $\beta$ , and adapting $\beta$ and $m$ together. We only optimize $d$ in the fixed- $m$ case as the strategy for adapting $m$ automatically finds an optimum $d$ by initially selecting $m=0$ . When adapting $\beta$ and/or $m$ , GPTune optimizes the factors $\omega_{\beta}$ and $\omega_{m}$ in (7) and (8), respectively. As before, each set is optimized for the same three residual tolerances and we use the same numbers of initial and total parameter-space samples. The convergence history for the best parameter values from each set are shown in Fig. 4, and history of $\beta$ or $m$ from the adaptive cases are given in Fig. 5.

[width=7.5cm]optimize_adaptive_params_p2.pdf

Figure 4: Residual history for autotuned configurations with adaptive

\beta

and/or

m

. The adaptive results closely follow the autotuned fixed parameter results in Fig. 3 (faded triangles and squares included for reference).

Optimizing a fixed $\beta$ and an adaptive $m$ gives similar results to optimizing fixed $\beta$ , $m$ , and $d$ . Fig. 5(a) shows that adaptive $m$ delays AA six iterations and has a max $m$ of three. Given the low sensitivity observed in Fig. 2(b), it is not surprising that, although the $m$ and $d$ are slightly different, the convergence history only differs by one iteration. Optimizing an adaptive $\beta$ with fixed $m$ and $d$ has improved convergence at larger residuals, and, after five iterations, the setup is within two iterations of the best results. Fig. 5(b) shows that the performance gain at larger residuals is due to using a small $\beta$ value initially before immediately growing to almost the maximum $\beta$ once AA starts. This behavior occurs because an optimum fixed $\beta$ is computed by GPTune in the delay region before AA starts, as $\Gamma_{k}$ is not available to estimate $\beta$ . Once AA starts, the iteration switches to adaptive values and keeps $\beta$ close to the max allowed.

Optimizing both $\beta$ and $m$ , is the least performant of the three configurations but, due to the low sensitivity to $\beta$ and $m$ , the results are still relatively close (within five iterations) of the other cases. Again adapting $m$ automatically delays AA (five iterations) before settling on a max depth. Unlike when adapting $\beta$ alone, there are large oscillations in the $\beta$ after the initial delay. This variation in $\beta$ appears to be the cause of lower overall performance and suggests that considering the gain history to reduce large, frequent changes in $\beta$ may improve results.

Overall, in this stiff test case there are modest gains in numbers of iterations with AA compared to optimized damping values at large to moderate residual values. For residuals below $10^{-8}$ AA has a larger impact as convergence of the iteration without AA slows. However, AA greatly increases robustness and decreases sensitivity to the choice of $\beta$ with a small delay in AA increasing these effects.

5.2 A Very Stiff Test Case

We now consider a very stiff case ( $r=10$ ) relevant to magnetic-fusion-energy problems and examine the impact of AA on convergence with optimized fixed and adaptive parameter configurations as in the prior two sections. Fig. 6 shows the convergence history for the default iteration ( $\beta=0.06$ ), an optimized damping value ( $\beta=0.14$ ), and three of the best performing AA configurations.

[width=7.5cm]optimize_p10.pdf

Figure 6: Residual history for autotuned setups with the default iteration (

\beta=0.06

), optimized damping (

\beta=0.14

), and various AA configurations. Markers are placed every 5 iterations. For a wide range of residuals, AA leads to reduced numbers of iterations, especially for smaller residual values.

As in the stiff case, optimizing $\beta$ gives better results than the default $\beta$ and, in this case, is generally the best option for larger residuals. Enabling AA with $\beta=0.01$ and $m=3$ consistently outperforms optimizing $\beta$ alone for residuals below $10^{-4}$ . Again, delaying AA generally leads to better results overall and, given the larger initial residual and stronger damping applied in this case, the best AA configurations tend to have a longer delay. For residuals below $10^{-7}$ , configurations with fixed $\beta$ , $m$ , and $d$ or fixed $\beta$ and adaptive $m$ are the most efficient with results that differ by less than six iterations. With an adaptive $m$ , AA is automatically delayed by twelve iterations, the same as in the fixed-delay case, but allows more depth going up to $m=7$ . Unlike in the stiff test, the depth does not immediately ramp up after the delay but instead keeps $m$ constant for 2 to 7 iterations before increasing $m$ . The stair-step behavior of $m$ reflects the overall slower convergence in this case, which requires additional iterations before a residual reduction triggers an increase in $m$ using the heuristic (8). Adapting $\beta$ can also improve results but, like in the stiff case, does not generally perform as well as other setups due to large oscillations in $\beta$ . Overall the results with this very stiff test are improved when comparing against the stiff case, with AA producing more significant gains over a larger range of residuals.

5.3 Adding Noise in a Stiff Test Case

Finally, we revisit the stiff test ( $r=2$ ) when accounting for fluctuations in $q$ . The 5D (3 physical- plus 2 velocity-space dimensions) turbulence code indicated in Fig. 1 reaches a saturated state with short spatial- and temporal-scale fluctuations. (It is the electric fields involved in these fluctuations that kick particles across the magnetic surfaces, i.e., cause the transport.) Even when a heat or particle flux is integrated over four dimensions, the 1D flux, $q$ , still has short-scale structure. We include the effects of these fluctuations on our coupling problem with the “noise” term, $\epsilon$ , in Eq. (9). We use the same spatially correlated, temporally white Gaussian noise model studied in parker2018investigation (see Figs. 4a and 5b in parker2018investigation for a comparison of the model with $q$ from GENE).

Fig. 7(a) shows the residual history for the default setup ( $\beta=0.3$ ) and a more optimal configuration ( $\beta=0.4$ ), with and without AA, along with a heat map for other combinations of $\beta$ and $m$ . Unlike in the cases without noise, the performance with AA is similar to that without AA when using the best $\beta$ value. This lack of improvement is due to the higher, noise-dependent residual floor at which the iteration stagnates as noted in parker2018investigation for the LoDestro method and similarly observed in toth2017local when applying AA to problems with inaccurate function evaluations. While the residual floor limits the benefits in lowering iteration counts, AA still improves overall robustness and reduces sensitivity to $\beta$ , which leads to faster convergence for non-optimal $\beta$ values. This reduced sensitivity is significant as it shows that with AA, less manual tuning of $\beta$ is necessary for an efficient solution approach.

6 Conclusions

In this work we used the KINSOL nonlinear solver library in SUNDIALS to evaluate the potential of several variations of Anderson acceleration to improve the convergence of the LoDestro method for coupled turbulence and transport simulations of a magnetically confined fusion plasma. Combining the flexibility of KINSOL with the GPTune autotuning library enabled rapid evaluation of the parameter space to compare the impact of different options on algorithmic performance.

For all test cases, AA allows for convergence with larger relaxation parameter values and decreases sensitivity to the choice of the relaxation parameter. Comparing to the optimal damping value, AA can provide a small decrease in the number of iterations for moderate residual values with the benefit increasing for smaller target residuals leading to a reduction of 20% or more in the number of iterations. Moreover, these benefits improve as the problem difficulty increases, going from stiff to very stiff tests; and for non-optimal damping values, the improvements are larger. Utilizing adaptive depth with AA provides a means for automatically selecting the delay and maximum allowed depth with results generally matching the best setups with optimized fixed parameter values. We note that use of GPTune removes the need for time-consuming hand optimizations of the parameters. When noisy fluxes are considered, AA does not provide an improvement in iterations with optimal damping due to the higher residual floor parker2018investigation . However, AA still increases robustness and reduces sensitivity to the damping parameter, which lowers iteration counts for non-optimal damping.

In future efforts we plan to assess the performance of AA with a modified analytic flux model that reflects the time dependence of the turbulence calculation as it is coupled to the transport code (see Fig. 1). We will also apply recently developed approaches for analyzing AA convergence de2022linear to better understand how the presence of noise and the addition of time dependence in the coupling impact convergence of the algorithm. Additionally, we will conduct studies to quantify parameter sensitivity in AA using GPUTune. Finally, the interfaces to KINSOL will allow for testing new variants of AA as they are integrated into the library, such as alternating and composite AA chen2022composite , without modifying Tango.

Acknowledgements.

The authors wish to acknowledge the initial explorations applying Anderson acceleration in Tango by Jeff Parker, Jack G. Paulson, and H. Hunter Schwartz and thank them for their help in using Tango and interfacing with KINSOL. We would also like to thank Cody J. Balos for his work in creating Python interfaces to KINSOL. Support for this work was provided in part by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) Program through the Frameworks, Algorithms, and Scalable Technologies for Mathematics (FASTMath) Institute, under Lawrence Livermore National Laboratory subcontract B626484 and DOE award DE-SC0021354. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research and Office of Fusion Energy Sciences, Scientific Discovery through Advanced Computing (SciDAC) program. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-JRNL-864392.

\ethics

Competing Interests The authors have no conflicts of interest to declare that are relevant to the content of this chapter.

References

(1) Anderson, D.G.: Iterative procedures for nonlinear integral equations. Journal of the ACM (JACM) 12(4), 547–560 (1965). DOI 10.1145/321296.321305
(2) Beazley, D.M., et al.: SWIG: An easy to use tool for integrating scripting languages with C and C++. In: Tcl/Tk Workshop, vol. 43, p. 74 (1996). DOI 10.5555/1267498.1267513
(3) Chen, K., Vuik, C.: Composite Anderson acceleration method with two window sizes and optimized damping. International Journal for Numerical Methods in Engineering 123(23), 5964–5985 (2022)
(4) Chen, K., Vuik, C.: Non-stationary Anderson acceleration with optimized damping. arXiv preprint arXiv:2202.05295 (2022)
(5) De Sterck, H., He, Y.: Linear asymptotic convergence of Anderson acceleration: fixed-point analysis. SIAM Journal on Matrix Analysis and Applications 43(4), 1755–1783 (2022). DOI 10.1137/21M1449579
(6) Evans, C., Pollock, S., Rebholz, L.G., Xiao, M.: A proof that Anderson acceleration improves the convergence rate in linearly converging fixed-point methods (but not in those converging quadratically). SIAM Journal on Numerical Analysis 58(1), 788–810 (2020). DOI 10.1137/19M1245384
(7) Fang, H.r., Saad, Y.: Two classes of multisecant methods for nonlinear acceleration. Numerical linear algebra with applications 16(3), 197–221 (2009). DOI 10.1002/nla.617
(8) Gardner, D.J., Reynolds, D.R., Woodward, C.S., Balos, C.J.: Enabling new flexibility in the SUNDIALS suite of nonlinear and differential/algebraic equation solvers. ACM Transactions on Mathematical Software (TOMS) 48(3), 1–24 (2022). DOI 10.1145/3539801
(9) GENE Developers: Gyrokinetic Electromagnetic Numerical Experiment (GENE). URL http://genecode.org/. Accessed 2024-05-06
(10) Goerler, T., Lapillonne, X., Brunner, S., Dannert, T., Jenko, F., Merz, F., Told, D.: The global version of the gyrokinetic turbulence code GENE. Journal of Computational Physics 230(18), 7053–7071 (2011). DOI 10.1016/j.jcp.2011.05.034
(11) Hamilton, S., Berrill, M., Clarno, K., Pawlowski, R., Toth, A., Kelley, C., Evans, T., Philip, B.: An assessment of coupling algorithms for nuclear reactor core physics simulations. Journal of Computational Physics 311, 241–257 (2016). DOI 10.1016/j.jcp.2016.02.012
(12) Hindmarsh, A.C., Brown, P.N., Grant, K.E., Lee, S.L., Serban, R., Shumaker, D.E., Woodward, C.S.: SUNDIALS: Suite of nonlinear and differential/algebraic equation solvers. ACM Transactions on Mathematical Software (TOMS) 31(3), 363–396 (2005). DOI 10.1145/1089014.1089020
(13) Kelley, C.T.: Numerical methods for nonlinear equations. Acta Numerica 27, 207–287 (2018)
(14) Kresse, G., Furthmüller, J.: Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set. Computational materials science 6(1), 15–50 (1996). DOI 10.1016/0927-0256(96)00008-0
(15) Liu, Y., Sid-Lakhdar, W.M., Marques, O., Zhu, X., Meng, C., Demmel, J.W., Li, X.S.: GPTune: Multitask learning for autotuning exascale applications. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 234–246 (2021). DOI 10.1145/3437801.3441621
(16) Lockhart, S., Gardner, D.J., Woodward, C.S., Thomas, S., Olson, L.N.: Performance of low synchronization orthogonalization methods in Anderson accelerated fixed point solvers. In: Proceedings of the 2022 SIAM Conference on Parallel Processing for Scientific Computing, pp. 49–59. SIAM (2022). DOI 10.1137/1.9781611977141.5
(17) Loffeld, J., Woodward, C.S.: Considerations on the implementation and use of anderson acceleration on distributed memory and gpu-based parallel computers. In: Advances in the Mathematical Sciences: Research from the 2015 Association for Women in Mathematics Symposium, pp. 417–436. Springer (2016). DOI 10.1007/978-3-319-34139-2˙21
(18) Parker, J.B., LoDestro, L.L., Campos, A.: Investigation of a multiple-timescale turbulence-transport coupling method in the presence of random fluctuations. Plasma 1(1), 126–143 (2018). DOI 10.3390/plasma1010012
(19) Parker, J.B., LoDestro, L.L., Told, D., Merlo, G., Ricketson, L.F., Campos, A., Jenko, F., Hittinger, J.A.: Bringing global gyrokinetic turbulence simulations to the transport timescale using a multiscale approach. Nuclear Fusion 58(5), 054004 (2018). DOI 10.1088/1741-4326/aab5c8
(20) Pollock, S., Rebholz, L.G.: Anderson acceleration for contractive and noncontractive operators. IMA Journal of Numerical Analysis 41(4), 2841–2872 (2021). DOI 10.1093/imanum/draa095
(21) Pollock, S., Rebholz, L.G.: Filtering for Anderson acceleration. SIAM Journal on Scientific Computing 45(4), A1571–A1590 (2023). DOI 10.1137/22M1536741
(22) Post, D.E., Batchelor, D.B., Bramley, R.B., Cary, J.R., Cohen, R.H., Colella, P., Jardin, S.C.: Report of the fusion simulation project steering committee. Journal of fusion energy 23, 1–26 (2004). DOI 10.1007/s10894-004-1868-0
(23) Shestakov, A., Cohen, R., Crotinger, J., LoDestro, L., Tarditi, A., Xu, X.: Self-consistent modeling of turbulence and transport. Journal of Computational Physics 185(2), 399–426 (2003). DOI 10.1016/S0021-9991(02)00063-3
(24) Toth, A., Ellis, J.A., Evans, T., Hamilton, S., Kelley, C.T., Pawlowski, R., Slattery, S.: Local improvement results for Anderson acceleration with inaccurate function evaluations. SIAM Journal on Scientific Computing 39(5), S47–S65 (2017). DOI 10.1137/16M1080677
(25) Toth, A., Kelley, C.T.: Convergence analysis for Anderson acceleration. SIAM Journal on Numerical Analysis 53(2), 805–819 (2015). DOI 10.1137/130919398
(26) Walker, H.F., Ni, P.: Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis 49(4), 1715–1735 (2011). DOI 10.1137/10078356X