Abstract
Applying parallel-in-time algorithms to multiscale Hamiltonian systems to obtain stable long-time simulations is very challenging. In this paper, we present novel data-driven methods aimed at improving the standard parareal algorithm developed by Lions et al. in 2001, for multiscale Hamiltonian systems. The first method involves constructing a correction operator to improve a given inaccurate coarse solver through solving a Procrustes problem using data collected online along parareal trajectories. The second method involves constructing an efficient, high-fidelity solver by a neural network trained with offline generated data. For the second method, we address the issues of effective data generation and proper loss function design based on the Hamiltonian function. We show proof-of-concept by applying the proposed methods to a Fermi-Pasta-Ulam (FPU) problem. The numerical results demonstrate that the Procrustes parareal method is able to produce solutions that are more stable in energy compared to the standard parareal. The neural network solver can achieve comparable or better runtime performance compared to numerical solvers of similar accuracy. When combined with the standard parareal algorithm, the improved neural network solutions are slightly more stable in energy than the improved numerical coarse solutions.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Hamiltonian systems are ubiquitous in astronomy, molecular dynamics, classical mechanics, and theoretical physics. We concentrate on the separable Hamiltonian case
where \(\textbf{p}\) and \(\textbf{q}\) are respectively the generalized momentum and position in a d-dimensional space, M a diagonal matrix denoting the masses, and U a smooth scalar function depending on the position \(\textbf{q}\). Physically, the Hamiltonian is interpreted as the total energy of a system, consisting of the kinetic energy \(K(\textbf{p}):= \frac{1}{2} \textbf{p}^T M^{-1} \textbf{p}\) and the potential energy \(U(\textbf{q})\). The dynamics of the system is given by Hamilton’s equations
Geometric integrators (methods that preserve geometric properties of the exact flow) such as the velocity Verlet method are frequently used to simulate Hamiltonian flows [1]. It is proved that the preservation of geometric structures may greatly improve long-time numerical integration compared with general-purpose methods.
Even with the improved long-time stability and accuracy of geometric integrators, the computational complexity remains high for many physical applications. In particular, for systems with multiple time scales, accurate long-time integration is very difficult because the small stepsize required for stable integration of the fast motions lead to a large number of time steps. Computational multiscale algorithms aim at reducing computational complexity by exploiting the underlying multiscale structures. For example, for systems with sufficiently wide separation of scales and certain homogeneity and ergodic properties, the heterogeneous multiscale methods (HMM) can compute the effective systems with significantly reduced complexity [2].
Recently, due to the increased availability of parallel processors in modern supercomputers, there has been rising interest in developing time-domain parallelization algorithms to reduce the wall-clock computation time for time-dependent problems. As the algorithm of choice, the work in this paper will rely on the parareal method, a parallel-in-time algorithm introduced by Lions et al. [3]. To set up, the time domain [0, T] is divided into N subintervals of length \(\Delta t = T/N\). The parareal method involves two numerical solvers that advance a solution by \(\Delta t\): an efficient but low-fidelity coarse solver, denoted by \(C_{\Delta t}\), and an accurate but expensive fine solver, denoted by \(F_{\Delta t}\). The coarse solver solutions are iteratively corrected by fine solver solutions computed on smaller time intervals in parallel. Formally, letting \(\textbf{u}_n^{(k)}\) denote the solution computed at iteration k and time \(t_n = n\Delta t\), the parareal iterations are given by
Ideally, with a sufficiently accurate coarse solver, the iterations will quickly converge to the sequential solution computed by the fine solver \(F_{\Delta t}^n \textbf{u}_0\). However, a closer analysis reveals that the convergence relies on the stability of the parareal iterations, and the standard parareal is only stable for dissipative problems [4, 5]. For oscillatory and hyperbolic problems (such as Hamiltonian systems), the standard parareal scheme is known to perform badly because the convergence restricts the length of integration time [6]. To be more specific, let \(\textbf{e}^{(k)}_{n}:=\textbf{u}^{(k)}_{n} - F_{\Delta t}^n \textbf{u}_0 \). The bound for the amplification \(\textbf{e}^{(k+1)}_{n}/\textbf{e}^{(k)}_{n}\) depends on the sum \(\sum _{j=0}^{n-k-2}{\left\Vert C_{\Delta t}\right\Vert ^j}\). For dissipative problems, \(\left\Vert C_{\Delta t}\right\Vert < 1\), and so the sum can be bounded by some constant independent of n. In contrast, for purely hyperbolic problems, \(\left\Vert C_{\Delta t}\right\Vert \) is close to 1 and the sum will be proportional to n, which causes the iterations to become unstable.
Over the past years, many efforts have been devoted to developing stable parareal schemes for oscillatory and hyperbolic problems. For example, Dai et al. [7] proposed a symmetric variant of the parareal scheme and coupled it with projections to the constant energy manifold. Another class of methods was developed based on the idea of making correction to parareal solutions using data. We will review more of these works in Section 2.
In this paper, we will focus on multiscale Hamiltonian systems where \(H = H_\epsilon \). Here, \(\epsilon \) is a small parameter indicating the small time or length scale. Our aim is to improve the computational efficiency for long-time simulations of multiscale Hamiltonian systems by leveraging recent advancement in machine learning (ML) and parallel-in-time algorithms. Specifically, we would like to develop data-driven methods to stabilize the standard parareal scheme. Our idea is to use an improved solver \(\Phi ^{\theta , k}_{\Delta t}\) in place of \(C_{\Delta t}\) in (4). The superscript \(\theta \) in \(\Phi ^{\theta , k}_{\Delta t}\) denotes the unknown parameters defining the solver. The superscript k indicates this solver may depend on the iteration. We propose two approaches to construct \(\Phi ^{\theta , k}_{\Delta t}\):
-
1.
Enhance an existing coarse solver \(C_{\Delta t}\) through a correction operator that aligns the “phase” in the coarse solutions \(C_{\Delta t} \textbf{u}^{(k)}_n\) and fine solutions \(F_{\Delta t} \textbf{u}^{(k)}_n\) for each iteration. In other words, \(\Phi ^{\theta , k}_{\Delta t}:= \Psi ^{(k)}_{\Delta t} \circ C_{\Delta t} \approx F_{\Delta t}\), where \(\Psi ^{(k)}_{\Delta t}\) is the correction operator constructed from data collected online during parareal iterations. Because \(\Psi ^{(k)}_{\Delta t}\) is obtained by solving an orthogonal Procrustes problem, this approach is named the Procrustes parareal method. The Procrustes parareal iteration is given by
$$\begin{aligned} \textbf{u}^{(k+1)}_{n+1} = \Psi ^{(k)}_{\Delta t} \circ C_{\Delta t} \textbf{u}^{(k+1)}_{n} + \left( F_{\Delta t} \textbf{u}^{(k)}_{n} - \Psi ^{(k)}_{\Delta t} \circ C_{\Delta t} \textbf{u}^{(k)}_{n}\right) . \end{aligned}$$(4) -
2.
Approximate the fine solver (namely the solution map with a fixed stepsize \(\Delta t\)) using a neural network (NN), i.e., \(\Phi ^{\theta , k}_{\Delta t}:= \Phi ^{\text {NN}}_{\Delta t} \approx F_{\Delta t}\) where \(\Phi ^{\text {NN}}_{\Delta t}\) stands for a NN solver. Unlike in the Procruste parareal approach, the NN solver is constructed using offline training data. For this approach, we will address the issue of suitable training data generation and loss function design. The parareal iteration with an NN solver is as follows:
$$\begin{aligned} \textbf{u}^{(k+1)}_{n+1} = \Phi ^{\text {NN}}_{\Delta t} \textbf{u}^{(k+1)}_{n} + \left( F_{\Delta t} \textbf{u}^{(k)}_{n} - \Phi ^{\text {NN}}_{\Delta t} \textbf{u}^{(k)}_{n}\right) . \end{aligned}$$(5)
The paper is organized as follows. Section 2 presents the Procrustes parareal approach. Section 3 presents the neural network approach for approximating the solution map. Section 4 presents a case study for the Fermi-Pasta-Ulam (FPU) problem. We then conclude our findings in Section 5.
2 Enhancing the coarse solvers by data
In order to address the instability issue in the standard parareal iterations, several methods involving a correction to bridge the gap between fine and coarse solutions were proposed in the past. For example, Farhat and Chandesris [8] used a Newton-type iteration to reduce the jumps between the fine and coarse solutions. In [5], Ariel et al. proposed the \(\theta \)-parareal scheme, which uses an interpolation-based linear operator to enhance the coarse solver for oscillatory systems. In [9], Nguyen and Tsai focused on the second-order wave equations and developed a correction operator based on minimizing the wave energy residual of the fine and coarse solutions. The resulting correction successfully stabilizes the parareal iterations by aligning the “phase” in the wave fields computed by the fine and coarse solver respectively. Later, the same authors proposed in [10] a deep learning approach to enhance the coarse solver to reduce the “phase” errors in wave propagation.
The success for the wave equations in [9] directly inspires the development of a similar approach for a class of Hamiltonian systems where the notion of “phase” can be suitably defined.
In this section, we first provide our definition of “phase” for Hamiltonian systems, and then introduce the procedure to obtain a correction operator \(\Psi ^{(k)}_{\Delta t}\) from data computed in previous iterations
2.1 A practical notion of “phase” for Hamiltonian systems
For integrable Hamiltonian systems such as harmonic oscillators or Kepler systems, the “phase” can be naturally defined as the angle variables from the action-angle coordinates. For non-integrable systems, such as the FPU problem, where action-angle coordinates are not available, we need alternative definitions for “phase.”
Because phase is an angle-like object, for a separable Hamiltonian (1), it is natural to consider a transform function \(\Lambda \), which maps the energy level set to a hypersphere, whose radius satisfies
This way, we can define the “phase” difference between \([\textbf{p}_1, \textbf{q}_1]\) and \([\textbf{p}_2, \textbf{q}_2]\) as the angle between the transformed vectors \(\Lambda \left( [\textbf{p}_1, \textbf{q}_1]\right) \) and \(\Lambda \left( [\textbf{p}_2, \textbf{q}_2]\right) \). We call \(\Lambda \) the energy transform because the \(l_2\) norm of the transformed vector is related to the energy.
The specific form of \(\Lambda \) and the pseudo-inverse \(\Lambda ^\dagger \) depends on the Hamiltonian. Noticeably, not all Hamiltonian functions allow a valid definition of \(\Lambda \). For example, for a 1D harmonic oscillator whose Hamiltonian is \(H(p,q) = \frac{1}{2}p^2 + \frac{1}{2}q^2\),
For a 2D Kepler problem where \(H(\textbf{p}, \textbf{q}) = \frac{1}{2} \textbf{p}^T \textbf{p}- \frac{1}{\left\Vert \textbf{q}\right\Vert }\), \(\Lambda \) does not exist because of the negative potential term.
In this paper, we will work with the FPU problem and will show the corresponding definition of \(\Lambda \) in Section 4.
2.2 The Procrustes parareal
In the following, we use \(\textbf{u}\) to represent the concatenated vector \([\textbf{p}, \textbf{q}]\). Assuming \(\Lambda \) and \(\Lambda ^{\dagger }\) are both known, we define the correction operator as
where \(\Omega ^{(k)}_{\Delta t}\) is an orthogonal transformation to be determined. We see the advantage of defining \(\Lambda (\textbf{u})\)—any orthogonal transformation on \(\Lambda (\textbf{u})\) preserves \(H(\textbf{u})\). Hence, the correction operator preserves the energy.
The Procrustes parareal method is given as follows: (function composition symbols are left out for brevity)
Here, \(\Omega ^{(k)}_{\Delta t}\) is obtained by solving the orthogonal Procrustes problem
The geometric interpretation is to minimize the sum of the phase errors between fine and coarse solutions computed along the trajectory from the last iteration. Hence, \(\Omega ^{(k)}_{\Delta t}\) is referred to as the phase corrector.
We follow the standard way to solve the orthogonal Procrustes problem, which uses the singular value decomposition (SVD) of the correlation matrix \(M:=F G^T\). Here, \(F:= [f_0 \; f_1 \; \cdots \; f_{N-1}]\), \(G:= [g_0 \; g_1 \; \cdots \; g_{N-1}]\). Let \(U\Sigma V^T\) be the SVD of M. If M has full rank, then the minimizer is uniquely \(\Omega _* = UV^T\). We refer readers to [11] for more details of the Procrustes problem.
The pseudo-code for the Procrustes parareal method is provided in Algorithm 1.
3 Neural network approximation of the solution map
In this section, we present our second data-driven approach, which is to use a neural network (NN) to approximate the solution map \(F_{\Delta t}\). To construct the NN solver \(\Phi ^{\text {NN}}_{\Delta t}\), the main task is to solve an optimization problem
where X is a function space determined by the network architecture, \(\mathcal {D}_0\) a set of input data points, and \(l\left( \textbf{u}_0, \Phi ^{\text {NN}}_{\Delta t}, F_{\Delta t}\right) \) the misfit term for each data point \(\textbf{u}_0\) given the reference solution map \(F_{\Delta t}\) and the approximated map \(\Phi ^{\text {NN}}_{\Delta t}\). We describe our setup for each of these components as follows.
3.1 Choice of neural network architecture
For simplicity, we choose fully connected residual networks (ResNets). Compared to a regular multilayer perceptron, the residual network adds skip connections between pairs of hidden layers.
Let L denote the number of hidden layers and n the number of nodes per hidden layer. The layer outputs are defined as follows:
Here, \(W^{(l)}\) and \(b^{(l)}, l=1,...,L+1\) are weights and biases to be determined through the training procedure. We use the Exponential Linear Unit (ELU) [12] for the nonlinear activation function \(\sigma \). Note that we adopt a scaling factor 1/L for the hidden layers with skip connections. This technique was proposed in [13] to make the network performance more robust against hyperparameter change.
3.2 Design of misfit term
The function to be approximated is a solution map that maps a phase space state to another state. This allows us to use a sequence of successive time steps to construct the misfit term. Suppose that for an input \(\textbf{u}_0\) and a sequence length S, we generate \(\{ \textbf{u}_i \}_{1\le i \le S}, \textbf{u}_i = \left( F_{\Delta t}\right) ^{i} \textbf{u}_0 \) as the target sequence and \(\{ \tilde{\textbf{u}}_i \}_{1\le i \le S}, \tilde{\textbf{u}}_i = \left( \Phi ^{\text {NN}}_{\Delta t}\right) ^{i} \textbf{u}_0 \) as the approximated sequence. The misfit is then computed between the approximated sequence and the target sequence
We remark that because the network is applied recursively for obtaining \(\tilde{\textbf{u}}_i\), this multi-step loss essentially makes training the network like training a recurrent neural network.
There are several ways to measure the difference between \((\textbf{u}_i, \tilde{\textbf{u}}_i)\). One common approach is to use the Euclidean metric of \(\mathbb {R}^{2d}\). This is known as the mean squared error
The Euclidean metric puts equal weights on \(\textbf{p}\) and \(\textbf{q}\) components of \(\textbf{u}\). While minimizing the mean squared error aligns with the goal of reducing the trajectory error, it often leads to imbalanced energy error because the Hamiltonian function does not always weight \(\textbf{p}\) and \(\textbf{q}\) similarly. Therefore, to naturally balance the components based on the Hamiltonian, we adopt the energy transform \(\Lambda \) as defined in Section 2.1. We define the energy balanced error
To put together, suppose we use the energy balanced error, the misfit term for an initial state \(\textbf{u}_0\) and a sequence length S is given by
3.3 Generation of input data set
The next problem is to generate a proper set of initial conditions \(\textbf{u}_0\) for training. Unlike in the Procrustes parareal approach, where the data are collected online during parareal iterations, here, we have to generate training data offline to construct an effective NN solver.
In order to understand “what is a reasonable distribution to sample in the phase space,” we regard the misfit term in the optimization problem (11) as the mean error over a continuous distribution of \(\textbf{u}\) in the phase space:
It is thus natural to consider using a relevant invariant measure of the Hamiltonian flow for \(\mu \).
Suppose we are interested in simulations of the Hamiltonian flow with a fixed total energy. Then, we should consider sampling an invariant measure on an energy level set
However, the numerical approximations may not preserve the total energy. To start, the accurate fine solver used to approximate the true solution map is a symplectic integrator for which exact energy preservation is not possible. In addition, there is no guarantee for energy preservation by the general NN solver considered in this paper.
Notice that by Liouville’s theorem, the Hamiltonian flow preserves phase space volume. Then, we can construct an invariant measure in \(\mathbb {R}^{2d}\) that concentrates on the chosen energy level \(H_0\) as follows, using the coarea formula:
In other words, one can separately sample the invariant densities on the energy level sets and the Gaussian density in the normal directions on the energy level sets.
Motivated by this observation, we propose a novel sampling algorithm called HMC-\(H_0\). The name comes from its resemblance to the Hamiltonian Monte-Carlo (HMC) algorithm [14]. Starting with \(\textbf{q}=\textbf{q}_0\), we generate a chain of points by repeating the following two steps:
-
1.
Momentum refreshment: randomly sample \(\textbf{p}\) from the hypersphere defined by
$$\begin{aligned} \left\{ \textbf{p}\in \mathbb {R}^d \mid \textbf{p}^T M^{-1} \textbf{p}= 2 \left( H_0 - U(\textbf{q}) \right) \right\} \end{aligned}$$(20) -
2.
Time integration: \(\left( \textbf{p}, \textbf{q}\right) \leftarrow F_{\delta t}\left( \textbf{p}, \textbf{q}\right) \)
Note that our approach is different from the original HMC algorithm in the momentum refreshment step. In the original HMC, \(\textbf{p}\) is randomly sampled from a Gaussian distribution independent of current \(\textbf{q}\), whereas in our approach, the distribution depends on \(\textbf{q}\) and a fixed \(H_0\).
We shall compare the HMC-\(H_0\) algorithm to the following naive approach, combining random sampling in the momentum space and generating trajectories in the phase space using the flow. We first sample a set of momenta, then using the sampled momenta and \(\textbf{q}_0\), we generate an ensemble of trajectories by flowing the points for a duration of time. The points along the trajectories are collected. This is a naïve attempt to sample the Liouville measures on the energy level sets, assuming ergodicity of the flows. We call this algorithm TrajEnsemble-\(H_0\).
Full descriptions of the algorithms are given in Algorithms 2 and 3. For both algorithms, we can leverage parallel computation to obtain a large number of data samples.
4 Case study: the Fermi-Pasta-Ulam problem
We consider the Fermi-Pasta-Ulam (FPU) problem as a model problem to demonstrate properties of our proposed methods. All code and data accompanying the experiments are publicly available at https://github.com/tsai-lab-ut/multiscale-hamiltonian.
First studied in 1955, the FPU problem [15] describes a simple yet important model for nonlinear physics, which exhibits unexpected dynamical behaviors after long enough integration time. The model involves a chain of particles connected by springs that obey Hooke’s law but with a weak nonlinear perturbation. Here, we adopt a version of the problem presented in [1]. Suppose there are 2m mass points connected by alternating soft nonlinear springs and stiff linear springs. The variables \(q_1, \cdots , q_{2m}\) (\(q_0 = q_{2m+1} = 0\)) represent the displacements of the mass points from equilibrium, and \(p_i = d q_i/dt\) represent velocities. The Hamiltonian of this system is given by
where \(\omega \gg 1\) is the frequency of the stiff linear springs.
The dynamics of such a system has different behaviors on several different time scales. On the smallest time scale \(\mathcal {O}(\omega ^{-1})\), the linear springs show almost-harmonic oscillations with period close to \(\pi /\omega \). On the time scale \(\mathcal {O}(\omega ^{0})\), the motion of the nonlinear springs becomes apparent. On the time scale \(\mathcal {O}(\omega )\), there is slow energy exchange among the stiff springs. An illustration of motion on different time scales can be found in Section XIII.2 in [1].
In our experiments, we aim to obtain stable simulation of the system on a time scale of \(\mathcal {O}(\omega )\) using solvers with \(\Delta t\) on the time scale of \(\mathcal {O}(\omega ^{0})\), i.e., \(\Delta t=1.0\). We will use \(m=3\) (hence the degree of freedom \(d=6\)) and \(\omega =300\). This is a challenging regime because the separation of characteristic time scales is large, i.e., from 1/300 to 300. We will run the algorithms from the initial condition
The corresponding energy is
Given a reference solution \(\textbf{u}^{\text {ref}}=\left( \textbf{p}^{\text {ref}}, \textbf{q}^{\text {ref}}\right) \) and a computed solution \(\textbf{u}=\left( \textbf{p},\textbf{q}\right) \), we shall report the trajectory error
and the energy error
4.1 Definition of the energy transform
We first present our definition of the energy transform \(\Lambda \), since both the Procrustes parareal and the NN solver will rely on this function. Based on the Hamiltonian (21), we define
where
Then, \(\Lambda \) can be written as
One can check the \(l_2\) norm squared of \(\Lambda \left( [\textbf{p}, \textbf{q}]\right) \) recovers (21).
To define the pseudo-inverse \(\Lambda ^{\dagger }\), the main task is to recover \(\textbf{q}\) from a given vector \(\Lambda _2(\textbf{q})\). This can be done by solving a nonlinear least squares problem: given \(\tilde{\textbf{dq}}_{\text {stiff}}, \tilde{\textbf{dq}}_{\text {soft}}\), find
We use an adaptive nonlinear least squares algorithm, called NL2SOL [16], to solve the problem. In the Procrustes parareal setup, \(\Lambda ^{\dagger }\) is only evaluated when applying the correction operator \(\Psi ^{(k)}_{\Delta t}:= \Lambda ^{\dagger } \Omega ^{(k)}_{\Delta t} \Lambda \) to some solution \(\textbf{u}\). Since the correction by the unitary matrix \( \Omega ^{(k)}_{\Delta t}\) is expected to be small, we use the \(\textbf{q}\) component of \(\textbf{u}\) as an initial guess in the least squares algorithm.
4.2 The Procrustes parareal method
In this section, we discuss our choice of numerical integrators and present results for the Procrustes parareal method.
4.2.1 Choice of numerical solvers
For Hamiltonian systems, we use symplectic integrators for the fine and coarse solvers. The stepsize h of the integrator should be order \(\mathcal {O}(\omega ^{-1})\) or smaller for accurate integration. For the coarse solver \(C_{\Delta t}\), we use a 4th-order symplectic algorithm developed by Calvo and Sanz-Serna [17]. We consider two stepsizes \(h=2^{-9}\) and \(h=2^{-8}\) for comparison. For the fine solver \(F_{\Delta t}\), we use an 8th-order symplectic algorithm developed by Kahan and Li [18] with a stepsize \(h=2^{-18}\). The numerical integrators are denoted by \(\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\), \(\Phi ^{\text {CSS4}, h=2^{-8}}_{\Delta t}\) and \(\Phi ^{\text {KL8}, h=2^{-18}}_{\Delta t}\).
Because we aim to simulate for a long time interval \(\mathcal {O}(\omega )\), generating the reference trajectory requires \(\omega /h\approx 8\times 10^7\) fine steps. Hence, we perform fine solver computations in octuple precision to reduce the accumulated rounding errors. The computations of the coarse solver are done in double precision. We perform all numerical computations in Julia. We use the MultiFloats library to obtain octuple precision numbers.
To demonstrate the significance of rounding errors and to access the quality of the fine solutions, in Fig. 1, we plot the global errors of the fine solutions computed up to \(T=1000\) in double precision versus in octuple precision. Here, we use a method of even higher order, the 12th-order explicit Runge–Kutta-Nyström method, with stepsize \(h=2^{-18}\) implemented in octuple precision to serve as the reference solution map. We can see the trajectory errors grow over time while the energy errors are stable (expected since it is a symplectic integrator) for both precisions. The trajectory errors grow linearly in time at first, and then grow exponentially. Based on the trajectory errors, we conclude the fine solutions computed in octuple precision lose digits much later than fine solutions computed in double precision. Even with octuple precision, the fine solutions are not reliable after \(n=500\). In the rest of the paper, unless otherwise mentioned, we compare the trajectory errors against the reference only up to \(n=500\). For energy errors, we may compare for \(n>500\) since the reference energy is almost a constant.
We also observe an effect of floating point precision on parareal iterations. Figure 2 shows the errors in parareal solutions computed with a coarse solver implemented in double precision versus octuple precision. The fine solver is fixed using octuple precision. We found the error plots differ significantly after \(n=200\) steps. Using double precision for the coarse solver prevents the parareal solutions from improving after \(n=200\). Unfortunately, we have to use double precision for the coarse solver for the rest of comparisons, given the facts that (1) the library for the least squares algorithm involved in inverting \(\Lambda \) only supports double precision, and (2) the NN solver is double precision.
4.2.2 Plain versus Procrustes
Using the introduced numerical solvers, we run the plain parareal method and the Procrustes parareal method for \(N=1000\) steps and \(k=10\) iterations, from the same initial condition.
Figure 3 shows errors in the computed trajectories for \(C_{\Delta t} =\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\). Overall, both methods improve the initial coarse solution, but the improvement becomes small after several iterations. Comparing the trajectory error plots, we observe the Procrustes parareal solutions improve faster than the plain parareal solutions over iterations. Comparing the energy error plots, we see the Procrustes parareal solutions not only improve faster, but they are also more stable in energy than the plain parareal solutions (in particular, see how the energy errors grow from iteration 0 to iteration 1 in plain parareal versus in Procrustes parareal). The stability issue can be seen more clearly in Fig. 4, where we plot the energy of the three stiff springs and their total energy from the computed solutions. As shown in the reference energy profile in Fig. 5, during the simulated time range, there is energy exchange among the stiff springs while the total energy of the stiff springs remains almost a constant. For the plain parareal solution at iteration 3, the total energy blows up by the end of the simulation. On the contrary, the total energy for the Procrustes parareal solution is almost conserved.
The same can be observed for a less accurate coarse solver \(C_{\Delta t} =\Phi ^{\text {CSS4}, h=2^{-8}}_{\Delta t}\) (see Figs. 11 and 12). Compared with using \(C_{\Delta t} =\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\), the improvement of Procrustes parareal over plain parareal is more substantial.
Lastly, we compare runtime of the parareal methods. We used 40 computing cores to perform parallel computation. For \(C_{\Delta t} =\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\) and 10 iterations, the total runtime of the plain parareal method is \(3.2\times 10^3\) s, and the total runtime of the Procrustes parareal method is \(3.6\times 10^3\) s. As a baseline, the runtime of the sequential fine computation on [0, T] is \(1.2 \times 10^5\) s.
4.3 NN solution map
In this section, we present the setups for learning the solution map \(\Phi ^{\text {NN}}_{\Delta t}\) and study how its quality is affected by different options of training data, loss function, and network architecture. To evaluate performance of a learned \(\Phi ^{\text {NN}}_{\Delta t}\), we generate a 1000-step trajectory from the initial condition (22) by sequential applications of \(\Phi ^{\text {NN}}_{\Delta t}\).
Using the proposed data generation algorithms, we generated two sets of input data, denoted by \(\mathcal {D}_0^{\text {HMC-}{H_0}}\) and \(\mathcal {D}_0^{\text {TrajEnsemble-}{H_0}}\) where \(H_0 = H\left( \textbf{p}_{\text {init}},\textbf{q}_{\text {init}}\right) \). The parameters for data generation algorithms are given in Table 1. Each dataset has 200k examples of inputs \(\textbf{u}_0\). We emphasize that \(\left( \textbf{p}_{\text {init}},\textbf{q}_{\text {init}}\right) \) is not included in either dataset even though \(\textbf{q}_{\text {init}}\) was used in the parameters. This is because both algorithms randomly sample \(\textbf{p}\) given \(\textbf{q}_{\text {init}}\).
Given a \(\mathcal {D}_0\), we generate the full training dataset \(\mathcal {D}\) by propagating each \(\textbf{u}_0\) for 5 steps using the fine solver \(F_{\Delta t}\), and then collecting the input and target sequence pairs:
This leads to two training datasets \(\mathcal {D}^{\text {TrajEnsemble-}{H_0}}\) and \(\mathcal {D}^{\text {HMC-}{H_0}}\).
Let ResNet(L, n) denote a network with L hidden layers and n nodes per hidden layer. We considered several ResNet architectures, including a shallow network ResNet(4, 1000) and a deep network ResNet(75, 200). The two networks have a similar number of trainable parameters, which is around \(3\times 10^6\). Performances of the two networks are similar. Hence, we will just report results for ResNet(4, 1000).
The neural networks are implemented using PyTorch. We trained the networks using the mini-batch Adam algorithm [19] with weight decay [20]. To accelerate the training procedure, we used a one-cycle learning rate scheduler [21] that anneals the learning rate from an initial value to some maximum value and then to some minimum value within a fixed number of epochs. We used 10,000 epochs to train a ResNet(4, 1000) and 5000 epochs to train a ResNet(75, 200).
4.3.1 Effects of training data
We trained the shallow network ResNet(4, 1000) with different datasets and with the one-step (\(S=1\)) MSE loss. Figure 6 shows the network trained with \(\mathcal {D}^{\text {HMC-}{H_0}}\) is better than the network trained with \(\mathcal {D}^{\text {TrajEnsemble-}{H_0}}\), especially in terms of the energy stability. We attribute this to the fact that the set of inputs \(\mathcal {D}_0^{\text {HMC-}{H_0}}\) better represent the target distribution. As shown in Fig. 7, the minimum distance from each point along the reference trajectory to the set \(\mathcal {D}_0^{\text {HMC-}{H_0}}\) is on average a lot smaller than the minimum distance to the set \(\mathcal {D}_0^{\text {TrajEnsemble-}{H_0}}\). What is more, the minimum distance to \(\mathcal {D}_0^{\text {TrajEnsemble-}{H_0}}\) increases along the trajectory, while the minimum distance to \(\mathcal {D}_0^{\text {HMC-}{H_0}}\) stays stable over time.
4.3.2 Effects of sequence length in loss function
We trained the shallow network ResNet(4, 1000) using \(\mathcal {D}^{\text {HMC-}{H_0}}\) and multi-step MSE loss for different sequence length S. Figure 8 shows that longer sequence length yields slightly better accuracy for the first ten steps. After ten steps, there is no significant difference between results of different sequence lengths. In Fig. 9, we repeated the comparison for a different initial condition \((\sqrt{2} \textbf{p}_{\text {init}}, \textbf{q}_{\text {init}})\). Note that the corresponding energy level is higher than \(H_0\) for generating the training data. In other words, we are testing the generalization ability of the NN solvers for out-of-distribution examples. Based on the results, the NN solvers are able to achieve accuracy on par with the accuracy for in-distribution examples for at least the first few steps. Moreover, we found longer training sequences result in significantly better generalization ability.
4.3.3 Effects of loss function metric
We trained the shallow network ResNet(4, 1000) using \(\mathcal {D}^{\text {HMC-}{H_0}}\) and different loss metrics with sequence length \(S=5\). As displayed in Fig. 10, compared to using MSE, using EBE leads to smaller trajectory error and energy error for over 100 steps. In particular, the energy error is not only smaller but also more stable over a long time period.
4.3.4 Comparison with numerical solvers
We present in Table 2 the one-step accuracy and runtime performance of various solvers. The fine solver \(F_{\Delta t}\) is \(\Phi ^{\text {KL8}, h=2^{-18}}_{\Delta t}\) implemented in octuple precision. The NN solver \(\Phi ^{\text {NN}}_{\Delta t}\) is ResNet(4, 1000), trained using \(\mathcal {D}^{\text {HMC-}{H_0}}\) and the multi-step (\(S=5\)) EBE loss. For comparison, we include several numerical solvers: \(\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\) and \(\Phi ^{\text {VV}, h=2^{-14}}_{\Delta t}\), whose one-step trajectory error is comparable to that of \(\Phi ^{\text {NN}}_{\Delta t}\), as well as \(\Phi ^{\text {CSS4}, h=2^{-8}}_{\Delta t}\) and \(\Phi ^{\text {VV}, h=2^{-11}}_{\Delta t}\), whose one-step energy error is comparable to that of \(\Phi ^{\text {NN}}_{\Delta t}\). Here, VV stands for the 2\(^\text {nd}\)-order velocity Verlet scheme.
It can be seen that, with the same level of trajectory error, the numerical solvers achieve lower energy error than \(\Phi ^{\text {NN}}_{\Delta t}\). However, in terms of runtime, \(\Phi ^{\text {NN}}_{\Delta t}\) is as good as \(\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\) and is about 17 times faster than \(\Phi ^{\text {VV}, h=2^{-14}}_{\Delta t}\). We emphasize that the runtime measurements took place in different environments: the NN solver is implemented in Python, and numerical solvers are implemented in Julia. We have fully optimized the Julia code for runtime and memory efficiency. We expect to further optimize the NN implementation for better runtime performance in the future.
4.4 NN solution map in parareal iterations
In this section, we present results of using \(\Phi ^{\text {NN}}_{\Delta t}\) as the coarse solver in parareal methods.
We first study the plain parareal method. Figure 11 compares the plain parareal solutions computed by different coarse solvers, including \(\Phi ^{\text {NN}}_{\Delta t}\), \(\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\), and \(\Phi ^{\text {CSS4}, h=2^{-8}}_{\Delta t}\). Clearly, \(\Phi ^{\text {CSS4}, h=2^{-8}}_{\Delta t}\) performs the worst, as expected because it is the least accurate among the three solvers. Based on the trajectory errors, we see using \(\Phi ^{\text {NN}}_{\Delta t}\) as the coarse solver provides slower accuracy improvement over iterations compared to using \(\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\). Comparing the energy errors, we observe that when using \(\Phi ^{\text {NN}}_{\Delta t}\), the stability in energy is not destroyed as much as in using \(\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\) as the coarse solver (see also the energy profiles at iteration 3 in Fig. 13).
We will now compare different coarse solvers used in the Procrustes parareal method. In Section 4.2, we found the Procrustes parareal method improves accuracy by stabilizing the energy of the solutions. As shown in Figs. 12 and 14, the best improvement is again yielded by \(\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\). There is no significant gain from combining Procrustes parareal with the NN solution map. In fact, the improvement over the Procrustes parareal iterations even deteriorates compared to the improvement over the plain parareal iterations.
We speculate that \(\Phi ^{\text {NN}}_{\Delta t}\) does not perform well in parareal iterations because it was not trained using suitable data. As described in Section 3.3, we sampled training data points from the Liouville density, mainly because it is a natural distribution for learning a solution map to be used in a sequential algorithm. In parareal schemes, since the coarse solver is applied differently than in a sequential algorithm, we would need a different data distribution. A similar issue has been investigated in [10] for approximating the correction operator using the NN approach for wave equations. There, the authors demonstrated the importance of using training data closer to the ones encountered in the simulations.
5 Conclusion
In this paper, we presented two data-driven approaches for stabilization of the standard parareal algorithm for long-time computation of highly oscillatory Hamiltonian systems. The Procrustes parareal approach uses solutions computed along the parareal iterations to construct a correction operator to align the “phase” of the fine and coarse solvers. Numerical results for the FPU problem demonstrated that the constructed correction can successfully stabilize the parareal iterations, which helps improve the accuracy of the computed solutions. The second approach we proposed is to use a neural network (NN) to approximate the reference solution map that advances the given state forward in time by a large fixed time step. We developed a sampling algorithm called HMC-\(H_0\) to sample phase space points from the neighborhood of an energy level set. We also designed a loss function which considers the energy-balanced errors between approximated trajectories and reference trajectories. The resulting NN solver for the FPU problem is able to achieve comparable or better runtime performance compared to numerical solvers of similar accuracy. When combined with parareal iterations, solutions computed by the NN solver are not as accurate as solutions computed by a comparable numerical solver, although the NN energy errors are slightly smaller. We think that this may be improved if we train the network using data suitably sampled for the discrete trajectories computed by the parareal schemes.
The FPU problem is too small to reveal the potential benefit of using NNs. It is small enough that optimized high-order symplectic integrators are extremely efficient and accurate. For more complicated problems, where the phase space is large and lower-order accurate methods are the only feasible choice, we think that the investigated NN approach may become viable.
Data availability
The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.
References
Hairer, E., Lubich, C., Wanner, G.: Geometric numerical integration, 2nd edn. Springer Series in Computational Mathematics, vol. 31, p. 644. Springer, Berlin (2006)
Engquist, B., Tsai, Y.-H.: Heterogeneous multiscale methods for stiff ordinary differential equations. Math. Comput. 74(252), 1707–1742 (2005)
Lions, J., Maday, Y., Turinici, G.: A “parareal’’ in time discretization of PDE’s. comptes rendus de l’acadmie des sciences-series i-mathematics 332, 661–668 (2001)
Gander, M.J., Vandewalle, S.: Analysis of the parareal time-parallel time-integration method. SIAM J. Sci. Comput. 29(2), 556–578 (2007)
Ariel, G., Kim, S.J., Tsai, R.: Parareal multiscale methods for highly oscillatory dynamical systems. SIAM J. Sci. Comput. 38(6), 3540–3564 (2016)
Gander, M.J., Hairer, E.: Analysis for parareal algorithms applied to Hamiltonian differential equations. J. Comput. Appl. Math. 259, 2–13 (2014)
Dai, X., Le Bris, C., Legoll, F., Maday, Y.: Symmetric parareal algorithms for Hamiltonian systems. ESAIM: Mathematical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique 47(3), 717–742 (2013)
Farhat, C., Chandesris, M.: Time-decomposed parallel time-integrators: theory and feasibility studies for fluid, structure, and fluid-structure applications. Int. J. Numer. Meth. Eng. 58(9), 1397–1434 (2003)
Nguyen, H., Tsai, R.: A stable parareal-like method for the second order wave equation. J. Comput. Phys. 405, 109156 (2020)
Nguyen, H., Tsai, R.: Numerical wave propagation aided by deep learning. J. Comput. Phys. 475, 111828 (2023)
Gower, J.C., Dijksterhuis, G.B.: Procrustes problems, vol. 30. Oxford University Press, Oxford (2004)
Clevert, D.-A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). arXiv:1511.07289 (2015)
E, W.: Machine learning and computational mathematics. arXiv:2009.14596 (2020)
Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Phys. Lett. B 195(2), 216–222 (1987). https://doi.org/10.1016/0370-2693(87)91197-X
Fermi, E., Pasta, P., Ulam, S., Tsingou, M.: Studies of the nonlinear problems. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States) (1955)
Dennis, J.E., Gay, D.M., Welsch, R.E.: Algorithm 573: NL2SOL—an adaptive nonlinear least-squares algorithm [E4]. ACM Transactions on Mathematical Software (TOMS) 7(3), 369–383 (1981). https://doi.org/10.1145/355958.355966
Calvo, M.P., Sanz-Serna, J.M.: The development of variable-step symplectic integrators, with application to the two-body problem. SIAM J. Sci. Comput. 14(4), 936–952 (1993)
Kahan, W., Li, R.-C.: Composition constants for raising the orders of unconventional schemes for ordinary differential equations. Math. Comput. 66(219), 1089–1099 (1997)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980 (2014)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv:1711.05101 (2017)
Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates. In: Artificial Intelligence and Machine Learning for Multi-domain Operations Applications, vol. 11006, pp. 369–386. SPIE (2019)
Acknowledgements
The authors thank the Texas Advanced Computing Center (TACC) for providing computing resources.
Funding
The authors are partially supported by the National Science Foundation grant DMS-2208504.
Author information
Authors and Affiliations
Contributions
R.F.: conceptualization, methodology, software, visualization, writing. R.T.: conceptualization, funding acquisition, methodology, project administration, supervision, writing.
Corresponding author
Ethics declarations
Ethical approval
Not applicable
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: A Energy profiles of parareal solutions by various coarse solvers
Appendix: A Energy profiles of parareal solutions by various coarse solvers
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Fang, R., Tsai, R. Stabilization of parareal algorithms for long-time computation of a class of highly oscillatory Hamiltonian flows using data. Numer Algor 96, 1163–1187 (2024). https://doi.org/10.1007/s11075-024-01826-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11075-024-01826-8