1 Introduction

Hamiltonian systems are ubiquitous in astronomy, molecular dynamics, classical mechanics, and theoretical physics. We concentrate on the separable Hamiltonian case

$$\begin{aligned} H\left( \textbf{p},\textbf{q}\right) = \frac{1}{2} \textbf{p}^T M^{-1} \textbf{p}+ U(\textbf{q}), \quad \textbf{p},\textbf{q}\in \mathbb {R}^{d} \end{aligned}$$
(1)

where \(\textbf{p}\) and \(\textbf{q}\) are respectively the generalized momentum and position in a d-dimensional space, M a diagonal matrix denoting the masses, and U a smooth scalar function depending on the position \(\textbf{q}\). Physically, the Hamiltonian is interpreted as the total energy of a system, consisting of the kinetic energy \(K(\textbf{p}):= \frac{1}{2} \textbf{p}^T M^{-1} \textbf{p}\) and the potential energy \(U(\textbf{q})\). The dynamics of the system is given by Hamilton’s equations

$$\begin{aligned} \dot{\textbf{q}}= H_\textbf{p}= M^{-1} \textbf{p}, \quad \dot{\textbf{p}} = - H_\textbf{q}= - \nabla U(\textbf{q}). \end{aligned}$$
(2)

Geometric integrators (methods that preserve geometric properties of the exact flow) such as the velocity Verlet method are frequently used to simulate Hamiltonian flows [1]. It is proved that the preservation of geometric structures may greatly improve long-time numerical integration compared with general-purpose methods.

Even with the improved long-time stability and accuracy of geometric integrators, the computational complexity remains high for many physical applications. In particular, for systems with multiple time scales, accurate long-time integration is very difficult because the small stepsize required for stable integration of the fast motions lead to a large number of time steps. Computational multiscale algorithms aim at reducing computational complexity by exploiting the underlying multiscale structures. For example, for systems with sufficiently wide separation of scales and certain homogeneity and ergodic properties, the heterogeneous multiscale methods (HMM) can compute the effective systems with significantly reduced complexity [2].

Recently, due to the increased availability of parallel processors in modern supercomputers, there has been rising interest in developing time-domain parallelization algorithms to reduce the wall-clock computation time for time-dependent problems. As the algorithm of choice, the work in this paper will rely on the parareal method, a parallel-in-time algorithm introduced by Lions et al. [3]. To set up, the time domain [0, T] is divided into N subintervals of length \(\Delta t = T/N\). The parareal method involves two numerical solvers that advance a solution by \(\Delta t\): an efficient but low-fidelity coarse solver, denoted by \(C_{\Delta t}\), and an accurate but expensive fine solver, denoted by \(F_{\Delta t}\). The coarse solver solutions are iteratively corrected by fine solver solutions computed on smaller time intervals in parallel. Formally, letting \(\textbf{u}_n^{(k)}\) denote the solution computed at iteration k and time \(t_n = n\Delta t\), the parareal iterations are given by

$$\begin{aligned} \textbf{u}^{(k+1)}_{n+1}&= C_{\Delta t} \textbf{u}^{(k+1)}_{n} + \left( F_{\Delta t} \textbf{u}^{(k)}_{n} - C_{\Delta t} \textbf{u}^{(k)}_{n}\right) , \\ \textbf{u}^{(k)}_{0}&=\textbf{u}_0, \quad \textbf{u}^{(0)}_{n+1} = C_{\Delta t} \textbf{u}^{(0)}_{n}, \quad k = 0, 1, 2, ...; n = 0, 1, ..., N-1. \nonumber \end{aligned}$$
(3)

Ideally, with a sufficiently accurate coarse solver, the iterations will quickly converge to the sequential solution computed by the fine solver \(F_{\Delta t}^n \textbf{u}_0\). However, a closer analysis reveals that the convergence relies on the stability of the parareal iterations, and the standard parareal is only stable for dissipative problems [4, 5]. For oscillatory and hyperbolic problems (such as Hamiltonian systems), the standard parareal scheme is known to perform badly because the convergence restricts the length of integration time [6]. To be more specific, let \(\textbf{e}^{(k)}_{n}:=\textbf{u}^{(k)}_{n} - F_{\Delta t}^n \textbf{u}_0 \). The bound for the amplification \(\textbf{e}^{(k+1)}_{n}/\textbf{e}^{(k)}_{n}\) depends on the sum \(\sum _{j=0}^{n-k-2}{\left\Vert C_{\Delta t}\right\Vert ^j}\). For dissipative problems, \(\left\Vert C_{\Delta t}\right\Vert < 1\), and so the sum can be bounded by some constant independent of n. In contrast, for purely hyperbolic problems, \(\left\Vert C_{\Delta t}\right\Vert \) is close to 1 and the sum will be proportional to n, which causes the iterations to become unstable.

Over the past years, many efforts have been devoted to developing stable parareal schemes for oscillatory and hyperbolic problems. For example, Dai et al. [7] proposed a symmetric variant of the parareal scheme and coupled it with projections to the constant energy manifold. Another class of methods was developed based on the idea of making correction to parareal solutions using data. We will review more of these works in Section 2.

In this paper, we will focus on multiscale Hamiltonian systems where \(H = H_\epsilon \). Here, \(\epsilon \) is a small parameter indicating the small time or length scale. Our aim is to improve the computational efficiency for long-time simulations of multiscale Hamiltonian systems by leveraging recent advancement in machine learning (ML) and parallel-in-time algorithms. Specifically, we would like to develop data-driven methods to stabilize the standard parareal scheme. Our idea is to use an improved solver \(\Phi ^{\theta , k}_{\Delta t}\) in place of \(C_{\Delta t}\) in (4). The superscript \(\theta \) in \(\Phi ^{\theta , k}_{\Delta t}\) denotes the unknown parameters defining the solver. The superscript k indicates this solver may depend on the iteration. We propose two approaches to construct \(\Phi ^{\theta , k}_{\Delta t}\):

  1. 1.

    Enhance an existing coarse solver \(C_{\Delta t}\) through a correction operator that aligns the “phase” in the coarse solutions \(C_{\Delta t} \textbf{u}^{(k)}_n\) and fine solutions \(F_{\Delta t} \textbf{u}^{(k)}_n\) for each iteration. In other words, \(\Phi ^{\theta , k}_{\Delta t}:= \Psi ^{(k)}_{\Delta t} \circ C_{\Delta t} \approx F_{\Delta t}\), where \(\Psi ^{(k)}_{\Delta t}\) is the correction operator constructed from data collected online during parareal iterations. Because \(\Psi ^{(k)}_{\Delta t}\) is obtained by solving an orthogonal Procrustes problem, this approach is named the Procrustes parareal method. The Procrustes parareal iteration is given by

    $$\begin{aligned} \textbf{u}^{(k+1)}_{n+1} = \Psi ^{(k)}_{\Delta t} \circ C_{\Delta t} \textbf{u}^{(k+1)}_{n} + \left( F_{\Delta t} \textbf{u}^{(k)}_{n} - \Psi ^{(k)}_{\Delta t} \circ C_{\Delta t} \textbf{u}^{(k)}_{n}\right) . \end{aligned}$$
    (4)
  2. 2.

    Approximate the fine solver (namely the solution map with a fixed stepsize \(\Delta t\)) using a neural network (NN), i.e., \(\Phi ^{\theta , k}_{\Delta t}:= \Phi ^{\text {NN}}_{\Delta t} \approx F_{\Delta t}\) where \(\Phi ^{\text {NN}}_{\Delta t}\) stands for a NN solver. Unlike in the Procruste parareal approach, the NN solver is constructed using offline training data. For this approach, we will address the issue of suitable training data generation and loss function design. The parareal iteration with an NN solver is as follows:

    $$\begin{aligned} \textbf{u}^{(k+1)}_{n+1} = \Phi ^{\text {NN}}_{\Delta t} \textbf{u}^{(k+1)}_{n} + \left( F_{\Delta t} \textbf{u}^{(k)}_{n} - \Phi ^{\text {NN}}_{\Delta t} \textbf{u}^{(k)}_{n}\right) . \end{aligned}$$
    (5)

The paper is organized as follows. Section 2 presents the Procrustes parareal approach. Section 3 presents the neural network approach for approximating the solution map. Section 4 presents a case study for the Fermi-Pasta-Ulam (FPU) problem. We then conclude our findings in Section 5.

2 Enhancing the coarse solvers by data

In order to address the instability issue in the standard parareal iterations, several methods involving a correction to bridge the gap between fine and coarse solutions were proposed in the past. For example, Farhat and Chandesris [8] used a Newton-type iteration to reduce the jumps between the fine and coarse solutions. In [5], Ariel et al. proposed the \(\theta \)-parareal scheme, which uses an interpolation-based linear operator to enhance the coarse solver for oscillatory systems. In [9], Nguyen and Tsai focused on the second-order wave equations and developed a correction operator based on minimizing the wave energy residual of the fine and coarse solutions. The resulting correction successfully stabilizes the parareal iterations by aligning the “phase” in the wave fields computed by the fine and coarse solver respectively. Later, the same authors proposed in [10] a deep learning approach to enhance the coarse solver to reduce the “phase” errors in wave propagation.

The success for the wave equations in [9] directly inspires the development of a similar approach for a class of Hamiltonian systems where the notion of “phase” can be suitably defined.

In this section, we first provide our definition of “phase” for Hamiltonian systems, and then introduce the procedure to obtain a correction operator \(\Psi ^{(k)}_{\Delta t}\) from data computed in previous iterations

$$\begin{aligned} \{ C_{\Delta t} \textbf{u}^{(k^\prime )}_n, F_{\Delta t} \textbf{u}^{(k^\prime )}_n\} \quad n=0, 1, \cdots , N-1; \; k^\prime =0, 1, \cdots , k-1. \end{aligned}$$

2.1 A practical notion of “phase” for Hamiltonian systems

For integrable Hamiltonian systems such as harmonic oscillators or Kepler systems, the “phase” can be naturally defined as the angle variables from the action-angle coordinates. For non-integrable systems, such as the FPU problem, where action-angle coordinates are not available, we need alternative definitions for “phase.”

Because phase is an angle-like object, for a separable Hamiltonian (1), it is natural to consider a transform function \(\Lambda \), which maps the energy level set to a hypersphere, whose radius satisfies

$$\begin{aligned} \left\Vert \Lambda \left( [\textbf{p}, \textbf{q}]\right) \right\Vert _2^2=H\left( \textbf{p}, \textbf{q}\right) + \text {constant}. \end{aligned}$$
(6)

This way, we can define the “phase” difference between \([\textbf{p}_1, \textbf{q}_1]\) and \([\textbf{p}_2, \textbf{q}_2]\) as the angle between the transformed vectors \(\Lambda \left( [\textbf{p}_1, \textbf{q}_1]\right) \) and \(\Lambda \left( [\textbf{p}_2, \textbf{q}_2]\right) \). We call \(\Lambda \) the energy transform because the \(l_2\) norm of the transformed vector is related to the energy.

The specific form of \(\Lambda \) and the pseudo-inverse \(\Lambda ^\dagger \) depends on the Hamiltonian. Noticeably, not all Hamiltonian functions allow a valid definition of \(\Lambda \). For example, for a 1D harmonic oscillator whose Hamiltonian is \(H(p,q) = \frac{1}{2}p^2 + \frac{1}{2}q^2\),

$$\begin{aligned} \Lambda \left( \begin{bmatrix}p \\ q\end{bmatrix}\right) := \begin{bmatrix}p / \sqrt{2} \\ q /\sqrt{2} \end{bmatrix}, \quad \Lambda ^{\dagger }\left( \begin{bmatrix}\tilde{p} \\ \tilde{q}\end{bmatrix}\right) := \begin{bmatrix}\sqrt{2} \tilde{p}\\ \sqrt{2} \tilde{q}\end{bmatrix}. \end{aligned}$$
(7)

For a 2D Kepler problem where \(H(\textbf{p}, \textbf{q}) = \frac{1}{2} \textbf{p}^T \textbf{p}- \frac{1}{\left\Vert \textbf{q}\right\Vert }\), \(\Lambda \) does not exist because of the negative potential term.

In this paper, we will work with the FPU problem and will show the corresponding definition of \(\Lambda \) in Section 4.

2.2 The Procrustes parareal

In the following, we use \(\textbf{u}\) to represent the concatenated vector \([\textbf{p}, \textbf{q}]\). Assuming \(\Lambda \) and \(\Lambda ^{\dagger }\) are both known, we define the correction operator as

$$\begin{aligned} \Psi ^{(k)}_{\Delta t} := \Lambda ^{\dagger } \circ \Omega ^{(k)}_{\Delta t} \circ \Lambda , \end{aligned}$$
(8)

where \(\Omega ^{(k)}_{\Delta t}\) is an orthogonal transformation to be determined. We see the advantage of defining \(\Lambda (\textbf{u})\)—any orthogonal transformation on \(\Lambda (\textbf{u})\) preserves \(H(\textbf{u})\). Hence, the correction operator preserves the energy.

The Procrustes parareal method is given as follows: (function composition symbols are left out for brevity)

$$\begin{aligned} \textbf{u}^{(k+1)}_{n+1}&= \Lambda ^{\dagger } \Omega ^{(k)}_{\Delta t} \Lambda C_{\Delta t} \textbf{u}^{(k+1)}_{n} + \left( F_{\Delta t} \textbf{u}^{(k)}_{n} - \Lambda ^{\dagger } \Omega ^{(k)}_{\Delta t} \Lambda C_{\Delta t} \textbf{u}^{(k)}_{n}\right) , \\ \textbf{u}^{(k)}_{0}&=\textbf{u}_0, \quad \textbf{u}^{(0)}_{n+1} = C_{\Delta t} \textbf{u}^{(0)}_{n}, \quad k = 0, 1, 2, ...; n = 0, 1, ..., N-1. \nonumber \end{aligned}$$
(9)

Here, \(\Omega ^{(k)}_{\Delta t}\) is obtained by solving the orthogonal Procrustes problem

$$\begin{aligned} \Omega ^{(k)}_{\Delta t}&:= \underset{\Omega }{{{\,\mathrm{arg\,min}\,}}}\ \sum _{n=0}^{N-1} \left\Vert f_n - \Omega g_n\right\Vert _2^2 \quad \text {s.t.} \quad \Omega \Omega ^T = \Omega ^T\Omega = I, \\ f_n&:= \Lambda F_{\Delta t} \textbf{u}^{(k)}_n, \nonumber \\ g_n&:= \Lambda C_{\Delta t} \textbf{u}^{(k)}_n. \nonumber \end{aligned}$$
(10)

The geometric interpretation is to minimize the sum of the phase errors between fine and coarse solutions computed along the trajectory from the last iteration. Hence, \(\Omega ^{(k)}_{\Delta t}\) is referred to as the phase corrector.

We follow the standard way to solve the orthogonal Procrustes problem, which uses the singular value decomposition (SVD) of the correlation matrix \(M:=F G^T\). Here, \(F:= [f_0 \; f_1 \; \cdots \; f_{N-1}]\), \(G:= [g_0 \; g_1 \; \cdots \; g_{N-1}]\). Let \(U\Sigma V^T\) be the SVD of M. If M has full rank, then the minimizer is uniquely \(\Omega _* = UV^T\). We refer readers to [11] for more details of the Procrustes problem.

The pseudo-code for the Procrustes parareal method is provided in Algorithm 1.

Algorithm 1
figure a

Procrustes parareal method

3 Neural network approximation of the solution map

In this section, we present our second data-driven approach, which is to use a neural network (NN) to approximate the solution map \(F_{\Delta t}\). To construct the NN solver \(\Phi ^{\text {NN}}_{\Delta t}\), the main task is to solve an optimization problem

$$\begin{aligned} \Phi ^{\text {NN}}_{\Delta t} = \underset{\Phi ^{\text {NN}}_{\Delta t} \in X}{{{\,\mathrm{arg\,min}\,}}} \frac{1}{|\mathcal {D}_0 |} \sum _{\textbf{u}_0 \in \mathcal {D}_0} l\left( \textbf{u}_0, \Phi ^{\text {NN}}_{\Delta t}, F_{\Delta t}\right) , \end{aligned}$$
(11)

where X is a function space determined by the network architecture, \(\mathcal {D}_0\) a set of input data points, and \(l\left( \textbf{u}_0, \Phi ^{\text {NN}}_{\Delta t}, F_{\Delta t}\right) \) the misfit term for each data point \(\textbf{u}_0\) given the reference solution map \(F_{\Delta t}\) and the approximated map \(\Phi ^{\text {NN}}_{\Delta t}\). We describe our setup for each of these components as follows.

3.1 Choice of neural network architecture

For simplicity, we choose fully connected residual networks (ResNets). Compared to a regular multilayer perceptron, the residual network adds skip connections between pairs of hidden layers.

Let L denote the number of hidden layers and n the number of nodes per hidden layer. The layer outputs are defined as follows:

$$\begin{aligned} \text {input layer:} \quad y^{(0)}&:= x \in \mathbb {R}^{2d}, \nonumber \\ \text {1st hidden layer:} \quad y^{(1)}&:= \sigma (W^{(1)}y^{(0)} + b^{(1)}) \in \mathbb {R}^{n}, \nonumber \\ \text {{ l}-th hidden layer:} \quad y^{(l)}&:= y^{(l-1)} + \frac{1}{L} \sigma (W^{(l)}y^{(l-1)} + b^{(l)}) \in \mathbb {R}^{n}, \quad l=2,...,L \nonumber \\ \text {output layer:} \quad y^{(L+1)}&:= W^{(L+1)}y^{(L)} + b^{(L+1)} \in \mathbb {R}^{2d}. \end{aligned}$$
(12)

Here, \(W^{(l)}\) and \(b^{(l)}, l=1,...,L+1\) are weights and biases to be determined through the training procedure. We use the Exponential Linear Unit (ELU) [12] for the nonlinear activation function \(\sigma \). Note that we adopt a scaling factor 1/L for the hidden layers with skip connections. This technique was proposed in [13] to make the network performance more robust against hyperparameter change.

3.2 Design of misfit term

The function to be approximated is a solution map that maps a phase space state to another state. This allows us to use a sequence of successive time steps to construct the misfit term. Suppose that for an input \(\textbf{u}_0\) and a sequence length S, we generate \(\{ \textbf{u}_i \}_{1\le i \le S}, \textbf{u}_i = \left( F_{\Delta t}\right) ^{i} \textbf{u}_0 \) as the target sequence and \(\{ \tilde{\textbf{u}}_i \}_{1\le i \le S}, \tilde{\textbf{u}}_i = \left( \Phi ^{\text {NN}}_{\Delta t}\right) ^{i} \textbf{u}_0 \) as the approximated sequence. The misfit is then computed between the approximated sequence and the target sequence

$$\begin{aligned} l\left( \textbf{u}_0, \Phi ^{\text {NN}}_{\Delta t}, F_{\Delta t}\right) = \frac{1}{S} \sum _{i=1}^{S} \text {diff} \left( \textbf{u}_i, \tilde{\textbf{u}}_i\right) . \end{aligned}$$
(13)

We remark that because the network is applied recursively for obtaining \(\tilde{\textbf{u}}_i\), this multi-step loss essentially makes training the network like training a recurrent neural network.

There are several ways to measure the difference between \((\textbf{u}_i, \tilde{\textbf{u}}_i)\). One common approach is to use the Euclidean metric of \(\mathbb {R}^{2d}\). This is known as the mean squared error

$$\begin{aligned} \text {MSE}\left( \textbf{u}_i, \tilde{\textbf{u}}_i\right) = \left\Vert \textbf{u}_i -\tilde{\textbf{u}}_i\right\Vert _2^2. \end{aligned}$$
(14)

The Euclidean metric puts equal weights on \(\textbf{p}\) and \(\textbf{q}\) components of \(\textbf{u}\). While minimizing the mean squared error aligns with the goal of reducing the trajectory error, it often leads to imbalanced energy error because the Hamiltonian function does not always weight \(\textbf{p}\) and \(\textbf{q}\) similarly. Therefore, to naturally balance the components based on the Hamiltonian, we adopt the energy transform \(\Lambda \) as defined in Section 2.1. We define the energy balanced error

$$\begin{aligned} \text {EBE}\left( \textbf{u}_i, \tilde{\textbf{u}}_i\right) = \left\Vert \Lambda \textbf{u}_i - \Lambda \tilde{\textbf{u}}_i\right\Vert _2^2. \end{aligned}$$
(15)

To put together, suppose we use the energy balanced error, the misfit term for an initial state \(\textbf{u}_0\) and a sequence length S is given by

$$\begin{aligned} l\left( \textbf{u}_0, \Phi ^{\text {NN}}_{\Delta t}, F_{\Delta t}\right) = \frac{1}{S} \sum _{i=1}^{S} \left\Vert \Lambda \left( \left( F_{\Delta t}\right) ^{i} \textbf{u}_0\right) - \Lambda \left( \left( \Phi ^{\text {NN}}_{\Delta t}\right) ^i \textbf{u}_0\right) \right\Vert _2^2. \end{aligned}$$
(16)

3.3 Generation of input data set

The next problem is to generate a proper set of initial conditions \(\textbf{u}_0\) for training. Unlike in the Procrustes parareal approach, where the data are collected online during parareal iterations, here, we have to generate training data offline to construct an effective NN solver.

In order to understand “what is a reasonable distribution to sample in the phase space,” we regard the misfit term in the optimization problem (11) as the mean error over a continuous distribution of \(\textbf{u}\) in the phase space:

$$\begin{aligned} \frac{1}{|\mathcal {D}_0 |} \sum _{\textbf{u}\in \mathcal {D}_0} l\left( \textbf{u}, \Phi ^{\text {NN}}_{\Delta t}, F_{\Delta t}\right) \approx \int _{\mathbb {R}^{2d}} l\left( \textbf{u}, \Phi ^{\text {NN}}_{\Delta t}, F_{\Delta t}\right) d\mu (\textbf{u}). \end{aligned}$$
(17)

It is thus natural to consider using a relevant invariant measure of the Hamiltonian flow for \(\mu \).

Suppose we are interested in simulations of the Hamiltonian flow with a fixed total energy. Then, we should consider sampling an invariant measure on an energy level set

$$\begin{aligned} \mathcal {M}_{H_0} := \left\{ (\textbf{p},\textbf{q}) \in \mathbb {R}^{2d} \mid H(\textbf{p},\textbf{q}) = H_0\right\} . \end{aligned}$$
(18)

However, the numerical approximations may not preserve the total energy. To start, the accurate fine solver used to approximate the true solution map is a symplectic integrator for which exact energy preservation is not possible. In addition, there is no guarantee for energy preservation by the general NN solver considered in this paper.

Notice that by Liouville’s theorem, the Hamiltonian flow preserves phase space volume. Then, we can construct an invariant measure in \(\mathbb {R}^{2d}\) that concentrates on the chosen energy level \(H_0\) as follows, using the coarea formula:

$$\begin{aligned} \begin{aligned}&\int _{\mathbb {R}^{2d}} l\left( \textbf{u}, \Phi ^{\text {NN}}_{\Delta t}, F_{\Delta t}\right) \exp ^{-(H(\textbf{u})-H_0)^2/2\sigma ^2}d\textbf{u}\\&= \int _{\mathbb {R}} \exp ^{-(E-H_0)^2/2\sigma ^2} \int _{\{H(\textbf{u})=E\}} l\left( \textbf{u}, \Phi ^{\text {NN}}_{\Delta t}, F_{\Delta t}\right) \frac{dS}{|\nabla H(\textbf{u})|} dE. \end{aligned} \end{aligned}$$
(19)

In other words, one can separately sample the invariant densities on the energy level sets and the Gaussian density in the normal directions on the energy level sets.

Motivated by this observation, we propose a novel sampling algorithm called HMC-\(H_0\). The name comes from its resemblance to the Hamiltonian Monte-Carlo (HMC) algorithm [14]. Starting with \(\textbf{q}=\textbf{q}_0\), we generate a chain of points by repeating the following two steps:

  1. 1.

    Momentum refreshment: randomly sample \(\textbf{p}\) from the hypersphere defined by

    $$\begin{aligned} \left\{ \textbf{p}\in \mathbb {R}^d \mid \textbf{p}^T M^{-1} \textbf{p}= 2 \left( H_0 - U(\textbf{q}) \right) \right\} \end{aligned}$$
    (20)
  2. 2.

    Time integration: \(\left( \textbf{p}, \textbf{q}\right) \leftarrow F_{\delta t}\left( \textbf{p}, \textbf{q}\right) \)

Note that our approach is different from the original HMC algorithm in the momentum refreshment step. In the original HMC, \(\textbf{p}\) is randomly sampled from a Gaussian distribution independent of current \(\textbf{q}\), whereas in our approach, the distribution depends on \(\textbf{q}\) and a fixed \(H_0\).

We shall compare the HMC-\(H_0\) algorithm to the following naive approach, combining random sampling in the momentum space and generating trajectories in the phase space using the flow. We first sample a set of momenta, then using the sampled momenta and \(\textbf{q}_0\), we generate an ensemble of trajectories by flowing the points for a duration of time. The points along the trajectories are collected. This is a naïve attempt to sample the Liouville measures on the energy level sets, assuming ergodicity of the flows. We call this algorithm TrajEnsemble-\(H_0\).

Full descriptions of the algorithms are given in Algorithms 2 and 3. For both algorithms, we can leverage parallel computation to obtain a large number of data samples.

Algorithm 2
figure b

HMC-\(H_0\)

Algorithm 3
figure c

TrajEnsemble-\(H_0\)

4 Case study: the Fermi-Pasta-Ulam problem

We consider the Fermi-Pasta-Ulam (FPU) problem as a model problem to demonstrate properties of our proposed methods. All code and data accompanying the experiments are publicly available at https://github.com/tsai-lab-ut/multiscale-hamiltonian.

First studied in 1955, the FPU problem [15] describes a simple yet important model for nonlinear physics, which exhibits unexpected dynamical behaviors after long enough integration time. The model involves a chain of particles connected by springs that obey Hooke’s law but with a weak nonlinear perturbation. Here, we adopt a version of the problem presented in [1]. Suppose there are 2m mass points connected by alternating soft nonlinear springs and stiff linear springs. The variables \(q_1, \cdots , q_{2m}\) (\(q_0 = q_{2m+1} = 0\)) represent the displacements of the mass points from equilibrium, and \(p_i = d q_i/dt\) represent velocities. The Hamiltonian of this system is given by

$$\begin{aligned} H(\textbf{p},\textbf{q}) = \frac{1}{2} \sum _{i=1}^{m} \left( p_{2i-1}^2 + p_{2i}^{2}\right) + \frac{\omega ^2}{4} \sum _{i=1}^{m} \left( q_{2i}-q_{2i-1}\right) ^2 +\sum _{i=0}^{m} \left( q_{2i+1}-q_{2i}\right) ^4, \end{aligned}$$
(21)

where \(\omega \gg 1\) is the frequency of the stiff linear springs.

The dynamics of such a system has different behaviors on several different time scales. On the smallest time scale \(\mathcal {O}(\omega ^{-1})\), the linear springs show almost-harmonic oscillations with period close to \(\pi /\omega \). On the time scale \(\mathcal {O}(\omega ^{0})\), the motion of the nonlinear springs becomes apparent. On the time scale \(\mathcal {O}(\omega )\), there is slow energy exchange among the stiff springs. An illustration of motion on different time scales can be found in Section XIII.2 in [1].

In our experiments, we aim to obtain stable simulation of the system on a time scale of \(\mathcal {O}(\omega )\) using solvers with \(\Delta t\) on the time scale of \(\mathcal {O}(\omega ^{0})\), i.e., \(\Delta t=1.0\). We will use \(m=3\) (hence the degree of freedom \(d=6\)) and \(\omega =300\). This is a challenging regime because the separation of characteristic time scales is large, i.e., from 1/300 to 300. We will run the algorithms from the initial condition

$$\begin{aligned} \textbf{p}_{\text {init}} = \begin{bmatrix} 0&\sqrt{2}&0&0&0&0 \end{bmatrix}^T, \quad \textbf{q}_{\text {init}} = \begin{bmatrix} \frac{1-\omega ^{-1}}{\sqrt{2}}&\frac{1+\omega ^{-1}}{\sqrt{2}}&0&0&0&0 \end{bmatrix}^T. \end{aligned}$$
(22)

The corresponding energy is

$$\begin{aligned} H(\textbf{p}_{\text {init}},\textbf{q}_{\text {init}}) = 2 + 3\omega ^{-2} + \frac{1}{2}\omega ^{-4}. \end{aligned}$$
(23)

Given a reference solution \(\textbf{u}^{\text {ref}}=\left( \textbf{p}^{\text {ref}}, \textbf{q}^{\text {ref}}\right) \) and a computed solution \(\textbf{u}=\left( \textbf{p},\textbf{q}\right) \), we shall report the trajectory error

$$\begin{aligned} \text {traj err} = \left\Vert \textbf{u}-\textbf{u}^{\text {ref}}\right\Vert = \sqrt{\left\Vert \textbf{p}-\textbf{p}^{\text {ref}}\right\Vert ^2 + \left\Vert \textbf{q}-\textbf{q}^{\text {ref}}\right\Vert ^2}, \end{aligned}$$
(24)

and the energy error

$$\begin{aligned} \text {energy err} = \frac{\left|H(\textbf{p},\textbf{q}) - H(\textbf{p}^{\text {ref}}, \textbf{q}^{\text {ref}})\right|}{\left|H\left( \textbf{p}^{\text {ref}}, \textbf{q}^{\text {ref}}\right) \right|}. \end{aligned}$$
(25)

4.1 Definition of the energy transform

We first present our definition of the energy transform \(\Lambda \), since both the Procrustes parareal and the NN solver will rely on this function. Based on the Hamiltonian (21), we define

$$\begin{aligned} \Lambda _1: \;&\textbf{p}\in \mathbb {R}^{2m} \mapsto \frac{\textbf{p}}{\sqrt{2}} \in \mathbb {R}^{2m}, \end{aligned}$$
(26)
$$\begin{aligned} \Lambda _2: \;&\textbf{q}\in \mathbb {R}^{2m} \mapsto \begin{bmatrix} \textbf{dq}_{\text {stiff}} \\ \textbf{dq}_{\text {soft}} \end{bmatrix} \in \mathbb {R}^{2m+1}, \end{aligned}$$
(27)

where

$$\begin{aligned} \textbf{dq}_{\text {stiff}} := \begin{bmatrix} \frac{\omega }{2} \left( q_2 - q_1\right) \\ \vdots \\ \frac{\omega }{2} \left( q_{2i} - q_{2i-1}\right) \\ \vdots \\ \frac{\omega }{2} \left( q_{2m} - q_{2m-1}\right) \end{bmatrix} \in \mathbb {R}^{m}, \quad \textbf{dq}_{\text {soft}} := \begin{bmatrix} \left( q_1 - q_0\right) ^2 \\ \vdots \\ \left( q_{2i+1} - q_{2i}\right) ^2 \\ \vdots \\ \left( q_{2m+1} - q_{2m}\right) ^2 \\ \end{bmatrix} \in \mathbb {R}^{m+1}. \end{aligned}$$
(28)

Then, \(\Lambda \) can be written as

$$\begin{aligned} \Lambda : \begin{bmatrix} \textbf{p}\\ \textbf{q}\end{bmatrix} \in \mathbb {R}^{4m} \mapsto \begin{bmatrix} \Lambda _1(\textbf{p}) \\ \Lambda _2(\textbf{q}) \end{bmatrix} \in \mathbb {R}^{4m+1}. \end{aligned}$$
(29)

One can check the \(l_2\) norm squared of \(\Lambda \left( [\textbf{p}, \textbf{q}]\right) \) recovers (21).

To define the pseudo-inverse \(\Lambda ^{\dagger }\), the main task is to recover \(\textbf{q}\) from a given vector \(\Lambda _2(\textbf{q})\). This can be done by solving a nonlinear least squares problem: given \(\tilde{\textbf{dq}}_{\text {stiff}}, \tilde{\textbf{dq}}_{\text {soft}}\), find

$$\begin{aligned} \textbf{q}_* = \underset{\textbf{q}}{{{\,\mathrm{arg\,min}\,}}}\ \left\Vert \Lambda _2\left( \textbf{q}\right) - \begin{bmatrix} \tilde{\textbf{dq}}_{\text {stiff}} \\ \tilde{\textbf{dq}}_{\text {soft}} \end{bmatrix}\right\Vert _2^2. \end{aligned}$$
(30)

We use an adaptive nonlinear least squares algorithm, called NL2SOL [16], to solve the problem. In the Procrustes parareal setup, \(\Lambda ^{\dagger }\) is only evaluated when applying the correction operator \(\Psi ^{(k)}_{\Delta t}:= \Lambda ^{\dagger } \Omega ^{(k)}_{\Delta t} \Lambda \) to some solution \(\textbf{u}\). Since the correction by the unitary matrix \( \Omega ^{(k)}_{\Delta t}\) is expected to be small, we use the \(\textbf{q}\) component of \(\textbf{u}\) as an initial guess in the least squares algorithm.

4.2 The Procrustes parareal method

In this section, we discuss our choice of numerical integrators and present results for the Procrustes parareal method.

4.2.1 Choice of numerical solvers

For Hamiltonian systems, we use symplectic integrators for the fine and coarse solvers. The stepsize h of the integrator should be order \(\mathcal {O}(\omega ^{-1})\) or smaller for accurate integration. For the coarse solver \(C_{\Delta t}\), we use a 4th-order symplectic algorithm developed by Calvo and Sanz-Serna [17]. We consider two stepsizes \(h=2^{-9}\) and \(h=2^{-8}\) for comparison. For the fine solver \(F_{\Delta t}\), we use an 8th-order symplectic algorithm developed by Kahan and Li [18] with a stepsize \(h=2^{-18}\). The numerical integrators are denoted by \(\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\), \(\Phi ^{\text {CSS4}, h=2^{-8}}_{\Delta t}\) and \(\Phi ^{\text {KL8}, h=2^{-18}}_{\Delta t}\).

Fig. 1
figure 1

Errors in trajectories generated by applying the fine solver \(F_{\Delta t}=\Phi ^{\text {KL8}, h=2^{-18}}_{\Delta t}\), implemented in double precision and octuple precision, sequentially for \(N=1000\) steps of \(\Delta t=1.0\). Float64 refers to double precision and float256 refers to octuple precision. The numerical integrator used for generating the reference trajectory here is \(\Phi ^{\text {DPRKN12}, h=2^{-18}}_{\Delta t}\), implemented in octuple precision

Fig. 2
figure 2

Log (base 10) errors in plain parareal solutions computed with \(C_{\Delta t} =\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\) implemented in double precision and octuple precision, and \(F_{\Delta t}=\Phi ^{\text {KL8}, h=2^{-18}}_{\Delta t}\) implemented in octuple precision only (\(\Delta t=1.0\), \(T=1000\))

Because we aim to simulate for a long time interval \(\mathcal {O}(\omega )\), generating the reference trajectory requires \(\omega /h\approx 8\times 10^7\) fine steps. Hence, we perform fine solver computations in octuple precision to reduce the accumulated rounding errors. The computations of the coarse solver are done in double precision. We perform all numerical computations in Julia. We use the MultiFloats library to obtain octuple precision numbers.

To demonstrate the significance of rounding errors and to access the quality of the fine solutions, in Fig. 1, we plot the global errors of the fine solutions computed up to \(T=1000\) in double precision versus in octuple precision. Here, we use a method of even higher order, the 12th-order explicit Runge–Kutta-Nyström method, with stepsize \(h=2^{-18}\) implemented in octuple precision to serve as the reference solution map. We can see the trajectory errors grow over time while the energy errors are stable (expected since it is a symplectic integrator) for both precisions. The trajectory errors grow linearly in time at first, and then grow exponentially. Based on the trajectory errors, we conclude the fine solutions computed in octuple precision lose digits much later than fine solutions computed in double precision. Even with octuple precision, the fine solutions are not reliable after \(n=500\). In the rest of the paper, unless otherwise mentioned, we compare the trajectory errors against the reference only up to \(n=500\). For energy errors, we may compare for \(n>500\) since the reference energy is almost a constant.

Fig. 3
figure 3

Log (base 10) errors in plain parareal solutions and Procrustes parareal solutions (\(\Delta t=1.0\), \(T=1000\), \(C_{\Delta t} =\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\), \(F_{\Delta t}=\Phi ^{\text {KL8}, h=2^{-18}}_{\Delta t}\) )

We also observe an effect of floating point precision on parareal iterations. Figure 2 shows the errors in parareal solutions computed with a coarse solver implemented in double precision versus octuple precision. The fine solver is fixed using octuple precision. We found the error plots differ significantly after \(n=200\) steps. Using double precision for the coarse solver prevents the parareal solutions from improving after \(n=200\). Unfortunately, we have to use double precision for the coarse solver for the rest of comparisons, given the facts that (1) the library for the least squares algorithm involved in inverting \(\Lambda \) only supports double precision, and (2) the NN solver is double precision.

Fig. 4
figure 4

Energy profiles of the stiff springs computed from plain parareal solutions and Procrustes parareal solutions at iteration 0 and iteration 3 (\(\Delta t=1.0\), \(T=1000\), \(C_{\Delta t} \!=\!\Phi ^{\text {CSS4}, h\!=\!2^{-9}}_{\Delta t}\), \(F_{\Delta t}\!=\!\Phi ^{\text {KL8}, h=2^{-18}}_{\Delta t}\) )

4.2.2 Plain versus Procrustes

Using the introduced numerical solvers, we run the plain parareal method and the Procrustes parareal method for \(N=1000\) steps and \(k=10\) iterations, from the same initial condition.

Figure 3 shows errors in the computed trajectories for \(C_{\Delta t} =\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\). Overall, both methods improve the initial coarse solution, but the improvement becomes small after several iterations. Comparing the trajectory error plots, we observe the Procrustes parareal solutions improve faster than the plain parareal solutions over iterations. Comparing the energy error plots, we see the Procrustes parareal solutions not only improve faster, but they are also more stable in energy than the plain parareal solutions (in particular, see how the energy errors grow from iteration 0 to iteration 1 in plain parareal versus in Procrustes parareal). The stability issue can be seen more clearly in Fig. 4, where we plot the energy of the three stiff springs and their total energy from the computed solutions. As shown in the reference energy profile in Fig. 5, during the simulated time range, there is energy exchange among the stiff springs while the total energy of the stiff springs remains almost a constant. For the plain parareal solution at iteration 3, the total energy blows up by the end of the simulation. On the contrary, the total energy for the Procrustes parareal solution is almost conserved.

The same can be observed for a less accurate coarse solver \(C_{\Delta t} =\Phi ^{\text {CSS4}, h=2^{-8}}_{\Delta t}\) (see Figs. 11 and 12). Compared with using \(C_{\Delta t} =\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\), the improvement of Procrustes parareal over plain parareal is more substantial.

Lastly, we compare runtime of the parareal methods. We used 40 computing cores to perform parallel computation. For \(C_{\Delta t} =\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\) and 10 iterations, the total runtime of the plain parareal method is \(3.2\times 10^3\) s, and the total runtime of the Procrustes parareal method is \(3.6\times 10^3\) s. As a baseline, the runtime of the sequential fine computation on [0, T] is \(1.2 \times 10^5\) s.

Fig. 5
figure 5

Reference energy profile of the stiff springs (\(\Delta t=1.0\), \(T=1000\))

Table 1 Descriptions of input data sets

4.3 NN solution map

In this section, we present the setups for learning the solution map \(\Phi ^{\text {NN}}_{\Delta t}\) and study how its quality is affected by different options of training data, loss function, and network architecture. To evaluate performance of a learned \(\Phi ^{\text {NN}}_{\Delta t}\), we generate a 1000-step trajectory from the initial condition (22) by sequential applications of \(\Phi ^{\text {NN}}_{\Delta t}\).

Using the proposed data generation algorithms, we generated two sets of input data, denoted by \(\mathcal {D}_0^{\text {HMC-}{H_0}}\) and \(\mathcal {D}_0^{\text {TrajEnsemble-}{H_0}}\) where \(H_0 = H\left( \textbf{p}_{\text {init}},\textbf{q}_{\text {init}}\right) \). The parameters for data generation algorithms are given in Table 1. Each dataset has 200k examples of inputs \(\textbf{u}_0\). We emphasize that \(\left( \textbf{p}_{\text {init}},\textbf{q}_{\text {init}}\right) \) is not included in either dataset even though \(\textbf{q}_{\text {init}}\) was used in the parameters. This is because both algorithms randomly sample \(\textbf{p}\) given \(\textbf{q}_{\text {init}}\).

Given a \(\mathcal {D}_0\), we generate the full training dataset \(\mathcal {D}\) by propagating each \(\textbf{u}_0\) for 5 steps using the fine solver \(F_{\Delta t}\), and then collecting the input and target sequence pairs:

$$\begin{aligned} \mathcal {D} = \left\{ \left( \textbf{u}_0, \{ \left( F_{\Delta t}\right) ^{i} \textbf{u}_0\}_{i=1,\cdots ,5}\right) : \textbf{u}_0 \in \mathcal {D}_0 \right\} . \end{aligned}$$
(31)

This leads to two training datasets \(\mathcal {D}^{\text {TrajEnsemble-}{H_0}}\) and \(\mathcal {D}^{\text {HMC-}{H_0}}\).

Let ResNet(L, n) denote a network with L hidden layers and n nodes per hidden layer. We considered several ResNet architectures, including a shallow network ResNet(4, 1000) and a deep network ResNet(75, 200). The two networks have a similar number of trainable parameters, which is around \(3\times 10^6\). Performances of the two networks are similar. Hence, we will just report results for ResNet(4, 1000).

Fig. 6
figure 6

Errors in trajectories generated by NN solvers learned with different data

The neural networks are implemented using PyTorch. We trained the networks using the mini-batch Adam algorithm [19] with weight decay [20]. To accelerate the training procedure, we used a one-cycle learning rate scheduler [21] that anneals the learning rate from an initial value to some maximum value and then to some minimum value within a fixed number of epochs. We used 10,000 epochs to train a ResNet(4, 1000) and 5000 epochs to train a ResNet(75, 200).

4.3.1 Effects of training data

We trained the shallow network ResNet(4, 1000) with different datasets and with the one-step (\(S=1\)) MSE loss. Figure 6 shows the network trained with \(\mathcal {D}^{\text {HMC-}{H_0}}\) is better than the network trained with \(\mathcal {D}^{\text {TrajEnsemble-}{H_0}}\), especially in terms of the energy stability. We attribute this to the fact that the set of inputs \(\mathcal {D}_0^{\text {HMC-}{H_0}}\) better represent the target distribution. As shown in Fig. 7, the minimum distance from each point along the reference trajectory to the set \(\mathcal {D}_0^{\text {HMC-}{H_0}}\) is on average a lot smaller than the minimum distance to the set \(\mathcal {D}_0^{\text {TrajEnsemble-}{H_0}}\). What is more, the minimum distance to \(\mathcal {D}_0^{\text {TrajEnsemble-}{H_0}}\) increases along the trajectory, while the minimum distance to \(\mathcal {D}_0^{\text {HMC-}{H_0}}\) stays stable over time.

Fig. 7
figure 7

Minimum distance from each point on the reference trajectory to different training data in the phase space

Fig. 8
figure 8

Errors in trajectories generated by NN solvers learned with different sequence length S (initial condition = \((\textbf{p}_{\text {init}}, \textbf{q}_{\text {init}})\))

4.3.2 Effects of sequence length in loss function

We trained the shallow network ResNet(4, 1000) using \(\mathcal {D}^{\text {HMC-}{H_0}}\) and multi-step MSE loss for different sequence length S. Figure 8 shows that longer sequence length yields slightly better accuracy for the first ten steps. After ten steps, there is no significant difference between results of different sequence lengths. In Fig. 9, we repeated the comparison for a different initial condition \((\sqrt{2} \textbf{p}_{\text {init}}, \textbf{q}_{\text {init}})\). Note that the corresponding energy level is higher than \(H_0\) for generating the training data. In other words, we are testing the generalization ability of the NN solvers for out-of-distribution examples. Based on the results, the NN solvers are able to achieve accuracy on par with the accuracy for in-distribution examples for at least the first few steps. Moreover, we found longer training sequences result in significantly better generalization ability.

Fig. 9
figure 9

Errors in trajectories generated by NN solvers learned with different sequence length S (initial condition = \(( \sqrt{2} \textbf{p}_{\text {init}}, \textbf{q}_{\text {init}})\))

Fig. 10
figure 10

Errors in trajectories generated by NN solvers learned with different metrics

4.3.3 Effects of loss function metric

We trained the shallow network ResNet(4, 1000) using \(\mathcal {D}^{\text {HMC-}{H_0}}\) and different loss metrics with sequence length \(S=5\). As displayed in Fig. 10, compared to using MSE, using EBE leads to smaller trajectory error and energy error for over 100 steps. In particular, the energy error is not only smaller but also more stable over a long time period.

4.3.4 Comparison with numerical solvers

We present in Table 2 the one-step accuracy and runtime performance of various solvers. The fine solver \(F_{\Delta t}\) is \(\Phi ^{\text {KL8}, h=2^{-18}}_{\Delta t}\) implemented in octuple precision. The NN solver \(\Phi ^{\text {NN}}_{\Delta t}\) is ResNet(4, 1000), trained using \(\mathcal {D}^{\text {HMC-}{H_0}}\) and the multi-step (\(S=5\)) EBE loss. For comparison, we include several numerical solvers: \(\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\) and \(\Phi ^{\text {VV}, h=2^{-14}}_{\Delta t}\), whose one-step trajectory error is comparable to that of \(\Phi ^{\text {NN}}_{\Delta t}\), as well as \(\Phi ^{\text {CSS4}, h=2^{-8}}_{\Delta t}\) and \(\Phi ^{\text {VV}, h=2^{-11}}_{\Delta t}\), whose one-step energy error is comparable to that of \(\Phi ^{\text {NN}}_{\Delta t}\). Here, VV stands for the 2\(^\text {nd}\)-order velocity Verlet scheme.

It can be seen that, with the same level of trajectory error, the numerical solvers achieve lower energy error than \(\Phi ^{\text {NN}}_{\Delta t}\). However, in terms of runtime, \(\Phi ^{\text {NN}}_{\Delta t}\) is as good as \(\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\) and is about 17 times faster than \(\Phi ^{\text {VV}, h=2^{-14}}_{\Delta t}\). We emphasize that the runtime measurements took place in different environments: the NN solver is implemented in Python, and numerical solvers are implemented in Julia. We have fully optimized the Julia code for runtime and memory efficiency. We expect to further optimize the NN implementation for better runtime performance in the future.

Table 2 Accuracy and runtime performance comparison for various solvers
Fig. 11
figure 11

Log (base 10) errors in plain parareal solutions by various coarse solvers (\(\Delta t=1.0\), \(T=1000\))

4.4 NN solution map in parareal iterations

In this section, we present results of using \(\Phi ^{\text {NN}}_{\Delta t}\) as the coarse solver in parareal methods.

We first study the plain parareal method. Figure 11 compares the plain parareal solutions computed by different coarse solvers, including \(\Phi ^{\text {NN}}_{\Delta t}\), \(\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\), and \(\Phi ^{\text {CSS4}, h=2^{-8}}_{\Delta t}\). Clearly, \(\Phi ^{\text {CSS4}, h=2^{-8}}_{\Delta t}\) performs the worst, as expected because it is the least accurate among the three solvers. Based on the trajectory errors, we see using \(\Phi ^{\text {NN}}_{\Delta t}\) as the coarse solver provides slower accuracy improvement over iterations compared to using \(\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\). Comparing the energy errors, we observe that when using \(\Phi ^{\text {NN}}_{\Delta t}\), the stability in energy is not destroyed as much as in using \(\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\) as the coarse solver (see also the energy profiles at iteration 3 in Fig. 13).

We will now compare different coarse solvers used in the Procrustes parareal method. In Section 4.2, we found the Procrustes parareal method improves accuracy by stabilizing the energy of the solutions. As shown in Figs. 12 and 14, the best improvement is again yielded by \(\Phi ^{\text {CSS4}, h=2^{-9}}_{\Delta t}\). There is no significant gain from combining Procrustes parareal with the NN solution map. In fact, the improvement over the Procrustes parareal iterations even deteriorates compared to the improvement over the plain parareal iterations.

We speculate that \(\Phi ^{\text {NN}}_{\Delta t}\) does not perform well in parareal iterations because it was not trained using suitable data. As described in Section 3.3, we sampled training data points from the Liouville density, mainly because it is a natural distribution for learning a solution map to be used in a sequential algorithm. In parareal schemes, since the coarse solver is applied differently than in a sequential algorithm, we would need a different data distribution. A similar issue has been investigated in [10] for approximating the correction operator using the NN approach for wave equations. There, the authors demonstrated the importance of using training data closer to the ones encountered in the simulations.

Fig. 12
figure 12

Log (base 10) errors in Procrustes parareal solutions by various coarse solvers (\(\Delta t=1.0\), \(T=1000\))

5 Conclusion

In this paper, we presented two data-driven approaches for stabilization of the standard parareal algorithm for long-time computation of highly oscillatory Hamiltonian systems. The Procrustes parareal approach uses solutions computed along the parareal iterations to construct a correction operator to align the “phase” of the fine and coarse solvers. Numerical results for the FPU problem demonstrated that the constructed correction can successfully stabilize the parareal iterations, which helps improve the accuracy of the computed solutions. The second approach we proposed is to use a neural network (NN) to approximate the reference solution map that advances the given state forward in time by a large fixed time step. We developed a sampling algorithm called HMC-\(H_0\) to sample phase space points from the neighborhood of an energy level set. We also designed a loss function which considers the energy-balanced errors between approximated trajectories and reference trajectories. The resulting NN solver for the FPU problem is able to achieve comparable or better runtime performance compared to numerical solvers of similar accuracy. When combined with parareal iterations, solutions computed by the NN solver are not as accurate as solutions computed by a comparable numerical solver, although the NN energy errors are slightly smaller. We think that this may be improved if we train the network using data suitably sampled for the discrete trajectories computed by the parareal schemes.

The FPU problem is too small to reveal the potential benefit of using NNs. It is small enough that optimized high-order symplectic integrators are extremely efficient and accurate. For more complicated problems, where the phase space is large and lower-order accurate methods are the only feasible choice, we think that the investigated NN approach may become viable.