1 Introduction

Generally speaking, time is sequential in nature. Traditional parallel algorithms for solving time-dependent problems restrict the parallelization in the spatial dimension, and are purely sequential in the temporal dimension. In such approaches, the solution at the current time step can not be obtained without knowing the solution at previous time steps. For applications with sufficient amount of degrees of freedom in the spatial dimension, parallelization in space is often enough for the efficient use of the parallel machine. However, for applications with limited degrees of freedom in the spatial dimension and many time steps are required to resolve the dynamics of the time-dependent solution, parallelization in both space and time is a better option. The oldest time parallel algorithm was present in 1964 by Nievergelt (1964). In the recent years, parallel algorithms with space-time concurrency have received more and more attention and we briefly review some of the approaches below.

Waveform relaxation (WR) is an important time parallel algorithm which was initially presented for solving problems in large-scale circuit simulations (Lelarasmee et al. 1982). The word waveform denotes signals traveling along the circuit, and relaxation is related to the algebraic relaxation scheme, such as Gauss–Seidel relaxation and Jacobi relaxation. The basic idea of WR is decomposing the system into decoupled subsystems by applying relaxation directly to the system of linear or nonlinear algebraic-differential equations, and the decoupled subsystems can be solved in parallel on the whole time interval. As a special iterative method, WR has been successfully applied to many problems in circuit simulations (Beyene 2008; Jiang et al. 2000; Ruehli and Johnson 1999) and to solving ordinary differential equations (ODEs) (Burrage 1995). Further, this method has been extended to solving parabolic PDEs (Bjorhus and Stuart 1997; Jeltsch and Pohl 1995), in which the PDE is first transformed into ODEs by a semi-discretization in space, then the system of ODEs is solved by WR. Actually, WR is closely related to the well-known simple Picard-Lindelöf iteration. We refer to Miekkala and Nevanlinna (1987), Nevanlinna (1990) for the convergence analysis and Bellen et al. (1993) for the stability analysis. One drawback of WR is its slow convergence for some problems. By combining WR and the overlapping domain decomposition (DD) method (Smith et al. 1996; Toselli and Widlund 2005), the Schwarz WR (SWR) method was developed for linear and semilinear parabolic equations in Gander and Stuar (1998), Giladi and Keller (2002), Simoens and Vandewalle (2000), Tran (2014). It shows that the convergence of SWR is faster than the classical WR Gander (1999), Gander and Stuar (1998), Giladi and Keller (2002). By introducing optimal transmission operators, such as the Robin or Ventcell type operators, one can obtain some optimized SWR (oSWR) methods (Bennequin et al. 2009; Gander 2006; Wu and Xu 2017) that are faster than SWR, however, it is sometimes expensive to find the optimization parameters in the optimal transmission operators. Some techniques about how to choose the optimization parameters are presented in Al-Khaleel et al. (2014), Halpern et al. (2012).

The parareal algorithm was first presented in Lions et al. (2001) for solving evolution problems. The name parareal is derived from parallel real time computations of evolution problems. The basic idea of the algorithm is to decompose the global problem in the temporal dimension into a series of independent problems on smaller time intervals, which are then solved in parallel. The parareal algorithm can be described in terms of two computation processes, i.e., the propagator \(\mathscr {G}\) on the coarse mesh with time step size \(\delta T\) and the propagator \(\mathscr {F}\) on the fine mesh with time step size \(\delta t\), respectively. And these two time propagators are combined by a predictor-corrector scheme. More precisely, the fine time propagator \(\mathscr {F}\) solves the independent problems on time subdomains in parallel using certain sequential-in-time methods, such as backward-Euler, Crank-Nicolson, Runge–Kutta and other high-order methods. Since \(\delta t \ll \delta T\), the solution computed by propagator \(\mathscr {F}\) is more accurate, but the cost is high. On the contrary, the coarse time propagator \(\mathscr {G}\) solves the problem by a simple, often less accurate, fast sequential time integrator. The solution computed by \(\mathscr {G}\) provides a correction for \(\mathscr {F}\) and therefore the overall algorithm converges. The parareal algorithm has received a lot of attention over the past few years (Bal 2005; Dai and Maday 2013; Gander and Vandewalle 2007; Wu and Zhou 2015), and it has been successfully applied to a variety of problems, such as parabolic problems (Gander and Vandewalle 2007; Ruprecht et al. 2016), hyperbolic problems (Dai and Maday 2013), optimal control problems (Du et al. 2013; Maday et al. 2007; Mathew et al. 2010), Navier–Stokes equations (Trindade and Pereira 2006) and non-differential equations (Bal and Maday 2002). The parallel efficiency of the parareal algorithm is far from ideal, it is proportional to 1 / k, here k denotes the number of iterations (Bal 2005; Farhat and Chandesris 2003). To improve the parallel efficiency, a hybrid algorithm has recently been proposed based on the parareal algorithm and the spectral deferred corrections algorithm. It shows that the parallel efficiency of this algorithm can be much closer to 1 (Minion 2010). Several variants of the parareal algorithm are also available, for example, PITA (parallel implicit time integrator) (Farhat and Chandesris 2003) and PFASST (parallel full approximation scheme in space and time) (Bolten et al. 2017; Minion et al. 2015).

The space-time multigrid method (Hackbusch 1984) is an interesting method that uses the elliptic multigrid method with a standard spatial smoother to solve parabolic PDEs on many time-levels simultaneously. This parabolic multigrid method is further studied in Horton (1992). A Fourier mode analysis shows that the success of this method requires that the meshes have a large aspect ratio \(\tau /h^2\) (\(\tau \) and h denote the time step size and the mesh size, respectively) (Vandewalle and Horton 1995). Instead of the spatial smoother used in the parabolic multigrid method, the multigrid WR method with a zebra Gauss–Seidel smoother was proposed in Lubich and Ostermann (1987). This method was carefully studied for solving incompressible Navier–Stokes equations (Oosterlee and Wesseling 1993) and a chemical reaction-diffusion problem (Vandewalle and Piessens 1991). A variant of the multigrid WR algorithm with a line relaxation smoother was introduced in (Horton et al. 1995), which has a fast convergence rate but the computational complexity is increased by a logarithmic factor. By employing a parallel pointwise smoother instead of the line-wise smoother, Horton and Vandewalle (1995) presented a space-time multigrid that solves the entire space-time problem simultaneously. For some applications, the fully discrete parabolic problem may become strongly anisotropic. To improve the robustness of the method one can change the coarsening strategy, i.e., using coarsening only in the direction where point smoothing is successful. Another approach presented in Gander and Neumüller (2016) is to use a block Jacobi smoother instead of the point Jacobi smoother. A local Fourier mode analysis shows that this new space-time multigrid method with block Jacobi smoother converges well, and numerical experiments demonstrate that it has excellent strong and weak scalability properties on large scale parallel computers.

Recently, a nonintrusive, optimal-scaling time parallel method based on multigrid reduction (MGR) was proposed Dobrev et al. (2017), Falgout et al. (2014), Falgout et al. (2017). The main advantage of this method is that the software is built on top of some existing sequential time-stepping routines, and the numerical comparison with the traditional time-stepping algorithm shows that this method is able to achieve a significant speedup. More details on numerical comparisons between the space-time multigrid with point-wise relaxation (Horton and Vandewalle 1995), space-time multigrid with block relaxation (Gander and Neumüller 2016), space-time concurrent multigrid waveform relaxation with cyclic reduction (Horton et al. 1995) and multigrid-reduction-in-time (Falgout et al. 2014) can be found in Falgout et al. (2017). Some other time parallel methods, such as the direct time parallel methods, can be found in a recent review paper (Gander 2015).

Note that all the methods mentioned above are developed by modifying some existing, standard time integration schemes (parallel in space, but sequential in time). The modifications introduce the desired extra parallelism, but also alter the discretization error and/or the stability conditions. Recently, several implicit space-time DD preconditioners were presented for parabolic PDEs (Cong et al. 2014), including some one- and two-level overlapping Schwarz preconditioned GMRES methods. Different from other space-time methods, these methods do not alter the accuracy or stability condition of the original, non-time-parallel schemes since they are precondiitoners. The optimal convergence analysis of two-level and multilevel space-time additive Schwarz methods for solving parabolic PDEs was established in Li and Cai (2015), Li et al. (2018). The numerical experiments carried out on a parallel supercomputer show that these methods exhibit excellent strong and weak scalabilities when the number of processors is large. Some of the techniques have also been used to solve the unsteady inverse source problems (Deng et al. 2015, 2016) and flow control and heat transfer problems (Yang and Cai 2017). In Badia and Olmand (2017), a space-time BDDC (balancing DD by constraints) preconditioned GMRES method was introduced for solving a set of linear and nonlinear parabolic problems, and an excellent weak scalability result is achieved on a supercomputer with thousands of processors.

In this paper, we focus mainly on the space-time Schwarz methods including both additive and restricted additive Schwarz. We investigate the parallel performance of these methods and show their superiority compared to a traditional time-stepping method. This rest of the paper is organized as follows. In Sect. 2, we introduce a model parabolic boundary value problem and describe the space-time Schwarz algorithms. In Sect. 3, some numerical experiments are reported to illustrate the performance of the algorithms. Finally, some conclusions are given in the last section.

2 A model problem and the space-time additive Schwarz algorithms

We consider a model parabolic equation:

$$\begin{aligned} \left\{ \begin{array}{rll} u_t-\Delta u &{}=f, \ &{}\text {in} \ \varOmega \times (0,T], \\ u(x,t)&{}=0, \ &{}\text {on} \ \partial \varOmega \times (0,T], \\ u(x,0)&{}=u_0(x), \ &{}\text {in} \ \varOmega , \end{array}\right. \end{aligned}$$
(1)

where \(\varOmega \subset R^d \ (d=2 \ \text {or} \ 3)\) is a bounded, open polygonal (or polyhedra) domain and \(f(x,t)\in L^2(\varOmega \times [0,T])\). Let \(0=t^0<t^1<\cdots <t^n=T\) and \(\tau \equiv \Delta t^k=t^k-t^{k-1}\). Suppose \(u^k\) is the solution at time \(t^k\). Traditional time-stepping methods solve (1) time step by time step. Consider the backward Euler scheme for the time discretization, the variational form of (1) at time \(t^k\) is to find \(u^k\in H^1_0(\varOmega )\), \(k\ge 1\), such that

$$\begin{aligned} (u^k,v) +\tau A(u^k,v)-(u^{k-1},v)=\tau (f,v) \quad \forall v\in H^1_0(\varOmega ), \end{aligned}$$
(2)

where the bilinear forms \(A(u,v)=\int _{\varOmega }\nabla u\cdot \nabla vdx\) and \((u,v)=\int _{\varOmega } u\cdot vdx\).

Different from the traditional time-stepping methods, the idea of the space-time method is to group equations for \(s \ (s\le n)\) time steps into a single system which is then solved by a Schwarz preconditioned Krylov subspace method in parallel

$$\begin{aligned} \left\{ \begin{array}{rll} (u^1,v^1) + \tau B(u^1,v^1)-(u^0,v^1)&{}=\tau (f^1,v^1) \\ (u^2,v^2) + \tau B(u^2,v^2)-(u^1,v^2)&{}=\tau (f^2,v^2) \\ &{}\vdots \\ (u^s,v^s) + \tau B(u^s,v^s)-(u^{s-1},v^s)&{}=\tau (f^s,v^s), \\ \end{array}\right. \end{aligned}$$
(3)

where s is often called the window size. By using the Galerkin finite element method for the spatial discretization, we obtain a fully discretized linear system of equations:

$$\begin{aligned} \mathbf Au ^*=\mathbf f , \end{aligned}$$
(4)

where \(\mathbf f \) is the vector associated with the right-hand side of (3) and

$$\begin{aligned} \mathbf A =\left( \begin{array}{ccccc} \tau \text {A}+\text {M} &{}&{}&{}&{}\\ -\text {M} &{} \tau \text {A}+\text {M} &{}&{}&{}\\ &{}\ddots &{} \ddots &{}&{} \\ &{}&{}\ddots &{} \quad \ddots &{} \\ &{}&{}&{}-\text {M} &{} \tau \text {A}+\text {M} \\ \end{array}\right) , \end{aligned}$$

here \(\text {M}\) and \(\text {A}\) denote the mass matrix associated with the bilinear form \((\cdot ,\cdot )\) and the stiffness matrix associated with the bilinear form \(A(\cdot ,\cdot )\), respectively.

To present the space-time Schwarz algorithms, we first introduce a space-time partition (Li et al. 2018). Let \(\mathscr {T}_l=\{K_i^l\}_{i=1}^{N^l} \ (l=0,1,\ldots ,L)\) be a shape regular family of nested conforming meshes covering \(\varOmega \), here L is the number of levels and \(N^l\) denotes the number of elements on level l. Let \(V_l=V_{h_l}\) be the space of continuous piecewise linear functions associated with the partition \(\mathscr {T}_l\). Set \(h_l=\max _i \text {diam}(K^l_i)\). For simplicity, we set \(V_H=V_0\) with mesh size \(H=h_0\) and \(V_h=V_L\) with mesh size \(h=h_L\). Let \(\{\varOmega _{li}^0 \}_{i=1}^{N_l}\) be a set of non-overlapping spatial subdomains such that \(\varOmega =\bigcup _{i=1}^{N_l}\varOmega _{li}^0\), and \(\{I_{lj}\}_{j=1}^{M_l}\) be a non-overlapping partition of \(I=[0,T]=[0,t_l^{s_l}]\) such that \(I=\bigcup _{j=1}^{M_l}I_{lj}\), where \(s_l\) denotes the window size of the time level l. Denote the overlapping subdomains corresponding to \(\{\varOmega _{li}^0\}\) and \(\{I_{lj}\}\) by \(\{\varOmega _{li}^{\delta }\}\) (with size \(H_l=O(h_{l-1})\)) and \(\{I_{lj}^{\prime }\}\) (with size \(H_l^{\tau }=O(\tau _{l-1})\)), respectively. The corresponding overlaps are denoted by \(\delta _l>0\) and \(\delta _l^{\tau }\ge 0\). In this paper, we only consider the case \(\delta _l^{\tau }= 0\), i.e., there is no overlap in the temporal direction. \(s_{lj}\) denotes the number of time steps in \(I_{lj}\). For convenience, a three-level space-time mesh is presented in Fig. 1.

Fig. 1
figure 1

A multilevel space-time mesh of \(\varOmega \times [0,T]\) consisting of three meshes. Space-time interpolations and restrictions are used to transfer functions from one mesh to another

For each \(\varOmega _{li}^{\delta }\), the finite element space is defined as \(V_{l,\delta }^i=V_l\cap H_0^1(\varOmega _{li}^{\delta })\). Then, we can define the space-time finite element subspaces on subsomain \( \varOmega _{li}^{\delta }\times I_{lj}\) as

$$\begin{aligned} (V_{l,\delta }^i)^{s_{lj}}=\bigcup _{t_l^k\in I_{lj}} V_{l,\delta }^{ik}, \end{aligned}$$

where \(V_{l,\delta }^{ik}=\underbrace{ 0 \times \cdots \times 0 \times \overset{k}{V_{l,\delta }^i} \times 0 \times \cdots \times 0}_{s_l},\) and the 0’s denote the zero spaces at the other time steps and \(k=1,2,\ldots ,s_l.\) Let \(\mathbf R _0^T: (V_H)^{s_0} \rightarrow (V_h)^s\) and \((\mathbf R _{ij}^{l,\delta })^T: \ (V_{l,\delta }^i)^{s_j}\rightarrow (V_l)^s\) be the interpolation operators, then the whole space-time finite element space \((V_h)^s\) can be decomposed into the sum of the coarse space \((V_H)^{s_0}\) and the subspaces as

$$\begin{aligned} (V_h)^s=\mathbf R _0^T(V_H)^{s_0}+\bigcup _{l=1}^L\bigcup _{j=1}^{M_l}\bigcup _{i=1}^{N_l}(\mathbf R _{ij}^{l,\delta })^T (V_{l,\delta }^i)^{s_{lj}}, \end{aligned}$$

where \(s_0\) denotes the window size of the coarsest time level.

By restricting and extending the global matrix \(\mathbf A \), we obtain a coarse matrix \(\mathbf A _0 =\mathbf R _0\mathbf A {} \mathbf R ^T_0\) and all the subspace matrices \(\mathbf A _{ij}^{l,\delta }=\mathbf R _{ij}^{l,\delta } \mathbf A (\mathbf R _{ij}^{l,\delta })^T\). The l-level space-time additive Schwarz preconditioners for the matrix \(\mathbf A \) are defined by using the coarse and the submatrices

$$\begin{aligned} \mathbf M _0^{-1}=\mathbf R _0^T \mathbf A _0^{-1} \mathbf R _0 \quad \quad \text {and}\quad \quad \mathbf M _{l,\delta }^{-1}=\sum _{j=1}^{M_l}\sum _{i=1}^{N_l} (\mathbf R _{ij}^{l,\delta })^T (\mathbf A _{ij}^{l,\delta })^{-1} \mathbf R _{ij}^{l,\delta }, \quad l\ge 1 \end{aligned}$$
(5)

as \(\mathbf M ^{-1}_{MAS}=\mathbf M _0^{-1}+\sum _{l=1}^{L}{} \mathbf M _{l,\delta }^{-1}\). More precisely, the multilevel space-time additive Schwarz algorithm for solving (4) can be written as

$$\begin{aligned} \mathbf P _{MAS} \mathbf u =\mathbf b _{MAS}, \end{aligned}$$
(6)

where \(\mathbf P _{MAS}=\mathbf M _0^{-1}{} \mathbf A +\sum _{l=1}^{L}{} \mathbf M _{l,\delta }^{-1}{} \mathbf A = \mathbf M ^{-1}_{MAS}{} \mathbf A, \) and \(\mathbf b _{MAS}=\mathbf M _0^{-1}{} \mathbf A {} \mathbf u ^*+\sum _{l=1}^{L}{} \mathbf M _{l,\delta }^{-1}{} \mathbf A {} \mathbf u ^* =\mathbf M ^{-1}_{MAS}{} \mathbf f \).

Now we present the multilevel space-time additive Schwarz algorithm.

Algorithm 1

(Multilevel space-time additive Schwarz algorithm) Find the solution of (4) by solving (6) with GMRES, where the action of the preconditioner \(\mathbf M ^{-1}_{MAS}\) on the global residual \(\mathbf r \) is carried out as follow:

   \(\mathbf z \leftarrow \mathbf R _0^T \mathbf A _0^{-1} \mathbf R _0 \mathbf r \)

   For\(l=1,2,\ldots ,L\)

      \(\mathbf z \leftarrow \mathbf z + \sum _{j=1}^{M_l}\sum _{i=1}^{N_l} (\mathbf R _{ij}^{l,\delta })^T (\mathbf A _{ij}^{l,\delta })^{-1} \mathbf R _{ij}^{l,\delta }{} \mathbf r \)

   End

The restricted additive Schwarz (RAS) preconditioning technique was first introduced for solving nonsingular linear systems (Cai and Sarkis 1999). It shows that RAS outperforms the classical AS since it requires fewer iterations, as well as less communication and CPU time (Smith et al. 1996). RAS has receive great attention and is the default Schwarz preconditioner for solving linear systems in the popular PETSc software package (Balay et al. 2018). Next, we present a multilevel space-time restricted additive Schwarz algorithm.

Define the space-time finite element subspaces on subsomains \( \varOmega _{li}^{0}\times I_{lj}\) as

$$\begin{aligned} (V_{l,0}^i)^{s_{lj}}=\bigcup _{t_l^k\in I_{lj}} V_{l,0}^{ik}, \end{aligned}$$

where \(V_{l,0}^{ik}=\underbrace{ 0 \times \cdots \times 0 \times \overset{k}{V_{l,0}^i} \times 0 \times \cdots \times 0}_{s_l}\) and \(V_{l,0}^i=V_l\cap H_0^1(\varOmega _{li}^{0})\). Let \((\mathbf R _{ij}^{l,0})^T: \ (V_{l,0}^i)^{s_j}\rightarrow (V_l)^s\) be the interpolation operators. Define the l-level space-time restricted additive Schwarz preconditioners by

$$\begin{aligned} \mathbf M _{l,0}^{-1}=\sum _{j=1}^{M_l}\sum _{i=1}^{N_l} (\mathbf R _{ij}^{l,0})^T (\mathbf A _{ij}^{l,\delta })^{-1} \mathbf R _{ij}^{l,\delta }, \quad l\ge 1. \end{aligned}$$
(7)

The multilevel space-time restricted additive algorithm for solving (4) can be written as

$$\begin{aligned} \mathbf P _{MRAS} \mathbf u =\mathbf b _{MRAS}. \end{aligned}$$
(8)

Here \(\mathbf P _{MRAS}=\mathbf M _0^{-1}{} \mathbf A +\sum _{l=1}^{L}{} \mathbf M _{l,0}^{-1}{} \mathbf A =\mathbf M ^{-1}_{MRAS}{} \mathbf A \) and \(\mathbf b _{MRAS}=\mathbf M ^{-1}_{MRAS}{} \mathbf f \).

Algorithm 2

(Multilevel space-time restricted additive Schwarz algorithm) Find the solution of (4) by solving (8) with GMRES method, where the action of the preconditioner \(\mathbf M ^{-1}_{MRAS}\) on the global residual \(\mathbf r \) is computed as follow:

   \(\mathbf z \leftarrow \mathbf R _0^T \mathbf A _0^{-1} \mathbf R _0 \mathbf r \)

   For\(l=1,2,\ldots ,L\)

      \(\mathbf z \leftarrow \mathbf z + \sum _{j=1}^{M_l}\sum _{i=1}^{N_l} (\mathbf R _{ij}^{l,0})^T (\mathbf A _{ij}^{l,\delta })^{-1} \mathbf R _{ij}^{l,\delta }{} \mathbf r \)

   End

For elliptic problems, an optimal convergence theory was develoepd for the additive Schwarz methods (Toselli and Widlund 2005), but the theory does not apply for the restricted Schwarz methods. A convergence analysis at the algebraic level is presented in Frommer and Szyld (2001), Nabben and Szyld (2003). Recently an optimal space-time additive Schwarz theory is also developed (Li et al. 2018); under some reasonable conditions, the convergence rate of the two- and multilevel space-time additive Schwarz methods is bounded independently of the mesh sizes, the number of subdomains and the window size. Let \(\widehat{\mathbf{A }}\) be the symmetric and positive part of \(\mathbf A \), we define the \(\widehat{\mathbf{A }}\)-inner product by

$$\begin{aligned} (\mathbf u ,\mathbf v )_{\widehat{\mathbf{A }}}=\mathbf u ^T\widehat{\mathbf{A }}{} \mathbf v . \end{aligned}$$

The convergence rate of Algorithm 1 can be characterized by the two quantities

$$\begin{aligned} c_p=\inf _\mathbf{u \ne 0}\frac{(\mathbf P _{MAS}{} \mathbf u , \mathbf u )_{\widehat{\mathbf{A }}}}{(\mathbf u , \mathbf u )_{\widehat{\mathbf{A }}}} \quad \quad \text {and} \quad \quad C_p=\sup _\mathbf{u \ne 0}\frac{\Vert \mathbf P _{MAS}{} \mathbf u \Vert _{\widehat{\mathbf{A }}}}{\Vert \mathbf u \Vert _{\widehat{\mathbf{A }}}}, \end{aligned}$$

and the residual at the kth iteration is bounded as

$$\begin{aligned} \Vert \mathbf r _k\Vert _{\widehat{\mathbf{A }}}\le \left( 1-\frac{c_p^2}{C_p^2} \right) ^{\frac{k}{2}}\Vert \mathbf r _0\Vert _{\widehat{\mathbf{A }}}, \end{aligned}$$

where \(\mathbf r _k=\mathbf b -\mathbf P _{MAS}{} \mathbf u ^k\). Here \(c_p=c\left( \max _{1\le l \le L}(1+h_{l-1}/\delta _l)\right) ^{-1}\) and \(C_p=C\left( 1+\Vert \varTheta \Vert _2\right) \) in Li et al. (2018), where C and c are constants independent of the mesh sizes, the number of subdomains, the window size and the number of levels. And the matrix \(\varTheta =\{\theta ^{lk}_{jq,ip}\}_{l,k\le L,j\le M_l,q\le M_k,i\le N_l,p\le N_k}\), \(\theta ^{lk}_{jq,ip}\) is the cosine of the angle between the subspaces \((V_l^i)^{s_{lj}}\) and \((V_k^p)^{s_{kq}}\), i.e., \(\theta ^{lk}_{jq,ip}=\cos \left( (V_l^i)^{s_{lj}},(V_k^p)^{s_{kq}}\right) \).

3 Numerical results

In this section, we present some numerical results to illustrate the parallel performance of the space-time Schwarz algorithms. We implement these algorithms on the top of an open source package PETSc (Balay et al. 2018). All experiments are carried out on the Tianhe-2 supercomputer located at the National Supercomputer Center in Guangzhou. Only CPUs are used in the tests.

In the numerical experiment, we divide the space-time domain \([0,1]^3\times [0,32]\) into \(N_p=P_x\times P_y\times P_z\times P_t\) subdomains and assign each subdomain to one processor. All the subdomain problems are solved using ILU, and the overlap is set as 1. The problem is solved by a preconditioned GMRES(30) method with the stopping condition

$$\begin{aligned} \frac{\Vert \mathbf r _k\Vert _2}{\Vert \mathbf r _0\Vert _2}\le 10^{-6}. \end{aligned}$$
Table 1 The averaged number of iterations and the total compute time for solving Example 1 on a \(65\times 65\times 65\times 2048\) space-time mesh using Algorithm 1
Table 2 The averaged number of iterations and the total compute time for solving Example 1 on a \(65\times 65\times 65\times 2048\) space-time mesh using Algorithm 2

Example 1

Consider a three dimensional convection diffusion equation

$$\begin{aligned} \left\{ \begin{array}{rll} u_t-\Delta u+\mathbf b \cdot \nabla u &{}=\ f, &{} \text {in} \ \varOmega \times (0,T], \\ u(x,y,z,0)&{}=\ \frac{1}{0.8} e^{-(x^2+y^2+z^2)/(0.8)}, &{} \text {in} \ \varOmega , \end{array}\right. \end{aligned}$$

where \(\mathbf b =(1,1,1)\), \(f=\frac{1-x-y-z}{8(t+0.2)^2}e^{-(x^2+y^2+z^2)/(4 (t+0.2))}\) and the analytical solution is

$$\begin{aligned} u(x,y,z,t)=\frac{1}{4 (t+0.2)}e^{-(x^2+y^2+z^2)/(4 (t+0.2))} \end{aligned}$$

with consistent Dirichlet boundary conditions.

We test this example to illustrate the strong scalability of the space-time methods with different s on a \(65\times 65\times 65 \times 2048\) space-time mesh. The mesh size and the time step size on level \(l \ (l=0,1)\) are set as \(h_0=\tau _0=1/32\) and \(h_1=\tau _1=1/64\), respectively. To obtain better parallel performance, the spatial mesh is decomposed evenly such that each processor holds approximately the same number of mesh points. And we omit some cases marked by “-” in Tables 1 and 2. Moreover, “*” indicates that the number of processors is too small to solve the problem.

The numerical results of Algorithm 1 are presented in Table 1 and Fig. 2. Note that the sequential time stepping algorithm is the special case of Algorithm 1 with \(s=1\). Table 1 shows that the average number of iterations is bounded independent of the number of processors for each window size s. For the case \(s=1\), the total compute time decreases quickly at the beginning and then increases a little with the number of processors. It means that the computation/communication ratio decreases as we increase the number of processors. The compute time curve on the left of Fig. 2 shows that the strong scalability of the classical time stepping algorithm is not good when the number of processors is large. For the cases \(s=2,4,32,256,512,1024\), the compute time curves behave similar to the case of \(s=1\), but they preform much better than the classical time stepping algorithm as we increase the number of processors. It is clear that the compute time curve is linear for the case \(s=2048\), i.e., we obtain linear speedup when solving all time steps of the problem at once. The crossover points (intersections with the curve corresponding to \(s=1\)) in Fig. 2 demonstrate the benefit of the space-time algorithm compared to the time stepping algorithm. For examples, the crossover point is at about 64 processors for the cases \(s=2,4\), and it is at larger number of processors with larger s. When the problem size is fixed, we observe that the time stepping algorithm is faster when the number of processors is small, but the space-time Schwarz algorithm performs much better once the number of processors goes beyond a certain number. We next study Algorithm 2. The numerical results in Table 2 and Fig. 3 show that the average number of iterations is almost the same as that of Algorithm 1, and the compute time is a little less than that required by Algorithm 1 for each s.

Fig. 2
figure 2

The total compute time for solving Example 1 using sequential time-stepping (left) and Algorithm 1 with different window size s (right)

Fig. 3
figure 3

The total compute time for solving Example 1 using sequential time-stepping (left) and Algorithm 2 with different window size s (right)

4 Concluding remarks

To exploit the full power of supercomputers, a new generation of methods with space-time concurrency are developed to solve time-dependent problems. In this paper, we gave an overview of several important time parallel methods including waveform relaxation, parareal and space-time multigrid. And we also introduced a class of implicit space-time domain decomposition methods for solving parabolic PDEs. More precisely, we extend the classical Schwarz methods for elliptic problems as space-time preconditioners applied to solving parabolic problems, and describe the optimal convergence theory for the space-time additive Schwarz algorithms by using the matrix-based formulation. The theory for the restricted version of the method is not available. The numerical experiments for three dimensional problems show both the additive and the restricted additive Schwarz methods have excellent parallel scalability and both outperform the classical time stepping method when the number of processors is large, but they require more memory as expected. Future work will focus on extending the methods for more realistic applications.