1 Introduction

A typical fluid flow is random or chaotic in the turbulent and instability regimes. Therefore, we need to employ accurate numerical schemes for simulating such flows. A pseudospectral algorithm [1, 2] is one of the most accurate methods for solving fluid flows, and it is employed for performing direct numerical simulations of turbulent flows, as well as for critical applications like weather predictions and climate modelling. Yokokawa et al [3, 4], Donzis et al [5], and Pouquet et al [6] have performed spectral simulations on some of the largest grids (e.g., 40963).

We have developed a general-purpose flow solver named Tarang (synonym for waves in Sanskrit) for turbulence and instability studies. Tarang is a parallel and modular code written in the object-oriented language C+ +. Using Tarang, we can solve incompressible flows involving pure fluid, Rayleigh–Bénard convection, passive and active scalars, magnetohydrodynamics, liquid metals, etc. Tarang is an open-source code and it can be downloaded from http://turbulence.phy.iitk.ac.in. In this paper we shall describe some details of the code, scaling results, and code validation performed on Tarang.

2 Salient features of Tarang

The basic steps of Tarang follow the standard procedure of pseudospectral method [1, 2]. However, we took several design decisions that help us scale the code, as well as solve large range of problems. Some of the design issues considered by us are:

  1. (a)

    We chose an object-oriented structure for Tarang with C+ + as the programing language. The modularity of the code helps us introduce new solvers very easily. Also, the basic functions, e.g., transforms, input–output, could be changed without affecting the other parts of the code. As a result, we have more than a dozen modules implemented in Tarang.

  2. (b)

    Grids as large as 40963 take huge memory space. So, researchers have attempted to use single-precision calculations rather than double-precision calculations to save memory. Tarang allows the users to use single/double precision using a simple switch. In the present paper, we present our results for single precision. Note however that the results of single-precision and double-precision calculations are very similar (within a percent), and are consistent with the results of Yokokawa et al [3, 4].

The basic numerical structure of Tarang follows standard pseudospectral algorithm. The Navier–Stokes and related equations are numerically solved given an initial condition of the fields. The fields are time-stepped using one of the time integrators. The nonlinear terms, e.g. \({\bf u\cdot \nabla u}\), transform to convolutions in the spectral space, which are very expensive to compute. Orszag devised a clever scheme to compute the convolution in an efficient manner using fast Fourier transforms (FFT) [1, 2]. In this scheme, the fields are transformed from the Fourier space to the real space, multiplied with each other, and then transformed back to the Fourier space. Note that the spectral transforms can involve Fourier functions, sines and cosines, Chebyshev polynomials, spherical harmonics, or a combination of these functions depending on the boundary conditions. For details, the reader is referred to Boyd [1] and Canuto et al [2]. Some of the specific algorithmic choices made in Tarang are as follows:

  1. (a)

    In the turbulent regime, the two relevant time-scales, the large-eddy turnover time and the small-scale viscous time, are very different (orders of magnitude apart). To handle this, we use the ‘exponential trick’ that absorbs the viscous term using a change of variable [2].

  2. (b)

    We use the fourth-order Runge–Kutta scheme for time stepping. The code however has an option to use the Euler and the second-order Runge–Kutta schemes as well.

  3. (c)

    The code provides an option for dealiasing the fields. The 3/2 rule is used for dealiasing [2].

  4. (d)

    The wavenumber components k i are

    $$ k_i = \frac{2\pi}{L_i} n_i, $$
    (1)

    where L i is the box dimension in the ith direction and n i is an integer. We use parameters

    $$ \mathrm{kfactor}_i = \frac{2\pi}{L_i} $$
    (2)

    to control the box size, especially for Rayleigh–Bénard convection. Note that typical spectral codes take kfactor i  = 1 or k i  = n i .

The parallel implementation of Tarang involved parallelization of the spectral transforms and the input–output operations, as described below.

3 Parallelization strategy

A pseudospectral code involves forward and inverse transforms between the spectral and real space. In a typical pseudospectral code, these operations take ~80% of the total time. Therefore, we use one of the most efficient parallel FFT routines, FFTW (fastest Fourier transform in the west) [7], in Tarang. We adopt FFTW’s strategy for dividing the arrays. If p is the number of available processors, we divide each of the arrays into p ‘slabs’. For example, a complex array A(N 1, N 2, N 3/2 + 1) is split into A(N 1/p, N 2, N 3/2 + 1) segments, each of which is handled by a single processor. This division is called ‘slab decomposition’. The other time-consuming tasks in Tarang are the input and output (I/O) operations of large datasets and the element-by-element multiplication of arrays. The datasets in Tarang are massive, for example, the data size of a 40963 fluid simulation is of the order of 1.5 terabytes. For I/O operations, we use an efficient and parallel library named hierarchical data format-5 (HDF5). The third operation, element-by-element multiplication of arrays, is handled by individual processors in a straightforward manner.

Tarang has been organized in a modular fashion, and so the spectral transforms and I/O operations were easily parallelized. For a periodic box, we use the parallel FFTW library itself. However, for the mixed transforms (e.g., sine transform along x and Fourier transform along yz planes), we parallelize the transforms ourselves using one- and two-dimensional FFTW transforms.

An important aspect of any parallel simulation code is its scalability. We tested the scaling of FFTW and Tarang by performing simulations on 10243, 20483, and 40963 grids with variable number of processors. The simulations were performed on the HPC system of IIT Kanpur and Shaheen supercomputer of King Abdullah University of Science and Technology (KAUST). The HPC system has 368 compute nodes connected via a 40 Gbps Qlogic Infiniband switch with each node containing dual Intel Xeon Quadcore C5570 processor and 48GB of RAM. Its peak performance (Rpeak) is ~34 teraflops (tera floating point operations per second). Shaheen on the other hand is a 16-rack IBM BlueGene/P system with 65536 cores and 65536 GB of RAM. Shaheen’s peak performance is ~222 teraflops.

For parallel FFT with slab decomposition, we compute the time taken per step (forward+backward transform) on Shaheen for several large N 3 grids. The results displayed in figure 1 demonstrate an approximate linear scaling (called ‘strong scaling’). Using the fact that each forward plus inverse FFT involves 5 N 3 logN 3 operations for single-precision computations [7], the average FFT performance per core on Shaheen is ~0.3 gigaflops, which is only 8% of its peak performance. Similar efficiency is observed for the HPC system as well, the cores of which have a rating of ~12 gigaflops. The aforementioned loss of efficiency is consistent with the other FFT libraries, e.g, p3dfft [8]. Also note that an increase in the data size and the number of processors (resources) by the same amount takes approximately the same time (see figure 1). For example, FFT of a 10243 array using 128 processors, as well as that of a 20483 array on 1024 processors, takes ~4 s as shown in table 1. Thus, our implementation of FFT shows good ‘weak scaling’ as well.

Figure 1
figure 1

Scaling of parallel FFT on Shaheen for 10243, 20483, and 40963 grids with single-precision computation. The straight lines represent the ideal linear scaling.

Table 1 Weak scaling of FFT using slab decomposition performed on Shaheen.

We also test the scaling of Tarang on Shaheen and the HPC system. Figures 2 and 3 exhibit the scaling results of fluid simulations performed on these systems. Figure 4 shows the scaling results for magnetohydrodynamics (MHD) simulation on Shaheen. These plots demonstrate strong scaling of Tarang, consistent with the aforementioned FFT scaling. Sometimes we observe a small loss of efficiency when N = p. Approximate weak scaling for the fluid turbulence solver of Tarang performed on both Shaheen and the HPC system are shown in table 2 and table 3 respectively. Approximate weak scaling for MHD solver of Tarang performed on Shaheen is shown in table 4.

Figure 2
figure 2

Scaling of Tarang’s fluid solver on Shaheen for 10243 and 20483 grids with single-precision computation. The straight lines represent the ideal linear scaling.

Figure 3
figure 3

Scaling of Tarang’s fluid solver on the HPC system of IIT Kanpur for 10243, 20483, and 40963 grids with single-precision computation. The straight lines represent the ideal linear scaling.

Figure 4
figure 4

Scaling of Tarang’s magnetohydrodynamic (MHD) solver on Shaheen for 10243 and 20483 grids with single-precision computation. The straight lines represent the ideal linear scaling.

Table 2 Weak scaling of the fluid solver of Tarang performed on Shaheen.
Table 3 Weak scaling of the fluid solver of Tarang performed on HPC system.
Table 4 Weak scaling of the MHD solver of Tarang performed on Shaheen.

A critical limitation of the ‘slab decomposition’ is that the number of processors cannot be more than N 1. This limitation can be overcome in a new scheme called ‘pencil decomposition’ in which the array A(N 1, N 2, N 3/2 + 1) is split into A(N 1/p 1, N 2/p 2, N 3/2 + 1) pencils where the total number of processors p = p 1 ×p 2 [8]. We are in the process of implementing ‘pencil decomposition’ on Tarang. In this paper we shall focus only on the ‘slab decomposition’.

After the above discussion on parallelization of the code, we shall discuss code validation, and time and space complexities for simulations of fluid turbulence, Rayleigh–Bénard convection, and magnetohydrodynamic turbulence.

4 Fluid turbulence

The governing equations for incompressible fluid turbulence are

$$ \partial_{t}{\bf u} + ({\bf u}\cdot\nabla){\bf u} = -\nabla{p}+ \nu\nabla^2 {\bf u} + {\bf F}^u, $$
(3)
$$ \nabla \cdot {\bf u} = 0, $$
(4)

where u is the velocity field, p is the pressure field, ν is the kinematic viscosity, and F u is the external forcing. For studies on homogeneous and isotropic turbulence, simulations are performed on high-resolution grids (e.g., 20483, 40963) with a periodic boundary condition. The resolution requirement is stringent due to N ~Re3/4 relation; for Re = 105, the required grid resolution is ~56003, which is quite challenging even for modern supercomputers.

Regarding the space complexity of a forced fluid turbulence simulation, Tarang requires 15 arrays (for u( k), u( r), F u (k ), nlin (k ), and three temporary arrays), which translates to ~120 gigabytes (8 terabytes) of memory for 10243 (40963) double-precision computations. Here k and r represent the wavenumbers and the real space coordinates respectively. The requirement is halved for a simulation with single precision. Regarding the time requirement, each numerical step of the fourth-order Runge–Kutta (RK4) scheme requires 9×4 FFT operations. The factor 9 is due to the three inverse and six forward transforms performed for each of the four RK4 iterates. Therefore, for every time-step, all the FFT operations require \(36 \times 2.5\times N^3 \log_2(N^3)\) multiplications for a single-precision simulation [7], which translates to ~2.9 (185) terafloating-point operations for 10243 (40963) grids. The number of operations for double-precision computation is twice that of the above estimate. On 128 processors on HPC system, a fluid simulation with single precision takes ~36 s (see figure 3), which corresponds to a per core performance of ~0.68 gigaflops. This is only 6% of the peak performance of the cores, which is consistent with the efficiency of FFT operations discussed in §3. Also note that the solver also involves other operations, e.g., element-by-element array multiplication, but these operations take only a small fraction of the total time.

We can also estimate the total time required to perform a 40963 fluid simulation. A typical fluid turbulence would require five eddy turnover time with dt ≈ 5×10 − 4, which corresponds to 104 time-steps for the simulations. So the total floating point operations required for this single-precision simulation is 185×104 terafloating-point operations for the FFT itself. Assuming 5% efficiency for FFT, and FFTs share being 80% of the total time, the aforementioned fluid simulation will take ~128 h on a 100 teraflop cluster.

We perform code validation of the fluid solver using Kolmogorov’s theory for the third-order structure function [9], according to which

$$ S^{\parallel}_3(r) = \langle \{ u_{\parallel}({\bf x+r}) - u_{\parallel}({\bf x}) \}^3 \rangle = -\frac{4}{5} \epsilon r, $$
(5)

where ϵ is the energy flux in the inertial range and \(\langle \cdots \!\rangle\) represents ensemble averaging (here spatial averaging). We compute the structure function \(S^{\parallel}_3(r)\), as well as \(S^{\parallel}_5(r)\), \(S^{\parallel}_7(r)\), and \(S^{\parallel}_9(r)\) for the steady-state dataset of a fluid simulation on a 10243 grid. The computed values of \(S^{\parallel}_q(r)\) are illustrated in figure 5 that shows a good agreement with Kolmogorov’s theory.

Figure 5
figure 5

Plots of the normalized odd-order structure functions \(-S^{\parallel}_n(r)/(\epsilon r)^{n/3}\) vs. r/η for a fluid simulation using Tarang. Here ϵ is the energy flux and η is the Kolmogorov scale.

After the discussion on fluid solver, we move on to the module for solving Rayleigh–Bénard convection.

5 Rayleigh–Bénard convection

Rayleigh–Bénard convection (RBC) is an idealized model of convection in which fluid is confined between two plates that are separated by a distance d, and are maintained at temperatures T 0 and T 0 − Δ. The equations for the fluid under Boussinesq approximations are

$$ \partial_{t}\mathbf{u} + (\mathbf{u} \cdot \nabla)\mathbf{u} = -\frac{\nabla \sigma}{{\rho}_0} + \alpha g \theta \hat{z} + \nu{\nabla}^2 \mathbf{u}, $$
(6)
$$ \partial_{t}\theta + (\mathbf{u} \cdot \nabla)\theta = \frac{\Delta}{d} u_{z} + \kappa{\nabla}^{2}\theta, $$
(7)
$$ \nabla \cdot {\bf u} = 0, $$
(8)

where θ is the temperature fluctuation (T = T c + θ with T c as the conduction temperature profile), σ is the pressure fluctuations from the steady conduction state, \(\hat{z}\) is the buoyancy direction, Δ is the temperature difference between the two plates, ν is the kinematic viscosity, and κ is the thermal diffusivity. We solve the nondimensionalized equations, which are obtained using d as the length scale, κ/d as the velocity scale, and Δ as the temperature scale:

$$ \frac{\partial{\textbf{u}}}{\partial{t}}+ (\textbf{u}\cdot \nabla)\textbf{u} = -\nabla\sigma + \mathrm{Ra\thinspace Pr}\thinspace\theta \hat{z} + \mathrm{Pr}\thinspace\nabla^{2}\textbf{u}, $$
(9)
$$ \frac{\partial{\theta}}{\partial{t}}+(\textbf{u}\cdot\nabla)\theta = u_{3} + \nabla^{2}\theta. $$
(10)

Here the two important nondimensional parameters are the Rayleigh number Ra = αg Δd 3/νκ and the Prandtl number Pr = ν/κ. At present, we can apply the free-slip boundary condition for the velocity fields at the horizontal plates, i.e.,

$$ u_3= \partial_z {u_1}=\partial_z {u_2}=0, \quad \mathrm{for\ } z=0,1, $$
(11)

and isothermal boundary condition on the horizontal plates

$$ \theta=0, \quad \mathrm{for\ } z=0,1. $$
(12)

Periodic boundary conditions are applied to the vertical boundaries. Note that the free-slip boundary condition is not as common as no-slip boundary condition, still it is quite useful in many situations, e.g., the upper layer of the atmosphere, flow involving two immiscible liquid, etc.

The number of arrays required for a RBC simulation is 18 (15 for fluids plus three for θ(k), θ(r), nlin θ). Thus, the memory requirement for RBC is (18/15) times that for the fluid simulation. Regarding the time complexity, the number of FFT operations required per time-step is 13 ×4 FFT operations (4 inverse + 9 forward transforms per RK4 step). As a result, the total time requirement for a RBC simulation is (13/9) times the respective fluid simulation.

For code validation of Tarang’s RBC solver, we compare the Nusselt number \(\mathrm{ Nu} = 1 + \langle u_{3}\theta\rangle\) computed using Tarang with that computed by Thual [10] for a two-dimensional simulation with the free-slip boundary condition. The analysis is performed for the steady-state dataset. The comparative results shown in table 5 illustrate excellent agreement between the two runs. We also compute the Nusselt number for a three-dimensional flow with Pr = 6.8 and observe that Nu = (0.27±0.04)(Pr Ra)0.27±0.01 [11], which is in good agreement with earlier experimental and numerical results.

Table 5 Verification of Tarang against Thual’s [10] 2D RBC simulations. We compare Nusselt numbers (Nu) computed in our simulations on a 642 grid against Thual’s simulations on 162 (THU1), 322 (THU2), and 642 (THU3) grids. All Nu values tabulated here are for Pr = 6.8.

Using the RBC module of Tarang, we also studied the energy spectra and fluxes of the velocity and temperature fields [12], the Nusselt number scaling [11], and chaos and bifurcations near the onset of convection [13, 14].

In the next section we shall discuss the results of the MHD module of Tarang.

6 Magnetohydrodynamic turbulence and dynamo

The equations for the incompressible MHD turbulence [15] are

$$ \partial_{t}{\bf u} + ({\bf u}\cdot\nabla){\bf u} = -\nabla{p}+ ({\bf B}\cdot\nabla){\bf B} + \nu\nabla^2 {\bf u} + {\bf F}^u, $$
(13)
$$ \partial_{t}{\bf B} + ({\bf u}\cdot\nabla){\bf B} = ({\bf B}\cdot\nabla){\bf u} + \eta \nabla^2 {\bf B}+ {\bf F}^B, $$
(14)
$$ \nabla \cdot {\bf u} = \nabla \cdot {\bf B} = 0, $$
(15)

where u, B, and p are the velocity, magnetic, and pressure (thermal+magnetic) fields respectively, ν is the kinematic viscosity, and η is the magnetic diffusivity. The F u and F B are the external forcing terms for the velocity and magnetic fields respectively. Typically, F B = 0, but Tarang implements F B for generality. The magnetic field B can be separated into its mean B 0 and fluctuations b: B =  B 0 + b. The number of nonlinear terms in the above equations is four whose computation requires 27 FFTs. However, the number of FFT computations in terms of the Elsasser variables z ±  = u ±b is only 15, thus saving significant computing time. We use the relations

$$ ({\bf u}\cdot\nabla){\bf u} - ({\bf B}\cdot\nabla){\bf B} = ({\bf z^-}\cdot\nabla){\bf z^+} + ({\bf z^+}\cdot\nabla){\bf z^-}, $$
(16)
$$ ({\bf u}\cdot\nabla){\bf B} - ({\bf B}\cdot\nabla){\bf u} = ({\bf z^-}\cdot\nabla){\bf z^+} - ({\bf z^+}\cdot\nabla){\bf z^-} $$
(17)

to compute the nonlinear terms. Thus, the time requirement for a MHD simulation would be around 15/9 times that for the fluid simulation. In figure 4 we plot the time taken per step for different sets of processors on Shaheen. The results are consistent with the above estimates. Regarding the space complexity, an MHD simulation requires 27 arrays for storing u(k), B(k), B(r), u(r), F u (k), F B(k), nlin u(k), nlin B(k), and three temporary fields. Hence, the memory requirement for a MHD simulation is 27/15 times that of a fluid simulation.

We perform code validation of Tarang’s MHD module using the results of Breyiannis and Valougeorgis’s [16] lattice kinetic simulations of three-dimensional decaying MHD. Following Breyiannis and Valougeorgis, we solve the MHD equations inside a cube with periodic boundary conditions on all directions, and with a Taylor–Green vortex (given below) as an initial condition,

$$ {\bf u} = [\sin(x)\cos(y)\cos(z), -\cos(x)\sin(y)\cos(z), 0], $$
(18)
$$ {\bf B} = [\sin(x)\sin(y)\cos(z), \cos(x)\cos(y)\cos(z), 0]. $$
(19)

This Taylor–Green vortex is then allowed to evolve freely. The simulation box is discretized using 323 grid points.

The results of this test case for different parameter values (ν = η = 0.01, 0.05, 0.1) are presented in figure 6. The top and bottom panels exhibit the time evolution of the total kinetic and magnetic energies respectively. Tarang’s datapoints, illustrated using blue dots, are in excellent agreement with Breyiannis and Valougeorgis’ results [16], which is represented by solid lines. We thus verify the MHD module of Tarang.

Figure 6
figure 6

Time evolution of total kinetic energy (a) and total magnetic energy (b) for a decaying MHD simulation with Taylor–Green vortex as an initial condition. Blue dots are Tarang’s datapoints, while the solid lines are the lattice simulation results of Breyiannis and Valougeorgis [16]. The three different curves reported here are for ν = η = 0.01, 0.05, 0.1 from top to bottom.

We have used Tarang to perform extensive simulations of dynamo transition under the Taylor–Green forcing [17, 18]. Using Tarang, we have also computed the magnetic and kinetic energy spectra, various energy fluxes [15], and shell-to-shell energy transfers for MHD turbulence; these results would be presented in a subsequent paper.

In addition to the fluid, MHD, and Rayleigh–Bénard convection solvers, Tarang has modules for simulating rotating turbulence, passive and active scalars, liquid metal flows, rotating convection [19], and Kolmogorov flow.

7 Conclusions

In this paper we describe the salient features and code validation of Tarang. Tarang passes several validation tests performed for fluid, Rayleigh–Bénard convection, and magnetohydrodynamic solvers. We also report scaling analysis of Tarang and show that it exhibits excellent strong and weak scaling up to several thousand processors. Tarang has been used for studying Rayleigh–Bénard convection, dynamo, and magnetohydrodynamic turbulence. It has been ported to various computing platforms including the HPC system of IIT Kanpur, Shaheen of KAUST, PARAM YUVA of the Centre for Advanced Computing (Pune), and EKA of the Computational Research Laboratory (Pune).