1 Introduction

Conjugate heat transfer is a phenomenon in which conduction mode of heat transfer in solid is combined with convection mode of heat transfer in fluid. Conduction usually dominates in solids and convection in fluids (Ate et al. 2010). There is wide area of applications in which conjugate heat transfer phenomenon is observed. For example, the optimal design of heat exchanger involves the combination of heat transfer by conduction, in the walls of heat exchanger, and by convection in the flowing fluid. In most of the electronic devices for regulating optimal temperature level of heat sink, in design and development of heating furnace used in metallurgical process, in turbo jet engine flow of hot gases, etc., the conjugate analysis is important. For safe, reliable, and efficient performance of electrochemical energy conversion system and many more, the conjugate condition is most appropriate (Cukurel et al. 2012; He et al. 2011; Guan et al. 2017). Conjugate study can be performed experimentally or numerical method. Numerical methods nowadays are used to simulate fluid flow situations for problems ranging from molecular level to global level. Numerical methods have advanced from the analysis of flow over two-dimensional configurations to three-dimensional configurations. Numerical methods are extensively used as a design and optimizing tool in industries. With the use of computational analysis, multi-phase flow modeling has become easy, adopting huge fine mesh resolution. Generally, the numerical simulation is performed by running the codes/software on modern-day computing machines (Bhatti et al. 2020; Khan et al. 2019, 2020; Ullah et al. 2019).

The thermal and fluid flow behavior study related to parallel plate channel has been reported in the abundant available literature. Different application aspects were covered under various operating conditions. The areas of application mainly include electronic components, equipment and devices, nuclear fuel elements, electric vehicle battery system, heat exchangers, solar flat plate collector, mini and micro-channels, etc. (Meng et al. 2016; Lindstedt and Karvinen 2017; Poddar et al. 2015; Adelaja et al. 2014). The arrangement of the channel formed by the set of plates in these studies is horizontal, vertical, or inclined. Among the research available in the subject of parallel plate channel, some of the investigations are purely based on the uncoupled mode of heat transfer, while some other investigations are based on coupled mode of heat transfer (Arici and Aydin 2009; Bilir 2002; Harman and Cole 2001). For numerical analysis of conjugate heat transfer and fluid flow analysis, finite-difference method (FDM) or finite-volume method (FVM) is most commonly adopted. SIMPLE algorithm to solve the Navier–Stokes (NS) equation for fluid flow analysis is a very popular method used by researcher in these areas. However, this algorithm is compute intensive and slow in convergence and, hence, demands parallelization for reducing the computational cost achieving faster convergence. To the best of our knowledge, parallelization of this algorithm in this field is not reported in the literature. The following literature review is pertinent to parallelization of various codes developed for thermal and fluid flow analysis.

Gropp et al. (2001) demonstrated parallelization of FUN3D code developed at NASA. This code uses an FVM-based discretization, such that for convective flux approximation, variable order Roe scheme is used, and for viscous terms, Galerkin discretization is used. The subdomain parallelization of FUN3D code was successfully demonstrated using MPI (message-passing interface) tool. The flux calculation phase needed 60% of computational time, which was parallelized using OpenMP (open multi-processing) tool. Transient NS equation for incompressible flows in three dimension for analysis of shear flows was solved by Passoni et al. (2001) developing a code. The parallelization of the developed computational code was achieved using MPI applying their own schemes (scheme A/B/C). The parallel performance was analyzed using two computers with different processors and for various grid points. The results show that the parallel performance is better for problem with medium grid size than larger size. 91% of parallel efficiency on doubling the processors and on eightfold increase in processors 60% efficiency was obtained. Schulz et al. (2002) used lattice Boltzmann method (LBM) for fluid flow analysis and established data structures to reduce the memory requirements. MPI parallelization for grid size of 2 × 107 was used and achieved about 90% of parallel efficiency. MPI was used for domain-decomposition-based parallelization of reproducing kernel particle method to analyze three-dimensional flow over a cylinder, flow past a building. 70 processors were used to obtain speedup of 35 (Zhang et al. 2002). Peigin and Epstein (2004) used NES code for aerodynamic design optimization which is computationally very expensive. Hence, MPI was used for multilevel parallelization of code NES for optimization. A cluster of 144 processors was used for implementing parallel execution which provided around 95% parallel efficiency.

Eyheramendy (2003) used FEM analysis of lid-driven cavity problem solving for 50182 degrees of freedom (dofs) on a four-processor Compaq machine. By increasing the number of threads for different dofs, the parallel efficiency reduced. Jia and Sunden (2004) used in-house developed CALC-MP which uses FVM and collocated mesh arrangement. To interpolate velocity values at faces this code uses Rhie and Chow method and SIMPLEC (SIMPLE Consistent) for coupling pressure and velocity. Out of three schemes proposed, scheme 1 provided the best parallel performance due to overlap of computational and communication phase. Lehmkuhl et al. (2007) used CFD code ThermoFluids for solving accurately and to have reliable results for industrial problems of turbulent flows. The parallelization of ThermoFLuids was carried out using METIS software on a cluster of ten processors. Oktay et al. (2011) used CFD code FAPeda which applies unstructured FVM-based cell-centered tetrahedral formulation. As the problem is complex, it requires heavy computations, hence, demanding the necessity of parallelization. Using MUMPS library centered on multi-frontal approach, parallelization of FAPeda was achieved. The CFD code GenIDLEST used for simulation of real-world problems was parallelized using OpenMP tool by Amritkar et al. (2012). GenIDLEST solves transient NS equation in a body fitted multi-block coordinate system using cell-centered FVM formulation. OpenMP parallelism by fine-grained tuning on 256 cores, the performance was shown to match with the MPI. In another study, Amritkar et al. (2014) provided parallelization strategies using OpenMP for simulation of dense particulate system by discrete element method (DEM) with the same code GenIDLEST. Rotary kiln heat transfer and fluidized bed problems were selected for demonstration of DEM-CFD coupled problem. OpenMP speedup was twice than the speedup of MPI on 25 cores. Steijl and Barakos (2018) applied Quantum Fourier transform to solve Poisson equation to a vortex-in-cell method. MPI was used for parallelization of Poisson solver required for simulation of quantum circuits. Gorobets et al. (2018) described parallelization of compressible NS equation for viscous turbulent fluid flow. OpenMP + MPI + OpenCL (open computing language) were used for parallelization. Grid size of 29 million was chosen and about 250 GPUs (graphical processing unit) and 2744 cores were employed for scalability and parallel performance analysis. Compute unified device architecture (CUDA) was used by Lai et al. (2018) to parallelize compressible NS equation on NVIDIAs GTX 1070 GPU. When the block size was varied from 50 to 750, the parallel efficiency varied significantly showing a maximum efficiency for a block size of 256.

CFD code ultraFluidX based on LBM was parallelized using CUDA by Niedermeier et al. (2018). 95% efficiency for empty wind tunnel problem and 80% efficiency for wind tunnel with car problem were obtained on eight GPUs. OpenMP was used to parallelize the Green function-based code which requires immense computational time applying 16 number of threads (Shan et al. 2018). For different discretization methods, the speedup up to 12 was achieved using 16 threads (Shan et al. 2018). For two model simulation analysis of bacterial biofilm model and solute simulation, FENICS software was used based on FEM. In modeling of bacteria growth dynamics, huge computational time is spent as reported by Sheraton and Sloot (2018). The computations were performed on 3.20 GHz Intel® Core™ i7-6900 K CPU running on Ubuntu Linux 14.04 using METIS library. The parallelization analysis revealed that during the initial growth of bacteria, the number of grids required is less, so the necessity of parallel execution at this stage is not required (Sheraton and Sloot 2018). Wang et al. (2018) developed in-house CFD code to solve NS equation for compressible viscous flow in three dimensions. CPU (central processing unit) + Intel Xeon-phi co-processors were used for heterogeneous parallel computing employing MPI, OpenMP, and offload programming model. CUDA, OpenMP, and hybrid OpenMP + CUDA-based parallelization for in-house CFD codes is also reported in the literature (Simmendinger and Kügeler 2010; Kafui et al. 2011; Xu et al. 2014; Jacobsen and Senocak 2011). Review articles by Afzal et al. (2017) and Pinto et al. (2016, 2017) provide a detailed insight into parallel computing strategies for different CFD applications.

The above presented past parallelization research works using different tools for several applications are reported for parallelization of in-house developed codes using OpenMP, CUDA, and MPI. These codes mostly belong to fluid flow application for analysis of aero foils, fluidized beds, heat exchangers, etc. Like Amritkar et al.’s (2012) parallelized GenIDLEST code, Niedermeier et al. (2018) parallelized ultraFluidX, Oktay et al. (2011) used CFD code FAPeda, Sheraton and Sloot (2018) parallelized FENICS software, and many more. However, none of the work says about parallelization of CFD codes using CUDA meant for conjugate heat transfer analysis, in which the conduction in solid and convection in fluids is dominated. The parallelization of numerical method for coupled heat and fluid flow problem solved using SIMPLE algorithm using CUDA and OpenMP tool is still a gap. The motivation of this work is to parallelize an in-house built generic CFD code for analysis of various applications using combined OpenMP and CUDA programming model. The prime objectives of this work are listed below:

  1. 1.

    To find the effect of using different GPUs, various thread blocks, and wide range of fluid flow conditions on parallel performance of the parallelized FVM-based CFD code using CUDA.

  2. 2.

    To provide in-depth understanding of speedups and parallel efficiency of the in-house code for different boundary conditions applied for heat-generating battery cells.

  3. 3.

    Comparison between the speedup achieved using OpenMP parallelization on different computing machines with CUDA parallelization on GPUs.

In this work, the incompressible two-dimensional NS equation solved using staggered grid and SIMPLE algorithm for fluid flow analysis can be used for analysis in the area of plates, parallel plates, heat-generating plates, etc. like battery cells, nuclear fuel plate, fins, and many more. As a demonstration, the heat-generating Li-ion battery system is considered in which the maximum temperature has to be kept within safe limits for effective thermal performance of battery system. To avoid the maximum temperature reaching its limit, air as coolant is forced to flow over the battery cells. Hence, this simulation analysis provided by the in-house developed code helps in understanding the thermal behavior of coupled heat and fluid flow scenario, but at the cost of computational expenses. Parallelization of FVM code using CUDA helps in achieving exhaustive numerical analysis in possible time duration. Organization of the rest of the paper is as follows. Section 2 provides the details of the numerical methodology. Section 3 provides the parallelization strategy and results obtained are discussed in Sect. 4. Finally, in Sect. 5, conclusions are drawn.

Nomenclature

Ar

Aspect ratio of battery cell

V

Non-dimensional velocity along the transverse direction

L

Length of battery cell

w

Half width

k

Thermal conductivity

\( \bar{W} \)

Non-dimensional width

l o

Length of extra outlet fluid domain

x

Axial direction

l i

Length of extra fluid domain

X

Non-dimensional axial direction

L o

Dimensionless length of extra outlet fluid domain

y

Transverse direction

L i

Dimensionless length of extra inlet fluid domain

Y

Non-dimensional transverse direction

q′′′

Volumetric heat generation

Greek symbols

\( \bar{q} \)

Non-dimensional heat flux

α

Thermal diffusivity of fluid

\( \bar{S}_{q} \)

Dimensionless volumetric heat generation

ν

Kinematic viscosity of fluid

Pr

Prandtl number

ρ

Density of fluid

Re

Reynolds number

ζ cc

Conduction–convection parameter

T

Temperature

μ

Dynamic viscosity

T o

Maximum allowable temperature of battery cell

Subscripts

\( \bar{T} \)

Non-dimensional temperature

c

center

u

Velocity along the axial direction

f

Fluid domain

U

Non-dimensional velocity along the axial direction

s

Solid domain (battery cell)

u

Free stream velocity

Free stream

v

Velocity along the transverse direction

m

mean

Q r

Heat removed from surface (non-dimensional)

  

2 Numerical methodology

A battery module usually consists of battery cells that are densely packed to obtain higher power densities. For ease of operation and better thermal uniformity, the number of battery cells in each module is less. In this paper, a computationally efficient thermal model used for simulating the thermal behavior of modern electric vehicle battery cells generating uniform heat during charging and discharging operations at a steady state is simulated. A parallel channel with liquid coolant flow is employed to cool the battery cells during operation. The developed thermal model is then used to analyze the thermal behavior of the battery cell for various parameters, as shown in Fig. 1. Alongside, the computational domain is symmetric along the vertical axis; therefore, to reduce the computational cost, only half of the domain through a flow passage, configuration is modeled. Figure 2 shows the simulated domain which consists of two sub-domains, which are the Li-ion battery cells, a vertical parallel flow path channel and the coolant. The fluid flow inside the channel is commonly in laminar regime owing to the low velocity of the flow inside the channel (Karimi and Li 2012; Xu and He 2013).

Fig. 1
figure 1

Schematic view of arrangement of battery cells with air circulation fan and heat generation in batteries

Fig. 2
figure 2

The symmetric battery (prismatic cell) and coolant flow domain considered for computational analysis

The governing equations describing the heat transfer process when discharging/charging the Li-ion battery cell are given by:

$$ k_{\text{s}} \nabla^{2} T_{\text{s}} + q^{\prime\prime\prime} = 0, $$
(1)

where q′′′ is volumetric heat generation term.

The governing equations for two-dimensional, steady, incompressible, laminar, forced-convection flow in the fluid domain are continuity equation, and x and y momentum equations and equation of energy, which are as follows:

$$ \nabla u = 0 $$
(2)
$$ (u\nabla u) = - \frac{1}{\rho }\nabla p + \mu \nabla^{2} u $$
(3)
$$ u\nabla T = \alpha \nabla^{2} T_{\text{f}} . . $$
(4)

In Eqs. (1)–(4), the term Ts represents the temperature of solid battery, Tf temperature of fluid body, u is velocity, p is pressure, and α and µ are thermal diffusivity and viscosity of the fluid. Equation (2) is mass conversation equation, Eq. (3) is momentum equation, and Eq. (4) is energy equation representing the temperature in fluid domain.

The above equations are non-dimensionalized using the following set of normalizing parameters mentioned in Eq. (5):

$$ \begin{aligned} & \bar{S}_{q} = \frac{{q^{\prime\prime\prime}w_{\text{s}}^{2} }}{{k_{\text{s}} (T_{\text{o}} - T_{\infty } )}},\quad C = 4{\text{Ar}}^{2} \quad \bar{T} = \frac{{T - T_{\infty } }}{{T_{0} - T_{\infty } }},\quad L_{\text{i}} = \frac{{l_{\text{i}} }}{L},\quad L_{\text{o}} = \frac{{l_{\text{o}} }}{L} \\ & X = \frac{x}{L},\quad \, U = \frac{u}{{u_{\infty } }},\quad \, V = \frac{v}{{u_{\infty } }},\quad \, P = \frac{p}{{\rho u_{\infty }^{2} }},A_{\text{r}} = \frac{L}{{2w_{\text{s}} }} \\ & Y_{\text{s}} = \frac{{y_{\text{s}} }}{{w_{\text{s}} }},\quad Y_{\text{f}} = 1 + \frac{{y_{\text{f}} }}{L},\quad \bar{W}_{\text{f}} = \frac{{w_{\text{f}} }}{L},\quad \zeta_{\text{cc}} = \frac{{k_{\text{f}} }}{{k_{\text{s}} }}\left[ {\frac{{w_{\text{s}} }}{L}} \right] \\ & {\text{Re}} = \frac{{u_{\infty } L}}{\nu },\quad \Pr = \frac{\nu }{\alpha }. \\ \end{aligned} $$
(5)

The above non-dimensionalized parameters are selected based on previous research work that are close to conjugate study of parallel plate channel (Kaladgi et al. 2019; Abdul Razak et al. 2019; Mohamme Samee et al. 2019; Samee et al. 2018).

The final set of non-dimensionalized governing equations are provided in Eqs. (6)–(9):

$$ \frac{{\partial^{2} \bar{T}_{\text{s}} }}{{\partial \,X^{2} }} + C\frac{{\partial^{2} \bar{T}_{\text{s}} }}{{\partial Y_{\text{s}}^{2} }} + C\bar{S}_{q} = 0 $$
(6)
$$ \nabla U = 0 $$
(7)
$$ U\nabla U = - \nabla P + \frac{1}{\text{Re}}\nabla^{2} U $$
(8)
$$ U\nabla \overline{{T_{\text{f}} }} = \frac{1}{\text{RePr}}\nabla^{2} \overline{{T_{\text{f}} }} . $$
(9)

The boundary conditions applied for the above energy, conduction, and momentum equations for the conjugate numerical computations of upward flow of air collecting the heat from the lateral surface of battery cell are provided in detail in Eq. (10):

$$ \begin{aligned} & X = 0;\quad 0 \le Y_{\text{s}} \le 1,\quad U = 0,\quad V = 0,\quad \bar{T}_{\text{s}} = 0 \\ & Y_{\text{s}} = 0;\quad 0 \le X \le 1,\quad \frac{{\partial \bar{T}_{\text{s}} }}{\partial X} = 0 \\ & Y_{\text{s}} = 1;\quad 0 \le X \le 1,\quad \frac{{\partial \bar{T}_{\text{s}} }}{\partial \,X} = 0 \\ & Y_{\text{s}} = 1;\quad 0 \le X \le 1,\quad \bar{T}_{\text{s}} = \bar{T}_{\text{f}} \\ & X = 1;\quad 0 \le Y_{\text{s}} \le 1,\quad \frac{{\partial \bar{T}_{\text{s}} }}{{\partial Y_{\text{s}} }} = 0, \\ & Y_{\text{f}} = 1;\quad - L_{\text{i}} \le X \le 0\quad {\text{and}}\quad L \le X \le L_{\text{o}} \quad \frac{{\partial \bar{T}_{\text{f}} }}{{\partial Y_{\text{f}} }} = 0,\quad \frac{\partial U}{{\partial Y_{\text{f}} }} = 0,\quad V = 0 \\ & Y_{\text{f}} = 1;\quad 0 \le X \le L,\quad \frac{{\partial T_{\text{f}} }}{{\partial Y_{\text{f}} }} = \frac{1}{{\zeta_{\text{cc}} }}\frac{{\partial T_{\text{s}} }}{{\partial Y_{\text{s}} }},\quad U = 0,\quad V = 0 \\ & Y_{\text{f}} = 1 + \bar{W}_{\text{f}} ;\quad - L_{\text{i}} \le X \le (L + L_{\text{o}} ),\quad \frac{{\partial \bar{T}_{\text{f}} }}{{\partial Y_{\text{f}} }} = 0,\quad V = 0,\quad \frac{\partial U}{{\partial Y_{\text{f}} }} = 0 \\ & X = - L_{\text{i}} ;\quad 1 \le Y_{\text{f}} \le (1 + \bar{W}_{\text{f}} ),\quad \bar{T}_{\text{s}} = 0,\quad U = 1,\quad V = 0 \\ & X = L + L_{\text{o}} ;\quad 1 \le Y_{\text{f}} \le (1 + \bar{W}_{\text{f}} ),\quad \frac{{\partial \bar{T}_{\text{s}} }}{\partial X} = 0,\quad \frac{\partial U}{{\partial Y_{\text{f}} }} = 0,\quad V = 0. \\ \end{aligned} $$
(10)

2.1 Solution strategy

The numerical solution of the conjugate problem consisting of energy and momentum equations is obtained by employing staggered grid method of finite-volume method (FVM). SIMPLE algorithm is used to solve the coupled momentum and continuity equation to obtain velocity and pressure components. The SIMPLE algorithm steps involved in solving the continuity and momentum equations are as follows:

Step 1: guess and initialize the variables U*, V*, and P*.

Step 2: solve the discretized equations of U* and V* using the guessed pressure field:

$$ \left( {a_{\text{p}} } \right)U^{*}_{i, j} = - \left( {P^{*}_{i + 1, j} - P^{*}_{i, j} } \right)A_{i,j} + \sum a_{nb} U^{*}_{nb} + b_{i,j} $$
$$ \left( {a_{\text{p}} } \right)V^{*}_{I,J j} = - \left( {P^{*}_{I, J + 1} - P^{*}_{i, j} } \right)A_{I,J} + \sum a_{nb} U^{*}_{nb} + b_{I,J} . $$

Step 3: solve the pressure correction equation P′ using the previously calculated U′ and V′:

$$ \left( {a_{\text{p}} } \right)P^{\prime}_{i, j} = \sum a_{nb} P^{\prime}_{nb} + F_{u}^{*} + F_{v}^{*} $$

Step 4: correct the pressure and velocity equations:

$$ P = P^{*} + P^{\prime}\quad U = U^{*} + U^{\prime}\quad {\text{and}}\quad V = V^{*} + V^{\prime}. $$

Step 5: update the U*, V*, and P* values with just computed *values and go back to step 1 till error is within desired limit.

Here, the terms \( a_{\text{p}} \) and \( a_{nb} \), \( b_{i,j} \), and \( F_{u}^{*} + F_{v}^{*} \) are sum of neighboring coefficients, source term, and velocity of neighboring points, respectively. Discretization of the governing equations was done by central differencing scheme for continuity and momentum equations. The diffusion equation of cell domain and energy equation of fluid domain are coupled as the conjugate condition at the cell–fluid interface should be satisfied. The U* and V* velocity components arrived in the pressure correction procedure are solved by line-by-line Gauss–Seidel iteration method and Thomas algorithm considering the boundary conditions mentioned earlier and guessed pressure P*. The pressure correction P′ equation obtained employing the continuity equation is solved by successive over-relaxation (SOR) method using the calculated U* and V* velocity from the previous computations. The corrected pressure P′ calculated to satisfy the continuity equation is further used to correct the guessed U*, V*, and P*. At this stage, the temperature \( \bar{T}_{\text{s}} \) from the 2-D conduction equation with source term \( \bar{S}_{q} \) and energy equation for \( \bar{T}_{\text{f}} \) are solved simultaneously using the obtained corrected U, V, and P values from the previous computations. Line-by-line Gauss–Seidel iteration method and Thomas algorithm are used for solving the \( \bar{T} \) of solid and fluid domains. The U, V, and P values are used as a guessed value for the next iteration with some under relaxation and so on until the error in the continuity equation and \( \bar{T} \) is < 10e−6.

To evaluate the accuracy of the numerical results, grid-independence tests to check the dependence of the obtained results on the grid resolution was conducted. The considered mesh systems have grid size of 42 × 82, 62 × 122, and 82 × 162 in fluid domain and grid size of 82 × 82, 82 × 122, and 122 × 122 in the solid domain, respectively. For typical cases calculated in the present work, the numerical results obtained for non-dimensional temperature for different mesh system were less than 5%. However, to reduce the computational time without affecting the accuracy, a grid size of 62 × 122 for solid domain and fluid domain is used. The detailed numerical method, boundary conditions, solution strategy, and validation are discussed in Afzal et al. (2019, 2020a, b), which are avoided here for the sake of briefness.

3 Parallelization strategy

Parallel architectures are in significant attention to offer immense computational power by utilizing multiple processing units. The progress in the growth of parallel processing is due to the stagnation of central processing units (CPUs) clock speed. To benefit out of the present multicore/processor and GPUs, the programs have to be developed for parallel execution (Mudigere et al. 2015; Couder-Castaneda et al. 2015; Xu et al. 2015). In this research work, an effort is made to parallelize the developed FVM code for the present conjugate heat transfer problem. Parallelization of the in-house developed indigenous code written in C language is carried out using NVIDIAs GPU GTX 980. Parallel computing paradigm CUDA is employed for the parallelization of the FVM code. Parallelization of the FVM code is achieved using Red and Black Successive over Relaxation (RBSOR) scheme.

For fluid flow conditions like internal flow, external flow, internal flow with outlet domain extended, and internal flow with inlet and outlet domain extended, computational speedup obtained is investigated in detail. Grid size of 42 × 82, 52 × 102, 62 × 122, and 72 × 142 for internal flow and grid size of 62 × 24 for inlet and outlet extended domain are adopted for parallel performance analysis. For external flow, the grid sizes chosen are 122 × 122, 162 × 162, 202 × 202, and 242 × 242 to understand the parallel speedup achieved. In case of internal flow, the spacing between the parallel battery cells \( \bar{W}_{\text{f}} \) = 0.1 is kept constant. For both internal and external flow, Re= 250, 750, 1250, and 1750 is considered. The other parameters are fixed to their base values for complete parallelization analysis. Parallel efficiency of the parallelized code is also investigated to understand the fraction of time for useful processor utilization.

3.1 The RBSOR method

The computational time taken by the developed FVM code depending upon on parameters varies from approximately 30 min to 24 h. From the profile analysis of different functions used in the code, it is found that up to 91% of computational time is spent/used on pressure correction function. The remaining time is used by U and V velocity, temperature, and printing output results function. Hence, the major focus is made on parallelizing the pressure correction function using RBSOR scheme and on CMs with different configuration. SOR method is employed for solving the pressure correction equation obtained using the SIMPLE algorithm technique. For the remaining functions, TDMA is used for formulating the corresponding discretized equations. One of the commonly known schemes for parallelization of SOR is red and black scheme. However, the use of wavefront scheme or their combinations is not reported in the literature. In the following section, a detailed description of working of RBSOR scheme and wavefront scheme is provided.

The SOR is an important iterative method to solve the system of linear equations. SOR is an expansion/improvement of Gauss–Seidel method that speed ups convergence. It over-relaxes and combines the old values and current values by a factor greater than unity (Niemeyer and Sung 2014). In this work, SOR is used to solve the pressure correction equation with over-relaxation factor equal to 1.8. This pressure correction function consumes maximum computational time as mentioned earlier, due to inner iterations required for correcting the pressure.

The parallel implementation of SOR technique is not easy as it uses the values of neighboring cells/grid points of the current iteration, as shown in Fig. 3. The gird point/cell shown in yellow color (number 15) requires the values of upper, lower, left, and right side cells (number 5, 25, 14, and 16, respectively) shown in blue color. Each grid point is serially executed one after the other taking the newly calculated values of neighboring points. This is mentioned serially from 1 to 100 in Fig. 3. Hence, this brings in sequential dependency and may lead to different results in parallelly executed SOR. To overcome this sequential dependency and parallelize the SOR algorithm, graph coloring methods are used (Abdi and Bitsuamlak 2015). Using this coloring method, the single sweep of SOR can be broken into multiple sweeps which are suitable for parallel processing. The RBSOR scheme can be thought as a compromise among Gauss–Seidel and Jacobi iteration. As shown in Fig. 4, the RBSOR solves by coloring in the checkerboard with alternative red/black grids. At first, all the red cells are computed simultaneously considering the neighboring black points. Then, black cells are computed using the updated red cells parallelly. The RBSOR scheme implementation is mentioned briefly in Algorithm 1. In Algorithm 1, imin refers to start point of grid number along x-direction and jmin along y-direction. m and n are the maximum numbers of grid points along x- and y-direction given by the user. w is the over-relaxation term which is set to 1.8 in this algorithm.

Fig. 3
figure 3

Serial execution of grid points depending upon the four neighboring cells

Fig. 4
figure 4

RBSOR and wavefront scheme for parallel implement of SOR

figure a

3.2 Parallelization using CUDA

Existing graphics processing units (GPUs) offer an excessive computational power in the form of graphics hardware. A GPU is an enormously parallelly coprocessing architecture to the CPU and is a set of SIMD (Single Instruction Multiple Data). GPUs and the present CPUs have similar memory hierarchy (Poddar et al. 2015). GPUs are nowadays commonly used as a programmable engine with the use of parallel programming tools like CUDA, OpenAcc (open accelerator), and OpenCL (open computing language). CUDA is a well-known parallel computing tool provided by NVIDIA to create immense parallel applications. Porting of applications on CPU is made easy by CUDA as it avoids the previous graphics pipeline concepts. CUDA establishes high abstraction levels by providing a new programming model for high-performance computing architecture. To enable heterogeneous programming, it consists of set of extensions and to manage devices or memory, it has straightforward APIs (application program interface) (Adelaja et al. 2014; Arici and Aydin 2009; Bilir 2002). In CUDA paradigm, the computational core is known as kernel, CPU with its memory is known as host, and GPU with its memory is known as device. The CUDA paradigm overview with its data/result copying and kernel launching is illustrated in Fig. 5. Figure 5 shows the passing/launching of kernel on GPU which parallelly executes using data streams by array (scalable) of multithreaded Streaming Multiprocessors (SMs).

Fig. 5
figure 5

Overview of memory copy and kernel launching in CUDA paradigm

The kernel launched from the host CPU is mapped to GPUs thread grid which has several blocks of threads along with different directions. These threads share the memory within a block which synchronizes. The SMs work together for massive computations managed by employing the architecture of single instruction multiple thread (SIMT). The SIMT architecture describes the characteristics of SMs that are same as all devices. The detailed description of SIMT and multithreading architecture can be found in NVIDIAs GPU programming guidelines in detail in Harman and Cole (2001), Gropp et al. (2001) and Passoni et al. (2001).

In this work, the FVM code parallelization using CUDA is adopted only for pressure correction function as it requires around 91% of the total runtime of the code. The remaining U and V velocity and temperature functions are parallelized using OpenMP for inner for() lop of the TDMA. The steps involved in parallelization using CUDA for the present developed code in mentioned in Algorithm 2. The specifier __global__ is added in the code to be identified by CUDA C++ compiler to be run on GPU and is called by the host program. The other declarations like cudaMalloc, cudaMemcpy, and cudaFree are used for the management of device memory. The kernel for RBSOR has grid and block dimension declared by dim3 command. The grid size is always two-dimensional, whereas the block dimension can be of one-, two-, or three-dimensional. The grid size specifier is for defining number of blocks per grid and the block specifier is for declaring number of threads per block. __synchthreads() is used to synchronize threads inside a kernel from the same block. In CUDA model global synchronization is difficult, and hence, to enforce it, the kernel is exited before a new kernel is launched. The details of GPUs installed on two different CMs (specifications in Table 1) used for parallel execution are mentioned in Table 2. Four different CMs are employed for comparison of relative increase in speedup with GPU1. The specifications of these four CMs are mentioned in Table 1.

Table 1 Specifications of CMs on which GPUs are installed
Table 2 Specifications of NVIDIAs’ GPU used for parallelization
figure b

3.3 Speedup and parallel efficiency

The use of multiple processors to work together simultaneously on a common task is commonly known as parallel computing. The performance of a parallel algorithm implemented on a parallel architecture for parallel computations is measured by speedup and parallel efficiency. The ratio of time taken to execute the sequential algorithm on a single processor to the time taken by parallel algorithm to execute on multiple-processor is known as speedup. Parallel efficiency is defined as the ratio of parallel speedup achieved the number of processers. Parallel efficiency gives the measure of the fraction of computational time at which a processor is used efficiently (Darmana et al. 2006; Walther and Sbalzarini 2009; Wang et al. 2005; Mininni et al. 2011).

According to the definition of speedup and parallel efficiency, they are calculated as given by Eq. (11) (Walther and Sbalzarini 2009; Darmana et al. 2006):

$$ S_{{({\text{par}})}} = \frac{{T_{{({\text{seq}})}} }}{{T_{{({\text{par}})}} }}\quad E_{{({\text{par}})}} = \frac{{S_{{({\text{par}})}} }}{{N_{{({\text{par}})}} }}, $$
(11)

where S(par) is the parallel speedup achieved, T(seq) is the elapsed (wall) time taken by the sequential program, T(par) is the wall time taken by the parallel program for execution, E(par) is the parallel efficiency, and N(par) is the number of processors employed for parallel execution. The efficiency of loss occurred due to communication of data, computational task partitioning, processors scheduling, management of data, etc. during parallel implementation is accounted by parallel efficiency. The elapsed time for computations in parallel on N processors can be written as composed of (Shang 2009):

$$ T_{{({\text{par}})}} = T_{{({\text{seq}})N}} + T_{(N)} + T_{{({\text{comm}})N}} + T_{{({\text{misc}})N}} . $$
(12)

Equation (12) shows the total amount of time taken by a code that is capable of running parallelly on multiple processors. In another way, it can be said that the code which is parallelized (or some part of the program which is parallelized) takes some time for parallel execution on multiple cores for a certain amount of time. This parallel execution time can be faster or slower depending upon the technique adopted for parallelization of the code. This is due to the amount of time taken by the master processor to schedule the parallel tasks, divide the work, divide the data, parallel execution, synchronization, etc., and this entire amount of time taken by the code is accountable under T(par).

Where T(seq)N is the time taken by CPU for computations of the sequential part of the program. T(N) is the time taken by CPU for computations of parallel part of the program on N processors, T(comm)N is the time taken by CPU for communication with N processors, and T(misc)N is the idle time or extra time spent induced due to parallelization of program.

4 Results and discussion of speedup and parallel efficiency using CUDA

Using CUDA parallel computing paradigm, the parallelization of FVM code developed in-house is analyzed in the form of parallel speedup and efficiency. Two GPUs, namely GPU1 and GPU2, are adopted for parallel execution. 400, 625, and 900 threads are launched from the host to device and their speedup is also checked. In this section, speedup and parallel efficiency of the parallelized FVM code are provided in detail. Using RBSOR scheme, various grid sizes, and Res for internal and external flow, the computational time analysis is carried out. In this entire parallel performance analysis, the operating heat and fluid flow parameters are kept fixed at \( \bar{S}_{q} \)= 0.5, ζcc= 0.06, and Ar= 10, for both the flow conditions. For computational time analysis, the grid sizes chosen are 42 × 82, 52 × 102, 6 × 122, and 72 × 142 for internal flow. The grid size considered for outlet and inlet domains extended is 24 × 122. In case of external flow 122 × 122, 162 × 162, 202 × 202, and 242 × 242 grid sizes are chosen, while Re is varied from 250 to 1750.

4.1 Elapsed time

The elapsed time of the FVM code for internal and external flow on all the four machines is noted at first, to get an idea of computational cost for different conditions. In Fig. 6 the elapsed time on CM1, CM2, CM3, and CM4 is shown for internal flow and external flow. The flow Re= 250 to 1750 and grid size of 62 × 122 and 202 × 202 for internal and external flow, respectively, is fixed to note the elapsed time. Based on clock speed and RAM of the machines, the elapsed time is minimum for CM1 and maximum for CM4 at Re= 250 for both the flow conditions. Time for CM1 and CM2 are almost the same for all flow Re. This elapsed time is for outlet and inlet domains not extended during internal flow analysis. Surely, if higher grid size and extra flow domains are considered, the computational time increases up to 24 h and above.

Fig. 6
figure 6

Elapsed time on different machines for different flow conditions

4.2 Speedup achieved

The speedup of parallel FVM code achieved for internal and external flow on GPU1 using CUDA paradigm and RBSOR scheme is shown in Fig. 7. The speedup using CUDA is calculated considering the serial performance of the FVM code on CM1. GPU1 is selected by default and 400 threads were launched on the device during each iteration from the host CPU for this entire parallel computations. For different Re and grid size, the speedup achieved on GPU1 shown in Fig. 7 indicates massive parallel performance of the CUDA parallelized FVM code. For internal flow, the speedups are higher than speedups achieved for external flow. The SMs of the GPU perform very fast computations as they are very light threads and do not require any kind of forking/joining mechanism. In internal flow, the inner iterations required for pressure correction function are higher compared to external flow. And on another hand, the total computational time for serial execution of FVM code for higher Re is less in case of both internal and external flow as shown previously in Fig. 6. Thus, due to the reduction in serial computational time for higher Re and massive parallel performance capability of GPU, the speedup reduces. Whereas for lower Re due to more utilization of GPU for parallel computation, the speedup increases. Similarly, due to reduced inner iterations in external flow, the speedup is also reduced compared to internal flow.

Fig. 7
figure 7

Speedup obtained for different Re and grid sizes

To analyze the computational performance of GPU when serial computations are made, only one thread on SM during each iteration is launched. To get an idea of elapsed time for serial computations on GPU, only internal flow with 62 × 122 grid size is selected. The elapsed time calculated is for serial execution of the FVM code in which only the pressure correction functions are made on GPU and the rest computations are made on CPU. The pressure correction function is chosen as it consumes 91% of total time, mentioned earlier. During each iteration due to the copying of memory back and forth to GPU/CPU and launching the kernel, extra computational effort is required. Hence, the computational time required for serial computations on GPU is huge, as shown in Fig. 8a.

Fig. 8
figure 8

a Time consumption of the FVM code for serial execution on GPU1 and b speedup obtained on GPU1 and GPU2

When two GPUs with different configuration mentioned in Table 2 are used for parallel computation, the respective speedup obtained is shown in Figure 8b that the speedup on both the GPUs is very close to each other. The default analysis of speedup using CUDA paradigm is performed by launching 400 threads for different conditions explained previously. In Fig. 9, speedup obtained by launching 400, 625, and 900 threads on GPU1 is presented for different Re. The flow chosen is internal with 62 × 122 grid size without any extended domain. As explained earlier, for higher Re, the speedup using CUDA is comparatively less. It is also seen that by launching higher number threads reduces the speedup further. The reason can be attributed to synchronization time required after parallel computations. As more threads are launched, more synchronization time is required, hence increasing the parallel execution time marginally.

Fig. 9
figure 9

Speedup obtained by launching kernel with different number of threads on GPU1

In Fig. 10, speedup obtained with outlet domain extended and with both inlet and outlet domain extended in case of internal is represented. The grid size for the inlet/outlet domain extended is 62 × 24 for each. The grid size is fixed at 62 × 122 for internal flow, while Re is changed from 250 to 1750. It is observed that the speedup for both the cases remains nearly the same at all Re. It can also be seen that when the speedup is compared with Fig. 7a, for extended domains, the speedup is reduced in internal flow. With the extension of flow domains, the parallel execution time has also increased, which shows that the pressure correction iterations required have drastically increased, reducing the speedup. This analysis provides an insight of behavior of parallel performance of FVM code using SIMPLE algorithm to solve NS equation. One thing that can be deduced is that, without any extended domain, the speedup for internal flow is best, and extension of fluid body before the leading edge and after the trailing edge causes immense computational cost.

Fig. 10
figure 10

Speedup in internal flow considering only outlet domain and both the domains extended

Out of curiosity to know how the FVM code behaves computationally, if the extensions of domain are applied for external flow, parallel performance analysis was carried out. In Fig. 11, the speedup obtained for only outlet domain and both the domains extended in case of external flow is depicted. Grid size of 202 × 202 is fixed for the flow over plate, whereas for outlet/inlet extended domain, 40 × 202 is fixed for each. Compared to results presented in Fig. 7b, the speedup has increased significantly when extended domains are chosen. However, the speedup for either only outlet domain or both domains extended remains more or less the same for all Re. One interesting aspect noticed is that the fluctuation in speedup with domains extended is in line with the speedup fluctuations observed from Fig. 7b. The increase in speedup for external flow with extended domain may be attributed to the physical behavior of fluid over/between the plates. According to the boundary layer theory, in case of internal flow, the boundary layers of parallelly placed plates mix and get fully develop near to the leading edge if the spacing between the plates is less. Hence, due to the mixing of boundary layers, the pressure correction required to satisfy the continuity equation for initially guessed velocity and pressure field are very immense. This causes increased computational time of the FVM code. If the spacing between the plates is further reduced, the computational time will inversely increase, and if the spacing is increased, the pressure correction required is less and the related computational time is also less. This fluid behavior is in agreement with reduced speedup obtained for internal flow with extended domain. In contrast to this, in case of external flow, the boundary layers never mix and the flow is parallel to the plate. Therefore, the pressure correction required is very less which leads to reduced computational time, and hence, addition of extended domain or grids causes more efficient computations on SMs of GPU. The parallel efficiency may, however, be more for internal flow with extended domain than that of external flow with extended domain due to better utilization of parallel processors.

Fig. 11
figure 11

Speedup in external flow considering only outlet domain and both the domains extended

The speedup obtained on GPU1 compared to the sequential time of different CMs for external and internal flow is presented in Fig. 12. The previous results of speedup on GPU1 were considering sequential time of CM1. The flow domain considered is without any extension of fluid body in both the cases. The grid size is fixed at 62 × 122 and 202 × 202 for internal and external flow, respectively. As per the preceding discussions, the speedup on CM1 for different Re is the same and similar is for CM2 due to very close configurations of both the CMs. Yet, another similar trend of speedup on GPU1 for increasing Re in internal and external flow when serial time of CM3 and CM4 is considered is obtained. It can be pointed out that the speedup on GPU1 considering serial time of CM1/CM2 is less than the speedup obtained considering serial time of CM3/CM4. The speedup on GPU1 is highest if serial time of CM4 is considered in case of internal and external flow, as well. These speedups obtained are corresponding to the computational capability of the CMs having different configurations. These speedups obtained give an idea of massive parallel computational capability of SMs of GPU. However, the speedups are significantly more for internal flow relative to external flow.

Fig. 12
figure 12

Speedup obtained using CUDA considering serial time on different computing machines

As mentioned earlier, the three functions (U, V, and T) are parallelized using OpenMP tool and the P function is parallelized using CUDA tool. In another attempt, the P function is also parallelized using OpenMP making the entire FVM code parallelized using OpenMP paradigm. Four CMs, namely CM1, CM2, CM3, and CM4, were selected for analysis of speedup using OpenMP parallelized code. The  %improvement in speedup obtained using GPU1 compared to speedup of these four CMs is depicted in Fig. 13. In this analysis, internal flow without any extended domain with grid size 62 × 122 was selected for demonstration. It can be easily seen that the %improvement in speedup obtained using GPU1 compared to OpenMP speedup on different CMs is very close to each other. The prime reason behind this same %improvement in speedup is due to very low or unnoticeable speedup obtained on CMs using OpenMP when compared with speedup using CUDA. Hence, from this analysis, it is crystal clear that the compute capability of GPUs is incomparable with that of CPUs. It is also clear that the OpenMP parallelization is 100 times slower than the CUDA-based parallelization.

Fig. 13
figure 13

%Improvement in speedup using GPU1 with respect to speedup obtained on four CMs using OpenMP

In Fig. 14, the speedup achieved during each iteration on GPU1 for internal flow considering Re= 750 and different grid sizes is depicted. During the initial iterations, the speedup is very high and as the iterations continue, the speedup reduces, and finally, they reach close to speed up around 75 for the last iteration. The speedups look unrealistic as such speedups are normally not achievable. In this work, the inner iterations of pressure correction equation parallelized using red–black scheme are completely computed on GPUs. 91% of time is used by pressure function as the inner iterations are high in number. Using an outer for loop to launch the kernels for fixed number of times on GPU, this speedup is possible. For higher grid size, the speedup at initial iterations is very high compared to lower grid size during the initial iterations. The speedup obtained during the initial iterations is very high due to immense inner pressure corrections required as the velocity and pressure are in guessed state. As the iterations proceed, the velocity and pressure field values are computed and, hence, the pressure correction required also reduces. The speedup obtained and change in it are directly related with the inner pressure corrections computed. If pressure corrections are highly required, then the serial execution takes immense time for that iteration. Hence, during that iteration, if parallel execution of pressure corrections is performed on GPU, it gives enormous speedup.

Fig. 14
figure 14

Speedup achieved during each iteration on GPU1 for different grid size

4.3 Parallel efficiency of the FVM code

The parallel efficiency of FVM code using CUDA and applying RBSOR scheme and 400 threads on GPU1 for internal and external flow is shown in Fig. 15. It can be easily seen from Fig. 15a, b for internal and external flow that for lower grid size, the parallel efficiency is lowest, and for higher grid size, it is highest irrespective of any Re. In internal flow problem, the parallel efficiency is higher compared to external flow at all Re due to increased speedup obtained on GPU1, as shown in Fig. 7. As the speedups are highest at 72 × 122 grid size, the parallel efficiency can be seen to be continuously improving. The parallel efficiency indicates the efficiency of utilization of multi-processors for parallel execution. Basically, if the time taken by the code for serial execution is more, the associated parallel execution time will be less. This indicates that for the most of the time, the code is run parallelly. If the number of cores is more and the amount of computations is less than the parallel efficiency reduces. Therefore, for efficient utilization of multiple cores, the code must be parallelized with a proper strategy. For lower Re, the serial time is high due to more number of inner iterations involved in pressure correction of large boundary layer. For high Re, the boundary layer is closer to solid part and, hence, the inner iterations are less. As the iterations are more for low Re, the serial time is also more which, in turn, improves the ability of code to utilize more efficiently the multiple cores. Therefore, for lower Re, the parallel efficiency is higher. An increase in grid size also causes further improvement in parallel efficiency due to better utilization of SMs of GPU for parallel computations. The parallel efficiency for all Re in external flow is found to be very close to each other, but increases with grid size. However, in the case of external flow, very high grid size can be used to improve the parallel efficiency. However, the use of higher grid size for the present problem is not required. Nevertheless, to increase the parallel efficiency, the scalable problem in external flow is necessary.

Fig. 15
figure 15

Parallel efficiency obtained for different grid size and Re

5 Conclusions

Parallel performance analysis of the parallelized FVM code developed in-house is analyzed in the form of parallel speedup and parallel efficiency. CUDA parallel computing paradigm is used for parallelization applying the RBSOR scheme. Four computing machines CM1, CM2, CM3, and CM4 for OpenMP parallelization are employed. For computational time analysis, the grid sizes chosen are 42 × 82, 52 × 102, 6 × 122, and 72 × 142 for internal flow. The grid size considered for outlet and inlet domain extended is 24 × 122. In case of external flow 122 × 122, 162 × 162, 202 × 202, and 242 × 242 grid sizes are chosen, while Re is varied from 250 to 1750.

From the complete speedup and parallel efficiency analysis of the parallelized FVM code using different methods, the following important conclusions are drawn.

  1. 1.

    Parallelization using CUDA paradigm gives massive speedup for the present FVM code. In case of internal flow, the speedup is very high compared to speedup achieved in case of external flow.

  2. 2.

    The use of two different GPUs, mainly GPU1 and GPU2, have provided very similar speedup. If different block size of threads is launched, that is 400, 625, and 900 threads, the speedup is again nearly the same, but is the highest for 400 threads.

  3. 3.

    For internal flow considering inlet and outlet domain extended reduces the speedup significantly compared to internal flow without any extended domain. Whereas the speedup increased for external flow when extended domains are considered.

  4. 4.

    Speedup on GPU considering the serial time of CM1/CM2 is the same. If serial time of CM3 and CM4 is considered, the speedup on GPU significantly improves due to the low frequency of CM3 and CM4.

  5. 5.

    The  %improvement in speedup using CUDA compared to speedup obtained using OpenMP parallelization is very immense.

  6. 6.

    The parallel efficiency in case of internal flow improves with increasing grid size for all Re considered. And in external flow at all Re, the parallel efficiency remains nearly the same with slight improvement for the increase in grid size.