1 Introduction

Due to the massive number of calculations involved, climate models or earth system models need support from high-performance computing (HPC) [1, 2]. Radiative transfer models, which are employed to calculate atmospheric radiative fluxes and heating rates [3], also demand the HPC. Some of the most well-known radiative transfer models are the line-by-line radiative transfer model (LBLRTM) [4, 5], rapid radiative transfer model (RRTM) [6], and rapid radiative transfer model for general circulation models (RRTMG). As an accelerated version of RRTM, the RRTMG can perform computations more efficiently [7, 8]. However, it still demands enormous computing resources for long-term climatic simulation [9,10,11]. The Chinese Academy of Sciences-Earth System Model (CAS-ESM) [12,13,14] uses the Institute of Atmospheric Physics (IAP) of CAS Atmospheric General Circulation Model Version 4.0 (IAP AGCM4.0) [15, 16] as its atmospheric component model. Here, the IAP AGCM4.0 uses the RRTMG as its radiative parameterization scheme.

Large-scale numerical simulations are typically performed on CPU clusters. However, because of the low power consumption, high memory bandwidth, highly parallel processing, and many-core processor capabilities of graphics processing units (GPUs), the HPC has undergone a paradigm shift from CPU-based computing to GPU-based computing [17,18,19,20,21]. It has become increasingly more common to run climate models on GPUs to perform highly efficient computations with low energy consumption [22, 23]. For instance, the RRTM longwave radiation scheme (RRTM_LW) in the Weather Research and Forecasting (WRF) model was accelerated with CUDA (NVIDIA’s Compute Unified Device Architecture) Fortran on C1060 GPUs and attained a \(10 \times\) performance improvement [24]. A nearly \(10 \times\) speedup for a computationally intensive portion of the WRF was obtained on an NVIDIA 8800 GTX [25]. In our previous study, the RRTMG longwave radiation scheme (RRTMG_LW) was accelerated on only one K20 GPU [26, 27].

In the aforementioned studies, radiation transfer models were accelerated on a single GPU. Currently, supercomputers or CPU/GPU heterogeneous HPC systems usually have thousands of CPU and GPU nodes. Radiation transfer models should be run on dozens of GPUs, at a minimum, to make full use of these GPU nodes. Moreover, running the current RRTMG on one GPU is still time-consuming when used for long-term simulations, so its acceleration algorithm should be studied to achieve more efficient computing on multiple GPUs. Thus, the present paper focuses on accelerating the RRTMG_LW on multiple GPUs. A CUDA-based acceleration algorithm of the RRTMG_LW on multiple GPUs is proposed. The proposed algorithm enables massively parallel calculations of the RRTMG_LW on multiple GPUs of a supercomputer. Then, a multi-GPU acceleration version of the RRTMG_LW, namely GPUs-RRTMG_LW, is built. The experimental results demonstrate that running the GPUs-RRTMG_LW on 16 K20 GPUs obtained a \(77.78 \times\) speedup.

The main contributions of this study are as follows:

  1. (1)

    To further accelerate the RRTMG_LW, a multi-GPU accelerating algorithm based on CUDA Fortran is proposed. The proposed algorithm adapts well to multiple GPUs and nodes. Moreover, it can also be generalized to accelerate the RRTMG shortwave radiation scheme (RRTMG_SW).

  2. (2)

    The GPUs-RRTMG_LW can run on a GPU cluster and shows excellent computational capability. To some extent, the more efficient computation of the GPUs-RRTMG_LW supports large-scale and real-time computing of the CAS-ESM. In addition, after implementing the GPUs-RRTMG_LW on multiple GPU nodes, the highly efficient parallel processing allows the CAS-ESM to run on a CPU/GPU heterogeneous supercomputer with thousands of nodes.

The remainder of this paper is organized as follows. Section 2 presents representative approaches that aim to accelerate physical parameterization schemes on multiple GPUs. Section 3 introduces the RRTMG_LW model and GPU environment. Section 4  describes the CUDA-based 3D acceleration algorithm for the RRTMG_LW on a single GPU. Section 5 details the MPI+CUDA acceleration algorithm for the RRTMG_LW on multiple GPUs. Section 6 evaluates the performance of the GPUs-RRTMG_LW in terms of runtime efficiency and speedup and discusses some of the problems arising during the experiments. The last section concludes the paper with a summary and proposal for future work.

2 Related work

In recent years, a fair amount of work has been devoted to accelerate physical parameterization schemes and climatic system models by using GPUs. Many GPU-based acceleration techniques have been proposed, and they can be divided into several categories: single GPU-based acceleration, multi-GPUs-based acceleration, and CPU/multi-GPU clusters-based acceleration. We are committed to speeding up the process on CPU/multi-GPU clusters. Here, we provide a brief summary of related categories about prior work.

Mielikainen et al. refactored the RRTMG_LW code. Without I/O transfer, their GPU version achieved a speedup of \(127 \times\) on a single Tesla K40 GPU compared to its CPU version on an Intel Xeon E5-2603 [28]. The RRTMG_SW was written in CUDA C [29] instead of the previous Fortran code. Compared to its single-threaded Fortran counterpart running on an Intel Xeon E5-2603, the RRTMG_SW based on CUDA C had a \(202 \times\) speedup on a single Tesla K40 GPU [30].

Running the RRTM_LW on a GTX480 obtained a \(27.6 \times\) speedup compared with the baseline wall-clock time [3]. The WRF Single Moment 5-class (WSM5) microphysics achieved a \(9.4 \times\) performance increase even without systematically optimizing the GPU code [31]. The WRF Single Moment 6-class (WSM6) microphysics scheme was accelerated with CUDA C. Here, the CUDA programming model is used to convert the original WSM6 module into GPU programs. Its GPU version obtained a greater than \(216 \times\) speedup when compared to its CPU serial version [32]. The GRAPES’ WSM6 scheme, using the NVIDIA CUDA programming model, exploited its fine-grained data parallelism. The implementation achieved a greater than \(140 \times\) performance improvement over a single CPU version [33].

The WRF Goddard shortwave radiance scheme was accelerated on two NVIDIA GTX 590s. Without taking I/O transfer times into account, the GPU implementation achieved a \(141 \times\) speedup [34]. The RRTM_LW in the GRAPES_Meso model was rewritten in CUDA Fortran, and a \(14.3 \times\) speedup was obtained. The experiments were carried out on a multi-GPU platform and could be extended to GPU clusters [9]. The double-precision version of the ODAS (a transmittance algorithm, which is available in the community radiative transfer model) obtained a \(201 \times\) speedup on two NVIDIA GTX 590s compared to its single-threaded Fortran code [35]. The WRF Kessler cloud microphysics scheme obtained a \(132 \times\) speedup on 4 GPUs compared to its single-threaded CPU version [36]. The WRF WSM5 microphysics scheme was accelerated by \(357 \times\) on four GPUs [37]. The horizontal diffusion method in the WRF was accelerated approximately 3.5 times using two Tesla K40m GPUs compared with the single-GPU version [38]. Lu et al. utilized the MPI+OpenMP/CUDA programming pattern to simulate radiation physics on a large GPU cluster and investigated the computational efficiency of the RRTM_LW CPU/GPU implementation [39, 40].

Despite the excellent results of the aforementioned studies, few of them have exploited both the CPU and GPU computational resources within large GPU clusters. In this paper, a parallel programming model, MPI+CUDA, is presented when simulating the RRTMG_LW in a CPU/multi-GPU computing environment.

3 Model description and GPU overview

3.1 RRTMG_LW model

With an objective of higher efficiency while less loss of accuracy, the RRTM model was modified to create the RRTMG [41, 42], which is a correlated k-distribution band model for the calculation of longwave and shortwave atmospheric radiative fluxes and heating rates [43]. The correlated k-distribution method and g points in the RRTMG are described in [26, 44]. The radiation flux and heating/cooling rate for calculating radiative transfer through a planetary atmosphere is described detailedly in [28].

3.2 RRTMG_LW code structure

According to the code test, the subroutine rrtmg_lw in the RRTMG_LW module accounts for most of the time proportion consumed by each subroutine and is likely to become the bottleneck of system performance [26, 27]. The rrtmg_lw calls the following five subroutines.

  1. (a)

    The subroutine inatm is used to read the atmospheric profile from the GCM for use in the RRTMG_LW and define other input parameters;

  2. (b)

    The subroutine cldprmc is used to set cloud optical depth for the Monte Carlo-independent column approximation (McICA) based on the input cloud properties;

  3. (c)

    The subroutine setcoef is used to calculate information needed by the radiative transfer routine which is specific to this atmosphere, especially some of the coefficients and indices needed to compute optical depths, by interpolating data from stored reference atmospheres;

  4. (d)

    The subroutine taumol is used to calculate gaseous optical depths and Planck fractions for each of the 16 spectral bands;

  5. (e)

    The subroutine rtrnmc (for both clear and cloudy profiles) is used to perform the radiative transfer calculation using the McICA to represent sub-grid-scale cloud variability.

Algorithm 1 shows the computing procedure of rrtmg_lw. The rrtmg_lw took most computing time of the RRTMG_LW, so the study target was to use GPUs to accelerate the inatm, cldprmc, setcoef, taumol, and rtrnmc subroutines.

figure a

3.3 Overview of GPU and CUDA

Figure 1 illustrates the hardware architecture of a GPU. It is organized into an array of highly threaded streaming multiprocessors (SMs). Each SM has a number of streaming processors (SPs) that share control logic and instruction cache. As a general purpose parallel computing architecture, CUDA facilitates creating a software environment that fully utilizes the cores of many GPUs in a massively parallel fashion. CUDA defines functions or subroutines as ‘kernels’ executed on the GPU (the ‘device’). Normally, the CPU (the ‘host’) invokes the kernels of an application. Each kernel is executed by CUDA threads, which are organized into a three-level hierarchy [45], as shown in Fig. 2. The top level is a grid consisting of thread blocks. Each thread block has a group of threads that share data efficiently through a fast shared memory.

Fig. 1
figure 1

Hardware architecture of a modern GPU

Fig. 2
figure 2

Hierarchy of threads and memory in CUDA

4 CUDA-based 3D acceleration of RRTMG_LW on a single GPU

The RRTMG_LW uses a collection of 3D cells to represent the atmosphere. Its 1D acceleration algorithm performs domain decompositions in the horizontal direction. Its 2D acceleration algorithm performs domain decompositions in the horizontal and vertical directions. In the RRTMG_LW, the total number of g points is 140. Therefore, there are iterative computations for each g point in inatm, taumol, and rtrnmc. For example, the computation of 140 g points is executed by a do-loop in the GPU-based acceleration implementation of 1D rtrnmc_d. To achieve more fine-grained parallelism, 140 CUDA threads can be assigned to run the kernels inatm_d, taumol_d, and rtrnmc_d. Thus, on the basis of the 2D algorithm, the 3D parallel strategy is further accelerating inatm_d, taumol_d, and rtrnmc_d in the g-point dimension. Figure 3 illustrates the domain decomposition in the g-point dimension for the RRTMG_LW accelerated on a GPU. The 3D acceleration algorithm is illustrated in Algorithm 2 and described as follows:

  1. (1)

    In the 3D acceleration algorithm, inatm consists of five kernels (inatm_d1, inatm_d2, inatm_d3, inatm_d4, and inatm_d5). Due to data dependency, a piece of code in inatm can be parallel only in the horizontal or vertical direction, so the kernel inatm_d4 uses 1D decomposition. The kernels inatm_d1, inatm_d2, and inatm_d5 use 2D decomposition, while the kernel inatm_d3 uses 3D decomposition. Due to the requirement of data synchronization, inatm_d1 and inatm_d2 cannot be merged into one kernel.

  2. (2)

    The kernel cldprmc_d still uses 1D decomposition.

  3. (3)

    Similarly, the kernel setcoef_d1 uses 2D decomposition, and the kernel setcoef_d2 uses 1D decomposition.

  4. (4)

    The kernel taumol_d uses 3D decomposition. In taumol_d, 16 subroutines with the device attribute are invoked.

  5. (5)

    Similarly, rtrnmc consists of 11 kernels (rtrnmc_d1rtrnmc_d11). Here, rtrnmc_d1, rtrnmc_d4, rtrnmc_d8, rtrnmc_d10, and rtrnmc_d11 use 1D decomposition. Furthermore, rtrnmc_d2 and rtrnmc_d9 use 2D decomposition in the horizontal and vertical directions. In addition, rtrnmc_d5 and rtrnmc_d6 use 2D decomposition in the horizontal direction and g-point dimension. Finally, rtrnmc_d3 and rtrnmc_d7 use 3D decomposition.

Fig. 3
figure 3

Schematic diagram of the decomposition in the g-point dimension for the RRTMG_LW in the GPU acceleration

In Algorithm 2, for 1D acceleration, n is the number of threads in each thread block, while \(m=\lceil (\mathbf{real} )ncol/n\rceil\) is the number of blocks utilized in each kernel grid. For 2D and 3D acceleration, the tBlock defines the number of threads utilized in each thread block of the x, y, and z dimensions by the derived type dim3. Furthermore, the grid defines the number of blocks in the x, y, and z dimensions by dim3.

The 1D, 2D, and 3D acceleration algorithms of the RRTMG_LW on one GPU were proposed in our previous study [26, 27]. After implementing the algorithms in CUDA Fortran, the GPU-RRTMG_LW has been developed and can run on a GPU. In the CAS-ESM, the IAP AGCM4.0 has a \(1.4^{\circ }\times 1.4^{\circ }\) horizontal resolution and 51 levels in the vertical direction, so the RRTMG_LW has \(128\times 256\) horizontal grid points. If one GPU is applied, in theory, \(128\times 256\times 51\times 140\) CUDA threads will be required for each 3D kernel.

figure b

5 MPI+CUDA acceleration algorithm of RRTMG_LW on multiple GPUs

5.1 Parallel architecture

The current CAS-ESM, which is implemented by adopting MPI technology, typically runs on dozens of compute nodes. Once the GPU-RRTMG_LW is integrated into the CAS-ESM, it also has to run on multiple compute nodes and GPUs. Generally, supercomputers or large-scale clusters have hundreds of compute nodes, each having two or more GPUs. To make full use of multi-core and multi-GPU supercomputers and further improve the computational performance of the GPU-RRTMG_LW, this study adopted a parallel architecture with an MPI+CUDA hybrid paradigm, as shown in Fig. 4. Hence, the GPU-RRTMG_LW can run on multiple GPUs,  whereas the other code of the CAS-ESM can run on multiple CPUs.

Fig. 4
figure 4

Parallel architecture of the GPU-RRTMG_LW on multiple GPUs

5.2 GPUs-RRTMG_LW algorithm

In the IAP AGCM4.0 of the CAS-ESM system, the computation of its physical parameterizations features the characteristics of a vertical single-column model. Therefore, when running the CAS-ESM on multiple CPU cores, the computation task of physical parameterizations is decomposed in the horizontal direction. As one of these physical parameterizations, the 3D global domain of the GPU-RRTMG_LW is decomposed by latitude and longitude. More specifically, the decomposition of the 3D atmospherical global grid is implemented by using MPI technology. Then, each MPI rank will finish the computation task on its own sub-grid points. For example, if the CAS-ESM is run on 4 CPU cores, the number of each sub-grid point in the horizontal direction is \(128 \times 256/4=8192\). Each CPU core drives a GPU, so 4 GPUs will be required here, as shown in Fig. 5. It means that each GPU will finish the computing of 8192 grid points in the horizontal direction at each time step. Due to the limitation of global memory on a GPU, a K20 GPU can only compute 2048 horizontal grid points. Thus, the 8192 points will be divided into 4 chunks, each having 2048 points. In other words, a K20 GPU will perform the computation of 8192 points in 4 iterations at each time step.

Fig. 5
figure 5

Decomposition of the global grid in the horizontal direction when running the GPUs-RRTMG_LW on four CPU cores and four GPUs

After decomposing the 3D atmospherical global grid, this study uses MPI to implement collaboration and communication among multiple GPUs. Figure 6 illustrates the flow of the GPUs-RRTMG_LW acceleration algorithm on multiple GPUs and nodes. The detailed acceleration algorithm is as follows.

  1. (1)

    The CAS-ESM runs concurrently on multiple CPU cores using MPI. The 3D atmospherical global grid is decomposed into sub-grids. Each CPU core is responsible for the computing task on its own sub-grid points. Then, each CPU core starts a GPU and sends the input data of the GPUs-RRTMG_LW to its corresponding GPU.

  2. (2)

    Each GPU initializes the run environment and allocates space for its variables or arrays. After receiving the input data from its corresponding CPU core, the ncol CUDA threads in each GPU will execute the computation of radiative transfer concurrently for each grid point on its own chunk. Then, each GPU will send the computing results to its corresponding CPU.

  3. (3)

    Each CPU core will receive the computing results. If all of the computing tasks are not finished at a time step, the algorithm will continue to run from the first step.

Fig. 6
figure 6

Flow of the GPUs-RRTMG_LW acceleration algorithm

5.3 GPUs-RRTMG_LW implementation

The implementation of the RRTMG_LW acceleration algorithm on multiple GPUs is illustrated in Table 1 and described as follows.

  1. (1)

    If each GPU node invokes two or more GPUs, arrays with device attribution for input or output data must be allocated memory dynamically. Thus, the dynamic allocation memory is adopted in the algorithm implementation.

  2. (2)

    The NVIDIA CUDA library cudaSetDevice sets which device (GPU) to be used for GPU code executions. Calling the library is to realize multi-GPU computation on each GPU node.

  3. (3)

    The NVIDIA CUDA library cudaDeviceSynchronize is used to wait for compute device to finish. Calling the library is to realize multi-GPU synchronous computation.

Table 1 Implementation of the GPUs-RRTMG_LW

6 Experimental results and discussion

To evaluate the performance of the proposed algorithm, experimental studies were conducted. The results are described below.

6.1 Experimental setup

This paper conducted an ideal global climate simulation for one model day to fully investigate the proposed algorithm. In this experiment, the time step of the GPUs-RRTMG_LW was one hour. The experiment ran on a K20 cluster in the Computer Network Information Center of CAS, which has 30 GPU nodes. Each GPU node has two Intel Xeon E5-2680 v2 processors and two NVIDIA Tesla K20 GPUs. Twenty CPU cores in each GPU node share 64 GB DDR3 system memory through QuickPath Interconnect. The basic compiler is the PGI Fortran compiler Version 14.10 that supports CUDA Fortran. Table 2 lists its detailed configurations. The serial RRTMG_LW was executed on an Intel Xeon E5-2680 v2 processor of the K20 cluster.

Table 2 Configurations of the K20 GPU cluster

6.2 Performance comparison of 1D and 3D GPUs-RRTMG_LW

Table 3 shows the runtime of the serial RRTMG_LW on one core of an Intel Xeon E5-2680 v2 processor. The computing time of the RRTMG_LW on the CPU or GPU, \(T_{rrtmg\_lw}\), is calculated with the following formula:

$$\begin{aligned} T_{rrtmg\_lw}=T_{inatm}+T_{cldprmc}+T_{setcoef}+T_{taumol}+T_{rtrnmc}+T_{I/O}, \end{aligned}$$

where \(T_{inatm}\) is the computing time of the subroutine inatm or kernel inatm_d; moreover, \(T_{cldprmc}, T_{setcoef}, T_{taumol}\), and \(T_{rtrnmc}\) are the corresponding computing time of the other kernels; \(T_{I/O}\) is the I/O transfer time between the CPU and GPU.

To evaluate the acceleration performance of the GPUs-RRTMG_LW on multiple GPUs, the performance of the 1D GPUs-RRTMG_LW on multi-GPUs was also evaluated. Table 3 also portrays the runtime and speedup of the 1D GPUs-RRTMG_LW on multiple K20 GPUs when each GPU node of the cluster invokes one K20 GPU. Table 4 presents the runtime and speedup of the 3D GPUs-RRTMG_LW on multiple K20 GPUs when each GPU node of the cluster invokes one K20 GPU. Some conclusions and analysis are described as below.

  1. (1)

    Increasing the number of GPUs can reduce the runtime and improve speedup. When the 1D GPUs-RRTMG_LW ran on 16 K20 GPUs, it achieved a speedup of \(51.28 \times\) as compared to its counterpart running on one CPU core of an Intel Xeon E5-2680 v2.

  2. (2)

    With incremental increases in the number of GPUs, the 3D GPUs-RRTMG_LW resulted in a similar rule. When the 3D GPUs-RRTMG_LW ran on 16 K20 GPUs, it achieved a speedup of \(77.78 \times\). The 3D GPUs-RRTMG_LW has better acceleration algorithm than the 1D GPUs-RRTMG_LW, so it can obtain a higher speedup.

Table 3 Runtime and speedup of the CAS-ESM 1D GPUs-RRTMG_LW on multiple GPUs when each GPU node of the cluster invokes one K20 GPU
Table 4 Runtime and speedup of the CAS-ESM 3D GPUs-RRTMG_LW on multiple GPUs when each GPU node of the cluster invokes one K20 GPU

6.3 Performance evaluation with different GPU configurations

In the K20 cluster, each GPU node has two Intel Xeon E5-2680 v2 processors (20 CPU cores) and two K20 GPUs. In the experiment of Sect. 6.2, each GPU node invokes one K20 GPU. To make full use of the cluster, each GPU node will invoke two K20 GPUs in the following experiment. Table 5 presents the runtime and speedup of the 3D GPUs-RRTMG_LW on multiple K20 GPUs when each GPU node invokes two K20 GPUs. Some conclusions and analysis are described as below.

  1. (1)

    Increasing the number of GPUs can reduce the runtime and improve speedup. When the 3D GPUs-RRTMG_LW ran on 16 and 32 K20 GPUs, it achieved a speedup of \(60.88 \times\) and \(76.13 \times\), respectively.

  2. (2)

    As shown in Tables 4 and 5, the 3D GPUs-RRTMG_LW running on the same number of GPUs obtains a higher speedup when each GPU node invokes one K20 GPU. This is because the data transfer between the CPU and GPU is slower and a communication overhead is produced when two GPUs are invoked in a GPU node.

  3. (3)

    Although the 3D GPUs-RRTMG_LW does not show a perfect performance improvement when each GPU node invokes two K20 GPUs, it can utilize more GPUs and has a stronger scalability.

  4. (4)

    When 16 nodes and 32 GPUs are utilized in Table 5, the maximum value of the ncol is 1024 (\(128 \times 256/32\)) because of the low resolution of the IAP AGCM4.0 in the CAS-ESM. In theory, if the IAP AGCM4.0 with a higher resolution is developed, the value of the ncol can be 2048 and the 3D GPUs-RRTMG_LW will have a speedup of about \(120 \times\). Therefore, the proposed algorithm can fully support the CAS-ESM with a higher resolution.

Table 5 Runtime and speedup of the CAS-ESM 3D GPUs-RRTMG_LW on multiple GPUs when each GPU node of the cluster invokes two K20 GPUs

6.4 Error analysis

When accelerating the computational performance of a climate system model, it is of vital importance to ensure that running the model on multiple GPUs can generate the same results within a small tolerance threshold. In a simulation experiment of two model days, Fig. 7 illustrates the impact on the longwave flux at the top of the atmosphere in a clear sky. The outgoing longwave flux achieved by running the CAS-ESM entirely on CPUs is shown in Fig. 7a. The longwave flux differences between the simulations running the CAS-ESM only on CPUs and running the CAS-ESM RRTMG on 16 GPUs are shown in Fig. 7b. The results show that there are minor and negligible differences. Besides the impact of running the 3D GPUs-RRTMG_LW on GPUs, the impact of the slight physics change by running the 3D GPUs-RRTMG_LW code on GPUs also results in these differences.

Fig. 7
figure 7

Impact on the longwave flux at the top of the atmosphere in a clear sky

6.5 Discussion

  1. (1)

    Zheng et al. proposed an acceleration algorithm for the RRTM_LW in the GRAPES_Meso model on multiple GPUs. Their CUDA Fortran version obtained a \(14.3 \times\) speedup on \(4 \times\)NVIDIA Tesla C1060 cards [9]. Compared to their algorithm, our proposed algorithm for the RRTMG_LW in the CAS-ESM has a better speedup. Moreover, our algorithm can run on multiple GPU nodes.

  2. (2)

    In fact, our algorithm does not attain an ideal speedup when running on multiple nodes and GPUs. There are two main reasons. First, the current IAP AGCM4.0 in the CAS-ESM is with a low resolution, so the RRTMG_LW calculation amount assigned on each GPU is less and the GPU hardware performance is inefficiently utilized. Second, the inevitable I/O transfer cost between the CPU and GPU reduces performance improvement. Thus, the proposed acceleration algorithm will be optimized further to achieve a better performance.

7 Conclusions and future work

Large-scale numerical simulation places an ever-growing demand on the computational performance of HPC infrastructure. Consequently, it is critical to make full use of the computational resources of CPU/GPU clusters. In this paper, a multi-GPU acceleration algorithm for the RRTMG_LW is proposed, and its hybrid programming paradigm (MPI+CUDA) is presented. After implementing the algorithm, the GPUs-RRTMG_LW was developed and integrated into the CAS-ESM as its longwave radiation transfer module, which realized the CPU/GPU heterogeneous parallel computing of the CAS-ESM. Moreover, we performed a simulation by exploiting the computational capacities of both CPU and GPU clusters. The experimental results demonstrate that the multi-GPU acceleration algorithm is valid and highly efficient. During a climate simulation of one model day, the GPUs-RRTMG_LW obtained a speedup of \(77.78 \times\) on a K20 GPU cluster.

The future work mainly includes the following two aspects: (1) The acceleration algorithm will be optimized to further harness the GPU performance. For example, using pinned memory reduces I/O transfer between the CPU and GPU. (2) To fully utilize CPU cores and GPUs, we will adopt the MPI+OpenMP+CUDA hybrid paradigm to improve the acceleration algorithm.