1 Introduction

The weather and climate changes have a significant impact on people’s productivity and lives. Using climate models to predict future climate conditions plays an important role in planning social production and formulating disaster prevention measures [1]. In climate models, whether the distribution and alteration of radiation from the Earth’s surface can be simulated reasonably and accurately will greatly affect the prediction of future weather and climate changes [2]. Radiative transfer is one of the key issues in atmospheric physics; it consumes a large amount of computing resources among all atmospheric physical processes [3]. Therefore, developing a high-accuracy and high-speed radiative transfer model is very meaningful for global atmospheric modeling. The rapid radiative transfer model for general circulation models (RRTMG) is a related k-distribution model that calculates the longwave and shortwave radiative flux and heating rate in the atmosphere [4,5,6]. Due to the high accuracy of RRTMG, it has been applied in weather forecasts and climate models [7, 8]. For example, the Institute of Atmospheric Physics of CAS Atmospheric General Circulation Model Version 4.0 (IAP AGCM4.0) [9] uses RRTMG as its radiation parameterization scheme. Although RRTMG has a quite fast calculation speed, it still occupies 15% to 25% of the calculation time for the entirety of the atmospheric physics in IAP AGCM4.0. With the development of high-resolution atmospheric circulation models, the calculation amount of RRTMG will increase exponentially, which will seriously limit the performance of large-scale simulation calculations. Thus, it is necessary to further accelerate RRTMG.

GPUs (Graphics Processing Units) have powerful floating-point computing capabilities, high memory access bandwidth, and many computing cores. In today’s era of green high-performance computing, the application of GPU-accelerated numerical computing is becoming more and more widespread [10]. The original RRTMG is with serial computing. Given the advantages of GPU-based parallel computing, using GPU to accelerate radiative transfer schemes has become a quite feasible and valuable research direction [11, 12].

At present, there have been some acceleration algorithms for radiative transfer scheme. In 2019, Wang et al. developed a GPU version of an RRTMG longwave radiation scheme (RRTMG_LW) using CUDA Fortran called G-RRTMG_LW, which achieved a speedup of 30.98\(\times\) on a Tesla K40 GPU [13]. Later, CUDA Fortran and CUDA C versions of the RRTMG shortwave radiation scheme (RRTMG_SW) were developed: CF-RRTMG_SW and CC-RRTMG_SW. Compared with RRTMG_SW running on a single Intel Xeon E5-2680 CPU core, CC-RRTMG_SW could reach a speedup of 38.88\(\times\) on a single NVIDIA GeForce Titan V GPU [14].

Although CC-RRTMG_SW implements GPU-based computing of RRTMG_SW, its speedup is fairly low and the cost of data transfer between CPU and GPU is relatively large. To fully utilize the computing performance of GPU, it is necessary to further optimize CC-RRTMG_SW. The main challenges are how to perform finer-grained parallelism and apply the optimization techniques of CUDA [15] programming for CC-RRTMG_SW. To solve these problems, this paper first proposes two-dimensional (2D) and three-dimensional (3D) acceleration algorithms of RRTMG_SW based on CUDA C and then optimizes them. This includes decoupling data dependency, reduction of access of global memory, and optimization of I/O transfer. The optimized CC-RRTMG_SW with 3D acceleration is named CC-RRTMG_SW++. Experimental results show that compared to RRTMG_SW running on a single Intel Xeon E5-2680 v2 CPU core, CC-RRTMG_SW++ without I/O transfer can achieve a speedup of 99.09\(\times\) on a single NVIDIA Tesla V100 GPU, and it still has a speedup of 24.07\(\times\) with I/O transfer. Compared with CC-RRTMG_SW, the computing efficiency of CC-RRTMG_SW++ is increased by 174.46%.

The main contributions of this paper are as follows.

  • A novel 2D acceleration algorithm is proposed for the spcvmc subroutine of RRTMG_SW and implements its GPU-based computing in the horizontal and jp-band dimensions. Then, a 3D acceleration algorithm is also proposed and implemented for RRTMG_SW in the horizontal, vertical, and g-point dimensions. The two algorithms further accelerate RRTMG_SW.

  • Some performance optimization methods are also proposed, including effectively decoupling data dependency, using GPU registers to optimize global memory access, using CUDA streams to reduce I/O transfer time between host and device. These methods make full use of the computing performance of GPU and improve the computing performance of CC-RRTMG_SW++.

The rest of this paper is organized as follows. Section 2 presents some excellent work on accelerated climate models and radiative transfer schemes. Section 3 introduces the RRTMG_SW model and its parallel dimensions. Section 4 describes the 2D and 3D GPU-based acceleration algorithms of RRTMG_SW. Section 5 introduces some optimization methods for RRTMG_SW on GPU. Section 6 analyzes the results of numerical experiments. The last section summarizes this paper and proposes an outlook for future work.

2 Related work

In recent years, with the wide application of GPU, there has been considerable research on using GPU to accelerate the parameterization schemes of climate models. In this section, we will introduce some excellent works on accelerating climate models in recent years and then focus on the related research on using GPU to accelerate the radiation physics process.

Huang et al. implemented a WRF five-layer thermal diffusion scheme using GPU large-scale parallel architecture. Without considering I/O transfer, the accelerated WRF five-layer thermal diffusion scheme achieved a speedup of 311\(\times\) on a Tesla K40 GPU [16]. Leutwyler et al. implemented a GPU version of the convection-resolving COSMO model in a climate model and completely transplanted it into a multi-core, heterogeneous atmospheric model. They demonstrated the applicability of this approach to longer simulations by conducting a 3-month-long simulation [17]. Mielikainen implemented GPU acceleration for a WRF double-moment 6-class microphysics scheme. On a single GPU, a 150\(\times\) speedup was achieved; if I/O overhead is not considered, a 206\(\times\) speedup was achieved [18]. Cao implemented a highly scalable 3D atmospheric circulation model, AGCM-3DLF, based on leap format. The experimental results on different platforms showed that the model has good efficiency and scalability. On the CAS-Xiandao1 supercomputer, a speed of 11.1 simulated years per day (SYPD) was achieved at a high resolution of 25 km [19].

In terms of using GPUs to accelerate radiative physics, Lu et al. accelerated RRTMG_LW on GTX470, GTX480, and C2050, and obtained 23.2\(\times\), 27.6\(\times\), and 18.2\(\times\) acceleration, respectively. They also performed performance analysis in terms of GPU clock frequency, execution configuration, use of registers, characteristics of RRTM_LW, etc. [20]. Mielikainen et al. used GPUs to accelerate the Goddard shortwave radiation parameterization scheme and achieved a 116\(\times\) speedup on two NVIDIA GTX 590 GPUs, and a 259\(\times\) speedup through a single precision calculation [21]. Price et al. rewrote the Fortran code of RRTMG_LW in C language and then implemented the GPU acceleration of RRTMG_LW using CUDA C. In the performance optimization, 19 optimization schemes were implemented. Regardless of I/O overhead, a 127\(\times\) speedup was achieved on a Tesla K40 GPU compared to its computing time using an Intel Xeon E5-2603 single-core [22]. Mielikainen et al. used CUDA C to implement the GPU computation of RRTMG_SW in WRF. RRTMG_SW achieved a speedup of 202\(\times\) on a single Tesla K40 GPU [2].

Significantly diverging from previous research, including Wang et al. [14], this paper applies parallel computing in the g-point dimension to RRTMG_SW for the first time and then creatively proposes the parallel scheme in the jp-band dimension. Moreover, this paper further accelerates RRTMG_SW by using certain optimization methods.

3 Model description and analysis

3.1 RRTMG_SW

The RRTMG model uses the two-stream approximation method to solve the radiation transfer equation and uses the correlation k-distribution method to calculate the radiation flux in the process of the molecular absorption and scattering. It decomposes the radiation spectrum into multiple bands, and the absorption intensity value of each band is divided into a cumulative distribution intensity function. In order to obtain the band’s radiant flux, the distribution function is discretized in each band by g iterative integration. For more details, please refer to [23].

Figure 1 shows the structure of RRTMG_SW. RRTMG_SW consists of two subroutines: mcica_subcol_sw and rrtmg_sw. mcica_subcol_sw is used to create Monte Carlo independent column approximation (McICA) stochastic arrays, enabling McICA to provide fractional cloudiness and cloud overlap capabilities to RRTMG_SW. rrtmg_sw is the driver of RRTMG_SW and is also the core computing part of the model. rrtmg_sw includes four main subroutines: inatm is used to read parallel plane atmospheric profile data ranging from the Earth’s surface to the top of the atmosphere. cldprmc is used to select the parameters of the optical depth of cloud ice and liquid, and it uses cam shortwave cloud optical properties to set the cloud depth for McICA. setcoef is responsible for calculating the pressure correlation index and the temperature correlation score under the condition of fixed atmospheric data. spcvmc is responsible for calculating the spectral loop of shortwave radiative flux and realizes the overall solution of the two-stream approximation model. In addition, spcvmc calls taumol, reftra, and vrtqdr. taumol computes the optical depth and Planck fraction for each spectral band, reftra computes reflectance and atmospheric transmittance for the two-stream approximation model, and vrtqdr computes the vertical quadrature integral in the two-stream approximation model. Figure 2 shows the proportion of computing time for four subroutines in RRTMG_SW and CC-RRTMG_SW. As can be seen from Fig. 2, spcvmc accounts for a high proportion of computing time, so further improving its computing efficiency is the key to optimizing the model.

Fig. 1
figure 1

The structure of RRTMG_SW

Fig. 2
figure 2

The proportion of computing time for the four subroutines in RRTMG_SW and CC-RRTMG_SW

3.2 Parallelism analysis

RRTMG_SW uses 3D data to represent atmospheric shortwave radiation. The first dimension is the horizontal layer represented by latitude and longitude, the second dimension is the vertical layer in 3D space, and the third dimension is the special g-point dimension [24]. There are fourteen spectral bands in shortwave radiation. When calculating the radiant flux of each spectral band, 112 g-point intervals are used to discretize the distribution function. By analyzing the model, this paper proposes a new parallel scheme in the jp-band dimension. jp-band is used to distinguish the above fourteen spectral bands. The calculation of radiative flux for each spectral band can be performed independently, so parallel computing in the jp-band dimension is workable.

Currently, CC-RRTMG_SW has achieved acceleration in the horizontal layer. To improve parallel scalability, this paper will tap the parallel potential of CC-RRTMG_SW in the vertical layer, g-point, and jp-band dimensions by designing appropriate methods to decouple data dependency.

4 Multi-dimensional acceleration algorithms

This section introduces the GPU-based 2D and 3D acceleration algorithms of RRTMG_SW, and their implementations.

4.1 Parallel strategy

The GPU-based acceleration algorithm of RRTMG_SW uses the CUDA programming model. Its main computing element is parallel in kernels. Figure 3 portrays its execution flow.

Fig. 3
figure 3

Execution flow of a parallel algorithm based on CUDA

Assuming that the global shortwave radiation is divided into 3D grids for calculation, Fig. 4 shows the calculation tasks of rrtmg_sw called each time during serial calculation and the calculation tasks per thread when using parallel computing in different dimensions. In Fig. 4a, rrtmg_sw performs computation tasks in the horizontal, vertical or jp-band, and g-point dimensions in a serial manner. In Fig. 4b, rrtmg_sw uses ncol threads to calculate horizontal columns in the grids in a 1D parallel computing manner. In Fig. 4c, rrtmg_sw uses ncol*nlay threads to calculate the horizontal and vertical layers in a 2D parallel computing manner or uses ncol*nbndsw threads to calculate the horizontal and jp-band dimensions. In Fig. 4d, rrtmg_sw uses ncol*nlay*ngptsw threads to calculate the horizontal, vertical, and g-point dimensions in a 3D parallel computing manner. After dividing the computing dimensions, the kernels can be launched for calculation.

Fig. 4
figure 4

The division of computing tasks for rrtmg_sw in serial and parallel computing modes (a. serial computing, b. 1D parallel computing, c. 2D parallel computing, d. 3D parallel computing)

4.2 Algorithm implementation

This section mainly describes the 2D and 3D acceleration algorithm implementation of inatm, cldprmc, setcoef, and spcvmc. Their kernels are inatm_d, cldprmc_d, setcoef_d, and spcvmc_d, respectively.

4.2.1 inatm_d

Considering data dependency and data synchronization requirements, inatm needs to be divided into five kernels for parallel computing as fine-grained as possible, namely inatm_d1, inatm_d2, inatm_d3, inatm_d4, and inatm_d5, respectively. The computation of inatm_d1 in the horizontal and vertical dimensions has no data dependency, so it uses 2D parallel computing. inatm_d2 can perform 2D parallel computing like inatm_d1. inatm_d3 has no data dependency in the horizontal, vertical, and g-point dimensions, so 3D parallel computing is used. In the branch sentences of inatm, the computing component that is not data-dependent on the horizontal and vertical dimensions uses 2D parallel computing. The kernel is named inatm_d4. Finally, inatm_d5 uses 1D parallel computing for the other part of inatm.

4.2.2 cldprmc_d

cldprmc is used to compute 3D arrays of cloud attribute parameters. It has no data dependency in the horizontal, vertical, and g-point dimensions, so it can use 3D parallel computing.

4.2.3 setcoef_d

To achieve multi-dimensional acceleration, setcoef is divided into two kernels: setcoef_d1 and setcoef_d2. setcoef_d1 does not have data dependence in the horizontal and vertical dimensions, so it uses 2D acceleration. setcoef_d2 is with accumulation operations on the vertical dimension, so it can only be 1D parallel on the horizontal dimension.

4.2.4 spcvmc_d

In RRTMG_SW, spcvmc is the most complicated subroutine and accounts for 71.4% of the computing time of rrtmg_sw. Thus, maximizing the acceleration of this subroutine is the key to improving the computing efficiency of RRTMG_SW.

spcvmc calls three sub-functions taumol, reftra, and vrtqdr, which are, respectively, used as three device functions in spcvmc_d. taumol calls fourteen sub-functions to calculate the data of fourteen bands. The calculation of fourteen bands is completely independent in the horizontal and vertical dimensions, so this paper considers taumol_d as a single kernel. That is, taumol_d is used for 2D parallel computing.

After separating taumol_d, there are many accumulation operations in the remaining part of spcvmc_d, so 2D acceleration in the horizontal and vertical dimensions is difficult. According to the analysis in Sect. 3.2, the front part of spcvmc_d calculates radiative flux for each band of shortwave and the computing for each band has no data dependency, so the computation for the 14 bands is 2D parallel in the horizontal and jp-band dimensions, as shown in spcvmc_d1. The remaining part of spcvmc_d is in spcvmc_d2, and it uses 1D parallel computing. Algorithm 1 shows 2D acceleration computing of spcvmc_d1 in the horizontal and jp-band dimensions.

The 3D acceleration algorithm of rrtmg_sw is shown in Algorithm 2.

figure a
figure b

5 Optimization methods

CC-RRTMG_SW++ has 11 kernels. Each kernel needs to transfer I/O data. When kernels execute, the data need to be accessed through global memory. The time of these operations has become a bottleneck in improving computing efficiency, so further optimizing CC-RRTMG_SW++ is urgent. The optimized methods include decoupling data dependency to improve parallelism, using temporary registers to improve memory access performance, avoiding unnecessary data transfers, and using CUDA streams [25] to implement asynchronous data transfer.

5.1 Decoupling data dependency

Data dependency often makes it so that algorithms cannot be parallel. In most cases, data dependency is inevitable. However, some appropriate methods can be utilized to decouple data dependency. Figure 5 shows the methods of decoupling data dependency. The initial decoupling method is to separate computing processes by increasing the number of kernels. However, the increase in the number of kernels will increase the time of kernels launch. Grid-level synchronization can be used to eliminate the time cost of kernels launch. Another decoupling method is to set temporary variables and increase the computing processes so that the dependent data are stored in temporary variables.

In CC-RRTMG_SW++, the calculation dependency in the horizontal dimension is decoupled by increasing the dimension of arrays so as to achieve the purpose of 1D parallel computing. Then, as shown in Sect. 4.2, by dividing inatm_d, setcoef_d, and spcvmc_d into smaller kernels, the calculation dependency in the vertical, jp-band, and g-point dimensions is decoupled. Therefore, 2D and 3D parallel computing can be performed on these kernels.

Fig. 5
figure 5

Decoupling data dependency

5.2 Optimizing memory access

There are different kinds of memory in GPUs, whose size, access methods, and memory access speeds are all different. In CUDA programming, if the memory is not specified, the data of each thread need to be accessed from global memory. However, the speed of accessing global memory is much lower than that of other types of memory, such as texture memory, shared memory, and registers. In modern multi-core accelerators, memory access latency has become the main bottleneck to improving program performance [26]. Therefore, it is necessary to minimize the access latency of global memory.

In CC-RRTMG_SW++, spcvmc_d still accounts for more than 90% of the running time of rrtmg_sw, so it is meaningful to further optimize the kernel. After analyzing its code structure and different characteristics of GPU memories, in order to eliminate repeated global memory access, the data depot is used for optimization. The data in the global memory are temporarily stored in the data depot before being involved in the operation. The requirement of the data depot is that its access efficiency is much higher than that of the global memory, the shared memory and registers in GPU just meet this requirement. When using shared memory as a data depot, since shared memory is shared by threads in the same block, it is necessary to increase the calculation process in CC-RRTMG_SW++ to eliminate the memory writing conflicts among different threads, which weakens performance improvement by using the data depot. Therefore, this paper chooses to use registers as a data depot.

The specific method is to set more temporary variables in the kernel, and these temporary variables are stored in the registers of each streaming multiprocessor in GPU. Accessing registers is much faster than accessing global memory, but registers are finite resources. If the thread blocks use too many registers, the GPU kernel occupation will be reduced, thereby reducing the occupancy rate of the multi-core processors. Therefore, it is not that more registers being used lead to a more efficient program [27]. In spcvmc_d2, through many experiments and performance analysis, considering the best balance between register usage and occupancy, six arrays are temporarily stored in registers. To make better use of the registers, we reduce the original 2D arrays to 1D, which means that the unnecessary horizontal dimension of the arrays is removed when calculating inside the kernel. After finishing computation, their dimensionality will be upgraded. That is, the 1D arrays in the registers will be reassigned to 2D arrays in global memory. Figure 6 shows the memory access methods for spcvmc_d2 before and after optimization.

Fig. 6
figure 6

The optimization strategy of global memory access

5.3 I/O optimization

CC-RRTMG_SW++ uses heterogeneous hybrid computing, which is a method of “CPU+GPU” collaborative computing. The logic control is completed by CPU, and the core computation is completed by GPU, so the data transfer between host and device is inevitable [28]. Due to the limitation of PCIe bus bandwidth, data transfer takes up a large amount of runtime [29]. In CC-RRTMG_SW, the method of pinning memory has been used to optimize data transfer. However, the data transfer still occupies rrtmg_sw for nearly 50% of the running time, so further optimization is required.

5.3.1 Avoiding unnecessary data transfer

In CC-RRTMG_SW, data are transferred among different kernels through the device-host-device approach, which means that the intermediate data calculated by a kernel need to be transmitted back to CPU and the data are transmitted to the next kernel after doing simple calculations in CPU. This approach involves a large amount of meaningless intermediate data transfer, so data in CC-RRTMG_SW++ are kept in GPU and only the necessary data are transferred back to CPU. The calculations in CPU are also ported to kernels. In this way, the cost of data transfer can be effectively reduced.

5.3.2 CUDA stream

A CUDA stream is a series of operations issued by host and executed sequentially on device. After creating multiple CUDA streams, different tasks can be assigned to different streams. Each CUDA stream has to complete three processes in turn: copying the data from host to device, computing the kernel, and copying computing results to the host. For different streams, as long as the computing and data transfer do not depend on each other, they can be performed synchronously. In this way, the computing and data transfer can overlap. When using CUDA streams, pinned memory and the asynchronous copy function cudaMemcpyAsync need to be used.

This paper uses CUDA streams to optimize the I/O transfer of spcvmc_d1 and spcvmc_d2, whose computation time and data transfer time account for a large proportion of the entire model. Four CUDA streams are used to divide the calculations of spcvmc_d1 and spcvmc_d2 in the horizontal grid, 1/4 of the horizontal column data (ncol/4) is transferred to the GPU’s memory by a stream. The pointer offset address is set to calculate the horizontal column data, which have currently been transferred to the GPU. The time of the data transfer can be overlapped by transferring data at the same time as computing on different streams. Figure 7 illustrates the difference between using the default stream and using multiple streams in spcvmc_d1 and spcvmc_d2.

Fig. 7
figure 7

Execution flow chart of spcvmc_d using multiple CUDA streams

6 Results and discussion

This paper conducts numerical experiments to evaluate the performance of the proposed acceleration algorithms and optimization methods.

6.1 Experimental setup

The experiments use three different GPU clusters: a K20 cluster, a K40 cluster, and a V100 cluster. The K20 cluster is located in the Computer Network Information Center of CAS, the V100 cluster is located in the Institute of Atmospheric Physics of CAS, and the K40 cluster is located in China University of Geosciences (Beijing). Their hardware configurations are listed in Table 1. The RRTMG_SW runs on a single Intel Xeon E5-2680 v2 CPU core of the K20 cluster. The CC-RRTMG_SW++ heterogeneous code runs on a single node of each cluster.

Table 1 Configurations of GPU Clusters

This paper evaluates the performance of CC-RRTMG_SW++ from three perspectives: parallel acceleration, memory access, and I/O transfer. The evaluation criterion is the speedup of CC-RRTMG_SW++ compared to RRTMG_SW and CC-RRTMG_SW. To make CC-RRTMG_SW++ achieve the best performance on three kinds of GPU, the block size for the 1D acceleration kernels is 128, and the block size for the 2D and 3D acceleration kernels is 512. RRTMG_SW has 128\(\times\)256 horizontal grid points when simulating the global shortwave radiation transfer process, with a resolution of \(1.4^{\circ }\times 1.4^{\circ }\). When the size of ncol is set to 1024, the model needs to be called repeatedly: (128\(\times\)256/1024)\(\times\)24=768 times, for the simulation of one model day (24 hours). When the size of ncol increases, the number of calls will decrease accordingly. The following simulation experiments are all carried out for one model day.

6.2 Evaluation of optimization methods

6.2.1 3D acceleration

First, taking the running time of serial RRTMG_SW as the standard, a comparative experiment is conducted for CC-RRTMG_SW and CC-RRTMG_SW++. The speedup is the ratio of serial computing time to parallel computing time. The time of rrtmg_sw (\(T_{rrtmg\_sw}\)) is calculated with the following formula:

$$\begin{aligned} \begin{array}{llll} T_{rrtmg\_sw}=T_\mathrm{inatm}+T_\mathrm{cldprmc}+T_\mathrm{setcoef}+T_\mathrm{taumol}+T_\mathrm{spcvmc}, \end{array} \end{aligned}$$

\(T_\mathrm {inatm}\) is the computing time of the subroutine inatm or the accumulation of kernels \(inatm\_d1, inatm\_d2, inatm\_d3, inatm\_d4\), and \(inatm\_d5\). \(T_\mathrm{cldprmc}\), \(T_\mathrm{setcoef}\), \(T_\mathrm{taumol}\), and \(T_\mathrm{spcvmc}\) are the corresponding computing times of the subroutines or kernels.

The experiment uses one K20 GPU, and the size of ncol is set to 1024. Table 2 shows the speedup of CC-RRMTG_SW++ optimized with 3D parallelism and decoupling data dependency. As shown in Table 2, compared with RRTMG_SW, in the kernels with a higher degree of parallelism, such as inatm, cldprmc, and setcoef, the speedup increases significantly compared to 1D parallelism. It also has a speedup of 3.80\(\times\) for the most time-consuming kernel, spcvmc, so the overall speedup of CC-RRTMG_SW++ with 3D acceleration computing is 5.05\(\times\); this is 1.84 times faster than CC-RRTMG_SW. Therefore, the 3D parallel method is very efficient.

Table 2 Runtime (s) and speedup of CC-RRTMG_SW and CC-RRTMG_SW++ on a single K20 GPU when ncol=1024

6.2.2 Memory access optimization

Table 3 compares the running time of spcvmc before and after memory access optimization. By using the method described in Sect. 5.2, the overall computing efficiency of spcvmc on one Tesla V100 GPU improves by 30.11%. Given the highly time-consuming proportion of the spcvmc in the model, this magnitude of performance improvement is very meaningful.

Table 3 The performance of optimizing memory access for spcvmc

6.2.3 I/O optimization

Table 4 shows the time and speedup of I/O transfer before and after optimizing in CC-RRTMG_SW++. CUDA Memory HtoD represents the data transfer from host to device, and CUDA Memory DtoH represents the data transfer from device to host. Through the I/O optimization methods specified in Sect. 5.3, the time of data transfer has been greatly reduced. By avoiding unnecessary data transfer, the transfer time from device to host is greatly reduced, even negligible. The transfer time from host to device is 1.62 times faster by using CUDA streams. In addition, the speed of data transfer is related to the PCIe bus bandwidth between host and device, so the speedup of the data transfer time on K20 GPU and V100 GPU is almost the same when the PCIe bus bandwidth is unchanged.

Table 4 The time and speedup of I/O transfer before and after optimizing for CC-RRTMG_SW++

6.3 Overall performance evaluation

6.3.1 Evaluation on different GPUs

This paper conducts experiments on the overall performance of CC-RRTMG_SW++ on three different GPUs and compares its speedup relative to RRTMG_SW. The experimental conditions are specified in Sect. 6.1. The results are given in Tables 5, 6, and 7. For the calculation method of the parameters in the three tables, please refer to Sect. 6.2.1. The experimental results show that CC-RRTMG_SW++ has reached a speedup of 99.09\(\times\) on a single Tesla V100 GPU. For the kernels inatm and cldprmc with high parallelism, the speedup is more than 150\(\times\). The results demonstrate the effectiveness and efficiency of 3D parallel computing and the performance optimization methods for RRTMG_SW.

Due to the differences in video memory size, the maximum value of ncol on different types of GPUs is different. Recalling the instructions in Sect. 6.1, increasing ncol reduces the number of kernel calls and indirectly increases the parallelism of the kernel in the horizontal dimension. Therefore, it can be seen from the experimental results that with the increase of ncol, the speedup of setcoef and spcvmc has been greatly improved. For inatm and cldprmc, the calculation process is simple and there are many memory access operations. With the increase of ncol, the higher memory access cost limits the parallel efficiency of kernels, so there will be cases where the speedup is not significantly improved or even slightly decreased. In addition, increasing the value of ncol also means that the horizontal dimension of the arrays increases, and more GPU memory is required for each call to rrtmg_sw. Figure 8 shows the impact of ncol on the speedup and memory usage of rrtmg_sw on a V100 GPU. When ncol is set to 16384, the memory usage reaches 54.80%, but the speedup of rrtmg_sw is 4.71 times higher than that of ncol=1024.

Because of the differences in video memory, bandwidth, and double-precision floating-point arithmetic capability for different GPUs, the computing performance of CC-RRTMG_SW++ is highly diverse. Figure 9 compares the performance of CC-RRTMG_SW++ on the K20, K40, and V100 GPUs. On a single Tesla V100 GPU, the highest speedup of CC-RRTMG_SW++ is 7.59 times faster than K20 and is 2.48 times faster than K40.

Fig. 8
figure 8

The influence of ncol on the speedup and memory usage of CC-RRTMG_SW++ on a V100 GPU

Table 5 CC-RRTMG_SW++ runtime (s) and speedup on a single K20 GPU
Table 6 CC-RRTMG_SW++ runtime (s) and speedup on a single K40 GPU
Table 7 CC-RRTMG_SW++ runtime (s) and speedup on a single V100 GPU
Fig. 9
figure 9

The influence of different GPUs on the speedup of CC-RRTMG_SW++ when ncol=2048

6.3.2 With I/O transfer

With I/O transfer, the overall performance of CC-RRTMG_SW++ is given in Table 8. The time of I/O transfer (\(T_{I/O}\)) and rrtmg_sw (\(T_{rrtmg\_sw}\)) is calculated with the following formulas:

$$\begin{aligned} \begin{array}{llll} T_\mathrm{I/O}=T_\mathrm{HtoD}+T_\mathrm{DtoH}, T_{rrtmg\_sw}=T_\mathrm{computing}+T_\mathrm{I/O} \end{array} \end{aligned}$$

\(T_{HtoD}\) is the data transfer from host to device, \(T_{DtoH}\) is the data transfer from device to host, and \(T_{computing}\) is the sum of the calculation times of all kernels. Due to the limitations of PCIe bus bandwidth and frequent communication, the data transfer between host and device still costs plenty of time after optimizing I/O. However, after the use of optimization methods, the speedup of CC-RRTMG_SW++ is still up to 24.07\(\times\), which is better than CC-RRTMG_SW.

Table 8 Running time (s) and speedup of CC-RRTMG_SW++ with I/O transfer on different GPUs

6.4 Accuracy verification

In order to ensure the accuracy of the CC-RRTMG_SW++, this paper carried out an error experiment of CC-RRTMG_SW++ and RRTMG_SW. The method is to call rrtmg_sw in the serial Fortran version and the CUDA C version, respectively, and subtract the results after the calls to obtain the error. The average error of CC-RRTMG_SW++ obtained by this method compared to RRTMG_SW is -0.000455583 \(W/m^{2}\), which is consistent with CC-RRTMG_SW. The reason for error is the difference in computing accuracy between the CPU and GPU, and the difference is magnified by many cumulative calculations in the code. Due to the error in accuracy of the numerical model itself, it cannot simulate the atmospheric radiative transfer process with absolute accuracy. Therefore, the error in accuracy between CC-RRTMG_SW++ and RRTMG_SW is acceptable.

6.5 Discussion

The overall performance of CC-RRTMG_SW++ and CC-RRTMG_SW is compared in Fig. 10. In CC-RRTMG_SW++, both the calculation time of the kernels and the I/O transfer time are much less than in CC-RRTMG_SW. Taking the calculation time as a measure of computing efficiency, the average computing efficiency of CC-RRTMG_SW++ on three different types of GPUs is 174.46% higher than that of CC-RRTMG_SW. Compared to Mielikainen et al. [2], their CUDA C-based 1D parallel RRTMG_SW achieved a speedup of 202\(\times\) on a single Tesla K40 GPU, but they compared the serial RRTMG_SW with the grid size of 425\(\times\)308, which is much higher than the grid size of 128\(\times\)256 in this paper; thus, they could achieve higher horizontal dimension parallelism. If this paper used RRTMG_SW with a higher grid resolution as the comparison, the speedup would be further improved. Thus, the CC-RRTMG_SW++ proposed in this paper is very effective, and it provides a more efficient scheme for accelerating atmospheric physics on GPUs.

In the future, if a GPU version of the entire atmospheric general circulation model or all atmospheric physics processes is developed, the initialization data in CC-RRTMG_SW++ only need to be transferred once from the host to the device. This means that the cost of data transfer is almost negligible, so CC-RRTMG_SW++ will achieve a high computing efficiency.

Fig. 10
figure 10

The runtime (s) of CC-RRTMG_SW (left) and CC-RRTMG_SW++ (right) on three different GPUs when ncol=1024

7 Conclusions and future work

High-efficiency computation on GPU is always challenging. This paper proposes a multi-dimensional acceleration algorithm for RRTMG_SW and some GPU-based performance optimization methods. Then, CC-RRTMG_SW++, an optimized version of CC-RRTMG_SW, is developed. CC-RRTMG_SW++ further exploits the advantages of GPU multi-core computing capabilities. The experimental results prove the effectiveness and high efficiency of CC-RRTMG_SW++. Thus, CC-RRTMG_SW++ can support the higher-resolution computation of shortwave radiative transfer. This is of great significance to the development of the atmospheric general circulation model.

Future work will begin from multi-GPU acceleration and mixed-precision computing. (1) Supercomputers usually have hundreds of thousands of CPU and GPU nodes. In order to make full use of these nodes, the “MPI+CUDA” hybrid programming will be utilized to further accelerate CC-RRTMG_SW++ [30, 31]. (2) The mixed-precision computing of CC-RRTMG_SW++ will also be considered. In recent years, to reduce computing cost and improve computing efficiency, mixed-precision computing has become a hot research topic in high-performance computing. When accuracy errors exist objectively in many applications, most of the variables are actually not necessary to use double-precision computation [32, 33]. Therefore, half-precision, single-precision, and double-precision mixed computing [34, 35] will be quite promising work in CC-RRTMG_SW++.