Keywords

1 Introduction

Currently, Convolutional Neural Networks (CNNs) are an effective approach in machine learning to solve many kinds of real world problems. CNNs need a large amount of computations for both the training and inference procedures. CNNs consist of several layers including convolution, activation, fully-connected, and other types of layers. In particular, the convolution layers are the most compute-intensive part in the advanced CNN models. The required multi-channel 2D convolution can be implemented in different ways either by converting it to the General Matrix-matrix Multiplication (GEMM) [2, 7], Winograd algorithm [3], Fast Fourier Transform (FFT) [6], or by direct computing [8].

In [1], the authors have demonstrated that an SSAM single-input/single-output channel (SISO)-based direct computing can achieve a much higher performance than deep learning libraries such as cuDNN, ArrayFire and Halide on Graphics Processing Units (GPUs). However, to accelerate deep learning further we need an extension of the SSAM-based approach for the basic multiple-input/multiple-output (MIMO) channels convolution because the convolutional layers of CNNs typically deal with a tensor data with MIMO channels.

The SSAM-based input-stationary approach proposed in [1] realizes a systolic data movement for SISO convolution using CUDA shuffle primitives. The approach reuses the input feature map from caches and intermediate results on the register files more effectively and reduces the time and energy expensive DRAM read/write accesses, which are helpful to extract a higher performance and, at the same time, to reduce energy consumption. However, for the MIMO convolution, we need much more data for computing each output pixel. Therefore, it is essential to effectively use the L1/L2 caches as well as register files to extract the full performance of the SSAM-based approach.

This paper presents a high-performance implementation for the multi-channel convolution computing by extending the SSAM-based convolution kernel. We propose two methods for the multi-channel convolution computing on the GeForce RTX 3090 of the NVIDIA Ampere GPU architecture. The first method conducts iterations of a multi-input/single-output convolution. In the second method, iterations of a single-input/multi-output convolution are conducted by keeping the partial sums during the MIMO computing. This paper presents performance evaluation results by using a profiling tool and describes performance comparison results with the original SSAM/SISO-based convolution kernel and the NVIDIA cuDNN library.

2 Convolution Implementations with Software Systolic Array for Multi-input and Multi-output Channels

The terminology used in this paper is described in Table 1. The 2D convolution is expressed as

$$\begin{aligned} y[i, j] = (x * w)[i, j] = \sum _{s=a_l}^{a_u}\sum _{t=b_l}^{b_u}x[i-s, j-t]\cdot w[s, t], \end{aligned}$$

where \(*\) is the convolution operation, \(\cdot \) is the multiplication operation, w is a filter of the size \(M\times N\) (\(M=a_u-a_l+1\) and \(N=b_u-b_l+1\)), x is a 2-dimensional input matrix (or image) of the size \(H\times W\), and y is a 2-dimensional output matrix of the size \((H+2pad-M+1)\times (W+2pad-N+1)\) (pad is a padding value). The multi-channel convolution with C-input channels and K-output channels is expressed as

$$\begin{aligned}&y_m[k, i, j] = (x_m * w_m)[k, i, j] \nonumber \\&\qquad \qquad \, = \sum _{c=1}^{C}\sum _{s=a_l}^{a_u}\sum _{t=b_l}^{b_u}x_m[c, i-s, j-t]\cdot w_m[k, c, s, t]. \end{aligned}$$
(1)

\(w_m\) is a filter of the size \(K\times C\times M\times N\). \(x_m\) is an input tensor of the size \(C\times H\times W\). \(y_m\) is an output tensor of the size \(K\times (H+2pad-M+1)\times (W+2pad-N+1)\).

Table 1. Terminology

In this study, the given tensor data are stored in KCHW, CHW, and KHW orders for filter, input and output tensors, respectively. For example, in the KCHW order, the W data elements of the same row are continuously stored, the \(H\times W\) matrix data for each of C input channels are continuously allocated, and the \(C\times H\times W\) tensor data for each of K output channels are continuously stored.

The methods proposed in this paper basically follow the 2D convolution algorithm by the Software Systolic Array Model (SSAM) described in [1]. The software systolic array simulates a mechanism of hardware systolic arrays, and the SSAM is suitable for memory-bound computations with regular memory access patterns. The features of the SSAM-based 2D convolution algorithm are as follows:

  • All data of filters are stored into shared memory in the beginning of the CUDA kernel and are reused during the convolution computing.

  • Input image data are cached into registers. A sliding window scheme is used for the careful management of the limited number of registers.

  • Each thread in a CUDA warp caches \(D (=N+P-1)\) elements of the input image also in the beginning, where P is the number of sliding window steps. Hence, each thread computes the P rows of elements in the image.

  • The input data are multiplied with the filter data and the product is accumulated. The computation is conducted by the fused multiply-add (MAD) operation, and the partial sums are transferred to the neighbor threads by the shuffle primitive.

CUDA kernels are called with a block dimension (blockDim) and a grid dimension (gridDim) of the 3D arrays. The blockDim denotes the number of threads in a block. The gridDim denotes the number of the blocks in a grid. In this study, the dimensions of the CUDA kernels are given as

$$\begin{aligned} \text{ blockDim }&= (BS, 1, 1) \text{ and } \\ \text{ gridDim }&= (\lceil \frac{W+2pad}{WarpCount\cdot (WarpSize-N+1)}\rceil , \lceil \frac{H+2pad}{P}\rceil , 1), \end{aligned}$$

where BS is the CUDA thread block size, WarpSize is the CUDA warp size, and \(WarpCount=BS/WarpSize\). Threads in a block of the CUDA kernels are in charge of computing a \(K\times P\times BS\) sub-tensor of the \(K\times (H+2pad-M+1)\times (W+2pad-N+1)\) output tensor.

In the present study, we have implemented two convolution methods for multi-input and multi-output channels. Listing 1.1 presents iterations across output channels in the outermost loop, while Listing 1.2 shows iterations across input channels in the outermost loop. In the following, Listings 1.1 and 1.2 are called OUT_IN method and IN_OUT method, respectively. In the implementation of both the methods, the filters are accessed in a decremental order and the input images are accessed in an incremental order, whereas each of these is accessed in the opposite order of the Eq. (1).

In the OUT_IN method, the data for the accumulation is initialized to the zero at the beginning of each outermost-loop (Lines 8–9 in Listing 1.1) and the computed results are stored to the destination array at the end of each outermost-loop (Lines 32–33). On the other hand, in the IN_OUT method, the array for the accumulation is initialized before the outermost loop (Lines 7–9 in Listing 1.2) and the results are written out to the destination array after the outermost-loop (Lines 34–37).

For both the methods, the number of required floating-point operations per CUDA thread is \(FLOP_{thread} = 2\cdot K\cdot C\cdot M\cdot N\cdot P\). The output amount to the GPU off-chip memory for both the methods is \(K\cdot P\cdot sizeof(dataType)\). The input amount of data from the GPU’s off-chip memory for the OUT_IN method is \(K\cdot C(P+M-1+\lceil M\cdot N/BS\rceil )\cdot sizeof(dataType)\). The IN_OUT method reuses input data in each outermost iteration after reading the data (Lines 11–12 in Listing 1.2), and the input amount from the GPU’s off-chip memory for the IN_OUT method is \(C(P+M-1+\lceil M\cdot N/BS\rceil )\cdot sizeof(dataType)\). This means that the IN_OUT method requires K times less memory accesses for the input tensor than the OUT_IN method. However, the IN_OUT method has to keep K times larger temporary data for the summations of sum_final in a 2D array. Note that the temporary data might be forced out to the off-chip memory if the available number of registers is not enough for keeping the data.

figure a
figure b

3 Performance Evaluation

The specification of the NVIDIA RTX 3090 GPU [5] is shown in Table 2. The used CUDA toolkit version is 11.4 and the GPU driver version is 470.57.02. The operating system of the environment is Ubuntu 20.04.2 LTS with Linux 5.13.0-35-generic kernel. The program codes are compiled by nvcc command with -gencode arch=compute_86,code=sm_86 -fmad true options and the parameter settingFootnote 1 of \(P=4\) and \(BS=128\). The performance measurements are conducted with the padding value \(pad=1\) . Matrix and tensor data in the program are allocated as single-precision floating-point arrays and single-precision operations are used for computing the multi-channel convolution. CUDA cores of the GPU are utilized and its Tensor Cores are not used in the evaluation.

Table 2. Specification of the NVIDIA RTX 3090 GPU. The L1 cache and shared memory use the same 128 KB hardware resources per SM, and the preferred cache configuration can be selected.

Figure 1 shows the performance of the convolution kernel implementation for the \(H\times W\) image with C-input and K-output channels. The Gflop/s performance is calculated by the following formula:

$$\begin{aligned} \text{ Performance } \text{[flop/s] } = \frac{2\cdot H\cdot W\cdot M\cdot N \cdot C\cdot K \text{[flop] }}{ \text{ time } \text{[second] }}. \end{aligned}$$

In the evaluation, each of the convolution CUDA kernels is called 20 times and its average time is used as the computation time. The “Original” in Fig. 1 is the performance of the 2D convolution kernelFootnote 2 by Chen et al. [1]. Note that the performance of the “Original” method is almost the same as the performance of the implemented SISO case (\(C=K=1\) case).

Fig. 1.
figure 1

Performance of computing convolution with C-input and K-output channels for the \(H\times W\) image.

As shown in Fig. 1, the performance of the IN_OUT method is higher than the OUT_IN and Original methods in the most cases. For example, in case of the \(H=W=256, K=1\) (Fig. 1a), the performance of the OUT_IN method with \(C=8\) for \(M\times N\) filter is 435 Gflop/s and it is 1.14 and 5.95 times higher than that of the IN_OUT method with \(C=8\) (383 Gflop/s) and the Original method (73 Gflop/s), respectively. The performance advantage of the OUT_IN method over Original method becomes weaker when the image sizes and filter size are larger; the Original method shows higher performance than OUT_IN method in the case when \(H=W=8192, M=N=10\) (Figs. 1e, 1f).

Table 3 shows profiling results of multi-channel convolution implementations by using ncu command from the NVIDIA Nsight Compute profilerFootnote 3 for \(512\times 512\) images and \(3\times 3\) filters. The OUT_IN method shows higher cache hit rates and higher instruction issue slot utilization than the other methods. In case \(K=8\), the profiling results of the IN_OUT method indicate that its memory utilization is not efficient: the DRAM (off-chip memory) read amount is much larger and the cache hit rates are lower than the other cases. The low memory utilization is probably caused by register spills [4], which lead to repeating read-write accesses between the registers and off-chip memory. The IN_OUT method requires to keep sum array data during computing all K output channels; hence, a lack of available registers are highly possible in the case of large number of output channels.

Table 3. Profiling results of multi-channel convolution computing by Nsight Compute for the case of \(H=W=512, C=1, M=N=3\).

Figure 2 shows a performance for the different numbers of input and output channels, and different sizes of images, and \(3\times 3\) filters. In Fig. 2, the performance of a GEMM-based computing of multi-channel convolution computing in the NVIDIA cuDNN v8.3.1 is also depicted. The cuDNN is a GPU-accelerated library for deep neural networks. The cuDNN supports several algorithms for the convolution foward (cudnnConvolutionForward) function. Figure 2 shows the cuDNN performance of the IMPLICIT_PRECOMP_GEMM algorithm which demonstrates higher performance than other supported algorithms in the most cases for single-precision data. A performance of the cuDNN library modestely increases when the image and the number of channels are increased.

The OUT_IN implementation shows much higher performance than the cuDNN library. As shown in Fig. 2, increasing the number of channels does not always improve the performance of the OUT_IN implementation. In the case of relatively small image sizes like \(H=W=256\) or \(H=W=512\) (Fig. 2a, b), the performance tends to improve with respect to increasing the number of channels. However, the performance deteriorates when \(H=W=1024, K=8 \text{ or } K=32\) (Fig. 2d). Figure 3 shows L1 and L2 cache hit rates of the OUT_IN implementation (the rates are measured also by the Nsight Compute profiler). The L1 hit rates for \(128\times 128\) images do not drop down even when the number of channels increases. On the flip side, the L1/L2 hit rates for \(512\times 512\) or \(1024\times 1024\) images are first decreased and leveled off along with increasing the number of channels. These results of cache hit rates are compiled with the performance tendencies as shown in Fig. 2.

Fig. 2.
figure 2

Performance of convolution computing by the OUT_IN method and cuDNN with C-input and K-output channels for the \(H\times W\) image and \(3\times 3\) filters.

Fig. 3.
figure 3

L1 and L2 cache hit rates of the OUT_IN convolution computing with C-input and K-output channels for the \(H\times W\) image and \(3\times 3\) filters.

4 Conclusion

We have developed a high-performance kernel for multi-channel convolution computing by extending an existing convolution kernel with a software systolic array model [1]. The implemented multi-channel kernel demonstrates a higher performance than the original single-channel kernel and the GEMM-based computing of cuDNN library. Profiling results of the implemented kernel show that efficient usage of registers and caches is more important for the multi-channel computing than the single-channel computing.

Our future work includes an application of the implemented multi-channel convolution kernel to different CNN models such as VGG, GoogleNet, and ResNet. Another future work is to conduct more detailed performance comparison with other deep learning libraries using Tensor and CUDA Cores of the current and future GPUs.