Keywords

1 Introduction

As the high-resolution imaging radar, with its superior performance, Synthetic Aperture Radar (SAR) is widely used in military and civilian areas. With the continuous development of SAR technology, processing data scale is increasing so fast that the requirements of signal processor computational complexity, imaging accuracy and many other factors are getting higher and higher. In terms of the initial single-core processor, the way to enhance the performance of the processor is to improve the processor frequency, but by the chip production process constraints, the benefits of the power consumption are covered up by the power consumption and yield problems. As an effective method to enhance the system processing capacity further, multi-core parallel processing and multi-processor parallel processing appeared.

Currently, the mainstream digital signal processors include DSP, FPGA and GPU. The most widely used processors among them, for instance, the peak processing performance of DSPC6678 is 160GFLops, and Intel Arria10 SoC FPGA is 1.5TFLops, while the peak processing performance of Tesla M6 GPU can reach 3.2TFLops. According to this, as a multi-core architecture processor, GPU which has strong floating-point computing capability is a general-purpose processor that used to achieve high performance parallel operation in the ultra-multi-stream processor platform. Compared to other processors, GPU has obvious advantages in terms of processing capability and memory bandwidth. At the same time, the mainstream GPU manufacturer NVIDA launched a computing architecture CUDA in 2006 which is combined with a comprehensive software platform to break through the hardware programmability and development constraints, using a kind of language that is easy to understand like C language, and do not need to use the graphical API. So it will give the full exert to the GPU’s powerful computing capability when building the high-performance applications. CUDA calculation model is working in the CPU + GPU heterogeneous mode, the CPU is the Host, and the GPU is the secondary processor or Device. With the standardization of the language, using CUDA for GPU software development has been widely used in high-performance computing.

First of all, this paper evaluates the FFT performance of the Tesla M6 GPU; Then, introduces the basic principle of SAR algorithm, studying the RD algorithm flow in the classical SAR algorithm, and analyzing the characteristics of the RD algorithm in detail; Finally, the RD_SAR algorithm is implemented on the CPU + GPU platform using CUDA language, and comparing the characteristics of the multi-core processing platform and DSPC6678 parallel processing platform in RD algorithm processing.

2 FFT Performance Evaluation of GPU

Fast Fourier transform (FFT) is often used in digital signal processing to obtain characteristics of the signal in the frequency domain. FFT is often used to evaluate the performance of a processor due to the features of computationally intensive and time-consuming.

In this paper, we implement a radix-4 FFT on Tesla M6 GPU based on CUDA and use different size of FFT to evaluate the processor’s performance. Using the CUFFT math library in CUDA’s official release, we can process one-dimensional, two-dimensional or three-dimensional fast Fourier transforms of multi-batch in parallel, and we can process fast Fourier transforms of multiple batches at the same time. Thus the multi - batch fast Fourier transform in complex domain is realized. And then use the average time of multiple calculations’ results in different points FFT to get the final time.

The final test results are listed in Table 1 and the comparison chart with the processing results of DSPC6678 is shown in Fig. 1. Wherein, the comparison contains three sets of data: Group A, Group B and Group C.

Table 1. The FFT results of DSP and GPU
Fig. 1.
figure 1

The FFT results of DSP and GPU

Group A: The result is based on TI’s official library dsplib.ae66, using eight cores of DSPC6678 to implement FFT in parallel;

Group B: The result is based on a new algorithm, VLFFT, which is designed for large points FFT, has a certain improvement over the efficiency of Group A, using eight cores of DSPC6678 to implement FFT in parallel;

Group C: The result is based on Tesla M6 GPU to implement FFT.

It can be seen from the data above, GPU (C group) is 56x–578x faster than DSPC6678 (A group), is 15x–48x faster than DSPC6678 (B group). If we take two sets of DSP to achieve the faster time, then the GPU can achieve an acceleration ratio of about 15 to 67 times compared to the DSP C6678, that is, an average of 41 times acceleration ratio.

3 Research on SAR Processing Performance of Multi-core GPU

This research is based on a high-performance processing system board (Fig. 2) which consists of a CPU + GPU architecture and used in the OpenVPX platform. The main part of this platform is composed of NVIDIA Tesla M6 GPU module and Express-SL7 i68- E22 ComE CPU module. CPU is responsible for the master work, including bus management and data distribution, while GPU, the key module of data processing, is mainly on deal with the relatively large amount of data in parallel processing, helping CPU to process data together. The communication of CPU and GPU is the PCIE protocol based on the VITA specification.

Fig. 2.
figure 2

The CPU + GPU board

3.1 Synthetic Aperture Radar (SAR) Algorithm

SAR processing is to extract the two - dimensional distribution of the scattering coefficient in the target area from the received echo data. It is essentially a two-dimensional data processing, the usual method is to decompose the two-dimensional data into the distance (Y-axis) and azimuth (X-axis) system, so the imaging process is essentially a two-dimensional matched filtering process. This research will study on RD (Range Doppler) algorithm which is one of the classical SAR imaging. The idea of the RD algorithm is to convert the two-dimensional imaging processing of the synthetic aperture radar echo data into a one-dimensional processing by the two matched filtering operations of the range direction and the azimuth direction. The typical data processing flow of RD algorithm is shown in the Fig. 3.

Fig. 3.
figure 3

RD algorithm diagram

Firstly, from the algorithm diagram above, RD algorithm has the characteristics of large amount of data and large amount of computing, thus it is suitable for the processing structure of parallel flow type. As a kind of processor with powerful parallel calculating ability, GPU have more transistors for data processing, causing it is very suitable for the realization of RD algorithm.

Secondly, there are strict dependencies between the key steps of the RD algorithm, the input data of each processing module is the previous module’s output data. Therefore, the processing modules can’t be separated and distributed separately in different processing cores of the GPU. And it has to put all the computing resources into the current module, when the module is processed before the next operation.

Finally, the FFT and IFFT which are the main step of RD algorithm are widely used in this algorithm, Therefore, the efficiency of FFT on CUDA platform is an important affecting factor of the whole program’s performance.

3.2 The Performance Analysis of SAR Algorithm Based on GPU

According to the RD algorithm above, this research is based on 8 K point FFT which is used most widely in practice, adopting 4 K × 8 K complex points of the echo data to realize RD algorithm in the GPU parallel processing system, that is, the total calculation is the 4000 groups of 8 k points one-dimensional FFT. Finally, we can get a complete image (Fig. 4) of the two sets of echo data after the processing of RD algorithm completed, it shows a clear river.

Fig. 4.
figure 4

RD SAR Image

Through the description of the function above, the main parts of the radar signal processing include the pulse compression process of the radar echo data on range direction, the FFT on azimuth direction process, the matched filtering on azimuthal direction and quantization. In the realization of the SAR algorithm, the optimization is mainly from the following two aspects.

On the one hand, as we all know, the transmission bandwidth between Intel CPU and DDR3 is approximately 25 GB/s, however, the transmission bandwidth between Nvidia GPU and GDDR5 can reach 200 GB/s, simultaneously, the communication of CPU and GPU is the PCIE protocol and its bandwidth is 16 GB/s in theory and it can reach 9.6 GB/s in reality through the program test results. This shows that the rate of PCIE is far from meeting the GPU processing requirements, and it is the slowest part of the GPU program. So, in the implementation of the program, the optimization principle is reducing the transmission between host data and device data as far as possible and the allocation, operation and release of the process data can operate directly on the GPU.

On the other hand, the optimization of the kernel function is the key part to achieve high-performance GPU program, the optimization methods generally start from the following two aspects: memory access optimization and instruction optimization. This program has been carefully optimized for the matrix transpose kernel, and it has a greater impact on overall performance. We use Shared Memory for memory optimization in the process of matrix transpose, but the shared memory of NVIDIA GPU is generally small, we need to divide the data into blocks according to the size of shared memory, and this process is commonly called tile operation. At the same time, in order to avoid bank conflict when using the shared memory, the size of the two-dimensional tile is generally expressed as [TILE_DIM] [TILE_DIM+1], which TILE_DIM is the one-dimensional size of tile. Through the thread index settings, the design of the transfer kernel function using shared memory can achieve double transposition, when the location of the tile in the entire input data transpose adjustment, the transpose operation is also going on inside the tile. The transpose module is shown in the Fig. 5.

Fig. 5.
figure 5

Transpose kernel module

After testing precisely, the results of the two different processing systems are compared as follows (Table 2 and Fig. 6). Wherein, the comparison contains two sets of data: Group A and Group B.

Table 2. RD_SAR Processing Time of DSP and GPU
Fig. 6.
figure 6

RD_SAR processing time of DSP and GPU

Group A: The result is based on using eight cores of DSPC6678 to implement RD_SAR in parallel;

Group B: The result is based on using Tesla M6 GPU to implement RD_SAR.

As we can see from the results, GPU has an obvious achievement on the acceleration of the RD algorithm compared with DSPC6678. And the total processing time of SAR algorithm for 4 K × 8 K point is shortened by 1.18 s compared with the existing DSPC6678 processor. The result shows that the execution of the algorithm on GPU is approximately 1.9x faster than DSPC6678.

Simultaneously, GPU is 1x–5x faster than DSP in each main part of the RD algorithm, especially in the step of the distance pulse compression, GPU obtain 4.98 times’ speed ratio than DSP. Then analyzing the main steps of pulse compression, the following Table 3 gives the results in detail.

Table 3. Times of the distance pulse compression

From Table 3, FFT, IFFT is the main process of the pulse compression on range direction. According to the analysis of the previous part, GPU has a substantial performance improvement in the FFT calculation comparing to the DSP, the results of the RD algorithm are the strong proof of this conclusion.

4 Conclusion

With the development of SAR real-time imaging system towards high precision, high real-time and high data throughput, while optimizing the algorithm, it also has higher and higher requirements on the hardware structure. So choosing the right processor to handle large amounts of data becomes more and more important. In this paper, GPU multi-core processor is the main researching object, and the RD_SAR algorithm is implemented on the CPU + GPU platform using the CUDA calculation model. It effectively demonstrated GPU has powerful computing ability in the high-performance area and it also proved that GPU has a very significant effect in the acceleration of the RD algorithm. GPU can satisfy the real-time requirements of radar signal processing better.