The Research of SAR Processing Performance Based on Multi-core GPU

Wang, Yuwei; Li, Xingming; Hu, Shanqing; Yu, Jiacheng

doi:10.1007/978-981-10-7521-6_19

Yuwei Wang³²,
Xingming Li³³,
Shanqing Hu³² &
…
Jiacheng Yu³²

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 473))

Included in the following conference series:

International Conference On Signal And Information Processing, Networking And Computers

1680 Accesses

Abstract

With the characteristics of large data volume, high algorithm complexity and large computational complexity, Synthetic Aperture Radar (SAR) technology which makes the signal processing system have to be improved continuously in the aspects of real-time, storage capacity, data throughput and computing capability. As a kind of multi-core architecture, Graphics Processing Unit (GPU) take the advantages of powerful computing capability and efficient storage bandwidth to meet the urgent need in scalability, computing capability and storage bandwidth for large-scale data parallel applications. In this paper, the first thing is to evaluate the FFT performance of the NVIDIA Tesla M6 GPU, which achieves an average 41x speedup ratio compared to TI’s TMS320C6678 DSP. Then, the RD (Range Doppler) algorithm which is the most classical SAR imaging algorithm is implemented on the platform of CPU + GPU using CUDA language, and execution time of the SAR algorithm for 4 K × 8 K point is shortened by 1.18 s and the result shows that GPU achieve 1.9x the performance improvement compared to DSP C6678 on RD-SAR algorithm.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Parallel Processing of SAR Imaging Algorithms for Large Areas Using Multi-GPU

Optimization of Two Bottleneck Programs in SAR System on GPGPU

The Challenge of Onboard SAR Processing: A GPU Opportunity

Keywords

1 Introduction

As the high-resolution imaging radar, with its superior performance, Synthetic Aperture Radar (SAR) is widely used in military and civilian areas. With the continuous development of SAR technology, processing data scale is increasing so fast that the requirements of signal processor computational complexity, imaging accuracy and many other factors are getting higher and higher. In terms of the initial single-core processor, the way to enhance the performance of the processor is to improve the processor frequency, but by the chip production process constraints, the benefits of the power consumption are covered up by the power consumption and yield problems. As an effective method to enhance the system processing capacity further, multi-core parallel processing and multi-processor parallel processing appeared.

Currently, the mainstream digital signal processors include DSP, FPGA and GPU. The most widely used processors among them, for instance, the peak processing performance of DSPC6678 is 160GFLops, and Intel Arria10 SoC FPGA is 1.5TFLops, while the peak processing performance of Tesla M6 GPU can reach 3.2TFLops. According to this, as a multi-core architecture processor, GPU which has strong floating-point computing capability is a general-purpose processor that used to achieve high performance parallel operation in the ultra-multi-stream processor platform. Compared to other processors, GPU has obvious advantages in terms of processing capability and memory bandwidth. At the same time, the mainstream GPU manufacturer NVIDA launched a computing architecture CUDA in 2006 which is combined with a comprehensive software platform to break through the hardware programmability and development constraints, using a kind of language that is easy to understand like C language, and do not need to use the graphical API. So it will give the full exert to the GPU’s powerful computing capability when building the high-performance applications. CUDA calculation model is working in the CPU + GPU heterogeneous mode, the CPU is the Host, and the GPU is the secondary processor or Device. With the standardization of the language, using CUDA for GPU software development has been widely used in high-performance computing.

First of all, this paper evaluates the FFT performance of the Tesla M6 GPU; Then, introduces the basic principle of SAR algorithm, studying the RD algorithm flow in the classical SAR algorithm, and analyzing the characteristics of the RD algorithm in detail; Finally, the RD_SAR algorithm is implemented on the CPU + GPU platform using CUDA language, and comparing the characteristics of the multi-core processing platform and DSPC6678 parallel processing platform in RD algorithm processing.

2 FFT Performance Evaluation of GPU

Fast Fourier transform (FFT) is often used in digital signal processing to obtain characteristics of the signal in the frequency domain. FFT is often used to evaluate the performance of a processor due to the features of computationally intensive and time-consuming.

In this paper, we implement a radix-4 FFT on Tesla M6 GPU based on CUDA and use different size of FFT to evaluate the processor’s performance. Using the CUFFT math library in CUDA’s official release, we can process one-dimensional, two-dimensional or three-dimensional fast Fourier transforms of multi-batch in parallel, and we can process fast Fourier transforms of multiple batches at the same time. Thus the multi - batch fast Fourier transform in complex domain is realized. And then use the average time of multiple calculations’ results in different points FFT to get the final time.

The final test results are listed in Table 1 and the comparison chart with the processing results of DSPC6678 is shown in Fig. 1. Wherein, the comparison contains three sets of data: Group A, Group B and Group C.

Table 1. The FFT results of DSP and GPU

Full size table

Group A: The result is based on TI’s official library dsplib.ae66, using eight cores of DSPC6678 to implement FFT in parallel;

Group B: The result is based on a new algorithm, VLFFT, which is designed for large points FFT, has a certain improvement over the efficiency of Group A, using eight cores of DSPC6678 to implement FFT in parallel;

Group C: The result is based on Tesla M6 GPU to implement FFT.

It can be seen from the data above, GPU (C group) is 56x–578x faster than DSPC6678 (A group), is 15x–48x faster than DSPC6678 (B group). If we take two sets of DSP to achieve the faster time, then the GPU can achieve an acceleration ratio of about 15 to 67 times compared to the DSP C6678, that is, an average of 41 times acceleration ratio.

3 Research on SAR Processing Performance of Multi-core GPU

This research is based on a high-performance processing system board (Fig. 2) which consists of a CPU + GPU architecture and used in the OpenVPX platform. The main part of this platform is composed of NVIDIA Tesla M6 GPU module and Express-SL7 i68- E22 ComE CPU module. CPU is responsible for the master work, including bus management and data distribution, while GPU, the key module of data processing, is mainly on deal with the relatively large amount of data in parallel processing, helping CPU to process data together. The communication of CPU and GPU is the PCIE protocol based on the VITA specification.

3.1 Synthetic Aperture Radar (SAR) Algorithm

SAR processing is to extract the two - dimensional distribution of the scattering coefficient in the target area from the received echo data. It is essentially a two-dimensional data processing, the usual method is to decompose the two-dimensional data into the distance (Y-axis) and azimuth (X-axis) system, so the imaging process is essentially a two-dimensional matched filtering process. This research will study on RD (Range Doppler) algorithm which is one of the classical SAR imaging. The idea of the RD algorithm is to convert the two-dimensional imaging processing of the synthetic aperture radar echo data into a one-dimensional processing by the two matched filtering operations of the range direction and the azimuth direction. The typical data processing flow of RD algorithm is shown in the Fig. 3.

Firstly, from the algorithm diagram above, RD algorithm has the characteristics of large amount of data and large amount of computing, thus it is suitable for the processing structure of parallel flow type. As a kind of processor with powerful parallel calculating ability, GPU have more transistors for data processing, causing it is very suitable for the realization of RD algorithm.

Secondly, there are strict dependencies between the key steps of the RD algorithm, the input data of each processing module is the previous module’s output data. Therefore, the processing modules can’t be separated and distributed separately in different processing cores of the GPU. And it has to put all the computing resources into the current module, when the module is processed before the next operation.

Finally, the FFT and IFFT which are the main step of RD algorithm are widely used in this algorithm, Therefore, the efficiency of FFT on CUDA platform is an important affecting factor of the whole program’s performance.

3.2 The Performance Analysis of SAR Algorithm Based on GPU

According to the RD algorithm above, this research is based on 8 K point FFT which is used most widely in practice, adopting 4 K × 8 K complex points of the echo data to realize RD algorithm in the GPU parallel processing system, that is, the total calculation is the 4000 groups of 8 k points one-dimensional FFT. Finally, we can get a complete image (Fig. 4) of the two sets of echo data after the processing of RD algorithm completed, it shows a clear river.

Through the description of the function above, the main parts of the radar signal processing include the pulse compression process of the radar echo data on range direction, the FFT on azimuth direction process, the matched filtering on azimuthal direction and quantization. In the realization of the SAR algorithm, the optimization is mainly from the following two aspects.

On the one hand, as we all know, the transmission bandwidth between Intel CPU and DDR3 is approximately 25 GB/s, however, the transmission bandwidth between Nvidia GPU and GDDR5 can reach 200 GB/s, simultaneously, the communication of CPU and GPU is the PCIE protocol and its bandwidth is 16 GB/s in theory and it can reach 9.6 GB/s in reality through the program test results. This shows that the rate of PCIE is far from meeting the GPU processing requirements, and it is the slowest part of the GPU program. So, in the implementation of the program, the optimization principle is reducing the transmission between host data and device data as far as possible and the allocation, operation and release of the process data can operate directly on the GPU.

On the other hand, the optimization of the kernel function is the key part to achieve high-performance GPU program, the optimization methods generally start from the following two aspects: memory access optimization and instruction optimization. This program has been carefully optimized for the matrix transpose kernel, and it has a greater impact on overall performance. We use Shared Memory for memory optimization in the process of matrix transpose, but the shared memory of NVIDIA GPU is generally small, we need to divide the data into blocks according to the size of shared memory, and this process is commonly called tile operation. At the same time, in order to avoid bank conflict when using the shared memory, the size of the two-dimensional tile is generally expressed as [TILE_DIM] [TILE_DIM+1], which TILE_DIM is the one-dimensional size of tile. Through the thread index settings, the design of the transfer kernel function using shared memory can achieve double transposition, when the location of the tile in the entire input data transpose adjustment, the transpose operation is also going on inside the tile. The transpose module is shown in the Fig. 5.

After testing precisely, the results of the two different processing systems are compared as follows (Table 2 and Fig. 6). Wherein, the comparison contains two sets of data: Group A and Group B.

Table 2. RD_SAR Processing Time of DSP and GPU

Full size table

Group A: The result is based on using eight cores of DSPC6678 to implement RD_SAR in parallel;

Group B: The result is based on using Tesla M6 GPU to implement RD_SAR.

As we can see from the results, GPU has an obvious achievement on the acceleration of the RD algorithm compared with DSPC6678. And the total processing time of SAR algorithm for 4 K × 8 K point is shortened by 1.18 s compared with the existing DSPC6678 processor. The result shows that the execution of the algorithm on GPU is approximately 1.9x faster than DSPC6678.

Simultaneously, GPU is 1x–5x faster than DSP in each main part of the RD algorithm, especially in the step of the distance pulse compression, GPU obtain 4.98 times’ speed ratio than DSP. Then analyzing the main steps of pulse compression, the following Table 3 gives the results in detail.

Table 3. Times of the distance pulse compression

Full size table

From Table 3, FFT, IFFT is the main process of the pulse compression on range direction. According to the analysis of the previous part, GPU has a substantial performance improvement in the FFT calculation comparing to the DSP, the results of the RD algorithm are the strong proof of this conclusion.

4 Conclusion

With the development of SAR real-time imaging system towards high precision, high real-time and high data throughput, while optimizing the algorithm, it also has higher and higher requirements on the hardware structure. So choosing the right processor to handle large amounts of data becomes more and more important. In this paper, GPU multi-core processor is the main researching object, and the RD_SAR algorithm is implemented on the CPU + GPU platform using the CUDA calculation model. It effectively demonstrated GPU has powerful computing ability in the high-performance area and it also proved that GPU has a very significant effect in the acceleration of the RD algorithm. GPU can satisfy the real-time requirements of radar signal processing better.

References

Songm, M.C., Liu, Y.B., Zhao, F.J.: Processing of SAR data based on the heterogeneous architecture of GPU and CPU. In: Radar Conference 2013, IET International, pp. 1–5. IET (2013)
Google Scholar
Tang, H., Li, G., Zhang, F.: A spaceborne SAR on-board processing simulator using mobile GPU. In: IGARSS 2016, 2016 IEEE International Geoscience and Remote Sensing Symposium, pp. 1198–1201. IEEE (2016)
Google Scholar
Baier, G.: GPU-based nonlocal filtering for large scale SAR processing. In: Geoscience and Remote Sensing Symposium, pp. 7608–7611. IEEE (2016)
Google Scholar
Frey, O., Werner, C.L., Wegmuller, U.: GPU-based parallelized time-domain back-projection processing for Agile SAR platforms. In: Geoscience and Remote Sensing Symposium, pp. 1132–1135. IEEE (2014)
Google Scholar
Peternier, A., Defilippi, M., Pasquali, P.: Performance analysis of GPU-based SAR and interferometric SAR image processing. In: Synthetic Aperture Radar, pp. 277–280. IEEE (2014)
Google Scholar
Alvarezsalazar, O., Hatch, S., Rocca, J., et al.: Mission design for NISAR repeat-pass Interferometric SAR. In: Proceedings of SPIE - The International Society for Optical Engineering, vol. 9241, pp. 92410C–92410C-10 (2014)
Google Scholar
Zhang, F., Hu, C., Li, W.: A deep collaborative computing based sar raw data simulation on multiple CPU/GPU platform. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 10(2), 387–399 (2017)
Article Google Scholar
Zhang, F., Hu, C., Li, W.: Accelerating time-domain SAR raw data simulation for large areas using multi-GPUs. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 7(9), 3956–3966 (2014)
Article Google Scholar
Otten, M., Vlothuizen, W., Spreeuw, H.: Real-time processing of multi-channel SAR data with GPUs Radar Conference. IEEE (2017)
Google Scholar
Yao, X., Hu, C., Zhang, F.: Atomic-free optimization on GPU based SAR raw data simulation. In: IGARSS 2016, 2016 IEEE International Geoscience and Remote Sensing Symposium, pp. 645–648. IEEE (2016)
Google Scholar
Que, R., Ponce, O., Baumgartner, S.V.: Multi-mode real-time SAR on-board processing. In: Eusar (2016)
Google Scholar
Ammar, M.A., Hassan, H.A., Abdel-Latif, M.S.: Performance evaluation of SAR in presence of multiplicative noise jamming. In: National Radio Science Conference (2017)
Google Scholar

Download references

Acknowledgments

This work was supported in part by the Chang Jiang Scholars Program under Grant T2012122, in part by the Hundred Leading Talent Project of Beijing Science and Technology under Grant Z141101001514005.

Author information

Authors and Affiliations

Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology, Beijing, 100081, China
Yuwei Wang, Shanqing Hu & Jiacheng Yu
Tsinghua University, Beijing, 100081, China
Xingming Li

Authors

Yuwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xingming Li
View author publications
You can also search for this author in PubMed Google Scholar
Shanqing Hu
View author publications
You can also search for this author in PubMed Google Scholar
Jiacheng Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xingming Li .

Editor information

Editors and Affiliations

Beijing University of Posts and Telecommunications, Beijing, China
Songlin Sun
Beijing University of Posts and Telecommunications, Beijing, China
Na Chen
Beijing University of Posts and Telecommunications, Beijing, China
Tao Tian

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Li, X., Hu, S., Yu, J. (2018). The Research of SAR Processing Performance Based on Multi-core GPU. In: Sun, S., Chen, N., Tian, T. (eds) Signal and Information Processing, Networking and Computers. ICSINC 2017. Lecture Notes in Electrical Engineering, vol 473. Springer, Singapore. https://doi.org/10.1007/978-981-10-7521-6_19

Download citation

DOI: https://doi.org/10.1007/978-981-10-7521-6_19
Published: 19 December 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7520-9
Online ISBN: 978-981-10-7521-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

The Research of SAR Processing Performance Based on Multi-core GPU

Abstract

Similar content being viewed by others

Parallel Processing of SAR Imaging Algorithms for Large Areas Using Multi-GPU

Optimization of Two Bottleneck Programs in SAR System on GPGPU

The Challenge of Onboard SAR Processing: A GPU Opportunity

Keywords

1 Introduction

2 FFT Performance Evaluation of GPU

3 Research on SAR Processing Performance of Multi-core GPU

3.1 Synthetic Aperture Radar (SAR) Algorithm

3.2 The Performance Analysis of SAR Algorithm Based on GPU

4 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

The Research of SAR Processing Performance Based on Multi-core GPU

Abstract

Similar content being viewed by others

Parallel Processing of SAR Imaging Algorithms for Large Areas Using Multi-GPU

Optimization of Two Bottleneck Programs in SAR System on GPGPU

The Challenge of Onboard SAR Processing: A GPU Opportunity

Keywords

1 Introduction

2 FFT Performance Evaluation of GPU

3 Research on SAR Processing Performance of Multi-core GPU

3.1 Synthetic Aperture Radar (SAR) Algorithm

3.2 The Performance Analysis of SAR Algorithm Based on GPU

4 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation