1 Introduction

Radar echoes returned from targets are usually submerged in noise, clutter or jamming signal. Radar target detection (RTD) is a fundamental but significant process in order to differentiate and measure targets from increasingly complex background. In traditional radar signal processing, effective methods such as matched filtering, Doppler processing and clutter suppression are mostly adopted to improve signal-to-noise ratio (SNR) and signal-to-clutter ratio (SCR) [1], and constant false alarm rate (CFAR) detection is a well-studied and widely used approach which is based on hypothesis testing for detection in demanding noisy environments [2]. However, due to the diversified development trend of target types and complex detection environment, selecting an optimal parameter set for traditional method is extremely challenging; thus, developing reliable and robust methods for RTD seems to be inevitable [3].

As a subset of machine learning, deep learning methods attempt to create models and extract features automatically from large complex datasets. It has brought a dramatical breakthrough in various domains including image processing, speech recognition, natural language processing, etc. [4,5,6,7]. As one of the most important technologies of deep neural network (DNN), currently, convolutional neural network (CNN) is widely used in computer vision tasks [8], such as semantic segmentation [9], image classification [10] and object detection [11]. In the field of remote sensing image segmentation, Wang et al. [12] created an 11-layer CNN for image segmentation of polarimetric synthetic aperture radar. Chen et al. [13] proposed an improved semantic segmentation neural network which adopts dilated convolution, a fully connected fusion path and pre-trained encoder for the semantic segmentation task of high-resolution remote sensing images. Zhang et al. [14] introduced a neural network architecture search method for sematic segmentation of high-resolution remote sensing images. In the field of medical, an effective segmentation method for skin lesion segmentation is also presented in [15]. For image classification, Öztürk et al. [16] presented an effective CNN-based classification method for gastrointestinal tract images. Zhang et al. [17] designed a 13-layer CNN for fruit category classification. For object detection, Zhang et al. [18] proposed a two-stream contextual CNN to adaptively capture body part information of face. Zhu et al. [19] proposed a contextual multi-scale region-based CNN in face detection.

1.1 Related work

Thanks to the outstanding success achieved by AlexNet [20], VGGNet [21], GoogLeNet [22], ResNet [23], etc., DNN architecture has spread to almost all fields, including RTD. In the past decades, many artificial neural network (ANN)-based and DNN-based approaches have been proposed to solve the problem of RTD in various complex scenarios. To differentiate the targets from noise, Gandhi et al. [24] firstly employed ANNs to detect signals in non-Gaussian noise. Amores et al. [25] deduced that the ANN approach could improve radar detector robustness. Rohman et al. [26] presented an adaptive ANN-CFAR detector to improve the performance of RTD in non-homogeneous noise. Akhtar et al. [27] presented an ANN-CFAR detector for fluctuating target detection in noisy background. Compared with noise background, detecting targets within clutter is a more common but challenging task. Cheikh et al. [28] assessed the problem of RTD using different architectures of ANNs in a K-distributed clutter. Akhtar et al. [29] proposed a more general training strategy to extract fluctuating targets in K-distributed clutter. Pan et al. [30] proposed a deep CNN approach for marine small target detection in strong sea clutter background. The above references provided solutions to a binary classification problem, namely differentiating target present or absent.

In addition, researchers tried to utilize multiple types of inputs to deep learning models for RTD, such as pulse–range maps, range–Doppler spectrums, time–frequency images, etc. Pan et al. [30] adopted pulse–range images as inputs of a deep CNN model for RTD. Wang et al. [31] designed a CNN-based target detector on range–Doppler spectrum and compared the proposed method against traditional CFAR detector. Brodeski et al. [32] introduced a CNN-based architecture for automotive radar detection using range–Doppler data to detect and localize targets. Gustavo et al. [33] implemented a time–frequency block by the Wigner–Ville distribution (WVD) and designed a WVD-CNN detector for RTD. Su et al. [34] adopted short-time Fourier transform (STFT) to transform IPIX measured data and designed a CNN-based method for maritime targets detection.

1.2 Motivation and contributions

At present, the difficulty in RTD mainly lies in target high-resolution feature extraction, environment clutter suppression, anti-jamming measures, especially for strong active jamming, “low-slow-small” target detection, etc. [2]. Traditional CFAR-based detection methods are based on statistical theory, which considers the target or environment models as a stochastic process. However, due to the diversified trend of target and environment models, selecting an optimal parameter set and declaring potential objects in complex environment with a high probability of detection (\(P_{D}\)) but a low false alarm rate (\(P_{fa}\)) still remains to be an extremely challenging task.

The above studies aiming at the problem of RTD used various forms of input which can be regarded as “two-dimensional images.” All inputs were preprocessed by methods of radar echo signal processing. Some researchers have proved that conventional radar signal processing serves as the preprocessing of the training data, which can help to sharpen features, thus improving the detection performance. But in fact, these preprocessing methods, such as pulse compression and Doppler processing, are intrinsically convolution operation, and in nature, deep learning models can automatically extract features from radar complex data. In addition, the primary objective of RTD is not only to distinguish whether the received signal undertested containing echo from the target or just corresponds to the noise and clutter, but also to obtain multi-dimensional location and velocity information. Therefore, realizing a complete “end-to-end” learning scheme for multitask RTD is reasonable and feasible.

In this paper, we propose a novel CNN-based detector for RTD and apply radar echo data directly to locate the target in multi-dimensional space of range, velocity, azimuth and elevation. The proposed approach eliminates the need for time-consuming radar signal processing and presents better detection performance in comparison with the classical radar signal processing methods. However, due to the specificity of RTD tasks, currently, the actual radar dataset is not widely published or accessible. To overcome the lack of labeled radar complex data, we constructed a radar echo dataset with multiple SNR. The shortcoming of our method is that we evaluated our model on the simulated radar dataset and hence it may not perform well on the actual dataset. In the future, we shall try to collect realistic radar data and evaluate and improve our model further.

The main contributions of our work are summarized as follows:

  • We propose a Feature Extraction Net which exploits both time and frequency information and extracts range–Doppler information as feature maps from raw radar echo data.

  • A multitask target detection method is designed, in which a RD Detection Net is adopted to measure range and radial velocity of target and an Angle Detection Net is used to predict azimuth and elevation.

  • In order to perform velocity measurement and angle estimation, we construct a three-channel sum–difference pattern radar echo dataset with multiple SNR for training and testing.

The rest of this paper is organized as follows: Sect. 2 introduces traditional signal processing for RTD, which is often used for performance comparison. In Sect. 3, the proposed CNN-based approach for RTD is described. In Sect. 4, the performance of the proposed model is evaluated in simulation data. Section 5 discusses the challenges and research opportunities. Finally, conclusion is presented in Sect. 6.

2 Traditional signal processing method for RTD

In a typical pulsed coherent radar system, the received multiple coherent pulses echoes are processed in general radar signal processing method including matched filter, clutter suppression, Doppler processing, CFAR detector, etc. Figure 1 shows a traditional radar signal processing procedure of radar detection.

Fig. 1
figure 1

Typical signal processing diagram of a radar system

After emission of each pulse, the incoming echoes are sampled with a given rate and a pulse compression is performed through a matched filter operation over fast time domain which could obtain narrow pulse width and high resolution of range profile. MTI and MTD are effective approaches for clutter suppression. Doppler processing is applied over multiple pulses by applying fast Fourier transform (FFT) over slow time at each range cell and thus range–Doppler spectrum can be obtained. Next, the echoes in the range–Doppler spectrum, whose energy exceeds the detection threshold, are disclosed by the CFAR detector. Then, the position and velocity information can be evaluated.

In target detection, the Neyman–Pearson criterion is widely used for decision making in order to minimize the \(P_{fa}\) and maximize the \(P_{D}\) [35]. In a general CFAR detector, the square-law detected range samples are sent serially (cell by cell) into a shift register of length \(2n + 1\). The statistic value \(z\), which is proportional to the estimate of total noise power, is formed by processing the contents in \(2n\) reference cells surrounding the cell under test (CUT) whose content is \(y\). A target is declared to be present if its energy exceeds the detection threshold \(\alpha z\):

$$ \left\{ \begin{gathered} y > \alpha z,\begin{array}{*{20}c} {} & {target} \\ \end{array} present \hfill \\ y \le \alpha z,\begin{array}{*{20}c} {} & {target} \\ \end{array} absent \hfill \\ \end{gathered} \right. $$
(1)

Here, a constant scale factor \(\alpha\) is used to achieve a desired constant \(P_{fa}\) for a given window of size \(2n\) when the total background noise is homogeneous. The processor configuration varies with different CFAR schemes. Figure 2 presents the diagram of several typical CFAR processors. Take the CA-CFAR detector as an example, as shown in Fig. 2, consisted of \(2n\) cells surrounding the CUT. The average level in the CUT is estimated by taking the arithmetic sum of the reference cells as

$$ z_{{CA}} = {{\left( {\sum\limits_{{i = 1}}^{n} {x_{i} } + \sum\limits_{{i = n + 1}}^{{2n}} {x_{i} } } \right)} \mathord{\left/ {\vphantom {{\left( {\sum\limits_{{i = 1}}^{n} {x_{i} } + \sum\limits_{{i = n + 1}}^{{2n}} {x_{i} } } \right)} 2}} \right. \kern-\nulldelimiterspace} 2}{\text{ }} $$
(2)
Fig. 2
figure 2

Block diagram of several typical CFAR processors

This average value \(z_{CA}\) is scaled by a factor \(\alpha\) to yield an adaptive threshold \(\alpha z\) and then compared with the value of CUT \(y\) to estimate whether the target is present or not.

Radar is always required to operate in a variety of significantly diversified scenes. Selecting a single optimal parameter set is extremely important, because predefined parameters, such as threshold, margin, sizes of the reference and guard windows, would determine the detection performance. Therefore, developing a data-driven deep learning approach for target detection seems to be inevitable and reasonable.

3 The proposed CNN-based method for RTD

In this section, we present a CNN-based RTD scheme which works directly on the raw radar data to detect target. Radar echo signal is a one-dimensional discrete complex sequence. Firstly, we construct the radar data cube as input of network by preprocessing the one-dimensional discrete echo sequence, which could better reflect the characteristics of samples. Then, a novel CNN-based model for RTD is designed to detect and locate the target in multi-dimensional space of range, radial velocity, azimuth and elevation. The block diagram of the procedure is depicted in Fig. 3, and the various stages are described in the following subsections.

Fig. 3
figure 3

Block diagram of the CNN-based target detector

3.1 Processing of input data for RTD

Radar emits multiple pulses in a coherent processing interval (CPI), and the received echoes are integrated together coherently or incoherently, which can be processed as a one-dimensional discrete sequence. There are many commonly used processing methods for discrete time sequences, which determine the diverse forms of input for detection network. However, compared with other input forms, using radar complex data (real and imaginary) directly can make better use of the original information contained in the echo signal.

We consider using the three-channel sum–difference pattern to construct radar echo data sets, which enables us to measure angle, respectively, in azimuth and elevation plane. The values of sampling points are usually stored in a matrix form. Figure 4 shows the raw radar echo cube, which is collected during the simulation process with the ground truth label of the detection: range, velocity, azimuth angle and elevation angle. Dimensions of the radar echo cube are \(Ns \times Np \times Nc\), where \(Ns\) denotes the number of sampling points within a range gate, \(Np\) denotes the number of pulses in a CPI and \(Nc\) denotes the number of received channels. Since echo data are complex number, each sample is split into a real part and an imaginary part as the input to the network and thus \(Nc\) = 6.

Fig. 4
figure 4

Raw radar echo cube as the input of network

We obtain and process suitable two-dimensional samples from the input cube of radar echo data for RTD. In the transverse direction (fast time domain), the sampling points are determined by the pulse repetition frequency and the sampling frequency, while in the longitudinal direction (slow time domain), the pulses in a CPI are listed in order. In classical radar signal processing, echo data in fast time domain are used to calculate range and those in slow time domain which represents Doppler bins are used to calculate velocity. Similarly, we slide the convolution kernels with preset size to extract information in time and frequency domain with rectangular samples as shown in Fig. 5. Rectangular samples may not necessarily be square, but it also needs to determine the sliding step and the size of data samples according to the actual data.

Fig. 5
figure 5

Direct input in two-dimensional form

In fact, processing radar raw echo data as one-dimensional input for classification or regression will greatly reduce the computational complexity of the model, but it will also split the spatial–temporal correlation and even destroy certain texture information of targets. Compared with one-dimensional data, two-dimensional data can make use of temporal and spatial correlation simultaneously, therefore improving discrimination ability and detection accuracy. CNN is good at extracting features automatically from large-scale raw data and classifying two-dimensional graph structure data. It is also necessary to ensure that enough amount of data and enough features are available for detection.

3.2 Design of the CNN-based target detector

The CNN-based detector consists of three parts: Feature Extraction Net, RD Detection Net and Angle Detection Net. Figure 6 shows the proposed structure of CNN-based detector aiming to detect targets in multi-dimensional space and obtain the information of range, velocity, azimuth angle and elevation angle with radar raw data. Detection steps are given as follows:

Fig. 6
figure 6

Architecture of the proposed CNN detector for RTD

Feature Extraction Net: Extract range–Doppler information as feature maps from raw radar echo data, detection of range, velocity, angles in subsequent network rely on these feature maps. Feature Extraction Net exploits both time and frequency information from raw radar signals by a specially designed CNN. Dimensions of the input are \(Ns \times Np \times Nc,\) while those of the output are \(R \times D \times Ch\). This section consists of multiple convolutional layers. Convolution layer is calculated as:

$$ y_{i}^{{k + 1}} = f\left( {\sum\limits_{j} {x_{j}^{k} \otimes w_{{ij}}^{k} + b_{i}^{k} } } \right) $$
(3)

where \(x\) represents the input data of the convolutional layer, \(w\) represents the weights of convolution operator, \(b\) denotes the bias and \(f\) denotes the activation function.

The pooling layer has the most significant effect on reducing the size of the input data according to the selected neighborhood values. When dimensions of input data \(x\) are \(m \times n\), the kernel size is \(p \times q\) and the step is \(t\), the max pooling operation is calculated as:

$$ P_{ij} = \max (x_{i \cdot t + r,j \cdot t + s} )\begin{array}{*{20}c} {} & {\begin{array}{*{20}c} {r = 0,1,2,...,p - 1\begin{array}{*{20}c} {} & {i \le (m - p)/t} \\ \end{array} } \\ {s = 0,1,2,...,q - 1\begin{array}{*{20}c} {} & {j \le (n - q)/t} \\ \end{array} } \\ \end{array} } \\ \end{array}. $$
(4)

The ReLU layer is calculated as:

$$ {\text{R}} (x) = \max (0,x) = \left\{ \begin{gathered} x,\begin{array}{*{20}c} {} & {x \ge 0} \\ \end{array} \hfill \\ 0,\begin{array}{*{20}c} {} & {x < 0} \\ \end{array} \hfill \\ \end{gathered} \right. $$
(5)

RD Detection Net: Detect targets and predict range and velocity of all targets in the range–Doppler domain. The input of the RD Detection Net is the range–Doppler–channel data (\(R \times D \times Ch\)), while the outputs are the range and radial velocity by regression and a list of detections in range–Doppler feature map with their associated classes. A global feature vector from the middle layer of the RD Detection Net is extracted for the Angle Detection Net. In addition to convolutional layers, upsampling layers are also used in this section in order to regain the same size as original input.

Angle Detection Net: Predict the azimuth angle and the elevation angle of each target detected by the RD Detection Net. The input form of the Angle Detection Net is a cropped region from the range–Doppler feature maps which is centered at the location provided by the RD Detection Net. A global feature vector is concatenated with the convolutional output vector, which contributes to obtain azimuth and elevation of each detection. The outputs of the Angle Detection Net are values of the azimuth angle and the elevation angle by regression.

Configurations of the CNN-based radar target detector are as follows:

  1. 1.

    The Feature Extraction Net consists of multiple computational layers. Six convolutional layers, six ReLU layers, three max pooling layers and batch normalization with each convolutional layer form a feature extractor which is able to automatically extract feature maps from the input radar data. The size of the echo data sample is determined by the size of the sliding window. The size of each convolutional filter is 3 × 3; each max pooling layer has a 2 × 2 spatial pooling area with a stride of 2.

  2. 2.

    The range–Doppler feature maps are then filtered by the convolution layers where the size of each convolutional filter is 3 × 3 and a global feature vector of the size 4096 is extracted by the middle layer of the RD Detection Net. The global feature vector is then passed through two fully connected layers to obtain range and velocity. The latter part of the RD Detection Net consists of convolutional layers, ReLU layers and upsampling layers in order to regain \(R \times D \times {\text{clc}}\) output. Notice that the 2-class network with target and not-target outputs is considered in this task. Therefore, the output layer is equipped with soft-max activation function.

  3. 3.

    For each detection, the Angle Detection Net crops a \(N \times M \times Ch\) frame from the input feature map which is centered at the location provided by the RD Detection Net. The crop is then filtered with convolution layers, providing the \(1 \times 1 \times 4096\) output. This output vector is concatenated with the \(1 \times 1 \times 4096\) global feature vector. The concatenated vector is then passed through two fully connected layers; subsequently, azimuth angle and elevation angle are predicted by two independent fully connected layers.

Specifications of the layers, including kernel size, number of kernels, pooling size, etc., are summarized in Table 1.

Table 1 Specifications of layers in the proposed CNN

3.3 Loss functions

The task of RTD addressed in this work is a multitask regression model. The L1-loss and the L2-loss are widely applicable to regression problem. The L1-loss function is used to calculate the mean absolute error between predicted values and true ones, which has good stability with stable gradient, but non-differentiable. The L1-loss is defined as:

$$ L_{1} (x,y) = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {y_{i} - f(x_{i} )} \right|}. $$
(6)

The L2-loss denotes the mean square error of predictions and true values which could guarantee the advantages of continuous and smooth. The L2-loss converges fast with analytical solution. However, it is more sensitive to outliers than L1-loss, which could lead to gradient explosion. The L2-loss function is given as follows:

$$ L_{2} (x,y) = \frac{1}{n}\sum\limits_{i = 1}^{n} {(y_{i} - f(x_{i} ))^{2} }. $$
(7)

The smooth L1-loss combines the advantages of the L1-loss and the L2-loss, which is more robust to outliers and defined as:

$$ \begin{array}{*{20}c} {{\text{Smooth}}} & {L_{1} (x,y) = } \\ \end{array} \frac{1}{n}\sum\limits_{{i = 1}}^{n} {\left\{ \begin{gathered} 0.5*(y_{i} - f(x_{i} ))^{2} ,\;\left| {y_{i} - f(x_{i} )} \right| < 1 \hfill \\ \left| {y_{i} - f(x_{i} )} \right| - 0.5,\quad {\text{otherwise}} \hfill \\ \end{gathered} \right.} $$
(8)

To address the issue of multitask RTD, we train the proposed network using a weighted smooth L1-loss. After adding each item, the cost function, thus, becomes

$$ Loss = \lambda_{1} Loss_{RD} + \lambda_{2} Loss_{Az} + \lambda_{3} Loss_{El} $$
(9)

where \(Loss_{RD}\) denotes the loss of the RD Detection Net, \(Loss_{Az}\) denotes the azimuth angle predicting loss, \(Loss_{El}\) denotes the elevation angle predicting loss and \(\lambda_{1}\), \(\lambda_{2}\), \(\lambda_{3}\) are the weights of different losses. During supervised training, the network tries to minimize the reconstruction error.

3.3.1 Experimental results and performance analysis

In this section, we demonstrate the effectiveness of the CNN-based target detector. The labeled radar echo dataset is constructed for training and testing. Then, our model is evaluated under different signal-to-noise ratios (SNRs) and compared with traditional radar signal processing approaches. The experimental procedures are described below.

3.4 Dataset preparation and implementation details

The effectiveness of deep learning models highly depends on the training process and the availability of high-quantity data. However, due to the specificity of RTD tasks, currently, real-world radar data for RTD are not widely accessible. In order to evaluate the considered target detector, a large set of labeled radar complex data is generated, which contains the radar sensor responses to a known point target located at a variety of locations with different velocity. The simulation of dataset is summarized as follows.

  1. 1.

    We consider a typical pulsed radar system in which the transmitted waveform is emitted at a regular interval. The radar carrier frequency is 10 GHz. It generates received signals of 32 coherent pulses in a CPI, which consists of echoes of a target and noise. The transmitted pulses are chirp signal, the pulse repetition frequency is 10KHz and the bandwidth of the chirp signal is 5 MHz. The sampling frequency of the received signals is 10 MHz, and the beam width is 3°. The target is assumed to be slowly fluctuating and follow a standard Swerling 1 distribution. The noise is white Gaussian noise which is an independent complex Gaussian random variable with mean zero.

  2. 2.

    In total, the simulated dataset contains 140,000 data frames. Dimensions of the raw radar echo frame are 256 × 32 × 6. By setting different power of noise, radar echoes with different SNRs can be generated. We set SNR as − 6 dB, − 2 dB, 0 dB, 2 dB, 6 dB, 10 dB and 13 dB, respectively. A total of 99,225 examples are used to train the CNN detector, 30,775 examples for validation and 10,000 examples for testing.

Since the parameters of our CNN-based detector have to be trained before being applied to detect the target, we optimized the following hyperparameters: kernel size, number of kernels per layer, the network depth, learning rate and dropout rate. We also tried different combinations of filter size and filter number in each layer to get the best accuracy. The network model is trained with back-propagation and Adam optimizer with initial learning rate 0.0001. In our experiment, we set the dropout rate to 0.4 and the batch size is 100 frames. For angle measurement, ground truth of range and Doppler is used to obtain the 3 × 3 center crop region. Various \(\lambda_{1}\), \(\lambda_{2}\), \(\lambda_{3}\) in Eq. (9) are exploited, and \(\lambda_{1} = 0.5\), \(\lambda_{2} = 1\), \(\lambda_{3} = 1\) are selected finally. PyTorch is used for model implementation.

3.5 Experimental results

Measuring error and detection accuracy are used to evaluate our proposed detection model; the measuring error of each item is defined as follows:

$$\begin{gathered} R_{{{\text{error}}}} = |R_{{de}} - R_{{gt}} | \hfill \\ V_{{{\text{error}}}} = |V_{{de}} - V_{{gt}} | \hfill \\ Az_{{{\text{error}}}} = |Az_{{de}} - Az_{{gt}} | \hfill \\ El_{{{\text{error}}}} = |El_{{de}} - El_{{gt}} | \hfill \\ \end{gathered} $$
(10)

where \(R_{de}\),\(V_{de}\),\(Az_{de}\) and \(El_{de}\) denote range, velocity, azimuth and elevation angle of the obtained detection and \(R_{gt}\),\(V_{gt}\),\(Az_{gt}\),\(El_{gt}\) are the corresponding ground truth. The accuracy threshold of each detected item is related to parametric resolution and SNR, which can be expressed as \({\Delta \mathord{\left/ {\vphantom {\Delta {\sqrt {SNR} }}} \right. \kern-\nulldelimiterspace} {\sqrt {SNR} }}\), where \(\Delta\) represents parametric resolution of each item. In our setting, for example, when SNR = 10 dB, \({{\Delta_{R} } \mathord{\left/ {\vphantom {{\Delta_{R} } {\sqrt {SNR} }}} \right. \kern-\nulldelimiterspace} {\sqrt {SNR} }}{ = }4.743m\),\({{\Delta_{V} } \mathord{\left/ {\vphantom {{\Delta_{V} } {\sqrt {SNR} }}} \right. \kern-\nulldelimiterspace} {\sqrt {SNR} }}{ = }1.482m/s\),\({{\Delta_{Az} } \mathord{\left/ {\vphantom {{\Delta_{Az} } {\sqrt {SNR} }}} \right. \kern-\nulldelimiterspace} {\sqrt {SNR} }}{ = }0.9^\circ\),\({{\Delta_{El} } \mathord{\left/ {\vphantom {{\Delta_{El} } {\sqrt {SNR} }}} \right. \kern-\nulldelimiterspace} {\sqrt {SNR} }}{ = }0.9^\circ\).

During the training process, the total loss of the CNN-based detector and detection performance of each item changes with the number of iterations, which is shown in Fig. 7. The training epoch is 100.

Fig. 7
figure 7

The loss and detection accuracy of CNN-based detector in each epoch

For measuring error of each item, Table 2 shows the average detection error of range, velocity, azimuth and elevation of the proposed approach under different SNRs. According to Table 2, as SNR increases, the detection error drops. The average errors of range and velocity slightly drop as SNR increases, but when SNR is higher than 6 dB, the overall trend remains steady. The average errors of azimuth and elevation drop as the SNR increases, but both in a small scale, and the average errors of range and velocity are larger than those of azimuth and elevation.

Table 2 Average measuring error of CNN-based method

In conventional radar processing methods, the received echoes are processed by matched filtering, Doppler processing, CA-CFAR detection, sum–difference angle measurement, etc. For the CA-CFAR detector, there are six reference cells, four guard cells on each side of range domain, and five reference cells, three guard cells on each side of Doppler domain. The threshold of CFAR comparator changes with various SNR. According to Table 3, the average errors of range and velocity are slightly varied in different SNRs, while the average errors of azimuth and elevation drop as the SNR increases.

Table 3 Average measuring error of conventional radar processing method

We compare the measuring errors of CNN-based detector with that of the conventional method in Fig. 8. It can be shown that errors have been greatly reduced in range and velocity, although in Fig. 8(c)(d), the angle errors are higher than classical one when SNR = − 6 dB, as SNR increases, both angle errors of our proposed method are lower than those of classical method.

Fig. 8
figure 8

Comparison of average error under different SNRs

3.6 Performance analysis

Only when measuring errors of four items, the below respective threshold is the detection considered as valid, which is defined as

$$ T_{d} = \left( {R_{{error}} \le \frac{{\Delta _{R} }}{{\sqrt {SNR} }}} \right)\;\& \;\left( {V_{{error}} \le \frac{{\Delta _{V} }}{{\sqrt {SNR} }}} \right)\;\& \;\left( {Az_{{error}} \le \frac{{\Delta _{{Az}} }}{{\sqrt {SNR} }}} \right)\;\&\;\left( {El_{{error}} \le \frac{{\Delta _{{El}} }}{{\sqrt {SNR} }}} \right) $$
(11)

Accuracy of detection \(D_{a}\) is defined as

$$ D_{a} = \frac{{N_{td} }}{{N_{total} }} $$
(12)

where \(N_{td}\) denotes the number of correctly detected targets and \(N_{total}\) denotes the total number of simulated targets.

According to evaluation criterion which is presented in Eq. (12), the detection accuracy is obtained and listed in Table 4. Then, we compare the detection performance of CNN-based detection method with classical method under different SNRs, which is shown in Fig. 9.

Table 4 Accuracy of our CNN-based and conventional detection methods under different SNRs
Fig. 9
figure 9

Detection accuracy of CNN-based and classical detector under different SNRs

It is evident that our CNN-based detector shows a better detection accuracy than the conventional radar signal processing method. As the SNR rises, the detection accuracy of the two methods rises. The detection accuracy of the CNN-based detector is larger than that of the conventional one when SNR is lower than 10 dB. Evidently, our proposed CNN-based detection model is able to get rich information about target from raw radar echo data and it is more robust to white noise and continues to outperform the conventional radar processing methods under various SNR conditions.

Accuracy of detection is also used to compare the proposed RTD models fairly with other state-of-the-art methods in the literature. Table 5 compares the performance of the proposed CNN-based approach with the conventional CA-CFAR method and the state-of-the-art ANN-based and CNN-based methods on our dataset with SNR = 10 dB. The detection and training procedure of GO-ANN (Akhtar et al.) and DRD (Brodeski et al.) are implemented on range–Doppler maps, which are generated by multiple raw radar echoes. They have achieved 95.2% and 96.7% detection accuracy on our dataset, respectively. Notice that the proposed method outperforms the conventional method and state-of-the-art ANN-based and CNN-based methods with detection performance of 98.5%.

Table 5 Comparison of performance of the proposed method with other state-of-the-art methods

3.7 Ablation experiment

To research the need of passing the global feature vector and classes vector to Angle Detection Net, we perform an ablation experiment. The Angle Detection Net and the RD Detection Net are trained separately with the same input feature maps. The convolutional layers, the fully connected layers and the loss functions remain the same except the cropped part and concatenated part which are skipped. The network model is trained as before using back-propagation and Adam optimizer with initial learning rate 0.0001. The dropout rate is 0.4 and the batch size is 100 frames.

The azimuth and elevation measuring error of the proposed model and the separate Angle Detection Net are presented in Fig. 10. It can be clearly observed the improvement in the azimuth–elevation estimation accuracy of the proposed method compared to the separate Angle Detection Net. This improved detection accuracy may be attributed to the additional information in the global feature vector from the RD Detection Net.

Fig. 10
figure 10

Accuracy comparison of the proposed net and the separate Angle Detection net

4 Discussion

Many of the research works in the literature express an evident trend that the traditional radar signal processing methods and deep learning schemes are deeply integrated in RTD application. On the one hand, conventional radar signal processing serves as the preprocessing method of the training data, such as pulse compression, Doppler processing and STFT, which can help to improve the detection performance. On the other hand, neural networks integrated with radar signal processing methods, such as CFAR, working as training strategies, have contributed to the performance improvement. Further, based on the results obtained by our proposed method, we argue that deep learning models can provide an end-to-end framework to integrate sensing, processing and decision making.

Although there are some successful application paradigms of deep learning-based methods in the field of RTD, huge challenges still remain. One of the bottlenecks of applying deep learning models to RTD is the lack of labelled data. Unlike other application fields, high cost is involved in obtaining the radar data. Although radar modeling is a solution to overcome this issue, generating the surrogate data is also extremely challenging and computationally demanding since multiple effects must be considered, such as interference, multipath reflections, reflective surfaces, discrete cells and multiple attenuation. Even if all the above problems are avoided, it inevitably relies on a mathematical model for simulation which may bring inaccuracies. To further promote the study of RTD with insufficient real-world radar data, the following ideas could be considered: 1) data augmentation; 2) some new generation of deep learning algorithms, such as generative adversarial networks (GANs) [36], are robust to the problem of insufficient training data; 3) some other learning-based methods like transfer learning [37, 38] and meta learning [39] could break through the limitation of data insufficiency.

In addition, despite a remarkable improvement in the application of deep learning in RTD, the available literature on ANNs and DNNs in RTD still remains relative sparse and lacks high-level maturity. For example, a common aspect found in some of the literature is that the ANNs are in moderate size, the networks proposed in [26, 27], containing only one hidden layer. Only CNNs are widely used in RTD, even though various deep learning architectures have been proposed in other fields. Since radar processing system is facing more complex and challenging tasks, more varied but practical powerful learning-based training strategies are urgently required.

5 Conclusion

In this paper, we proposed a novel CNN-based model for multitask RTD, which works directly with raw radar echo data. Accordingly, the rich original information contained in the echo signal can be fully exploited. The proposed detection method is able to locate the target in multi-dimensional space of range, radial velocity, azimuth and elevation. The CNN-based detector shows a better measuring accuracy in each item than the conventional radar signal processing method when evaluated under different SNRs. In addition, the detection performance of the proposed method outperforms other state-of-the-art methods in the literature. Although this work adopts simulated data for training, we believe that it has promising potential to apply deep learning to RTD on radar complex data. In addition, we expect to get sufficient training data from radar system which will definitely prompt the detection model more convincing. In the future, we will try to extend and evaluate the CNN-based detector with real-world radar data.