1 Introduction

The object detection technique, whose aim is to detect specific objects and their positions in an image, constitutes one of the most applicable fields in machine vision [1]. In recent years, very good object detection models, whose backbones are the deep neural networks (DNNs), have been introduced. Some of the more important of these models include the different versions of the YOLO [2] and SSD [3]. These models are called the single-shot object detection models, because they detect the target objects and determine their positions in a single step. Of course, for detecting the salient objects, these models need to be robustified. A sample activity in this field is the work of Liu et al. [4], who have presented a robust technique for the detection of salient objects.

Despite all the known advantages of the DNNs, it was discovered in 2013, that these networks are vulnerable to adversarial attacks, and that the images corrupted by such attacks can fool and mislead the DNNs [5]. Since then, the robustification of the deep neural networks has become one of the most important concerns of the researchers in this field [6]. The adversarial attacks can be combined with images in the form of small perturbations; and although these perturbations and contaminations cannot be detected by human eye [7], they can mislead the DNNs and severely reduce the accuracy of the models that are based on deep learning. Unfortunately, the DNNs report the images they have misrecognized as being highly reliable and least erroneous [8].

Since the introduction of this drawback, many efforts have been made to improve the adversarial accuracy of models and to enhance their robustness against adversarial attacks in machine vision tasks. In spite of all these efforts, still the adversarial accuracy of the models, especially those in the field of object detection, has not reached an acceptable level [9]. Considering these facts, the application of the deep learning knowledge in various fields, especially in the real-world applications (e.g., the autonomous vehicles) would be a challenging task [10]. The adversarial accuracy is a parameter that shows the accuracy of a model when it is tested on the images perturbed by adversarial attacks [11]. Another problem reported by the research works conducted on this subject is the reduction of the models’ clean accuracy. The defenses presented in the literature against adversarial attacks have led to a significant decline in the clean accuracy of the object detection models, which means a reduction in the accuracy of object detectors when they deal with the input images that are not perturbed by adversarial attacks [12].

The adversarial attacks can be divided into targeted and untargeted classes. The targeted adversarial attacks are those that perturb the input images in such a way that for a defined class of objects, a specific label is presented as the output. Of course, in the targeted attacks, sometimes the images are perturbed in such a way that no labels are designated for them by the object detectors. The actions of the targeted adversarial attacks are controlled by the attack designer. Conversely, the untargeted adversarial attacks simply combine with the input images and they have no particular control over the outcome of the object detectors as they examine such contaminated images [13].

The adversarial attacks of different forms can be devised by different techniques and algorithms. However, to be able to compare the results of various relevant research works, a standardized benchmark of targeted and untargeted attacks was presented in [14] for the first time; and in this benchmark, standardized attacks were used for the object detection models. Although, there have been very few attacks and defenses in the field of object detection so far, it is necessary to have this benchmark, if we want to better compare the results of various research works.

The Gabor filters are the most frequently used filters in the conventional machine vision tasks. These filters are based on a sinusoidal plane wave of a specific frequency and direction; which enables them to extract the spatial structures from images [15]. Combining these filters with the DNNs for different purposes has become a research interest. In [16], the Gabor filters have been combined with the deep learning models in classification tasks and have improved the robustness of these networks against adversarial attacks. Also, the Gabor filters have been combined with the DNNs in [17] in order to reduce the complexity and increase the learning speed of these networks.

In this paper, we introduce five object detectors that are robust to adversarial attacks. These robust models have been obtained by combining the Gabor filters with the backbones of the different versions of famous models (YOLO.v3, SSD, and Faster R-CNN) [18]. Then, each of these robust object detectors has received adversarial training by means of the perturbed images from the MSCOCO (2017 version) and the PASCAL VOC (2012 version) datasets. Adversarial training means training a network with the images that have been perturbed by adversarial attacks [19].

In recent years, newer adversarial attacks based on more novel techniques have also been introduced, such as the Evaporate attacks [20]. This type of attacks, which are categorized as the black-box attacks, can successfully mislead the detection models without having to know the architecture of a destination system. For example, Wang et al. [20] introduced an effective type of attack that could successfully mislead the object detection models such as the YOLO and the FRCNN. They effectively combined the Evaporate, Boundary and the Gaussian Noise Attacks and formed a black-box type of attack. In our paper, in addition to the 6 adversarial attacks considered, the presented method has been evaluated again on all the object detection models by applying this combined attack. In our paper, we have abbreviated this attack as EBG and we have reexamined the models for the case in which the images are disturbed by the mentioned type of attack. Also, Lee and Kolter [21] have presented an adversarial patch for deceiving the object detectors. These authors claim that this type of attack is quite effective on the object detection models and totally disrupts their object detection ability. Some attacks are developed exclusively for a specific image [22], and some attacks are designed to mislead a system on a particular class of images. These attacks have recently attracted the attention of the attack developers. Wang et al. [23] have presented a patch which is aimed at deceiving the object detection systems on specific classes of images. In its maximum state of performance, this patch has been able to reduce the accuracy of the detection systems by 81%.

For perturbing the images in this paper, we have used 6 of the more famous adversarial attacks in the field of object detection (TOG-vanishing, TOG-fabrication, TOG-mislabeling [14], DAG [24], RAP [25] and UEA [26]). Finally, the results of implementing the considered models on different datasets and adversarial attacks have been obtained and the performances of these models have been compared with each other. Some of the attacks used in this paper are new, and there are no reports in the literature about the performances of other defensive techniques against these attacks. Therefore, for comparing the models presented in this paper with those in other papers, we have also evaluated and compared the performances of other defensive techniques against the attacks used in this paper. The results of these comparisons have been presented at the end of this manuscript.

The next sections of this paper contain the following: Sect. 2 briefly introduces some of the works carried out on the robustification of DNNs against adversarial attacks. Section 3 describes the proposed technique and explains its application on some well-known object detectors. The results of our model are given in Sect. 4 and compared with those of the other models. And finally, the Conclusion and the Discussion are presented in Sect. 5. The main contributions of our paper are as follows:

  • In this paper, a novel method based on the Gabor filters has been presented for robustifying the object detectors against adversarial attacks. This approach improves the adversarial accuracy of the VOD models much more than the former methods considered.

  • The proposed method has been implemented on the most famous object detection models (YOLOv3-m, YOLOv3-d, SSD300, SSD512 and FRCNN) and the results have been evaluated and compared extensively on different models.

  • To verify the proper performance of the defense technique presented in this paper, the newest and the most common adversarial attacks (both targeted and untargeted) have been used in this work, and the proposed model has been evaluated by considering 7 different types of attacks.

  • Finally, the proposed method has been compared with the most recent techniques in this field, and it has been successfully demonstrated that the performance of this approach is better than that of the other state-of-the-art methods introduced in the literature.

2 Related works

Numerous research works have been conducted on the robustification of DNNs against adversarial attacks in different tasks, with most of them related to the classification field. To make the DNNs robust in the classification tasks, Gabor filters have been combined with several well-known architectures including the ALEX NET and the VGG16. The adversarial training method has been employed in [26,27,28,29,30] to make the DNNs robust in the classification tasks. This technique is not sufficiently effective against strong attacks [9]. The denoising auto encoders have been used in [29] to deal with the adversarial attacks. The authors of that paper claim that using a denoising auto encoder can boost the network robustness against such attacks. Nevertheless, combining a denoising auto encoder with a main network can create more problems for that network [9]. A combination of gradient regularization and DNNs has been used in [30] to robustify the deep learning models against adversarial attacks. Although the adversarial accuracy increases in this approach, the clean accuracy of the model diminishes, which is not desirable.

Some researchers have tried to devise new attacks in other tasks as well and to robustify the DNNs against them. For example, the “spare aware online incremental attack” technique has been employed in [31] to create online attacks; which can pose a serious challenge to object tracking efforts. Various defense strategies in the field of semantic segmentation have been explored in [32]. This paper has revealed that the methods used in the classification tasks cannot be applied effectively to network robustification in the semantic segmentation tasks.

Most of the efforts undertaken to robustify the object detection models are based on adversarial training; and less attention has been paid to making changes to these models and their backbones [34]. Considering the similarity in the backbones of object detection models and famous classification architectures, it seems that by relying on the research efforts related to classification and by making changes to network architectures and improving their robustness, some techniques for robustifying the object detectors could be devised.

A multitask method for model training as well as various techniques for the adversarial training of models have been used in [33]. The model achieved in this work has been tested on the images perturbed by the DAG and RAP attacks.

3 The proposed method

The method proposed in this paper exploits the Gabor filter banks in the first layer of famous object detectors. As we know, the backbones of the well-known detectors (e.g., the YOLO and the SSD) are based on famous architectures; and we can generate the convolutional Gabor layers by combining the starting filters of such detectors with the Gabor filter banks [16]. The Gabor filters used to be very common in traditional machine vision applications, and they were placed at the start of machine vision systems in order to detect the edges and curves [35]. These filters were used for the first time in classification tasks in [16] and yielded very promising results. In our proposed approach, we attempt to match the Gabor filter banks with the backbone structure of the object detector models and then to replace the ordinary convolutional layers in these backbones with the convolutional Gabor layers.

It is a known fact that the Gabor filters are able to extract the spatial features of images quite successfully [17]. Hence, it is assumed that the extraction of these spatial features could make the object detection systems more robust. We will prove this hypothesis in the next section by means of several experimental results. Subsequently, we will explore the Gabor filter equations and the matching of these filters with the backbone structure of object detectors. Our method of generating the filters and using them in the first layer of the DNNs has been illustrated graphically in Fig. 1.

Fig. 1
figure 1

The algorithm proposed in this paper

$$ G_{\theta } (x^{\prime},y^{\prime};\alpha ,\beta ,\delta ,\eta ) = e^{{ - \alpha^{2} (x^{{\prime}{2}} + \beta y^{{\prime}{2}} )}} \cos (\delta x^{\prime} + \eta ) $$
(1)

The Gabor filter is a complex sinusoidal function in the form of Eq. 1.

In the above formula, \(x^{\prime}\) and \(y^{\prime}\) are defined as

$$ x^{\prime} = x\cos \theta - y\sin \theta $$
(2)
$$ y^{\prime} = x\sin \theta + y\cos \theta $$
(3)

In order to make a discrete Gabor filter (Fig. 1), we uniformly resolve the x and y parameters in our DNN model. The filter size is determined according to the number of network samples. To construct the filter in the \(\{ (x_{i} {, }y_{i} )\}_{i = 1}^{{k^{2} }}\) network with dimensions \(k \times k\), these parameters are inserted into the network as a set of trainable parameters and are trained via a conventional learning method. Here, the {\(\alpha ,\beta ,\delta ,\eta\)} is considered as the set of trainable parameters. According to [17], the learnable parameters of the Gabor layer are trained exactly like the vector coefficients and weights of a network. Like any other learnable parameter, these parameters are trained, within a specific range, during the learning process. The effect of these parameters on the final accuracy is exactly like the influence of network weights; i.e., these parameter values are updated in each succeeding epoch so as to raise the final adversarial accuracy. These parameters are trained at different rotation angles (\(\theta\)), and according to Fig. 1, they form a set of Gabor filters which are eventually applied to the input images. Equation 4 shows the procedure for constructing the \(F_{p}\). The value of θ indicates the filter rotation angle. And since a Gabor filter detects features such as image edges in the direction of its theta angle, a Gabor filter bank must include different values of theta (from 0 to 2π) in order to cover various rotation angles. Therefore, the filter bank used in our paper contains a large number of Gabor filters with different θ values so that the edges and the low-level features of images can be detected at different rotation angles.

$$ F_{p} = \{ G_{{\theta_{1} }} ,G_{{\theta_{2} }} ,G_{\theta 3} ,.....,G_{{\theta_{n} }} \} $$
(4)

Using this equation, different filters can be made with various \(\theta\) angles in the range of \(\left[ {0,2\pi } \right]\). The K set is eventually constructed by producing several F sets for different p values.

In the classical machine vision, the frequency and the rotation angle of the Gabor filters are, respectively, obtained from Eqs. 5 and 6.

$$ \omega_{n} = \frac{\pi }{2}\sqrt 2^{ - (n - 1)} $$
(5)
$$ \theta_{m} = \frac{\pi }{8}(m - 1) $$
(6)

These equations can also be used here to obtain the F set.

After completing the steps shown in Fig. 1 and producing the set of filters, the convolutional Gabor layer is finally obtained. Now, we can add this layer to the first layer of the backbone of object detectors. In this method, the activation function of ReLU has been used in the convolutional Gabor layer so that the output of this layer can be connected to the next layer.

As is shown in Fig. 2, in this approach, an image is first divided into its constituent RGB channels. These channels are then fed to a Gabor filter bank as a tensor. As the input layer of the detection system, the Gabor filter bank extracts the image’s low-level features. Based on the explained technique and according to Fig. 1, each filter in the filter bank is constructed with a specific theta angle (\(0 \le \theta \le 2\pi\)) and it can extract the edges and the other low-level features of images corresponding to this theta angle.

Fig. 2
figure 2

The block diagram of the proposed method

In order to cover different theta angles, our filter bank includes various filters with different rotation angles. Every channel of an input image is convoluted with every existing filter in the filter bank. After applying this filter bank, it would be necessary to prepare the output tensor to be fed into the main section of the detector. The object detector will be selected after matching the input channels with the output of the Gabor layer. In this work we have used the YOLOv3-m object detector with the MobileNet backbone, the YOLOv3-d with the Darknet backbone, and the FRCNN. The reason for choosing these models is to evaluate the presented method on the models with different architectures and backbones. After exiting from the convolutional Gabor filters, the tensor is fed to an activation function in order to prepare the output tensor for input into the 1 × 1 convolution block. After applying the convolution procedure, the tensor obtains the number channels that have to be fed to the main part of the detector model.

4 Experimental results

In this research, all the networks have been trained, under similar conditions, in 70 epochs. After evaluating the number of epochs in the training process, we found out that the adversarial accuracy does not increase significantly and has little fluctuation after epoch 70. So it was decided that in this work and for the model considered, it is sufficient to use 70 epochs for training purposes. As an example, the average adversarial accuracy of the YOLOv3-d model for the TOG-mislabeling type of attack and on the PASCAL VOC dataset has been reported in Fig. 3. By examining this figure, we can see that no significant improvement has occurred in the accuracies following epoch 70; therefore, the accuracy obtained at this epoch number can be trusted.

Fig. 3
figure 3

The average adversarial accuracy of the YOLOv3-d model for the TOG-mislabeling type of attack and for the PASCAL VOC dataset

Also, the batch size has been considered as 64. Different batch sizes were also evaluated in this paper and based on the algorithm used and the existing hardware, the best accuracy and the most suitable computation speed were achieved at the batch size of 64. For example, using a batch size of 128, the average adversarial accuracies are reduced by about 2% at the same number of epochs.

The networks and the learnable parameters of the Gabor layer have been trained by the stochastic gradient descent approach. To evaluate the performance of the method presented in the preceding section, it has been applied on 5 well-known object detectors: YOLOv3-m, YOLOv3-d, SSD300, SSD512, and the Faster R-CNN (henceforth called FRCNN in this paper). The specifications of these detectors have been listed in Table 1.

Table 1 The models used in this paper and their specifications

The reason for choosing these object detector models is to evaluate the efficacy of the presented method in various models with different inputs and backbones. The datasets of MSCOCO (v. 2017) and PASCAL VOC (v. 2012) have been used to test the robust object detectors obtained. The MSCOCO dataset is one of the most famous datasets in the field of object detection. This dataset includes 4000 images for model training, 5000 images for validation, and 5000 images for testing. All the images of this dataset have been used in this paper. The images in this dataset cover 80 classes of objects.

The PASCAL VOC dataset includes 20 object classes. This dataset consists of 1464 images for training as well as 1464 images for validation and testing. In evaluating our proposed method, we have used all the images of this dataset.

For perturbing the database image, different techniques have been proposed in recent years. In this research, we have employed six of the most famous attacks that exist in the field of object detection. The adversarial attacks must perturb the images in such a way that these perturbations are not recognizable by human eye. The targeted attacks used in this paper comprise the TOG-fabrication, TOG-vanishing, and the TOG-mislabeling attacks and the untargeted attacks are the DAG, RAP, and UEA. A sample image perturbed by the targeted TOG-mislabeling attack has been illustrated in Fig. 4.

Fig. 4
figure 4

A sample image perturbed by the TOG-mislabeling attack

As is observed in Fig. 4, the perturbed image is not recognizable by human eye, but it has been able to mislead the object detector and cause it to miss the considered object in the output image. In Fig. 5, the same image has been perturbed via the untargeted random UEA attack in magnified form so that it is visible to human eye. This elevated degree of image perturbation is so that it can be recognized by human eye; otherwise, a much lower perturbation level can completely fool the detection system. Each of the targeted attacks has been designed to mislead the detection network in a particular way. Figures 6 and 7 respectively show the performances of these targeted and untargeted attacks and their effects on the recognition ability of object detectors.

Fig. 5
figure 5

A sample image perturbed by the UEA attack in magnified form

Fig. 6
figure 6

The effects of targeted attacks on the recognition performance of the object detector

Fig. 7
figure 7

The effects of untargeted attacks on the recognition performance of the object detector (The UEA attack has been magnified to make it visible.)

The efficacy of various attacks can also be evaluated by means of two parameters: the false negative increase (FNI) and the mean square error (MSE). The FNI parameter indicates the ratio of the false negative detections of objects (\(\Delta N\)) by the system to the total number of positive detections (\(N\)). This parameter is defined as follows [21]:

$$ FNI = \frac{\Delta N}{{N + 1}} $$
(7)

These parameters are not usually reported by the attack developers for their devised attacks. However, to shed more light on the performance and effectiveness of these attacks, the FNI and MSE parameters have been calculated for the attacks analyzed in this research and the results have been tabulated in Table 2.

Table 2 The FNI and MSE values for the attacks used in this paper

As is observed, the TOG series of attacks mislead the object detectors more effectively, and it is harder to formulate a defense strategy against them. To actually test the presented algorithm, the existing models are robustified first, according to the procedures given in Sect. 2. Next, by implementing the introduced attacks, the input images are perturbed independently. Then, by employing two GPUs with specifications (NVIDIA GEFORCE 1080 TI and NIVIDIA GEFORCE 2060 SUPER) and using the perturbed images obtained, each of the networks is subjected to adversarial training. Adversarial training means training a model with perturbed images. It should be pointed out that every network in this paper has been trained and tested by each of the attacks considered. In this research, the training data of each dataset have been used to train the networks, the validation data have been used to evaluate the networks during the training process, and the test data have been used for the final evaluation of the detection models.

Now, to examine the performance of the introduced method more closely, the accuracy obtained for each class has been computed. The graphs in Figs. 8, 9, 10, 11, and 12 illustrate some examples of this evaluation. These diagrams show the clean accuracy of the models in the absence of any defense, the adversarial accuracy of the models subjected to an arbitrary random attack in the absence of a defense, and the accuracies of the models following their robustification via clean data and the data perturbed by adversarial attacks. The results of various classes are illustrated in these graphs for the TOG-vanishing and the DAG attacks, as examples of targeted and untargeted attacks, respectively. The accuracies obtained for each class can provide valuable information, which can be used to evaluate the effectiveness of the presented method for each class of the dataset. Figures 8, 9, 10, 11, and 12show the results of this analysis on the MSCOCO dataset. Due to the large number of classes in this dataset, sample results have been presented in this paper for just 5 of these classes.

Fig. 8
figure 8

Comparison of clean and adversarial accuracy for 5 classes of MSCOCO datasets in the presence of a targeted attack and an untargeted attack for the YOLOv3-m model

Fig. 9
figure 9

Comparison of clean and adversarial accuracy for 5 classes of MSCOCO datasets in the presence of a targeted attack and an untargeted attack for the YOLOv3-d model

Fig. 10
figure 10

Comparison of clean and adversarial accuracy for 5 classes of MSCOCO datasets in the presence of a targeted attack and an untargeted attack for the SSD300 model

Fig. 11
figure 11

Comparison of clean and adversarial accuracy for 5 classes of MSCOCO datasets in the presence of a targeted attack and an untargeted attack for the SSD512 model

Fig. 12
figure 12

Comparison of clean and adversarial accuracy for 5 classes of MSCOCO datasets in the presence of a targeted attack and an untargeted attack for the FRCNN model

The results obtained by applying the algorithms in this paper on the datasets of PASCAL VOC and MSCOCO have been analyzed and compared with the results of other works in Tables 3 and 4, respectively. An important point to consider when trying to robustify the DNNs against.

Table 3 The clean and the adversarial accuracies obtained by different models (with/without a defense) by considering all the attacks in the PASCAL VOC dataset
Table 4 The clean and the adversarial accuracies obtained by different models (with/without a defense) by considering all the attacks in the MSCOCO dataset

adversarial attacks is to make sure that the clean accuracy of these networks does not drop significantly relative to the accuracy of their undefended state. By examining Tables 3 and 4, it is observed that in all the states and models, the clean accuracy drop, in our method, relative to the undefended state is negligible and much lower than that in former works. The accuracy results for the defended states of networks and for different attacks on each of the mentioned datasets have been plotted in Figs. 13, 14, and 15. By inspecting these figures, we can see that the adversarial accuracies of the models presented in this paper are better than those of the previous works and substantially improved against all the considered attacks.

Fig. 13
figure 13

Comparing the performances of different models against the A) TOG-vanishing, B) TOG-fabrication, C) TOG-mislabeling, D) DAG, E) RAP and F) UEA in the PASCAL VOC dataset

Fig. 14
figure 14

Comparing the performances of different models against the A) TOG-vanishing, B) TOG-fabrication, C) TOG-mislabeling, D) DAG, E) RAP and F) UEA in the MSCOCO dataset

Fig. 15
figure 15

Comparing the average performances of different models against adversarial attacks in the A) PASCAL VOC dataset and B) MSCOCO dataset

Also, by comparing the graphs in Fig. 15, it is confirmed that the combined model of “YOLOv3-d + Gabor” has the best performance in both the PASCAL VOC and MSCOCO datasets. A closer examination of Tables 3 and 4 shows that the considered models perform much better in the PASCAL VOC dataset than in the MSCOCO. This superiority is due to the higher clean accuracy achieved by the models of this paper in their undefended state in the PASCAL VOC dataset than in the MSCOCO dataset, which can be attributed to the simpler data contained in the PASCAL VOC dataset and the smaller number of classes that exist in this dataset. Also, the accuracies of all the models in their undefended state is very low and less than 2% on the average, which shows the serious vulnerability of all the existing models against adversarial attacks. Of course, by analyzing the results, one can see that the model accuracies.

are reduced much more by the newer targeted attacks. This is due to the more complex and precise design of these attacks compared to the older untargeted attacks.

Moreover, the information obtained from Tables 3 and 4 clearly shows that the former defense strategies perform much better against the untargeted attacks than the targeted ones.

The closest work to our research in terms of the applied conditions and the datasets used is the work of Zheng et al. [36]. We had already compared these two works in the context of graphs and tables. However, for comparing the proposed algorithm with similar works, we tried to simulate, once again, some of the existing algorithms for the conditions close to the test conditions of that paper. A method based on pre-training has been introduced in [37] for face recognition applications. In this approach, the image features are extracted first and the dimensions are reduced. We applied the method presented in [37] to the models used in this paper, called the technique “SDF” for short, and compared it with the results of our work in Table 5. Another robustification technique based on the detection and mitigation of adversarial attacks has been proposed by Goswami et al. [38] for boosting the system robustness in face recognition tasks. We also evaluated this method by using our datasets and tabulated the results.

Table 5 Comparing the presented methods with the similar adapted approaches

of these comparisons in Table 5. This table clearly shows that our proposed method has been able to improve the adversarial accuracy better than the other approaches.

For a better assessment of the method proposed, the introduced models are evaluated once again by using a new attack strategy that combines the Evaporate, Boundary, and the Gaussian Noise Attacks (the new hybrid attack was abbreviated as the EBG attack and was fully described in the Introduction). The exact results of this evaluation for the datasets of PASCAL VOC and MSCOCO have been listed in Tables 6 and 7, respectively. This attack has a high FNI value and it can also efficiently measure our defense performance against the hybrid black-box types of attack. By examining Tables 6 and 7, it is realized that not only the presented defense strategy can adequately deal with older attacks as well as the targeted TOG attacks, but it also can perform effectively against the new types of hybrid black-box attacks. This shows the credibility of the mentioned defense strategy in dealing with various types of adversarial attacks under different conditions.

Table 6 The performances of the presented models against the EBG attack on the PASCAL VOC dataset
Table 7 The performances of the presented models against the EBG attack on the MSCOCO dataset

5 Conclusion

Using the Gabor filter banks in different model backbones, a new method was introduced in this paper for robustifying the visual object detectors. This approach was applied on five models and the obtained results were tested by means of the PASCAL VOC and the MSCOCO datasets. In this study, the input images were perturbed by three types of targeted and three types of untargeted attacks and the results were reported for all the considered states. Here, six models that are robust to adversarial attacks and suitable for object detection applications have been proposed and their results for different states have been compared. Based on the findings of this research, the introduced models perform well against the adversarial attacks, and the best performance among these models belongs to the robust YOLOv3 model with the DARKNET backbone and convolutional layers.