Keywords

1 Introduction

With the development of artificial intelligence technology and deep learning, computer vision is becoming more and more mature, and pedestrian detection technology has entered people's life. There is a broad prospect in the fields of security monitoring, vehicle unmanned driving and human behavior analysis etc. [1,2,3,4,5,6,7,8,9,10,11,12].Pedestrian detection technology is to study and judge the given image or verify whether there are pedestrians to be detected in each frame of video sequence, and the specific location of the target can be found accurately and quickly.

The traditional visible light technology can’t be applied in the fields of night or unmanned driving. Infrared thermal image is based on the relative temperature information of the object, compared with the traditional situation, which is less affected by various additional factors and can be applied in many aspects. But there is no color in the image collected by infrared equipment, so the accuracy of pedestrian detection is low. Pedestrian detection algorithm can be divided into traditional algorithm and deep learning algorithm. The traditional algorithm mainly uses the artificial design to extract the image features, combined with machine learning to recognize and classify the image features, so that to detect the target. However, the traditional algorithm is complex, sometimes it is difficult to design a reasonable method in the complex scene, and the weight parameter is difficult to get a more accurate value, so the generalization ability is poor.

In recent years, Convolutional Neural Network (CNN) [7] has made a significant breakthrough in pedestrian detection. Convolutional Neural Network (CNN) can learn the original features of the target through a large number of data automatically. Compared with the manually designed features, CNN has stronger abilities in discrimination and generalization [8]. In the meanwhile, the deep learning algorithm not only improves the detection efficiency, but also improves the detection speed, which is better than traditional method. Before the appearance of YOLO, the detection speed of deep learning was not very fast, and the real-time performance could not be guaranteed, especially in the future unmanned driving technology. Redmon et al. [5] proposed YOLO (you only look once, unified, real-time object detection) algorithm, which entered the field of one stage target detection. The idea of one stage solves the problem of speed in target detection, and improves the real-time performance with a certain accuracy greatly. Although the speed is improved, but the accuracy is not more precisely than other algorithms. After that, there are YOLOv2, YOLO9000 and YOLOv4. The network structure of YOLOv4 is simple and efficient, so it’s easy to deploy and widely used. It is one of the preferred algorithms in many commercial fields. Combined with the practical application scenarios, it is applied to large-scale outdoor monitoring to detect the areas where pedestrians are forbidden. Moreover, YOLOv4 has great application prospects in infrared images of small object detection and pedestrian detection. The backbone network is better used in the network structure such as DarkNet or RESNET classifier, but also detected quickly. The most important thing is to build a simple environment, reduce the background detection error and make a strong generalization. Although YOLOv4 network has great advantages in multi-scale prediction and better classifier, there are several disadvantages that the accuracy of object recognition is poor and the recall rate is low, compared with other RCNN series detection algorithms.

To solve the above problems, the YOLOv4 algorithm is improved in the paper. The pedestrian detection accuracy (MAP) of infrared image is improved by 0.04%, compared with the original YOLOv4 algorithm.

2 Algorithm Structure of YOLOv4 Network

YOLO algorithm is the first work of one-stage detection. It is a target detection system based on single neural network proposed by Redmon and Ali Farhadi in 2015. In CVPR in 2017, Redmon and Ali Farhadi published YOLOv2 and YOLOv3, which further improved the accuracy and speed. After further improvement, YOLOv4 algorithm appeared. YOLOv4 is mainly introduced in three aspects: network input, structure and output, the network structure is shown in Fig. 1.

Fig. 1.
figure 1

YOLOv4 network structure.

3 Improved Infrared Pedestrian Detection Algorithm Based on YOLOv4

In the night vision infrared pedestrian detection and location task, first, the sampling machine is in the high spot to make the volume of the pedestrian target smaller; Second, because of the active pedestrians, each target is at a different distance from the camera, resulting in the differences of similar targets in the images. These two factors lead to a certain deviation between the final detection results and the actual situation. So the structure of the feature extraction network is optimized to enhance the ability of the network to capture the target position.

In the feature extraction network CSPDarknet adopted by YOLOv4, standard convolution of 3 × 3 size is mainly used for feature extraction. Because the shape and size of the receptive field of standard convolution are fixed, it will also extract the features of non-target areas when detecting small targets, which leads to more interference factors in the features extracted by final convolution and more interference effects on the prediction of the detector. So in the actual detection situation, deformation convolution is used as the core component, the deformation feature extraction module is constructed to improve the effectiveness of target feature extraction based on the standard convolution of YOLOv4. Compared with standard convolution, deformable convolution has the following advantages: firstly, the efficiency of receptive field is improved, that is, the feature map is more accurate in mapping target information; Secondly, the effectiveness of feature extraction in convolution kernel is improved. Convolution kernel can adapt to the position of the target for sampling, and the extracted feature information matches the target better; In addition, the deformable convolution can extract features more specifically for the region of the target, the stability of the feature graph (that is, the weight parameter will not change) is better than the standard convolution. When the feature graph is transferred in the network, the deformation process during model training can be expressed by the following formula and the deformation convolution used in the paper.

$$\mathrm{y}(\mathrm{p})=\sum_{\mathrm{k}=1}^{\mathrm{K}}{\mathrm{w}}_{\mathrm{k}}\cdot \mathrm{x}(\mathrm{p}+{\mathrm{p}}_{\mathrm{k}}+\Delta {\mathrm{p}}_{\mathrm{k}})\cdot \Delta {\mathrm{m}}_{\mathrm{k}}$$
(1)

\(\mathrm{p}\)—Convolution kernel coordinates.

\(\mathrm{k}\)—The number of convolution kernels (for example, 3 × 3 convolution kernels with 9 kernels).

\(\mathrm{w}\)—Weight.

\({\mathrm{p}}_{\mathrm{k}}\)—The offset of the Kth kernel.

\(\Delta {\mathrm{p}}_{\mathrm{k}}\)—The offset of the model needs to learn.

\(\Delta {\mathrm{m}}_{\mathrm{k}}\)—Offset control parameters that the model needed to learn.

The feature extraction network module is optimized based on deformation convolution. The composition of the optimized deformation feature extraction module is shown in Fig. 2.

Fig. 2.
figure 2

Deformation feature extraction module.

Compared with the module before optimization, the optimized module mainly uses 3 × 3 deformation convolution to replace the standard convolution, and uses 1 × 1 convolution layer to construct feature channel for dimension reduction and dimension elevation. In the beginning, a 1 × 1 standard convolution layer is used to reduce the dimension of the input feature map, at the same time, the redundant features are deleted; Then, the feature map is extracted by 3 × 3 deformation convolution; Finally, 1 × 1 standard convolution is used to enhance the dimension of the extracted feature to increase the amount of information. In order to enhance the reuse of target location information, coordinate attention mechanism module is added based on the attention mechanism of YOLOv4 to enhance the coordinate information. Coordinate attention mechanism module optimizes based on SE channel attention mechanism, extracts the horizontal and vertical feature weight information of feature graph, and achieves accurate target position coordinate saliency mark through aggregation.

In order to optimize the location accuracy of the anchor, the “Guided Anchoring” mechanism is added in the detection layer of YOLOv4 to improve the quality of the anchor and the candidate region through the adaptive generation of the network anchor. The core principle of guided anchoring is to decompose the location attribute, the location attribute of a target is usually by four parameters \((x,y,w,h)\) to represent location and size. That is to say, if the position information of a target can be expressed as \(p(x,y,w,h|F)\) in the feature graph \(F\), it can be decomposed into \(p(x,y|F)p(w,h|x,y,F)\). This decomposition method shows that the definition of the position information of a target is to determine the existing region firstly, the shapes and sizes are closely related to the regional coordinates. Guided Anchors includes two branches, one is responsible for the prediction of anchor center coordinates, and the other is responsible for the prediction of anchor shape. The structure of anchor generator is shown in Fig. 3.

Fig. 3.
figure 3

Anchor network structure.

The main function of the center coordinate prediction branch is to determine whether regions in the feature map may have the center points of anchors, which is the binary classification problem. The specific implementation method is that the input feature map is converted into score map through 1 × 1 convolution layer, and then the final probability map is obtained by activating the sigmoid function of the elements on the map in the way of element-wise. At the same time, the threshold value \(\varepsilon_{L}\) is set to select the area where there may be anchors center point. Taking the point on the feature graph \(F\) as the anchor center point \(\left( {i,j} \right)\), the probability value is \(p(i,j|F)\), and mapping it back to the coordinates in the input image is \(((i + 1/2)s,(j + 1/2)s)\), S is the down sampling step of the feature graph related with the original image, by which the point on the feature graph can be mapped to the size of the original image for detection and output.

4 Experimental Results and Analysis

4.1 Evaluating Indicator

The related performance indicators of the infrared pedestrian detection algorithm, such as intersection and union ratio, IOU, precision and recall, are used to evaluate the advantages and disadvantages of infrared pedestrian detection. According to the size of the corresponding value to judge the quality of the model.

$$\mathrm{IOU}=2*\mathrm{area}\left({\mathrm{S}}_{1}\cap {\mathrm{S}}_{2}\right)/\left(\mathrm{area}({\mathrm{S}}_{1}\right)+\mathrm{area}({\mathrm{S}}_{2}))$$
(2)
$${\text{Recall}}\, = \,{\text{TP}}/\left( {{\text{TP}}\, + \,{\text{FN}}} \right)$$
(3)
$${\text{P}}\, = \,{\text{TP}}/\left( {{\text{TP}}\, + \,{\text{FP}}} \right)$$
(4)

S1—Pedestrian area predicted by infrared image.

S2—Pedestrian area marked by people;

TP—The correct prediction of infrared image downlink;

FN—The situation of wrong prediction;

FP—It's not a pedestrian area, but it's predicted to be pedestrian.

4.2 Experimental Steps and Innovation Analysis

The data sets used in the experiment is from the OSU Thermal Pedestrian Database. Before used, the data set is cleaned, and 1500 ordinary samples, 400 difficult samples, and 200 negative samples are selected to form a 2100 training set; 300 ordinary samples and 200 difficult samples are used as 500 test sets.

According to the two training models, the improved YOLOv4, YOLOv4and YOLOv3 are compared and tested. The test results are shown in Table 1.

Table 1. Model checking performance comparison

Among them: the meaning of each index in Table 1 is as follows: precision represents the proportion of the part that the classifier considers to be a positive class and is indeed a positive class in all classifiers. Recall represents the proportion of the part that the classifier thinks is a positive class and is a positive class in all the positive classes. F1score calculation formula: 2 * precision * recall/(precision + recall). IOU (intersection and union ratio) represents the overlap ratio of the candidate bound and the ground truth bound, that is the ratio of their intersection and union. The ideal situation is complete overlap, that is, the ratio is 1. The mean accuracy (mAP) represents the average value of each category of AP. From the analysis in Table 1, the column F2-1score shows that the overall robustness and recall rate of the improved YOLOv4 algorithm are better than YOLOv3, which comprehensively reflects that the optimization of backbone network and detection network is of great help in improving network performance. The improved YOLOv4 algorithm is used for pedestrian detection, and the test results are shown in Fig. 4.

Fig. 4.
figure 4

Test results with improved YOLOv4 algorithm.

In order to verify the comprehensive performances, three algorithms are compared with the same training sets and the same testing sets to obtain the ROC curve. The comparison of improved YOLOv4(Im-YOLOv4), YOLOv4 and YOLOv3 are shown in Fig. 5.

Fig. 5.
figure 5

The comparison of comprehensive performances.

5 Conclusion

In the paper, an improved infrared pedestrian detection algorithm based on YOLOv4 is proposed. The optimized algorithm improves the detection ability of gray image, small target, and the practicability of infrared detection. The deformation convolution is used as the core component, and the deformation feature extraction module is constructed to enhance the effectiveness of target feature extraction, to strengthen the ability of feature information transmission, and effectively improve the detection accuracy.