1 Introduction

Pedestrian detection is an attractive issue in computer vision as it can identify pedestrians and marks their positions in an image. Pedestrian detection is already applied in security monitoring, Autonomous Driving [1, 2], smart home, etc. The demands for the technology are still increasing. Especially in modern society, there exist a large number of crowded pedestrian scenes, such as bus stations, shopping malls, gatherings, etc. The applications of pedestrian detection in such scenes not only gives people more conveniences, but also ensures their safety. Consequently, it is appealing to investigate the effective pedestrian detection approaches specifically suitable for crowded scenes.

The performance of pedestrian detection in crowded scenes can be affected by some factors. For example, the pixel size of individual pedestrian, the multiple posture variation of pedestrian, the degree of occlusion, etc. Among these mentioned issues, the occlusion degree is a problem that seriously affects the detection performance, so it is necessary for us to further explore the solution for this issue. The pedestrian occlusions are mainly classified into intra-class occlusion and inter-class occlusion [3, 4]. The intra-class occlusion is defined as the mutual occlusion between individual pedestrians, which often introduces a large amount of interference information and leads to false detection. The inter-class occlusion is the occlusion of pedestrians by other objects, which often brings about information loss of the detected pedestrians and thus leads to missed detection. In crowded scenes, these two issues result in the inability to detect pedestrians efficiently and locate their positions accurately. Therefore, to address the above problem, we need to conduct related researches and propose a novel model to improve the performances of pedestrian detection.

To obtain high detection performance, the anchor based two-stage pedestrian detection model is proposed, which displays the excellent performance on COCO [5], PASCAL VOC [6], etc. Furthermore, the researchers are devoted to enhance the modules of the two-stage pedestrian detection algorithm. For instance, Gao et al. [7] propose feature fusion model to improve pedestrian detection performance by capturing high quality features. Zhou et al. [8] address pedestrian detection in crowded scenes in terms of part detection. Bodla et al. [9] and Liu et al. [10] address the problem of low detection performance by improving the Non-Maximum Suppression (NMS) algorithm. Liu [11] et al. solve the weaknesses of FPN by a novel tripartite feature-enhanced pyramidal network (TFPN), which speeds up the encoding capability and generates more robust representations. The existing works contribute some effective detection methods, but they rarely involve the positions information for overlapping parts of prediction boxes which may be helpful in eliminating the effect of occlusion and determine the complete pedestrian instance precisely.

In this paper, a novel anchor-based two-stage pedestrian detection model is employed to solve the severe occlusions in crowded scenes. Firstly, to address the false detection due to occlusion scenes, we introduce Distance-Intersection over Union (DIoU) loss [12] to train the network model so as to improve the accuracy of the generated proposal boxes. During the training process, the presence of occluded pedestrian instances in the image results in one instance can be contained in more than one proposal box. An efficient method is needed to determine which proposal box has the best match with the Ground-truth box (Gt box) containing the instance. DIoU loss algorithm takes the center point distance between the proposals box and the Gt box as the basis for calculating the loss, and directly regresses the Euclidean distance between the center point of the two boxes to accelerate the convergence. Secondly, a refinement module is added to the Region Convolutional Neural Network (RCNN). Due to the occlusions of pedestrian instances, some proposal boxes generated by the module mentioned above contain several instances. Mover's Distance (EMD) loss [13] is introduced as the metric to determine which instance is preserved in the proposal box. Finally, we utilize Relocation Non-Maximum Suppression (RNMS) [14] as the post-processing operation. Compared to other NMS algorithms, RNMS not only selects the bounding box from a series of proposal boxes, but also relocates the box, so as to achieve the optimal bounding box. Our main contributions can be summarized as follows:

  • We propose DIoU-RPN module to retrain the feature extraction network. The core of the module is to use the new loss algorithm to calculate the center point distance and the overlapping area between proposal boxes to distinguish occluded pedestrians.

  • We introduce a refinement module to exclude false positives from the proposal boxes. The refinement module mainly uses EMD loss to minimize the loss values generated during training, thus optimizing the detection performance.

  • RNMS is introduced as a post-processing operation. For a pedestrian instance, the position information of all proposal boxes containing it parts is obtained. Then, the optimal bounding box is relocated based on such position information, so that it contains the complete instance.

With the combination of these modules, the proposed model can eliminate the effect of the occlusion and achieve the better detection performance.

This paper is organized as follows: Related work is reviewed in Section 2. The details of our pedestrian detection model are described in Section 3. While in Section 4, the experimental results of our pedestrian detection model on the relevant data sets will be shown. Finally, the conclusions are discussed in Section 5.

2 Relate work

2.1 Pedestrian detection

Some methods have been proposed to detect pedestrians in various situations, such as size variation, occlusion, etc. Proposal boxes and prior boxes are commonly used in existing algorithms.

Based on the use of proposal boxes in detection, pedestrian detection algorithms can be classified into one-stage object detection algorithms [15,16,17,18,19] and two-stage object detection algorithms [3, 13, 20,21,22,23,24,25,26]. Instead of extracting features of candidate regions, the one-stage object detection algorithm directly uses the detection network to classify and regress objects in an image. These one-stage object detection algorithms characterize low computational cost and high real-time performance, but low accuracy when detecting dense objects. Detection networks include YOLO [15, 16], RetinaNet [17], and SSD [18]. The primary difference between the two-stage object detection algorithms and the one-stage object detection algorithms is that the first layer network model is used exclusively to extract the proposal boxes, and the second layer network model classifies and regresses the proposals boxes. Compared to the one-stage object algorithms, the two-stage object detection algorithms have higher accuracy, but consume larger resources and time, which results in poor real-time performance. Detection networks of two-stage object detection algorithms include the RCNN [20, 25], SPPNet [26], etc.

According to whether a priori boxes are used for detection, detection algorithms can be divided into anchor-based object detection algorithms [3, 13, 25] and anchor-free object detection algorithms [22, 23]. In the anchor-based algorithms, a set of anchor boxes at different scales are generated and then these anchor boxes contain the pedestrians are selected as the candidates. Most of the mentioned above two-stage object detection algorithms are also anchor-based. The central region and key point are the major approaches to implement anchor-free object detection algorithms, which eliminate the anchor box generation mechanism and speed up the detection. Nonetheless, the accuracy of anchor-free methods is lower than that of anchor-based methods. The common networks used by this type of algorithm are YOLO, CenterNet [27], Fcos [28], etc.

Furthermore, partial detectors [29] and novel detection models have been specifically designed to deal with the occluded scenes. Recently, convolutional neural networks have dominated the crowded pedestrian detection and showed the excellent performances. Shang et al. [24] propose that by supervising the visibility for each part, the network is encouraged to extract features with essential part information. Chu et al. [13] propose that a proposal can predict multiple instances, thereby improving the detection performance. Wang et al. [3] mainly use the proposed dual-region feature generation model to generate high-quality proposal features. Liu [30] et al. propose a feature blender to generate stronger features by fusing initially obtained rough features.

Despite these advances, the challenges posed by environmental changes in real-world scenes continue to persist, necessitating further research. Our model differs from existing models, as it not only detects the positions of pedestrians but also refines the generated proposal box information to address the occlusion issue in crowded scenes.

2.2 IoU loss

IoU can reflect the accuracy of the prediction results in object detection tasks. It shows the detection performance mainly by calculating the similarity between two boxes. IoU and IoU loss [31] equations are as follows:

$$\begin{array}{c}IoU=\frac{\left(A\cap B\right)}{\left(A\cup B\right)}\\ {L}_{IoU}=1-IoU\end{array}$$
(1)

where \(A\cap B\) represents the area of the overlapping part of two boxes, \(A\cup B\) represents the area of the union of two boxes. The smaller the IoU loss value is, the larger the overlap area and the closer the position of the two boxes and the better the detection performance of the model.

In the object detection model, IoU loss is utilized in the RPN module to calculate the model training loss value. The specific application is to select and adjust the proposals box based on the IoU loss calculated from the anchor box and the Gt box. However, IoU indicates the area of the overlapping area between two boxes and fails to provide information about the position of the boxes. There are various cases where two boxes overlap, but there may be overlapping areas with the same area size but different overlapping positions in these cases.

In order to solve such problems, we use DIoU. DIoU is the improvement of IoU and adjusts and determines the position of the proposal box by adding a penalty item which minimizes the normalized distance between the center point of the proposal box and the Gt box [12]. DIoU not only optimizes the convergence speed in regression, but also involves several important factors for detection in the calculation, including overlapping area and center point distance. During the training, DIoU loss makes the proposals box regression more stable.

The calculation results of IoU and DIoU for the overlapping parts of two boxes are displayed in Fig. 1. It can be observed that DIoU is more sensitive to the overlapping position of boxes, which is benefit to achieve higher accuracy in object detection.

Fig. 1
figure 1

Comparisons between IoU and DIoU

2.3 NMS

NMS algorithm is commonly applied in post-processing operation in object detection, aiming at selecting the optimal bounding box from a set of proposal boxes.

The main steps of the traditional NMS algorithm are as follows: 1) select the proposal box with the highest category confidence as the optimal bounding box and remove it from the proposal box set; 2) calculate the IoU between the remaining proposals boxes and the currently selected optimal bounding box; 3) compare the calculated IoU values with the NMS threshold and suppress the proposal boxes larger than the threshold; 4) repeat the above operations until the proposal box set is empty.

In recent years, various NMS improvement algorithms have been proposed. Soft-NMS [9] is to reduce the detection scores instead of directly removing the highly overlapping proposal boxes. Adaptive-NMS [10] algorithm is to automatically set the confidence threshold based on the pedestrian density. Set-NMS [13] algorithm add an additional evaluation process to check whether two proposal boxes are coming from the same proposal before removing a box.

These algorithms can improve the recall rate to a certain extent, but it is insufficient to select the optimal bounding box based on the category confidence. The reason is that the proposal box generated by depth-based algorithms contains coordinate information and the category confidence. The coordinate information fails to provide any useful information about the proposal box. The category confidence indicates the probability of instance existence. The higher the category confidence, the higher the probability that pedestrian instance exists in the proposal box. The box selected based on the category confidence is the one containing the largest part of the pedestrian instance in all of the candidate boxes, but it does not necessarily contain complete pedestrian instances. Therefore, in this paper, RNMS [14] is proposed as a post-processing operation in the pedestrian detection algorithm to improve the reliability and accuracy in pedestrian detection in crowded scenes.

3 Method

The overall architecture of the proposed model is depicted in Fig. 2. The foundational network model is established based on Feature Pyramid Network (FPN) [32] and Resnet-50 [33], which is employed as the backbone for feature mapping. The feature maps generated by FPN are marked blue, the Gt boxes are marked yellow in Fig. The model presented in this paper comprises four primary processing steps. 1) The input image is pre-processed and rough features are generated by FPN. 2) DIoU-RPN module is proposed to train the network weights to generate proposal boxes. Compared with the original IoU loss, DIoU loss converges faster. Furthermore, considering not only the overlapping area between the proposal box and the Gt box but also the position of the overlapping part of the proposal boxes, DIoU-RPN effectively obtains the ideal proposal box set. which is marked purple in Fig. 3) In RM-RCNN, the refinement module is incorporated into RCNN to verify the legitimacy of the instance, thereby enhancing the accuracy and reliability of the model. 4) RNMS is introduced as a post-processing operation, which is more comprehensive than other NMS. RNMS relocates the optimal bounding box, which makes the final optimal bounding box obtained be the one containing the most information about the object instance boxes.

Fig. 2
figure 2

The Architecture of the model

Fig. 3
figure 3

The parameters for calculating DIoU

3.1 Distance-IoU region proposal networks

The existing works aim to enhance the quality of image features by improving RPN. RPN module is utilized to generate proposal boxes in Faster Region Convolutional Neural Network (Faster-RCNN) model [25]. The main purpose of RPN is to preliminarily adjust the anchor box, get the proposal box, and lay the foundation for the subsequent fine adjustment. There are the following steps in the RPN process: 1) A series of convolutions are applied in FPN module to obtain the common feature map, and predefined anchor boxes are performed on the common feature map to generate suggestion frames. These anchors have different sizes and shapes and the purpose is to frame objects of different sizes and shapes. 2) The rough proposal boxes are generated by performing 3 × 3 convolution and 1 × 1 convolution on the common feature map, respectively. The 3 × 3 convolution is to determine whether the anchor box contains pedestrians or not, and the 1 × 1 convolution is to adjust the position of each anchor box. 3) Since there is a large number of rough proposal boxes, it is necessary to filter them. Firstly, according to the probability of whether it contains pedestrians or not, some proposal boxes with higher scores are filtered out, then the NMS is used to remove some boxes with more overlapping to get the final proposal boxes. 4) During the training process, IoU between the Gt box and anchor is calculated and the anchor box is selected according to IoU value. 5) The loss value between the chosen reliable box and the Gt box is computed, and the gradient descent of the network weight is determined from the loss value.

Proposal box regression is generally used in pedestrian detection to identify and locate the target object, so the results of this stage play a crucial role in our overall pedestrian detection performance. Our proposed improvement method focuses on improving the accuracy of the proposed box prediction during the training phase of the RPN module. RPN module calculates IoU loss mainly based on the anchor and the Gt box, the final proposal boxes are selected and adjusted according to the loss value. According to the discussion in SubSect. 2.2, IoU loss only reflects the overlapping area between the two boxes and fails to provide information on the relative positions of the two boxes. Besides, there exists a case where the overlapping area between two boxes is the same, but the overlapping positions are different. Therefore, it is doubtful to select a suitable proposed box by only relying on the overlapping area value. Based on the above, we employ DIoU loss to achieve higher precision detection performance as follows:

$$\begin{array}{c}DIoU=IoU-\frac{{\rho }^{2}\left(b,{b}^{Gt}\right)}{{d}^{2}}=IoU-\frac{{c}^{2}}{{d}^{2}}\\ {L}_{DIoU}=1-DIoU\end{array}$$
(2)

where \(b\) represents the central position of the proposal box, \({b}^{gt}\) represents the center of the Gt box, \(\rho\) is the Euclidean distance between the two center points which is also denoted as \(c\), and \(d\) represents the diagonal length of minimum outer rectangle for the two boxes. Figure 3 provides an example that details these parameters for calculating DIoU. The proposal box and its center point are marked blue, the Gt box and its center point are marked green, and the minimum outer rectangle of these two boxes are marked red.

As shown in Fig. 1 and Fig. 3, compared to IoU, DIoU not only focuses on the overlapping area between multiple boxes, but also considers the distance between the center point of two boxes. IoU algorithm is likely to regard the occluded part of a masked instance and the unmasked part as two instances, which will increase the number of false positive samples. However, introducing DIoU method, based on the center distance between the overlapping boxes and calculating the loss, it is subsequently possible to better predict the presence of multiple instances and thus improve the performance of pedestrian detection. If the two boxes overlap perfectly, it means\(c=0, {\text{IoU}}=1, DIoU=1\). Conversely, if There is no overlapping of the two boxes, \(\frac{{c}^{2}}{{d}^{2}}\) tends to 1,\({\text{IoU}}=0\), and\(, DIoU=-1\). Therefore, the range of DIoU is\([-\mathrm{1,1}]\).

Ultimately, DIoU-RPN generates proposal boxes that are better suited, with lower category confidence loss and position loss relative to the original model.

3.2 Refinement region-CNN

Faster-RCNN model is comprised of two key components: RPN and Fast Region Convolutional Neural Network [34] detection module. RPN has been elaborated in SubSection 3.1. In the RCNN module of pedestrian detection, the tasks of classification and localization are performed.

RCNN is a simple and scalable object detection algorithm and has the following characteristics: 1) The regions of interests with varying sizes from RPN and FPN are mapped into candidate boxes with fixed size \(w*h\) by using the pooling method. 2) RCNN assumes that there are multiple instances in each proposal boxes and records the class confidence and corresponding location information for the pedestrian instance. 3) The loss of category confidence and location information with respect to the Gt box is calculated for multiple pairs in the proposal box.

We assume that there are two instances in each proposal box, but in fact, there is only one instance in some proposals. So, these proposals need to be verified and confirmed. Therefore, our model deals with the problem by adding a refinement module. In this module, the prediction result of the RCNN is taken as input and combined with the proposal features to perform a second round of prediction in order to correct possible mispredictions.

In the RCNN module, there exists the matching problem between GT box and proposal box. As shown in Fig. 4, Ground-truth boxes \({Gt}_{0},\) are displayed in red, proposal boxes \({P}_{0},{ P}_{1}\) are displayed in green. Both \({P}_{0}\) and \({P}_{1}\) have intersecting regions with \({Gt}_{0}\), respectively. We introduce EMD Loss to match the optimal proposal box for a Gt box. EMD loss is a measure of the distance in one of the two multidimensional matrices in the feature space, which is utilized to minimize the loss incurred during multiple training runs. EMD loss can be calculated as follows:

$${\mathcal{L}}_{loss}={\text{min}}\sum_{k=1}^{k}\left[{\mathcal{L}}_{cls}\left({c}_{i}^{\left(k\right)},{g}_{\pi k}\right)+{\mathcal{L}}_{reg}\left({l}_{i}^{\left(k\right)},{g}_{\pi k}\right)\right]$$
(3)

where \(\pi\) represents a certain permutation of (1, 2,..., K), whose \({\pi }_{k}\)-th item is\({\pi }_{k}\); \({g}_{{\pi }_{k}}\)\(G({b}_{i})\) is the \({\pi }_{k}\)-th Gt box; \({\mathcal{L}}_{cls}\) and \({\mathcal{L}}_{reg}\) are classification loss and box regression loss respectively.

Fig. 4
figure 4

Matching problem between Ground-truth box and prediction bounding box

3.3 Relocation non-maximum suppression

In object detection, NMS algorithm is frequently employed as a post-processing operation. The performance of NMS algorithm works in object detection is not only related to the algorithm itself, but also often closely related to the threshold value it sets, especially in crowded scenes. If the NMS threshold is set small, the algorithm fails to distinguish all of the pedestrian. If the NMS threshold is set too large, the model detects the other objects as pedestrians, which leads to an increase number of false positive samples. Therefore, not only the adaptability of the algorithm should be considered in the selection of post-processing operation, but also its threshold value should be trained. In this paper, RNMS is proposed as a post-processing operation in the pedestrian detection algorithm in order to improve the reliability and accuracy in pedestrian detection in crowded scenes.

RNMS not only considers the proposal box with high category confidence score as the optimal bounding box as well as relocates the location of the optimal bounding box using the position relationship between the optimal bounding box and the surrounding proposal boxes [14]. Furthermore, RNMS employs the distance length instead of the IoU to measure the positional relationship between proposal boxes. The localization accuracy of the optimal bounding box is improved by RNMS.

RNMS methodology comprises two primary components: determining the optimal bounding box among the proposal boxes and relocation of the optimal bounding boxes. 1) Select the proposal box with the highest category confidence score as the bounding box \(bi\), and subsequently calculate the Proximity (P) between the bounding box \(bi\) and other proposal boxes. Compare P with the proximity threshold. Proposal boxes above the proximity threshold are added to the set of localization references and deleted from the set of proposal boxes. Then the offset O between the bounding box \(bi\) and the proposed box in the set of localization references is computed. 2) Relocate bounding box \(bi\) using the offset O to get a higher quality optimal candidate box. 3) Repeat the above steps for the proposal boxes smaller than the proximity threshold until the proposal boxes set is empty.

There is a new variable introduced in RNMS, P. P can be expressed by Manhattan distance between bounding box \(bi\) and the proposal box. The computation of P in the RNMS involves coordinate transformation and computation. Figure 5 illustrates the parameters used to calculate Manhattan distance. proposal box and bounding box are marked red and green, respectively. X and Y represent the set of horizontal and vertical coordinates of the two boxes respectively. \({\text{Hm}}\) represent the Manhattan distance.

Fig. 5
figure 5

Manhattan distance

P can effectively represent the distance relationship between boxes when the size of box is similar. In the post-processing operation, there will be a large number of proposal boxes with different sizes. When the sizes of two boxes are obviously different, the P can not accurately measure the degree of their overlap [14]. In order to solve this problem, we introduced the method of normalizing the proposal box coordinate. This method makes the coordinates range between 0 and 1 and maintains their original positional relationship between the boxes.

$$\begin{array}{c}norm\left({x}_{i},{y}_{i}\right)=\left({x}_{i}{\prime},{y}_{i}{\prime}\right)\\ =\left(\frac{{x}_{i}-\mathit{min}\left(X\right)}{\mathit{max}\left(X\right)-\mathit{min}\left(X\right)},\frac{{y}_{i}-\mathit{min}\left(Y\right)}{\mathit{max}\left(Y\right)-\mathit{min}\left(Y\right)}\right)\end{array}$$
(4)

\(X\) and \(Y\) are the set of horizontal and vertical coordinates shown in Fig. 5.\({\text{max}}\left(\cdot \right)\) and \({\text{min}}\left(\cdot \right)\) represent the maximum and minimum values in set \(\cdot\), respectively. The P of two boxes is calculated using coordinate normalization, the formula is as follows:

$$\begin{array}{c}P={H}_{m}\left({U}_{1},{V}_{1}\right)+{H}_{m}({U}_{2},{V}_{2})=\left|{y}_{1}{\prime}-{q}_{1}{\prime}\right|+\left|{y}_{2}{\prime}-{q}_{2}{\prime}\right|+\\ \left|{x}_{1}{\prime}-{p}_{1}{\prime}\right|+\left|{x}_{2}{\prime}-{p}_{2}{\prime}\right|\end{array}$$
(5)

In order to implement the relocation operation of the bounding boxes, the offset O is utilized. The offset O is obtained by calculating the distance between the proposal boxes whose P are larger the proximity threshold and the bounding box \(bi\), the formula is as follows:

$$O=\frac{\sum_{i=1}^{n}\left|{B}_{i}-M\right|}{n}$$
(6)
$${M}_{R}=M+O$$
(7)

where: \({B}_{i}\) is the proposal box less than the threshold, M is the bounding box, and O represents the offset between the bounding box and all proposal boxes. Finally, the optimal bounding box \({M}_{R}\) is obtained by adding the offset O to the optimal bounding box M.

Finally, the execution steps of RNMS are shown in Algorithm 1:

Algorithm 1
figure a

RNMS.

4 Experiments and discussions

In this section, we perform experiments on CrowdHuman dataset [35] and CityPersons dataset [36] to evaluate the proposed model. We introduce the two datasets, assessment metrics and experimental setup in the experiments. Then we report the experimental results and discuss the performances of the proposed model.

4.1 Datasets

The datasets employed in this paper are CrowdHuman and CityPersons which are commonly used to evaluate the performances of pedestrian detection algorithms. It is essential to solve the complex occlusion problem to improve the pedestrian detection accuracy. If annotated example can reflect these aspects to a significant extent, it is anticipated that the pedestrian detection performance will witness substantial improvement. The CrowdHuman dataset provides three annotation labels for each pedestrian: Head Bounding-Box, Visible Bounding-Box, and Full Bounding-Box. A detailed picture can be seen [35].

We further investigate the robustness of the proposed model on CityPersons dataset. CityPersons, a subset of cityscape, is a lightly occluded pedestrian dataset with varying levels of occlusion. The dataset contains annotations for the region bounding boxes and full-body bounding box of pedestrians. There are 2,975 images for training, 500 images for validation and 1575 images for test.

Table 1 displays the crowding levels and the average number of pedestrians in each image. The value of the overlaps indicates IoU value greater than 0.5 between two pedestrian instances in the image. The average overlaps on CrowdHuman dataset are 2.4, and 0.32 on CityPersons dataset. We can thoroughly evaluate the robustness of the proposed model across multiple scenes with diverse crowded levels.

Table 1 Instance density of CrowdHuman and CityPersons datasets. The threshold for overlap statistics is IoU > 0:5 [13]

4.2 Evaluation metric

This paper mainly uses the following three indicators to evaluate the performance of the model:

AP: Average Precision is a measure that is jointly determined by Recall and Precision. With the value of log-average Miss Rate (MR−2) as the threshold, the maximum Precision value is established for each MR−2 value, and the average value of all the Precision is the AP value. In object detection algorithms, AP serves as a reliable indicator of the model’s Precision and Recall. The Eq. 9 to Eq. 11 used to determine AP incorporates two critical parameters. The accuracy rate expresses the ratio of correctly identified targets in the detection result to all targets detected by the detector. The recall rate denotes the ratio of correctly identified targets detected by the detector to the total number of real-world targets. The higher numerical value of AP signifies a superior performance of the detector.

$$Precision=\frac{N\left(Positive\;samples\;by\;detector\right)}{N\left(All\;samples\;by\;detector\right)}$$
(8)
$$Recall=\frac{N\left(Positive\;samples\;by\;detector\right)}{N\left(Positive\;samples\;in\;the\;label\right)}$$
(9)
$$AP=\frac{\sum_{1}^{N}{\text{Precision}}}{N}$$
(10)

MR−2: log-average Miss Rate [38]. MR−2 refers to the miss rate of false positives per image and is commonly used as an evaluation of the performance of object detection algorithm. It mainly calculates the false positive samples in the proposal box, and the lower value indicates the better detection performance of the model.

$$\text{MR}^{-2}=\frac{N\left(\mathrm{False}\;\mathrm{positive}\right)}{N\left(\mathrm{True}\;\mathrm{positive}\right)+N\left(\mathrm{False}\;\mathrm{positive}\right)}$$
(11)

JI: Jaccard Index [18]. JI is mainly evaluated the degree of overlap between the predicted set and the Ground-truth label set. The larger the value of JI, the closer the predicted result is to the Ground-truth.

$$JI=\frac{\left|DT\cap GT\right|}{\left|DT\cup GT\right|}$$
(12)

4.3 Implementation details

The backbone network we use is a Resnet-50 model pretrained on ImageNet dataset [39], using Faster RCNN with FPN as the baseline model, and the initial RoI Pooling [25] is replaced with RoI Align [40]. On CrowdHuman dataset, an aspect ratio of H:W = {1:1; 2:1; 3:1} anchor point scale is employed, while on CityPersons dataset, an aspect ratio of H:W = {0.5:1; 1:1; 2:1} anchor point scale is employed. Since the images on CrowdHuman dataset have a wide variety of sizes, these images need to be preprocessed to a unified size. In contrast, all images on CityPersons dataset are the same size, so this step can be omitted. We trained CrowdHuman dataset for a total of 30 Epochs, where the learning rate is set to 10% of the original at the 24th Epoch to the 27th Epoch, as well as the learning rate is set to 100% of the original at the 28th Epoch to the 30th Epoch. On CityPersons dataset, we train the proposed model for 25 epochs, where the learning rate is set to 10% of the original at the18th Epoch to the 21th Epoch, and at the 22th Epoch to the 25th Epoch, the learning rate is set to 100% of the original. For each proposal, we assume that there are two instances.

4.4 Detection results on CrowdHuman dataset

Ablation experiments

To comprehensively evaluation the performance of the methods expounded in Section 3, a substantial number of experiments are carried out on CrowdHuman dataset. The effectiveness of the model is evaluated by the three evaluation indices mentioned in SubSection 4.2, with AP as the primary evaluation metric. Table 2 presents the results of the comparison between the methods mentioned in Section 3 and the baseline model. The Faster RCNN is used as the baseline model, IoU method is employed for the loss calculation in RPN, and the post-processing operation utilizes the NMS algorithm with a threshold of 0.5. To analyze the contribution of the proposed module separately, the components in the baseline model are gradually replaced with our module. The results of the experiments clearly demonstrate that the proposed module significantly enhances the detection performance. In particular, compared to the baseline, our model has increased 5.6% in AP metric and 5.2% in JI. More importantly, the ratio of MR−2 is reduced by 3.8%, providing evidence that the model does not generate false predictions. Although the refinement module has a little effect on AP and JI, its introduction results in a 1% reduction in MR−2, demonstrating that the module mainly reduces false positives. Figure 6 shows the detection results of both our model and the baseline model on CrowdHuman dataset. For comparison purposes, detection results of the baseline model and our model are presented on the left and right, respectively. The number of Ground-truth boxes (GT) and the number of predicted boxes generated by the model (DT) are given under each result. Each predicted box is labeled with the confidence value of the instance it contains and identified with a different color so as to be distinguished in a crowded scene. The result comparison in light occlusion scene is in the first row. Both methods detect all instances. But predicted boxes generated by our model can contain full pedestrian instances. The second row represents the result comparison in dense occlusion scene. Our model still detects all instances. There exist false detections in baseline model. The third row represents the result comparison in high crowding and heavy occlusion scene. Our model detects 79 out of 84 instances, while the baseline model detects only 59 instances. It can be clearly seen that our model is effective in detecting pedestrian instances under various crowding and occlusion levels.

Table 2 The results of ablation experiments on CrowdHuman dataset
Fig. 6
figure 6

Visualization of detection results. The detection results of the baseline model are on the left and the detection results of our model are on the right. The GT represents the number of Ground-truth boxes and the DT represents the number of prediction boxes

Introduction of DIoU loss. Occlusion is the most challenging of pedestrian detection. Occlusion scenes are either pedestrian occluding each other or pedestrians being obscured by objects in the environment, which increases the number of false positive samples or loses information about pedestrians. To address the problem of low performance caused by occlusion in pedestrian detection, we introduce DIoU loss. Specifically, DIoU predicts whether the target is a different instance by the overlapping area and center distance between multiple proposal boxes, which in turn suppresses the false positive samples to solve the occlusion problem. In Table 2, our baseline is FPN with ResNet-50, DIoU is the loss calculation used for training in RPN. We are able to find that the AP value increases by 3.1% after DIoU is adopted compared to the baseline. This proves that our DIoU method can improve the accuracy of detection.

Impact of different hyperparameter Settings in RNMS

In pedestrian detection algorithm, the setting of NMS threshold plays an important role in the performance of NMS algorithm. If the threshold setting is small, the algorithm cannot distinguish all of the pedestrians. If the threshold setting is large, the model considers other objects as pedestrians and increases the false positive samples. In order to analyze the optimal NMS threshold, it is necessary to conduct relevant experiments for validation. According to the existing work [14], the threshold of RNMS has a better performance in [0.3,0.5], so we also take the values in this interval. Figure 7 shows the changes of AP, MR−2 and JI values in the interval range of [0.3,0.5]. Combining the data of the three metrics, we found that a threshold value of 0.4 is the most comprehensive detection performance that meets our expectation.

Fig. 7
figure 7

Setting threshold parameters, (a) is the AP with the threshold between 0.3 and 0.5. (b) is the MR−2 with the threshold between 0.3 and 0.5. (c) is the JI with the threshold between 0.3 and 0.5

Comparison with various NMS algorithms

NMS algorithms are frequently treated as post-processing operation in object detection. In order to solve the problem of low object detection accuracy, there are some different NMS algorithms proposed by researchers. It is significant to select a suitable NMS algorithm in order to improve the pedestrian detection performance, and we propose RNMS as a post-processing operation to improve the detection accuracy. The conventional NMS algorithms filter the proposal boxes based on the category confidence, while RNMS determines the optimal bounding boxes based on the category confidence and the location information of the proposal boxes. In Table 3, RNMS is compared with NMS, Soft-NMS, Adaption-NMS and Set-NMS, the IoU value of each algorithm is set to the best performing value. Apparently, it can be found that RNMS shows the best performance in all of AP, MR−2 and JI, which demonstrates the ability of RNMS as a post-processing operation to improve the accuracy of the detection while reducing the introduction of false positive samples.

Table 3 The various NMS algorithms are compared on CrowdHuman dataset. The baseline model is Faster RCNN

Comparison with existing work

For a comprehensive evaluation of our model, we choose three types of detection models for comparisons, which are listed in Table 4. The first type is the baseline model, such as Faster RCNN, Soft-NMS, which are widely employed in detection performance evaluations. The second type is the detection model proposed in the last three years, such as R2NMS (2020), V2F-Net (2021), OAF-Net (2022), OPLA (2023). The third type is the state-of-the-art model, such as Dual-Region Feature Extraction. By comparing the performances with these models, the advantage in detection accuracy of our model can be effectively verified. It can be seen from Table 4, among all the models compared, our model shows the best performance in MR−2 with 41.4%. In AP and JI, our model is only slightly inferior to Dual-Region Feature Extraction Networks, but superior to any of other models. These results confirm that our model plays a positive role in improving pedestrian detection in crowded scenes.

Table 4 The results of different models on CrowdHuman dataset

4.5 Detection results on CityPerson dataset

In order to further evaluate the performance of our model, we perform experiments on the CityPersons dataset as well. CityPersons is a dataset containing moderately crowded scenes.

Comparison with existing methods on CityPersons

To further evaluate the performance of our model, we perform experiments on the CityPersons dataset as well. CityPersons is a dataset containing moderately crowded scenes. In Table 5, our model is compared with three types of models. They are baseline models such as Faster RCNN, Soft-NMS, recent models V2F-Net, CrowdDet, Repulsion Loss, and the state-of-the-art model Dual-Region Feature Extraction Network. As displayed in Table 5, our model performs the best AP with 96.8%, which 1.6% higher than the baseline model and 0.4% higher than Dual-Region Feature Extraction Networks. It can be demonstrated that our model is robust for various crowded scenes in pedestrian detection.

Table 5 The results of different models on CityPersons dataset

5 Conclusion

In crowded scenes, the occlusion degree is an important factor affecting the pedestrian detection performance. To improve the detection accuracy, we propose a novel model to relocate the optimal bounding box according to the location information of proposal boxes, which includes DIoU-RPN module, refinement module and RNMS. DIoU-RPN module and refinement module solve the false detection problem and improve the detection accuracy. RNMS solves the missed detection problem and relocates the optimal bounding box so that contains the complete instances. Our model is evaluated on two datasets with different crowded levels and shows great improvements in AP, MR−2 and JI compared to the existing models. However, our model can still be further improved. Our model is a two-stage detection model, which is characterized by the advantage in detection accuracy. Our model does not have a significant advantage in terms of speed of detection. In low-light environments, it is difficult to achieve high-quality pedestrian features, which results in a decrease of detection accuracy in our model. These issues will be considered in our future work.