1 Introduction

Computer vision has had major growth in numerous fields such as robotics, medical imaging, microscopy, image retrieval, face recognition and modern industrial applications. In recent years, pedestrian detection is also a conspicuous problem in CV applications like semi-autonomous vehicles and self-driving cars. It is also an indispensable and remarkable task in an intelligent video surveillance system. It has a clear-cut extension to automotive applications because of its use in safety systems. This was offered by many car manufacturers (e.g. Ford, Nissan, GM, Volvo) as an ADAS option in 2017.

In daily life, we often drive through busy environments, traffic and challenging weather conditions. Road accidents are one of the major causes of death. So, we may be able to design safer automobiles by providing tools to inform and warn the driver about pedestrians and other relevant information. Therefore, it can save many lives in road accidents, i.e. the concept of self-driving cars. For pedestrian detection, we should be able to differentiate between multi-scale pedestrians and other complicated objects that are also present in the image backgrounds. For that, we need to withdraw the features of pedestrians such as shape, colour and behaviour. The intra-class variations of pedestrians such as clothing, backgrounds, lightning, articulation and occlusion are the main challenges. The detection is more precise if the expected and plausible features are extracted perfectly from the images.

A feature is an interesting part of an image, and the main focus of feature extraction is to extract information from the image and make some decisions at every image point so that the given features of any object are present in that image. So the extraction of image information can be done by convolutional neural networks (CNNs) as at present they have very good capability of high-level feature extraction of an image and are also useful in low-level feature detections. Deep convolutional neural network (DCNN) is classified basically into two categories, base and detection network. AlexNet [1], VGGNet [2], Xception [3], MobileNet [4] and DenseNet [5] are widely used baseline networks. High-level features are provided by classification or detection by the base network. MobileNet uses convolution to produce high-level features just like another base network as well to decrease the number of network parameters. For image classification, a fully connected (FC) layer is the final layer of CNN. Classification layers can be removed and replaced by detection networks. Examples of detection layers include feature-fused [6], feature pyramid network (FPN) [7], RCNN (“Region-based Convolutional Neural Network”) series [8,9,10], YOLO (“You Only Look Once”) series [11] [12], SSD (“Single Shot Multi-Box Detector”) [13], M-RCNN (“Mask-Region-based Convolutional Neural Network”) [14], etc. The use of SSD on the last convolutional layer results in a detection task.

Some studies on pedestrian detection [15] [16] use different types of targets as well as detection networks that are implemented using either traditional methods or deep learning-based methods. Traditional methods widely used in machine learning are “Histogram of Oriented Gradients” (HOG) [17], Haar-like features using patterns of motion and appearance [18], deformable models [19], etc. Deep learning-based methods include RCNN [8], Fast RCNN [9], Faster RCNN [10], YOLO [11], YOLOv2 [12], SSD300/512 [13] and Mask-RCNN [14]. However, there is still a need for further optimization in these networks for real-time multi-scale pedestrian detection on low-end edge devices.

To improve the performance of real-time pedestrian detection, the proposed model makes full use of convolutional layers. The proposed model used Keras [20] as the main deep learning framework with Tensor Flow [21] in the backend, and various data augmentation techniques are applied to the dataset while training. Keras provides fast experimentation with deep neural networks, and it is easy to use and extensible. Keras allows distributed fast training of deep neural network models on clusters of GPUs and Tensor processing units (TPU's). The proposed network is trained on Pascal Voc-2007 [22] dataset with Adam as optimizer by replacing SGD optimizer used in original SSD + VGG network. Finally, the proposed method is tested in real-time on a low-end edge device.

The key contributions of the proposed work are summarized as follows:

  1. 1.

    The SDD + VGG [13] model fails to detect dense pedestrians and offset this shallow convolution layers conv4_3 and conv5_3 are concatenated in the proposed Optimized MobileNet + SSD network which improves the feature map information and this in-turn improves detection performance.

  2. 2.

    This paper is the first to propose a concatenation feature fusion module for adding contextual information in the Optimized MobileNet + SSD network to improve the detection accuracy of pedestrians.

  3. 3.

    The proposed method seeks to extract the best hyper-parameters because fine-tuning hyper-parameters such as depth, stride, filter, shape, optimizer play a vital role in the optimization of the network and also help in reducing computational power while execution.

  4. 4.

    Experimental results show that the proposed model outperforms on Pascal Voc-2007 test dataset and showed a better detection effect while detecting denser and multi-scale pedestrians during low light and darker images and runs at 34.01 fps on Jetson Nano board.

The rest of this paper is organized as follows: Section II covers related work on pedestrian detection. In sections III and IV, methods and materials and the proposed methodology to solve the problem related to real-time pedestrian detection were discussed. In section V, we illuminated the findings of our model and also conducted experiments on a low-end device Jetson Nano; finally, the results, limitations and scope for future research in the area are discussed.

2 Related work

Recently, pedestrian detection methods based on deep learning techniques have exhibited state-of-the-art (SOTA) performance. Most existing pedestrian detectors employ either single-stage or two-stage strategy as their backbone architecture. Liu W et al. [13] proposed SSD using a single deep CNN to detect objects of various scales. This method separates the output space of bounding boxes into a set of anchor boxes over different aspect scales and ratios for every extracted feature map location. Szarvas M et al. [24] implemented pedestrian detection using CNN. The demonstrated method used a difficult pedestrian database collected from a city with no restrictions on background, pose, action, lighting and environmental conditions. Fukui H et al. [26] proposed a pedestrian detection based on deep CNN with an ensemble inference network to achieve high accuracy. To achieve such generalization, they introduced “ensemble inference network” (EIN) and “random dropout” for performing classification and training processes separately.

Joint deep learning for pedestrian detection was proposed by Ouyang [27]. A new deep CNN architecture by joint deep learning, i.e. through collaboration feature extraction, deformation and occlusion handling and classification learn jointly to maximize the strengths by formulating it into a joint deep learning framework. To detect occluded pedestrians, Zhang [28] proposed guided attention in CNNs. This method employs an attention network with self or external guidances on the baseline Faster RCNN detector to detect heavily occluded pedestrians.

Kuang et al. [29] implemented real-time pedestrian detection in a low illumination environment and distance estimation from the camera using a smartphone-based thermal camera. A “high-level semantic feature detection” [30] new perspective for pedestrian detection took pedestrian detection as a reference and a new perspective was developed for detecting objects which influence object detection to a modified level or a task, i.e. high-level semantic feature detection task. This method scans for feature points in the entire image such as corners, edges, blobs and where the convolution is naturally suited. Cheng et al. [31] proposed enhanced SSD for pedestrian detection. The proposed pedestrian detection with enhanced SSD depicts small-scale objects and most of the missed pedestrian targets (dense targets).

Yang F et al. [32] proposed SSD for online pedestrian detection on input fed video using Kalman filter. This method performs post-processing together by combining the fusion module and Kalman filter so that it effectively reduces miss rate and provides better performance at a faster speed. Afifi et al. [33] implemented robust real-time pedestrian detection using YOLOv3 on collected aerial images from Embedded Real-Time Inference (ERTI) Challenge on Jetson TX2 and achieved more than 5 frames per second (fps). Real-time pedestrian detection using a robust Enhanced Tiny-YOLOv3 network was proposed by C.B Murthy [36]. This method introduces an anti-residual module to improve the network’s feature extraction and bounding box loss error is minimized but fails to detect severely occluded and denser pedestrians in real-time. Real-time pedestrian detection using the Improved Tiny Yolov3 network was proposed by Yi [35]. This method applies the K-means clustering algorithm on the training dataset to find the best prior bounding boxes to achieve better detection accuracy. But this algorithm fails to detect denser pedestrians in real-time during low light and darker pictures. So, to overcome this problem, the proposed model adopts a feature fusion concatenation module to improve detection accuracy while detecting denser pedestrians in real-time.

3 Methods and materials

3.1 Single shot detector (SSD)

Figure 1 shows the SSD + VGG backbone network for input image size 300 × 300. SSD is referred as regression-based object detector. The model can solve the conflict caused by translational invariance and variability and can achieve better detection accuracy as well as speed.

Fig. 1
figure 1

SSD300 with VGG-16 backbone network through Conv5_3 layer

Each selected feature map which consists of K frames varies in size and width to height ratio. Each frame is termed as anchor box. Figure 1 shows bounding boxes on feature maps of different convolution layers. B class score and four position parameters are predicted by default at every bounding box. B x K x w x h class score and 4 × K x w x h position parameters need to be detected for the w–h feature image. It requires a size 3 × 3(B + 4) x K number of w x h convolution kernel for the processing of the feature map. The convolution result is considered to be the last feature for bounding box and classification regression. The mathematical expression for bounding boxes for every feature map is:

$$ {\text{S}}_{{\text{K}}} = \tfrac{{{\text{S}}_{{{\text{Min}}}} + {\text{(S}}_{{{\text{Max}}}} - {\text{S}}_{{{\text{Min}}}}) * {\text{(K - 1)}}}}{{({\text{M}} - 1)}},\left( {{\text{K varies from }}\left[ {{\text{1}},{\text{ M}}} \right]} \right) $$
(1)

M indicates the count of feature maps, and SMin and SMax are settable parameters. The same five types of aspect ratios a = {1, 2, 3, 0.5, 0.33} generate anchor boxes to tune the fairness of feature vectors in training as well as while testing experiments. So, each bounding box can be mathematically expressed as:

$$ {\text{h}}^{{\text{a}}} _{{\text{K}}} {\text{ = }}\frac{{{\text{S}}_{{\text{K}}} }}{{\sqrt {{\text{a}}_{{_{{\text{r}}} }} } }} $$
(2)
$$ {\text{w}}_{{\text{k}}}^{{\text{a}}} {\text{ = S}}_{{\text{k}}} \sqrt {{\text{a}}_{{\text{r}}} } $$
(3)

Here, \({\text{h}}_{{\text{k}}}^{{\text{a}}}\) and \({\text{w}}_{{\text{k}}}^{{\text{a}}}\) represent the height and width of the corresponding bounding box, respectively.

When the aspect ratio is one, a bounding box \({\text{s}}_{{\text{k}}}^{'} {\text{ = }}\sqrt {{\text{s}}_{{\text{k}}} } .\sqrt {{\text{s}}_{{{\text{k + 1}}}} }\) should be added. The centre of every bounding box varies from \((\tfrac{{({\text{i + 0}}{\text{.5)}}}}{{\left| {f_{{\text{k}}} } \right|}},{\text{ }}\tfrac{{\left( {{\text{j + 0}}{\text{.5}}} \right)}}{{\left| {{\text{f}}_{{\text{k}}} } \right|}})\) and if |fk| represents the size of Kth feature unit, i, j € [0, |fk|]. Figure 2 shows that a dog is perfectly matched to a bounding box in the 4 × 4 feature map, while it is not matched to any of the bounding boxes in the 8 × 8 feature map. Since the bounding boxes with different scales do not match with the dog box, they were considered as negatives during training.

Fig. 2
figure 2

SSD multiple bounding boxes for localization and confidence

The IoU, i.e. the intersection over union, can be mathematically expressed as:

$$ {\text{IoU = }}\tfrac{{({\text{ AREA (C) }} \cap {\text{ AREA (D) )}}}}{{({\text{ AREA (C) }} \cup {\text{AREA (D) )}}}} $$
(4)

If the calibration and bounding box intersection over union value exceeds 0.5, it indicates that for the respective category, the bounding box matches the calibration box.

The sum of bounding box regression’s position loss, i.e. Lloc(r, l, g) and the classification regression confidence loss is referred to as total loss function and is expressed as shown below:

$$ {\text{L}}\left( {s,r,c,l,g} \right){\text{ = }}\tfrac{{\text{1}}}{{\text{N}}}.{\text{ (L}}_{{{\text{conf}}}} \left( {{\text{s, c}}} \right){\text{ + }}\alpha {\text{L}}_{{{\text{loc}}}} \left( {{\text{r, l, g}}} \right) $$
(5)

where “r” and “s” are eigenvectors of position loss and confident loss, respectively, \(\alpha\) is a parameter to tune both position and confidence loss; ‘I’ is the offset including the scaling offset of the height and width and translational offset of the centre of the predicted boxes, ‘N’ represents the number of bounding boxes matching the calibration box for a given category, and ‘g’ is the calibration box of the target actual position.

3.2 Jetson Nano evaluation board

Nvidia Jetson Nano board is an embedded developer kit mainly meant for developing embedded systems that need high processing power for CV, deep learning, machine learning and image/video processing applications. Jetson Nano consists of 128 CUDA cores Maxwell GPU, quad-core ARM A57 CPU works at 1.43 GHz with 4 GB of LPDDR4 memory and has a processing power of 472 GFLOPS. Since the Jetson Nano board consumes power which is less than 5 watts, has in-built GPU cores and is low cost compared to other embedded boards. So, therefore, it is a popular board for implementing real-time pedestrian detection algorithms.

4 Proposed methodology

4.1 Optimized MobileNet + SSD network

MobileNet model was proposed by Google and is a type of base architecture highly suitable for embedded-based vision applications with less computing power. The MobileNet architecture uses depthwise separable convolutions instead of standard convolution. This reduces the number of parameters significantly as compared to the network with normal convolution with the same amount of depth in the network, which results in lightweight deep neural networks. The activation function “ReLU” is replaced by “ReLU6”, and the “Batch Normalization” layer was included in each layer of the newly appended structure to prevent the gradient’s disappearance. MobileNet is easy to train and takes relatively less time while training, which is highly desired for real-time implementation. This makes the network more reliable compared to VGG-16 and other available architectures.

Figure 3 shows that the Optimized MobileNet + SSD network is composed of 21 convolutional layers. The target feature layers for detection used are Conv 4_3, Conv 13, Conv 14_2, Conv 15_2, Conv 16_2 and Conv 17_2. This network enhances the information of the newly added shallow convolutional layer Conv 5_3 to detect smaller and denser objects. Since Conv 4_3 and Conv 5_3 feature maps are different in terms of size, to make the same size, the Conv 5_3 layer is followed by a deconvolution layer (2 × up-sampled). On both layers, 3 × 3x256 convolution layer and applied normalization with 10, 20 different scales were used incorporating the better features to fuse. Finally, both convolutional layers are concatenated before by applying a 1×1×256 convolutional layer which reduces dimensions and feature recombination to generate the final fusion feature map as shown in Fig. 4.

Fig. 3
figure 3

Optimized MobileNet + SSD backbone network through Conv4_3 layer (proposed method)

Fig. 4
figure 4

Feature fusion concatenation module

We introduce a feature fusion concatenation module, to inject contextual information into the shallower layer Conv 4_3 since this layer lacks semantic information and it is an important supplement for detecting small-scale and dense pedestrians. Therefore, the detection performance of small-scale pedestrians is improved by passing the captured semantic information in convolutional forward computation back towards the shallower layers. While designing the most effective feature fusion concatenation module, we explored different layer feature fusion trials. Finally, we selected Conv 4_3 and Conv 5_3 shallow layers to fuse which would introduce less background noise while detecting small-scale pedestrians. Higher shallower layers after Conv 5_3 possess large receptive fields and would introduce more background noises while detecting small-scale pedestrians. Figure 5 shows the framework of the proposed Optimized MobileNet + SSD network.

Fig. 5
figure 5

Proposed Optimized MobileNet + SSD network framework

Following pointwise convolution and depthwise convolution together is termed as depthwise separable convolution.

Figure 6 shows MobileNet’s standard convolution layers (left side) and depthwise and pointwise separable convolutional layers (right side). Conv_Dw_Pw is a deep and separable convolution structure that consists of two layers, namely pointwise (Pw) and depthwise layers (Dw). The Dw layer uses 3 × 3 kernels, and the Pw layer uses 1 × 1 kernels which are also called deep convolutional layers and common convolutional layers, respectively. So, the result of each convolution is then processed by batch normalization (BN) algorithm and ReLU6 activation.

Fig. 6
figure 6

Standard convolution (left) and depthwise convolution modules (right) with Batch Norm and ReLU6

The activation function ReLU6 (Rectified Linear Unit6) is non-linear and also better than the sigmoid function in terms of performance. This algorithm supports the adjustment of data distribution automatically. Here, the activation function is limited to a maximum size of 6, due to increased robustness when used with low precision computation. MobileNet substantially reduces complexity, amount of calculation and also speeds up the training process. ReLU6 activation function is expressed as

$$ {\text{r }}\left( {\text{z}} \right){\text{ }} = {\text{ min }}\left( {{\text{max }}\left( {0,{\text{z}}} \right),{\text{6}}} \right) $$
(6)

Algorithm

The flow of the proposed algorithm is as shown in Fig. 7.

Fig. 7
figure 7

Flowchart of the proposed algorithm

4.2 Standard convolution

The working of a standard convolutional (regular) layer is similar to the convolution process in signals analysis. The process forms a kernel or sends a filter to all the given channels of the image. It iterates the kernel across the whole image and during each iteration performs a weighted pixels sum under the filter across all input channels.

The important feature of the convolution feature is that it combines the values of all input channels. For example, if there are three input channels and a single convolution kernel is applied, it would still result in an output image with only one channel for every pixel. No matter how many channels it has, for each input pixel, the convolution writes a new output pixel with only a single channel. In normal convolution, if F is the input feature map by applying a convolutional kernel of size K, an output feature map G is produced.

The standard convolution output feature map is computed as:

$$ {\text{G}}_{{\text{Q}}} = \Sigma _{{\text{Q}}} {\text{K}}_{{{\text{Q}},{\text{ R}}}} \times {\text{F}}_{{\text{Q}}} $$
(7)

where KQ, R is the filter, Q is the number of input channels, and R is the number of output channels. Standard convolution consists of an input image that includes a featured image (FQ) and uses the “fill style of zero padding”.

When the sizes are DFxDF and channels of input images are Q, it is necessary to have R filters with Q channels and the size of DKxDK before outputting G feature images of size DK × DK. The total computational cost of standard convolution is DK × DK × Q × R × DF × DF.

The depthwise convolution with one filter per input channel formula is expressed as:

$$ G^{\prime}_{Q} = \sum {K^{\prime}_{{1,Q}} } \times F_{Q} $$
(8)

where K′1, Q is the filter.

4.3 Depthwise separable convolution (DSC)

The MobileNet architecture uses depthwise separable convolution (DSC). It uses the standard convolution but just once on the very first layer. After the first layer, all further layers have DSC.

Depthwise separable convolution (DSC) module is just a combination of depthwise + pointwise convolution. The main difference between standard and depthwise separable is that unlike the former, the latter performs convolution on each image channel separately. For an image having three image channels, it performs convolution separately and creates an output image that also has image channels. Each channel has then a separate set of weights. The main motive of depthwise operation is applied to a single channel at a time. This results in more precision in edge detection, color filtering, etc. In depthwise separable convolution, if the dimensions of the input feature map (DF × DF × Q) are applied to a filter of kernel size (DK × DK × 1), it produces an output feature map (DF × DF × P). This is then processed by pointwise convolution (1 × 1convolution) on P channels. The final output feature map generated is Dk × Dk × Q × DF × DF.

The objective of doing pointwise convolution is to combine separate channels in the output of depthwise convolution to create new features. Using the two methods results in separate filtering and combining unlike in standard convolution where the two are one process. Another main advantage of depthwise separable over standard is that even though they both finally result in filtering the data and making new features, standard convolution needs much more computational work to get the result and hence needs to learn more weights. Meanwhile, DSC implements convolution operations much more efficiently and uses fewer parameters. Table 1 shows the computation cost and the number of parameters required for both standard convolution and depthwise separable convolution.

Table 1 Computation cost and no. of parameters required for standard convolution and DSC

Therefore, the reduction of computation cost obtained is:

$$ = \tfrac{{{\text{D}}_{{\text{k}}} {\text{x D}}_{{\text{k}}} {\text{x Q x D}}_{{\text{F}}} {\text{ x D}}_{{\text{F}}} {\text{ + Q x R x D}}_{{\text{F}}} {\text{ x D}}_{{\text{F}}} }}{{{\text{D}}_{{\text{K}}} {\text{ x D}}_{{\text{K}}} {\text{ x Q x R x D}}_{{\text{F}}} {\text{ x D}}_{{\text{F}}} }}{\text{ = }}\frac{{\text{1}}}{{\text{R}}}{\text{ + }}\frac{{\text{1}}}{{{\text{D}}^{{\text{2}}} _{{\text{K}}} }} $$
(9)

Therefore, the parameters reduce to:

$$ = \tfrac{{{\text{D}}_{{\text{k}}} {\text{ x D}}_{{\text{k}}} {\text{ x Q + 1 x 1 x Q x R}}}}{{{\text{D}}_{{\text{K}}} {\text{ x D}}_{{\text{K}}} {\text{ x Q xR}}}}{\text{ = }}\frac{{\text{1}}}{{\text{R}}}{\text{ + }}\frac{{\text{1}}}{{{\text{D}}^{{\text{2}}} _{{\text{K}}} }} $$
(10)

When width multiplier α is applied then the computation cost of DSC is given by:

$$ D_{K} \times D_{K} \times \alpha Q \times D_{F} \times D_{F} + \alpha Q \times \alpha R \times D_{F} \times D_{F} $$
(11)

Experiments have proved that using a 3 × 3 kernel and saving computation time by the new approach is about 9 times faster without making any difference in detection. MobileNet uses up to 13 depthwise convolutions in a row.

5 Experiments and results

5.1 Datasets

The proposed methodology was trained on Pascal Voc-2007 [22] trainVal dataset and tested on both Pascal Voc-2007 and Caltech pedestrian test datasets [23].

5.1.1 Pascal VOC dataset

The “Pascal Visual Object Classes Dataset” is Pascal Voc-2007 test dataset that is a collection of images of around 20 classes. During the training, the model used Pascal Voc-2007 trainVal (5011 images). The “Test” image set contains around 4952 images. Since the Pascal dataset background is more complicated, the degree of occlusion and human postures is different, the size of the humans is not the same. So, therefore, several images were used to improve the generalization of the trained network to meet complex real-time traffic scene environment.

5.1.2 Caltech pedestrian dataset

This dataset contains a set of video sequences of 640 × 480 size. It includes some train (set 00 to set 05) subsets and test (set 06 to set 10) subsets. There are about 350,000 bounding boxes in 250,000 frames and 2300 pedestrians were annotated and only “person” and “people” were used in our experiments. The dataset was challenging due to the small size of pedestrians and different occlusion cases.

5.2 Experimental setup

The experiments were carried out on a workstation during the training phase, and finally, the testing phase was performed both on the workstation and Jetson Nano evaluation board. Figure 8 shows the experimental setup and captured real-time pedestrian detection on the Jetson Nano evaluation board.

Fig. 8
figure 8

Experimental setup and captured results on Jetson Nano Evaluation board

Names

Experimental configuration

OS

Windows 10 Pro

CPU/GHz

Intel Xeon 64 bit CPU @3.60

RAM/GB

64

GPU

NVIDIA Quadro P4000, 8 GB, 1792 Cuda cores

GPU acceleration library

CUDA10.0, CUDNN7.4

Tensor flow

2.×

Keras

2.2.×

5.3 Training and evaluation metrics

The model training was performed on trainVal (5011) images of the Pascal VOC-2007 dataset and tested on Pascal Voc-2007 test (4952) images. The input image size was set to 300 × 300, and various data augmentation techniques were applied such as flipping, cropping and random sampling to enhance the training process. We followed the standard evaluation methods used in [13]. The proposed method used the ADAM (Adaptive Moment Estimation) optimizer instead of SGD [13] optimizer while training the Pascal Voc-2007 dataset. The hyper-parameter values used, while training the proposed model is listed below.

Parameters

Values

Input size

300 × 300

Optimization method

Adam optimizer

Batch size

32

Weight decay

0.0005

Epsilon

1e-9

Beta1, Beta2

0.9,0.999

Iteration steps

150

Learning rate

0.001

Epochs

110

6 Recall vs. precision

To evaluate the robustness of the proposed model, the commonly used evaluation metrics for real-time pedestrian detection such as recall, precision, average precision (AP), speed (fps) and memory footprint were calculated.

Recall: Recall is defined as the % of total relevant results correctly classified by the algorithm.

$$ \text{Re} {\text{call}} = \frac{{{\text{True Negative}}}}{{{\text{Predicted Results}}}}or\tfrac{{{\text{True Positive}}}}{{{\text{True Positive + False Negative}}}} $$
(12)

Precision: This is defined as % of results that are relevant, i.e. represents the accuracy of prediction.

$$ {\text{Precision = }}\frac{{{\text{True Positive}}}}{{{\text{Actual Results}}}}or\tfrac{{{\text{True Positive}}}}{{{\text{True Positive + False Positive}}}} $$
(13)

Average Precision (AP): It is the area under the precision–recall curve, and it shows the correlation between precision and recall at a different level of confidence scores.

$$ {\text{Accuracy = }}\tfrac{{{\text{True Positive + True Negative}}}}{{{\text{Total}}}} $$
(14)

Figure 9 shows the precision versus recall curve obtained on Pascal Voc-2007 test dataset. The graph depicts that with an increase in recall value at the convergence point, the precision starts gradually decreasing. The predictor built on Keras runs predictions over the entire dataset, matches predictions to ground-truth boxes, computes precision–recall curves for pedestrian class and samples 11 equidistant points from the precision–recall curves to compute the average precision for the pedestrian class, giving AP value.

Fig. 9
figure 9

PR curve of Optimized MobileNet + SSD network

Table 2 shows the comparison of average precision (AP) with state-of-the-art (SOTA) models obtained on the Pascal Voc-2007 test dataset.

Table 2 Comparison of average precision (AP) with the existing models. (IoU@0.5)

Comparing the results of Table 2 with existing SOTA models, the proposed model achieves better detection performance; the AP value reaches 80.04% on Pascal Voc-2007 test dataset, which is + 3.24%, + 1.92%, + 11.5% and + 6.06% higher compared to SSD 500 [13], Yolov3 [34], Tiny Yolov3 [34] and Improved Tiny Yolov3 [35], respectively. From the evaluation results, it is clear that the speed of Tiny-Yolov3 [34] is 220 fps, while that of the proposed model is only 155 fps. At the same time, the proposed model file is 23.6 MB, which is much smaller compared to Tiny Yolov3 [34] model file. The proposed model surpasses Improved Tiny Yolov3 [35] model both in terms of accuracy and weight file. To test the robustness of the proposed model, it was also tested on the Caltech pedestrian test dataset and achieved competitive results.

For real-time implementation, the proposed model was tested on a low-cost edge device Jetson Nano board. After training the proposed model on Quadro P4000 GPU, the whole model was tested on the Nvidia Jetson Nano evaluation board with the same system environment. Generally, more CUDA cores represent higher computational power, with the same memory and frequency conditions. The number of cores on Jetson Nano is 256 which is only 1/7th of Quadro P4000 (1792) GPU. But Jetson Nano consumes less energy and has much lower computational power. To test the validity of the proposed algorithm more intuitively, we captured real-time road video under low light conditions and fed it for detection verification. Figure 10 shows a comparison of the proposed model detection speed (fps) with existing SOTA models on the Pascal Voc-2007 test dataset.

Fig. 10
figure 10

Comparison of detection speed of the proposed model with the existing SOTA methods. (Tested on Jetson Nano board)

By using the same video for verification, the detection speed of all SOTA detectors on Jetson Nano was far slower compared to on Quadro P4000 GPU. From Fig. 10, it is clear that the proposed model runs with a speed of 34.01 fps on Jetson Nano which is quite higher compared to SSD 512 [13], Yolov3 [34], Tiny Yolov3 [34] and Improved Tiny Yolov3 [35] SOTA models.

6.1 Detection results

The model is trained on low-resolution images with a size of 300 × 300. This results in a fall in accuracy on low-resolution images. Nevertheless, with Optimized MobileNet as a Backbone model, our proposed model can detect and identify pedestrian class with an appreciable amount of accuracy. Figure 11 shows samples of detected images of both Pascal Voc-2007 and Caltech pedestrian test datasets. This model accurately detected different samples varying from a few people to several. The model perfectly segregates different persons without intermixing them, giving precise detection results.

Fig. 11
figure 11

Detection examples on sample images from Pascal Voc-2007 and Caltech datasets

The proposed model is tested on low-resolution images, and the detected results are shown in Fig. 12. So this model has a better detection effect while detecting dense and smaller pedestrians during low light and darker pictures, while tiny-yolov3 [34] fails to detect pedestrians in darker pictures.

Fig. 12
figure 12

Detection examples on sample images from Pascal Voc-2007 and Caltech datasets on low-resolution images

Since Pascal Voc-2007 dataset contains many small objects. Since our concern is only the pedestrian class, we manually gathered around 300 images that cover mainly smaller pedestrians for testing the performance of the proposed model. Detection results of the original SSD, YOLOv3 and Optimized MobileNet + SSD models are shown in Fig. 13.

Fig. 13
figure 13

Detection results of a Original SSD b YOLOv3 c Optimized MobileNet + SSD

Figure 13 shows clearly that the proposed model performs better than compared to the original SSD [13], YOLOv3 [34] while detecting small-scale and denser pedestrians in real-time.

To test the validity of the proposed algorithm more intuitively, we captured real-time road video under low light conditions and test results of randomly selected frames 498, 520 and 798 on Jetson Nano board, shown in Fig. 14. Therefore, the proposed algorithm has good adaptability to detect pedestrians under complex environments for real-time video.

Fig. 14
figure 14

Detection effect of the Optimized MobileNet + SSD Network on Jetson Nano board

From Fig. 14, it is clear that the proposed model when tested on Jetson Nano works better while detecting small-scale and denser pedestrians in real-time but fails to detect both occluded and distant smaller pedestrians.

7 Conclusion and future work

It is a quite challenging task to reliably detect multi-scale pedestrians on a low-end edge device due to their limited resolution and information in images. This is because the existing SOTA models fail to improve real-time pedestrian detection accuracy. By exploiting the contextual information, considerable improvement is achieved in the proposed model. Therefore, a feature fusion concatenation module is introduced for adding contextual information in the Optimized MobileNet + SSD network to improve pedestrian detection accuracy in real-time. In using the proposed model, the number of network parameters is decreased, while the detection accuracy is improved when compared to state-of-the-art pedestrian detectors.

The proposed model for real-time pedestrian detection is implemented effectively by running it on the Jetson Nano evaluation board. Experimental results show that the proposed model achieves 80.4% average precision on Pascal Voc-2007 test dataset, which is + 1.92%, + 11.5% and + 6.06%, respectively, and much higher compared to Yolov3, Tiny Yolov3 and Improved Tiny Yolov3. Since the proposed model weight file is 23.6 MB and runs at 155 fps on Quadro P4000 GPU, it is more adaptable for low-end edge devices. The proposed model runs at a speed 5 × faster and holds 10.5 × times smaller weight file than the current SOTA real-time detector YoloV3. The proposed model runs in real-time with a speed of 34.01 fps when tested on the Jetson Nano board; however, the proposed model is still flawed, as it fails to detect occluded and distant smaller pedestrians in some frames. Still, the proposed model achieves competitive results on the Caltech pedestrian test dataset.