1 Introduction

For crowd safety and security, it is important to automatically understand high density crowd dynamics in a faster way. However, automated understanding of crowd dynamics is a challenging job. Several efforts have been done during recent years to overcome those challenges. To understand crowd dynamics, crowd counting has gained much attention from research community. Counting the number of people and estimating the distribution of people in the environment provide valuable information for crowd managers.

Considerable amount of work is reported in literature on crowd counting in high density crowds. Most of the existing methods treat the crowd counting problem as regression problem that only estimate the crowd count and avoid localization of individuals in the scene.

Pedestrian detection provides the exact location of individuals in the scene (in terms of bounding boxes), which on the one hand, provides crucial information for crowd dwellers and on the other hand, serve as useful input for other crowd applications, for example, tracking, behavior understanding and anomaly detection. Despite significant importance, limited amount of work is reported in literature to detect pedestrians in high density crowds. The task of pedestrian detection in high density crowds is extremely challenging due to severe clutter, occlusions in the scene. In high density crowds, human bodies are occluded that poses a challenge for a detector to learn consistent human-like features. Generally for crowd surveillance, camera is mounted overhead to provide better coverage of the crowded scenes. In such cases, human head is the only visible part. In this paper, we propose head detection framework that learns head-like features and provides crowd count for an input image as shown in Fig. 1.

Few strides [3, 12, 14, 25] have been made during the recent years to learn consistent features of heads; however, the task of head detection in high density crowds is still an unsolved problem due to the following challenges:

Fig. 1
figure 1

Shows head detection results of proposed method. The sample input frame (left) is taken from UCF-QNRF [9] dataset and demonstrates significant variations in scales, poses and sizes of heads. Our proposed framework efficiently detects 230/236 heads and precisely estimates the bounding boxes (sizes). It is observed that proposed framework performs better by handling variations in scales, poses and sizes

  1. 1.

    Significant variations in the appearances of person heads.

  2. 2.

    Diversity in scales of human heads. Due to perspective distortions, human heads near to the camera appear large, while the far away heads appear small.

  3. 3.

    Detecting smaller heads (composed of few pixels) in high density crowds is challenging task for existing generic detectors.

  4. 4.

    Most state-of-the-art detectors perform prediction on down-sampled resolution which is not applicable in fine grained detection of human heads in high density crowds.

The first challenge can be addressed by deep neural networks, since deep neural network are translation invariant and can effectively handle pose and appearance variations. However, the remaining challenges are inter-related and generally caused by perspective distortions. From the empirical evidences, we conclude that camera view point causes perspective distortions due to which the scale of the objects change drastically from one location to other location in the scene.

Currently, existing methods treat head detection problem as special case of object detection. Faster-RCNN [20] and Single Shot MultiBox [15] are the most popular and frequently adopted detectors in detection tasks. To handle scale variation of objects, Faster-RCNN uses anchor boxes of different sizes. However, Faster-RCNN, in current settings, cannot handle significant scale variations and therefore cannot be applied in head detection problem in high density crowds. Single-shot detector (SSD) [15] estimates class probabilities and bounding boxes of objects by employing multi-scale deep features. The multi-scale configuration of SSD detects multiple objects at different scales. We observed that SSD works best for large objects, however, achieved low performance when applied in high density crowds, since the size of object is extremely small.

In this paper, our goal is to precisely predict the bounding boxes of human heads in high density crowds, irrespective of the above mentioned challenges. For this purpose, we present a novel framework that captures broad range of scale variations in an image by splitting broad range of scales into small sets of sub-scales. To model each small sub-scale, we designed a separate scale-specific network that can deal with heads that corresponds to particular sub-scale. This is done by three multiple detectors with three separate region proposal networks (RPNs). After designing the network for each sub-scale, we combine all the networks into a single network and optimize the network parameters in end-to-end fashion.

Generally, our proposed framework has the following contributions:

  1. 1.

    A novel crowd counting framework that counts the number of people in the scene by providing fine-grained detection of human heads at high as well as low resolutions.

  2. 2.

    Based on superior performance achieved on benchmark datasets, the proposed framework provides an alternative solution of crowd counting to the prevalent regression-based crowd counting methods.

  3. 3.

    The network efficiently integrates multiple scale-specific networks into a single end-to-end network.

  4. 4.

    The framework provides inference at single scale and avoids the computational complexity of computing image pyramid.

The rest of this paper is organized as follows. We discusses related work in Sect. 2. Section 3 discusses the details of proposed method. Experiment results are discussed in 4. Section 6 discusses conclusion and future work.

2 Related work

Different methods and approaches are reported in literature for crowd counting and density estimation in high density crowds. Generally, we categorize these methods into two categories, (1) detection-based methods and (2) regression-based methods.

Currently, regression-based methods prevailed the counting problem that estimate the count by employing regression between crowd features and crowd count. With the tremendous success of CNNs, different models employ CNN to back propagate the regression and update the count loss. However, these do not integrate spatial information in the loss and cannot precisely localize the pedestrians in the crowded scenes. Zhang et al. [30] proposed multi-column convolutional neural network (MCNN) composed of three columns with different receptive fields to handle perspective distortions. Zhang et al. [29] estimated the count from a single image by proposing two-configuration regression model. Sam et al. [21] proposed switching network that chooses one network among multiple CNNs based on the performance. Similarly, Zhu et al. [31] proposed patch-scale discriminant regression network (PSDR) to estimate the crowd count. Sindagi et al. [26] proposed contextual pyramid network generated high-quality density maps and estimated the crowd count by integrating local and global contextual information from the image. Kang et al. [11] provided comprehensive analysis and comparisons of different crowd density estimation methods.

Generally, regression-based methods capture texture information and achieved notable performance in high density crowded scenes; however, these methods have following limitations. (1) Regression-based methods do not incorporate spatial information; therefore, these methods cannot predict the precise location and size (bounding box) of pedestrians in the scene. (2) Regression-based methods usually overestimate the count in low dense crowded scenes.

In order to address the above problems, detection-based methods, detection-based methods [22, 23, 25], train object detectors to predict the location of all pedestrians in the scene. In these methods, total number of detections represent the total count of pedestrians in the scene. Beside regression- and detection-based methods, the authors also proposed hybrid methods that combine both regression and detection-based methods. For example, Liu et al. [14] proposed a hybrid method for crowd counting, where the framework operates in two modes, i.e., regression and detection mode. The framework dynamically decides the appropriate mode depending upon the complexity.

Our propose framework is detection-based method, where we train a head detector with a notion that head is the only visible and reliable part of human body in high density crowds. Unlike other detection methods that tackle the scale problem by generating an image pyramid, we present a novel scale-adaptive framework that splits the target scales into a set of disjoint sub-scales. Each sub-scale set is modeled by separate scale-specific specialized network. These networks are then combined into a single backbone network that jointly optimized the parameters in end-to-end fashion.

Fig. 2
figure 2

Shows the pipeline of proposed framework. The framework consists of three scale-specific sub-network with different RPNs that are combined in a single backbone network. The first scale-specific sub-network detects small heads, the second detects medium size heads, and third one detects large heads. The proposals from each sub-network are accumulated and input to another detection network that makes the final prediction

Comparison and Difference. Our proposed framework is different in many aspects from the existing detection-based methods. (1) In contrast to scale-invariant detection-based methods, our framework addresses the detection problem by training set of specialized sub-networks with different RPNs. This enables our framework to capture different scale range in the input image. Single-shot detector [15] (SSD) is a cascaded framework, where predictions are generated at every stage to capture certain scales. In this way, samples that are missed in the first stage cannot be recovered in the later stages. In order to deal with this problem, each stage of SSD needs to be generalized to capture large scale variance. Unlike SSD, each scale-specific detector of proposed framework detects human heads fall within a certain scale range. In the same way, different from [2], we integrate a set of scale-specific sub-networks into a single backbone network that is optimized end-to-end. We argue that proposed configuration of framework reduces the computational complexity by sharing parameters and also enhances the detection accuracy by learning discriminating representation of human heads.

3 Proposed methodology

In this section, we discuss the architecture of our proposed framework. The overall architecture of proposed framework is shown in Fig. 2. The input to our framework is an image of arbitrary size, and output is the set of bounding boxes correspond to heads.

The backbone of the proposed framework is based on DenseNet [8] and consists of 174 layers. We use deep network to avoid the problem of gradient vanishing. The network is divided into four stages. The first stage of network consists of one convolutional layer with filter size of \(7 \times 7\) and stride 2. The convolutional layer is then followed by a pooling layer with filter size of \(3\times 3\) and stride of 2. The first stage is followed by three stages, i.e., denseblock1, denseblock2 and denseblock3. Dense block implements a set of two convolution layers. The filter size of first convolutional layer is \(1\times 1\) followed by second convolutional layer with filter size of \(3\times 3\) pixels. As illustrated in Fig. 2, denseblock1 consists of 12 sets with total of 24 convolutional layers. The second dense block denseblock2 consists of \(24\times 2=48\) layers, and denseblock3 contains \(48\times 2=96\) layers. The output of each dense block passes through a Transition block. The transition block consists of one convolutional layer of filter size \(1\times 1\) followed by a pooling layer of size \(2\times 2\) with stride 2.

Deep architectures achieved significant success in object classification and detection tasks [10, 13, 16, 20]. These detectors perform well in detecting large objects; however, the performance of these detectors degrades while detecting small objects. Generally, these detectors use feature maps of the last convolutional layer (last layer of denseblock3) and employ region proposal network (RPN) to generate multi-scale object proposals. The size of the feature map of last convolutional layer is small as it is reduced step by step after passing through series of convolutional and pooling layers. Due to small resolution and large receptive field, the feature map of last convolutional layer looses information of small objects. Therefore, these detectors are not suitable for detecting small objects of size less than \(32\times 32\) pixels. Here, it is to be noted, according to definition of small objects in [27], objects with size smaller than \(32\times 32\) pixels are considered as small objects. Since we are detecting human heads in high density crowds, the size of head is less than \(20\times 20\) pixels (approx). These detectors are not applicable.

To detect human heads of small size, we assume that the resolution of shallow layers is large and have small receptive fields, therefore, suitable for detecting small objects. In the same way, we assume that intermediate layers contain information about the medium size objects. The receptive field of last convolutional layer (denseblock3) is large and helpful in detecting large objects. Unlike existing methods that use feature map of the convolutional layer of last dense block, we utilize feature map of each of three dense blocks and build three region proposal networks. The first RPN utilizes the feature map of denseblock1, the second RPN utilizes the feature map of denseblock2, and third RPN uses last dense block, i.e.,denseblock3. Through these three branches, RPNs generate multi-scale object proposals. Each dense block combines with RPN and implements a detector with specific receptive field size. This enables each detector to capture specific scales in an image. We set the anchor scale set of first RPN as \(\lbrace 10, 16, 32, 56 \rbrace \), \(\lbrace 64, 96, 128, 160 \rbrace \) for second RPN and for third RPN \(\lbrace 165, 212, 256, 512 \rbrace \).

For training, each RPN has its own disjoint set of training samples. Each RPN, samples regions from the input images according to pre-defined anchor scale set as mentioned above. For example, the first RPN samples positive and negative regions with the size range from 10px to 56px from the input image. We assign positive label to an anchor if the intersection-over-union (IoU) of the candidate region and ground truth is greater than 0.7. Since each RPN has its own disjoint set of samples, therefore, we ignore ground truth regions with size greater than anchor scale set of a particular RPN. It is to be noted that a single ground truth region may assign positive label to multiple anchors. Negative values are assigned to anchors with IoU less than 0.3. We also ignore those anchors that do not contribute to the training loss. Usually, these anchors belong to the region outside the boundary of the given image. However, each RPN will generate two type of outputs, i.e., bounding boxes and classification score. Therefore, we use multi-task loss function and minimize the following objective function.

$$\begin{aligned} L(l_j,m_j)= & {} \frac{1}{M_\mathrm{class}} \sum _{j=1}^N L_l(l_j,\hat{l_j}) \nonumber \\&+ \varOmega \frac{1}{M_\mathrm{regress}} \sum _{j=1}^N L_m(m_j,\hat{m_j}) \end{aligned}$$
(1)

where N is the number of samples per mini-batch. j represents the index number of an anchor. \(l_j\) and \(m_j\) represent the predicted class probability and bounding box, respectively. \(\hat{l_j}\) and \(\hat{m_j}\) represent the ground truth class label and bounding box, respectively. During RPN training process, \(\hat{l_j}\) takes either value 1 or zero. The value “1” represents the positive class, while negative class or background is represented by “0.” \(L_l\) is the log class loss, and \(L_m\) is log regression loss. In Eq. 1, multi-task terms are normalized by \(M_\mathrm{class}\) and \(M_\mathrm{regress}\), while \(\varOmega \) is a balancing parameter.

During training, we generate mini-batch of positive and negative samples from a single image. From empirical studies, we observed that training RPN with all samples generated from a single image cause the network bias towards negative samples, since the number of negative samples (or background) is greater than positive samples. In order to address this problem, we generate a mini-batch of 256 samples by randomly selecting positive and negative samples with the ratio of 1:1.

We employ Xavier initialization [6] to initialize all the layers. We keep learning rate of 0.001 with the decrease the learning rate by rate of 10 after every 10k iterations.

3.1 End-to-end training

In the above section, we discussed how to train multiple RPNs with different scale sets. The output of these RPNs is a set of bounding boxes of different sizes with different class labels. Now, we discuss how to utilize these scale-specific region proposals for head detection task. More precisely, we describe the algorithm that learns end-to-end network composed of multiple RPNs and detection network. Multiple RPNs and detection network trained independently will modify and update convolutional layers in their own ways. Therefore, we need to develop a method that allows the network to train end-to-end by sharing convolutional layers.

The proposals obtained from multiple RPNs are of different sizes. As per requirement of fully connected layer, the obtained proposals need to be converted to fixed size before feeding fully connected layer. Region-of-interest (ROI) pooling does this job by taking feature map from the denseblock3 and proposals obtained from multiple RPNs as inputs as shown in Fig. 2. ROI pooling layer takes every region proposal and extracts a patch from the feature map that corresponds to that region proposal and converts it to feature map of fixed size.

The final prediction of bounding boxes and class labels is done by the detection network. The detection network has two sibling layers, one layer outputs class probability and second layer outputs a tuple that represents the offsets of predicted bounding boxes. Since we are optimizing two tasks, i.e., class label predication and prediction of bounding box offsets, therefore, we define two different loss functions. Let \({\hat{c}}\) represents predicted class label and c represents the ground truth class. We define class loss \(L_\mathrm{class}\) as negative log-likelihood and formulated as in Eq. 2.

$$\begin{aligned} L_\mathrm{class} ({\hat{c}},c) = -(\log {\hat{c}}) \end{aligned}$$
(2)

The class loss \(L_\mathrm{class}\) maximizes the class probability by incentivizing the model when it predicts the positive class with higher probability and penalizing the model when it predicts the positive class with smaller probabilities. The penalizing part is done by the logarithm. The purpose of the negative sign is to make the loss value positive, since the values of class probability lie in range [0, 1] and logarithm of values in this range is negative.

The second loss \(L_\mathrm{bbox} ({\hat{b}},b)\) for predicting the offsets of bounding boxes is defined over ground truth tuple \(b = (b_x,b_y,b_w,b_h)\) and predicted tuple \({\hat{b}} = ({\hat{b}}_x,{\hat{b}}_y,{\hat{b}}_w,{\hat{b}}_h)\), where \(b_x\) and \(b_y\) represent the location, and \(b_w\) and \(b_h\) represent the width and height of the ground truth bounding box. The loss \(L_\mathrm{bbox}\) is formulated as in Eq. 3

$$\begin{aligned} L_\mathrm{bbox} ({\hat{b}},b) = \sum _{j\in \lbrace x,y,w,h \rbrace } L_1(\hat{b_j},b_j) \end{aligned}$$
(3)

where \(L_1(\hat{b_j},b_j)\) is Huber loss and formulated as in equation

$$\begin{aligned} L_1({\hat{b}},b)= \left\{ \begin{array}{l l} 0.5 ({\hat{b}} - b)^2, &{} \quad \text {if }\left| {\hat{b}} - b\right| < 1 \\ \left| {\hat{b}} - b \right| - 0.5 &{} \quad \text {otherwise} \end{array} \right. \end{aligned}$$
(4)

We combine both losses \(L_\mathrm{class}\) in Eq. 2, \(L_\mathrm{bbox}\) in Eq. 3 and train detector using the following joint loss L in Eq. 4

$$\begin{aligned} L({\hat{c}},c,{\hat{b}},b) = L_\mathrm{class} ({\hat{c}},c) + L_\mathrm{bbox} ({\hat{b}},b) \end{aligned}$$
(4)

3.2 Significance of joint optimization

We observe that existing similar detectors [17, 18] produce redundant detection by training multiple detector independently. However, our framework reduces this redundancy by sharing representation of heads among scale-specific detectors.

Fig. 3
figure 3

Evaluation of “straightforward” method with proposed on both datasets

As discussed above, our framework combines all scale-specific detectors into a single backbone network and jointly optimize network parameters in end-to-end fashion. In order to see the significance of joint optimization, we compare the results with one of our framework variant that has similar network structure but do not jointly optimize the network parameters in end-to-end fashion. In this method, the detection obtained from each stage is accumulated to generate final detection. We call this method “straightforward” method. From experiments, we observe that “straightforward” method accumulates redundant predictions and results in low average precision. The comparison of “straightforward” method and proposed framework is shown in Fig. 3.

We also reduce the parameters of network to improve run time efficiency. We keep minimum number of filters in each layer of backbone network and initialize the network parameters by pre-training the network on ImageNet [5] dataset.

4 Experiment results

In this section, we evaluate and compare the performance of proposed framework with existing methods in both qualitative and quantitative ways. To evaluate the effectiveness of our proposed framework, we use two benchmark datasets, i.e., UCSD dataset [4] and UCF-QNRF [9]. These datasets include images collected from different scenes with varying camera view points, illumination and densities. UCSD dataset covers low density situations, where the average count per frame is 25, while UCF-QNRF covers high density crowded scenes, where average count per frame is 815. These datasets are carefully selected among existing benchmark datasets to evaluate the performance of proposed framework in both high and low density crowded scenes. Most of existing regression-based crowd counting methods use these datasets for crowd counting. However, these dataset have never been used for head detection-based crowd counting methods. However, these datasets contain dot annotations and only suitable for evaluating regression-based models. To use these datasets for head detection problem, we annotate human heads with bounding boxes with aspect ratio of 1:1.

To comprehensively evaluate the performance of proposed method, we divide the experiment setup in two parts. The first part of the experiment discusses detection performance of proposed framework, while the second part discusses the counting performance.

4.1 Detection performance

Detection provides crucial information by precisely localizing human head in the scene. In detection performance, we measure how precisely the model detects the bounding box of objects. Therefore, it is important for a good detector to precisely localize human heads in the scene. The detection performance is usually measured by Intersection over union (IoU) which quantifies the overlap between the predicted and ground truth bounding boxes. Generally, fixed threshold value (0.5) is used for IoU. However, we observed that with fixed threshold value, the performance of the detector cannot be evaluated with different threshold values. To measure the detector performance, we use mean average precision (mAP) as evaluation metric predominately used to assess detector’s performance. The detection performance of different methods is summarized in Table 1. It is to be noted that we directly use the pre-trained models of reference methods for comparisons. From the table, it is obvious that our framework achieved better results compared to existing methods.

Table 1 Detection performance of different methods in terms of mean average precision (mAP) using UCSD and UCF-QNRF datasets
Table 2 Performance of different detectors on small, medium and large group from UCF-QNRF dataset
Fig. 4
figure 4

Results of proposed framework at different stages using sample frames from UCF-QNRF dataset

To evaluate the effectiveness of each scale-specific detector, we categorize human heads into three groups based on height of image, i.e., small, medium and large. The small group corresponds to heads of size ranges from 8–60 pixels, medium (60–160) pixels and large corresponds to 160–256 pixels. We evaluate the performance of existing methods on each group in terms of mean average precision. The performance of methods is summarized in Table 2. From the table, it is obvious that all detectors achieve impressive performance on both medium and large groups. However, the performance of detectors degrades when applied on small group. From the table, it is obvious that there is a considerable gap between the performance of detectors on small and medium/large head sizes. It attributes to small size of heads that occupy few pixels and lack of appearance information.

From Table 2, it is obvious that Faster-RCNN achieves lower performance compared to other reference methods. This is due to reason that Faster-RCNN [20] fails to detect small objects. It attributes to the fact that Faster-RCNN uses feature map of the high-level layer for object detection. These high-level layers have large receptive fields sizes and do not contain information about the small objects. Therefore, Faster-RCNN misses heads during inference stage. SSD [15], on the other hand, uses feature maps of top and shallow to tackle scale in variance problem. Features maps from the top layers have small resolution that lack details of small objects. Moreover, the resolution of shallow layers is large; however, it has less discriminating power that ultimately leads to significant amount of false positives. FCHD [28] employs fully convolutional network (FCN) that takes arbitrary size image as input and use feature map of the last convolutional layer for predicting class labels and bounding boxes. Since this method also uses last convolutional layer, therefore, it can detect heads near to camera (due to large size) and miss heads that are far from the camera. DecideNet [8] employs two sub-networks, i.e., RegNet and DetNet. The architecture of RegNet is based on FCN and DetNet follows the typical pipeline architecture of Faster-RCNN. This is due to reason that DecideNet faces difficulty in detecting small heads. HR [7] solves the multi-scale problem by using image pyramid, where image is re-scaled to different size before feeding to the network. This method achieves comparable results; however, it suffers from following limitations. (1) Processing each level of pyramid is computationally expensive. In some cases, the resolution of up-sampled image reaches to 5000 pixels per one side that significantly increase the inference time. (2) Down-sampling the image results in loss of information about the small objects. This is the reason that HR performs relatively lower than proposed method. On the other hand, proposed method achieves state-of-the-art performance on both benchmark datasets. We solve the multi-scale problem by employing scale-specific detectors that detects human heads at different range of scales and does not require image pyramid.

Fig. 5
figure 5

Shows performance of different method on different sample frames. The first row shows the results of different methods on UCSD dataset. Second and third rows show the results of different methods using UCF-QNRF dataset

To visualize the performance of framework at different stages on small, medium and large group, we report qualitative results in Fig. 4. From the figure, it is obvious that each scale-specific detector can precisely localize and estimate the respective bounding boxes of human heads.

We also demonstrate the detection performance of different methods in Fig. 5. From the figure, it is obvious that performance of SD-CNN [3] is comparable to proposed method by predicting the location of human heads; however, the methods fails to estimate the exact bounding boxes correspond to human heads. On the other hand, TinyFace [2] accumulates large number of redundant bounding boxes around human heads. Furthermore, it also produces many false positives as obvious from the figure. Our proposed framework, on the other hand, not only precisely localize human heads but also estimate the exact sizes of bounding boxes.

4.2 Counting performance

We next evaluate the counting performance of proposed method and its comparison with other state-of-the-art methods. For evaluating counting performance, we use the same convention of mean absolute error (MAE) and mean square error (MSE) followed in state-of-the-art crowd counting methods. We report the performance of different methods in Table 3. It is obvious from the table that regression-based methods achieve comparable performance on UCF-QNRF dataset due to high density images, since these models capture regular repetitive structures (texture) in the crowd. However, the performance of regression-based methods degrades on UCSD dataset, as these methods overestimate the count in low density crowds.

Table 3 Counting performance of different methods using MAE and MSE on UCSD and UCF-QNRF datasets

On the other hand, detection-based methods are unable to produce the desired results on UCF-QNRF dataset. This attributes to small head size and occlusions in high density crowded scenes. However, we notice that on UCSD dataset, detection-based methods achieve relatively high performance. However, most parts of human body were visible. Our proposed method overcome the limitations of regression based model by precisely detecting human heads in both low and high density crowds. As obvious from the table, our framework achieves better results compared to existing related methods.

Table 4 Computation complexity of different methods in UCF-QNRF dataset
Table 5 Computation complexity of different methods in UCSD dataset

5 Computation complexity

In this section, we evaluate and compare the inference speed of proposed framework. All the models are trained and tested using NVIDIA Titan Xp GPU. We take the average inference time of randomly selected images from UCF-QNRF dataset. This dataset consists of images of varying high resolutions and densities that we believe can affect the inference time. We compare the performance with other related methods in terms of average precision and inference time. The performance of different methods is reported in Tables 4 and 5. From the Table 4, it is obvious that in UCF-QNRF dataset, proposed framework achieves 79.56% with 1.72 frames per second. On the other hand, Yolo comparatively achieved high frame rate but average precision is dropped to 35.47%, significantly lower than proposed framework. SD-CNN, on the other hand, achieved comparable performance, but with generation of large number of scale-aware proposals, inference time is very high compared to other methods.

From Table 5, it is obvious that YOLO is faster than proposed method; however, it achieved lower average precision value compared to proposed method. Furthermore, SD-CNN achieved comparable results in terms of average precision, but cause high computational cost. From Tables 4 and 5, we further observed that methods run faster and perform lower on UCSD dataset compared to UCF-QNRF dataset. From the empirical evidences, we observed that resolution of an image affects the inference time and accuracy. The images in UCSD dataset is of low resolution, where average size of head is around \(8\times 8\) pixels. This is due to the reason that most reference methods could not precisely localize the heads.

6 Conclusion

In this paper, we proposed a unified framework to detect human heads with wide range of scale variance. Our framework achieved better performance with minimum computational cost. We demonstrated through experiments that best performance can be achieved through integration of different scale-specific detectors. It is also demonstrated that the proposed framework achieves better performance than its counterpart, regression-based models. We also evaluate the performance of different scale-specific detectors in detecting human heads fall in their respective scale range. We hope that these encouraging results will motivate the research to adopt detection-based models instead of regression models. These results can provide a useful to many other crowd applications like tracking, crowd behavior understand and anomaly detection.