Keywords

1 Introduction

Crowd counting based on computer vision aims at generating high-quality density maps of crowd scenes, thereby calculating the total number of the crowd. It is widely used in public safety and video surveillance. What’s more, the proposed methods for crowd counting can be extended to other fields with similar tasks, including traffic control, agricultural monitoring, and cell counting.

With the rapid growth of deep learning, many CNN-based methods have made amazing improvements in crowd counting. However, crowd counting is still a difficult task due to the complexity of the scenes, especially the large scale variation (Fig. 1).

Fig. 1.
figure 1

Scale variation in crowd scenes.

In recent years, numerous methods have been proposed to tackle with the problem of scale variation. MCNN [31] uses filters with different sizes to solve the size variation of the human head. CSRNet [12] adopts dilated convolutions as the back-end part to extract deeper features by expanding the receptive fields. Kang et al. [11] propose an adaptive fusion feature pyramid to handle multiple scales. CAN [14] combines multiple receptive fields with different sizes and learns the correct context for each image location.

Although above methods have achieved better performance, there are still some deficiencies to be improved. On the one hand, the crowd scene has large scale variations in size, shape, and location, and using a simple multi-column structure does not effectively extract multi-scale contextual information. On the other hand, features captured by earlier layers in the deep network contain less semantic information, so naively cascading multi-level features in the network does not effectively solve large scale variation.

To this end, we introduce an innovative deep learning framework named Gated Cascade Multi-scale Network (GCMNet) to take full advantage of the representation of multi-scale features. The architecture of GCMNet is shown in Fig. 2. To perform more comprehensive multi-scale representations and overcome the drawbacks of multi-branch structure, we design a multi-scale contextual information enhancement module to capture the global context. We employ four parallel convolutional layers with different filter sizes and combine the features generated by these convolutions. By doing this, the representation capability of the network is greatly improved. In addition, with the successive feature extraction process, a large amount of detail information is lost, so we have integrated various pixel-level detail through a hopping cascade module, thus ensuring the completion of multi-level feature fusion. Furthermore, the utilization of hopping cascade module to integrate multi-level features does not weight the importance of the information contained therein. While a gated information selection delivery module is adopted, we can determine the turn-on and turn-off of information in multi-level features to perform adaptive and effective delivery of useful information.

In summary, the main contributions of our work are as follows:

  • We design a multi-scale contextual information enhancement module with multiple different sizes of convolutional filters to extract multi-scale contextual information.

  • We put forward a hopping cascade module that cascades multi-level features to reconstruct pixel-level image detail.

  • We propose a gated information selection delivery module to adaptively control information delivery between multi-level features.

Fig. 2.
figure 2

The overall framework of our GCMNet.

2 Related Works

In recent years, significant improvements have been achieved in crowd counting from traditional methods [3, 7] to CNN-based methods [9, 28]. In this paper, we mainly focus on three categories of CNN-based methods: multi-scale feature extraction methods, multi-level feature fusion methods, and feature-wise gated convolution methods.

2.1 Multi-scale Feature Extraction Methods

This kind of method aims to address the scale variation in crowd counting with multi-scale contextual information. Zhang et al. [31] propose a multi-column convolutional neural network to extract multi-scale features. Similarly, Sam et al. [20] put forward the Switching-CNN, which uses the density variation to improve the accuracy and localization of crowd counting. Cao et al. propose the SANet [1] for extracting multi-scale features based on the Inception architecture of encoders. ADCrowdNet [13] combines multi-scale deformable convolution with an attention mechanism to construct a cascade framework. Jiang et al. [10] design a grid coding network that captures multi-scale features by integrating multiple decoding paths. In addition, the spatial pyramid pooling (SPP) [5] uses pooling layers with different sizes to extract multi-scale feature maps and finally aggregates them into a fixed-length vector, thus improving robustness and accuracy. Therefore, it is widely used in SCNet [26], PaDNet [25], and CAN [14] for extracting multi-scale features.

In this paper, we utilize four parallel convolutional layers to extract multi-scale features and fuse features to improve the redundancy arising from the multi-branch structure.

2.2 Multi-level Feature Fusion Methods

Several recent works for complex and intensive prediction tasks have demonstrated that features from multiple layers are favorable to produce better results. Deeply encoded features contain semantic information of the object, while shallowly encoded features conserve more spatially detailed information. Several studies on crowd counting [15, 23, 31] have attempted to use features from multi-level convolutional neural networks for more accurate information extraction. Many studies [15, 31] predict the independent results of each stage and finally fuse them to obtain multi-scale information. Sindagi et al. [23] introduce a multi-level bottom-top and top-bottom fusion method to combine shallower information with deeper information.

Different from the above methods, we propose a hopping cascade module to perform multi-level feature fusion with hopping cascade, thereby the pixel-level image details lost during extraction can be regained.

2.3 Feature-Wise Gated Convolution Methods

The introduction of gating mechanisms in convolutions has also been extensively studied in language, vision, and speech. Dauphin et al. [2] effectively reduce gradient dispersion by using linear gating units and also retain the ability to be nonlinear. Oord et al. [18] employ a selected-pass mechanism to improve performance and convergence speed. Yu et al. [29] propose an end-to-end gated evolution-based generative image restoration system to improve the restoration of free-form masks and user-guided inputs. WaveNet [17] applies gated activation units to audio sequences to simulate audio signals and obtains better results.

In this study, we propose a gated information selection delivery module to adaptively control the information delivery between multi-level features during the hopping cascade.

3 Proposed Algorithm

In this section, we will outline the overall framework of our GCMNet and give a detailed introduction of the theory to realize each module.

3.1 Overview of Network Architecture

The overall framework is shown in Fig. 2. Following the practice of most previous work, we adopt VGG-16 [22] as the backbone network and choose the first five stages (\(Layer_1-Layer_5\)) of the pre-trained VGG-16 to generate the hopping features at five levels, which are represented as \(F^e=\{f_i^e,i=1,\ldots ,5\}\). After \(Layer_5\), we add the Multi-scale Contextual Information Enhancement Module (MCIEM) consisting of multiple convolutional layers with different sizes of filters to capture global context information. Afterwards, to reconstruct the pixel-level image detail information that is lost in the successive feature extraction, we propose the hopping cascade module to cascade the hopping features \(F^e\) with the upsampling features \(F^d=\{f_i^d,i=1,\ldots ,5\}\) generated by upsampling operations. Moreover, we design the Gated Information Selection Delivery Module (GISDM) to control the delivery of the pixel-level image detail information in \(F^e\) with the aim of effectively integrating the multi-level features in the cascade process.

3.2 Multi-scale Contextual Information Enhancement Module

It is observed that the output features fused by using parallel convolution contain more image details than the features generated by successive convolution operations. Therefore, we come up with the MCIEM to capture global context information. The module consists of four parallel convolutional layers with filters of different sizes \(k \in \{3,7,11,15\}\) and four max-pooling layers. The details of the MCIEM is given in Fig. 3.

Fig. 3.
figure 3

Details of MCIEM.

Firstly, the multi-level features \(f_5^e\) extracted by the backbone network are taken as the input to the MCIEM. Then the four parallel convolutions with the receptive field of 3 \(\times \) 3, 7 \(\times \) 7, 11 \(\times \) 11, and 15 \(\times \) 15 are used to extract multi-scale features. Finally, these features are fed into a 2 \(\times \) 2 max-pooling layer and then fused together to extract more comprehensive contextual features. With the MCIEM, multi-scale features can encode richer contextual information.

3.3 Hopping Cascade

Though MCIEM can extract effective contextual information through multi-scale features, some pixel-level image detail information is lost in this extraction process. Therefore, we introduce the hopping cascade module to reconstruct the lost pixel-level image detail information.

Specifically, after the MCIEM, we choose the \(H_1-H_{5}\) with 32-fold bilinear upsampling operations to generate upsampling features \(F^d=\{f_i^d,i=1,\ldots ,5\}\). Meanwhile, the lost pixel-level image detail information is reconstructed by cascading \(F^e\) with \(F^d\). Our cascade module takes the hopping features \(f_3^e\), \(f_4^e\), \(f_5^e\) and upsampling features \(f_3^d\), \(f_4^d\), \(f_{5}^d\) as input. The cascade process is implemented by the following equation.

$$\begin{aligned} H_i=ReLU(Conv(f_i^e;\theta ))+ReLU(Conv(f_{i}^d;\theta )) \end{aligned}$$
(1)

where \(Conv (*;\theta )\) is a convolutional layer with parameter \(\theta = \{W, b\}\), ReLU() is an activation function. \(f_i^e\) is parallel to the multi-level feature \(f_{i}^d\) and they have the same size.

3.4 Gated Information Selection Delivery Module

The pixel-level image detail information is reconstructed with the hopping cascade module, but not all of the pixel-level detail information contributes to the realization of accurate crowd counting. Therefore, we propose the GISDM to deliver this information from adaptive selection, which consists of a residual block and a gated function, as shown in Fig. 4.

Fig. 4.
figure 4

Details of GISDM.

In our implementation, we feed the hopping features into a residual block to improve the representation ability of hopping features, which is expressed as \(G_i\):

$$\begin{aligned} G_i=Res(ReLU(Conv(f_i^e;\theta )) \end{aligned}$$
(2)

where \(Res(*)\) represents the residual block.

Additionally, we introduce the gated function to further calibrate this information and achieve adaptive delivery of pixel-level detail information instead of indiscriminately delivering all information among multi-level features. The gated function is essentially a convolutional layer with sigmoid activation in the range of [0, 1]. Let \(GF (x; \theta )\) denotes the gated function:

$$\begin{aligned} GF(x;\theta ) =Sig (Conv(x;\theta )) \end{aligned}$$
(3)

where Sig() represents sigmoid function, \(Conv (x; \theta )\) is a 1\(\times \)1 convolutional layer of channels with x.

With the gated function, \(G_i\) can be rewritten as:

$$\begin{aligned} G_i=GF(G_i;\theta )\otimes Res(ReLU(Conv(f_i^e;\theta ))) \end{aligned}$$
(4)

where \(\otimes \) represents an element-wise product.

Therefore, the \(H_i\) is summarized as:

$$\begin{aligned} H_i=Conv(G_i;\theta )+ReLU(Conv(f_{i}^d;\theta )) \end{aligned}$$
(5)

where \(G_i\) is the updated features after performing the GISDM.

4 Experiments

In this section, we first give the description of the four widely used datasets and the implementation settings. Additionally, we compare our method with state-of-the-art methods by evaluating counting performance and density map quality. Finally, we perform an extensive ablation study to demonstrate the effectiveness of each component of our method.

4.1 Datasets

ShanghaiTech Dataset [31]. The ShanghaiTech dataset is composed of Part A and Part B datasets. Part A dataset includes 482 images, which are randomly crawled from the Internet and represent highly crowded scenes. It is divided into the training sets and test sets. Part B dataset is acquired from the surveillance cameras of commercial streets, representing relatively sparse scenes, with 400 images in the training sets and 316 images in the test sets.

UCF_CCF_50 Dataset [7]. The UCF_CCF_50 dataset is full of challenges. The training sample is limited and it only collects 50 annotated images of complex scenes from the Internet. These images have a large number of different people, ranging from 94 to 4543. There are a total of 63,974 head annotations, with an average of 1,280 per image.

UCF-QNRF Dataset [8]. The dataset contains 1535 high-resolution images with 1,251,642 head annotations, which has more head annotations than the previous datasets. The number of people in each image varies from 49 to 12,865. And the training and test sets have 1,201 and 334 images, respectively.

WorldExpo’10 Dataset [30]. This dataset includes 1,132 annotated video sequences collected from 103 different scenes captured by 108 surveillance cameras at the 2010 Shanghai World Expo. There are 3,980 annotated frames with a total of 199,923 annotated pedestrians, of which 3,380 annotated frames are used for model training and the other 600 frames are used for model testing.

4.2 Settings

Ground Truth Generation. We generate ground truth density maps following the same theory as in MCNN [31]. We use a normalized Gaussian kernel to blur each human head annotation thus generating the ground truth density maps F(x).

$$\begin{aligned} F(x)= \sum _{i=1}^{N} \delta (x-x_i )\times G_{\sigma _i} (x), with\ \sigma _i=\beta \overline{d^i} \end{aligned}$$
(6)

where N represents the number of people in the image, x is the position of the pixel in the image, \(x_i\) represents the labeled position of the \(i^{th}\) individual, \(\delta (x - x_i)\) denotes a head annotation at pixel \(x_i\), \(G_{\sigma _i}\) represents a Gaussian kernel with standard deviation \(\sigma _i\), and \(\overline{d^i}\) represents the average distance between \(x_i\) and its nearest k heads. In our implementation, we set \(\beta \) = 0.3 and \(\sigma _i\) = 3.

Evaluation Metrics. To evaluate the performance of our method, we adopt the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE), which are denoted as Eq. (7) and Eq. (8), respectively.

$$\begin{aligned} MAE= \frac{1}{N} \sum _{i=1}^{N} |C_i^{ES} - C_i^{GT}| \end{aligned}$$
(7)
$$\begin{aligned} RMSE= \sqrt{\frac{1}{N} \sum _{i=1}^{N}(C_i^{ES}- C_i^{GT})^2} \end{aligned}$$
(8)

where N is the total number of the test images, \(C_i^{ES}\) and \(C_i^{GT}\) are the estimated and ground-truth counts of the \(i^{th}\) image, respectively.

MAE and RMSE determine the accuracy and the robustness of the crowd counting, respectively. The lower their values, the better performance of the count results.

In addition, the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) in images are exploited to evaluate the quality of the output density maps.

The PSNR is defined as:

$$\begin{aligned} P S N R=10 \times \log _{10}(\frac{M A X_{I}^{2}}{M S E}) \end{aligned}$$
(9)

where \(MAX_I\) is the maximum possible pixel value of the images.

$$\begin{aligned} SSIM(x,y)= \frac{(2\mu _x\mu _y + C_1)(2\sigma _{xy}+C_2)}{(\mu _x^2+\mu _y^2+C_1)(\sigma _x^2+\sigma _y^2+C_2)} \end{aligned}$$
(10)

where \(\mu _x\) and \(\mu _y\) denote the mean values of images x and y, respectively. \(\sigma _x\) and \(\sigma _y\) denote the variance of images x and y, respectively. \(\sigma _{xy}\) is the covariance of images x and y. \(C_1\) and \(C_2\) are two constants and defined as:

$$\begin{aligned} \left\{ \begin{array}{l} C_{1}=\left( K_{1} \times L\right) ^{2} \\ C_{2}=\left( K_{2} \times L\right) ^{2} \end{array}\right. \end{aligned}$$
(11)

where \(K_1\) = 0.01, \(K_2\) = 0.03, L = 255.

PSNR essentially represents the error between the corresponding pixels. The higher its value, the better the quality of the density map. SSIM measures the similarity between the predicted density map and the ground truth in terms of brightness, contrast and structure. The higher its value, the smaller the image distortion.

Implementation Details. We utilize the pre-trained VGG-16 to initialize the parameters of the first five stages of our model, and parameters of the other convolutional layers are initialized randomly using a Gaussian distribution with \(\delta \) = 0.01. Both upsampling and downsampling operations are simulated using bilinear interpolation. We use Adam optimizer to train our network for 200 epochs, and the learning rate is initially set to 1e−5. And the network is trained by minimizing the Euclidean distance between the estimated density map and the ground truth. The loss function is defined as:

$$\begin{aligned} L(\varTheta )= \frac{1}{2N} \sum _{i=1}^{N}||F(X_i;\varTheta )-D_i ||_2^2 \end{aligned}$$
(12)

where N is the number of training images, \(X_i\) is the \(i^{th}\) input image, \(F(X_i;\varTheta )\) denotes the estimated density map, \(D_i\) represents the ground truth density map.

4.3 Comparisons with the State-of-the-Art

ShanghaiTech. We compare our method with several state-of-the-art methods and the comparison results are listed in Table 1. On Part A, our method obtains the MAE improvement by 4.28% and RMSE improvement by 4.46% compared to the second-best result. On Part B, our method achieves the MAE and RMSE improvements by 4.31% and 4.61%, respectively, compared to the second-best result.

Table 1. Comparisons of GCMNet and state-of-the-art methods on three datasets.

UCF_CC_50. The UCF_CC_50 dataset has a huge challenge and we evaluate our method according to 5-fold cross-validation [12]. As shown in Table 1, we compare our method with the current state-of-the-art methods. Our method has a very significant improvement, with MAE and RMSE improved by 18.57% and 18.69%, respectively, compared to the latest CFANet method. Despite the limited training samples, our method converges well in this dataset.

UCF-QNRF.Table 1 shows the MAE and RMSE of our method as well as the state-of-the-art methods on UCF-QNRF dataset. The proposed method is compared with eight methods. It can be observed that the proposed method is able to yield the best performance on this dataset. The MAE exceeds the second-best method by 2.19% and RMSE improves over the second-best method by 2.69%.

WorldExpo’10. Our method is compared with six state-of-the-art methods. In Table 2, we give the comparison results of MAE for each scene. Our proposed method obtains the best performance in scene 1 (sparse crowd S1), scene 4 (dense crowd S4). Moreover, the best average MAE performance is also achieved.

Table 2. Comparisons of GCMNet and state-of-the-art methods on WorldExpo’10.
Fig. 5.
figure 5

Sample results of the GCMNet on ShanghaiTech dataset. The first row shows the samples of the input image. The second row shows the ground truth for each sample while the third row presents the density map generated by GCMNet. The number in each density map denotes the count number.

In this section, we first conduct experiments on four datasets and then compare our model quantitatively with several state-of-the-art methods. It is clearly seen from the results that our method achieves the best performance on ShanghaiTech, UCF_CC_50 and UCF-QNRF datasets, and outperforms some of the current state-of-the-art methods on WorldExpo’10 dataset. And the predicted density maps on ShanghaiTech dataset is also given and compared with the ground truth, as shown in Fig. 5. It can be obviously seen from the figures that our method is advanced for crowd counting in different scenes. Regardless of highly crowded or sparse crowd counting scenes, we effectively address the scale variation in crowd counting. Our method effectively uses multi-scale features for accurate crowd counting.

4.4 Comparison of Density Map Quality

In this section, we compare our method with other representative methods: MCNN, CP-CNN, CSRNet, CFF and SCAR in PSNR and SSIM.

Table 3. Comparisons of PSNR and SSIM of GCMNet and representative methods on ShanghaiTech Part A.

As shown in Table 3, our GCMNet achieves the highest SSIM and PSNR. In particular, we get PSNR of 28.66 and SSIM of 0.84 on ShanghaiTech Part A dataset. Compared with SCAR, the PSNR and SSIM are improved by 19.77% and 3.70%, respectively. The results show that our method has a significant advantage in generating high-quality density maps.

4.5 Ablation Study

In this section, we conduct ablation study on ShanghaiTech dataset to verify the effectiveness of each module in our network and analyze the impact of different network combinations on the counting performance.

Table 4. Results of ablation study on ShanghaiTech Part A and Part B datasets.

We use four different combinations to test our model:

  1. (1)

    VGG-16: VGG-16 first 13-layer network with 32-fold upsampling operations at the end.

  2. (2)

    VGG-16+MCIEM: VGG-16 first 13-layer network with MCIEM for extracting multi-scale contextual information and 32-fold upsampling operations at the end.

  3. (3)

    VGG-16+MCIEM+Hopping Cascade: VGG-16 first 13-layer network with MCIEM for extracting multi-scale contextual information and hopping cascade module for cascading the hopping features \(f_3^e\), \(f_4^e\), \(f_5^e\) with the upsampling features \(f_3^d\), \(f_4^d\), \(f_{5}^d\).

  4. (4)

    VGG-16+MCIEM+Hopping Cascade+GISDM: our proposed method.

We give the experimental results of ablation study in Table 4. It can be seen that directly using VGG-16 for feature extraction does not necessarily yield the best performance. After injecting MCIEM into the network for multi-scale feature extraction, the counting error is greatly reduced compared to the previous stage. Further improvements are made by adding the hopping cascade module, and the results show that, as with MCIEM, the performance of the model is substantially improved and the counting error is substantially reduced. Finally, the embedded GISDM adaptively performs information delivery, which further optimizes the effect of crowd counting. In conclusion, our proposed final model achieves the best performance and further accuracy in estimating the crowd. Each of the structures added to our model is effective and complementary to each other. The counting results are significantly better in the case of both high-density and low-density scenes. Figure 6 gives the stage density maps of the ShanghaiTech Part B dataset during the ablation study, and it is observed that our final model improves on the previous missing (yellow circles) and redundant (red circles) counts, effectively addressing the problem of scale variation. Our model achieves accurate density estimation and produces high-quality density maps.

Fig. 6.
figure 6

Stage results of ablation study on ShanghaiTech Part B dataset. (a) Input image, (b) Ground Truth, (c) Baseline (VGG-16), (d) VGG-16+MCIEM, (e) VGG-16+MCIEM+Hopping Cascade, (f) Ours. The number in each density map denotes the count number. The yellow and red circles label the missing and redundant counts of the Baseline method, respectively.

5 Conclusion

This paper proposes a novel end-to-end Gated Cascade Multi-scale Network (GCMNet), which effectively solves the problem of rapid scale variation in crowd counting. With the MCIEM, our GCMNet can capture global context at multiple scales. Then we introduce a hopping cascade module to make full use of the pixel-level image detail information. Subsequently, we design a GISDM to selectively integrate multi-level features by adaptively delivering valid information. Finally, the multi-level features are used to generate the final density maps. Extensive experimental results on four datasets show that our GCMNet is superior under different evaluation metrics. In the future, we will explore better methods to perform multi-scale feature extraction and effective integration of multi- level features.