1 Introduction

As a challenging computer vision task, crowd counting aims to obtain rich crowd features and generate the estimated density map corresponding to the crowd image. With the development of deep learning, a large number of counting models have subsequently been proposed. Although substantial progress has been made in relevant studies, there are still many factors that influence the acquisition of crowd features. Besides the problems of complex crowd background, scale variation within and between images, occlusion existing between dense crowd, some specific scenes also have problems such as the difficulty of detecting pedestrians due to too poor illumination conditions. Because most cameras cannot work normally in the dark environment, it will lead to a sharp decline in the performance of many methods [1], so other types of visual sensors (such as infrared cameras) began to be widely used as a supplement to RGB cameras to overcome this difficulty. Therefore, the research on the combination of multi-modal representation learning and crowd counting has been further developed.

Early researchers mainly used the complementarity of RGB image and depth image to complete the counting task. For example, Lian et al. collected the ShanghaiTech RGB-D dataset and utilized a depth adaptive kernel for considering head size variation to improve the quality of density maps [2]. In addition, depth sensing anchors are also used in the detection framework to better initialize the anchor size. Zhao et al. adopted the bifurcated backbone strategy to recombine multi-level features into teacher and student features, and they mined informative depth clues from channel and spatial views and finally fused RGB and depth modes in a complementary way [3]. The above methods mainly generate pixel level crowd density map through rich RGB information and depth information. However, in some unconstrained scenes, only using the above features may not be able to accurately identify semantic objects. In the case of poor lighting conditions (such as backlight and night), it is difficult to detect pedestrians directly from RGB images. In addition, some objects with high similarity to pedestrians are easily mistaken for pedestrians only relying on optical features [4]. References [5,6,7] use multi-modal data to directly fuse features or input them into deep neural network in a combined way to connect the representation of all modes and supervise training, but these cannot make good use of the complementary information between different modes. Recently, Liu et al. released the RGBT-CC benchmark and proposed a cross-modal collaborative representation learning framework for crowd counting [8]. The framework includes two multi-modal specific branches, a modal shared branch and multiple information aggregation and distribution modules, which can fully capture the complementarity between different modalities. However, it does not fully consider the different characteristics of RGB data and thermal imaging data, so it is necessary to further enrich the information and enhance the expression of specific features.

As seen from the relevant literature, RGB images may not be able to accurately identify semantic objects in unconstrained scenes. Although the thermal images are not affected by light and shade, the existing hard negative objects are difficult to eliminate. Meanwhile, the depth images of outdoor scenes are rough, so they has certain limitations in application. In general, RGB information and thermal image information can be complementary, but it is not easy to capture the complementarity between multi-modal data. In order to capture the complementary features of different modalities, we refer to the idea of taking RGB and thermal images as multi-modal- specific branches proposed in [8] and add crowd region recognition design to the RGB modality, so as to realize the learning separation and coordinated representation of multi-modal streams under the constraint of region. Considering the influence of background and the judgment of crowd region, we combine multi-scale feature with region recognition to achieve the purpose of adaptive attention to different regions. And residual learning and skip connection are adopted. In dealing with reducing over fitting and facilitating gradient back propagation, we add multiple output layers and back propagation the gathered loss function in the multi-modal branch for the multi-level supervision. Besides that, we adopt the idea of [9] in the calculation of loss function for combating with the problems of occlusion, perspective effect, shape change, and so on in the scene. The density contribution probability model is constructed from the perspective of point annotation, so as to realize more reliable supervision on the counting expectation of each annotation point. To sum up, we propose the method based on cross-modal coordinated representation and multi-level supervision for crowd counting.

The rest of the article is arranged as follows. In Sect. 2, we will briefly introduce the related research of crowd counting based on deep neural network. Then, the method proposed in this paper is introduced in Sect. 3. In Sect. 4, we will report the performance of the method based on a series of experiments using images from the RGBT-CC benchmark [8]. Finally, we summarize the paper in Sect. 5.

2 Related work

At present, many crowd counting methods mainly use RGB image of crowd scene to extract features and generate estimated density map. Meanwhile, they can be divided into regression-based models [9,10,11,12,13,14] and models with additional detection [15,16,17,18]. For instance, Liu et al. proposed to use the pool pyramid to extract the features of supplementary scales and adaptively assign different weights to different scales and regions [10]. The lightweight hierarchical network structure used by Jiang et al. effectively combines the features from high level and low level [11]. The conditional random fields developed by Liu et al. can enable the features of each scale to obtain information from other scales [12]. And Liu et al. solved the problem of reduced counting accuracy in high congestion and noise scenes by adding attention mechanism and multi-scale deformable convolution in the network [13]. Dong et al. utilized the improved encoder–decoder structure and advanced loss function to mine the features between adjacent scales to cope with scale changes [14]. While Ma et al. proposed Bayesian loss which is helpful to establish density distribution probability model from point annotation [9]. When the crowd density is low, the detect-based method is more effective than the regression-based method; when the crowd density is high, the regression-based method is better than the detection-based method. Aim at giving full play to the advantages of these two methods at the same time, Liu et al. designed two sub networks for detection and regression, respectively, and fused the two by the attention mechanism [15]. However, this method is only suitable for experiments on datasets with low density. If the scene is very crowded, the cost of bounding box labeling is very high. Therefore, Liu et al. suggested changing the initialization of point annotation to the real annotation of the initial box and updating it in real time during training [16]. Moreover, Liu et al. employed additional positioning branches to detect head positions and scale adaptively in difficult to identify regions [17]. Furthermore, Rong et al. proposed a three branches network for feature extraction, regional recognition, and judgment of crowd density [18]. The method simulates a series of steps when a person actively observes the crowd scene, that is, first observing the crowd area, then paying attention to the crowd density of each area, and finally estimating the number of people.

While the above methods cannot effectively identify invisible pedestrians in the case of poor illumination conditions, multi-modal representation learning based on deep learning has attracted extensive attention because of its powerful multi-level abstract representation ability. Therefore, some researchers also use depth map or thermal image for object recognition and crowd counting. Fu et al. proposed a novel joint learning and densely cooperative fusion architecture for rgb-d salient object detection [19]. This method mainly obtains fusion features through element level multiplication or addition and cascade operation. In order to achieve fully represented shared features, Li et al. designed a cross-modal crowd counting method combining cross-modal cycle attention fusion and fine coarse supervision [20]. While Piao et al. took the estimated map and attention map as a bridge to transmit depth knowledge [21], Zhao et al. proposed an effective multi-scale cross-modal feature fusion method for rgbd salient object detection [22]. In addition, Lu et al. used the cross-modal shared feature transfer algorithm to explore the potential of modal shared information and specific features in improving the ability of re recognition [1]. These methods chiefly use the depth data as auxiliary information to assist the representation and learning of RGB data. Lately, Liu et al. proposed to utilize the complementarity of RGB image and thermal image to complete crowd counting [8].

For achieving the purpose of crowd counting, especially some scenes with poor illumination, we catch the modal shared information and specific features with RGB—thermal image, reduce the influence of irrelevant background through region recognition, and strengthen the supervision of point, region, and multi-scale feature extraction.

3 Proposed method

3.1 Network architecture

Aiming at the counting problem of unconditionally restricted crowd scenes, we propose the density estimation method based on multi-modal representation learning and multi-level supervision. The overall frame structure is shown in Fig. 1. In considering the learning of deep multi-modal representation, we learn from the ideas of [1] and [8], comprehensively learn the complementarity between different modes of RGB image and thermal image based on specific modal features, and make effective use of the shared information and specific information of each sample. In addition, we also consider the visual sense of human observing the scene image, that is, we will first pay attention to whether there will be someone in a part or area of the image, and then follow a series of steps to complete the density estimation processing. Therefore, the design of region recognition is added to RGB information by referring to [18]. The overall network of cross-modal collaborative representation learning is mainly composed of five modules. Each module contains two modal specific streams, one modal shared stream, and a shared specific transformation module (SSTM). The RGB modality-specific stream is represented by blue and green squares, the modality-shared stream is represented by orange squares, and the thermal modal modality-specific stream is represented by purple squares. In the design of network structure, [1] takes the network design of [23] and [24] to complete the preliminary extraction of features and combines the theory of graph convolution to determine the similarity within and between modes. [8] mainly has two backbone networks, which are based on CSRNet [25] and BL [9], and it verifies the applicability of the method combined with different classical network models. Our network is based on VGG-19 [26]. VGGNet has smaller convolution kernel and pooled sampling domain, which will bring implicit regularization results and can obtain more control parameters of image features. There followed taking the first block as an example to illustrate the components in the block. Firstly, two streams take the RGB and thermal images as inputs to extract specific modal features separately, which retain the specific information of a single modality. The shared stream takes the zero tensor as the input and aggregates the information of specific modal features in layers. The design of region recognition likes the attention mechanism, which divides each pixel in the feature into crowd and background region through the generated coarse-grained attention map. The network structure of area recognition is C (512,3)–U–C (256,3)–U–C (128,3)–U–C (64,3)–C (1,3), where C represents convolution layer and U represents up sampling. The later SSTM connected by specific streams can determine the internal similarity between RGB mode and thermal mode, as well as the similarity between them, and propagate the shared features and specific features back between the two modes at the same time to achieve the dual purpose of making up for each specific information and enhancing the shared information. In addition to the above design, the five modules are connected in turn, and the multi-scale problem is considered from the overall structure and the interior of the module.

Fig. 1
figure 1

a Overall structure of method. b Region identification of RGB modality. c Shared specific transformation module (SSTM)

3.2 Supervised fusion feature extraction

Region-based two-stream feature extractor For the input RGB image \(X^{\mathrm{RGB}}\) and thermal image \(X^{I}\), the features obtained after multi-layer convolution calculation are represented by F. Then, F is transformed into multi-scale context information I through pyramid pooling. The unified calculation process of thermal modality stream and modality-shared stream is as follows:

$$\begin{aligned} I=\hbox {Conv}_{1\times 1}(P^{1}(F) \oplus P^{2}(F) \oplus P^{3}(F)) \end{aligned}$$
(1)

where \(\hbox {Conv}_{1 \times 1}\) expresses \({1 \times 1}\) convolutional layer and P represents the maximum pool layer with different size. The output sizes of P(F) are 1, 1/2 and 1/4 of the original input size.

For the RGB modal stream, it is necessary to calculate the regional information while extracting the crowd characteristics. The region map generated by region recognition gives different weights to the pixels in different regions of the image. It is similar to the attention mechanism, which is mainly realized by multiple full connections and sigmoid functions, so that the transformation of the feature information of the part of interest can be highlighted. The updated feature \(F^{RA}\) is obtained by multiplying the crowd feature and the regional information \(F^{A}\) and then adding it to the original feature. The calculation process is as follows:

$$\begin{aligned} I^{R}=F^{R'} \oplus F^{R''} + F^{R'} \end{aligned}$$
(2)

Shared-specific feature transfer (SSTM) As shown in Fig. 1, the residual calculation and gate function are carried out for \(I^{RA}\), \(I^{S}\) and \(I^{I}\) obtained above to complete the aggregation and distribution of information. First, the residual information between the three is obtained, and then, complementary information is propagated to refine modality-shared features \(F^{S}\), through \(1 \times 1\) convolution adaptively. The calculation formula of enhanced features \(\hat{F}^{S}\) is:

$$\begin{aligned} \hbox {FRS}= & {} \hbox {Conv}_{1 \times 1} (I^{RA} - I^{S}) \end{aligned}$$
(3)
$$\begin{aligned} \hbox {FIS}= & {} \hbox {Conv}_{1 \times 1} (I^{I} - I^{S}) \end{aligned}$$
(4)
$$\begin{aligned} \hat{F}^{S}= & {} F^{S} + (I^{RA} - I^{S}) \otimes \hbox {FRS} + (I^{I} - I^{S}) \otimes \hbox {FIS} \end{aligned}$$
(5)

Next, new modalities are assigned to share feature information, and the specific features of each modality are refined, respectively. The context information \(\hat{I}^{S}\) corresponding to the enhancement feature \(\hat{F}^{S}\) is dynamically propagated into \(\hat{F}^{R}\) and \(\hat{F}^{I}\). \(\hat{F}^{R}\) and \(\hat{F}^{I}\) are calculated as follows:

$$\begin{aligned}&\hat{F}^{R}=F^{R} + (\hat{I}^{S} - I^{RA}) \otimes \hbox {Conv}_{1\times 1}(\hat{I}^{S} - I^{RA}) \end{aligned}$$
(6)
$$\begin{aligned}&\hat{F}^{I}=F^{I} + (\hat{I}^{S} - I^{I}) \otimes \hbox {Conv}_{1\times 1}(\hat{I}^{S} - I^{I}) \end{aligned}$$
(7)

\(\hat{F}^{RA}\), \(\hat{F}^{S}\), and \(\hat{F}^{I}\) complete the representation learning through five blocks in turn. Finally, the estimated density map is obtained through multiple convolution layers, and then, the density map is summed pixel by pixel to obtain the estimated counting.

4 Experiments and results

In this chapter, we will give the evaluation indicators and experimental details based on the newly proposed RGBT-CC benchmark. The training and evaluation are performed on Intel Core i7-7800 @ 3.50GHz processor and 31.1G memory. Experimental environment adopts PyTorch [27] framework and Adam [28] optimizer. At the same time, we use and start with training each model for 400 epochs and set the learning rate to 1e-5.

4.1 Evaluation metrics

Referring to the previous work, we adopt the Grid Average Mean Absolute Error (GAME [29]) and Mean Squared Error (MSE) as the evaluation indicators of performance. Compared with Mean Absolute Error (MAE), GAME not only evaluates the overall area, but also includes the evaluation of areas with different sizes. The calculation formula of GAME is:

$$\begin{aligned} \hbox {GAME}(l)=\frac{1}{N}\sum ^{N}_{i=1}\sum ^{4^{l}}_{j=1}|\hbox {CE}^{j}_{i}-\hbox {CG}^{j}_{i}| \end{aligned}$$
(8)

where N represents the number of images, and l is a specific level. The image can be divided into 4l nonoverlapping regions according to l, and the corresponding regional error measurement can be carried out. The values of l are 0, 1, 2, and 3, respectively. When the value of l is 0, the value of GAME is the same as that of MAE. \(\hbox {CE}^{j}_{i}\) and \(\hbox {CG}^{j}_{i}\) represent the estimated count of the lth region and the corresponding ground-truth count, respectively. And the definition of MSE is as follows:

$$\begin{aligned} \hbox {MSE}=\sqrt{\frac{1}{N}\sum ^{N}_{i=1}(\hbox {CE}_{i}-\hbox {CG}_{i})^2} \end{aligned}$$
(9)

where \(\hbox {CG}_{i}\) is the ground truth of testing image and \(\hbox {CE}_{i}\) is the corresponding estimation.

4.2 Loss function

The published datasets used for training generally provide point annotation for each training image. Many counting methods first use Gaussian kernel to convert the point annotation of each training image into ground truth map and then train the depth neural network model by regression calculation of each pixel’s value in the density map. In contrast, the Bayesian loss function [9] adopted in this paper constructs a density contribution probability model from the perspective of point annotation and then calculates the expectation of each annotation point by summing the product of the contribution probability and the estimated density of each pixel. The calculation formula of loss function is:

$$\begin{aligned} L=\sum ^{N}_{n=1}F(1-E[C_{n}]) \end{aligned}$$
(10)

where \(F(\cdot )\) is distance function. The ground truth \(C_{n}\) of each annotation point is 1, and \(E[_{n}]\) is the expectation of \(C_{n}\). Compared with the loss function that limits the density value of each pixel, the Bayesian loss function monitors the count expectation of each annotation point.

4.3 Performance on comparison

RGBT-CC benchmark [8] contains 2030 pairs of RGB and thermal images from different scenes, in which 1013 pairs are from bright scenes, 1017 pairs are from dark scenes, and there are 138,389 marked pedestrians. The size of all images used for training and testing is uniformly set to \(640 \times 480\). A total of 1030 pairs were used for training, 200 pairs were used for verification, and the remaining 800 pairs were used for testing.

Experimental results Table 1 shows the comparison between our method and other methods on RGBT-CC benchmark [8]. Each method in the table considers capturing enough scene details in various ways to complete the task of recognition and counting. HDFNet [30] and BBSNet [7] fully integrate and make use of cross-modal information (RGB image with depth optical information) to facilitate the task of target detection. The multi-view crowd counting proposed by MVMS [31] uses the information from multiple camera views to predict the scene level density map on the 3D world ground plane. Compared with CSRNet [25] and BL [9], CSRNet \(+\) IADM [8] and BL \(+\) IADM [8] combine RGB information and thermal imaging information for density map estimation. Both use the cross-modal collaborative representation learning framework to fully capture the complementary information of different modalities. Besides cross-modal collaborative representation learning, we add a multi-level supervision mechanism, which not only integrates multi-modal information, but also considers the extraction of coarse and fine-grained features. From the results, our method is better than other methods in the values of GAME and MSE.

Table 1 Results comparison of different methods on the RGBT-CC benchmark

Comparison of different condition of illumination Table 2 shows the testing comparison between our method and BL \(+\) IADM [8] on the bright and dark images given in [8]. And the table also shows the comparison of test results using RGB information, thermal imaging information, and RGB-T information, respectively. It can be seen that optical and thermal imaging information can complement each other, multi-modal information can obtain more detailed features than single-modal information, and thermal image can help to extract crowd features. Moreover, both our method and BL \(+\) IADM [8] improve the quality of density map by using the knowledge characteristics of comprehensive information and obtain more accurate estimation results. Meanwhile, it can achieve better results in brightness and darkness. In general, our method is better than BL \(+\) IADM [8] in MAE and MSE under the different condition of illumination.

Table 2 Performance of different methods on the RGBT-CC benchmark under different illumination conditions

4.4 Ablation experiments

In this part, we will conduct further experimental comparison and discuss the details of model design. At the same time, the effect of cross-scene is also preliminarily considered.

Architecture learning Besides comparing with the references, we also consider different schemes in the design of the network structure. The results of various schemes are compared as shown in Fig. 2. BL [9] mainly adopts VGG-19 network, and BL \(+\) IADM [8] adds cross-modal cooperative representation learning on the basis of BL [9]. Moreover, BL \(+\) IADM \(+\) TSF tries to take crowd region recognition as a specific stream to participate in the subsequent calculation process. While BL \(+\) IADM \(+\) RS1 adds region recognition for the extraction of RGB and thermal imaging feature, BL \(+\) IADM \(+\) RS2 only recognizes the region of RGB features. From the curve changing trends in the chart, BL \(+\) IADM \(+\) RS2 is lower than other schemes in GAME value and MSE value. The results show that more detailed crowd features can be obtained through the comprehensive knowledge extraction of cross-modal features, which is helpful to improve the accuracy of crowd counting. More than two modal representations are difficult to learn, while the coarse- and fine-grained feature extraction of multi-modal features is helpful to obtain high-quality density map.

Fig. 2
figure 2

Results comparison of different structural schemes

Effect of cross-scene In addition to considering the training and testing on different illumination, we also preliminarily verify the effect of cross-scene. Firstly, the model is trained on the whole dataset and then tested on the brightness and darkness sets divided by BL \(+\) IADM [8]. As seen from the data distribution in Figs. 3 and 4, the effect of using RGB-T multi-modal information is better than using RGB or T feature information alone. When using RGB information alone to extract features, our method is not as effective as BL \(+\) IADM [8]. But when T information or RGB-T information can be used to extract features, our method is better than BL \(+\) IADM [8], and the MAE and MSE values are increased by 38%, 31% and 6%, 11%, respectively.

Fig. 3
figure 3

Comparison of cross-scene testing and the illumination condition is brightness

Fig. 4
figure 4

Comparison of cross-scene testing and the illumination condition is darkness

5 Conclusion

We propose a method based on coordinated representation and multi-level supervision for the estimated density map and crowd counting in the unconstrained crowd scene. The whole network includes five blocks and density map generator. Pairs of RGB-thermal images are first input into the two-stream feature extractor to obtain shared features and specific features, in which the RGB stream contains both crowd features and regional information. Then, the multi-modal features determine the similarity within and between modalities through SSTM module and transmit shared and specific features between modalities. In addition, the five blocks and the later density map generator extract multi-scale global and local features and form a three-level supervision mechanism of point, region, and multi-scale. Meanwhile, the multi-scale feature map between multiple modules is combined with region recognition and combined with residual calculation to achieve the purpose of adaptive attention to different regions and extracting different detail features. Finally, the estimated density map is generated through the density map generator. We conduct experiments on the RGBT-CC benchmark to verify the effectiveness of the method. And we will further consider the application of unsupervised method in crowd counting in the future work.