1 Introduction

In recent days, surveillance videos, especially the ones that record crowds have dramatically increased due to the wide applications, such as crowd surveillance in the square, railway station, shopping malls, schools etc. These surveillance videos contain a huge number of frames (about 3000 frames a minute) that is a barrier to many practical usages. Video summarization is used to shorten an input video in the form of key shots or frames while still preserving the important information it contains. The shortened video provides an efficient way to browse large amounts of video data. In previous works of surveillance videos summarization, [9, 29] selected frames with moving targets as summarization according to frame-level dissimilarity measure. But it is sensitive to the minor changes in video stream, so they are not suitable for crowd surveillance which contain a large number of moving targets. [26, 43] proposed event-based surveillance video summarization, they selected key frames highly dependent on complicated abnormal event detection results. Obviously, as discussed in [25], the performance may decline significantly when there are no predefined abnormal event in the video stream.

In this paper, a novel unsupervised learning-based video summarization approach is proposed. Our goal is to select key shots to summarize crowd surveillance videos. Our approach is motivated by the following two facts. First, crowd location and density are two main contents in surveillance video, which are widely concerned in the field of video analysis [15, 30, 31, 39]. Second, high-quality video summary should keep the main contents of input video [6, 23, 50]. Inspired by the two reasons, we try to learn a crowd surveillance video summarization model that selects shots according to crowd location and density while meet high-quality video summary requirements.

Recently, sequence-to-sequence learning techniques which can be categorized as supervised [11, 12, 40, 46] and unsupervised [14, 18, 19, 21, 32, 33, 36, 38] have introduced several promising models. Supervised approaches learn from human-created summary ground truths. But there are few public crowd surveillance video data sets with labels. The demand of time-consuming and labor-intensive annotation procedures, which has been a limiting factor of existing datasets [48]. Thus unsupervised techniques are more applicable to our tasks where the annotated data is scarce.

More specifically, we develop a long short-term memory (LSTM) cell [13] based network that has been exploited to model the sequential patterns in video shots to summarize crowd surveillance videos. To train our model, reinforcement learning (RL) is used due to the following two reasons. First, unsupervised setting is focused on in our work. As mentioned in [53], RL can provide supervision from a reward as input signals to LSTM. Second, crowd location and density rewards are computed over the whole video sequence, which can only be made at the end of video streams. RL teaches the model to select better shots by the rewards iteratively. The reward function that consists of crowd location and density measures how well the generated summary can represent the main contents (can be taken as a set of different crowd behaviors) in original video according to the count of people and where they are. It is designed in terms of high-quality video summary requirement [6, 23, 50] that summary should be key shots whose contents was similar to contents of original videos, while different from shots already selected. Therefore, the novel reward function for calculating similarity and difference in crowd location and density is designed to encourage summarization network to produce high-quality summaries.

Although LSTM-based summarization network trained by RL has obtained significant results in different video summarization tasks [52, 53], the ideal length of video for LSTM modeling is less than 100 frames. Unfortunately, most of surveillance videos contain thousands of frames. Apply LSTM to surveillance videos summarization directly may restrict the quality of summary results. Due to this reason, a hierarchical LSTM instead of single lay bidirectional LSTM/GRU [52, 53] is part of our video summarization model to capture dependencies across longer time spans. As shown in Fig. 1, three layers of LSTM units are used for modeling frames and shots. The first layer LSTM is used to obtain a representation for shots which generated by cutting original video evenly, and the final hidden state of each shot is input to the next layer. The output of last layer is treated as the embedding for the all shots and determine whether shots is key shots. Experiments show that two layers LSTM can get high performance summary videos than those from single layer LSTM, but surveillance videos are longer than ordinary videos (such as videos in standard datasets SumMe [11] and TVSum [42]). To this end, the network with three layers LSTM is designed to maintain the performance of our surveillance video summarization model.

Fig. 1
figure 1

Training summarization network with reinforcement learning

To conclude the introduction, we summarize the main contributions of this paper as follows: (1) a RL-based unsupervised framework for crowd surveillance videos summarization is proposed. A novel crowd location and density reward function is designed to encourage summarization network to produce high-quality summaries. (2) A hierarchical LSTM is introduced as the summarization network in our framework to maintain the model performance for long crowd surveillance videos. (3) To show the effectiveness of the proposed approach, an extensive study on three crowd surveillance video public datasets has demonstrated that our method outperforms the state-of-the-art methods.

2 Related work

2.1 Supervised video summarization

Although there may be overfitting, learning from manually labeled video summary ground truth can achieve remarkable results and has been widely concerned. Gong et al. [3] proposed a two-pronged approach for learning a determinantal point process (DPP) from labeled data for modeling diversity. Intuitively, a DPP defines a probability distribution which makes subsets of higher diversity more likely to be selected. Inspired by this, Zhang et al. [48] first proposed a LSTM-based model for video summarization. Bidirectional LSTM layers were used for modeling better long-rang dependency in both the past and the future directions. Then, it was enhanced by a DPP to increase the diversity in the selected frames. The concern of Zhang et al. [49] was to leverage non-parametric learning from exemplar videos to transfer summary structures to novel input videos. The two points of interest are how similar the new video is to annotated ones and how the training videos are summarized. The similarity was inferred by comparing visual features at each frame. And key frames were selected from the human created training summaries. Vasudevan et al. [46] proposed a method that generated video summaries adapted to a text query. The technical key is the relevance model to rank frames of a video according to their relevance given a text query. A learned visual-semantic embedding space and a query-independent term help to compute the relevance, while summary frames were selected in terms of relevance, representativeness and diversity using a submodular mixture of objectives. Feng et al. [6] considered that better summary come from the understanding of whole video. Hence, an external memory was utilized to record the whole video, then it was understood by a global attention mechanism.

Although various supervised approaches have achieved remarkable results on benchmark data sets, such as SumMe [11] and TVSum [42], the supervised techniques have limited applicability when the annotated data is scarce. Tagging surveillance videos recorded by a camera is a time-consuming task. In addition, there are differences between surveillance videos recorded by different cameras. Hence, learning from manually labeled video summary ground truth is not suitable for our application.

2.2 Unsupervised video summarization

Unsupervised approaches select key frames/shots without the guidance of human-created ground truths but rely on manually designed criteria, web images or video categories. Song et al. [42] observed that the title (title-based images) always serves as a prior on expected summary. To select frames from input videos, they learned a joint factorial representational of images and video data sets. To summarize user-generated videos which consist of long, poorly-filmed and unedited contents, Lei et al. [22] developed a graph-based method to rank the frame segments (clustering frames of the original video). Kang et al. [18] analyzed that space-time were informative in some videos. And salient portions in videos is determined by sptio-temporal contrast. Lee et al. [40] proposed methods that learns category-independent importance cues to target key objects and people to summarize egocentric videos which captured from a wearable camera. The goal of Lu et al. [28] was to create story-driven summaries for long, unedited videos. The basic ideal of [23] was that summary should be key frames whose visual content was similar to contents of original videos, while different from the frames already selected in the summary. Subsequent works were affected by this criteria. Zhou et al. [53] proposed a deep reinforcement learning framework to train deep summarization network. The reward function is inspired by this general criteria. They used a dissimilarity function to measure the different between the selected frames and a set of medoids to measure the similarity between selected frames and contents of original videos. Another criteria is the machine-generated summary should be similar to the original video in an abstract semantic space. Inspired by this criteria, Zhang et al. [50] used regression loss for matching summaries, the summary and the original, mismatched summary and original to measure the amount of information conveyed in the original sequence and the summary. With the same criteria, Mahasseni et al. [33] built a Generative Adversarial Network (GAN) by the selector LSTM, the encoder LSTM, the decoder LSTM and discriminator LSTM. The summarization performance of these unsupervised methods [33, 50, 53] is superior than contemporaneous supervised methods on benchmark data sets [11, 42].

Obviously, the criteria for training summary model is designed manually according to the application. And the changes in the crowd reflects main contents of crowd surveillance video. Therefore, crowd aware rewards are used as the criteria to evaluate whether the summary results capture the main content of crowd surveillance video. On the other hand, hierarchal network which proposed by a supervised method [51] is adopted to capture long-span dependencies because surveillance video always across longer time spans.

2.3 Reinforcement learning

Reinforcement learning (RL) has been popular and successful in many areas. Seijen et al. [45] decomposed the reward function into a number of different reward functions for constructing an easy-to-learn value function. Dong et al. [5] train an attention agent for action recognition because the attention model cannot be trained end-to-end with the whole network. Janisch et al. [16] formalized the problem of classification with costly features as a Markov decision process. Hence, RL was a natural choice. In our previous work [24], we modeled the dynamic selection of nodes in camera network as a Markov decision process to obtain the most informative camera node while simultaneously reducing camera switching. Jay et al. [17] utilized RL to tackle the crucial and timely challenge of internet congestion control. Zhou et al. [53] first used RL in the domain of video summarization. The two main technical differences between our and their approaches are hierarchical network and crowd aware rewards which make our approach more suitable for long crowd surveillance videos summarization.

3 Problem formulation and background

3.1 Problem formulation

An input video can be represented as a series of consecutive frames:

$$ F=\left\{{f}_1,{f}_2,\cdots, {f}_t,\cdots, {f}_T\right\}, $$
(1)

where ft is the frame at time t. There are two forms of output summarization. The first is selected key frames [10, 27, 34] as the output:

$$ {F}^{\prime }=\left\{{f}_{r1},{f}_{r2},\cdots, {f}_{rn},\cdots, {f}_{rN}\right\}, $$
(2)

where F ∈ F is the selected frames with a size of N (N < T), rn ∈ {1, 2, ⋯, N} and rn < rn + 1. The second is selected interval-based key shots [11, 12, 37] as the output:

$$ {F}^{\prime \prime }=\left\{{\mathcal{F}}_1,\cdots, {\mathcal{F}}_k\right\}, $$
(3)

where F′′ ∈ F is the selected shots, \( \forall {\mathcal{F}}_i\bigcap \forall {\mathcal{F}}_j=\varnothing \), i ≠ j.

Essentially, our approach falls into the second category. We try to select a smaller set of interval-based key shots for video summarization. But there are two problems need to be solved. First, most of surveillance videos contain thousands of frames, and how to capture dependencies across longer time spans. Second, how to select shots to summary main contents in crowd surveillance videos. For the former problem, as shown in Fig. 1, a hierarchical LSTM [51] is used for our summarization model. The input video is divided into some subsequences evenly. Then the first layer LSTM is utilized to exploit the sequential information by performing convolutional operations on each subsequence which typically contains up to 80 consecutive frames according to the performance of RNN. The output of each subsequence is a hidden state of LSTM that can capture short-range temporal dependency. These hidden states are treated as the input of next layer. Hence, the second layer LSTM can capture the long-rang temporal dependency. This kind hierarchical RNN can reduce the information loss in long sequence modeling to improve summary performance. For the second problem, a popular criteria of high-quality video summary is that summary should be key frames/shots whose content was similar to content of original videos, while different from the frames already selected [23]. Hence, we use the distance between selected shots to cluster centers of the original video frames in terms of crowd location and density to measure the similarity. The intuition behind it is that the clustering centers can represent videos contents [53], and the closer the selected shots to clustering centers, the more similar they are to videos content. The dissimilarity of crowd location and density between selected shots is used to measure the difference. The novel crowd location-density based measurement is utilized as the penalty term in our RL framework to teach the summarization model to select better shots. And the main content of a crowd surveillance video can be taken as a set of different crowd behaviors in our experiments.

3.2 Background: Long short-term memory (LSTM)

LSTM is a popular variant of standard Recurrent Neural Network (RNN) which constructed by feedforward network and an extra feedback connection. LSTM is designed to address the issue of hard to train for the gradient vanishing problem [1] and suitable for modeling long-range dependencies. The most significant difference between LSTM and stand RNN is the external memory cell which encodes the knowledge of inputs that have been observed up to that step. There are three gates to control the calculation of hidden state ht and memory cell ct. Specifically, this process can be described as follows:

$$ {i}_t=\sigma \left({W}_{ix}{x}_t+{U}_{ih}{h}_{t-1}+{b}_i\right), $$
(4)
$$ {f}_t=\sigma \left({W}_{fx}{x}_t+{U}_{fh}{h}_{t-1}+{b}_f\right), $$
(5)
$$ {o}_t=\sigma \left({W}_{ox}{x}_t+{U}_{oh}{h}_{t-1}+{b}_o\right), $$
(6)
$$ {g}_t=\phi \left({W}_{gx}{x}_t+{U}_{gh}{h}_{t-1}+{b}_g\right), $$
(7)
$$ {c}_t={f}_t\odot {c}_{t-1}+{i}_t\odot {g}_t, $$
(8)
$$ {h}_t={o}_t\odot \phi \left({c}_t\right), $$
(9)

where the input gate it controls whether to consider current input xt, the forget gate ft allows to forget previous memory ct, and the output gate ot decides how much of the memory to transfer to the hidden states ht. σ denotes the sigmoid function and all the Ws, Us, bs are the training weights and bias. ⊙ denotes element-wise products.

4 Approach

In this section, we describe our methods for summarization crowd surveillance videos. Because of we want to learn a model to predict probabilities for video shots in terms of crowd location and density by RL to solve the two problems described in section 3.1 (i.e., how to capture dependencies across longer time spans and how to select shots that summary main content), the location and density aware frame representation is discussed in section 4.1 first. Then, hierarchical based modeling for long time spans is described in section 4.2. Finally, we introduce new reward functions related to crowd location, density and summary criteria [23] in section 4.3.

4.1 Frame feature representation

The input of the neural network model is a set of features corresponding to the original video frames F. It has been confirmed that deep convolutional features consistently improved performance over the hand-crafted features in video summarization [48], and has been used in many works [6, 50, 52, 53]. Specifically, in our work, the visual feature vector are extracted from the penultimate layer of the GoogLeNet [44] for each frame.

However, there are many noises in the background of crowd surveillance videos, such as building, vehicle, and natural environment. The noises are all embedded in feature space if we extract the feature from each frame directly. Actually, the information of background is useless in the surveillance video summarization task. The differences of background may interfere with the practical application of learning-based video summarization methods. Because surveillance videos are always obtained from different cameras, and backgrounds in the videos may be different from each other. This can lead to an extreme situation that we need to retrain the summarization model for each surveillance video if they are all obtained from different cameras, which is obviously unacceptable.

For the reasons above, we first calculate a crowd density maps [41] set \( \mathbbm{M}=\left\{{\omega}_1,\dots, {\omega}_T\right\} \) for original video frames F (Eq. 1), where ωt is the crowd density map of the frame ft in F. Then the deep convolutional feature vectors set

$$ X=\left\{{x}_1,{x}_2,\cdots, {x}_t,\cdots, {x}_T\right\} $$
(10)

is extracted from the crowd density maps set \( \mathbbm{M} \) instead of frames F themselves as the input of our model, where xt is the deep feature vector of the density map ωt at time t.

As shown in Fig. 2, the crowd density map records the location and relative density of crowds and filters out the background (such as buildings and lawns) in the form of heat maps. And the feature vectors set X highlights visual information of crowd location and relative density. As discussed in the experiments, it brings another benefit for cross-scene surveillance videos summarization task. We use the vectors set as the input of our video summary model in both training and testing processes, and the output of the model is a probability value set used to evaluate each subshot.

Fig. 2
figure 2

An example of crowd density map. (a) A frame in the crowd surveillance video dataset PETS [7]; (b) the corresponding crowd density map calculated by [41]

4.2 Hierarchical deep summarization network

In this section, we describe the hierarchical summarization model in details. As discussed in [36], the ideal length of video for LSTM modeling is less than 100 frames. Thus, it is challenging to model surveillance videos that are usually with long durations. Zhao et al. [51] trained a two-layer LSTM model with supervised framework to capture dependencies across long time spans. But surveillance videos are always longer than ones in standard datasets. To this end, we improve the summarization model with three layers made of LSTM units for modeling frames and shots. And we train it with an unsupervised RL framework. As shown in Fig.1, the first and second layer are two LSTMs and responsible for modeling at the frame level and shot level respectively. The third layer is a bi-directional LSTM and employed to predict the confidence of certain shot to be selected into the video summary.

Specifically, the input of the first layer is

$$ {X}^{\prime }=\left\{{x}_1,\cdots, {x}_n\right\}\cup \left\{{x}_{n+1},\cdots, {x}_{2n}\right\}\cup \cdots \cup \left\{{x}_{mn+1},\cdots {x}_{\left(m+1\right)n}\right\}, $$
(11)

which means the feature vectors X (Eq. 10) is separated into m consecutive and disjoint subsequences. If the T < (m + 1)n in Eq. 11, the finally subsequence is padded with zeros. One subsequence in X can be calculated as LSTM({xi + 1, ⋯, x2i}), where LSTM(∙) is short for Eqs. (4)–(9). The output of first layer is

$$ {\tau}_f=\left\{{\tau}_{f\_1},\cdots, {\tau}_{f\_m}\right\}, $$
(12)

where τf _ i denotes the final hidden state of the i _ th subsequences in X, which can be treated as the representation of the ith subsequences. High quality summary results can be obtained by two layers LSTM [51], however, surveillance videos are always longer than standard videos (such as videos in SumMe [11]). Thus, three layers LSMT is used in our work and τf is further divided into shots

$$ {\tau}_f^{\prime }=\left\{{\tau}_{f\_1},\cdots, {\tau}_{f\_n}\right\}\cup \left\{{\tau}_{f\_\left(n+1\right)},\cdots, {\tau}_{f\_2n}\right\}\cup \cdots \cup \left\{{\tau}_{f\_\left( kn+1\right)},\cdots, {\tau}_{f\_\left(k+1\right)n}\right\}, $$
(13)

as input of the second layer (k < m). It means features vectors τf of subsequences is separated into k consecutive and disjoint representation of shots (The finally shots is padded with zeros if m < (k + 1)n). The similar with Eq. (12), the output of second layer is

$$ {\tau}_s=\left\{{\tau}_{s\_1},\cdots, {\tau}_{s\_k}\right\}, $$
(14)

where τs _ i denotes the final hidden state of the ith subsequences in \( {\tau}_f^{\prime } \), which can be treated as the representation of the ith shots. Then, similar with \( {\tau}_f^{\prime } \), τs is divided into subshots

$$ {\tau}_s^{\prime }=\left\{{\tau}_{s\_1},\cdots, {\tau}_{s\_n}\right\}\cup \left\{{\tau}_{s\_\left(n+1\right)},\cdots, {\tau}_{s\_2n}\right\}\cup \cdots \cup \left\{{\tau}_{s\_\left( qn+1\right)},\cdots, {\tau}_{s\_\left(q+1\right)n}\right\}, $$
(15)

as input of the third layer which is composed of a bi-directional LSTM. The output of last layer are \( {h}^f=\left\{{h}_1^f,\cdots, {h}_q^f\right\} \) and \( {h}^b=\left\{{h}_1^b,\cdots, {h}_q^b\right\} \), where hf and hb are output hidden state of forward LSTM and backward LSTM respectively. Then, a softmax layer is used to predict a probability

$$ {p}_t= softmax\left(\tanh \left({W}_p\left[{h}_t^f,{h}_t^b,{\tau}_{s\_t}\right]+{b}_p\right)\right) $$
(16)

to indicate whether the tth shot is select or not. And Wp and bp are the parameters to be learned. The softmax function is utilized to constrain the sum of the elements in pt to be 1. Actually, pt is a two dimensional vector, each element of which indicates the possibility of the ith subshot is key or non-key.

4.3 Reward function

During the training process, the reward function will send a signal to the summarization model in each iteration to evaluate the result of generated summaries. RL ensures the summarization model to select high-quality summaries when the expected rewards is maximized. As stated in the criteria [23], high-quality summary should keep the contents of original videos, while different from the frames/shots already selected in the summary. And the main contents of crowd surveillance videos are crowd density and location. To this end, we design a reward function to evaluate the quality of summaries according to crowd density and location.

Crowd location reward. As discussed in section 4.1, we extract visual features from crowd density map [41] as the input of our summarization model. Although the density map keeps information of crowd density, it is also sensitive to crowd location and emphasizes where the highest density of crowd is on the map. Therefore, visual features extracted from crowd density map are used to measure the quality of the selected shots, and this term is named the crowd location reward. Inspired by [53], unsupervised diversity-representativeness reward is employed,

$$ {R}_{loc}=\frac{1}{\left|\mathcal{Y}\right|\left|\mathcal{Y}-1\right|}{\sum}_{t\in \mathcal{Y}}{\sum}_{\begin{array}{c}{t}^{\prime}\in \mathcal{Y}\\ {}{t}^{\prime}\ne t\end{array}}d\left({s}_t,{s}_{t^{\prime }}\right)+\mathit{\exp}\left(-\frac{1}{T}{\sum}_{t=1}^T\underset{t^{\prime}\in \mathcal{Y}}{\mathit{\min}}{\left\Vert {s}_t-{s}_{t^{\prime }}\right\Vert}_2\right), $$
(17)

where d(∙) is cosine dissimilarity, \( \mathcal{Y}=\left\{{y}_i|{a}_{y_i}=1,i=1,\cdots, \left|\mathcal{Y}\right|\right\} \) contains indices of selected T shots, s is shot feature. Similar to [6], s is calculated by average deep features of all frames within the shot. According to the criteria [23], during the training process, the first term computes dissimilarity between selected shots and the second term measures how much information of original video do the shots contain.

Crowd density reward. There are two preference about crowd density when users browse surveillance videos. First, the shots are useless if they do not involve characters, which should be filtered out as redundant information. Second, the rate of change in the number of personnel (i.e., the rapid appearance or disappearance of crowds) is a point of concern. Hence, for the first preference, we penalize the summary video with −5 if no characters are shown in a shot of the summary video. Otherwise we reward the summary video with +1. The reward Rden _ 1 is calculated as:

$$ {R}_{den\_1}=\left\{\begin{array}{c}{R}_{den\_1}+1,\kern0.5em if\left({\rho}_t>0\right)\\ {}{R}_{den\_1}-5,\kern0.5em otherwise\end{array}\right., $$
(18)

where ρt is the estimated count of personnel for each selected video shot, which calculated by [41], \( t\in \mathcal{Y} \). Eq. 18 indicates that the summary video will get a high Rden _ 1 score if characters are shown in each shot. In other words, Rden _ 1 is used to prevent the shots which contains no characters in the summary video. For the second preference, we introduce a reward Rden _ 2 based on the classical definition of rate of change. The intuition behind this reward is: in the same time span, more difference between ρt − 1 and ρt, the higher reward that the model can receive. We compute Rden _ 2 as the mean of the pairwise rate of change between adjacent two selected shots:

$$ {R}_{den\_2}=\frac{1}{\left|\mathcal{Y}-1\right|}{\sum}_{t\in \mathcal{Y}}r\left({s}_{t-1},{s}_t\right), $$
(19)

where r(∙, ∙) is the rate of change function calculated by:

$$ r\left({s}_{t-1},{s}_t\right)=\frac{\left|{\rho}_t-{\rho}_{t-1}\right|}{\left({\rho}_t+{\rho}_{t-1}\right)}. $$
(20)

Hence, crowd density reward can be calculated as:

$$ {R}_{den}={R}_{den\_1}+{R}_{den\_2}, $$
(21)

Finally, Rloc and Rden complement to each other and work jointly to guide the learning of our summarization model.

The feature vectors set X defined in Eq. 10 is used as the input of our video summary network in both training and testing processes, and the output of the network is a set of probability values corresponding to each subshot. The set of probability values is used to evaluate each subshot in both training and testing processes, and we select top 15% subshots as the summary result from the shots sequence according to the descending order of probability values. During the training process, Eq. 21 is used to evaluate the summary result (i.e., a sequence of shots) and send the bi-directional LSTM a signal to train our summarization model with policy gradient, where the feature vector of a subshots is the mean of the features vectors of frames in it. And the features vectors of frames are recorded in the set X in Eq. 10. We will discuss models trained with different rewards in section 5.

4.4 Implementation details

We use GoogLeNet [44] trained on ImageNet [4] to extract frame features. We train our summarization model with policy gradient. Adam [20] with mini-batch size of 10 and initial learning rate 1e − 4 is implemented as the optimization algorithm. The dimension of embedding space and hidden units of all used LSTM are 256. The epoch for training LSTM is 40. The length \( \mathcal{L} \) of LSTM varies from 25 to 60 in each layer can obtain stable performance [51]. The lengths \( \mathcal{L} \) of three layers LSTM are set to 30 in our experiments. Hence, our model can handle the frame sequence less than 27,000 (30 × 30 × 30). There are two ways to deal with the problem of videos contain more than 27,000 frames. First, \( \mathcal{L} \) can be set to a larger value. Second, videos can be sampled to meet the constraint because consecutive frames in a video share much redundant semantic information. Videos are padded with zeros if they contain fewer than 27,000 frames.

5 Experiments

To verify the effectiveness of the proposed approach, it is tested on three publicly surveillance video datasets [8, 9, 52] and web videos. We first compare our method with several baselines to demonstrate the contribution of different rewards and hierarchical summarization model to the final performance in section 5.4. Then we compare our method with several state-of-the-art methods on short and long crowd surveillance video respectively in section 5.5. In addition, the advantage of using density map is discussed in the experiments.

5.1 Datasets

We test our summarization model on three publicly crowd surveillance video datasets UMN [8], PETS [7] and WorldExpo’10 dataset [47]. The UMN [8] dataset consists of 11 videos. The content of these videos is consisted of several distinct crowd activities, such as wandering, being scattered in all directions and so on. PETS [7] dataset comprises multi-sensor sequences containing crowd scenarios with increasing scene complexity. The main crowd behaviors in the videos content include the crowd moving slowly or rapidly through the scene, the crowd standing in the scene, the crowd gathering or thinning. The two surveillance video datasets are very useful to illustrate the performance of our method, because crowd behaviors which form the main content of surveillance videos are very different. A high performance video summary is one that preserves these differences while filtering out redundant information. So we choose them to illustrate the differences among our method, baseline methods and state-of-the-art methods. But the videos in UMN dataset and PETS dataset are short (about 1–2 min). Hence, WorldExpo’10 dataset [47] is used to verify the effectiveness of our method on long videos. It has 1127 one-minute long video sequences out of 103 scenes and 5 one-hour long video sequences from 5 different scenes, all from Shanghai 2010 WorldExpo captured by 108 surveillance cameras. The 5 one-hour long video sequences which are separated into 15 consecutive and disjoint subsequences with about 20 min long are used in our experiments. A notable benefit is that, the same with the other two datasets [7, 8], the video content can be divided into several different behaviors of the crowd. It is useful to illustrate the performance of methods in retaining the main content of crowd surveillance videos.

Cross-scene surveillance videos summarization is important to actual applications because training a summary model for each scene is a time-consuming task. To discuss the advantages of using density map in our model for cross-scene summary, we download several surveillance videos from YouTube as a supplementary dataset to train video summary models. However, these downloaded videos are still very lengthy and noisy since they contain a proportion of frames that irrelevant to crowd scene. Therefore, we segment web videos using KTS [37] and filter out the noisy parts. Finally, 20 downloaded surveillance videos (less than 4 min) are served as a training dataset in our experiments.

5.2 Evaluation setup

For a fair comparison among our method, baseline methods and state-of-the-art methods, the keyshot-based metric proposed in [48] is used for evaluation. Let A be generated keyshots which to be less than 15% in duration of original video and B the user-annotated keyshots. The precision Ρ and recall R can be calculated as:

$$ P=\frac{\mathrm{duration}\ \mathrm{of}\ \mathrm{overlap}\ \mathrm{between}\ A\ \mathrm{and}\ B}{\mathrm{duration}\ \mathrm{of}\ A}, $$
(22)
$$ R=\frac{\mathrm{duration}\ \mathrm{of}\ \mathrm{overlap}\ \mathrm{between}\ A\ \mathrm{and}\ B}{\mathrm{duration}\ \mathrm{of}\ B}, $$
(23)

then, the harmonic mean F-score:

$$ F=2P\times \frac{R}{P+R}\times 100\%, $$
(24)

is used as the evaluation metric. The output of our method is importance score pt of keyshots in \( {\tau}_s^{\prime } \). But several methods [26, 43, 53] only provide key frame scores. To generate keyshots for a fair comparison, the videos are initially temporally segmented into disjoint intervals evenly with the same length (the count of frames) as keyshots in \( {\tau}_s^{\prime } \). Then, the importance score of an interval is calculated as the average score of the frames in that interval and the resulting intervals are ranked based on their importance score. Finally, the keyshots are selected from the ranked intervals, which are less than 15% of the duration of the original video.

Although ground truth labels evaluation is often carried out using human judgments, the standard approach is described in [12, 42]. We create the ground truth set according to the standard approach and surveillance video datasets. Before the task, each video is segmented into uniform-length shots for capturing local context with good visual coherence. The shot length is empirically two seconds. Then the shots are clustered using k-means (k = length of video in seconds/10) and presented the shots within each cluster in random order to prevent chronological bias [2] which indicates that humans have a tendency to assign higher scores to shots appear earlier in video. During the task, the participants were asked to provide an importance score of 1 to 5 to each of shots. The score of 5 indicates that the shot can represent the activity of crowds very well. The score of 1 indicates no crowd activity. In addition, the frequency of score 5, score 4, score 3, score 2 and score 1 in the ground truth of a shot were assigned between 1% and 5%, 5% and 10%, 10% and 20%, 20% and 40% and gets the rest respectively to ensure the score distribution is appropriate for generating summaries.

5.3 Baselines and comparison

To clarify the performance of our method, we set several baseline models. To investigate how much different rewards contribute to the hierarchical summarization network model, the baseline models as the ones trained with Rloc only and Rden only, which are denoted by L-HSN and D-HSN, respectively. The model trained with the two rewards are represented as LD-HSN. Furthermore, our hierarchical summarization model contains three layers LSTM is suitable for the summarization task that the length of input video is about 20–30 min. But we have to pad with lots of zeros at the finally subsequence of X if the length of input video is less than 5 min (such as videos in dataset [7, 8]). An efficient way is to use hierarchical model with two layers LSTM for short videos summary, i.e. the parameter τs _ t in formula (16) is replaced by τf _ t. The three layers model is denoted by HSN3 while two layers model is denoted by HSN2. Baseline models L-HSN2, D-HSN2, LD-HSN2 and LD-HSN3 are discussed in experiments. To verify our hierarchical summarization network model, we use DSN [53] which was constructed by a bidirectional recurrent neural network and a fully connected layer instead of our hierarchical summarization network. This baseline model is represented as LD-DSN.

To compare with other approaches, we retrieve results of other approaches including surveillance videos summary [26, 43], DPP-LSTM [42], DSN [53], GAN-based [33] methods.

5.4 Comparison with baselines

Qualitative Evaluation. We compare our method LD-HSN with two baselines L-HSN and D-HSN on datasets UMN [8], PETS [7] to investigate how much different rewards contribute to the model. Quantitative evaluation on the two datasets can make it easier to understand the difference between the two baselines and our method. The three models (denoted by L-HSN2, D-HSN2, LD-HSN2) consist of two layers LSTM respectively because videos in UMN and PETS are short. We provide qualitative results for two example videos that from UMN (a1 ~ a3) and PETS (b1 ~ b3) in Fig. 3.

Fig. 3
figure 3figure 3

Quantitative evaluation among our model LD-HSN2 and two baselines L-HSN2, D-HSN2 on datasets UMN (a1 ~ a3) and PETS (b1 ~ b3), respectively. The light-gray, light-blue and light-yellow bars in (a1) to (b3) correspond to ground truth importance scores in different temporal intervals, while the colored areas correspond to the selected parts by different models

The main content in the example video from UMN (a1 ~ a3 in Fig. 3) consists of two parts. The crowd behaviors are working around in the temporal interval Part I while swarming from all directions in the temporal interval Part II. As shown in (a1), the summarized shots that obtained by the reward Rloc are always closer to ground truth in Part I than that in Part II. Besides, the shots do not contain any information about people in Part II may be selected as part of the result, because the reward Rloc is more sensitive to changes in crowd position.

For a more comprehensive and accurate summary of the crowd behaviors in the video, the crowd density reward Rden is used as a supplementary. As shown in (a2), most of selected frames fall into the temporal interval Part II because Rden is designed to capture changes in the number of people. In addition, Rden _ 1 can effectively prevent the model from selecting shots that without any information of people as parts of the summary result. The summary result produced by LD-HSN2 (a3 in Fig. 3) is much closer to ground truth in the two parts. It is because LD-HSN2 benefits from that the changes in crowd position and density are captured simultaneously.

As discussed above, the purpose of our method is to summarize crowd behaviors resulting from changes in crowd position and density in video sequences. And the shots which do not contain the changes in crowd position and density are always be filtered out as the redundant information. As shown in Fig. 3 (b1) ~ (b3), the main content in the example video from UMN consists of three parts. The crowd behaviors are getting together or swarming from all directions in the temporal intervals Part I and Part III respectively, while keeping still in the temporal interval Part II. From results of (b1) ~ (b3), we can observe that almost all of selected key shots fall into Part I and Part III. The peak regions of ground truth are almost captured by LD-HSN2. While shots that without significant changes in the crowd in Part II are filtered out as the redundant information.

Quantitative evaluation. We compare our method with several baselines to investigate the different hierarchical summarization network models and rewards on datasets PETS [7], UMN [8] and WorldExpo’10 [47]. Videos in UMN and PETS last 1–2 min, and one-hour long videos in WorldExpo’10 are separated into 15 sub-videos with about 20 min long. According to the hierarchical structure, summarization models in Table 1 can be divided into three types: single layer (LD-DSN), two layers (L-HSN2, D-HSN2 and LD-HSN2) and three layers (L-HSN3, D-HSN3 and LD-HSN3).

Table 1 F-scores of different variants of our method on PETS, UMN and WorldExpo’10

We can find that the performances of two-layer LSTM model LD-HSN2 and three-layer LSTM model LD-HSN3 are significantly better than those of single-layer LSTM model LD-DSN on long video dataset WorldExpo’10 from Table 1, which indicates that the hierarchical summarization network can capture more crowd changes on long surveillance video sequences. On the other hand, the performances of LD-HSN2 and LD-HSN3 are still slightly better than those of LD-DSN on short video datasets UMN and PETS, which indicates multiply layers LSTM models also have advantages in short video summary. On the other hand, comparing LD-HSN3 with LD-HSN2 on short video datasets UMN and PETS, we can find that the two models perform similarly (48.4 vs. 48.3 on UMN and 47.6 vs 47.6 on PETS) on short surveillance video sequences. But the performance is difference on long surveillance video sequences (39.6 vs. 37.1 on WorldExpo’10). It indicates that the hierarchical network with three layers LSTM can improve summary performance on long crowd surveillance video (about 20 min in our experiments).

We also can find that video summary model LD-HSNn trained with crowd location reward and crowd density reward jointly outperforms video summary model trained with crowd location reward L-HSNn or density reward D-HSNn from Table 1. It demonstrates that we can better train our video summary model HSN to produce high-quality summaries by using crowd location reward Rloc and crowd density reward Rden jointly.

5.5 Comparison with stat-of-the-art

We compare our method with the current state-of-the-art which includes surveillance video summary methods [9, 29] and unsupervised deep learning based methods [33, 48, 53].

Comparison with surveillance video summary methods. The aim of [26, 43] is to summary predefined abnormal events. Therefore, we compare our method with the more general surveillance video summary approaches. Gao [9] and their previous work [29] proposed dissimilarity-based surveillance video summary methods. The intuitive behind their works is to detect the changes from a surveillance video sequence. To this end, their works are consisted of two key processes: dissimilarity measure and dissimilarity peaks localizer (Fig. 4 a2 and b2). The first step is to measure dissimilarity between frames, the second step is to localize the local dissimilarity peaks.

Fig. 4
figure 4

Comparison between our method and dissimilarity-based method [9]. The blue curves in (a2) and (b2) are temporal dissimilarity, while the red straight lines are used to localize the local dissimilarity peaks

As shown in Fig. 4, we compare our method with dissimilarity-based method [9] for the same example videos in Fig. 3. Although their work can filter out frames which do not contain any changes well (Fig. 4 b2, frames in the temporal interval Part II are almost filtered out), low-level features, such as color histogram, are used to measure the dissimilarity between frames. It is suitable for summarization moving targets under the condition of the sparse target environment. But it cannot capture changes in crowd behavior accurately. For example, the background without moving targets was preserved in Fig. 4 (a2), while almost all frames in Part I were discarded. We can see that our method clearly outperforms [9, 29] on the three datasets from Table 2. It demonstrates that we can better capture surveillance video shots as the summary result, which outline the main behaviors of the crowd over a limited frames sequence.

Table 2 F-scores of unsupervised deep learning based approaches and dissimilarity-based approach on PETS, UMN and WorldExpo’10

Comparison with deep learning based methods. The quantitative comparison among our method (LD-DSN, LD-HSN2, and LD-HSN3), the state-of-the-art unsupervised deep learning based methods [33, 48, 53] and dissimilarity-based surveillance video summary methods [9, 29] on three datasets is illuminated in Table 2. Eighty percent of the data is used for training models and the rest of the data is used for testing. We can find that, benefit from deep features, unsupervised deep learning based methods (including our methods) clearly outperforms dissimilarity-based method [9] on datasets UMN and WorldExpo’10. But dissimilarity-based methods [9, 29] score well on dataset PETS, because color differences between frames can reflect changes in crowd behavior to some extent.

Comparing the most competitive deep summary model (DR-DSN) which trained with diversity-representativeness (DR) reward with our baseline LD-DSN (44.6 vs. 47.1 on PETS, 43.3 vs. 46.7 on UMN and 32.1 vs. 33.8 on WorldExpo’10), it demonstrates that crow location reward and crow density reward can help us training the summary model to better capture the representative behaviors of the crowd.

Cross-scene surveillance videos summary. For better summary results in the case of different monitoring scenes, the most straightforward way is to retain the surveillance video summarization model for each monitor scene. But it is a time-consuming task. An ideal video summary model should generate better results while without repeated training. In this section, a quantitative comparison is used to illustrate that we do not need to repeat training for different monitoring scenes by using our method. Table 3 compares our method with the state-of-the-art unsupervised deep learning based methods under the condition that the training dataset is obtained from YouTube while the test datasets are PETS, UMN and WorldExpo’10, respectively. We can find that our method can maintain stable performance, while the performance of other deep learning based methods is obviously decreased. Because the density map used in our method has filtered out the background information.

Table 3 F-scores of unsupervised deep learning based approaches on PETS, UMN and WorldExpo’10 under the condition of cross-scene

6 Conclusion

In this paper, we present a RL-based unsupervised method for summarization crowd surveillance videos. Our goal is to maintain distinct crowd behaviors while filter out redundancy shots in the summary result. To this end, a crowd location-density reward is used to teach our model to produce high-quality summaries. Compared with dissimilarity-based surveillance videos summarization methods and deep learning based methods, our method can better capture surveillance video shots as the summary result, which outline the main behaviors of the crowd over a limited frames sequence. On the other hand, our hierarchical network model can maintain performance for long (20 min) crowd surveillance videos.

In the future, we will explore more crowd behavior patterns which could be used in surveillance videos summarization. It will expand the application scope of our method.