1 Introduction

With the development of multimedia and information technology, huge amounts of videos are uploaded and downloaded on the Internet every day, thus it has spawned a great deal of research into videos or images, such as video/image classification [57], video/image retrieval [38, 41, 56, 58], video segmentation [13, 40], video annotation [12, 39] and video captioning [11, 37] etc. As a bridge between computer vision and natural language, video captioning has become a hot research topic in recent years. Moreover, describing video contents with natural language becomes a key component for improving human-robot interaction and artificial intelligence. To date, extracting features from videos and then translating them into natural language sentences is the main trend, therefore researchers [18, 19, 21, 43] are focusing on solving two sub-problems: 1) how to efficiently extract video features; and 2) how to accurately translate video features into sentences with Recurrent Neural Networks (RNNs).

Specifically, [18, 43] firstly identified video semantic contents and then generated sentences based on some templates. Lee et al. [21], Farhadi et al. [9] and Rohrbach et al. [31] disposed the problem with probabilistic graphical model (PGM) and utilized Markov random field (MRF) or Conditional random field (CRF) to find the relationship between visual content and natural language. Great success in image classification achieved by deep convolutional neural networks (e.g. GoogLeNet [42], VGG [36] and ResNet [16]), these networks provide researchers with powerful tools to extract image/video features on different fields, such as video captioning [26, 27] and video action recognition [23, 49]. In general, the basic video captioning framework adopts pre-trained deep CNNs (e.g. ResNet or C3D) to extract spatial and/or temporal features, and then applies an RNN network (e.g. LSTM [34], GRU [5] or their extensions.) to generate words.

Furthermore, Mnih et al. [25] proposed a novel recurrent neural network model to extract information from images by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. The experimental results showed that it significantly outperforms the convolutional neural network baseline on a dynamic visual control problem. This strategy is named as visual attention. In fact, the basic idea of the attention mechanism is to selectively focus on the important information and maximumly ignore the unimportant information in the meantime. Therefore, the first step of attention is to estimate which part is important and assign a higher weight to it. Inspired by its great success, a variety of visual attention models have been proposed [37, 51]. For example, [37, 51] introduced a temporal attention to enhance video captioning by setting frames with different weights to select the most relevant temporal segments by training. However, for a video, the temporal variances are existing between sub-shots instead of adjacent frames. From Figure 1, we can see that all of the frames are important for describing the video, but many of the frames are duplicate. The top three frames describe a man who is holding a gun, while the bottom three frames describe the shooting action. Weighting each frame would incur excessive computational cost and result in low accuracy.

Figure 1
figure 1

We extract six frames from a video, which contains distinct appearance information. However, the top three frames do not contain much variance, so as the bottom three frames

Here, we argue that long-range temporal structure plays an important role in understanding dynamics in video captioning. However, mainstream video captioning frameworks [37, 51] usually focus on appearances and short-term motions, which lack the capacity to incorporate long-range temporal structure. In this paper, we aim to study the following problem: How to design an effective and efficient video-level framework for learning video representation that is able to capture long-range temporal structure for improving video captioning. Moreover, in terms of long-range temporal structure modeling, a key observation is that consecutive frames are highly invariant [23, 49] (Figure 1), thus it is unnecessary to directly set dense temporal sampling for LSTMs. Therefore, we propose a novel framework, namely temporal and spatial LSTM (TS-LSTM), which firstly uses a temporal pooling (TP) layer to keep the temporal invariance in a short video shot, then a Long Short-Term Memory (LSTM) [34] to exploit temporal dynamics between long-range video shots. In addition, a stacked Long Short-Term Memory (Stack-LSTM) is adopted to generate words in the final stage. This framework employs representations from spatial and temporal features to enhance video captioning. The contributions of this paper are as follows:

  • Given spatial and motion feature representations over time, we propose to integrate a temporal pooling and a LSTM to learn both temporal invariance and variance. This mechanism fuses high-level spatial and temporal features to learn long-range temporal dynamics over the whole video.

  • We introduce a TS-LSTM video captioning framework, which integrates TP-LSTM with a mean pooling and a stacked LSTM to automatically generate words for describing a video. Specifically, the mean pooling is applied on a concatenation of visual features, motion features and long-term dynamics to extract useful information for the decode process. In addition, inspired by the two-stream framework [10, 35] which has achieved great results in video action recognition, we adopt a fine-tuned Resnet-152 [16] to extract the temporal features. Compared with C3D [44] features, using Resnet-152 features can achieve better results.

  • We perform experiments on two video captioning datasets, namely MSVD [4] and MSR-VTT [50], to verify the effectiveness of our method. The experimental results show that our method outperforms existing approaches.

2 Related work

2.1 Deep convolutional neural network

In the field of deep learning, deep convolutional neural networks (CNNs) have been widely applied to explore visual information, such as image recognition [20], object detection [30] and image retrieval [41]. From LeNet [20] to ResNet [16], the performances of such models have greatly improved on the task of image classification. Specifically, ResNet-152 [16] achieves better results than human beings. As a result, many researchers employ these networks to improve the performance of their tasks. For example, Feichtenhofer et al. [35] fine-tuned the VGG [36] to improve the performance on video action recognition task, and Yao et al. [51] used a pre-trained GoogLeNet [42] to extract features for video captioning. Motivated by the previous works [22, 37], we use the ResNet-152 to extract both spatial and temporal information. In addition, all the above mentioned deep CNNs contain pooling layers, which are always used to reduce the spatial size and solve the over-fitting problem. Besides, Scherer et al. [33] showed that pooling layers have potential to obtain spatial invariance, thus we integrate a temporal pooling layer to explore the temporal invariance in a video short snip in this paper.

2.2 Recurrent neural networks

Compared with CNNs, Recurrent Neural Networks (RNNs) are good at modeling sequential data, thus they have been widely utilized in natural language processing and achieved great success [8, 17]. At each time step, an RNN observes an element and updates its internal states. In the field of speech recognition, the RNN Language Model (RNNLM) [24] models the output distribution by adding a softmax layer onto the hidden states. In order to learn the RNNLM model’s parameters, it maximized the log-likelihood by using the gradient descent method.

However, the above mentioned RNNs are suffering from the “long-term dependencies” problem [2]. LSTM [34] is designed for leaning long-term dependencies. It solves the “long-term dependencies” problem by adding some gates that explicitly allow the RNN to learn when to forget previous hidden states with “forget gate” and when to update hidden states given new inputs. Previous studies showed that LSTM is capable of modeling data sequences, especially for encoding sentences and video features. Therefore, in this paper, we choose LSTM as our basic component for video captioning.

2.3 Video captioning

As a bridge connecting computer vision and natural language processing, video captioning has attracted great attention in both areas. How to auto-generate descriptions of images or videos is an old topic in computer vision [9, 19, 21]. For example, Kojima et al. [19] firstly detected human postures, including head positions, head directions and hands positions, and then several predicts and objects are selected with domain knowledge. Finally, they filled these syntactic components into a case frame, and translated the case frames into sentences with some syntactic rules. In addition, same strategy is utilized to enhance other multimedia applications, such as [15, 18].

Later on, some researchers tried to describe videos/images with probabilistic graphical model [9, 21, 31]. For instance, Farhadi et al. [9] constructed three spaces: image space, sentence space and meaning space. In order to find the relationship between images and the corresponding sentences, they projected both image and sentence spaces into a common space: the meaning space. Specifically, the meaning space was represented by a triplet indicating as <object, action, scene>. Mapping the image space to the meaning space was reduced to predicting the triplets from images, while mapping the sentence space into a meaning space was conducted by extracting triplets from sentences and then computing the similarity between two triplets. In addition, Rohrbach et al. [31] explored the relationship between visual contents and semantic representations with Conditional Random Field (CRF). However, all of these methods are highly dependent on the templates of sentences, which is insufficient to model the richness of natural language.

Recently, inspired by the great success of deep learning, many researchers [14, 27, 46, 51] applied deep neural networks to solve the video captioning problem. Specifically, Venugopalan et al. [46] employed a stacked LSTM for generating good descriptions effectively. The first LSTM encodes the visual features from pre-trained CNNs and the second LSTM generates words. Pan et al. [27] leveraged the semantics, both from entire sentence and video content, to learn a visual-semantic embedding model. Some works [22, 28] showed that semantic attributes make a significant contribution to video captioning. Pan et al. [28] adopted the Multiple Instance Learning (MIL) to learn the semantic attributes from videos, then utilized the generated attributes to improve the performance of their models. Compared with mean pooling, [37, 51, 54] were interested in tackling video captioning with attention mechanisms. Yao et al. [51] introduced a temporal soft attention mechanism into video captioning to automatically select the most relevant frames. Yu et al. [54] introduced a supervised spatial attention mechanism to guide the model to learn the relevant spatial information for video captioning. Different with above works, we are focusing on further extracting informative features for videos in terms of exploiting a long-range temporal structure.

3 The proposed approach

In this section, we introduce our approach for video captioning. Firstly, we define the terms and notations. Next, we describe our proposed network. Finally, we introduce the loss function of our model.

3.1 Terms and notations

Given a video V, we extract its features \( {\mathbf {V}} = \left \{ {{v_{1}}, {v_{2}} {\cdots } ,{v_{i}} , {\cdots } ,{{v}_{{N_{v}}}}} \right \} \in \mathbb {R}^{D_{v} \times N_{v}} \), Dv denotes the dimension of visual features, Nv denotes the number of sampled frames from the video. A sentence \({\mathbf {S}} = \left \{ {{s_{1}},{s_{2}}, {\cdots } , {s_{i}},{\cdots } ,{{s}_{{N_{s}}}}} \right \} \in \mathbb {R}^{D_{s} \times N_{s}} \) consisting of Ns words for describing the video, and si is an one-hot vector. Ds is the size of dictionary. And we denote < BOS > as the start of a sentence. Our framework is shown in Figure 2. This framework consists of six major components. The first component is a Spatial ResNet-152 network which takes RGB frames as inputs and extracts visual features from each video frame, while the second component is the Temporal ResNet-152 network which takes optical flows as inputs and produces temporal features for each frame. Next, the third component is a concatenation that concatenates the outputs of Spatial ResNet-152 and Temporal ResNet-152 networks. Then, TP-LSTM takes a set of concatenations as inputs with a temporal pooling strategy. Finally, the second concatenation integrates visual features, temporal features and the outputs of TP-LSTM into a new video representation. The last component is a stacked LSTM, which takes the new video representation and words to produce a natural language sentence.

Figure 2
figure 2

The framework of our model. TP-LSTM explores the invariance and variance in the video, while a Stack-LSTM is applied to generate words for describing the video

3.2 Temporal pooling LSTM

How to extract effective visual features is an important problem for analyzing videos. Due to the rapid development of deep convolutional neural networks (CNNs), which have made a great success in image classification [16], object detection [30] and video action recognition [35], it is common to apply deep CNNs to extract visual features. In this work, we use the ResNet-152 per-trained on the ImageNet to extract video frame visual features. In addition, a video contains not only spatial information but also temporal information. Therefore, we utilize another fine-tuned ResNet-152, which takes optical flow images as inputs, to extract video temporal features. After that, we concatenate above two features together. In order to model the invariance and variance of the input video, we propose a temporal pooling LSTM to dispose the fused new feature. More specifically, we divide the new features into Ne parts along the temporal dimension, thus each part has Nk = Nv/Ne features. Next, we average the features from same part. This process is expressed as follow:

$$ e_{i} = \frac {{\sum}_{j=(i-1)\times{N_{k}}}^{i \times {N_{k}}} v_{j}}{N_{k}} \ \ \ \ \ i \in \{1,2,...,N_{e}\} $$
(1)

\({\mathbf {E}} = \left \{ {{e_{1}}, {e_{2}} {\cdots } ,{e_{i}} , {\cdots } ,{{e}_{N_{e}}}}\right \} \in \mathbb {R}^{D_{v} \times N_{e}} \) is generated after the temporal pooling.

In the next step, we aim to extract long-term dynamics across a video by applying a Recurrent Neural Network (RNN) on E. As mentioned above, we employ LSTM to model the long-term temporal dynamics of E. The structure of LSTM is described below:

$$\begin{array}{@{}rcl@{}} f_{t} &=& \sigma(W_{xf} e_{t} + W_{hf} h_{t-1} + b_{f}) \\ i_{t} &=& \sigma(W_{xi} e_{t} + W_{hi} h_{t-1} + b_{i}) \\ o_{t} &=& \sigma(W_{xo} e_{t} + W_{ho} h_{t-1} + b_{o}) \\ g_{t} &=& \phi(W_{xg} e_{t} + W_{hg} h_{t-1} + b_{g}) \\ c_{t} &=& f_{t} \odot c_{t-1} + i_{t} \odot g_{t} \\ h_{t} &=& o_{t} \odot \phi \left( {{c_{t}}} \right) \end{array} $$
(2)

where σ(⋅) denotes the sigmoid function, ϕ(⋅) denotes the hyperbolic tangent function, and ⊙ denotes the element-wise multiplication. ct is a cell state vector, and ht is an hidden state vector. W is a set of parameters, and b is a set of bias values. For convenience, we define the function as:

$$ h_{t}, c_{t} = LSTM(e_{t}, h_{t-1}, c_{t-1};W,b) \ \ \ \ \ t \in \{1,...,N_{e}\} $$
(3)

where et is the input at t-th time step, and h0, c0 are initialized vectors. In our model, we use ht as the output of the LSTM. After Ne time steps, we get \({\mathbf {H}} = \left \{ {{h_{1}}, {h_{2}} {\cdots } ,{h_{i}} , {\cdots } ,{{h}_{N_{e}}}}\right \} \in \mathbb {R}^{D_{h} \times N_{e}} \). Dh is the output dimension of the LSTM. Next, we average the outputs of LSTM and visual features, respectively. See below:

$$\begin{array}{@{}rcl@{}} \overline{v} = \frac{{\sum}^{N_{v}}_{i = 1} v_{i}}{N_{v}} \ \ \ \ \ v_{i} \in \mathbf{V} \\ \overline{h} = \frac{{\sum}^{N_{e}}_{i = 1} h_{i}}{N_{e}} \ \ \ \ \ h_{i} \in \mathbf{H} \end{array} $$
(4)

Then, we concatenate them \(y = [\overline {v},\overline {h}]\) and feed them into our Stack-LSTM.

3.3 Stacked LSTM

In order to reduce the dimension of the one-hot vector and explore the semantic information from the one-hot vector, we follow previous works [28, 37, 53] to embed the one-hot vector into a low-dimensional vector as follow:

$$ \mathbf{M} = {W_{s} \mathbf{S}} $$
(5)

where \(W_{s} \in \mathbb {R}^{D_{m} \times D_{s}} \) is a parameter matrix. After embedding, we obtain an embedding matrix \({\mathbf {M}} = \left \{ {{m_{1}}, {m_{2}} {\cdots } ,{m_{i}} , {\cdots } ,{{m}_{N_{s}}}}\right \} \in \mathbb {R}^{D_{m} \times N_{s}} \).

Then we use LSTM layers to explore semantic information from both sentences and videos. Donahue et al. [7] suggested that two LSTM layers are better than one or four layers for image captioning. Compared with their two LSTM layers, our first LSTM layer is used to encode sentence information, while the second LSTM layer is applied to fuse both sentence and visual information for achieving semantic features. More specifically, at first we use a standard LSTM to explore the relationship between words:

$$ q_{t}, u_{t} = LSTM(m_{t}, q_{t-1}, u_{t-1};W_{q},b_{q}) \ \ \ \ \ t \in \{1,...,N_{s}\} $$
(6)

where q0 and u0 are initialized vectors. Wq and bq are parameters. After Ns time steps, we get a series of vectors \({\textbf {Q}} = \left \{ {{q_{1}}, {q_{2}} {\cdots } ,{q_{i}} , {\cdots } ,{{q}_{N_{s}}}}\right \} \in \mathbb {R}^{D_{q} \times N_{s}} \), which contain temporal information from a sentence. Next, we use a multi-modal LSTM (M-LSTM) which incorporates features from different information sources (i.e., video and words) into a set of higher-level representations. The M-LSTM integrates information from visual and word sources into latent semantic features by adjusting their weights to improve the video captioning performance. The structure of the multi-modal LSTM is described as follows:

$$\begin{array}{@{}rcl@{}} f^{\prime}_{t} &=& \sigma(W^{\prime}_{xf} q_{t} + W^{\prime}_{hf} h^{\prime}_{t-1} + W^{\prime}_{yf} y + b^{\prime}_{f}) \\ i^{\prime}_{t} &=& \sigma(W^{\prime}_{xi} q_{t} + W^{\prime}_{hi} h^{\prime}_{t-1} + W^{\prime}_{yi} y + b^{\prime}_{i}) \\ o^{\prime}_{t} &=& \sigma(W^{\prime}_{xo} q_{t} + W^{\prime}_{ho} h^{\prime}_{t-1} + W^{\prime}_{yo} y + b^{\prime}_{o}) \\ g^{\prime}_{t}&=& \phi (W^{\prime}_{xg} q_{t} + W^{\prime}_{hg} h^{\prime}_{t-1} + W^{\prime}_{yg} y + b^{\prime}_{g}) \\ c^{\prime}_{t} &=& f^{\prime}_{t} \odot c^{\prime}_{t-1} + i^{\prime}_{t} \odot g^{\prime}_{t} \\ h^{\prime}_{t}&=& o^{\prime}_{t} \odot \phi(c^{\prime}_{t}) \end{array} $$
(7)

where \(W^{\prime }_{*}\) and \(b^{\prime }_{*}\) are the parameters, which need to be learned. y is the concatenated feature, mentioned in (4). \(h^{\prime }_{0} \in \mathbb {R}^{D_{h^{\prime }} \times 1}\) and \(c^{\prime }_{0} \in \mathbb {R}^{D_{h^{\prime }} \times 1}\) are initialized vectors. Finally, we use a softmax layer to estimate the conditional probability distribution over st+ 1.

$$ P(s_{t + 1}|s_{<t},\mathbf{V}) = softmax(W_{f} h^{\prime}_{t} + b_{f}) \ \ \ \ \ t \in \{1,...,N_{s}\} $$
(8)

where \(W_{f} \in \mathbb {R}^{D_{s} \times D_{h^{\prime }}}\) and \(b_{f} \in \mathbb {R}^{D_{s}}\) are the parameters. If the input is represented as \(x\in \mathbb {R}^{D_{s} \times 1}\), the softmax function can be expressed as:

$$ softmax(x_{i}) = \frac{e^{x_{i}}}{{\sum}_{j = 1}^{D_{s}} e^{x_{j}}}\ \ \ \ \ i \in \{1,...,D_{s}\} $$
(9)

3.4 Loss function

Previous works [46, 47, 53] defined their loss functions based on maximum likelihood estimation (MLE). In this work, we follow them to define our loss function by optimizing the log-likelihood:

$$\begin{array}{@{}rcl@{}} \mathcal{L} &=& \log P(\mathbf{S}|\mathbf{V}) \\ &=&{\sum}_{t = 1}^{N_{s}} {s^{T}_{t}} \log P(s_{t}|s_{<t},\mathbf{V}) \end{array} $$
(10)

By maximizing this loss function, we can estimate the parameters in the whole model. After extracting features with deep CNNs, we simultaneously train the rest of model (i.e. TP-LSTM, Mean Pooling Concatenating and Stack-LSTM in Figure 2). More specifically, we use back-propagation through time (BPTT) algorithm to compute the gradients and conduct the optimization with adadelta [55].

4 Experiments

We evaluate our model on the task of video captioning. We firstly study the performance of different features, and then evaluate the influence of different hyper-parameters. Finally, we compare our model with the state-of-the-art methods.

4.1 Datasets

In our experiments, we use two public video captioning benchmarks that have been widely used in many other works.

The microsoft video description corpus (MSVD)

This dataset is proposed by Chen et al. [4]. There are 1,970 short video clips collected from YouTube and about 80,000 descriptions collected by Amazon Mechanical Turkers (AMT) in this dataset, and an average length of each video clip about 9s. Each video clip has an average of forty descriptions. And this dataset is open-domain and covers a wide range of topics such as people, animals, sports, actions, music, scenarios, landscapes etc. In total, all the descriptions contain nearly 16,000 unique vocabularies. Following previous work [26, 27, 53], we split this dataset into a training, a validation and a testing dataset with 1200 (60%), 100 (5%) and 670 (35%) video clips, respectively.

MSR video to text (MSR-VTT).

Xu et al. [50] collected this dataset by a commercial video search engining. It’s a new large-scale and open-domain video captioning benchmark for supporting video understanding, especially for the task of automatically describing videos. There are 10K video clips and 200K descriptions in this dataset, collected by Amazon Mechanical Turkers workers (AMT) same as MSVD dataset, about 20 sentences for each short video. It covers about 20 categories and diverse visual content. The updated version contains many quality sentences, so we implement our experiments on the updated version. This dataset is divided into three subsets: 65% for training, 5% for validating and 30% for testing, corresponding to 6,513, 497 and 2,990 clips.

4.2 Evaluation metrics

Following previous works [27, 46, 51], for evaluating the performance of our method, we utilize the following three evaluation metrics: BLUE [29], METEOR [6], and CIDEr [45].

4.3 Implementation details

Preprocessing

For preprocessing the descriptions of MSVD dataset, firstly we convert sentences to lower cases, and then use the wordpunct_tokenizer in NLTKFootnote 1 library to tokenize sentences and remove punctuations. Finally, we obtain a dictionary of 15,903 in size on the training splits.

For preprocessing the descriptions of MSR-VTT dataset, we directly split descriptions with a blank space, because they have been tokenized. As a result, we can obtain a dictionary of 23,662 in size on the training splits. In this experiment, we only take words which appear more than two times as the dictionary. Finally, we get a dictionary of 13,626 in size.

For the visual features, we use same method to extract features on both two datasets. For the spatial features, thanks to the ResNet-152 achieved the great results in image classification and video captioning [27, 46, 52], we use a per-trained ResNet-152 on ImageNet [32] to extract visual features. At first, we select equally-spaced 30 frames from each video, then feed them into the per-trained ResNet-152 to extract features from the pool5 layer. Finally, we get a 2048 × 30 feature matrix for each video. For the temporal features, inspired by [10], we first transform RGB images to optical flow images [3] stacking with 10 frames, then we use a fine-tuned ResNet-152 [10] on UCF101 to extract features from pool5 layer. As a result, we obtain a 2048 × 30 feature matrix. Next, we concatenate spatial and temporal features together and then feed them into our model. In our experiments, Dv = 4096 and Nv = 30.

Training details

In the training phase, sentences in corpus are varying lengths, thus we add a begin-of-sentence flag < BOS > to start each sentence and an end-of-sentence flag < EOS > to end each sentence. In the testing phase, we input < BOS > flag into our model to trigger the process of sentence generation. Beam search method, a heuristic search algorithm based on greedy algorithm, is utilized to find a sentence, which has the max partial probability. In addition, the width of the beam search is set as 5.

In addition, all the LSTM unit sizes are set as 512 (\(D_{h}=D_{q}=D_{h^{\prime }}= 512\)) and the word embedding size is set as 512 (Dm = 512), empirically. In our experiments, we throw away the sentences whose length is more than 30, thus Ns < 30. The batch sizes are set as 64 on MSVD dataset and 256 on MSR-VTT dataset. We apply the back-propagation through time (BPTT) algorithm to compute the gradients of the parameters and conduct the optimization with adadelta [55]. In addition, we set the learning rate as 10− 4 to avoid the gradient explosion. We utilize dropout regularization with the rate of 0.5 in all layers and clip gradients element wise at 10. We stop training our model until 500 epoches are reached or until the evaluation metric does not improve the validation set at the patience of 20. Moreover, we utilize Theano [1] framework to conduct our experiments.

All experiments are conducted on the Ubuntu 14.04 with Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz and GeForce GTX TITAN X (Pascal) GPU.

4.4 Experiments on MSVD

For verifying the effectiveness of our framework, we design following experiments:

Effectiveness of different features

. C3D features [44] are widely used for video captioning [22, 26, 53]. In this experiment, we evaluate the influence of spatial ResNet-152 feature (res_s) and compare our temporal ResNet-152 (res_t) features with the C3D features. The baseline is our model without the TP-LSTM part. The experimental results are shown in Table 1. From Table 1, we can see that simply applying spatial ResNet-152 is quite effective for video captioning with B@4 (51.5%), M (33.5%) and C (75.8%). Making use of both res_s and c3d, B@1, B@2, B@3 and B@4 improves, but M and C drops. In terms of video captioning evaluation, METEOR and CIDE are more reliable than BLEU. Table 1 also shows that res_s and rest_t performs best in terms of B@4, M and C. Therefore, we prove that our res_t performs better than c3d for video captioning. In the following experiments, all the models take both res_s and res_t.

Table 1 Performances of our model with different features, where res_s stands for the the spatial ResNet-152 feature, res_t stands for the the temporal ResNet-152 feature, c3d stands for the C3D feature

The effect of segmentation numbers

In order to explore the effectiveness of temporal pooling, we study the influence of the segmentation numbers, represented as Ne. In this experiment, we set Ne = 1 (average whole features), Ne = 3 and Ne = 30 (without average operation) and the experiments are shown in Table 2, where baseline stands for our approach without TP-LSTM. From Table 2, we can see that when Ne = 3, our model achieves better results than Ne = 1 and Ne = 30. When Ne = 1, the model ignores the temporal variance between long-range video shots. When Ne = 30, the model ignores the temporal invariance in a short video shot. This proves that reasonable number of segments can improve the performance of video captioning. Compared with the baseline, our model (TS-LSTM Ne = 3) achieves better results with 2.2%, 2.7%, 2.3%, 1.8% 0.2% and 3.4% increases on BLUE-1, BLUE-2, BLUE-3, BLUE-4, METEOR and CIDEr, respectively. Therefore, in the following experiments, we set Ne = 3.

Table 2 Performances of our model with different Ne

Comparing with existing methods

To verify the availability of our model, we compare our results with the following methods:

  • MP-LSTM. Venugopalan et al. [47] used a mean-pooling layer to dispose all extracted frame-level features, then stacked two LSTM layers to explore semantic information.

  • SA-LSTM. Yao et al. [51] introduced a temporal attention mechanism to automatically select the relevant frames, and combined with the spatial temporal 3-D convolutional neural network (3D-CNN) features, the model achieve the great results on the video captioning task.

  • LSTM-E. Pan et al. [27] assumed that a low-dimensional embedding exists for the representation of video and sentence, thus they mapped the video features and sentence features to the visual-semantic embedding and minimized the relevance loss to adequately explore the semantic information from videos.

  • HRNE-AT. Pan et al. [26] proposed a Hierarchical Recurrent Neural Encoder (HRNE) structure, which stacks a short LSTM on a long LSTM for adequately exploring the temporal information of a video.

  • h-RNN. Yu et al. [53] designed a sentence generator and a paragraph generator to generate paragraphs. The paragraph generator is stacked on the sentence generator and receives the state of the sentence generator, then initials the sentence generator.

  • M3-LSTM. Wang et al. [48] designed a visual and semantic shared memory structure for achieving the long-term visual-semantic dependency to further guide global visual attention. In this way, the model can learn an effective mapping from visual space to language space.

  • MFA-LSTM. Long et al. [22] selected the most frequent subject and verb across captions of each video, and took these as the semantic attributes and used a multi-modal attention mechanism to explore the semantic information from videos.

  • LSTM-TSA. Pan et al. [28] introduced the Multiple Instance Learning (MIL). A weakly-supervised method was proposed to learn attribute detectors and great results were achieved.

  • hLSTMat. Song et al. [37] proposed a adjusted temporal attention mechanism, which can automatically decide whether to depend on the visual features or the semantic information, to improve the attention mechanism on video captioning.

4.5 Comparison results on MSVD

In this experiment, we firstly compare our method with the existing methods on the MSVD dataset and the results are shown in Table 3. From Table 3, we can see that our model obtains the best performance. In particular, the BLEU-4 of our model reaches 54.5%, making a great improvement over h-RNN, MFA-LSTM, LSTM-TSA, hLSTMat by 4.6%, 1.7%, 1.7%, 1.5%, respectively. The METEOR of our model is 34.5%, which outperforms h-RNN, MFA-LSTM, LSTM-TSA, hLSTMat by 1.9%, 1.1%, 1.0%, 0.9%, respectively.

Table 3 BLEU@N (B@N), METEOR (M), and CIDEr(C) scores of our model and other state-of-the-art methods

In Figure 3, we show some example sentences generated by our TS-LSTM model and our baseline mentioned in Section 4.4. The first column shows that both TS-LSTM and baseline can generate correct sentences to describe each video. From the second column, we have the following observations: 1) TS-LSTM model can generate sentences with accurate words to describe objects within a video, such as “bike” in the top video. 2) Compared with baseline, TS-LSTM is able to provide more detailed information for describing video contents. For instance, in the middle video, TS-LSTM indicates that a man is “eating pasta” instead of just “eating”. 3) For the bottom video in the second column, it shows that TS-LSTM has ability to calculate the number of objects within a video. In addition, the third column introduces some wrong examples. For the bottom video in the third column, TS-LSTM and baseline generate two sentences: “a monkey is playing” and “a tiger is playing”, respectively. Both of them are incorrect due to the following reason that the MSVD dataset contains few videos about “cheetah”. Therefore, the trained models both encounter an over-fitting problem.

Figure 3
figure 3

Some example sentences on the MSVD dataset. These sentences are generated by our model TS-LSTM and Baseline. GT denotes the ground truth. The imprecise words are labeled with red color

4.6 Comparison results on MSR-VTT

To further illustrate the performance of our model, we compare our model with the state-of-the-art methods on the MSR-VTT dataset, which has the largest number of video-sentence pairs. The experimental results are shown in Table 4.

Table 4 BLEU@N (B@N), METEOR (M), and CIDEr(C) scores of our model and other state-of-the-art methods

From Table 4, we have the following observations. Firstly, our model achieves the best performance on BLEU-3 (51.3%), BLEU-4 (39.9%) and METEOR (27.1%). Compared with MP-LSTM, SA-LSTM, M3-LSTM, hLSTMat, MFA-LSTM, our model improves the BLEU-4 by 4.1%, 3.3%, 1.8%, 1.6%, 0.7%, respectively, and increases the METEOR by 1.8%, 1.2%, 0.5%, 0.8%, 0.5%, respectively.

5 Conclusion

In this paper, we present our temporal spatial LSTM network (TS-LSTM), a video level framework that aims to model long-term temporal dynamics and integrates dynamics with spatial and temporal features to improve video captioning. As demonstrated on two challenging datasets, this work outperforms the existing methods while keeping a reasonable computation cost. In this framework, the TP-LSTM is proposed to explore the long-range structure by taking segments of video visual and motion features as inputs, and produces informative long-term dynamics for video captioning. The experimental results show the effectiveness of our proposed approach.