1 Introduction

Video captioning targets the automatic generation of natural language descriptions from videos. Since it requires natural language processing as well as comprehensive understanding of videos, generating human-like sentence descriptions from videos is a very challenging task in computer vision. Recently, this task has drawn increasing attention due to the grand success of image captioning [15, 31, 35, 38]. In particular, it can be used in many applications, such as blind navigation, home surveillance, human–robot interaction, and video retrieval, so the related industry market is currently expanding.

Early works on video captioning mainly proceed along a template-based language model [7, 12, 27]. It first predefines a set of language templates, which follow specific rules for language grammars, and sentences are split into several semantic parts, such as subject, verb, and object. When sentences are generated from videos, individual classifiers identify these semantic contents, and then sentences based on the templates are generated. However, this kind of method highly depends on predefined templates, so it is insufficient to model the richness of the language used in human-generated descriptions.

Thanks to the successful developments in not only machine translation using recurrent neural networks (RNNs) [1, 4, 24] but also object recognition and detection using convolutional neural networks (CNNs) [13, 22, 26], most of current video captioning networks are designed using a combination of RNN and CNN. For example, video features extracted from CNN are embedded into a long-short-term memory (LSTM) model for learning the network parameters required in sentence generation. A benefit of this learning-based video captioning is that it directly describes a sentence from video contents [24], and its video captioning performance is usually much better than that of rule-based video captioning systems. In particular, LSTM models, which are a type of RNN, have contributed greatly to sequence-to-sequence learning because they can avoid the long-term dependency problem [9]. It should be noted that the video and language are represented as a sequence of frames and words, respectively.

Compared to images, videos contain not only static information but also temporal structure. Consecutive frames within videos often deliver lots of information, such as scene changes, diverse sets of objects and actions, and interaction between people and objects. Therefore, it is difficult to determine the salient content and describe its corresponding events appropriately in context. To overcome this problem, the recently developed video captioning networks generally exploit various modalities, such as frames, motion, and audio. Since these different modalities can complement each other, they enable the generated descriptions to be as rich as possible. Figure 1 shows an example video with a human-annotated sentence that can demonstrate the efficiency of the multimodality. Most of object words in the sentence may be recognized from the individual frames. However, the words “skiing rapidly down” and “while music plays in the background” are highly correlated with the motion and audio information, respectively. For example, how fast the skiing is can be only observed in the motion information, and whether the background music is playing in the video can be only determined through the audio information. Hence, using multimodal information in video captioning is very efficient.

Fig. 1
figure 1

Example video with human-annotated sentences – “someone is skiing rapidly down a mountain towards a crack in the wall while music plays in the background”

To efficiently map a sequence of the information, such as frames, motion, and audio, to a sequence of words, we propose an LSTM network based on deep multimodal embedding. Since a category can be derived from the video topic information, we consider four modalities for the embedding. The proposed network extends S2VT [29], which is the simplest encoder-decoder framework. Figure 2 shows the proposed network considering the four modalities. The first four stacked LSTM layers input the features of x1 to x4, which correspond to audio, category, motion, and frame feature vectors, respectively, and then the last layer encodes them together in hidden representations. The decoding layer generates words from the encoded representations and then outputs a sentence. Unlike the encoder, the decoder only stacks a single LSTM layer. Since the characteristics of the features are different from each other, they are embedded in the different LSTM layers. For instance, the types of features can be classified to learned features or engineered features, and their dimensions can be small or large. Therefore, we find the optimal embedding layers based on their characteristics In Fig. 2, tags <pad>, <bos>, and < eos > denote null-padded-word, beginning-of-sentence, and end-of-sentence, respectively.

Fig. 2
figure 2

Proposed network considering the four modalities

To our knowledge, this is the first approach that embeds the different modalities in different LSTM layers. The embedding order of the features is determined through their characteristics in the proposed network. Alternatively, all the features can be embedded in the same LSTM layer, or each feature can be randomly embedded in the different LSTM layers. However, our analysis results demonstrate that embedding the features of the different modalities into different LSTM layers appropriately is more efficient than the other embedding methods. Moreover, the experimental results show that the captioning performance of the proposed network is very competitive in comparison to that of conventional networks.

The remainder of this paper is organized as follows. Section 2 introduces related works. In section 3, the proposed network embedding frame, motion, audio, and category information is described in detail. Finally, experimental results are presented and we conclude the paper in sections 4 and 5, respectively.

2 Related works

Most of the recent studies on video captioning have been based on CNN and LSTM models because of their good captioning performance. In the method proposed in [30], videos are directly translated to sentences by concatenating a CNN encoder and an LSTM decoder. The encoder uses a uniform weighting feature, which is an average of the CNN features from all frames (mean pooling), and the decoder outputs words based on the previous words. A spatiotemporal CNN (3-D CNN) is introduced to represent a dynamic temporal structure in [37]. It includes a temporal attention mechanism that can focus more on interesting frames. In [29], S2VT is designed to enhance accurate sentence generation as an extended version of the mean pooling in [30]. Because the encoder and decoder employ a single LSTM, their parameters are shared during encoding and decoding. In [17], LSTM with visual-semantic embedding (LSTM-E) considers the relationship between sentence semantics and visual contents. In [16], a hierarchical recurrent neural encoder (HRNE) is proposed to efficiently exploit the temporal structure in a longer range by reducing the length of input information flow, so it tries to focus on learning good video features. On the contrary, a paragraph-RNN (p-RNN) introduces a hierarchical decoder in [39], which consists of not only a sentence generator but also a paragraph generator to generate multiple sentences. A reconstruction network (RecNet) introduced in [32] is an encoder–decoder–reconstructor architecture to exploit a video-to-sentence flow and a sentence-to-video flow at the same time.

To improve the captioning performance, many studies have discussed multimodal networks exploiting a variety of information, such as frames, motion, audio, and so on. Since the various features complement each other in translating videos to language, the captioning performance of networks using multimodal features is generally much better than that of networks using the visual-only feature. In [10], each feature is embedded in a multimodal fusion network, and a text sequence decoder generates a sentence. It shows that the fusion of the multimodal features significantly improves the captioning performance. A multimodal video description network (MMVD) in [20] is extended on top of S2VT [29]. Unlike S2VT, MMVD considers the multimodal features for good performance and uses a single LSTM layer instead of using a stack of two LSTM layers for complexity reduction. Multimodal features are used to predict video topics and generate topic-aware captions in [3]. In [36], a multirate multimodal network encodes video frames with different motion speeds and incorporates a temporal structure adaptively. In [34], a multimodal attention LSTM network (MA-LSTM) employs both multimodal streams and temporal attention to adaptively focus on specific elements during sentence generation. In particular, the attention mechanism focuses on not only interesting parts within each modality but also contributions across the different modalities. A category-aware ensemble model and a reranking model are proposed in [11]. The ensemble model efficiently fuses the predictions from different captioning models according to the category information. The reranking model chooses the best sentence among candidates generated by the different captioning models. A consensus-based sequence-training network (CST) is developed as a variant of the reinforcement learning framework in [19], which tries to maximize the score of a candidate sentence as a reword.

3 Proposed network

Our proposed network is based on S2VT [29], which is designed with the LSTM encoder and decoder. The encoder inputs four different multimodal features, and the decoder outputs words. First of all, we briefly introduce the S2VT network in section 3.1. Since this paper focuses on how the multimodal features are embedded in a network, section 3.2 explains and analyzes the multimodal features that we used for experiments. In section 3.3, we propose a deep multimodal embedding network based on the analysis of the features.

3.1 S2VT network

The five encoding layers and the one decoding layer are constructed in the proposed network as shown in Fig. 2, while S2VT uses two encoding and decoding layers. Although the proposed network puts more encoding layers and less decoding layers, these two networks are somewhat similar in their architectural aspects. The first layer in S2VT reads a sequence of the frame features that is extracted by CNN, and then they are encoded to a hidden representation. The second encoding layer receives the hidden representation vector and delivers them to the decoder. When the second decoding layer is fed to the tag <bos>, it starts to decode the vector to a sequence of words, considering previous words. Please note that the previous words here mean ground truth words in training, but words with the maximum probability after a softmax in testing. The second layer generates the words until it emits the tag <eos>. S2VT employs the LSTM model to deal with the long-term dependency problem [9]. Its parameters are continuously updated in a direction in which the log-likelihood of the predicted words is maximized through back-propagation techniques.

LSTM is one of the widely used models for variable sequential data because it is especially good at modeling long-term sequence information. It is designed to address the vanishing and exploding gradient problem. In detail, its memory cell can keep track of previous information until the current time step and update with the current input under three gates, which are an input gate it, a forget gate ft, and an output gate ot. For example, it and ft determine what information is stored and forgotten in the cell state, respectively. Finally, ot controls what is going to be output. Using these three gates, LSTM can calculate a hidden state ht and a memory cell state ct at each time t, as follows:

$$ {\displaystyle \begin{array}{l}{i}_t=\sigma \left({W}_{ix}{x}_t+{U}_{ih}h{}_{t-1}+{b}_i\right)\\ {}{f}_t=\sigma \left({W}_{fx}{x}_t+{U}_{fh}h{}_{t-1}+{b}_f\right)\\ {}{o}_t=\sigma \left({W}_{ox}{x}_t+{U}_{oh}h{}_{t-1}+{b}_o\right)\\ {}{g}_t=\phi \left({W}_{gx}{x}_t+{U}_{gh}h{}_{t-1}+{b}_g\right)\\ {}{c}_t={f}_t\otimes {c}_{t-1}+{i}_t\otimes {g}_t\\ {}{h}_t={o}_t\otimes \phi \left({c}_t\right)\end{array}} $$
(1)

where all the weight matrices W, U, and b are training parameters to be learned. Here, xt denotes an input feature vector; σ and Φ represent a logistic sigmoid and a hyperbolic tangent function, respectively; and ⓧ denotes an element-wise product. Once the hidden state ht is computed from Eq. (1), the log-likelihood of the predicted words is finally calculated.

3.2 Multimodal features

The proposed network considers the embedding of the frame, motion, audio, and category features. For our experiments, each feature was obtained from one of the feature extraction methods currently used most often. Table 1 shows the feature information in detail. Inception-v4 is employed for the static frame feature [25]. It is extracted from the penultimate layer, and its dimension is 1536. I3D represents the two-stream inflated 3D CNN as the motion feature [2]. It is extracted from a logit layer, and its dimension is 400. MFCC stands for the mel-frequency cepstral coefficient, and it is widely used in speech and audio recognition [5]. It is derived from a type of cepstral representation of the audio clip of every second in our experiments. The first 20 coefficients and their first- and second-order derivatives are used together, so its total dimension becomes 60. The category information is tagged with each video, and it is represented as a one-hot vector. Each video belongs to one of 20 different categories, such as music, food, or TV shows. Therefore, its dimension is 20.

Table 1 Feature information used in our experiments

These four features have different characteristics. First of all, both Inception-v4 and I3D are the learned features, which means that they are already trained on other datasets. For example, Inception-v4 is pre-trained on ImageNet [25], and I3D is pre-trained on ImageNet and fine-tuned on Kinetics [2]. On the other hand, MFCC is the engineered feature. It is derived by taking the Fourier transform, mapping the powers of the frequency spectrum onto the mel-scale, and then taking their discrete cosine transform [5]. Since Inception-v4 and I3D are the well-learned features extracted from state-of-the-art technologies, their captioning performance usually better than the engineered feature, such as MFCC. We will show the corresponding results in section 4.2. Actually, learned audio features are currently used in audio processing [8]. However, they are not verified in video captioning, and MFCC is still the most popular feature in this area. The category feature is a one-hot vector, which is provided in the video dataset.

After each feature is embedded, it is encoded to the hidden representation, which is represented as the format of the fixed-size vector. However, the dimensions of the features are all different. For example, the dimension of the category feature is much lower than that of the frame, motion, and audio features. In particular, the dimension of the frame feature is significantly high. On the contrary, the dimension of the category feature is very low, so it may require a long time for its encoding. Hence, if the features in Table 1 are embedded in the same LSTM layer, some features may be encoded very well and ready to generate words. However, if some features are not fully encoded, they might have a negative influence on the word generation.

3.3 Deep multimodal embedding

As shown in Fig. 2, our proposed network stacks five LSTM layers. The multimodal features can be learned more in the upper layer than the underlying layer. In other words, as the features are embedded in the deeper layers above, they are encoded more than when they are embedded in the layers below. For instance, after the first feature that is embedded in the first LSTM layer is encoded, it is encoded once more with the second feature in the second layer, and then the first and second features are further encoded with the third and fourth features in the bottom layers. In the case of the fourth feature, it is embedded and encoded in the fourth LSTM layer and then immediately prepared to generate words in the last layer. Therefore, the number of times that each feature is encoded is different.

Taking into account the layer structure of the proposed network and the analysis in section 3.2, the proposed network determines the embedding order of the four different features. The engineered features or the features with a low dimension are embedded in the upper layers, such as the first and second LSTM layers. On the other hand, the proposed network inputs the well-learned features and high dimension features to the layers below. Since MFCC of the audio information is the engineered features, it is embedded into the first LSTM layer. The category feature is classified into 20 categories and represented as a one-hot vector. It is not the engineered feature, but its dimension is very low. Therefore, it is embedded in the second layer. Finally, both Inception-v4 of the frame information and I3D of the motion information are well-learned features. Since the dimension of Inception-v4 is much higher than that of I3D, Inception-v4 and I3D are embedded in the third and fourth layers, respectively. Based on this embedding order, the features x1 to x4 in Fig. 2 can be replaced with the MFCC, category, I3D, and Inception-v4 features, respectively. Please note that the different types of the features can be employed for the frame, motion, audio, and category modalities in the proposed network. In this case, their embedding order should be modified according to their characteristics.

4 Experimental results

In section 4.1, we introduce the experimental setting, including the evaluation matrices, dataset, and implementation details. The performance evaluation of the proposed network is described in detail in section 4.2.

4.1 Experimental setting

To verify the captioning performance of the proposed network, we used the latest released Microsoft Video-to-Text dataset (MSR-VTT 2017) [33], which is a large-scale dataset including 10,000 training videos and 3,000 testing videos. Each video clip in MSR-VTT 2017 is annotated with 20 natural language descriptions. In addition, each video content belongs to one of 20 categories, and the information is tagged with metadata. Therefore, the category feature can be simply derived from the video container.

Diversity of video topics allows for a variety of expressions through a wide vocabulary. This results in a large discrepancy between captions generated from video captioning networks and human-annotated ground truth captions. Figure 3 shows an example video with various ground truth captions. We can see that different persons can describe the same videos with different expressions. To fairly consider this problem in the performance evaluation, we used four different evaluation matrices, namely, BLEU4 [18], METEOR [6], ROUGE-L [14], and CIDEr [28].

Fig. 3
figure 3

Example video with various ground truth captions – “two black haired ladies are having fun in the car”, “two ladies are having nice time together in a car”, and “two women are dancing funny and singing in a car”

In the proposed network, the tags <bos > and < eos > were added to the vocabulary to indicate the beginning and end of the sentence, respectively. We did not pre-process the sentences except for removing punctuation and converting the sentences to lower case. The dimension of the LSTM hidden state was set to 512. The trainable weights for embedding were initialized with uniform distribution in the range of −0.1 to 0.1. We used the Adam optimizer with a learning rate of 0.0001 to optimize the model and dropout with a rate of 0.5 to avoid over-fitting. When generating the sentences, we used beam search with a size of 5.

4.2 Performance evaluation

Table 2 shows the captioning performance of the proposed network and the leaderboard in the second Video to Language Challenge (MSR-VTT Challenge 2017) (http://ms-multimedia-challenge.com/2017/leaderboard). The proposed network outperforms all of the state-of-the-art networks except the method placed in the first rank [11], which used more training datasets than other networks and the proposed network. Comparing the proposed network with the network placed in the second rank [36], which fairly employed the identical training dataset, the result shows that all the scores of the proposed network are higher. This demonstrates that the proposed network is very competitive in comparison with conventional networks.

Table 2 Captioning performance of the proposed network and the leaderboard in the second Video to Language Challenge (MSR-VTT Challenge 2017)

Tables 3, 4, and 5 show the results, which support our analysis. Please note that the beam search function was turned off in these tests. In section 3.2, we claimed that the learning degree and dimension of the multimodal features are different, so they should be embedded in the different LSTM layers for good captioning performance. In section 3.3, we propose that the features of MFCC for the audio, one-hot vector for the category, I3D for the motion, and Inception-v4 for the frame information are embedded into x1 to x4 in Fig. 2. Table 3 presents the captioning performance when the frame, motion, audio, and category features are randomly embedded in the different LSTM layers. Under the given features, the result shows that the embedding order of the audio, category, motion, and frame features is more efficient than the other combinations. In addition, when either the frame or motion feature, which is already well-learned, is embedded in the deeper layers, such as the first and second combination cases in Table 3, the captioning performance is relatively low, because the network memorizes the frame and motion features particularly well in the training data. Therefore, it is highly possible to encounter the overfitting problem in the training phase [21, 23]. Figure 4 shows examples of captions generated by the first and fourth combination cases and the ground truth caption. It can be observed that the generated captions have a similar meaning to the ground truth caption. In particular, the captions generated from the fourth combination, which is the proposed embedding order in this paper, are more accurate and detailed than that generated from the first combination.

Table 3 Captioning performance when the frame (F), motion (M), audio (A), and category (C) features are randomly embedded to the different LSTM layers in Fig. 2
Table 4 Captioning performance when the frame (F) and audio (A) features are embedded to the same LSTM layer or the different LSTM layers in Fig. 5
Table 5 Captioning performance when the frame (F), motion (M), audio (A), and category (C) features are individually embedded to the single LSTM network in Fig. 6
Fig. 4
figure 4

Examples of captions generated by the first and fourth combination cases in Table 3 and the ground truth caption

To support our arguments more clearly, we additionally conducted a simple experiment. Table 4 shows the captioning performance when the frame and audio features are embedded in the same LSTM layer or different LSTM layers in Fig. 5. The simplified network in Fig. 5 is exactly same as that in Fig. 2, except the number of embedding features. If the frame and audio features are embedded in the same LSTM layer, x1 and x3 in Fig. 5 are replaced with the frame and audio features. When they are embedded in different LSTM layers, one case is that the frame feature is embedded in the first layer, and the audio feature is embedded in the second layer. The other case is that the audio feature is embedded in the first layer, and the frame feature is embedded in the second layer. As seen in the results, embedding in different layers in the order of the audio and frame features is more efficient than applying the other embedding order and embedding the features in the same layer. This demonstrates that embedding the features in different LSTM layers with the proper embedding order is more efficient than the other cases.

Fig. 5
figure 5

Simplified LSTM network embedding the frame and audio features

Finally, Table 5 presents the captioning performance when the frame, motion, audio, and category features are individually embedded in the single LSTM network shown in Fig. 6. This single LSTM network is designed to investigate the captioning performance of each feature. As shown in Table 5, when feature x in Fig. 6 is the frame feature, the captioning performance is the highest. The others are efficient in the order of motion, category, and audio features. In other words, the captioning performance is low in the order of the audio, category, motion, and frame features, which is exactly same as the embedding order in the proposed network. These results demonstrate that our analysis is very reasonable.

Fig. 6
figure 6

Single LSTM network embedding the single feature

5 Conclusions

This paper proposed the deep multimodal embedding network, which embeds the different modality features to different LSTM layers. Our experiments demonstrated that embedding the different modality features in different LSTM layers according to their characteristics achieves significantly higher captioning performance than embedding them in the same layer or randomly embedding them in different layers. In particular, embedding the features in the order of audio, category, motion, and frame features provided the most competitive captioning performance in the proposed network.