Deep multimodal embedding for video captioning

Lee, Jin Young

doi:10.1007/s11042-019-08011-3

Deep multimodal embedding for video captioning

Published: 24 July 2019

Volume 78, pages 31793–31805, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Deep multimodal embedding for video captioning

Download PDF

Jin Young Lee ORCID: orcid.org/0000-0002-4384-2633¹

363 Accesses
9 Citations
Explore all metrics

Abstract

Automatically generating natural language descriptions from videos, which is simply called video captioning, is very challenging work in computer vision. Thanks to the success of image captioning, in recent years, there has been rapid progress in the video captioning. Unlike images, videos have a variety of modality information, such as frames, motion, audio, and so on. However, since each modality has different characteristic, how they are embedded in a multimodal video captioning network is very important. This paper proposes a deep multimodal embedding network based on analysis of the multimodal features. The experimental results show that the captioning performance of the proposed network is very competitive in comparison with conventional networks.

Multimodal Interaction Fusion Network Based on Transformer for Video Captioning

Multimodal attention-based transformer for video captioning

Article 09 July 2023

A Novel Approach for Video Captioning Based on Semantic Cross Embedding and Skip-Connection

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Video captioning targets the automatic generation of natural language descriptions from videos. Since it requires natural language processing as well as comprehensive understanding of videos, generating human-like sentence descriptions from videos is a very challenging task in computer vision. Recently, this task has drawn increasing attention due to the grand success of image captioning [15, 31, 35, 38]. In particular, it can be used in many applications, such as blind navigation, home surveillance, human–robot interaction, and video retrieval, so the related industry market is currently expanding.

Early works on video captioning mainly proceed along a template-based language model [7, 12, 27]. It first predefines a set of language templates, which follow specific rules for language grammars, and sentences are split into several semantic parts, such as subject, verb, and object. When sentences are generated from videos, individual classifiers identify these semantic contents, and then sentences based on the templates are generated. However, this kind of method highly depends on predefined templates, so it is insufficient to model the richness of the language used in human-generated descriptions.

Thanks to the successful developments in not only machine translation using recurrent neural networks (RNNs) [1, 4, 24] but also object recognition and detection using convolutional neural networks (CNNs) [13, 22, 26], most of current video captioning networks are designed using a combination of RNN and CNN. For example, video features extracted from CNN are embedded into a long-short-term memory (LSTM) model for learning the network parameters required in sentence generation. A benefit of this learning-based video captioning is that it directly describes a sentence from video contents [24], and its video captioning performance is usually much better than that of rule-based video captioning systems. In particular, LSTM models, which are a type of RNN, have contributed greatly to sequence-to-sequence learning because they can avoid the long-term dependency problem [9]. It should be noted that the video and language are represented as a sequence of frames and words, respectively.

Compared to images, videos contain not only static information but also temporal structure. Consecutive frames within videos often deliver lots of information, such as scene changes, diverse sets of objects and actions, and interaction between people and objects. Therefore, it is difficult to determine the salient content and describe its corresponding events appropriately in context. To overcome this problem, the recently developed video captioning networks generally exploit various modalities, such as frames, motion, and audio. Since these different modalities can complement each other, they enable the generated descriptions to be as rich as possible. Figure 1 shows an example video with a human-annotated sentence that can demonstrate the efficiency of the multimodality. Most of object words in the sentence may be recognized from the individual frames. However, the words “skiing rapidly down” and “while music plays in the background” are highly correlated with the motion and audio information, respectively. For example, how fast the skiing is can be only observed in the motion information, and whether the background music is playing in the video can be only determined through the audio information. Hence, using multimodal information in video captioning is very efficient.

To efficiently map a sequence of the information, such as frames, motion, and audio, to a sequence of words, we propose an LSTM network based on deep multimodal embedding. Since a category can be derived from the video topic information, we consider four modalities for the embedding. The proposed network extends S2VT [29], which is the simplest encoder-decoder framework. Figure 2 shows the proposed network considering the four modalities. The first four stacked LSTM layers input the features of x¹ to x⁴, which correspond to audio, category, motion, and frame feature vectors, respectively, and then the last layer encodes them together in hidden representations. The decoding layer generates words from the encoded representations and then outputs a sentence. Unlike the encoder, the decoder only stacks a single LSTM layer. Since the characteristics of the features are different from each other, they are embedded in the different LSTM layers. For instance, the types of features can be classified to learned features or engineered features, and their dimensions can be small or large. Therefore, we find the optimal embedding layers based on their characteristics In Fig. 2, tags <pad>, <bos>, and < eos > denote null-padded-word, beginning-of-sentence, and end-of-sentence, respectively.

To our knowledge, this is the first approach that embeds the different modalities in different LSTM layers. The embedding order of the features is determined through their characteristics in the proposed network. Alternatively, all the features can be embedded in the same LSTM layer, or each feature can be randomly embedded in the different LSTM layers. However, our analysis results demonstrate that embedding the features of the different modalities into different LSTM layers appropriately is more efficient than the other embedding methods. Moreover, the experimental results show that the captioning performance of the proposed network is very competitive in comparison to that of conventional networks.

The remainder of this paper is organized as follows. Section 2 introduces related works. In section 3, the proposed network embedding frame, motion, audio, and category information is described in detail. Finally, experimental results are presented and we conclude the paper in sections 4 and 5, respectively.

2 Related works

Most of the recent studies on video captioning have been based on CNN and LSTM models because of their good captioning performance. In the method proposed in [30], videos are directly translated to sentences by concatenating a CNN encoder and an LSTM decoder. The encoder uses a uniform weighting feature, which is an average of the CNN features from all frames (mean pooling), and the decoder outputs words based on the previous words. A spatiotemporal CNN (3-D CNN) is introduced to represent a dynamic temporal structure in [37]. It includes a temporal attention mechanism that can focus more on interesting frames. In [29], S2VT is designed to enhance accurate sentence generation as an extended version of the mean pooling in [30]. Because the encoder and decoder employ a single LSTM, their parameters are shared during encoding and decoding. In [17], LSTM with visual-semantic embedding (LSTM-E) considers the relationship between sentence semantics and visual contents. In [16], a hierarchical recurrent neural encoder (HRNE) is proposed to efficiently exploit the temporal structure in a longer range by reducing the length of input information flow, so it tries to focus on learning good video features. On the contrary, a paragraph-RNN (p-RNN) introduces a hierarchical decoder in [39], which consists of not only a sentence generator but also a paragraph generator to generate multiple sentences. A reconstruction network (RecNet) introduced in [32] is an encoder–decoder–reconstructor architecture to exploit a video-to-sentence flow and a sentence-to-video flow at the same time.

To improve the captioning performance, many studies have discussed multimodal networks exploiting a variety of information, such as frames, motion, audio, and so on. Since the various features complement each other in translating videos to language, the captioning performance of networks using multimodal features is generally much better than that of networks using the visual-only feature. In [10], each feature is embedded in a multimodal fusion network, and a text sequence decoder generates a sentence. It shows that the fusion of the multimodal features significantly improves the captioning performance. A multimodal video description network (MMVD) in [20] is extended on top of S2VT [29]. Unlike S2VT, MMVD considers the multimodal features for good performance and uses a single LSTM layer instead of using a stack of two LSTM layers for complexity reduction. Multimodal features are used to predict video topics and generate topic-aware captions in [3]. In [36], a multirate multimodal network encodes video frames with different motion speeds and incorporates a temporal structure adaptively. In [34], a multimodal attention LSTM network (MA-LSTM) employs both multimodal streams and temporal attention to adaptively focus on specific elements during sentence generation. In particular, the attention mechanism focuses on not only interesting parts within each modality but also contributions across the different modalities. A category-aware ensemble model and a reranking model are proposed in [11]. The ensemble model efficiently fuses the predictions from different captioning models according to the category information. The reranking model chooses the best sentence among candidates generated by the different captioning models. A consensus-based sequence-training network (CST) is developed as a variant of the reinforcement learning framework in [19], which tries to maximize the score of a candidate sentence as a reword.

3 Proposed network

Our proposed network is based on S2VT [29], which is designed with the LSTM encoder and decoder. The encoder inputs four different multimodal features, and the decoder outputs words. First of all, we briefly introduce the S2VT network in section 3.1. Since this paper focuses on how the multimodal features are embedded in a network, section 3.2 explains and analyzes the multimodal features that we used for experiments. In section 3.3, we propose a deep multimodal embedding network based on the analysis of the features.

3.1 S2VT network

The five encoding layers and the one decoding layer are constructed in the proposed network as shown in Fig. 2, while S2VT uses two encoding and decoding layers. Although the proposed network puts more encoding layers and less decoding layers, these two networks are somewhat similar in their architectural aspects. The first layer in S2VT reads a sequence of the frame features that is extracted by CNN, and then they are encoded to a hidden representation. The second encoding layer receives the hidden representation vector and delivers them to the decoder. When the second decoding layer is fed to the tag <bos>, it starts to decode the vector to a sequence of words, considering previous words. Please note that the previous words here mean ground truth words in training, but words with the maximum probability after a softmax in testing. The second layer generates the words until it emits the tag <eos>. S2VT employs the LSTM model to deal with the long-term dependency problem [9]. Its parameters are continuously updated in a direction in which the log-likelihood of the predicted words is maximized through back-propagation techniques.

LSTM is one of the widely used models for variable sequential data because it is especially good at modeling long-term sequence information. It is designed to address the vanishing and exploding gradient problem. In detail, its memory cell can keep track of previous information until the current time step and update with the current input under three gates, which are an input gate i_t, a forget gate f_t, and an output gate o_t. For example, i_t and f_t determine what information is stored and forgotten in the cell state, respectively. Finally, o_t controls what is going to be output. Using these three gates, LSTM can calculate a hidden state h_t and a memory cell state c_t at each time t, as follows:

$$ {\displaystyle \begin{array}{l}{i}_t=\sigma \left({W}_{ix}{x}_t+{U}_{ih}h{}_{t-1}+{b}_i\right)\\ {}{f}_t=\sigma \left({W}_{fx}{x}_t+{U}_{fh}h{}_{t-1}+{b}_f\right)\\ {}{o}_t=\sigma \left({W}_{ox}{x}_t+{U}_{oh}h{}_{t-1}+{b}_o\right)\\ {}{g}_t=\phi \left({W}_{gx}{x}_t+{U}_{gh}h{}_{t-1}+{b}_g\right)\\ {}{c}_t={f}_t\otimes {c}_{t-1}+{i}_t\otimes {g}_t\\ {}{h}_t={o}_t\otimes \phi \left({c}_t\right)\end{array}} $$

(1)

where all the weight matrices W, U, and b are training parameters to be learned. Here, x_t denotes an input feature vector; σ and Φ represent a logistic sigmoid and a hyperbolic tangent function, respectively; and ⓧ denotes an element-wise product. Once the hidden state h_t is computed from Eq. (1), the log-likelihood of the predicted words is finally calculated.

3.2 Multimodal features

The proposed network considers the embedding of the frame, motion, audio, and category features. For our experiments, each feature was obtained from one of the feature extraction methods currently used most often. Table 1 shows the feature information in detail. Inception-v4 is employed for the static frame feature [25]. It is extracted from the penultimate layer, and its dimension is 1536. I3D represents the two-stream inflated 3D CNN as the motion feature [2]. It is extracted from a logit layer, and its dimension is 400. MFCC stands for the mel-frequency cepstral coefficient, and it is widely used in speech and audio recognition [5]. It is derived from a type of cepstral representation of the audio clip of every second in our experiments. The first 20 coefficients and their first- and second-order derivatives are used together, so its total dimension becomes 60. The category information is tagged with each video, and it is represented as a one-hot vector. Each video belongs to one of 20 different categories, such as music, food, or TV shows. Therefore, its dimension is 20.

Table 1 Feature information used in our experiments

Full size table

These four features have different characteristics. First of all, both Inception-v4 and I3D are the learned features, which means that they are already trained on other datasets. For example, Inception-v4 is pre-trained on ImageNet [25], and I3D is pre-trained on ImageNet and fine-tuned on Kinetics [2]. On the other hand, MFCC is the engineered feature. It is derived by taking the Fourier transform, mapping the powers of the frequency spectrum onto the mel-scale, and then taking their discrete cosine transform [5]. Since Inception-v4 and I3D are the well-learned features extracted from state-of-the-art technologies, their captioning performance usually better than the engineered feature, such as MFCC. We will show the corresponding results in section 4.2. Actually, learned audio features are currently used in audio processing [8]. However, they are not verified in video captioning, and MFCC is still the most popular feature in this area. The category feature is a one-hot vector, which is provided in the video dataset.

After each feature is embedded, it is encoded to the hidden representation, which is represented as the format of the fixed-size vector. However, the dimensions of the features are all different. For example, the dimension of the category feature is much lower than that of the frame, motion, and audio features. In particular, the dimension of the frame feature is significantly high. On the contrary, the dimension of the category feature is very low, so it may require a long time for its encoding. Hence, if the features in Table 1 are embedded in the same LSTM layer, some features may be encoded very well and ready to generate words. However, if some features are not fully encoded, they might have a negative influence on the word generation.

3.3 Deep multimodal embedding

As shown in Fig. 2, our proposed network stacks five LSTM layers. The multimodal features can be learned more in the upper layer than the underlying layer. In other words, as the features are embedded in the deeper layers above, they are encoded more than when they are embedded in the layers below. For instance, after the first feature that is embedded in the first LSTM layer is encoded, it is encoded once more with the second feature in the second layer, and then the first and second features are further encoded with the third and fourth features in the bottom layers. In the case of the fourth feature, it is embedded and encoded in the fourth LSTM layer and then immediately prepared to generate words in the last layer. Therefore, the number of times that each feature is encoded is different.

Taking into account the layer structure of the proposed network and the analysis in section 3.2, the proposed network determines the embedding order of the four different features. The engineered features or the features with a low dimension are embedded in the upper layers, such as the first and second LSTM layers. On the other hand, the proposed network inputs the well-learned features and high dimension features to the layers below. Since MFCC of the audio information is the engineered features, it is embedded into the first LSTM layer. The category feature is classified into 20 categories and represented as a one-hot vector. It is not the engineered feature, but its dimension is very low. Therefore, it is embedded in the second layer. Finally, both Inception-v4 of the frame information and I3D of the motion information are well-learned features. Since the dimension of Inception-v4 is much higher than that of I3D, Inception-v4 and I3D are embedded in the third and fourth layers, respectively. Based on this embedding order, the features x¹ to x⁴ in Fig. 2 can be replaced with the MFCC, category, I3D, and Inception-v4 features, respectively. Please note that the different types of the features can be employed for the frame, motion, audio, and category modalities in the proposed network. In this case, their embedding order should be modified according to their characteristics.

4 Experimental results

In section 4.1, we introduce the experimental setting, including the evaluation matrices, dataset, and implementation details. The performance evaluation of the proposed network is described in detail in section 4.2.

4.1 Experimental setting

To verify the captioning performance of the proposed network, we used the latest released Microsoft Video-to-Text dataset (MSR-VTT 2017) [33], which is a large-scale dataset including 10,000 training videos and 3,000 testing videos. Each video clip in MSR-VTT 2017 is annotated with 20 natural language descriptions. In addition, each video content belongs to one of 20 categories, and the information is tagged with metadata. Therefore, the category feature can be simply derived from the video container.

Diversity of video topics allows for a variety of expressions through a wide vocabulary. This results in a large discrepancy between captions generated from video captioning networks and human-annotated ground truth captions. Figure 3 shows an example video with various ground truth captions. We can see that different persons can describe the same videos with different expressions. To fairly consider this problem in the performance evaluation, we used four different evaluation matrices, namely, BLEU4 [18], METEOR [6], ROUGE-L [14], and CIDEr [28].

In the proposed network, the tags <bos > and < eos > were added to the vocabulary to indicate the beginning and end of the sentence, respectively. We did not pre-process the sentences except for removing punctuation and converting the sentences to lower case. The dimension of the LSTM hidden state was set to 512. The trainable weights for embedding were initialized with uniform distribution in the range of −0.1 to 0.1. We used the Adam optimizer with a learning rate of 0.0001 to optimize the model and dropout with a rate of 0.5 to avoid over-fitting. When generating the sentences, we used beam search with a size of 5.

4.2 Performance evaluation

Table 2 shows the captioning performance of the proposed network and the leaderboard in the second Video to Language Challenge (MSR-VTT Challenge 2017) (http://ms-multimedia-challenge.com/2017/leaderboard). The proposed network outperforms all of the state-of-the-art networks except the method placed in the first rank [11], which used more training datasets than other networks and the proposed network. Comparing the proposed network with the network placed in the second rank [36], which fairly employed the identical training dataset, the result shows that all the scores of the proposed network are higher. This demonstrates that the proposed network is very competitive in comparison with conventional networks.

Table 2 Captioning performance of the proposed network and the leaderboard in the second Video to Language Challenge (MSR-VTT Challenge 2017)

Full size table

Tables 3, 4, and 5 show the results, which support our analysis. Please note that the beam search function was turned off in these tests. In section 3.2, we claimed that the learning degree and dimension of the multimodal features are different, so they should be embedded in the different LSTM layers for good captioning performance. In section 3.3, we propose that the features of MFCC for the audio, one-hot vector for the category, I3D for the motion, and Inception-v4 for the frame information are embedded into x¹ to x⁴ in Fig. 2. Table 3 presents the captioning performance when the frame, motion, audio, and category features are randomly embedded in the different LSTM layers. Under the given features, the result shows that the embedding order of the audio, category, motion, and frame features is more efficient than the other combinations. In addition, when either the frame or motion feature, which is already well-learned, is embedded in the deeper layers, such as the first and second combination cases in Table 3, the captioning performance is relatively low, because the network memorizes the frame and motion features particularly well in the training data. Therefore, it is highly possible to encounter the overfitting problem in the training phase [21, 23]. Figure 4 shows examples of captions generated by the first and fourth combination cases and the ground truth caption. It can be observed that the generated captions have a similar meaning to the ground truth caption. In particular, the captions generated from the fourth combination, which is the proposed embedding order in this paper, are more accurate and detailed than that generated from the first combination.

Table 3 Captioning performance when the frame (F), motion (M), audio (A), and category (C) features are randomly embedded to the different LSTM layers in Fig. 2

Full size table

Table 4 Captioning performance when the frame (F) and audio (A) features are embedded to the same LSTM layer or the different LSTM layers in Fig. 5

Full size table

Table 5 Captioning performance when the frame (F), motion (M), audio (A), and category (C) features are individually embedded to the single LSTM network in Fig. 6

Full size table

To support our arguments more clearly, we additionally conducted a simple experiment. Table 4 shows the captioning performance when the frame and audio features are embedded in the same LSTM layer or different LSTM layers in Fig. 5. The simplified network in Fig. 5 is exactly same as that in Fig. 2, except the number of embedding features. If the frame and audio features are embedded in the same LSTM layer, x¹ and x³ in Fig. 5 are replaced with the frame and audio features. When they are embedded in different LSTM layers, one case is that the frame feature is embedded in the first layer, and the audio feature is embedded in the second layer. The other case is that the audio feature is embedded in the first layer, and the frame feature is embedded in the second layer. As seen in the results, embedding in different layers in the order of the audio and frame features is more efficient than applying the other embedding order and embedding the features in the same layer. This demonstrates that embedding the features in different LSTM layers with the proper embedding order is more efficient than the other cases.

Finally, Table 5 presents the captioning performance when the frame, motion, audio, and category features are individually embedded in the single LSTM network shown in Fig. 6. This single LSTM network is designed to investigate the captioning performance of each feature. As shown in Table 5, when feature x in Fig. 6 is the frame feature, the captioning performance is the highest. The others are efficient in the order of motion, category, and audio features. In other words, the captioning performance is low in the order of the audio, category, motion, and frame features, which is exactly same as the embedding order in the proposed network. These results demonstrate that our analysis is very reasonable.

5 Conclusions

This paper proposed the deep multimodal embedding network, which embeds the different modality features to different LSTM layers. Our experiments demonstrated that embedding the different modality features in different LSTM layers according to their characteristics achieves significantly higher captioning performance than embedding them in the same layer or randomly embedding them in different layers. In particular, embedding the features in the order of audio, category, motion, and frame features provided the most competitive captioning performance in the proposed network.

References

Bahdanau D, Cho K, Benjio Y (2015) Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. IEEE Conference on Computer Vision and Pattern Recognition
Chen S, Chen J, Jin Q, Hauptmann A (2017) Video captioning with guidance of multimodal latent topics. ACM Multimedia
Cho K, Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Benjio Y (2015) Learning phrase representations using RNN encoder-decoder for statistical machine translation. Conference on Empirical Methods in Natural Language Processing
Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4)
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. Association for Computational Linguistics
Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrel T, Saenko K (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition. IEEE Conference on Computer Vision and Pattern Recognition
Hershey S, Chaudhuri S, Ellis DPW, Gemmeke JF, Jansen A, Moore C, Plaka M, Platt D, Saurous RA, Seybold B, Slaney M, Weiss R, Wilson K (2017) CNN Architectures for Large-Scale Audio Classification. International Conference on Acoustics, Speech, and Signal Processing
Hochreiter S, Schmidhuber J (1998) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Jin Q, Chen J, Chen S, Xiong Y, Hauptmann A (2016) Describing videos using multi-modal fusion. ACM Multimedia
Jin Q, Chen S, Chen J, Hauptmann A (2017) Knowing yourself: improving video caption via in-depth recap. ACM Multimedia
Krishnamoorthy N, Malkarnenkar G, Mooney R, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. AAAI Conference on Artificial Intelligence
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems
Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. Association for Computational Linguistics
Liu C, Mao J, Sha F, Yuille A (2017) Attention correctness in neural image captioning. AAAI Conference on Artificial Intelligence
Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. IEEE Conference on Computer Vision and Pattern Recognition
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. IEEE Conference on Computer Vision and Pattern Recognition
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. Association for Computational Linguistics
Phan S, Henter GE, Miyao Y, Satoh S (2017) Consensus-based sequence training for video captioning arXiv preprint arXiv:1712.09532v1
Ramanishka V, Das A, Park DH, Venugopalan S, Hendricks LA, Rohrbach M, Saenko K (2016) Multimodal video description. ACM Multimedia
Rich C, Steve L, Lee G (2001) Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. Neural Information Processing Systems Conference
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, Lecun Y (2014) Overfeat: Integrated recognition, localization and detection using convolutional networks. International Conference on Learning Representations
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MATH MathSciNet Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Proces Syst
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. AAAI Conference on Artificial Intelligence
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2013) Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition
Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. International Conference on Computational Linguistics
Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: consensus-based image description evaluation. IEEE Conference on Computer Vision and Pattern Recognition
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrel T, Saenko K (2015) Sequence to sequence – video to text. IEEE International Conference on Computer Vision
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2015) Translating videos to natural language using deep recurrent neural networks. Conference of the North American Chapter of the Association for Computational Linguistics
Vinyals O, Toshev A, Benjio S, Erhan D (2015) Show and tell: a neural image caption generator. IEEE Conference on Computer Vision and Pattern Recognition
Wang B, Ma L, Zhang W, Liu W (2018) Reconstruction network for video captioning. IEEE Conference on Computer Vision and Pattern Recognition
Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. IEEE Conference on Computer Vision and Pattern Recognition
Xu J, Yao T, Zhang Y, Mei T (2017) Learning multimodal attention LSTM networks for video captioning. ACM Multimedia
Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Benjio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. International Conference on Machine Learning
Yang Z, Xu Y, Wang H, Wang B, Han Y (2017) Multirate multimodal video captioning. ACM Multimedia
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. International Conference on Computer Vision
You Q, Jin H, Wang Z, Fang C, Luo J (2017) Image captioning with semantic attention,” IEEE Conference on Computer Vision and Pattern Recognition
Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. IEEE Conference on Computer Vision and Pattern Recognition

Download references

Author information

Authors and Affiliations

School of Intelligent Mechatronics Engineering, Sejong University, Seoul, South Korea
Jin Young Lee

Authors

Jin Young Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jin Young Lee.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, J.Y. Deep multimodal embedding for video captioning. Multimed Tools Appl 78, 31793–31805 (2019). https://doi.org/10.1007/s11042-019-08011-3

Download citation

Received: 22 November 2018
Revised: 08 July 2019
Accepted: 17 July 2019
Published: 24 July 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11042-019-08011-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Deep multimodal embedding for video captioning

Abstract

Similar content being viewed by others

Multimodal Interaction Fusion Network Based on Transformer for Video Captioning

Multimodal attention-based transformer for video captioning

A Novel Approach for Video Captioning Based on Semantic Cross Embedding and Skip-Connection

1 Introduction

2 Related works