1 Introduction

Video description is to translate video content including person/object, action, scene and even relationship into natural language with reasonable sentence pattern, correct grammars and decent words. It belongs to the task of high-level visual understanding and possesses broad prospect such as early education, automatic interpretation, visual aids and human–machine intelligent environment development. However, the task is really challenging since the model has to bridge the vision and language which belong to different modal information and there is great semantic gap. Facing the challenge, the object detection, action recognition techniques in computer vision are usually employed to extract the visual semantic entities or regions firstly, and then, they are filled into a pre-designed template such as “S–O–V” for sentence [1, 2], or the detected visual semantic concepts are composed of sentence with special rules [3]. However, the generated sentences are frequently with unsatisfying accuracy, coherence and flexibility caused by the limitations of inflexibility template or rule and weak representative ability of handcraft visual feature.

Recently, the deep convolutional neural networks (DCNN) are usually employed to extract visual deep feature which often possesses a higher level of abstraction and more powerful representation. And a number of DCNN models like AlexNet [4], VGGNet [5], GoogLeNet [6] and ResNet [7] are developed and applied on various visual tasks including image classification [4,5,6,7], action recognition [8, 9] as well as image captioning [10,11,12,13]. Also, the breakthrough of deep learning offers an opportunity to video description for performance improvement [14,15,16,17,18,19]. Generally speaking, the “encoding–decoding” pipeline for machine translation is learned and employed for the video description. The 2D/3D DCNN models pre-trained on large-scale dataset such as ImageNet [20] are often used to extract video feature firstly. Then the visual feature is usually fed into recurrent neural networks (RNN) for motion feature encoding and language modeling. The generated sentences are generally more accuracy and richer semantics in the help of deep feature and flexible language decoder. Hereafter, the attention mechanism [21,22,23,24] and variants (e.g., Transformer) [25, 26], visual attributes/concepts [27, 28] are employed for visual feature selection and optimization. Additionally, advanced optimization strategies such as reinforcement learning [29, 30] and adversarial learning [31] are also applied to generate a description of videos under different circumstances.

However, the visual semantics or language feature is frequently mined and used alone in popular works, leading to the insufficient available information for performance improving. In order to solve the problem, a framework with hybrid enhancing and complementary of visual and language semantics is proposed in this work, where two extra branches including visual and language semantics enhancing are appended on a multimodal model for compensation of visual and language information. In addition, the multi-objective optimization strategy is employed to provide more regularization information for the model training by adding objective functions on the aforementioned three branches. During testing, the output probabilities of all the branches are fused by weight average for word prediction at each time step for further improving sentence quality. For simplification, if the two extra visual and language branches are fused with the multimodal module by concatenation, the model is denoted as HE-VLSC. And if the element-wise addition operation is employed for the fusion, the model is abbreviated as HE-VLSA. Besides, a variant of the proposed HE-VLSC/A is designed to further boost performance, where the module based on deep fusion and module based on visual semantics enhancing are integrated. Two objective functions are designed to optimize the model jointly. Similar to HE-VLSC/A, the sequential probability fusion is employed during the test stage with weight average. As an example, the proposed HE-VLSA framework is presented in Fig. 1.

Fig. 1
figure 1

Overview of the proposed HE-VLSA framework. Two extra branches including visual semantics enhancing and language semantics enhancing are appended on a multimodal model for comprehensive information capturing

Experiments are conducted on two public video description datasets including MSVD [2, 32] and MSR-VTT2016 [33] and better performance is achieved compared to the baseline model, which reveals the effectiveness of the proposed model. Also, competitive results are obtained compared to the other state-of-the-art methods on a few evaluation metrics. The contributions of this work can be concluded as below.

  • A model based on visual semantics enhancing for video description is proposed in this work, where the visual information is modeled separately for further mining of motion feature, and then, the module is appended on a multimodal-based module for improving the accuracy of word prediction.

  • A framework based on hybrid enhancing and complementary of visual and language semantics is developed for video description, in which two branches for visual and language semantics enhancing are integrated with the multimodal-based module for more powerful representation, and multi-objective optimization and sequential late fusion are employed for model training and sentence generation.

  • Two deep fusion-based modules including visual and language semantics enhancing are combined into a wide model for video description, where joint optimization and sequential late fusion are also employed for training and testing, respectively. And competitive results are achieved compared to the state-of-the-art methods on two public evaluation datasets.

The rest of this work is organized as follows: Sect. 2 introduces the related works about video description briefly. Section 3 describes the motivation, proposed models and related formulas of this work in detail. The experiments and result discussion are provided in Sect. 4. And Sect. 5 concludes all this work.

2 Related works

The task of video description has been researched for decades. In early days, the techniques of computer vision such as object detection, action recognition are usually employed to extract visual semantic entities, and then, the related words or phrases are filled into templates designed in advance or reconstituted into the sentence in line with rules. Nagel et al. generate description for vehicle’s condition according to composing the detected motion track and type to phrases by a few rules [34]. Gupta et al. propose a model to generate long description for a video, where the actions or events are detected firstly, following that the visual entities are described as simple sentences, and then, they are recomposed to detail description by the relationships among the actions or events [3]. Kojima et al. construct the mapping relationships between semantic objects (e.g., object, action) and specific concepts, and fill them into the templates for sentences according to corresponding syntactic components [1].

Inspired by the pipeline of machine translation, the “encoder–decoder” framework is employed to generate descriptions for videos, where the videos remained to be described are treated as “source language” and the generated sentences are regarded as “target language.” Rohrbach et al. propose a model for video description with flexible structure, where the conditional random field (CRF) is employed to model the relationships of different detected objects [35]. However, the handcraft feature and traditional object detection method in the work limit the performance improvement caused by the insufficient accuracy of recognition and detection. The performance breakthrough of visual classification and recognition with deep learning provides another inspiration for researchers to develop advanced models for video description. In the general pipeline, the 2D/3D DCNN is employed to extract more abstract and representative deep features of frames or clips in a video for visual encoding, and then, the RNN is usually applied to generate sentence for decoding.

Venugopalan et al. propose a visual feature mean pooling-based framework for video captioning, in which the sampled frames are fed to a DCNN model for deep features; hereafter, the element-wise mean feature vector is calculated and used as the visual representation to send to a recurrent neural network named long short-term memory (LSTM) for words prediction [36]. The method makes use of the information of all sampled frames, but the motion and semantic structure information in the video may be destroyed. Afterward, they learn from the sequence to sequence method in machine translation and propose S2VT model [14], where the sampled frames are given to a CNN model in sequence, and then, the extracted feature sequence is fed to two stacked LSTM network for motion information capturing. When all the features of the frame are exhausted, the final visual representation is transmitted to the next time step in the same LSTM network for decoding. The model forces the motion feature encoding and language decoding to share the same network and parameters, and hence, it possesses concise architecture and good interpretability. Additionally, they develop a Deep-Glove model for further discover language information [37]. In the model, language branch pre-trained on large-scale corpus is added on S2VT to improve the accuracy and semantics of the generated sentences. Shen et al. borrow the fully convolutional network (FCN) [38] to extract CNN representation of visual semantic regions in frames of a video. Then the motion of the visual objects is tracked and a description is generated [39]. Besides, the event is usually employed as the elementary semantic object and description unit in a few works [40,41,42], and a number of sentences are generated to describe a video in more detail. Mun et al. argue that different events in a video are often interrelated, and thus, the independent event descriptions should be recomposed to paragraph with reasonable logic and the redundant sentences will be truncated [43]. In this work, the S2VT [14] and Deep-Glove [37] are investigated and employed to implement the proposed idea in consideration of the simple architecture and excellent performance for video captioning and description.

In addition to the sequence to sequence with RNN for visual feature encoding, the 3D CNN is another solution for visual representation for serving video description. In general, the method is integrated with RNN, attention mechanism, hierarchical modeling, advanced memory unit and multi-objective joint optimization for taking advantage of visual information. Yao et al. extract local temporal structure features of videos with a 3D CNN model. The visual feature is then fed to LSTM for decoding. In the process, an attention mechanism is employed to assign different weights to different 3D CNN features at different time steps for guiding sentence generation [16]. Pan et al. employ the C3D model to extract visual features for clips in a video, then they are sent to LSTM network with the 2D CNN mean feature of all frames. Finally, the visual module and language module are jointly training for model optimization [15, 28]. Wang et al. present a hierarchical framework based on reinforcement learning and 3D CNN for video fine-grained description [30]. Additionally, a few other researchers pay attention to visual objects and their corresponding relationships in videos, and develop a series of effective models, in which 2D and 3D visual features are usually both employed [44,45,46], to complete the task of video captioning. Different from the aforementioned works, Park et al. adopt an adversarial learning strategy and develop a group of discriminators to evaluate the accuracy, relevance of candidate sentences from the generator [47].

The popular works always focus on a single aspect of visual semantics or language feature mining separately for better performance but ignore the use of both and the mutual supplement between them during training and testing. As an example, the S2VT model [14] improves the flexibility of generated sentences, but both the visual semantic information and language are not fully exploited. In Deep-Glove model [37], the language information is further discovered by an extra branch. However, the language branch is difficult to be pre-trained since the large-scale corpus is required. Also, the visual semantic information is not further mined and put into full use. In this work, a framework based on hybrid enhancement and complementary of visual and language semantics is proposed for video description. For comprehensive visual and language information, the visual semantics enhancing and language enhancing branches are designed and appended on a multimodal-based module, making the three parts can be optimized cooperatively and complementarily. During training, besides the main objective function for the multimodal-based module, the other two extra objective functions for visual semantics enhancing branch and language enhancing branch are designed to provide more regularization information and hence improve the generalization ability of the whole model.

3 Proposed methods

Both visual semantics and language information are required in the task of video description. In most popular works, the visual and language information is usually fed to a sequential model to form multi-model representation and map the two different modalities data into a unified feature space. Then the RNN is employed as a language decoder and generates a description sentence for the video. However, more noises including vision and language may be introduced into the model if the multimodal feature is used, leading to interference to word prediction. Additionally, a few significant visual and language information may be lost during multiple nonlinear transformation of feature as the going on of time steps in RNN and other variants (e.g., LSTM) since the activation functions (i.e., Sigmoid and Tanh) may be in a saturated state once the value exceeds the sensitive interval. For richer language information and reducing the influence of noises on the model, a single language branch on a large-scale corpus is pre-trained and then integrated with S2VT by fusing the two sequential features at each time step in [37]. However, the extra domain corpus should be constructed for the language branch optimization. Moreover, the further mining and discovery of visual semantic information are ignored, and the semantics of generated sentences cannot be further improved.

Facing the problem, the visual semantic branch is designed based on S2VT [14] for richer visual information capturing in this work, which is learned from the idea in [37]. Then the branch is added on Deep-Glove [37], and the output features of the three parts are combined with early sequential fusion for comprehensive and complementary representation for sentence generation, where the features are fused before they are fed to the word predictor at each time step. Additionally, the multi-objective optimization strategy is employed to prevent the model to stick to over-fitting caused by the increase of parameters in visual and language branches. In detail, each branch of visual semantics enhancing, language semantics enhancing and the multimodal-based module deploys an objective function. Because there is no enough available modal information in the extra added branches (e.g., the visual information is missing in language enhancing branch), the larger errors between ground truth and generated sentence may be produced, as well as the disturbance during model optimization, which can be treated as extra regularization for the whole model. At test time, the late sequential fusion is employed for word prediction, where the probabilities from the predictors in the three modules are fused with the weighted average.

Suppose there are m frames in a video, and the frame sequence is \(<f_1,f_2,\cdots ,f_m>\). The corresponding feature sequence extracted from CNN model is \(<v_1,v_2,\cdots ,v_m>\). The word sequence in one of the corresponding sentences denoted as \(<w_1,w_2,\cdots ,w_k>\), where k is the length of the sentence. During training, the object of the model is to predict the word sequence \(<w_1',w_2',\cdots ,w_{k'}'>\) under the condition of \(<v_1,v_2,\cdots ,v_m>\). Afterward, the errors between the output sequence and reference are calculated and backpropagated through time (BPTT) for updating and optimizing the parameter set \(\varTheta\). The objective function can be formulated as:

$$\begin{aligned} \arg \min _\varTheta {\mathscr {L}}(<w_1',w_2',\cdots ,w_{k'}'>|<v_1,v_2,\cdots ,v_m>,\varTheta ), \end{aligned}$$
(1)

where \({\mathscr {L}}(\cdot )\) is the loss function, where the cross-entropy strategy is usually employed. And for different models, the function may possess different formulas (e.g., Eq. (2) and Eq. (7) in Sects. 3.1 and  3.3, respectively). As an important module in the backbone (Deep-Glove [37]) of the proposed model, the S2VT is presented in Fig. 2. The visual feature sequence \(<v_1,v_2,\cdots ,v_m>\) is fed to the first LSTM layer whether at the training stage or the testing stage. As for the second LSTM, it receives \(<pad>\) for the following word inputting alignment. When all visual features are exhausted, the word sequence \(<w_1,w_2,\cdots ,w_k>\) is fed to the second LSTM with the output from the first LSTM for modeling language. During training, each word will be given to the language model and the corresponding output word at each time step will be employed to calculate gradients with the input word, while at testing stage, the output words are predicted one by one but each of them has to depend on the previous word sequence.

Fig. 2
figure 2

Overview of the S2VT framework, which belongs to “encoding–decoding” pipeline. The a is training stage, where the visual feature sequence \(<v_1,v_2,\cdots ,v_m>\) is fed to the first LSTM during visual encoding, and the word sequence \(<w_1,w_2,\cdots ,w_k>\) is fed to the second LSTM with the output from the first LSTM. And the b is for testing, where the predicted word at current time step depends on the previous generated word sequence

3.1 The visual semantics enhancing-based model

It is the basis of visual features for video description. In the multimodal-based model, the visual features of frames are extracted from a CNN model pre-trained on large-scale dataset, then they are fed to LSTM by sequence for motion feature encoding. The final visual representation is delivered to the following time steps for language decoding when all the visual features are exhausted. In fact, the visual and language features are combined together in the pipeline. However, the visual feature may be compromised as the language feature dominates the word prediction at each time step in the decoding stage. Particularly, more and more visual details will be lost during nonlinear transformation of visual feature as the time step advances in LSTM, leading to increasing of cumulative error and decreasing of accuracy and semantics of the generated sentence. For richer language information, Venugopalan et al. [37] develop an extra single branch to enhance language semantics and then integrate it with S2VT [14] for more powerful representation and the following improvement of generated sentence. In this work, we learn from the practice but add a single branch for further visual semantics enhancing and fuse the output feature with that from S2VT to enhance the final representation and improve the accuracy of word prediction.

Fig. 3
figure 3

Visual semantics enhancing-based model for video description (E-VSA). The sequence of visual feature is fed to both \(\hbox {LSTM}_{{v}}\) and \(\hbox {LSTM}_{{1}}\) for visual motion feature modeling, and the outputs of \(\hbox {LSTM}_{{v}}\) and \(\hbox {LSTM}_2\) are fused for word prediction

As shown in Fig. 3, it is the architecture overview of E-VSA. The visual CNN features are fed to the bottom LSTM in multimodal-based module (the module is marked as m-S2VT for simplification) and the single visual semantics enhancing branch (the branch is denoted as B-VE for simplification) at the same time. At test time, the outputs of B-VE and m-S2VT are fused by the early sequential fusion method. For m-S2VT, there are two stages including visual “encoding” and language “decoding,” and the number of all time steps in the LSTM model is denoted as n. During visual encoding, suppose the number of sampled frames in a video is m, the time step is denoted as \(t1~(1 \le t1 \le m)\). At the decoding stage, the length of the corresponding reference sentence is \(n-m+1\), and the time step is \(t2~(m+1 \le t2 \le n\)). As for the extra visual semantics enhancing branch, the time step is denoted by t3. It is worth noting that \(1 \le t3 \le m\) in B-VE since the branch models the motion feature of the same video. If \(m \equiv n-m+1\), the outputs of B-VE and m-S2VT can be fused completely at each time step. However, when \(m<n-m+1\), the \(<pad>\) is used as the missing output from B-VE of rest time steps, while if \(m>n-m+1\), let \(m = n-m+1\), and the extra frame features will be truncated. In practice for feature fusing, two methods are employed, including feature concatenation (E-VSC) and element-wise addition (E-VSA). When E-VSC is employed, the model focuses on the balance of information including vision and language. In particular, the generated sentence can be further improved in accuracy and semantics if the used CNN feature possesses weaker representative ability (e.g., GoogLeNet [7] feature). On the contrary, when the more abstract feature (e.g., ResNet152 [6] feature) is used, the corresponding sensitivity of useful information can be enhanced when E-VSA is employed.

For E-VSC/A, the loss function is denoted as \({\mathscr {L}}_{\rm fusion}^{vs}\), which can be calculated by

$$\begin{aligned}&{\mathscr {L}}_{\rm fusion}^{vs}(V,W,\varTheta _{vs}) = \frac{1}{|V|}\sum _{i=1}^{|V|} \sum _{t=1}^{k'} \log \nonumber \\&\quad \times {\mathscr {P}}(w'_t|(h_{m+t-1}^{vs\_{\rm fusion}},w'_{1:t-1});\varTheta _{vs}), \end{aligned}$$
(2)

where \(V={<v_1^1,\cdots ,v_{m_1}^1>,<v_1^2,\cdots ,v_{m_2}^2>,\cdots ,<v_1^{|V|},\cdots ,v_{m_{|V|}}^{|V|}>}\) is the set of visual feature vector for the model in one iteration, and the |V| is the size of the set. The \(m_i~(i\in \{1,2,\cdots ,|V|\})\) is the number of sampled frames in the \(i^{th}\) video. And \(\varTheta _{vs}=(\varTheta _{m-s2vt},\varTheta _v)\) is the parameter set, where \(\varTheta _{m-s2vt}\) and \(\varTheta _v\) stand for the parameter set of m-S2VT and B-VE, respectively. The \(h_{m'+t-1}^{vs\_{\rm fusion}}\) denotes the fusion state of outputs from the hidden state of LSTM in B-VE and that from the top LSTM in m-S2VT at \(m+t-1\) time step, which can be written as

$$\begin{aligned} h_{m+t-1}^{vs\_{\rm fusion}}={\mathscr {F}}(h_{m+t-1}^2,h_{m+t-1}^v). \end{aligned}$$
(3)

In the above equation, the \(h_{m+t-1}^2\) is the output of the top LSTM in m-S2VT, while \(h_{m+t-1}^v\) is the output from LSTM in B-VE branch. The function \({\mathscr {F}}(\cdot )\) is fusion operation. Take element-wise addition as an example, the formulation is

$$\begin{aligned} {\mathscr {F}}(h_{m+t-1}^2,h_{m+t-1}^v)=h_{m+t-1}^2 \oplus h_{m+t-1}^v. \end{aligned}$$
(4)

Afterward, the fusion feature is fed to the classification layer for word prediction. At \(m+t-1\) time step, the output probabilities can be calculated by a Softmax function \({\mathscr {P}}(\cdot )\) with the formula as

$$\begin{aligned} {\mathscr {P}}(w'_{t2}|h_l^{vs\_{\rm fusion}})=\frac{\exp (W_{t2} \cdot h_l^{vs\_{\rm fusion}})}{\sum _{i=1}^{|VOC|} \cdot h_l^{vs\_{\rm fusion}}}, \end{aligned}$$
(5)

where \(l=m+t-1\), \(t2 \in [1,n-m-1]\) and \(t2=t\). The \(W_t\) denotes weight vector of the hidden layer, and |VOC| is the size of the vocabulary. Hereafter, the word in vocabulary corresponding to the maximum probability is picked as the predicted output at this time step.

3.2 Hybrid enhancement and complementary of visual and language semantics-based model

The proposed E-VSC/A ignores the further mining of language information though richer visual details and semantics can be achieved. For comprehensive information, the E-VSC/A and Deep-Glove framework are used for reference, and a more effective model with richer language and visual semantics is developed. Specifically, the B-VE in E-VSC/A is integrated with Deep-Glove framework [37] where an extra language enhancing branch (which is abbreviated as B-LE (Branch for Language semantics Enhancing) for simplification) is added on m-S2VT, to form a wide and comprehensive architecture for visual and language complementary. The output features from B-VE, m-S2VT and B-LE are fused together by early sequential fusion strategy to enhance representative ability of the final feature at the training stage. During the testing, the output probability from the m-S2VT is used for word prediction. The model is named HE-VLSC/A for convenience.

Figure 4 shows the overview of HE-VLSA. The previous predicted word \(w_{t2-1}\) will be fed to the top LSTM (\(\hbox {LSTM}_2\_t2\)) in m-S2VT module and language enhancing branch (B-LE) at the t2 time step. During training, the \(w_{t2-1}\) is the previous word from the reference sentence. And at test time, it is the previous predicted word. For the three branches including B-VE, m-S2VT and B-LE, the output features from respective hidden states are fused with concatenation or element-wise addition at each time step, similar to the practice in B-VSC/A. Note that the inputs for B-LE and m-S2VT can correspond to each other for all time steps since they share the same word sequence (sentence).

Fig. 4
figure 4

Framework of hybrid enhancement and complementary of visual and language semantics (here it is an example of HE-VLSA). The output features from \(\hbox {LSTM}_v\), \(\hbox {LSTM}_{{lm}}\) and \(\hbox {LSTM}_2\) in m-S2VT are fused for word prediction

It is similar to E-VSC/A model, the loss function of HE-VLSC/A is denoted as \({\mathscr {L}}_{{\rm fusion}}^{lvs}(V,W,\varTheta _{lvs})\), where \(\varTheta _{lvs}=(\varTheta _{m-s2vt},\varTheta _v,\varTheta _{lm})\). The output state of fusion \(h_{m+t-1}^{lvs_{\rm fusion}}\) can be computed by

$$\begin{aligned} h_{m+t-1}^{lvs\_{\rm fusion}}={\mathscr {F}}(h_{m+t-1}^2,h_{m+t-1}^v,h_{m+t-1}^{lm}), \end{aligned}$$
(6)

where \(h_{m+t-1}^2\), \(h_{m+t-1}^v\) and \(h_{m+t-1}^{lm}\) represent the outputs from the top LSTM in m-S2VT, hidden states in B-VE and B-LE at \(m+t-1\) time step, respectively. As for \({\mathscr {F}}(\cdot )\), it is the function for fusion with the operation of feature concatenation or element-wise addition.

3.3 HE-VLSC/A with multi-objective optimization

In HE-VLSC/A model, the parameter scale will be increased caused by the two added B-VE and B-LE branches visual and language semantic enhancement, which is easy to result in over-fitting for the model. Additionally, the errors are backpropagated not only to m-S2VT, but also to the branches of B-VE and B-LE for parameter updating and optimizing. However, the inputs of B-VE and B-LE are not the complete multimodal feature (i.e., only visual feature for B-VE, while only language information for B-LE), which may lead to insufficient optimization of the whole model.

In this work, two extra objective functions are appended on B-VE and B-LE branches in addition to the objective function with fusion feature. As shown in Fig. 5, it is the architecture overview of multi-objective-based HE-VLSA. Besides the \(L_{{\rm fusion}}\) which is treated as the main loss function, the B-VE and B-LE are implemented their own loss functions of \(L_{lm}\) and \(L_v\), respectively. During training, the errors from \(L_{{\rm fusion}}\) are propagated back to the three modules including m-S2VT, B-VE and B-LE for parameter updating. Simultaneously, the errors of \(L_{lm}\) and \(L_v\) are fed back to their respective branches for parameter optimizing twice in them. For difference to HE-VLSC/A, the model is marked as \(\hbox {HE-VLSC}^*/ \hbox {A}^*\).

Fig. 5
figure 5

Model of multi-objective-based HE-VLSA (\(\hbox {HE-VLSA}^*\)). Three objective functions are employed for model training. The \(L_{lm}\) and \(L_v\) are for language semantics enhancing branch (i.e., B-LE) and visual feature enhancing branch (i.e., B-VE), respectively, while the \(L_{{\rm fusion}}\) is for the multimodal model (i.e., m-S2VT)

Because there are no corresponding language features and visual features in B-VE and B-LE, respectively, the gradients with \(L_{lm}\) and \(L_v\) are usually bigger than that with \(L_{{\rm fusion}}\). This may be detrimental to the optimization of the two branches, but the bigger errors can be treated as extra regularization information for the whole model and improve the generalization ability. In Fig. 6, the loss trend and performance on CIDEr of HE-VLSA and \(\hbox {HE-VLSA}^*\) are presented. Figure 6a, b shows that the two models possess a similar loss trend whether the GoogLeNet [6] feature or ResNet152 [7] feature is employed, but the errors of HE-VLSA* are higher than that with HE-VLSA on the whole, which indicates that the multi-objective optimization-based \(\hbox {HE-VLSA}^*\) is less likely to stick into the state of over-fitting. On the other hand, the performance of \(\hbox {HE-VLSA}^*\) is superior to HE-VLSCA on CIDEr [48] metric (shown in Fig. 6c, d), which reveals that \(\hbox {HE-VLSA}^*\) enhances the generalization ability effectively and improves semantics of the generated sentences.

Fig. 6
figure 6

Comparison of the loss trend and performance on CIDEr with HE-VLSA and \(\hbox {HE-VLSA}^*\) (for a and c, the GoogLeNet feature is employed in model, while the ResNet152 feature is employed in the model for b and d)

When the sentence is generated, the output probabilities from the fusion branch based on m-S2VT, B-VE and B-LE are sequentially fused by weight average for cooperative word prediction. In practice, the output probability from the fusion branch dominates the word selection, with the assistance of output probabilities from branches of B-VE and B-LE. The fusion factors can be achieved empirically in general.

For \(\hbox {HE-VLSC}^*\)/\(\hbox {A}^*\) model, there are three loss functions including \({\mathscr {L}}_{{\rm fusion}}^{lvs}(V,W,\varTheta _{lvs})\) for fusion branch, \({\mathscr {L}}^v(V,W,\varTheta _v)\) for visual semantics enhancing branch B-VE and \({\mathscr {L}}^{lm}(W,\varTheta _{lm})\) for language enhancing branch B-LE. Regarding \({\mathscr {L}}^{lm}(W,\varTheta _{lm})\), it is blind to visual information but just provides gradients to the B-LE branch and updating the parameters only by language information. The formulation is

$$\begin{aligned} {\mathscr {L}}^v (V,W,\varTheta _v)=\frac{1}{|V|}\sum _{i=1}^{|V|}\log {\mathscr {P}}(w'_t|(h_t^v,w'_{1:t-1});\varTheta _v), \end{aligned}$$
(7)

where \(h_t^v\) is the output of hidden state in B-VE at the t time step, and the previous input is CNN feature of the \(t-1\) frame. The three objective functions in \(\hbox {HE-VLSC}^*\)/\(\hbox {A}^*\) are independent of each other, but the objective function in the fusion branch contains the information from B-VE and B-LE, and hence, the parameters in the two auxiliary branches are also optimized by the objective function in the fusion branch.

At test time, the output probabilities of the three branches at the t time step are fused with

$$\begin{aligned} {\mathscr {P}}_t^{{\rm fusion}}=\alpha {\mathscr {P}}_t^{m-s2vt}+\beta {\mathscr {P}}_t^{lm} + \gamma {\mathscr {P}}_t^v, \end{aligned}$$
(8)

where \({\mathscr {P}}_t^{{\rm fusion}}\) is the fused probability vector, and \({\mathscr {P}}_t^{m-s2vt}\), \({\mathscr {P}}_t^{lm}\) and \({\mathscr {P}}_t^v\) stand for the probability vectors of fusion branch, B-LE and B-VE, respectively, at t time step. And \(\alpha\), \(\beta\) and \(\gamma\) are the fusion factors of the three branches, respectively, where they meet the constraint of \(\alpha +\beta +\gamma \equiv 1\), and the weight assigned in practice follows experimental experience.

Besides the described \(\hbox {HE-VLSC}^*\)/\(\hbox {A}^*\) above, another variant with multi-objective-based model is also investigated. Specifically, the backbone of Deep-Glove is employed as one of modules for language enhancing in the framework (for convenience and distinction, it is renamed as \(\hbox{E-LSC}^\#\)/\(\hbox{A}^{\#}\)). And the model E-VSC/A is employed as another module for visual semantics enhancing (for convenience and distinction, it is renamed as \(\hbox {E-VSC}^\#\)/\(\hbox {A}^\#\)). Then the two modules are integrated into a wider framework, where the two modules share the same language embedding layer and visual feature reduction layer, which reduces the “one-hot” word feature into embedding language representation and is to map the CNN feature to visual representation with lower dimension, respectively. During training, for separate \(\hbox {E-VSC}^\#\)/\(\hbox {A}^\#\) and \(\hbox {E-LSC}^\#\)/\(\hbox {A}^\#\), the two modules are optimized independently with their respective objective functions. However, it is worth noting that the two modules are actually trained jointly on the whole in that the parameter updating and optimization of the word embedding layer and visual feature reduction layer depend on both of the two objective functions. At the test stage, the late sequential fusion method is employed to fuse the output probabilities from the two modules. The fusion weights are assigned to be equal. For comparison convenience, the variant is denotes as \(\hbox {HE-VLSC}^\#\)/\(\hbox {A}^\#\), and the architecture is shown in Fig. 7.

Fig. 7
figure 7

Architecture of \(\hbox {HE-VLSCA}^\#\). Two independently designed modules (i.e., language semantics enhancing branch-based module vs visual information enhancing branch-based module) are integrated for comprehensive representation. During training, the two objective functions optimize two modules jointly according to the word embedding layer. At test time, the outputs from the two modules are fused with sequential weight average fusion for final word prediction at each time step

The optimization of \(\hbox {HE-VLSC}^\#\)/\(\hbox {A}^\#\) is similar to \(\hbox {HE-VLSC}^*\)/\(\hbox {A}^*\), and the gradients form the \(\hbox {E-LSC}^\#\)/\(\hbox {A}^\#\) and \(\hbox {E-VSC}^\#\)/\(\hbox {A}^\#\) are fed back not only their own modules but to each other modules according to the language embedding layer and reduction layer, with complementary to each other. During testing, the output probabilities are fused with weight average too, where the fusion factors (which are marked as \(\alpha ^\#\) and \(\beta ^\#\)) are set to be equal empirically.

4 Experiments

For evaluation of the proposed model, experiments are conducted on two public datasets including MSVD [2, 32] and MSR-VTT2016 [33]. In this section, the employed dataset and evaluation metric are described in detail in the first place. Then the performance of the proposed model and comparable models are presented and discussed. Concretely, the ablation study is carried on to reveal the improvement of the proposed methods on performance. Besides, the performance comparison with the popular state-of-the-art methods is conducted to show the superiority of the proposed model. Finally, a few examples including sentences generated by the proposed model and other comparable models are presented and discussed to further reveal the effectiveness of the proposed model.

4.1 Evaluation dataset and metric

Two popular datasets including MSVD [2, 32] and MSR-VTT2016 [33] are employed to evaluate the proposed methods. For the MSVD dataset [2, 32], there are 1997 simple video clips (e.g., cooking, exercise) and corresponding 80827 reference sentences in total. According to the using protocol, 1200 videos and related 48774 references are for training, and 100 videos and related 4290 references are for model validation and hyperparameter discovery. The rest video–sentence pairs including 677 videos and related sentences are for the model test. The dataset contains limited training samples and simple video–sentence pairs. Comparatively, MSR-VTT2016 [33] possesses relatively larger-scale and more complex videos and corresponding descriptions. In the dataset, 10000 videos are included and each video corresponds to 20 description sentences. In general, 6513 videos and related references are for model training, while 497 video–sentence pairs are for validation, and the rest 2990 videos are for model evaluation.

Different from the task of classification or object detection, the evaluation of video description is more complicated since more factors require to be taken into account including accuracy, coherence and semantics. For more comprehensive evaluation of the generated sentences, the metrics for machine translation including BLEU [49], METEOR [50] and ROUGE\(\_\)L [51] are usually lent to the video description. As for BLEU [49], it assesses the quality of generated sentences according to measuring the matching degree of n-grams (where \(n \in \{1,2,3,4\}\)) in references and candidate sentences. Generally speaking, the larger the n and the higher the BLEU-n, the better the quality of generated sentences (for simplicity, the metric is denoted as “B-n”). However, the BLEU metric focuses on the precision of word or phrase prediction but ignores the recall. By comparison, precision and recall are both in consideration in METEOR [50]. Specifically, three matching alignment strategies including precise matching, synonym matching and stem matching are employed to build the matching alignment set of references and candidate sentences. Then the ratio between the set size and candidate sentence length is treated as the precision, while the ratio between the set size and reference sentence length is treated as the recall. Finally, the harmonic mean of the precision and recall is calculated as the evaluation score. (The metric is marked as “M” for convenience.) Different from METEOR, a concept of longest common subsequence (LCS) is defined for ROUGE\(\_\)L [51]. The ratio between the two lengths of LCS and candidate sentence is used as the precision, and the ratio between those of LCS and reference sentence is used as the recall (it is abbreviated as “R” for convenience). Besides, for more targeted evaluation of visual description, Vedantam et al. develop a novel evaluation metric of CIDEr [48]. The distribution of n-gram in all reference sentences of the image/video remained to be described is statistically analyzed by the idea of “consensus,” and each n-gram is assigned a different weight according to its frequency. Then the similarity of references and candidate sentences is calculated as the score. (The metric is abbreviated as “C” for convenience.)

In fact, an evaluation metric often focuses on one of the aspects of the generated sentences, and thus, it is difficult to assess the ability of a model comprehensively with a single metric. In this work, we follow the popular practice and employ all the metrics of BLUE [49], METEOR [50], ROUGE\(\_\)L [51] and CIDEr [48] to measure the quality of generated sentences.

4.2 Experimental setting

All our experiments are implemented on 4 GPUs (TITAN X Pascal, 12GB) to speed model convergence. And the popular deep learning framework Caffe [52] is employed to conduct model configuration, training and testing. In order to reduce redundancy and make full use of information in a video, a frame is sampled every 10. As for the visual features, two CNN models including GoogLeNet [6] and ResNet152 [7] are employed to extract feature from each sampled frame for verifying the performance of the proposed model comprehensively. The two CNN models are not only pre-trained on the large-scale classification dataset ImageNet [4], but also fine-tuned on an image captioning dataset MSCOCO2014 [53]. During fine-tuning, the LRCN framework [11] is employed as the backbone where the CNN model for visual encoding and LSTM for language decoding are jointly optimized. The solver of stochastic gradient descent is used, and the batch size is set to 16. As for the learning rate, it is initially set to 0.01 and then is gradually decreased by multiplying 0.1 in step size manner after each of 20000 iterations. The CNN model with 490000 iterations is picked to extract visual feature of each frame in videos since the model is in convergence and possesses the best performance on the validation set of MSCOCO2014.

The practice forces the CNN model to learn the category information of objects in the first pre-training stage. Then, the CNN model is further fine-tuned at the second stage, where it enables the model to perceive sufficient information in advance for visual description since there are enough image–sentence pairs in MSCOCO2014. Consequently, the two-stage pre-training strategy makes the model sensitive to frequently used words/phrases and sentence patterns in the following video description task. Then the extracted CNN feature sequence of sampled frames in a video is fed to the proposed model and comparable models in order for motion encoding and the following language decoding. During training, the number of time steps in m-S2VT is set to 80, which is learned from the practice in S2VT [14]. When the sum of sampled frames and words in the corresponding reference breaks the limitation, all the words should be fed to the model, while the extra frames will be truncated. Regarding the visual semantics enhancing branch, the number of time steps is equal to that in the part of m-S2VT for language decoding. However, the limitation will be removed at test time, which indicates that all the sampled frames will be fed to LSTM for visual motion encoding.

For further improving performance, the beam search is employed to expand the search space of words, which is also following the practice in most popular works. However, when the proposed models except for \(\hbox {HE-VLSC}^\#\)/\(\hbox {A}^\#\) are implemented on MSVD [2, 32], the pool of beam size is set to 1 empirically, which means the beam search is not working. As for all the remaining experiments, the pool of beam search is set to 3. In \(\hbox {HE-VLSC}^*\)/\(\hbox {A}^*\) model, the factors including \(\alpha\), \(\beta\) and \(\gamma\) for probability fusion are achieved by experimental experience. In general, when the ratio is assigned to 8:1:1 or 6:2:2, the best performance of the model can be obtained. (The discussion and possible reasons are provided in Sect. 4.3.2.)

4.3 Experimental result and discussion

4.3.1 Ablation study

In order to evaluate the effectiveness of the proposed methods including visual semantics enhancing, multi-objective optimization-based hybrid enhancing and complementary of visual and language semantics, ablation experiments are conducted on MSVD [2, 32] and MSR-VTT2016 [33]. Concretely, the Deep-Glove framework [37] is employed as the backbone to construct the baseline model. However, in the original Deep-Glove [37], the extra language semantics branch is pre-trained on other large-scale corpora for richer language semantics. And then the whole model is trained on video captioning dataset (e.g., MSVD [2, 32] and MST-VTT2016 [33]) with the extra language branch fine-tuning and the m-S2VT full training. We abandon this strategy but train the model on video captioning dataset directly for fair comparison. (The baseline model is denoted as E-LSC for convenience and distinction.) In addition, the proposed E-VSC/A, HE-VLSC/A and \(\hbox {HE-VLSC}^*\)/\(\hbox {A}^*\) are implemented on the two datasets.

The performances of the baseline and proposed models on MSVD dataset with GoogLeNet and ResNet152 features are shown in Table 1. When GoogLeNet feature is in use, it can be observed that both the proposed models E-VSC and E-VSA perform better than the baseline model (E-LSC) on MSVD from the results. More specifically, the performances on B-4 and CIDEr are 45.3 and 66.5, respectively, with E-VSC, outperforming the baseline model by 3.6 and 5.5. Note that the performance of E-VSA is inferior to E-VSC a little on all metrics, though the model is superior to the baseline model. This comparison trend reveals the effectiveness of the proposed visual semantics enhancing method. As for the proposed HE-VLSC and HE-VLSA, the HE-VLSC/A generally performs better on a few metrics compared to the baseline model and E-VSC/A. In particular, the CIDEr of HE-VLSC reaches 67.2, with 0.7 points higher than the best E-VSC. However, the B-4, METEOR and ROUGE\(\_\)L with HE-VLSC are not effectively improved but declined a little. It indicates that the B-LE branch has the ability to enhance the semantics of generated sentences but is not good at coherence improvement. On the contrary, the HE-VLSA concerns more with coherence which is reflected on BLEU, but yields to the entire semantics corresponding to CIDEr. The reason is that the GoogLeNet feature possesses weaker abstractness and representative ability, and more visual noises may be introduced into the model since the further mining of the visual semantics in the model with an extra branch. What is more, the visual noises are further amplified with element-wise addition operation for feature fusion, hindering the improvement of model performance.

As for \(\hbox {HE-VLSC}^*\)/\(\hbox {A}^*\), it outperforms other comparable models on the whole with GoogLeNet feature. Take \(\hbox {HE-VLSC}^*\) as an example, it performs better than HE-VLSC on all evaluation metrics but CIDEr. Additionally, the \(\hbox {HE-VLSA}^*\) is also superior to HE-VLSA on each metric. In particular, the CIDEr of \(\hbox {HE-VLSA}^*\) outperforms HE-VLSA by 4.6. However, the comparison trend of the two models is opposite to that of HE-VLSA and HE-VLSC on CIDEr. It reveals that the element-wise addition has the ability to enhance the response of features to semantic concepts to a certain extent with multi-objective optimization.

Table 1 Performance comparison of each proposed model on MSVD dataset with GoogLeNet and ResNet152 feature, respectively (the “M,” “R” and “C” are the abbreviations of “METEOR,” “ROUGE\(\_\)L” and “CIDEr,” respectively, and the E-LSC is actually the baseline model)

Furthermore, when the ResNet152 feature is employed, the performance of E-VSC/A is inferior to the baseline model (E-LSC) in which the language enhancing branch is used. However, the performance of HE-VLSA in which the hybrid enhancing of visual and language semantics is implemented, the performance is improved greatly. Particularly, the B-4 and CIDEr outperform the baseline model by 2.4 and 1.4, respectively. Moreover, the performance of \(\hbox {HE-VLSA}^*\) on B-4 and CIDEr surpasses the baseline by 4.2 and 4.4, demonstrating the effectiveness and superiority of the proposed methods sufficiently. Another remarkable performance trend is that the model with element-wise addition (i.e., HE-VLSA and \(\hbox {HE-VLSA}^*\)) performs better than that with concatenation (i.e., HE-VLSC and \(\hbox {HE-VLSC}^*\)) on most metrics. The possible reason is that the ResNet152 feature is enough abstract compared to GoogLeNet, and the extra visual semantics enhancing branch cannot offer more effective information for the whole model.

The experimental results of the baseline model (E-LSC) and the proposed models on MSR-VTT2016 [33] with GoogLeNet and ResNet152 features are shown in Table 2. If the GoogLeNet feature is employed, the performance improvement of each proposed model on MSR-VTT2016 is not as obvious as that on MSVD dataset [2, 32] compared to the baseline model. As an example in Table 2, the baseline model E-LSC commonly performs better than E-VSC/A on most of the metrics. This is rooted in that the sentences in MSR-VTT2016 are usually longer and with richer semantics compared to that in the MSVD dataset, and more language information can be captured by B-LE branch. As for HE-VLSC/A and \(\hbox {HE-VLSC}^*\)/\(\hbox {A}^*\), they achieve better performance than baseline model and proposed E-VSC/A on the whole, but the best HE-VLSA model only outperforms the baseline model by 1.3 and 2.0 on B-4 and CIDEr, respectively. Also, the better \(\hbox {HE-VLSA}^*\) compared to \(\hbox {HE-VLSC}^*\) is superior to the baseline by 0.9 and 1.2. On the other hand, the performance of \(\hbox {HE-VLSC}^*\)/\(\hbox {A}^*\) is inferior to HE-VLSC/A generally, which goes the opposite of the performance trend on MSVD. The possible reason is that the MSR-VTT2016 is relatively clean and there are fewer visual and language noises compared to MSVD, and the model is difficult to be disturbed with the errors from B-LE and B-VE, limiting the generalization ability of the model. At the test stage, the lower prediction accuracy of the two branches will weaken the fusion performance of the final word prediction.

Table 2 Performance comparison of each proposed model on MSR-VTT2016 dataset with GoogLeNet and ResNet152 feature, respectively

And if the ResNet152 feature is employed, the performance trend is similar to that GoogLeNet feature is employed from the comparison between baseline model E-LSC and E-VSC/A. Another remarkable observation is that the \(\hbox {HE-VLSC}^*\) performs higher than that of \(\hbox {HE-VLSA}^*\) when GoogLeNet feature is the visual representation, while the results are opposite if ResNet152 feature is in use. The trend indicates that the performance improvement of concatenation of the more abstract visual feature (i.e., ResNet152) is limited compared to the operation of element-wise addition.

Besides the aforementioned model evaluation and discussion, the performances of \(\hbox {HE-VLSC}^\#\)/\(\hbox {A}^\#\) on MSVD [2, 32] and MSR-VTT2016 [33] with GoogLeNet feature and ResNet152 feature are presented in Table 3 and Table 4, respectively. For comparison, the \(\hbox {E-VSC}^\#\)/\(\hbox {A}^\#\) and \(\hbox {E-LSC}^\#\)/\(\hbox {A}^\#\) are evaluated, where the output probability from only visual semantics enhancing-based module or language enhancing-based module is used to word prediction, respectively, during the testing. From the results in Table 3, it is obvious that the performance of \(\hbox {HE-VLSC}^\#\)/\(\hbox {A}^\#\) is greatly improved compared to \(\hbox {E-VSC}^\#\)/\(\hbox {A}^\#\) or \(\hbox {E-LSC}^\#\)/\(\hbox {A}^\#\) on all evaluation metrics on MSVD. As an example, the B-4 and CIDEr of \(\hbox {HE-VLSA}^\#\) reach 47.8 and 69.4, respectively, on MSVD with GoogLeNet feature. And when ResNet152 feature is employed, the performance on the two metrics achieves to 52.4 and 81.5, respectively, outperforming that the \(\hbox {E-LSC}^\#\) or \(\hbox {E-VSC}^\#\) is used alone. Additionally, the proposed \(\hbox {HE-VLSC}^\#\)/\(\hbox {A}^\#\) also performs well on MSR-VTT2016 regardless of using GoogLeNet or ResNet152 feature, and the performance of the model is improved greatly in comparison with \(\hbox {E-LSC}^\#\)/\(\hbox {A}^\#\) and \(\hbox {E-VSC}^\#\)/\(\hbox {A}^\#.\) (The results are presented in Table 4.)

Table 3 Performance comparison of \(\hbox {E-VSC}^\#\)/\(\hbox {A}^\#\), \(\hbox {E-LSC}^\#\)/\(\hbox {A}^\#\) and \(\hbox {HE-VLSC}^\#\)/\(\hbox {A}^\#\) on MSVD dataset with GoogLeNet and ResNet152 features, respectively
Table 4 Performance comparison of \(\hbox {E-VSC}^\#\)/\(\hbox {A}^\#\), \(\hbox {E-LSC}^\#\)/\(\hbox {A}^\#\) and \(\hbox {HE-VLSC}^\#\)/\(\hbox {A}^\#\) on MSR-VTT2016 dataset with GoogLeNet and ResNet152 features, respectively

In addition, it can be found that the models with different fusion methods perform quite different from experimental results in Table 3 and Table 4. In general, the models with the fusion method of feature concatenation possess better performance than that with element-wise addition. The reason can be attributed to that there is no loss of feature with concatenation operation, and the representation is relatively more complete and comprehensive. However, the element-wise addition improves the sparsity of features, retains the dimension and speeds up iteration, but certain available information may be obliterated.

4.3.2 Example of generated sentence and discussion

The weight assigning of \(\alpha\), \(\beta\) and \(\gamma\) belonging to m-S2VT, B-LE and B-VE, respectively, is investigated. The performance trends under different ratios of \(\alpha:\beta:\gamma\) on MSVD dataset are presented in Fig. 8, where Fig. 8a, b shows the performance trends with GoogLeNet feature, while Fig. 8c, d shows those with ResNet152 feature. From the comparison, the models including HE-VLSC* and HE-VLSA* generally get the best performance on B-4 when the ratio of \(\alpha:\beta:\gamma\) is 6:2:2 or 4:3:3 is employed. However, if the ratio of 4:3:3 is employed, the performance on CIDEr is not satisfactory when the GoogLeNet feature is used. The possible reason is that the modules of B-LE and B-VE can be treated as auxiliary branches for visual and language semantics enhancing and compensation since incomplete information is used as their input. On the contrary, the module m-S2VT receives the complete multimodal information and fuses the three modules outputs for word prediction, and it behaves better than the above two modules. And hence, the \(\alpha\) is usually given bigger weights than \(\beta\) and \(\gamma\). The ratio of \(\alpha:\beta:\gamma\) is set to 8:1:1 or 6:2:2 empirically in our work.

Fig. 8
figure 8

Performance trend comparison on B-4 and CIDEr under different ratios of \(\alpha:\beta:\gamma\) on MSVD dataset (a and b are with GoogLeNet feature, while c and d are with ResNet152 feature)

Fig. 9
figure 9

Examples of reference, generated sentence with different models (the ResNet152 is employed in all models, and the samples of a and b are from MSVD dataset, while the c and d are from MSR-VTT2016 dataset. The “Ref” is the selected references of the corresponding videos)

In addition, a few examples of reference and generated sentences with the baseline and proposed models are presented in Fig. 9, which are from MSVD and MSR-VTT2016, respectively. From the examples, the generated sentences with the proposed models possess better quality than the baseline model and E-VSC/A on the whole. Particularly, the sentences generated with \(\hbox {HE-VLSC}^\#\) are superior to those with other comparable models in accuracy and semantics. As an example in Fig. 9a which is from MSVD, the visual content is described comprehensively and accurately in the sentence generated with \(\hbox {HE-VLSC}^\#\). In contrast, both the grammar and logic are not correct in the sentence with the baseline model. As for other comparable models, the wrong words are usually predicted (e.g., “rice,” “water”). Similarly, the used sentence elements of “A group of people” in the generated sentence with \(\hbox {HE-VLSC}^\#\) is more accurate than “A woman” in that with baseline model according to the video to be described in Fig. 9c which is from MSR-VTT2016.

From Fig. 9, an interesting fact can be observed that different models may be sensitive to different videos. Take Fig. 9b for example, the baseline model flatly describes the video, while the generated sentences with HE-VLSC/A and \(\hbox {HE-VLSC}^*\)/\(\hbox {A}^*\) are generally coarser. However, the sentences with HE-VLSC/A and \(\hbox {HE-VLSA}^*\) behave more appropriately in Fig. 9d, by contrary, the description is ambiguous in sentences with \(\hbox {HE-VLSA}^\#\) (“unk” is used, which stands for “unknown”). Most important of all, there is still a great gap to be bridged for generated sentences compared to references in flexibility and semantics.

Table 5 Average time cost (ms) (GPU time) of each model for one sentence generation with GoogLeNet feature and ResNet152 feature

The average time cost (GPU time) comparison of the baseline model and the proposed models for one sentence generation is given in Table 5. From the results, it is obvious that the E-VSC and E-VSA cost less time than the baseline model (E-LSC) whether the GoogLeNet feature or ResNet152 feature is employed. And from comparison between the time cost of HE-VLSC and HE-VLSA, the HE-VLSC takes more time than HE-VLSA since the element-wise addition operation does not extend the dimension of the feature for word prediction. However, when \(\hbox {HE-VLSC}^*\)/\(\hbox {A}^*\) is implemented, the time cost increases significantly. The reason is that each branch (module) such as B-LE, B-VE and m-S2VT in \(\hbox {HE-VLSC}^*\)/\(\hbox {A}^*\) has to calculate the probability vector which will be fused for final word prediction. Additionally, similar to the trend of HE-VLSC/A, the \(\hbox {HE-VLSA}^*\) possesses a lower time cost than \(\hbox {HE-VLSA}^*\), which also indicates the concatenation operation consumes more time than element-wise addition operation.

As for the parameter scale of each module, there are 21.81 million and 0.51 million parameters for word embedding and visual feature dimension reduction, respectively (suppose the GoogLeNet feature is employed). In m-S2VT, about 12.60 million parameters are contained. And for B-LE and B-VE, the parameter scales are about 6.30 million and 8.40 million, respectively. Regarding the output module, it possesses a relatively large scale generally, with about 74.40 million parameters for one branch. Therefore, the parameter scales of HE-VLSA and HE-VLSC are about 124.02 million and 198.42 million, respectively, while when the multi-objective strategy is employed, the scales increase sharply to 272.82 million and 347.22 million in the two models (\(\hbox {HE-VLSA}^*\) and \(\hbox {HE-VLSC}^*\)). Additionally, the \(\hbox {HE-VLSC}^\#\) has the largest scale, with about 359.82 million parameters. By contrast, the scale is only about 211.02 million in \(\hbox {HE-VLSA}^\#\). The comparison details are presented in Table 6.

Table 6 Parameter scale comparison of each model

4.3.3 Comparison with the state-of-the-art methods

Performance comparison to the popular models is made, and the results are shown in Tables 7 and 8. It is obvious that the proposed \(\hbox {HE-VLSA}^*\)(R), \(\hbox {HE-VLSA}^\#\)(R) and \(\hbox {HE-VLSC}^\#\)(R) (where the “R” represents that the ResNet152 feature is used in the model) achieve competitive performance on MSVD compared to the state of the art. Especially in the \(\hbox {HE-VLSC}^\#\)(R) model, it outperforms the popular GRU-\(\hbox {EVE}_{hft+sem}\) [54] by 3.4 on the CIDEr metric. On the MSR-VTT2016 dataset [33], the proposed HE-VLSA(R), \(\hbox {HE-VLSA}^\#\)(R) and \(\hbox {HE-VLSC}^\#\)(R) also possess better performance than most of the popular works. However, the proposed model is inferior to GRU-\(\hbox {EVE}_{hft+sem}\) [54] on CIDEr, though they perform better than GRU-\(\hbox {EVE}_{hft+sem}\) [54] on B-4 metric, which indicates that the proposed methods remains to be further improved on a relatively large and complex dataset.

Table 7 Performance comparison to the other popular models on MSVD dataset
Table 8 Performance comparison to the other popular models on MSR-VTT2016 dataset

5 Conclusion

A framework with hybrid visual and language semantics enhancing and complementary is proposed in this work, which aims to mine and use visual information efficiently, and compensate each other of vision and language information. Inspired by Deep-Glove model, a visual semantics enhancing branch is developed and appended on a multimodal-based module with concatenation or element-wise addition operation to further take advantage of visual information. On the basis, the language and visual semantics enhancing branches, and multimodal-based module are integrated into a wider and more effective model. Additionally, the multi-objective optimization strategy is employed and developed a new framework to further optimize the proposed model and boost performance. Experiments on MSVD and MSR-VTT2016 datasets are conducted, and competitive results are achieved. From the comparison, the proposed models are more effective and superior compared to not only the baseline model but also the other popular methods. Specifically, if the more abstract visual feature (e.g., ResNet152) is employed, the visual semantics enhancing method can effectively supplement visual information to the model and better performance can be obtained compared to language semantics enhancing. Alternatively, the model with a language semantics enhancing branch performs better than that with visual semantics enhancing branch. And when both visual and language semantics enhancing methods are employed, the performances are further improved generally. Particularly, the variant \(\hbox {HE-VLSC}^\#\) possesses the competitive performance compared to the state-of-the-art methods on both datasets.

The results indicate that the language semantics enhancing is better for the situation that the reference sentences are clean and with rich semantics, while the visual semantics enhancing is more adaptable when there are more extra noises in sentences. Naturally, the integration of the two methods facilitates to complementary of visual and language information and further performance improvement. Furthermore, the concatenation operation for the feature is more sensitive to complete information in each branch and performance balance of different modules in that the \(\hbox {HE-VLSC}^\#\) performs more excellent than other models. However, the accuracy and semantics of the generated sentences remain to be further improved. Actually, the proposed methods can be introduced into other popular and powerful frameworks to further boost performance. In the future work, more prior knowledge such as visual concept and attribute, the attention unit and related variant (e.g., Transformer), and reinforcement learning strategy will be introduced into the proposed framework for further performance improvement.