Keywords

1 Introduction

Over the past decade, the tremendous increasing of multimedia data (e.g.image, audio and video) brings difficulties to information processing. Traditional approach for representation learning, classification and retrieval tasks usually focus on singe modality. However, in reality, we receive data from different information channels, one of the most common scenarios is image-text paired document. It is worth to note that different data modalities actually carry different information at different semantic levels. As shown in Fig. 1, an example document which contains an image and a loosely related descriptive text. If image and text information can be semantically fused, the more expressive and representative features can be learned for representing this document, and further improve multimodal document classification accuracy. Realizing the importance of multimodal information, in this work, we propose to address this problem by fusing visual and textual information with deep neural network. Multi-modality joint modeling is an open problem in bridging “semantic gap” across modalities. The procedure for multimodal data modeling generally falls into two stages: (1) modality representation and (2) correlation learning. In modality representation, one popular approach for image representation is to represent images as “bag-of-visual-words” (BOVW) using scale-invariant feature transform (SIFT) [10] or Dense SIFT [16] descriptor. In text representation, text is represented as topic feature that derived from Latent Dirichet Allocation [2]. Recently, many approaches have been proposed to explore the correlation between different modalities, including Canonical Correlation Analysis (CCA) [14], Semantic Correlation Match (SCM)[12], Cross-Modal Topic Correlations (CMTC) [19].

Fig. 1.
figure 1

An example document that paired with an image and a descriptive text

Unfortunately, the problem of fusing and combining different modalities was rarely discussed for multimedia data classification. In this paper, from different perspective, we focus on multimodal fusion problem [1]. In [3] St. Clinchant et al. proposed semantic combination approach for late fusion and image re-ranking in multimedia retrieval. D.Liu et al. [9] proposed Sample Specific Late Fusion (SSLF) method, which learns the optimal sample-specific fusion weights and enforces the positive sample have the highest fusion scores. In [17] deep neural network has been proved effective at fusion video keyframe and audio information for video classification. Considering the powerful capability of late fusion in areas such as video analysis [18], image retrieval [5] and object recognition [13].

Our work is distinguished from previous works in two aspects. First, we investigated deep convolutional neural network(CNN) features as image feature, this is motivated by recent success of deep CNN feature in addressing various research questions such as speech recognition [6], image classification [8] and multimodal learning [11]. Compare to commonly used SIFT feature, we prove that deep CNN features are more robust and representative in multi-modality fusion. Second, we propose to use deep neural network (DNN) to capture the highly non-linear dependency between different modalities, besides, late fusion with linear interpolation rule is adopted to capture the semantic contribution of image and text. Our contributions can be summarized as follows:

  1. 1.

    We propose to represent image and text to higher level feature using deep CNN feature and topic feature respectively.

  2. 2.

    We propose a novel approach to learn visual-textual fusion feature, which is seen as a unified representation for document categorization.

  3. 3.

    Extensive experiments and discussion were provided to show the effectiveness of DNN based late semantic fusion.

The remainder of this paper is organized as follows. Section 2 states proposed approach for late semantic fusion and then we describe our implementation details in Sect. 3. Section 4 presents the experimental evaluation which illustrates the effectiveness of DNN-based late semantic fusion. Section 5 concludes this work.

2 DNN Late Semantic Fusion

This section introduces proposed DNN late semantic fusion. Given a set of N documents \(S=\{D_n\}, \forall {n}=1,2...N\), where \(D_i\) is image-text paired document. We extracted the deep CNN feature and topic feature for each document \(D_n\), then document \(D_n\) can be represented as \(D_n=\{I_n,T_n\}\), \(I_n\in \mathbb {R}^{d_i}\), \(T_n\in \mathbb {R}^{d_t}\) at feature level, where \(d_i\) and \(d_t\) are the dimensionality of visual and textual feature respectively. Traditionally, if we combine visual feature \(I_n\) and textual feature \(T_n\) at feature level, called early fusion, formulated as

$$\begin{aligned} F_{(n)}=\alpha _v f_{r}(I_n)+(1-\alpha _v)\,f_{r}(T_n), ~\forall {n=1,2,..,N},r \in [0,1] \end{aligned}$$
(1)

where \(F_{(n)}\) is early fused feature which is used to represent given document \(D_n\) and \(f_{r}(\cdot )\) is normalization operator. \(\alpha _v (0\le \alpha _v\le 1)\) and 1-\(\alpha _v\) denotes the fusion weight of visual and textual feature respectively. Another common approach is depicted in Eq. (2) called late fusion that performs fusion at decision level by combining the prediction scores of M pre-trained classifiers \(C_m\).

$$\begin{aligned} P_{(n)}=\alpha _v C_m(I_n)+(1-\alpha _v)\,C_m(T_n),\forall {n=1,..,N},\forall {m=1,..,M} \end{aligned}$$
(2)
Fig. 2.
figure 2

DNN late fusion framework. Red nodes are visual (image) feature inputs and blue nodes are textual (text) feature inputs. The visual-textual fusion feature can be extracted from the output of \(4^{th}\) layer (Color figure online).

In both approaches, \(\alpha _v\) is usually assigned according to empirical experiments for demonstrating the importance of individual feature or classifier. Unfortunately, both fusion strategies do not take the correlations between visual and textual feature into account. A good fusion approach should consider the underlying shared semantic correlation between different modalities and take the advantage of the complementarity of modalities. To address the problem, besides heuristically assign \(\alpha _v\) from 0 to 1 to capture the semantic contribution of each modality, we also learn latent fusion weights using deep neural network to capture the relationships across image and text. To achieve this goal, we propose a DNN fusion architecture which is shown as in Fig. 2. For a given single training sample \(\{I_n, T_n,Y_n\}\), where \(I_n\) and \(T_n\) are input image and text feature respectively, \(Y_n\) is ground truth category label. The final output the global network can be represented as

$$\begin{aligned} {\left\{ \begin{array}{ll} \hat{Y}^{(5)}=g^{(5)}(\hat{Y}^{(4)}W^{(4)}+b^{(4)})\\ \hat{Y}^{(4)}=f^{(4)}((\alpha _vP_v+(1-\alpha _v)P_t)W^{(3)}+b^{(3)}) \end{array}\right. } \end{aligned}$$
(3)

where \(\hat{Y}^{(l)}\) is the output of \(l^{th}\) layer and \(W^{(l)}\) denotes the weights that connect to \((l-1)^{th}\) layer(also see from Fig. 2). \(g(\cdot )\) and \(f(\cdot )\) are activation functions and \(b^{(l)}\) is bias item corresponding to \(l^{th}\) layer. \(P_v\) and \(P_t\) are prediction scores that computed from input feature \(I_n\) and \(T_n\) by

$$\begin{aligned} {\left\{ \begin{array}{ll} P_v=f^{(3)}[f^{(2)}(I_nW^{(1)}_v+b^{(1)}_v)W^{(2)}_v+b^{(2)}_v]\\ P_t=f^{(3)}[f^{(2)}(T_nW^{(1)}_t+b^{(1)}_t)W^{(2)}_t+b^{(2)}_t] \end{array}\right. } \end{aligned}$$
(4)

We unitized sigmoid function \(f^{(2)}(x)\)=\(f^{(3)}(x)\)=\(f^{(4)}(x)\)=\(\frac{1}{1+e^{-x}}\) and softmax function \(g^{(5)}(x)\)=\(e^{(x-\varepsilon )}/\sum _{k=1}^{K}e^{(x_k-\varepsilon )}\) where \(\varepsilon =max(x_k)\). To learn optimal weight set \(\mathbf {W}\)=\(\{W^{(l)}\}\) and \(\mathbf {b}\)=\(\{b^{(l)}\}\), \(\forall {l}=1,2,3,4\), with all training samples, the objective is to minimize following loss function

$$\begin{aligned} \arg \!\min _{\mathbf {W,b}} ~~~\frac{1}{2N}\sum _{n=1}^{N}{\parallel \hat{Y}^{(5)}_n-Y_n \parallel }^{2}+\frac{\lambda }{2}\sum _{l=1}^{L-1}{\parallel \varvec{W_l} \parallel }^{2} \end{aligned}$$
(5)

Where the second part is weight decay item for preventing overfitting. In learning procedures, the weights \(W_m^{(1)}\) and \(W_m^{(2)}\), m={v,t} are first learned by intra-modality training. Those weights can be regarded as local weights for achieving better prediction results. \(W^{(3)}\) and \(W^{(4)}\) are learned globally by fusing the scores of predicting image and text feature. The output of the \(4^{th}\) layer are fusion features which combines visual and textual predictions.

3 Implementation

Our experimental configuration are Ubuntu 12.04, Nvidia GTX 780 GPU with 3G memory for image feature extraction. Ubuntu 12.04, Intel 3.20GHz\(\times \)4 CPU, 8G RAM for text model training and feature extraction. And Window 8, Intel 3.20GHz\(\times \)4 CPU, 8G RAM for training visual-textual joint model on Matlab.

Dataset: Our experiments were conducted on open benchmark Wikipedia datasetFootnote 1, which contains 2886 documents (2173 for training and 693 for test). This dataset has 10 semantic categories such as “biology”,“geography”. Each document is comprised of an image and a short descriptive text as the example we given in Fig. 1.

Image Representation: We use deep convolutional neural network (deep CNN) [8] that has been proved its effectiveness in image representation in recent years. Based on Caffe framework [7] we extracted the image feature with Caffe model that on ImageNet [4] ILSVRC2012 dataset (more than 1.2M training images). By extracting the output of the \(7^{th}\) layer(F7), each image can be represented as a 4096-dimension vector, that is, visual feature \(I \in \mathbb {R}^{4069}\). Due to image features are highly learned by deep CNN, it can be considered as kind of high level semantic feature.

Text Representation: To represent text as semantic feature, Latent Dirichlet Allocation(LDA) [2] was used to generate 20 topics. We compute the topic distribution of given text document d over 20 topics and finally obtain a 20-dimension vector, that is, textual feature \(T \in \mathbb {R}^{20}\).

Fig. 3.
figure 3

Top: The mean squared error of training and test against epochs. Bottom: Test accuracy against epochs

Training: In DNN learning, the first three layers are designed for intra-modal regularization which optimizes the weights within each modality to improve performance firstly. Thus we named our fusion framework as RE-DNN. The networks are designed as [4096/100/10] and [20/100/10] for image and text receptively, and the last three layers is set as [20/100/10]. In our experiments, learning rate \(\alpha \)=0.001, momentum=0.9 achieved the best performance. According the scale of our training data (2173 training samples), we adopted the mini batch gradient descent with batch size 41. The epoch number fixed at K=200. Figure 3 shows the change of mean squared error of training and test during training procedure as well as the increasing of test accuracy against training epochs. We obtained the final test accuracy is 74.6 %.

Table 1. Comparison between unimodal and multimodal fusion feature. Top: visualization of visual feature(a(1)), textual feature(a(2)) and fusion features (a(3)) from test examples. Bottom: classification results include precision (P), recall (R), F1-score (F1) and Accuracy (A).

4 Experimental Evaluation

To validate the effectiveness of proposed RE-DNN approach for multimodal feature fusion. Our experiment first consider unimodal (visual or textual feature separately) to perform document categorization task and then compared with RE-DNN approach. In this work, we also explored early fusion and late fusion on some mainstream classifiers such as K-Nearest Neighbor(KNN), Support Vector Machine (SVM), Naive Bayes (NB) and Neural Network(NN).

Table 1 shows the comparison between unimodal feature and multimodal fusion feature based classification. We visualized visual feature \(I\in \mathbb {R}^{4096}\) and textual feature \(T\in \mathbb {R}^{20}\) to 2D by using t-SNE [15] as shown in a(1) and a(2) respectively. By visually comparing visual and textual feature from a(1) and a(2) find that the margin of textual feature tend to be clearer. Meanwhile, we applied those features to classification. We note that text-based classification outperforms image-based classification for all employed classifiers. This confirms previous research that text information is easier to be perceived and recognized by machines compare to image information. The best performed classification accuracy of visual feature is achieved by NB with 0.463, and a 3L-NN achieved the best classification accuracy 0.695 for textual feature. The configuration of 3L-NN are {4096/100/10} for visual feature and {20/100/10} for textual feature. The learning rate is adjusted as 0.001 and momentum=0.9. However, the further improvements are made by fusing visual and textual feature with deep neural network. This relies on the fact that paired image and text are perceived by machines that they belong to same semantic and the latent relationships between visual and textual features are captured by network. At this stage, we set \(\alpha _v\)=0.5, it means the semantic contribution of each modality are equal so that we can observe the capability of RE-DNN in fusing features. Our final classification accuracy is 74.6 %. Here we extracted the late fusion feature from the output of the \(4^{th}\) layer in RE-DNN and visualized as in a(3). It is clear to see, the fusion features tend to more discriminative than both textual and visual features. Compare Table 1 b(1)–b(3) we see that the overall performance including precision, recall, F1 and accuracy of RE-DNN approach are higher than unimodal based classification. The result shows that late fusion based RE-DNN improves on the approaches “3L-NN for textual” and “NB for visual” by 5.1 % and 28.3 % respectively.

Fig. 4.
figure 4

Visual-textual early and late fusion

Further experiments were conducted to explore visual-textual early fusion and late fusion by taking the semantic contribution of each modality into consideration. In both fusion strategies, according to Eqs. (1) and (2) we heuristically assign \(\alpha _v\)(image modality weight) from 0 to 1. For early fusion, the inputs are raw image and text features. For late fusion, the inputs are prediction scores of different classifiers. Figure 4(a) shows the accuracy changes in early fusion and Fig. 4(b) describes late fusion results. It is observed that late fusion outperforms early fusion at most of levels of \(\alpha _v\). In early fusion approach, almost the accuracy for all classifiers decreasing along with the increasing of \(\alpha _v\). When we impose linear interpolation on RE-DNN, we note that for all levels of \(\alpha _v\), RE-DNN late fusion with linear interpolation further improved the classification accuracy to 75.3 % at \(\alpha _v\)=0.3. It proves the effectiveness of our approach.

5 Conclusions

In this paper, we have proposed a DNN framework for fusing visual and textual features. By imposing linear interpolation on DNN, more discriminative and representative fusion feature can be extracted. Our experiments on document categorization show that our proposed approach outperforms mainstream classifiers in both early fusion and late fusion.