Visual-Textual Late Semantic Fusion Using Deep Neural Network for Document Categorization

Wang, Cheng; Yang, Haojin; Meinel, Christoph

doi:10.1007/978-3-319-26532-2_73

Cheng Wang¹⁷,
Haojin Yang¹⁷ &
Christoph Meinel¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9489))

Included in the following conference series:

International Conference on Neural Information Processing

2184 Accesses
1 Citations

Abstract

Multi-modality fusion has recently drawn much attention due to the fast increasing of multimedia data. Document that consists of multiple modalities i.e. image, text and video, can be better understood by machines if information from different modalities semantically combined. In this paper, we propose to fuse image and text information with deep neural network (DNN) based approach. By jointly fusing visual-textual feature and taking the correlation between image and text into account, fusion features can be learned for representing document. We investigated the fusion features on document categorization, found that DNN-based fusion outperforms mainstream algorithms include K-Nearest Neighbor(KNN), Support Vector Machine (SVM) and Naive Bayes (NB) and 3-layer Neural Network (3L-NN) in both early and late fusion strategies.

Access provided by Autonomous University of Puebla. Download conference paper PDF

EAML: ensemble self-attention-based mutual learning network for document image classification

Article 24 June 2021

Fine-Grained Image-Text Retrieval via Complementary Feature Learning

Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval

Keywords

1 Introduction

Over the past decade, the tremendous increasing of multimedia data (e.g.image, audio and video) brings difficulties to information processing. Traditional approach for representation learning, classification and retrieval tasks usually focus on singe modality. However, in reality, we receive data from different information channels, one of the most common scenarios is image-text paired document. It is worth to note that different data modalities actually carry different information at different semantic levels. As shown in Fig. 1, an example document which contains an image and a loosely related descriptive text. If image and text information can be semantically fused, the more expressive and representative features can be learned for representing this document, and further improve multimodal document classification accuracy. Realizing the importance of multimodal information, in this work, we propose to address this problem by fusing visual and textual information with deep neural network. Multi-modality joint modeling is an open problem in bridging “semantic gap” across modalities. The procedure for multimodal data modeling generally falls into two stages: (1) modality representation and (2) correlation learning. In modality representation, one popular approach for image representation is to represent images as “bag-of-visual-words” (BOVW) using scale-invariant feature transform (SIFT) [10] or Dense SIFT [16] descriptor. In text representation, text is represented as topic feature that derived from Latent Dirichet Allocation [2]. Recently, many approaches have been proposed to explore the correlation between different modalities, including Canonical Correlation Analysis (CCA) [14], Semantic Correlation Match (SCM)[12], Cross-Modal Topic Correlations (CMTC) [19].

Unfortunately, the problem of fusing and combining different modalities was rarely discussed for multimedia data classification. In this paper, from different perspective, we focus on multimodal fusion problem [1]. In [3] St. Clinchant et al. proposed semantic combination approach for late fusion and image re-ranking in multimedia retrieval. D.Liu et al. [9] proposed Sample Specific Late Fusion (SSLF) method, which learns the optimal sample-specific fusion weights and enforces the positive sample have the highest fusion scores. In [17] deep neural network has been proved effective at fusion video keyframe and audio information for video classification. Considering the powerful capability of late fusion in areas such as video analysis [18], image retrieval [5] and object recognition [13].

Our work is distinguished from previous works in two aspects. First, we investigated deep convolutional neural network(CNN) features as image feature, this is motivated by recent success of deep CNN feature in addressing various research questions such as speech recognition [6], image classification [8] and multimodal learning [11]. Compare to commonly used SIFT feature, we prove that deep CNN features are more robust and representative in multi-modality fusion. Second, we propose to use deep neural network (DNN) to capture the highly non-linear dependency between different modalities, besides, late fusion with linear interpolation rule is adopted to capture the semantic contribution of image and text. Our contributions can be summarized as follows:

1.
We propose to represent image and text to higher level feature using deep CNN feature and topic feature respectively.
2.
We propose a novel approach to learn visual-textual fusion feature, which is seen as a unified representation for document categorization.
3.
Extensive experiments and discussion were provided to show the effectiveness of DNN based late semantic fusion.

The remainder of this paper is organized as follows. Section 2 states proposed approach for late semantic fusion and then we describe our implementation details in Sect. 3. Section 4 presents the experimental evaluation which illustrates the effectiveness of DNN-based late semantic fusion. Section 5 concludes this work.

2 DNN Late Semantic Fusion

This section introduces proposed DNN late semantic fusion. Given a set of N documents $S=\{D_n\}, \forall {n}=1,2...N$, where $D_i$ is image-text paired document. We extracted the deep CNN feature and topic feature for each document $D_n$, then document $D_n$ can be represented as $D_n=\{I_n,T_n\}$, $I_n\in \mathbb {R}^{d_i}$, $T_n\in \mathbb {R}^{d_t}$ at feature level, where $d_i$ and $d_t$ are the dimensionality of visual and textual feature respectively. Traditionally, if we combine visual feature $I_n$ and textual feature $T_n$ at feature level, called early fusion, formulated as

$$\begin{aligned} F_{(n)}=\alpha _v f_{r}(I_n)+(1-\alpha _v)\,f_{r}(T_n), ~\forall {n=1,2,..,N},r \in [0,1] \end{aligned}$$

(1)

where $F_{(n)}$ is early fused feature which is used to represent given document $D_n$ and $f_{r}(\cdot )$ is normalization operator. $\alpha _v (0\le \alpha _v\le 1)$ and 1-$\alpha _v$ denotes the fusion weight of visual and textual feature respectively. Another common approach is depicted in Eq. (2) called late fusion that performs fusion at decision level by combining the prediction scores of M pre-trained classifiers $C_m$.

$$\begin{aligned} P_{(n)}=\alpha _v C_m(I_n)+(1-\alpha _v)\,C_m(T_n),\forall {n=1,..,N},\forall {m=1,..,M} \end{aligned}$$

(2)

In both approaches, $\alpha _v$ is usually assigned according to empirical experiments for demonstrating the importance of individual feature or classifier. Unfortunately, both fusion strategies do not take the correlations between visual and textual feature into account. A good fusion approach should consider the underlying shared semantic correlation between different modalities and take the advantage of the complementarity of modalities. To address the problem, besides heuristically assign $\alpha _v$ from 0 to 1 to capture the semantic contribution of each modality, we also learn latent fusion weights using deep neural network to capture the relationships across image and text. To achieve this goal, we propose a DNN fusion architecture which is shown as in Fig. 2. For a given single training sample $\{I_n, T_n,Y_n\}$, where $I_n$ and $T_n$ are input image and text feature respectively, $Y_n$ is ground truth category label. The final output the global network can be represented as

$$\begin{aligned} {\left\{ \begin{array}{ll} \hat{Y}^{(5)}=g^{(5)}(\hat{Y}^{(4)}W^{(4)}+b^{(4)})\\ \hat{Y}^{(4)}=f^{(4)}((\alpha _vP_v+(1-\alpha _v)P_t)W^{(3)}+b^{(3)}) \end{array}\right. } \end{aligned}$$

(3)

where $\hat{Y}^{(l)}$ is the output of $l^{th}$ layer and $W^{(l)}$ denotes the weights that connect to $(l-1)^{th}$ layer(also see from Fig. 2). $g(\cdot )$ and $f(\cdot )$ are activation functions and $b^{(l)}$ is bias item corresponding to $l^{th}$ layer. $P_v$ and $P_t$ are prediction scores that computed from input feature $I_n$ and $T_n$ by

$$\begin{aligned} {\left\{ \begin{array}{ll} P_v=f^{(3)}[f^{(2)}(I_nW^{(1)}_v+b^{(1)}_v)W^{(2)}_v+b^{(2)}_v]\\ P_t=f^{(3)}[f^{(2)}(T_nW^{(1)}_t+b^{(1)}_t)W^{(2)}_t+b^{(2)}_t] \end{array}\right. } \end{aligned}$$

(4)

We unitized sigmoid function $f^{(2)}(x)$=$f^{(3)}(x)$=$f^{(4)}(x)$=$\frac{1}{1+e^{-x}}$ and softmax function $g^{(5)}(x)$=$e^{(x-\varepsilon )}/\sum _{k=1}^{K}e^{(x_k-\varepsilon )}$ where $\varepsilon =max(x_k)$. To learn optimal weight set $\mathbf {W}$=$\{W^{(l)}\}$ and $\mathbf {b}$=$\{b^{(l)}\}$, $\forall {l}=1,2,3,4$, with all training samples, the objective is to minimize following loss function

$$\begin{aligned} \arg \!\min _{\mathbf {W,b}} ~~~\frac{1}{2N}\sum _{n=1}^{N}{\parallel \hat{Y}^{(5)}_n-Y_n \parallel }^{2}+\frac{\lambda }{2}\sum _{l=1}^{L-1}{\parallel \varvec{W_l} \parallel }^{2} \end{aligned}$$

(5)

Where the second part is weight decay item for preventing overfitting. In learning procedures, the weights $W_m^{(1)}$ and $W_m^{(2)}$, m={v,t} are first learned by intra-modality training. Those weights can be regarded as local weights for achieving better prediction results. $W^{(3)}$ and $W^{(4)}$ are learned globally by fusing the scores of predicting image and text feature. The output of the $4^{th}$ layer are fusion features which combines visual and textual predictions.

3 Implementation

Our experimental configuration are Ubuntu 12.04, Nvidia GTX 780 GPU with 3G memory for image feature extraction. Ubuntu 12.04, Intel 3.20GHz$\times $4 CPU, 8G RAM for text model training and feature extraction. And Window 8, Intel 3.20GHz$\times $4 CPU, 8G RAM for training visual-textual joint model on Matlab.

Dataset: Our experiments were conducted on open benchmark Wikipedia dataset^{Footnote 1}, which contains 2886 documents (2173 for training and 693 for test). This dataset has 10 semantic categories such as “biology”,“geography”. Each document is comprised of an image and a short descriptive text as the example we given in Fig. 1.

Image Representation: We use deep convolutional neural network (deep CNN) [8] that has been proved its effectiveness in image representation in recent years. Based on Caffe framework [7] we extracted the image feature with Caffe model that on ImageNet [4] ILSVRC2012 dataset (more than 1.2M training images). By extracting the output of the $7^{th}$ layer(F7), each image can be represented as a 4096-dimension vector, that is, visual feature $I \in \mathbb {R}^{4069}$. Due to image features are highly learned by deep CNN, it can be considered as kind of high level semantic feature.

Text Representation: To represent text as semantic feature, Latent Dirichlet Allocation(LDA) [2] was used to generate 20 topics. We compute the topic distribution of given text document d over 20 topics and finally obtain a 20-dimension vector, that is, textual feature $T \in \mathbb {R}^{20}$.

Training: In DNN learning, the first three layers are designed for intra-modal regularization which optimizes the weights within each modality to improve performance firstly. Thus we named our fusion framework as RE-DNN. The networks are designed as [4096/100/10] and [20/100/10] for image and text receptively, and the last three layers is set as [20/100/10]. In our experiments, learning rate $\alpha $=0.001, momentum=0.9 achieved the best performance. According the scale of our training data (2173 training samples), we adopted the mini batch gradient descent with batch size 41. The epoch number fixed at K=200. Figure 3 shows the change of mean squared error of training and test during training procedure as well as the increasing of test accuracy against training epochs. We obtained the final test accuracy is 74.6 %.

Table 1. Comparison between unimodal and multimodal fusion feature. Top: visualization of visual feature(a(1)), textual feature(a(2)) and fusion features (a(3)) from test examples. Bottom: classification results include precision (P), recall (R), F1-score (F1) and Accuracy (A).

Full size table

4 Experimental Evaluation

To validate the effectiveness of proposed RE-DNN approach for multimodal feature fusion. Our experiment first consider unimodal (visual or textual feature separately) to perform document categorization task and then compared with RE-DNN approach. In this work, we also explored early fusion and late fusion on some mainstream classifiers such as K-Nearest Neighbor(KNN), Support Vector Machine (SVM), Naive Bayes (NB) and Neural Network(NN).

Table 1 shows the comparison between unimodal feature and multimodal fusion feature based classification. We visualized visual feature $I\in \mathbb {R}^{4096}$ and textual feature $T\in \mathbb {R}^{20}$ to 2D by using t-SNE [15] as shown in a(1) and a(2) respectively. By visually comparing visual and textual feature from a(1) and a(2) find that the margin of textual feature tend to be clearer. Meanwhile, we applied those features to classification. We note that text-based classification outperforms image-based classification for all employed classifiers. This confirms previous research that text information is easier to be perceived and recognized by machines compare to image information. The best performed classification accuracy of visual feature is achieved by NB with 0.463, and a 3L-NN achieved the best classification accuracy 0.695 for textual feature. The configuration of 3L-NN are {4096/100/10} for visual feature and {20/100/10} for textual feature. The learning rate is adjusted as 0.001 and momentum=0.9. However, the further improvements are made by fusing visual and textual feature with deep neural network. This relies on the fact that paired image and text are perceived by machines that they belong to same semantic and the latent relationships between visual and textual features are captured by network. At this stage, we set $\alpha _v$=0.5, it means the semantic contribution of each modality are equal so that we can observe the capability of RE-DNN in fusing features. Our final classification accuracy is 74.6 %. Here we extracted the late fusion feature from the output of the $4^{th}$ layer in RE-DNN and visualized as in a(3). It is clear to see, the fusion features tend to more discriminative than both textual and visual features. Compare Table 1 b(1)–b(3) we see that the overall performance including precision, recall, F1 and accuracy of RE-DNN approach are higher than unimodal based classification. The result shows that late fusion based RE-DNN improves on the approaches “3L-NN for textual” and “NB for visual” by 5.1 % and 28.3 % respectively.

Further experiments were conducted to explore visual-textual early fusion and late fusion by taking the semantic contribution of each modality into consideration. In both fusion strategies, according to Eqs. (1) and (2) we heuristically assign $\alpha _v$(image modality weight) from 0 to 1. For early fusion, the inputs are raw image and text features. For late fusion, the inputs are prediction scores of different classifiers. Figure 4(a) shows the accuracy changes in early fusion and Fig. 4(b) describes late fusion results. It is observed that late fusion outperforms early fusion at most of levels of $\alpha _v$. In early fusion approach, almost the accuracy for all classifiers decreasing along with the increasing of $\alpha _v$. When we impose linear interpolation on RE-DNN, we note that for all levels of $\alpha _v$, RE-DNN late fusion with linear interpolation further improved the classification accuracy to 75.3 % at $\alpha _v$=0.3. It proves the effectiveness of our approach.

5 Conclusions

In this paper, we have proposed a DNN framework for fusing visual and textual features. By imposing linear interpolation on DNN, more discriminative and representative fusion feature can be extracted. Our experiments on document categorization show that our proposed approach outperforms mainstream classifiers in both early fusion and late fusion.

Notes

1.
http://www.svcl.ucsd.edu/projects/crossmodal/.

References

Atrey, P.K., Hossain, M.A., El-Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16(6), 345–379 (2010)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Clinchant, S., Ah-Pine, J., Csurka, G.: Semantic combination of textual and visual information in multimedia retrieval. In: Proceedings of the 1st ACM International Conference on Multimedia Retrieval, p. 44. ACM (2011)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009 (CVPR 2009), pp. 248–255. IEEE (2009)
Google Scholar
Escalante, H.J.: Late fusion of heterogeneous methods for multimedia image retrieval (2008)
Google Scholar
Hinton, G., Deng, L., Dong, Y., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Liu, D., Lai, K.-T., Ye, G., Chen, M.-S., Chang, S.-F.: Sample-specific late fusion for visual category recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 803–810. IEEE (2013)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011)
Google Scholar
Rasiwasia, N., Pereira, J.C., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the International Conference on Multimedia, pp. 251–260. ACM (2010)
Google Scholar
Terrades, O.R., Valveny, E., Tabbone, S.: Optimal classifier fusion in a non-bayesian probabilistic framework. IEEE Trans. Pattern Anal. Mach. Intell. 31(9), 1630–1644 (2009)
Article Google Scholar
Thompson, B.: Canonical correlation analysis. In: Everitt, B., Howell, D. (eds.) Encyclopedia of Statistics in Behavioral Science. Wiley, New York (2005)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(2579–2605), 85 (2008)
MATH Google Scholar
Vedaldi, A., Fulkerson, B.: Vlfeat: An open and portable library of computer vision algorithms. In: Proceedings of the International Conference on Multimedia, pp. 1469–1472. ACM (2010)
Google Scholar
Wu, Z., Jiang, Y.-G., Wang, J., Pu, J., Xue, X.: Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In: Proceedings of the ACM International Conference on Multimedia, pp. 167–176. ACM (2014)
Google Scholar
Ye, G., Liu, D., Jhuo, I.-H., Chang, S.-F.: Robust late fusion with rank minimization. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3021–3028. IEEE (2012)
Google Scholar
Yu, J., Cong, Y., Qin, Z., Wan, T.: Cross-modal topic correlations for multimedia retrieval. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 246–249. IEEE (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Hasso Plattner Institute, University of Potsdam, Prof.-Dr.-Helmert-Str. 2-3, 14482, Potsdam, Germany
Cheng Wang, Haojin Yang & Christoph Meinel

Authors

Cheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haojin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Meinel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheng Wang .

Editor information

Editors and Affiliations

University of Istanbul, Istanbul, Turkey
Sabri Arik
University at Qatar, Doha, Qatar
Tingwen Huang
Tunku Abdul Rahman University College, Kuala Lumpur, Malaysia
Weng Kin Lai
University of Science Technology, Wuhan, China
Qingshan Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, C., Yang, H., Meinel, C. (2015). Visual-Textual Late Semantic Fusion Using Deep Neural Network for Document Categorization. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9489. Springer, Cham. https://doi.org/10.1007/978-3-319-26532-2_73

Download citation

DOI: https://doi.org/10.1007/978-3-319-26532-2_73
Published: 12 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26531-5
Online ISBN: 978-3-319-26532-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Visual-Textual Late Semantic Fusion Using Deep Neural Network for Document Categorization

Abstract

Similar content being viewed by others

EAML: ensemble self-attention-based mutual learning network for document image classification

Fine-Grained Image-Text Retrieval via Complementary Feature Learning

Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval

Keywords

1 Introduction

2 DNN Late Semantic Fusion

3 Implementation

4 Experimental Evaluation

5 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Visual-Textual Late Semantic Fusion Using Deep Neural Network for Document Categorization

Abstract

Similar content being viewed by others

EAML: ensemble self-attention-based mutual learning network for document image classification

Fine-Grained Image-Text Retrieval via Complementary Feature Learning

Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval

Keywords

1 Introduction

2 DNN Late Semantic Fusion

3 Implementation

4 Experimental Evaluation

5 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation