Keywords

1 Introduction

Sentence classification is a task of assigning sentences to predefined categories, which has been widely explored in past decades. It requires modeling, representing and mining a degree of semantic comprehension, which are mainly based on the structure or sentiment of sentences. This task is important for many practical applications, such as product recommendation [5], public opinion detection [24], and human-machine interaction [3], etc.

Recently, deep learning has achieved state-of-the-art results across a range of Computer Vision (CV) [15], Speech Recognition [7], and Natural Language Processing tasks (NLP) [11]. Especially, Convolutional Neural Network (CNN) has gained great success in sentence modelling. However, training deep models requires a great diversity of data so that more discriminative patterns can be mined for better prediction. Most existing work on sentence classification focuses on learning better representation for a sentence given limited training data (i.e., source language), which resorts to design a sophisticated network architecture or learning paradigm, such as attention model [31], multi-task learning [20], adversarial training [19], etc. Inspired by recent advances in Machine Translation (MT) [30], we can perform an input data augmentation by making use of multilingual data (i.e., target language) generated by machine translation for sentence classification tasks. Such generated new language data can be used as the auxiliary information, and provide the additional knowledge for learning a robust sentence representation. In order to effectively exploit such multilingual data, we further propose a novel deep consensus learning framework to mine their language-share and language-specific knowledge for sentence classification. Since the machine translation model can be pre-trained off-the-shelf with great generalization ability, it is worth noting that we do not directly introduce other language data comparing to existing methods in the training and testing phase.

Our main contributions are of two-folds: 1) We first propose utilizing multilingual data augmentation to assist sentence classification, which can provide more beneficial auxiliary knowledge for sentence modeling; 2) A novel deep consensus learning framework is constructed to fuse multilingual data and learn their language-share and language-specific knowledge for sentence classification. In this work, we use English as our source language and Chinese/Dutch as the target language from an English-Chinese/Dutch translator. The related experimental results s how that our model can achieve very promising performance on several sentence classification tasks.

2 Related Work

2.1 Sentence Classification

Sentence classification is a well-studied research area in NLP. Various approaches have been proposed in last a few decades [6, 29]. Among them, Deep Neural Network (DNN) based models have shown very good results for several tasks in NLP, and such methods become increasing popular for sentence classification. Various neural networks are proposed to learn better sentence representation for classification. An influential one is the work of [13], where a simple Convolutional Neural Network (CNN) with a single layer of convolution was used for feature extraction. Following this work, Zhang et al. [36] used CNNs for text classification with character-level features provided by a fully connected DNN. Liu et al. [20] used a multi-tasking learning framework to learn multiple related tasks together for sentence classification task. Based on Recurrent Neural Network (RNN), they utilized three different mechanisms of sharing information to model text. In practice, they used Long Short-Term Memory Network (LSTM) to address the issue of learning long-term dependencies. Lai et al. [16] proposed a Recurrent Convolutional Neural Network (RCNN) model for text classification, which applied a recurrent structure to capture contextual information and employed a max-pooling layer to capture the key components in texts. Jiang et al. [10] proposed a text classification model based on deep belief network and softmax regression. In their model, a deep belief network was introduced to solve the sparse high-dimensional matrix computation problem of text data. They then used softmax regression to classify the text. Yang et al. [31] used Hierarchical Attention Network (HAN) for document classification in their model, where a hierarchical structure was introduced to mirror the hierarchical structure of documents, and two levels of attention mechanisms were applied both at the word and sentence level.

Another direction of solutions for sentence classification is to use more effective learning paradigms. Yogatama et al. [33] combined Generative Adversarial Networks (GAN) with RNN for text classification. Billal et al. [1] solved the problem of multi-label text classification in semi-supervised learning manner. Liu et al. [19] proposed a multi-task adversarial representation learning method for text classification. Zhang et al. [35] attempted to learn structured representation of text via deep reinforcement learning. They tried to learn sentence representation by discovering optimized structures automatically and demonstrated two attempts of Information Distilled LSTM (ID-LSTM) and Hierarchically Structured LSTM (HS-LSTM) to build structured representation.

However, these tasks do not take into account the auxiliary language information corresponding to the source language. This auxiliary language can provide the additional knowledge to learn more accurate sentence representation.

2.2 Deep Consensus Learning

Existing sentence classification works [1, 10, 13, 16, 33, 35, 36] mainly focus on feature representation or learning a structured representation [35]. Deep learning based sentence classification models have obtained impressive performance. Those approaches are largely due to the powerful automatic learning and representation capacities of deep models, which benefit from big labelled training data and the establishment of large-scale sentence/document datasets [1, 33, 35]. However, all of the existing methods usually consider only one type of language information by a standard single language process. Such methods not only ignore the potentially useful information of other different languages, but also lose the opportunity of mining the correlated complementary advantages across different languages. A similar model is [20], which used synthetic source sentences to improve the performance of Neural Machine Translation (NMT). While sharing the high-level multilingual feature learning spirit, the proposed consensus learning model significantly has the following three outstanding characteristics. (1) Beyond the language concatenation based on fusion, our model uniquely considers a synergistic cross-language interaction learning and regularization by consensus propagation. This aims to overcome the challenge of learning discrepancy in multilingual feature optimization. (2) Instead of the traditional single loss design, a multi-loss concurrent supervision mechanism is deployed by our model. This enforces and improves the model’s individuality learning power of language-specific feature. (3) Through NMT, we can eliminate some of the ambiguous words and highlight some key words.

3 Methodology

We aim to learn a deep feature representation model for sentence classification based on language-specific input, without any specific feature transformation. Figure 1 depicts our proposed framework, which consists of two stages. The first stage performs multilingual data augmentation from an off-the-shelf machine translator; and the second one feeds the source language data and generated target language data to our deep consensus learning model for sentence classification.

Fig. 1.
figure 1

The framework of our proposed model for sentence classification.

3.1 Multilingual Data Augmentation

Data augmentation is a very important technique in machine learning that allows building better models. It has been successfully used for many tasks in areas of CV and NLP, such as image recognition [15] and MT [35]. In MT, Back-translation is a common data argumentation method [25, 39], which allows us to combine monolingual training data. Especially when the existing data is insufficient to learn a discriminative representation for a specific task, the data augmentation methods can be used.

In sentence classification, given an input sentence in one language, we perform data augmentation by translating the sentence to another language using existing machine translation methods. We name the input language as source language and the translated language as target language. This motivation comes from the recent great advance in NMT [30]. Given an input sentence in source language, we simply call the Google Translation APIFootnote 1 to get the translated data in target language. Comparing to other state-of-art NMT models, the Google translator has the advantage of both effectiveness and efficiency in real application scenarios. Since target language is used for multilingual data augmentation and the type of it is not important to the proposed model, we random choose Chinese and Dutch respectively as the target language for multilingual data augmentation, and the source language depends on the language of input sentence.

3.2 Deep Consensus Learning Model

Learning a consensus classification model with the combination of several beneficial information into one final prediction can lead to a more accurate result [2]. Thus we use two languages of data, \(\left\{ S_1, S_2, S_3, \cdots , S_{N-1}, S_N\right\} \) and \(\left\{ T_1, T_2, T_3, \cdots , T_{N-1}, T_N\right\} \), to perform consensus learning for sentence classification. As shown in Fig. 1, our model has three parts: (1) Two branches of language-specific subnetworks for learning the most discriminative features for each language data; (2) One fusion branch responsible for learning the language-share representation with the optimal integration of two kinds of language-specific knowledge; and (3) Consensus propagation for the feature regularization and learning optimization. The design of architecture components will be described in detail as below.

Language-Specific Network. We utilize the TextCNN architecture [13] for each branch of language-specific network, which has been proved to be very effective for sentence classification. TextCNN can be divided into two stages, that is, one with convolution layers for feature learning, and another with full connected layers for classification. Given training labels of input sentence, the Softmax classification loss function is used to optimize the category discrimination. Formally, given a corpus of sentences of source language \(\left\{ S_1, S_2, S_3, \cdots , S_{N-1}, S_N\right\} \), the training loss on a batch of n sentences can be computed as:

$$\begin{aligned} L_{S\_brch}= -\frac{1}{n}\sum _{i=1}^n\log {\left( \frac{\exp {\left( w_{y_i}^TS_i\right) }}{\sum _{k=1}^c\exp {\left( w_k^TS_i\right) }}\right) } \end{aligned}$$
(1)

where c is the number of categories of sentences; \(y_i\) denotes the category label of the sentence \(S_i\); and w is the prediction function parameter of the training category class k. The training loss for target language branch \(L_(T\_brch)\) can be computed in the same manner. Meanwhile, since the source language and target language belong to different language spaces, such two branches of language-specific networks are trained with the uniform architecture but different parameters.

Language-share Network. We perform the language-share feature learning from two language-specific branches. For this purpose, we firstly perform the language-share learning by fusing across from these two branches. For design simplicity and cost efficiency, we achieve the feature fusion on the feature vectors from the concatenation layer before dropout in TextCNN by an operation of Concat\(\rightarrow \)FC\(\rightarrow \)Dropout\(\rightarrow \)FC\(\rightarrow \)Softmax. This produces a category prediction score for input pair (a sentence in source language and its translated one in target language). We similarly utilize the Softmax classification loss \(L_ST\) for the language-share classification learning as that in the language-specific branches.

Consensus Propagation. Inspired by the teacher-student learning approach, we propose to regularize the language-specific learning by consensus feedback from the language-share network. More specifically, we utilize the consensus probability \(P_{ST} = \left\lceil p_{ST}^1, p_{ST}^2, \cdots , p_{ST}^{c-1}, p_{ST}^c \right\rceil \) from the language-share network as the teacher signal (called “soft label” versus the ground-truth one-hot “hard label”) to guide the learning process of all language-specific branches (student) concurrently by an additional regularization, which can be formulated in a cross-entropy manner as:

$$\begin{aligned} \mathcal {H}_S=-\frac{1}{c}\sum _{i=1}^c\left( p_{ST}^i\ln {\left( p_s^i\right) }+\left( 1-p_{ST}^i\right) \ln {\left( 1-p_s^i\right) }\right) \end{aligned}$$
(2)

where \(P_S=[p_S^1,p_S^2,p_S^3,\cdots ,p_S^{c-1},p_S^c]\) defines the probability prediction over all c sentence classes by the source language branch. Thus the final loss function for the language-specific network can be re-defined via enforcing an additional regularization in Eq. (1).

$$\begin{aligned} L_S= L_{S\_brch}+ \lambda \mathcal {H}_S \end{aligned}$$
(3)

where \(\lambda \) controls the importance tradeoff between two terms. The regularization terms \(\mathcal {H}_T\) and \(L_T\) for target language branch can be computed in the same way.

The training of our proposed model proceeds in two stages. First, we rely on training the language-specific network separately, which is terminated by the early stopping strategy. Afterwards, the language-share network and consensus propagation loss are introduced. We use the whole loss defined in Eq. (3) and \(L_{ST}\) to train the language-specific network and language-share network at the same time. In the testing time, given an input sentence and its translated sentence, the final prediction is obtained by averaging the three prediction scores from the language-specific networks and the language-share network.

4 Experiment and Analysis

In this section, we investigate the empirical performance of our proposed architecture on five benchmark datasets for sentence classification.

4.1 Datasets and Experimental Setup

The sentence classification datasets include:

  1. (1)

    MR: This dataset includes movie reviews with one sentence per review, in which the classification involves detecting positive/negative reviews [23].

  2. (2)

    CR: This dataset contains annotated customer reviews of 5 products, and the target is to predict positive/negative reviews [8].

  3. (3)

    Subj: This dataset is a subjectivity dataset, which includes subjective or objective sentiments [22].

  4. (4)

    TREC: This dataset focuses on the question classification task that involves 6 question types [18].

  5. (5)

    SST-1: This dataset is Stanford Sentiment Treebank, an extension of MR, which contains training/development/testing splits and fine-grained labels (very positive, positive, neutral, negative, very negative) [27].

Similar with [13], the initialized word vectors for source language are obtained from the publicly available word2vec vectors that were trained on 100 billion words from Google News. For target language of Chinese, we retrain the word2vec models on Chinese Wikipedia Corpus; and for target language of Dutch, we retrain the word2vec models on Dutch Wikipedia Corpus. In our experiments, we choose the CNN-multichannel model variant of TextCNN because of its better performance.

4.2 Ablation Study

We first compare our proposed model with several baseline models for sentence classification. Here, we use S+T to indicate that the model’s input contains the source language and the target language. T(*) indicates the type of target language, i.e., T(CH) indicates that the target language is Chinese, and T(DU) indicates that the target language is Dutch. Figure 2 and 3 show the comparison results of classification accuracy rate on five benchmark datasets. CNN(S) denotes the CNN-multichannel model variant of TextCNN, which only uses the source language data of English for training and testing. CNN(T) is a retrained TextCNN model on the translated target language data of Chinese(CH)/Dutch(DU), and the other settings keep the same as CNN(S). Ours(S+T(*)) denotes our model by combining multilingual data augmentation with deep consensus learning. We can find that Ours(S+T(*)) performs much better than those baselines, which proves the effectiveness of our framework. It is obvious that multilingual data augmentation can provide the beneficial additional discrimination for learning a robust sentence representation for classification. It is worth noting that CNN(T) is even better than CNN(S) on MR. This indicates that existing machine translation methods can not only keep the discriminative semantics of source language, but also create useful discrimination in target language space.

Fig. 2.
figure 2

The comparison results with existing baseline models based on English\(\rightarrow \) Chinese MT.

Fig. 3.
figure 3

The comparison results with existing baseline models based on English\(\rightarrow \) Dutch MT.

Similar to TextCNN, we also use several variants of the model to demonstrate the effectiveness of our model. As we know, when lacking a large supervised training set, we usually use word vectors obtained from unsupervised neural language models to initialize word vectors for performance improvement. Thus we use various word vector initialization methods to validate the model.

The different word vector initialization methods include:

  1. (1)

    Rand: All words are randomly initialized and can be trained during training.

  2. (2)

    Static: All words of input language are initialized by pre-trained vectors from the corresponding language word2vec. Simultaneously, all these words are kept static during training.

  3. (3)

    Non-static: This is an initialization method same to Static, but the pre-trained vectors can be finetuned during training.

  4. (4)

    Multichannel: This model contains two types of word vector, which are treated as different channels. One type of word vector can be finetuned during training, while the other keeps static. Two types of word vector are initialized with the same word embedding form word2vec.

In Table 1, we show the experimental results of different model variants based on English\(\rightarrow \) Chinese MT. Compared to the source language S, the accuracy rates of the target language T(CH) classification are partly improved or decreased, which shows the strong dataset dependency. Considering that the proposed S+T(CH) model with Multichannel obtains the current optimal results, we choose the model with Multichannel as our final results. Similar to Table 1, we show the experimental results of different model variants based on English\(\rightarrow \) Dutch MT in Table 2. Combining the experimental results in Tables 1 and 2, we have enough reasons to prove the validity of our consensus learning method.

Table 1. The experimental results of different model variants based on English\(\rightarrow \) Chinese MT.
Table 2. The experimental results of different model variants based on English\(\rightarrow \) Dutch MT.
Table 3. The comparison results between the state-of-the-art approaches and ours.

4.3 Comparison with Existing Approaches

To further exhibit the effectiveness of our model, we compare our approach with several state-of-the-art approaches, including recent LSTM-based models and CNN-based models. As shown in Table 3, it can be concluded that our approach can gain very promising results comparing to these methods. The whole performance is measured by the accuracy rate for sentence classification. We roughly divide the existing approaches into four categories. The first category is the RNN-based model, in which Standard-RNN refers to Standard Recursive Neural Network [27], MV-RNN is Matrix-Vector Recursive Neural Network [26], RNTN denotes Recursive Neural Tensor Network [27], and DRNN represents Deep Recursive Neural Network [9]. The second category is the LSTM-based model, in which bi-LSTM stands for Bidirectional LSTM [28], SA-LSTM means Sequence Autoencoder LSTM [4], Tree-LSTM is Tree-Structured LSTM [28], and Standard-LSTM represents Standard LSTM Network [28]. The CNN-based model is the third category, in which DCNN denotes Dynamic Convolutional Neural Network [12], CNN-Multichannel is Convolutional Neural Network with Multichannel [13], MVCNN refers to Multichannel Variable-Size Convolution Neural Network [32], Dep-CNN denotes Dependency-based Convolutional Neural Network [21], MGNC-CNN stands for Multi-Group Norm Constraint CNN [38], and DSCNN represents Dependency Sensitive Convolutional Neural Network [34]. The fourth one is based on other methods, in which Combine-skip refers to skip-thought model with the concatenation of the vectors from uni-skip and bi-skip [14], CFSF indicates initializing Convolutional Filters with Semantic Features [17], and GWS denotes exploiting domain knowledge via Grouped Weight Sharing [37]. Especially on MR, our model of S+T(CH) can achieve the best performance by a margin of nearly \(5\%\). This improvement demonstrates that our multilingual data augmentation and consensus learning can make great contributions to such sentence classification task. Through multilingual data augmentation, important words will be retained. The NMT systems can map those ambiguous words in source language to different word units in target language, which can achieve the result of word disambiguation. Essentially, our method can enable CNNs to obtain better discrimination and generalization abilities. To further demonstrate the superiority of our proposed model, we also use English as the source language and Dutch as the target language to evaluate the model of S+T(DU). On the four benchmark datasets of MR, CR, Subj, and TREC, our models of S+T(CH) and S+T(DU) have both achieved the best results at present.

5 Conclusion and Future Work

In this paper, multilingual data augmentation is introduced to further improve sentence classification. A novel deep consensus learning model is established to fuse multilingual data and learn the language-share and language-specific knowledge. The related experimental results demonstrate the effectiveness of our proposed framework. In addition, our method requires no external data comparing to existing methods, which makes it very practical with good generalization abilities in real application scenarios. In the future, we will try to explore the performance of the model on larger sentence/document datasets. The linguistic features of different languages will be also considered when selecting the target language.