Keywords

1 Introduction

In the age of information, people are wasting a lot of time finding their interesting information. Consequently, it is crucial to effectively and quickly extract the most relevant information from a wide range of information. Text classification can negotiate with these problems. It is one of Natural Language Processing (NLP)’s main research areas. Text classification is the arrangement of text into their respective categories such as spam filtering, articles, sentiment analysis, posts, and hate speech identification. Recently, the use of word embedding with a deep learning method has attracted considerable interest in text classification due to their ability to capture semantic relationships of words [2, 6, 8]. Words are considered as basic unit in most of the NLP for implementing continuous word vector representation. This paper focuses in particular on the Myanmar text classification. There is no rule to determine word boundaries for Myanmar language. Since Myanmar language is rich in morphology, it is difficult to learn good representation of words because many word types seldom occur in the training corpus. In order to classify Myanmar text by means of deep learning models, several steps are taken to pre-process Myanmar text such as extracting massive amounts of Myanmar text, removing unnecessary characters, determining words boundaries and converting words into word vectors that keep the context information. Grave et al. [3] published pre-trained word vectors for two hundred forty-six languages trained on common crawl and Wikipedia. They proposed bag-of-character n-grams based on skip-gram that could capture sub-word information to enrich word vectors. The pre-trained sub-word vectors for two hundred seventy-five languages were also released by Heinzerling et al. [5]. Their works are very helpful in resource-scarce languages and can be applied to specific NLP tasks by transferring learning. This paper applies the deep learning models for text classification and pre-trained word embedding trained on Wikipedia for the construction of embedding matrix.

The next sections are as follow, Section 2 addresses the related work of the text classification for both the English and Myanmar languages. Section 3 discusses the pre-processing steps before an embedding layer. Section 4 explains the proposed model. Section 5 explains the experimental section containing the dataset collection, comparison models and experiment results and the paper is concluded in the Sect. 6.

2 Related Work

Conneau et al. [2] have proposed very deep convolutional neural networks (VDCNN) that use twenty nine layers of convolution. VDCNN operated directly on character-level and performance is measured by using eight datasets. Joulin et al. [6] developed a text classification system that is efficient and simple and is denoted as fastText. This model’s accuracy is similar to other deep learning classifiers, but using a regular multicore CPU, it takes less than ten minutes for training more than one billion words. Song et al. [13] introduced a context-LSTM-CNN model to use LSTM-based long-range dependencies and used the convolution layer and max-pooling layer to extract local features at specific points. Lai et al. [10] applied bi-directional RNN to capture meaning and max-pooling to capture key components in texts. Kim [8] showed that the use of a single convolution layer in the simple CNN and proposed variations of the CNN models CNN-rand, CNN-static, CNN-non-static, and CNN-multichannel. These models were experimented on seven publicly available datasets and improved the state-of-the-art methods on four out of seven datasets. Zhang et al. [15] compared character-level convolutional networks with word-level ConvNets and RNN for text classification in the English language. In Myanmar language text classification, we also investigated previous research work, such as news classification, spam filtering, and sentiment analysis. Aye et al. [1] improved the accuracy of prediction on informal Myanmar text by considering objective and intensifier words for Myanmar’s food and restaurant text reviews. Khine et al. [7] showed the comparison of Naïve Bayes and k-Nearest Neighbors (KNN) algorithms for Myanmar news classification. The experiment showed that KNN is higher in recall and accuracy than Naïve Bayes on 1,200 documents datasets with four categories. Yu et al. [14] developed a corpus annotated with sentiment polarity for Myanmar news. The N-gram model is used to choose features and the Naïve Bayes algorithm is to identify emotions. Kyaw et al. [9] constructed a spam filtering corpus and proposed a Naïve Bayes-based learning algorithm for spam or harm classification. According to the literature review, some deep learning models were improved and explored for Automatic Speech Recognition as in [12]. Most of the Myanmar text classification tasks are performed in lexicon-based and approaches because the challenge of text classification in Myanmar language is the need for huge resources to train in deep learning models. Using pre-trained word vectors can address such resource-requiring problems. In our previous work [12], we performed the comparative analysis of CNN and RNN both on syllable and word level by using three pre-trained vectors and also collected and annotate six Myanmar articles datasets. We use the pre-trained vector that is trained on the skip-gram model in the embedding layer. This paper presents a joint CNN and Bi-LSTM model and compares with most of the baseline deep learning models and their combination models for Myanmar text classification on five datasets.

3 Pre-processing

Pre-processing steps is cruel for Myanmar language because of its nature. Firstly, we extract sentences from text documents. Pre-processing steps contain removing the non-Myanmar character, punctuation marks, and numbers. As this work focus on Myanmar text classification, we remove non-Myanmar characters that do not contain in the Unicode range between [U1000-U104F]. The numbers [U1040-U1049] and the punctuation marks [U104A-U104B] are also removed. Myanmar language has rule to determine the boundary of words. In this work, the BPE tokenizerFootnote 1 is used to define the word boundary. Algorithm 1 and 2 show the step by step procedure of preprocessing task. Algorithm 1 shows the step-by-step process to remove the unnecessary characters from the text dataset. Table 1 shows the sample of pre-processing steps for sample input text “ Medicine Box ”. In this sample text, non-Myanmar characters “Medicine Box” and punctuation marks “ ” are removed and the remaining text string “ ” is segmented as “ ” by the tokenizer.

Table 1. Pre-processing steps for sample input text
figure f

3.1 Pre-trained Vector

In this work, we use the pre-trained vector trained on the fastText Skip-gram modelFootnote 2. The number of word vectors in this pre-trained vector is 91,497 and the dimension is 300. The pre-trained vectors file is used as vocabulary to convert words into word vectors. Algorithm 2 shows the conversion of segmented words to embedding matrix. Figure 1 shows the step-by-step process before the embedding layer. Table 2 the sample result of embedding matrix for each segmented word.

Table 2. Sample result of embedding matrix for each segmented word
figure g
Fig. 1.
figure 1

Pre-processing steps before the embedding layer

4 Model

A joint CNN-Bi-LSTM model is illustrated in Fig. 2. It is basically composed of the following layers.

Embedding Layer:

After pre-processing steps, the segmented words are matched with the vocabulary in the pre-trained word vector that is trained on the skip-gram model. Each word in the vocabulary attaches with their corresponding vectors and it can catch context information.

Convolution Layer:

Convolution layer performs the convolution process with stride size 1 by using the ReLU activation function f(x) = max(x, 0). The convolution layer is used to extract features from the embedding matrix and discard the pooling layer because it only captures the most important information and lost the context information.

Bi-LSTM Layer:

Bi-LSTM layer is applied as an alternative of pooling layer to capture long term semantic information from both past and future contexts.

Fig. 2.
figure 2

A joint CNN and Bi-LSTM model

Fully Connected Layer:

Fully connected layer used sigmoid activation function to calculate the probabilities of each class. The sigmoid function of \( p(c_{n} ) \) is

$$ {\text{p}}(c_{n} ) = \frac{1}{{1 + e^{{ - c_{n} }} }} $$
(1)

The probability of a class does not depend on all other classes’ probabilities. It can handle the multi-label problem. Binary cross-entropy is used as a loss function and Adam optimization function with 0.5 dropouts and 16 batch size on 10 epochs are set as hyper-parameters. In addition, the bias and kernel regularizer set (l2 = 0.01) in output layer for reducing overfitting problem.

5 Experiment

5.1 Datasets

Empirical exploration is conducted in Myanmar language on five news datasets. These datasets are collected from five daily news websites [12]. Text data are converted into Unicode font by using rabbit converterFootnote 3. Each line represented a sentence annotated with corresponding label. Text data are and shuffled and split into 75% and 25% for training and testing datasets. The first dataset is collected from the 7Day DailyFootnote 4 website with 10,884 and 3,628 sentences for train and test sets. The second dataset is collected from the DVBFootnote 5 website and it includes five subjects, with 8,201 and 2,733 sentences for the training and testing dataset. The third dataset is collected from The VoiceFootnote 6 news website, which covers five subjects with 7,660 and 2,586 sentences for training and testing dataset. The fourth dataset from the Thit Htoo LwinFootnote 7 news website and includes five subjects with 12,299 and 4,099 sentences for testing and training datasets. The last dataset is collected from the Myanmar WikipediaFootnote 8 website that contains four topics with 11,299 and 3,766 sentences for testing and training set. Table 3 summarizes these datasets.

Table 3. Myanmar text classification datasets

5.2 Comparison Models

In this work, we performed the comparative analysis of a joint CNN and Bi-LSTM model with CNN, RNN, Bi-LSTM, CNN-LSTM models.

Convolutional Neural Networks (CNN):

It is an artificial neural network feed-forward, most widely used for visual image analysis. This model has recently achieved significant success in the tasks of text classification. It has three basic components, convolution, pooling, and fully connected layer. ReLU activation functions f(x) = max(x, 0) is used in convolution layer and it can have several layers of convolution. The pooling layer extracts the most important features. The pooling layer mostly applies Max-pooling. The fully connected layer is the model’s output layer and it predicts the class of the input sentences. The fully connected layer commonly uses the Softmax function. Softmax function of \( f\left( x \right)_{i} \) is \( \frac{{e^{{x_{i} }} }}{{\mathop \sum \nolimits_{j}^{C} e^{{x_{j} }} }} \), the probability of a class depends on the probabilities of all other classes.

Recurrent Neural Networks (RNN):

RNN is a generalization of feedforward neural networks with the distinction that it has an internal memory that keeps information to persist. It performs the same function for all input data by learning from the previous data. RNN produces the output yt as in Eq. (2).

$$ y_{t} = f\left( {W_{y} h_{t} } \right) $$
(2)
$$ h_{t} = \sigma \left( {W_{h} h_{t - 1} + W_{x} x_{t} } \right) $$
(3)

Bidirectional LSTMs:

It is an extension of the LSTM model that can learn from the past and future information for a specific task.

CNN-LSTM:

(Hassan A, 2018) proposed a joint CNN and LSTM framework to produce the feature map by CNN and to capture long term dependencies by LSTM.

5.3 Experimental Result and Discussion

The experiment is accomplished on Google Cloud LaboratoryFootnote 9 that does not require to configure the Jupyter notebook by using, KerasFootnote 10, a model-level library. The performance of the CNN-Bi-LSTM model is compared with comparison models described in Sect. 5.2 as listed in Table 4. The highest performance sores for each dataset are highlighted in bold. According to the experiments, the proposed model improves accuracy in four out of five datasets. The CNN model performs equally with the proposed model in two datasets. The CNN-LSTM combined model performs better in two out of five datasets. We also measure the training time of each model. According to the measurement results, the CNN model requires the minimum training time because we used only one convolution layer. Although CNN-Bi-LSTM model performed better in three datasets than the remaining models, it requires more time for training than CNN-LSTM, RNN and CNN models. Average training time of each model are listed in Table 5.

Table 4. Comparison of average testing accuracy
Table 5. Comparison of average training time

6 Conclusion

This paper presents a joint CNN-Bi-LSTM model that take advantages of CNN to extract feature and Bi-LSTM to capture long term context information from both past and future information. A series of the experiment is performed by comparing the pro-posed model with CNN, Bi-LSTM, RNN, CNN-LSTM models in term of accuracy on five Myanmar articles datasets. According to the experiment, the proposed system per-forms better in three out of five datasets. The CNN model requires minimum training time than the remaining models and CNN-Bi-LSTM model takes more time than CNN, RNN, and CNN-LSTM models.