1 Introduction

Text classification has been enormously applied to real-world problems, e.g., deceptive review identification [1, 2], sentiment analysis [3, 4], information retrieval [5], and email spam detection [6]. Many traditional techniques of text classification, such as topic modeling [7], are generally based on either the bag-of-words (BOW) or simple statistics of some ordered word combinations (such as n-grams) [8, 9]. However, the bag-of-words (BOW) ignores word order, such that different sentences might have the same representation. Although bag-of-n-grams considers the word order in short context, it is not applicable to text classification due to the sparse and high dimensional data representations. Traditional topic modeling methods, such as LDA (Latent Dirichlet Allocation), PLSA (Probabilitistic Latent Semantic Analysis) and NMF (Non-negative Matrix Factorization) [10, 11], are prone to serious issues with optimization, noise sensitivity, and instability for complex data relationships [12,13,14]. Different from topic modeling, some deep neural network models have been proposed to learn more effective vector representations of words, e.g., the pre-trained word vectors, which are mapped into a vector space such that semantically similar words have similar vector representations [15, 16].

By virtue of word embedding, a family of CNN text classification models was presented to explore the semantic representation of sentences. These methods were generally competitive to traditional models without any knowledge on the syntactic or semantic structures of a language [17, 18]. Kim [17] proposed a CNN structure for text classification, which utilizes pre-trained word embedding vectors as inputs. Then, a standard CNN model was applied to extract semantic features of sentences. In order to achieve better performance, most of the researches constructed more complex models by increasing parameters or updating the architecture, such as using various word embedding techniques, increasing the number of layers, or introducing new pooling techniques [18,19,20]. However, these models generally converge very slowly. In addition, if embedding vectors of rare words are poorly estimated, it would likely have negative effects on the representations of words surrounding them and the performance of classification models. This is especially problematic in morphologically rich languages with long-tailed frequency distributions or domains with dynamic vocabularies (e.g. social media).

Fortunately, many researchers have demonstrated convolutional networks are useful in extracting information from raw signals [21,22,23,24,25,26,27,28,29,30], ranging from computer vision applications to speech recognition and others. There are convolutional networks approaches that use features extracted at word or word n-gram level form a distributed representation [23, 24], or utilize convolutional networks to extract character-level features toward different languages [21]. Consequently, these models have the ability of automatically learning abnormal character combinations, such as misspellings and emoticons.

Different from existing fast learning models based on convolutional neural networks [31, 32], we propose a character-level text classification model by utilizing both CNN and Bi-GRU to further improve the performance of the existing methods [33, 34]. Meanwhile, highway Network and fully convolutional layers are incorporated into the proposed model to speed up convergence rate. The main contributions of this work are summarized as follows:

  1. (1)

    Other than existing models, the fully connected layers are replaced by fully convolutional layers in our model, which has significantly fewer parameters and is more applicable to large scale text classification tasks.

  2. (2)

    By virtue of FCLs, an end-to-end character-level CNN-Highway-BiGRU network is constructed for handling raw text character sequences, and the argmax function is utilized to pre-train our end-to-end model, which can achieve satisfactory classification results with much faster convergence speed.

  3. (3)

    By introducing error minimization extreme learning machine [35], our model can update output weights incrementally. Thus, compared with the exsiting method based on the softmax classifier, the proposed model has the ability of leveraging the extracted features to achieve better performance.

The remainder of the paper is organized as follows. We review related work in Sect. 2. Section 3 presents the details of our CNN-Highway-BiGRU network. Experimental results are shown in Sect. 4. Finally, Sect. 5 concludes the paper.

2 Related work

We mainly discuss some representative works for the two subtasks of text categorization, i.e., text feature extraction and classifier design.

Traditional methods of text feature representation have some limitations for classification. Specifically, some words occurring frequently across all documents tend to overshadow other words in the BOW model. TF-IDF, as a kind of term weight schemes, is commonly used to alleviate this problem by virtue of term frequency (TF) and inverse document frequency (IDF). In addition, the bag-of-n-grams model leverages the word order in short context and achieves better classification performance than BOW [7]. However, data sparsity, the curse of dimensionality and low utilization of semantic information are still challenging and intractable for these traditional methods [15, 16]. To this end, learning a low-dimensional vector representation of words via its local context, i.e., word embedding, has been developed and widely used in natural language processing (NLP) [36,37,38,39,40]. By transforming each short text unit (or sentence) into a matrix, CNN model can be naturally incorporated into text categorization. In all CNN-based methods, CNN-non-static, a single-layer and single-channel sentence model proposed by Kim is the simplest method and has satisfactory performance [17]. Compared with word embedding based methods, CNN based feature extraction methods are more efficient for raw signals. Santos confirmed that the accuracy of short text classification can be substantially improved if the English short text character sequence, as a processing unit, is input to learn the word and sentence-level features of the text, respectively [41]. Kanaris et al. [22] combined character-level n-grams with a linear classifier to obtain satisfactory performance of text classification. Zhang et al. [21] incorporated character-level features to convolutional networks for the classification tasks of different languages. Cho et al. [28] proposed a neural network language model to extract subword information based on a character-level convolutional neural network (CNN), whose output is used as an input to a recurrent neural network language model. Ling et al. [41] proposed a neural network using character level features to encode and decode individual characters in the translation process. Huang et al. [42] proposed a bidirectional LSTM model using character-level features to learn word embedding and character segmentation. Compared with traditional models, these approaches have superior performance in natural language processing. To speed up the convergence rate of deep CNN, Srivastava et al. [43] proposed the Highway Networks and combined this new structure with CNN and fully-connected networks.

On the other hand, the softmax classifier has been replaced by other classifiers to improve the performance of CNN-based text classification models. Some hybrid models, such as CNN-SVM model, were proven to outperform the traditional CNN model in sentiment analysis and face recognition [8,9,10]. However, when cross-validation is used in experiments, it is generally time-consuming to select the appropriate parameters. Extreme Learning Machine (ELM), proposed by Huang et al. [35], has been proven to be superior to SVM and has fewer parameters need to manually adjust. Furthermore, EM-ELM has the ability of automatically choosing the optimal number of hidden nodes and has the advantage of updating output weights incrementally.

Motivated by these studies, we propose a novel character-level CNN-Highway-BiGRU network for text categorization, which can achieve better classification performance with much faster convergence speed. Different from existing models, the fully connected layers are replaced by fully convolutional layers to effectively reduce the number of parameters in our model. In addition, the argmax classifier is used to pre-train our end-to-end model, which efficiently extracts the local and global features from raw text character sequence. By virtue of extracted deep features, EM-ELM is introduced to further enhance the performance of text classification model by automatically choosing the optimal number of hidden nodes and updating output weights incrementally. Consequently, the proposed model not only has faster convergence rate compared with the state-of-the-art methods, but has better classification accuracy for text datasets.

3 Character-level text categorization based on CNN-highway-BiGRU

In this section, we develop a character-level deep learning model for text classification. The architecture of our model is shown in Fig. 1. In the proposed model, the fully connected layers (FCLs) have been removed and replaced by fully convolutional layers. Instead of the softmax classifier, the argmax classifier is used to pre-train our end-to-end model. Then, the pre-trained model works as a deep feature extractor and normalized deep features are fed to the EM-ELM classifier.

Fig. 1
figure 1

An illustration of the network architectures for pre-training and fine-tuning

At first, our model receives a sequence of characters (a sentence) as input, and then finds the corresponding One-hot vector for each character through the dictionary which containing m characters. Due to the different sentence lengths in the dataset, the length of the longest sentence in the entire dataset is generally obtained as l0 (i.e., the number of characters), and then each sentence is filled to l0 in the preprocessing. For characters or spaces that do not appear in the dictionary, we assign a 0 vector to them. For English datasets, the dictionary contains the following 70 characters: abcdefghijklmnopqrstuvwxyz-,;.!?:”’/\|_@#$%^&*~`+= <>()[]{}0123456789. After a lookup of character embedding and stacking them to form the input matrix, convolution operations are performed between the input matrix and multiple filter kernels. Then, a max-over-time pooling operation is applied to obtain a fixed-dimensional representation of the word, which is output to the highway network. The outputs of highway network are used as the inputs to a bidirectional gated recurrent unit RNN model, which aims to learn semantics of words and take the information of the context into consideration. After the entire network is completely trained, the FCLs are removed and the hidden representations of the bidirectional GRU are fed to EM-ELM to perform classification tasks.

3.1 Model description

Our method utilizes the CNN-non-static architecture which is a single-layer and single-channel CNN-based sentence model.

In the CNN-non-static, each word in one sentence is replaced with its vector representation. Let \( V \) be the vocabulary of characters, \( d \) be the dimensionality of character embedding, and \( {\mathbf{A}} \in {\mathbb{R}}^{d \times \left| V \right|} \) be the matrix character embeddings. Suppose that word \( w \) is made up of a sequence of characters \( \left[ {c_{1} , \ldots c_{l} } \right] \), where \( l \) is the length of word \( w \). Then the character-level representation of \( w \) is given by the matrix \( \varvec{E}^{w} \in {\mathbb{R}}^{d \times l} \), where the \( j \)-th column corresponds to the character embedding for \( c_{j} . \)

A narrow convolution is used between \( \varvec{E}^{w} \) and a kernel \( {\mathbf{K}} \in {\mathbb{R}}^{d \times \omega } \) of width \( \omega \), after a bias \( b \) is added, a feature map \( \varvec{f}^{w} \in {\mathbb{R}}^{l - \omega + 1} \) is introduced, whose \( i \)-th element is defined by

$$ \varvec{f}^{w} \left[ i \right] = \tanh \left( {\varvec{E}^{w} \left[ {*,i:i + \omega - 1} \right],{\mathbf{K}} + b} \right), $$
(1)

where \( \varvec{E}^{w} \left[ {*,i:i + \omega - 1} \right] \) is the \( i \)-to-\( \left( {i + \omega - 1} \right) \)-th column of \( \varvec{E}^{w} \) and \( {\mathbf{M}},{\mathbf{N}} = {\text{Tr}}\left( {{\mathbf{MN}}^{T} } \right) \) is the Frobenius inner product. Finally, we take the max-over-time

$$ y^{w} = \mathop {\hbox{max} }\limits_{i} \varvec{f}^{w} \left[ i \right] $$
(2)

as the feature corresponding to the filter \( {\mathbf{K}} \), which can extract the highest value for a given filter. Each filter is essentially picking out a character \( n \)-gram, where the size of the \( n \)-gram corresponds to the filter width.

3.2 Highway network

In order to solve the problem of model training in deep learning, Srivastava et al. [43] proposed a network that can optimize deep learning model, termed as Highway network. Under the gating mechanism, Highway network can locally regulate the information flow. In a feedforward neural network consisting of \( L \) layers, each layer can use non-linear transformation \( {\mathbf{G}} \) with the parameter \( \varvec{W}_{\text{G}} \) to generate the output \( {\mathbf{z}}_{i} \) for the input \( \varvec{x}_{i} \), and the tensor \( {\mathbf{z}} \) can be represented as

$$ {\mathbf{z}} = {\mathbf{G}}\left( {\varvec{x},\varvec{W}_{\text{G}} } \right). $$
(3)

Highway networks introduce two non-linear transforms \( {\mathbf{T}} \) and \( {\mathbf{C}} \) into Eq. 4, so that the output \( {\mathbf{z}} \) can be rewritten as

$$ {\mathbf{z}} = {\mathbf{G}}\left( {\varvec{x},\varvec{W}_{\text{G}} } \right) \cdot {\mathbf{T}}\left( {\varvec{x},\varvec{W}_{\text{T}} } \right) + \varvec{x} \cdot {\mathbf{C}}\left( {\varvec{x},\varvec{W}_{\text{C}} } \right), $$
(4)

where \( {\mathbf{T}} \) is called the transform gate and \( {\mathbf{C}} \) is called the carry gate, which express how much of the output is produced by transforming the input and carrying it, respectively. For simplicity, \( {\mathbf{C}} \) is usually set as \( {\mathbf{1}} - {\mathbf{T}} \). For every layer of highway network, we have

$$ {\mathbf{z}} = {\mathbf{G}}\left( {\varvec{x},\varvec{W}_{\text{G}} } \right) \cdot {\mathbf{T}}\left( {\varvec{x},\varvec{W}_{\text{T}} } \right) + \varvec{x} \cdot \left( {{\mathbf{1}} - {\mathbf{T}}\left( {\varvec{x},\varvec{W}_{\text{T}} } \right)} \right), $$
(5)

where \( {\mathbf{G}} \) is usually an affine transform followed by a non-linear activation function. The dimensionality of \( \varvec{x},{\mathbf{z}},{\mathbf{G}}\left( {\varvec{x},\varvec{W}_{\text{G}} } \right) \) and \( {\mathbf{T}}\left( {\varvec{x},\varvec{W}_{\text{T}} } \right) \) must be the same to guarantee the validity of Eq. (5). Thus, based on the output of the transform gates, a highway layer can smoothly vary its behavior.

3.3 Gated recurrent unit

Recurrent neural network (RNN) can capture contextual information for text sequences. However, there are two major problems in traditional RNN model: vanishing gradient and exploding gradient. Gated recurrent unit (GRU), a variant of LSTM, is designed to avoid these problems [33, 34]. The architecture of LSTM unit and GRU unit are shown in Fig. 2 for comparison.

Fig. 2
figure 2

Architectures of LSTM and GRU. a LSTM, where i, f and o are the input, forget and output gates, respectively. \( C \) and \( \tilde{C} \) denote the memory cell and the new memory cell content. b GRU, where r and z are the reset and update gates, and h and h are the activation and the candidate activation

As shown in Fig. 2, GRU ensembles forget gate and input gate into a single update gate. It also mixes cell states and hidden states, and some other changes. The final model is simpler than the standard LSTM unit. In addition, the experiments indicate that GRU can achieve competitive or higher result than LSTM in the NLP task. And the performance of the GRU is better at the convergence time and the required epoch. Based on the above-mention reasons, we choose GRU to capture semantics of character-level and sentence-level feature in the event extraction task. In the proposed model, the two layer GRU network is designed to encode the sentence. A forward GRU computes the state \( \overrightarrow {{\varvec{h}_{t} }} \) of the past (left) context of the sentence at character \( \varvec{c}_{t} \), while a backward GRU network reads the same sentence in reverse and outputs \( \overleftarrow {{\varvec{h}_{t} }} \) given the future (right) context. Afterwards, we concatenate the outputs \( \overrightarrow {{\varvec{h}_{t} }} \) and \( \overleftarrow {{\varvec{h}_{t} }} \) as the output of GRU network, \( \varvec{h}_{t} = \left[ {\overrightarrow {{\varvec{h}_{t} }} :\overleftarrow {{\varvec{h}_{t} }} } \right] \). For the input sentence, we set the number of hidden layers as \( m \), the result of GRU network can be expressed as follows:

$$ \varvec{H} = \left[ {h_{1} ;h_{2} ; \ldots ;h_{n} } \right], $$
(6)

where \( n \) is the length of the input sentence. The RNN network result is \( \varvec{H} \in {\mathbb{R}}^{{n \times \left( {2 \times m} \right)}} \) where each row of \( \varvec{H} \) represents the feature of one word generated by GRU.

3.4 Fully convolutional layers

Our model replaces fully connected layers with convolutional layers. In Eq. (7), “*” denotes the convolution operator. The first parameter \( \varvec{x} \) represents the input, which is the output of former layers in the convolutional neural network, and the second parameter \( \varvec{w} \) represents the weight vector of one convolution kernel. The time complexity of single convolutional layer is shown in Eq. (8), where \( M \) represents the size of the output feature map, \( K \) is the convolution kernel size, Cin represents the number of input channels, and Cout denotes the number of output channels. The spatial complexity of the model is shown in Eq. (9). As can be seen from the formula, the spatial complexity of the model is only related to the convolution kernel size \( K \) and the channel numbers Cin and Cout, regardless of the input size. Thus, the neurons are locally connected to the input data and share parameters. In contrast, each node of the fully connected layer is connected to all nodes of the upper layer, which suffers from a large amount of parameters.

$$ \varvec{s} = \varvec{x}\left( t \right)*\varvec{w}\left( t \right) $$
(7)
$$ time = {\mathcal{O}}\left( {M^{2} \times K^{2} \times Cin \times Cout} \right) $$
(8)
$$ space = {\mathcal{O}}\left( {K^{2} \times Cin \times Cout} \right) $$
(9)

3.5 Error minimized extreme learning machine for classification

In our model, to reduce a large number of parameters of fully connected layers, the classifier, based on the argmax function, is used to pre-train our model for two-class or multiclass classification. Thus, the length of the last layer is determined by the number of classes. Then, error minimized extreme learning machine (EM-ELM) [35], which can add random hidden nodes to SLFNs one by one or group by group (with varying group size), is utilized to achieve better classification results by incrementally updating the output weights. The error minimized ELM (EM-ELM) algorithm is described as follows.

Compared with the standard ELM, which has to recalculate the output weights if the network architecture is updated, EM-ELM effectively reduces the computation complexity by updating the output weights incrementally. Furthermore, its convergence can still be guaranteed [35].

figure a

4 Experiments

In this section, we evaluate the performance of our model on large scale datasets, including English and Chinese text datasets. The experiments were carried out using Ubuntu 14.04, Python 2.7 and TensorFlow 1.13.1 with Intel i7 4.0-GHZ CPU and 64G DDR4 Memory.

4.1 Datasets

In the experiments, for fair comparison we used both English and Chinese large-scale text datasets to test different models. English datasets include MR,Footnote 1 SST-2,Footnote 2 Tweet,Footnote 3 AG-News,Footnote 4 Yah,Footnote 5 DBPediaFootnote 6 and Yelp Review Full (Yelp.F).Footnote 7 Chinese datasets include the Sogou News datasetFootnote 8 and Chinese Movie Reviews dataset.Footnote 9 The detailed descriptions of these datasets are listed in Table 1.

Table 1 Statistics of English and Chinese datasets

4.2 Experimental settings

The kernel sizes were set as 1, 2, 3, 4, 5, 6, 8 respectively, and the numbers of channels were 50, 100, 150, 150, 200, 200, 200 correspondingly. For fair comparison, ReLU activation was used in all CNN-based models, and the dropout rate was set to 0.5 and mini-batch size 32. In GRU network, the number of hidden layer \( m \) was set to 512 as in Ref [34]. The length of each fully convolutional layer was set to 512 and 1 × 1 kernels were used. We utilized the Adam optimizer instead of stochastic gradient decent (SGD) to pre-train our model, learning rate was set to 0.001 and the dropout rate was 0.5. For EM-ELM, we used the sigmoid activation function. Table 2 reports the maximum number of hidden nodes Lmax and the expected learning error for each dataset. The SLFN was initialized by one hidden node and then new random hidden nodes are added one by one.

Table 2 The settings of the hyperparameters in EM-ELM on different datasets

To verify the effectiveness and efficiency of the proposed model, we compared our model with the traditional classification methods and convolutional neural network classification models. The former include Naive Bayes, Multinomial Naive Bayes (MNB), KNN and Linear-SVM, and the latter include convolutional neural networks for text classification based on the CNN-rand, CNN-static, CNN-non-static and CNN-multichannel methods. We tested all algorithms by the 10-fold cross validation procedure, and all the algorithms were subjected to the same folds of the cross-validation process.

For these traditional classification methods, we first segmented words from sentences, and then removed special characters, such as the space character, and stopwords on Chinese datasets. For English datasets, we directly removed special characters and stopwords. Specifically, for each dataset, the bag-of-words model was constructed by selecting 30,000 most frequent words from the training subset. Then, the counts of each word were set as the term-frequency. The inverse document frequency was set as the logarithm of the division between total number of samples and number of samples with the word in the training subset. To further reduce the dimensionality of the features, the Linear Discriminant Analysis (LDA) algorithm was performed to obtain low-dimensional vectors. The dimension of the embedding was set to 500 and the final features were normalized by dividing the largest feature value. Finally, NB, MNB, KNN and SVM were carried out on the generated low-dimensional features. For KNN, we set k as 10 and used cosine similarity to obtain the k nearest neighbors. For the sake of large amount of training data, we only performed linear SVM using sequential minimal optimization algorithm, where the penalty parameter C equals to default value 1. For Multinomial Naive Bayes, we used the same parameters as those in [44].

In the CNN-rand model, all word vectors were initialized randomly and optimized in training. For the CNN-static model, the word embeddings were learnt from the training subset of each dataset with skip-gram [21], and the dimension of word embedding was set to 128 as in Ref. [44]. In CNN-non-static, these word vectors could be tuned. The CNN-multichannel model can be regarded as a combination of CNN-static and CNN-non-static. For Chinese datasets, we employed pypinyin package combined with jieba Chinese segmentation system to produce Pinyin—a phonetic romanization of Chinese, as in Ref. [34]. The proposed model can then be applied to Chinese datasets without change.

4.3 Experimental results

4.3.1 Experiments on english datasets

We first compared our method with traditional methods and CNN based models on English datasets. In this experiment, the number of layers of highway networks was set to 3. The experiment results are listed in Table 3. As can be seen, the CNN-based models have the better classification accuracy than traditional methods. It is due to the fact that deep models have the advantage of extracting global and local features by virtue of multilayer neural network. Specifically, our method significantly outperforms both traditional methods and the existing CNN based models. It achieves all best results from 7 datasets. The performance of the proposed model is obviously superior to that of the CNN-non-static model, which shows that raw character information is useful to improve the performance of text classification. Our method is much better than the CNN-LSTM hybrid model, which validates the effectiveness of integrating CNN, highway network, GRU and fully convolutional layers into the united model. In addition, different from existing CNN-based methods, we leveraged the extracted features by means of EMELM. Consequently, the proposed model inherits the advantages of both traditional CNN-based deep neural networks and EMELM, which contributes to the performance improvement of text classification algorithms.

Table 3 Performance comparison between different text categorization methods on English datasets (%)

To further validate the effectiveness of our model, we tested different CNN-based text classification models using softmax and EM-ELM, respectively, and then reported the performance of classifiers in Table 4. For CNN-EMELM, we replaced the softmax classifier by the EM-ELM classifier based on the same network structure. Comparing CNN-softmax and CNN-EMELM, we can see that EMELM has the ability of improving the classification accuracy by using the same extracted features as CNN-softmax. In addition, it can be seen in Table 4 that our model is obviously superior to the counterparts based on softmax classifiers. The experimental results show that EM-ELM can enhance the performance further. As a result, these experimental results validate the effectiveness of the proposed model.

Table 4 Performance comparison between CNN-based methods on English datasets (%)

4.3.2 Experiments on Chinese datasets

We further implemented different algorithms on Chinese datasets to validate the effectiveness and efficiency of the proposed model. The experiment results are listed in Table 5.

Table 5 Performance comparison between different text categorization methods on Chinese datasets (%)

From Table 5, we can come to the same conclusion that the CNN-based models perform better than the traditional classification models on Chinese datasets. Specifically, it can be seen from Table 5 that the performance of CNN-rand model is similar to that of CNN-char-static model and CNN-char-non-static model, and is superior to that of Naive Bayes, MBN, KNN and Linear-SVM. The CNN-based models with highway networks outperform those without highway networks. The proposed model performs best among all models, which further validates the effectiveness of our model on Chinese datasets.

The accuracy and convergence curves on the Chinese movie reviews dataset were displayed in Figs. 3 and 4, respectively. From Figs. 3 and 4, we can see that our model has better performance than the standard CNN and highway network based CNN. It has superior classification accuracy with faster convergence speed in the training process.

Fig. 3
figure 3

The accuracy curves over iterations on the Chinese Movie Reviews dataset

Fig. 4
figure 4

The convergence curves over iterations on the Chinese Movie Reviews dataset

Finally, we compared our methods with several widely used supervised text classification models, including the character level convolutional model (char-CNN) [21], Region embedding for text classification (Region.emb) [44], the character based convolution recurrent network (char-CRNN) [45], the bigram FastText (bigram-FastText) [46], the Discriminative LSTM (D-LSTM) [47] as well as the very deep convolutional network (VDCNN) [48]. The experimental results were reported in Table 6. As can be seen, our method achieves the best 4 results among all algorithms. For the Yah dataset, the classification accuracy of our method on the test dataset is very close to that of Region.emb. On AG, DBPedia, Yah and Yelp F, the performance of the proposed method is much better than that of other methods. Notably, all algorithms have unsatisfactory classification performance on Yah and Yelp F.

Table 6 Performance comparison with the state-of-the-art methods on several datasets (%)

To analyze the stability of our method, we also reported results of several repeated runs on Yah and Yelp F in Tables 7 and 8, respectively. As can be seen, five independent runs are conducted on each dataset of Yah and Yelp F, where both standard deviations are within 0.051, and maximum performance variances are within 0.13% on accuracy, indicating that our method is still stable even if the accuracy is relatively low. Overall, our method is superior to the state-of-the-art algorithm on large scale datasets.

Table 7 Performance variance through several repeated runs on Yah
Table 8 Performance variance through several repeated runs on Yelp F

5 Conclusion

In this paper, we proposed a character-level text categorization model based on convolutional neural network, Highway network and gated recurrent unit, which has the ability of efficiently extracting both the global and the local textual semantics. In addition, the fully convolutional layers are introduced to substantially reduce the large amount of parameters arisen from the original fully convolutional layers. Thus, the convergence rate of the model can be significantly sped up. Furthermore, combined with error minimization extreme learning machine, the extracted features are leveraged to improve the classification accuracy. Experimental results validate that the proposed method can achieve satisfactory classification performance with faster learning speed.