Keywords

1 Introduction

Most of the data generated by Internet are stored in a format of text, and text data occupy an important position. Manually organizing and managing textual information has been unable to adapt to the ever-expanding digital information of the Internet age. With such a large amount of data and a variety of data forms, finding the right method to effectively manage and use these text data is very important. Efficient feature extraction and text representation are challenges for text classification, and it is also the first problem that should be solved in text classification.

Most of the early text representation methods used were vector space models. Later, most researchers used the distributed representation of words [1], which was proposed by Hinton in 1986 and could overcome the shortcomings of the one-hot representation. Bengio proposed to use a three-layer neural network to train text representation model in 2003 [2]. Hinton used hierarchical ideas in 2008 to improve the training process from the hidden layer to the output layer in the Bengio method, speeding up the training model [3]. Mikolov proposed a neural network model to train distributed word vectors in 2013. The training tool word2vec implemented by this model has been widely used [4, 5]. Hu used a convolutional neural network to extract semantic combination information from local words in a sentence through a neural network in 2014 [6].

We find traditional text representation methods, such as Boolean models and vector space models, have problems of data sparseness and dimensional disaster. With the rapid development of machine learning and deep learning technologies, researchers have begun to use various neural network to construct text representation models and map texts to low-dimensional continuous vectors through neural network, improving the model’s representation ability. However, the existing neural network text representation model also has some problems. First of all, though the neural network obtains better semantic information for the text representation, its class distinction ability is lacking. Secondly, the existing method for extracting text features using convolutional neural network is based on the length of the longest text in the data set. Texts that are shorter than this length are filled with special characters. The introduction of too many non-semantic characters in this text affects the original information of the text and results in poor classification effect.

2 Related Work

Text representation is an important step in text classification. The original texts are unstructured data. You must find a suitable representation method to convert the text content into information that computer can recognize. The text representation mainly contains two aspects: representation and calculation, respectively referring to definition of feature selection and feature extraction, and the definition of computational weighting and semantic similarity [7]. The Vector Space Model (VSM) was proposed by Salton in the 1970 [8]. It simplifies the process of processing text content into vector operations in vector space, and expresses the similarity of text semantics by calculating spatial similarity. VSM is a common and very classic text representation. But the problem of dimension disaster exists in vector space model.

The Convolutional Neural Network (CNN) is a deep neural network that has made major breakthroughs in computer vision and speech recognition. It is widely used in image understanding [9, 10]. In recent years, continuous development of convolutional neural network has been used in natural language processing tasks such as text classification and element identification [11]. Wang proposed a semi-supervised convolutional neural network to enhance the semantic relevance of the context [12]. The c-lstm model proposed by Zhou, c-lstm uses the convolutional neural network to extract text sentence features and uses short-term memory recursive neural network to obtain sentence representations [13]. Lai proposed recursive convolutional neural network for text classification, which introduces less noise than traditional neural network [14].

3 Methods

3.1 Word2vec

Word2vec can train word vectors quickly and efficiently. There are two Word2vec models, CBOW model and Skip-gram model. The CBOW model uses the c words before and after the word w(t) to predict the current word; whereas the Skip-gram model does the opposite. It uses the word w(t) to predict the c words before and after it. The two model training methods are respectively shown in Fig. 1 left and right. This paper uses the CBOW model to train word vectors.

Fig. 1.
figure 1

CBOW and Skip-gram

3.2 Convolutional Neural Network

Convolution Layer

The convolutional layer is also called feature extraction layer. This layer is the core part of the convolutional neural network and can describe the local characteristics of the input data. The convolution kernel \( w \in \mathop Q\nolimits^{hk} \) included in the convolution operation [11] will generate a new feature value each time it passes through a word sequence window with a height of \( h \) and a width of \( k \). For example, a feature point \( \mathop c\nolimits_{i} \) in a feature map is the result of the window \( \mathop x\nolimits_{i:i + h - 1} \) after convolution operation, that is, each feature value can be obtained by formula 1:

$$ \mathop c\nolimits_{i} = f(w \cdot \mathop x\nolimits_{i:i + h - 1} + b) $$
(1)

Where, \( \mathop x\nolimits_{i} \in \mathop Q\nolimits^{k} \), \( w \) is the weight parameter of the convolution kernel; \( b \) is the offset term of the convolution layer; and \( f \) is a nonlinear activation function.

When training a convolutional neural network, it is necessary to establish a convolution kernel sliding stride, which can be set to be 1, 2 or more. The convolution kernel can convolve the input data to get a feature map, as shown in 2:

$$ \text{c} = [\mathop c\nolimits_{1} ,\mathop c\nolimits_{2} , \cdots ,\mathop c\nolimits_{s - h + 1} ] $$
(2)

Pooling Layer

The pooling layer is generally disposed between two consecutive convolution layers. The pooling layer down-samples the feature map of the convolutional layer output, aggregates the statistics of all the feature maps of the convolutional layer, simplifies the information output from the convolutional layer through the pooling layer, and reduces the features and network parameters.

3.3 The Idea of the CD_STR Model

The main idea of IDF in TF-IDF algorithm is: if there are fewer documents containing characteristic words \( t \), larger IDF indicates that characteristic words \( t \) have better category discrimination ability. The simple structure of IDF in TF-IDF cannot effectively reflect the importance of words and the distribution of feature words. The CD_STR model first considers that if a word appears in each text, and the frequency of occurrence in each text or in each type of text does not differ much, then the word contributes very little to the category distinction and should be filtered out or given a smaller weight. Conversely, the feature words should be given a higher weight value.

If there are three characteristic words \( \mathop t\nolimits_{1} \), \( \mathop t\nolimits_{2} \), \( \mathop t\nolimits_{3} \), in the three categories \( \mathop c\nolimits_{1} \), \( \mathop c\nolimits_{2} \), \( \mathop c\nolimits_{3} \), the distributions are (8, 8, 8), (5, 8, 5), (1, 8, 5). Then, the weights of these three feature words should be increased successively, because the frequency of t3 in each category is relatively uneven compared to \( \mathop t\nolimits_{1} \), \( \mathop t\nolimits_{2} \). So, it will be better to distinguish categories.

CD_STR optimizes the TF-IDF algorithm mainly based on the distribution of feature words in various category, and specific improved algorithm is shown in formula 3:

$$ G(t,\,d)\, = \,\frac{{\mathop {tf}\nolimits_{ij} (t,d)\, \times \,\mathop {idf}\nolimits_{i} (t)\, \times \,\sum\nolimits_{k} {\mathop {p(t|\mathop c\nolimits_{k} )}\nolimits^{2} } \mathop {p(\mathop c\nolimits_{k} |t)}\nolimits^{2} }}{{\sqrt {\sum\limits_{i = 1}^{n} {\mathop {[\mathop {tf}\nolimits_{ij} (t,d)\, \times \,\mathop {idf}\nolimits_{i} (t)]}\nolimits^{2} } } }} $$
(3)

Among them, \( \mathop m\nolimits_{ij} \) is the number of occurrences of feature word \( \mathop t\nolimits_{i} \) appearing in the text \( \mathop d\nolimits_{j} \); \( \mathop n\nolimits_{i} \) is the total number of texts containing feature word \( \mathop t\nolimits_{i} \); \( N \) is the total number of texts in the corpus; \( F \) is a normalization factor; \( \sum\nolimits_{k} {\mathop m\nolimits_{k,j} } \) is the total number of feature word in the text \( \mathop d\nolimits_{j} \); and \( p(t|\mathop c\nolimits_{k} ) \) is the probability that the feature word \( t \) appears in the category \( \mathop c\nolimits_{k} \cdot p(\mathop c\nolimits_{k} |t) \) is the conditional probability that the feature belongs to the category \( \mathop c\nolimits_{k} \) when the feature word appears.

Then use the optimized TF-IDF weighting word2vec to train the word vector, assign a weight to each feature word vector, and accumulate each weighted word vector according to the corresponding dimension to obtain the vector representation of each text, that is, updating each text vector according to the formula 4:

$$ GW(d) = \sum\nolimits_{t \in d} {G(t,d)\, \times \,Word2vec(t)} $$
(4)

3.4 The Idea of the CD_MTR Model

With the advantages of the TF-IDF weighted vector space model, the LSI model and the CD_STR, they express the text information in different ways. Therefore, the three single-text representation models are combined to allow the three models to complement each other and to better express the content of the text. Thus, a text representation model (CD_MTR) that integrates multiple models is proposed.

The main idea of the CD_MTR model is that each single-text representation model selects an appropriate dimension to vectorize the original text and obtains three different sets of text representation vectors. The union of the text vectors corresponding to these text vector sets is determined as the final text vector. The specific solution is shown in Eq. 5:

$$ CD\_MTR(\mathop d\nolimits_{i} ) = LSI(\mathop d\nolimits_{i} )\, \oplus \,T\_VSM(\mathop d\nolimits_{i} )\, \oplus \,CD\_STR(\mathop d\nolimits_{i} ) $$
(5)

Among them, \( \mathop d\nolimits_{i} \) is the ith text in the data set \( D \); \( \oplus \) is a splice operator; \( LSI(\mathop d\nolimits_{i} ) \) is the text vector representation obtained from the LSI model training text \( \mathop d\nolimits_{i} \); \( T\_VSM(\mathop d\nolimits_{i} ) \) is vector representation obtained by the TF-IDF weighted vector space model; and \( CD\_STR(\mathop d\nolimits_{i} ) \) is the text vector representation obtained by the CD_STR model training text \( \mathop d\nolimits_{i} \).

3.5 Structural Improvements for Convolutional Neural Network (MCNN)

In general, only one convolution kernel is included in each convolution layer, and the number of convolution kernels is set to a fixed value. For text data, in order to take the contextual information of each feature word in the text into consideration, a variety of convolution kernels of different sizes can be designed. In this paper, three convolution kernels with different sizes are designed as 3 × 360, 4 × 360 and 5 × 360. The respectively number of corresponding convolution kernels are 150, 100, and 50.

3.6 CD_MTR Combined with Convolutional Neural Network (MTR_MCNN)

For the input features of convolutional neural network, the references [11, 15,16,17] use the length of the longest text in all the texts of the data set as a benchmark, and the rest of the texts shorter than this length are filled with special characters. For example, if the text is insufficiently long, padding is used to fill it. This method is also a commonly used processing method for input data based on the convolutional neural network text classification method. Two disadvantages are shown in this method:

  1. (1)

    Taking the length of the longest text in a data set as a benchmark, texts that are shorter than this length are filled with special characters. There will be too many non-semantic characters in short texts, which affects the classification effect.

  2. (2)

    The text represented by single model is used as the input of the convolutional neural network. The feature representation of the text is relatively single, which is not conducive to text classification.

In order to correct the drawbacks of the above method, a method of combining the CD_MTR model with a convolutional neural network (MTR_MCNN) is proposed. The dimension of each text vector obtained by this method is the same, so it does not need to be filled with special characters. Then, the text retains the original semantic information. And the text vector trained by the CD_MTR model expresses each text in multiple ways as an input to the convolutional neural network, allowing it to extract deeper features and achieve better classification results.

4 Experimental Design

4.1 Experiment Data Set

In order to verify the performance of the CD_MTR model on text classification, two classification data sets were selected for experimentation. A total of 24,000 texts were selected from 6 categories of automotive, culture, economics, medicine, military, and sports of the NetEase News Corpus. There are 7691 texts in 8 categories of Fudan Text Classification Corpus: art, history, computer, environment, agronomy, economics, politics, and sports. The number of corpora categories and the proportionality of texts are not the same. This experiment uses a ten-fold cross validation method to evaluate the effectiveness of this method.

4.2 The Influence of Single-Text Representation Model Dimension on the Effect of CD_MTR Model

The CD_MTR model proposed in this paper combines three single models of T_VSM, LSI and CD_STR. In order to have a good text representation effect for the CD_MTR model, it is necessary to fuse three dimensions of the T_VSM, LSI, and CD_STR. The effect of testing the CD_MTR on two datasets for three single models with different dimensions were chosen (the number of topics for the LSI is 400). This paper tests the various combinations of the three single models of T_VSM, LSI and CD_STR when dimensions of [100, 200, 300, 400, 500, 1000, 1500] are selected. The combination is too much. This paper only takes part of them to explain. Respectively shown in Figs. 2, 3, and 4.

Fig. 2.
figure 2

The effect of LSI and CD_STR dimensions on CD_MTR classification effect when T_VSM dimension is 100

Fig. 3.
figure 3

The effect of T_VSM and CD_STR dimensions on the CD_MTR classification effect when the LSI dimension is 100

Fig. 4.
figure 4

The effect of T_VSM and LSI dimensions on CD_MTR classification effect when CD_STR dimension is 100

The results in Figs. 2, 3 and 4 show that changes in the three single-model dimensions of T_VSM, LSI, and CD_STR affect the text representation capability of the fusion model CD_MTR. Considering the classification effect and classification speed of the four models of T_VSM, LSI, CD_STR and CD_MTR in different dimensions, the dimensions of the three models T_VSM, LSI and CD_STR are selected to be 1000, 500 and 400 respectively. The number of LSI model topics was selected as 400, so the dimension of the CD_MTR model is 1800.

4.3 Text Representation Model Comparison and Analysis

In order to verify the validity of the CD_MTR model, we compared the effect of different models: A_word2vec (an average of each word vector per text), T_word2vec (TF-IDF+word2vec), CD_STR, LDA fusion word2vec (LDA+word2vec), T_VSM fusion LSI (T_VSM+LSI), LSI fusion CD_STR (LSI+CD_STR), and T_VSM fusion CD_STR (T_VSM+CD_STR). The classification effect of each model is shown in Table 1.

Table 1. Classification effect of each model on two datasets

The results from Table 1 show that:

  1. (1)

    The micro-average F1 value and the macro-average F1 value obtained by CD_STR on the two data sets are superior to the single models A_word2vec and T_word2vec. This result also verifies that the CD_STR model considers the influence of a single word on the entire document and has better class discrimination ability.

  2. (2)

    Compared with other combined models, the CD_MTR model presented in this paper improves both the micro-average F1 value and the macro-average F1 value.

4.4 Ten-Fold Cross Result

In order to test the effectiveness of the MTR_MCNN method proposed in this chapter, a ten-fold cross validation was used. The ten-fold cross-validation method divides the data set into 10 equal and disjoint sub-samples each time. In 10 sub-samples, one sub-sample is used as the data of the test model, and the other 9 samples are used for training. Verifying each sub-sample for one time and repeat cross validation for 10 times. Figure 5 show the classification effect of the NetEase news text and Fudan text under different sub-samples, respectively.

Fig. 5.
figure 5

Different methods of ten-fold cross-validation

The results in Fig. 5 show that:

  1. (1)

    Compared with CD_STR+CNN, CD_STR+MCNN in the distribution of ten different training sets/test sets pairs is better in most cases. MTR_MCNN is superior to CD_MTR+CNN.

  2. (2)

    The classification accuracy of CD_MTR+CNN is better than CD_STR+CNN, and the classification accuracy of MTR_MCNN is also superior to CD_STR+MCNN. Furthermore, under the same convolutional neural network structure, the fusion model CD_MTR presented in this paper can better represent text information, distinguish categories, and improve the accuracy of text classification.

  3. (3)

    The model of word2vec_padding+CNN performs better under certain training set/test set pairs in normal conditions. But compared with the MTR_MCNN presented in this chapter, the classification accuracy is lower than that of MTR_MCNN. In the experiment, the word2vec_padding+CNN method needs to fill in most of the texts in the text sets with special characters, and there is a case where the supplemented text information and the original text information are deviated which affects the classification effect. In this method, the matrix dimension of the input convolutional neural network is the number of words per text multiplied by the dimension of each feature word. When the number of text feature words is too large, the text matrix dimension of the input convolutional neural network will be very large, thus affecting the training speed.

4.5 Method Comparison and Analysis

To further verify the validity of the MTR_MCNN method, this paper compares it with other classification methods. Table 2 shows the average classification accuracy of each method under the 10-fold crossover method.

Table 2. Classification accuracy of different text classification methods (%)

From the results in Table 2, we can conclude:

  1. (1)

    The text vector represented by the single model CD_STR is lower in classification accuracy than the CD_MTR proposed in Sect. 3.4 of this paper, whether it is an input as a non-optimized or optimized convolutional neural network. The classification accuracy of the method once again proves the performance of the CD_MTR model.

  2. (2)

    The MTR_MCNN method proposed in this paper has the highest classification accuracy among all the methods. The accuracy values on the two data sets respectively are 96.70% and 97.87%, and its classification is also more effective than the common used word2vec_padding+CNN method.

The above results are mainly because the MTR_MCNN method proposed in this paper introduces a convolutional neural network to improve the feature extraction of the CD_MTR method which is superficial. The MTR_MCNN method designs different convolution kernels of different sizes and numbers, which extracts text features from different angles. The MTR_MCNN method uses the CD_MTR model to vectorize the text and convert the resulting text vectors into a matrix form as input to convolutional neural network. Additionally, the MTR_MCNN method do not need to fill shorter texts with special characters, which affects the expression of the original text information, so it improved the text classification accuracy.

5 Conclusion

This paper proposes a single model CD_STR and a fusion model CD_MTR, which using optimized TF-IDF weighting word2vec with semantic information combines with LSI and TF-IDF weighted vector space model to complement each other. Based on the CD_MTR model, the classification method MTR_MCNN combined with CD_MTR and convolutional neural network is proposed in this paper. In this method, the convolutional neural network structure is improved, and different sizes and numbers of convolution kernels are designed to extract text features from different angles. In addition, the text vectors obtained from the CD_MTR are converted into a matrix form as an input to the convolutional neural network. For MTR_MCNN method, it does not need special characters to fill the text, avoiding meaningless additional text information and reducing the dimension of the input matrix so as to improve the training speed of the convolutional neural network. The experiment results show that both the CD_STR model and CD_MTR model and the MTR_MCNN method in this paper have achieved good classification effect and are superior to other methods.