1 Introduction

The development of computer technology and the explosive growth of text data cause the increase of text processing workload [8], and promote the research of many subfields in natural language processing (NLP). Text classification, which refers to dividing the texts into different categories according to its topic, content, and attributes, has gradually become a hot issue. Furthermore, multi-label text classification is to assign one or more category labels to each text, which is widely used in sentiment analysis, information retrieval [9], news subject classification, and spam detection [20].

However, there are still many problems to be solved in multi-label text classification. For example, the appearance of new words reduces the accuracy of the classifier, a great number of textual data affect the convergence of neural networks and some semantic information is ignored during the text quantization. Besides, due to the rapid growth of unstructured texts from social networks, mail systems and other platforms, text mining as the basis of text classification has also attracted great attention from researchers.

To address the above issues, we propose a novel text classification method that combines dynamic semantic representation model and deep neural network classifier (DSRM-DNN). The proposed method improves the classification performance by updating text mining technology and optimizing text classifiers. In summary, the contributions of the paper are as follows:

  • Dynamic semantic representation model is proposed to quantify the texts. The model combines word embedding model and clustering algorithm to select semantic words, then utilizes bag of words (BOW) and word attributes to obtain textual features. Dynamic semantic representation not only greatly reduces the complexity of the existing BOW, but also lays the foundation for the expression of new words.

  • A sparse constraint is adopted to build the relationship between new words and the existing semantic words. By utilizing the semantic words in DSRM to obtain the sparse representation of low-frequency words and new words, more words can be quantified and used in the training data.

  • An efficient multi-label text classifier is constructed based on deep belief network (DBN) and BP network. DBN is used to preprocess the input of the BP network, which reduces the risk of falling into local optimum and accelerates the convergence of the classifier network.

The rest of the paper is organized as follows. Section 2 introduces the related work of multi-label text classification. In Section 3, the details of the proposed method are described. Experimental results and discussions are presented in Section 4. Finally, Section 5 summarizes work and gives further research direction.

2 Related work

Compared with simple classification tasks, multi-label text classification has two characteristics. On one hand, the corpus cannot be directly input into the classifier, and the texts must be quantified for further purposes. On the other hand, different from the multi-class classifier, each sample in multi-label data belongs to several target domains that are unrelated to each other. The final performance of multi-label text classification is jointly affected by the text quantization methods and multi-label text classifiers.

2.1 Text quantization

The text quantization methods [5, 6, 13, 15] can be roughly divided into three categories: the traditional language models, the sequence or structured language models, and the attention language models. The traditional language models based on BOW [14] are widely used in the data pre-processing of various research fields, such as text classification, cross-modal retrieval [10, 23], and so on. The second kind of language models can use sequence or structured models to quantify texts, mainly including convolutional neural network, recurrent neural network, recursive auto-encoders, etc. The third kind of language models is based on attention mechanisms to build quantification features by scoring the words or sentences differentially. The attention mechanisms are commonly known as hierarchical attention and self-attention.

In the traditional representation models, textual data are usually unordered word sets that ignore grammar and the order of words, but they are widely used in NLP because of their simplicity and efficiency. To reduce the resource consumption, some representative words in the corpus are selected as the basic elements of BOW. Statistical methods and theme models [33] are commonly used for keyword selection. The statistical method mainly extracts keywords according to the word attributes (i.e. term frequency, TF-IDF, etc.), and the theme model is based on the following two assumptions: each document is a mixed distribution of several topics and each topic follows the probability distribution of words. At present, the LDA theme model that combines PLSA model and Dirichlet prior distribution is widely used. In addition to the above two methods, the methods based on graph theory [32] and complex networks can also be used to extract keywords.

2.2 Multi-label classifier

With the development of neural network research, deep learning has been extensively studied and widely used in many areas. The construction of text classifiers can be divided into two categories: one is based on the traditional machine learning algorithm, the other is based on deep neural network.

Before the emergence of deep neural networks, most of the studies were based on traditional machine learning algorithms, namely the artificial definition of machine learning algorithms and the construction of experimental models. At present, a large number of machine learning methods are applied in the text categorization system, such as Naive Bayes [26] based on Bayesian theory and characteristic conditional independence hypothesis, the support vector machine (SVM) based on the statistical learning VC theory and the structural risk minimum principle, and the decision tree method [4] of the optimal scheme which is obtained by comparing different schemes with probability and tree in graph theory. In addition, there are also k-nearest neighbor classification models [3], classification methods based on association rules and so on [11, 21, 22]. Although time complexity of the above models is low, the context and potential semantic relationship of text words are not fully considered.

Deep neural networks [1] can mine more complex text semantics and provide a new research direction for NLP. At present, convolutional neural network [7, 12], back-propagation network and recurrent neural network have been applied in text classification and achieved good experimental results.

Fig. 1
figure 1

The framework of the text classification

3 The proposed method

In this section, we introduce the overall process of DSRM-DNN shown in Fig. 1. The proposed method consists of the text quantization module and the classifier module. The text quantization module is divided into three subtasks: semantic word extraction, word feature construction, and dynamic representation. First, DSRM-DNN selects semantic words by combining word embedding model and clustering algorithm. Then the selected words are taken as the elements of DSRM-DNN and quantified by the weighted combination of word attributes. During the classification process, DSRM-DNN can represent the low-frequency words and new words according to the existing words in DSRM-DNN through sparse representation. In the classifier module, deep belief network and back-propagation neural network are utilized to construct a text classifier.

3.1 Semantic word extraction

Before extracting semantic words, we preprocess the corpus \( \overset{\sim }{D} \), including uniform format, word segmentation and stopword elimination. Therefore, we can obtain the preprocessed corpus D and the word sets Tj corresponding to the j th text.

During the words extraction, word embedding model is trained on the word sets. Specifically, the skip-gram model based on hierarchical softmax is adopted to obtain the embedding vectors. Thus, the text semantics are quantized and stored in the word embedding vectors. In the word embedding model, we can regard each leaf node of the binary tree as a word, and each non-leaf node is equivalent to a perceptron that outputs 0 or 1. Each word in the datasets can be represented by a unique encoding, and its encoding sequence corresponds to the sequence of events. Skip-gram is a language model that predicts the context according to the current words, and Fig. 2 shows the model. For ease of reference, Table 1 lists the major notations used in the embedding model and their mathematical meanings.

Table 1 Summary of symbols in the word embedding model
Fig. 2
figure 2

a is the skim-gram model concept diagram with window size 5, w(t) is a word in the current position t in the text, and other marks are defined in the same way. b is the skim-gram network structure diagram, v(w) is the vector representation of w

The learning objective of the embedding model can be written as the maximum likelihood function:

$$ \mathcal{L}=\sum \limits_{w\in c} logp\left( Context(w)|w\right) $$
(1)

The probability function of the language model is

$$ p\left( Context(w)|w\right)=\prod \limits_{u\in Context(w)}p\left(u|w\right) $$
(2)

where w is the word in the corpus, u is the word in the context of w. Each u is independent of each other in the Hierarchical Softmax, so we get:

$$ p\left(u|w\right)=\prod \limits_{j=2}^{l^u}p\left({d}_j^u|v(w),{\theta}_{j-1}^u\right) $$
(3)

where each multiplier in (3) is logistic regression:

$$ p\left({d}_j^u|v(w),{\theta}_{j-1}^u\right)=\left\{\begin{array}{ccc}\sigma \left(v(w),{\theta}_{j-1}^u\right),& & {d}_j^u=0\\ {}1-\sigma \left(v(w),{\theta}_{j-1}^u\right),& & {d}_j^u=1\end{array}\right. $$
(4)

Given that d is 0 or 1, we can easily express (4) in exponential form:

$$ p\left({d}_j^u|v(w),{\theta}_{j-1}^u\right)={\left[\sigma \left(v(w),{\theta}_{j-1}^u\right)\right]}^{1-{d}_j^u}\cdot {\left[1-\sigma \left(v(w),{\theta}_{j-1}^u\right)\right]}^{d_j^u} $$
(5)

Substituting (2), (3), (5) into (1), we get the new objective function:

$$ {\displaystyle \begin{array}{rcl}\mathcal{L}& =& \sum \limits_{w\in u}\sum \limits_{u\in Context(w)}\sum \limits_{j=2}^{l^u}\Big\{\left(1-{d}_j^u\right)\cdot \mathit{\log}\left[\sigma \left(v(w),{\theta}_{j-1}^u\right)\right]\\ {}& & +{d}_j^u\cdot \mathit{\log}\left[1-\sigma \left(v(w),{\theta}_{j-1}^u\right)\right]\Big\}\end{array}} $$
(6)

Each item in the objective function is denoted as:

$$ {\displaystyle \begin{array}{rcl}\mathcal{L}\left(w,u,j\right)& =& \left(1-{d}_j^u\right)\cdot \mathit{\log}\left[\sigma \left(v(w),{\theta}_{j-1}^u\right)\right]\\ {}& & +{d}_j^u\cdot \mathit{\log}\left[1-\sigma \left(v(w),{\theta}_{j-1}^u\right)\right]\end{array}} $$
(7)

u is certain for a given training instance w and its context {uc}, so there are only two variables v(w) and \( {\theta}_{j-1}^u \) in (7).

The partial derivatives of \( \mathcal{L}\left(w,u,j\right) \) with respect to v(w) and \( {\theta}_{j-1}^u \) are given as follows:

$$ \frac{\partial \mathcal{L}\left(w,u,j\right)}{\partial {\theta}_{j-1}^u}=\left[1-{d}_j^u-\sigma \left(v(w),{\theta}_{j-1}^u\right)\right]v(w) $$
(8)
$$ \frac{\partial \mathcal{L}\left(w,u,j\right)}{\partial v(w)}=\left[1-{d}_j^u-\sigma \left(v(w),{\theta}_{j-1}^u\right)\right]{\theta}_{j-1}^u $$
(9)

Therefore, we get the update functions for v(w) and \( {\theta}_{j-1}^u \):

$$ {\theta}_{j-1}^u={\theta}_{j-1}^u+\eta \left[1-{d}_j^u-\sigma \left(v(w),{\theta}_{j-1}^u\right)\right]v(w) $$
(10)
$$ v(w)=v(w)+\eta \sum \limits_{u\in Context(w)}\sum \limits_{j=2}^{l^u}\frac{\partial \mathcal{L}\left(w,u,j\right)}{\partial v(w)} $$
(11)

where η is the learning rate.

figure a

After several update iterations, we can get the embedding vector v(w) of each word and the feature matrix Ti, where each column in the feature matrix is the feature representation of a word. Then clustering analysis [16,17,18] is carried out for each text, and the words with similar semantics are divided into the same class. During each clustering process, the clustering center is extracted as the semantic words of the text.

Assuming that the set of embedding vectors of a text is {w1,w2,⋯ ,wn}, S is defined as the similarity matrix between samples, and s(i,j) > s(i,k) if and only if the similarity between wi and wj is greater than that between wi and wk. In addition, we defined R as the responsibility matrix and A as the availability matrix, where r(i,k) describes the suitability of wk for the clustering center of wi and a(i,k) describes the suitability of wi to select wk as its clustering center, and the elements of matrices R and A are initialized to 0.

The clustering algorithm is implemented by iteratively updating the responsibility r(i,k) and availability a(i,k):

$$ {\mathbf{R}}_{t+1}\left(i,k\right)=\left(1-\lambda \right){\mathbf{R}}_{t+1}\left(i,k\right)+\lambda {\mathbf{R}}_t\left(i,k\right) $$
(12)

where \( {\mathbf{R}}_{t+1}\left(i,k\right)=\left\{\begin{array}{ccc}\mathbf{S}\left(i,k\right)-\underset{j\ne k}{\max}\left\{{\mathbf{A}}_t\left(i,k\right)+{\mathbf{R}}_t\left(i,k\right)\right\},& & i\ne k\\ {}\mathbf{S}\left(i,k\right)-\underset{j\ne k}{\max}\left\{\mathbf{S}\left(i,k\right)\right\},& & i=k\end{array}\right.,\lambda \) is the damping coefficient.

$$ {\mathbf{A}}_{t+1}\left(i,k\right)=\left(1-\lambda \right){\mathbf{A}}_{t+1}\left(i,k\right)+\lambda {\mathbf{A}}_t\left(i,k\right) $$
(13)

where \( {\mathbf{A}}_{t+1}\left(i,k\right)\kern0.3em =\kern0.3em \left\{\begin{array}{ccc}\min \left\{0,{\mathbf{R}}_{t+1}\left(k,k\right)\kern0.3em +\kern0.3em {\sum}_{j\ne i,k}\max \left\{0,{\mathbf{R}}_{t+1}\left(j,k\right)\right\}\right\},& & i\ne k\\ {}{\sum}_{j\ne k}\max \left\{0,{\mathbf{R}}_{t+1}\left(j,k\right)\right\}\Big\},& & i=k\end{array}\right. \). The algorithm stops until the clustering centers remain unchanged after several iterations or the execution iteration number of the algorithm exceeds the predefined number of iterations.

figure b

Based on the above method, we can replace the text with the corresponding semantic words {sw1,sw2,⋯ }. Moreover, DSRM-DNN can update the semantic words dynamically when classifying the texts, and further improves the adaptive ability of the model. Details are described in Section 3.3.

3.2 Word feature construction

According to the method in Section 3.1, the selected words are saved in the DSRM-DNN. The model contains n semantic words {sw1,sw2,sw3,⋯ ,swn}, each of which has a unique index (that is, swiswj, i,j= 1,2,3,⋯ ,n and ij). To maintain the representation ability of textual vectors and accelerate the training of the classifier, we construct a fusion feature to replace semantic words in the DSRM-DNN. Therefore, each text in the datasets can be represented by a n-dimensional vector, the expression formula is:

$$ {x}_i=M\_ TF\left(s{w}_i\right) $$
(14)

where M_TF(⋅) is the feature fusion function. Different from the traditional bag of words (BOW) using statistical features to represent all of the words, the statistical features are the statistics of the number, position and span of words in the text.

Word frequency is the number that a word appears in the text. The higher the word frequency is, the more important the word is. It is one of the commonly used word attributes for statistical features. For the calculation of word frequency factor frei, we adopt the nonlinear function as:

$$ fr{e}_i=\frac{n}{n+1} $$
(15)

where n is the number of appearance of the word swi. The nonlinear function has two advantages: one is that the word frequency factor is positively proportional to the word frequency; the other is that when the word frequency increases to a certain extent, the values of the word frequency factor will decrease, which conforms to the language reality.

Part-of-speech factor is a quantification of part-of-speech. Part-of-speech analysis of existing semantic words shows that most of them are nouns. Compared with nouns, verbs and adjectives have relatively little influence. Since different part-of-speech has different effect on text classification, we divide the words into three categories according to their parts-of-speech as follows:

$$ po{s}_i=\left\{\begin{array}{ccc}1,& & {w}_i\in nouns.\\ {}0.8,& & {w}_i\in verbs.\kern0.60em or\kern0.60em adjectives.\\ {}0.6,& & {w}_i\in otherwise.\end{array}\right. $$
(16)

The processing function of word length factor is as following:

$$ lent{h}_i=\frac{le{n}_i}{\mathit{\max}\left( le{n}_i\right)} $$
(17)

where leni is the word length of the i th word swi, and max(leni) is the maximum length of all words in the text.

The position of a word in the text is also of great value in judging its importance. Different words appear in different positions in the text, and their ability to express the theme of the text is often different. Words in the title can better reflect the theme of the text than words in the abstract, the first paragraph and the body. We define the function according to different positions:

$$ lo{c}_i=\left\{\begin{array}{ccc}0.75,& & {w}_i\kern0.60em in\kern0.60em title;\\ {}0.67,& & {w}_i\kern0.60em in\kern0.60em abstract\kern0.60em or\kern0.60em paragrap{h}_{first};\\ {}0.5,& & {w}_i\kern0.60em in\kern0.60em body\kern0.60em part.\end{array}\right. $$
(18)

The word span factor can effectively reduce the influence of local words on full-text words. To improve the existing original calculation method, we define the word span factor formula as follows:

$$ spa{n}_i=\frac{las{t}_i- firs{t}_i+1}{su{m}_i}\times \frac{d_i}{D_i} $$
(19)

where firsti is the position the word swi first appears, lasti is the position the word swi last appears, sumi is the total number of words in the text, di is the number of paragraphs that contain the word swi, and Di is the total number of paragraphs.

Here we construct the fusion feature to represent the semantic words, and the new feature is computed as follows:

$$ {\displaystyle \begin{array}{rcl}M\_ TF\left(s{w}_i\right)& =& {\alpha}_1 fr{e}_i+{\alpha}_2 po{s}_i+{\alpha}_3 lent{h}_i\\ {}& & +{\alpha}_4 lo{c}_i+{\alpha}_5 spa{n}_i\end{array}} $$
(20)

where frei is the word frequency factor, posi is the part-of-speech factor, lenthi is the word length factor, loci is the word position factor, spani is the word span factor, α1, α2, α3, α4 and α5 are the weights of feature factors. Therefore, according to (14) we can use {x1,x2,x3,⋯ ,xn}j to represent Tj with n semantic words.

3.3 Dynamic representation

Since training set and test set are assigned randomly and semantic words in the DSRM-DNN are composed of the words in the training set, while the words extracted from the test set may not appear in the DSRM-DNN. If the words extracted from the test text do not appear in the DSRM-DNN (denoted as wordoutDSRM), we sparsely represent these words using other semantic words or non-semantic words (swinDSRM or wordinDSRM). The objective function is as follows:

$$ \underset{\alpha }{\mathrm{argmin}}\parallel y-\alpha X{\parallel}_2^2\kern0.90em s.t.\parallel \alpha {\parallel}_1\le \varepsilon $$
(21)

or

$$ \underset{\alpha }{\mathrm{argmin}}\parallel y-\alpha X{\parallel}_2^2+\lambda \parallel \alpha {\parallel}_1 $$
(22)

where y is the sample that needs to be reconstructed, X is a matrix of the embedding vectors, ε and λ are both small positive constants.

Although l1-norm plays an implicit role in the selection of training samples in regression, the computational cost of the iterative solution is very high, so we replace the regularization term with l2-norm. The objective function can be expressed as:

$$ \underset{\alpha }{\mathrm{argmin}}\parallel y-\alpha X{\parallel}_2^2+\lambda \parallel \alpha {\parallel}_2^2 $$
(23)

To improve the efficiency of sparse representation, new semantic representation words can be divided into the following two situations:

$$ \left\{\begin{array}{ccc}{\mathrm{argmin}}_{x_i}\parallel {k}_i-{x}_iK{\parallel}_2^2+\lambda \parallel {x}_i{\parallel}_2^2,& & {m}_1/\left({m}_1+{m}_2\right)<\alpha, \\ {}{\mathrm{argmin}}_{y_i}\parallel {k}_i-{y}_iW{\parallel}_2^2+\lambda \parallel {y}_i{\parallel}_2^2,& & {m}_1/\left({m}_1+{m}_2\right)\ge \alpha \end{array}\right. $$
(24)

where λ is the weight parameter, ki is the i th semantic word in the wordoutDSRM, \( {x}_i\in {\mathbb{R}}^{m_2} \) is the reconstruction vector, \( K\in {\mathbb{R}}^{m_2\times n} \) is composed of m2 semantic words in swinDSRM, \( {y}_i\in {\mathbb{R}}^{m_3} \) is the reconstruction vector, \( W\in {\mathbb{R}}^{m_3\times n} \) is composed of m3 non-semantic words in wordinDSRM, and m1 = num(wordoutDSRM), m2 = num(swinDSRM), m3 = num(wordinDSRM). Besides, new semantic words in the test texts (denoted as swoutDSRM) can be added to DSRM-DNN to further improve the representation and adaptability of DSRM-DNN.

3.4 Classifier construction

Fig. 3
figure 3

The structure diagram of BP network and DBN

BP network is multilayer feedforward neural network based on error backward propagation algorithm. The principle is to calculate the difference between the actual output and the expected output recursively, and then adjust the weights according to the difference. In order to reduce the computational cost of neural network training, we construct a BP network including an input layer, an output layer and two hidden layers, as shown in Fig. 3. We input the obtained feature matrix Hm×n to the network for training. The activation function and its derivative are given as follows:

$$ S(x)=\frac{1}{1+{e}^{-x}} $$
(25)
$$ {S}^{\prime }(x)=\frac{e^{-x}}{{\left(1+{e}^{-x}\right)}^2}=S(x)\left(1-S(x)\right) $$
(26)

Since the randomly initialized parameters may make the network convergence to the local optimum and reduce the training effect, this paper adopts the deep belief network (DBN) co-constructed by BP network and restricted Boltzmann machine [2] to initialize the parameters of BP network.

figure c

Each restricted Boltzmann machine has n visible elements and m hidden elements, and the states of visible elements and hidden elements are represented by vectors a and b respectively, then the energy function is:

$$ \mathrm{E}\left(v,h,\theta \right)=-\sum \limits_{i=1}^n\sum \limits_{j=1}^m{v}_i{W}_{ij}{h}_j-\sum \limits_{i=1}^n{a}_i{v}_i-\sum \limits_{j=1}^m{b}_j{h}_j $$
(27)

where \( \theta =\left\{{W}_{ij},{a}_i,{b}_j\right\} \) is the parameter of RBM, Wij is the connection weight between visible element i and hidden element j, ai and bj are the bias of element i and j. We get the joint probability distribution function and likelihood function of v and h:

$$ \mathrm{p}\left(v,h|\theta \right)={\mathrm{e}}^{-\mathrm{E}\left(v,h|\theta \right)}/\mathrm{Z}\left(\theta \right),\kern0.60em \mathrm{Z}\left(\theta \right)=\sum \limits_{v,h}{\mathrm{e}}^{-\mathrm{E}\left(v,h|\theta \right)} $$
(28)
$$ \mathrm{p}\left(v|\theta \right)=\sum \limits_h{\mathrm{e}}^{-\mathrm{E}\left(v,h|\theta \right)}/\mathrm{Z}\left(\theta \right) $$
(29)

where Z\( \left(\theta \right)={\sum}_{v,h}{\mathrm{e}}^{-\mathrm{E}\left(v,h|\theta \right)} \) is the normalization factor.

4 Experiments

In this section, we introduce the experiments in detail and report the results of DSRM-DNN and comparative methods on benchmark datasets. To comprehensively analyze the performance of the proposed methods, we mainly perform five experiments. The first experiment studies the related setting of semantic words extraction, including the dimension of word embedding, the size of the context window, and the number of semantic words. In the second and third experiments, we analyze the effects of different word attributes and dynamic representation. The fourth experiment compares the performance of different classification methods. Finally, we test DSRM-DNN and compare it with nine baselines respectively on four datasets Fig. 3.

Table 2 Datasets statistics. D_num is the number of total texts, L_num is the number of labels, Vo_size is the size of total vocabulary, L_ave is the average number of labels per text, Tr_num is the number of train texts, Te_num is the number of test texts

4.1 Datasets

We use four representative datasets to evaluate the proposed method as shown in Table 2. The datasets are split into training sets and test sets, and 80% of the randomly selected texts are used to train classifiers and the remaining texts are used to verify the effect of text classification methods.

4.2 Baselines

To make the experimental comparison more comprehensive and objective, we not only use the traditional machine learning algorithms such as decision tree, k-nearest neighbor, but also use the latest methods such as neural networks for comparative analysis. The details of these methods are as follows:

  • Multi-label decision tree (ML-DT) [31] adopts decision tree techniques to process multi-label data, where an information gain criterion based on multi-label entropy is utilized to build the decision tree recursively.

  • Multi-label k-nearest neighbor (ML-KNN) [30] adopts k-nearest neighbor techniques to process multi-label data, where maximum posterior probability is utilized to predict the labels of nearest samples.

  • Binary relevance (BR) transforms the multi-label learning problem into multiple independent binary classification problems, where each binary classification problem corresponds to a possible label in the label space.

  • Classifier chains (CC) transforms the multi-label learning problem into a chain of binary classification problems, in which subsequent binary classifiers in the chain build on the predictions of preceding ones. In BR and CC, we use the Euclidean-SVMs [24] as the base classifiers.

  • Multi-label neural networks (ML-NN) [28] represents the multi-label classification problem as a neural network with multiple output nodes. Each label in the dataset corresponds to an output node of the network, and the output layer can model the dependencies between different categories.

  • Hierarchical ARAM neural network (HARAM) [27] is an extension of the fuzzy Adaptive Resonance Associative Map (ARAM), to speed up the classification on high-dimensional and large-scale datasets.

  • Convolutional and recurrent neural networks (CNN-RNN) [29] utilizes the ensemble application of convolutional and recurrent neural networks to capture the global and local textual semantics, and to model label correlations while having a tractable computational complexity.

  • Hierarchical label set expansion (HLSE) [19] regularizes data labels and considers extreme multi-class and multi-label text classification when defining hierarchical label set.

  • Supervised representation learning (SERL) [25] is the framework based on neural networks, which can learn global feature representation by jointly considering all labels in an effective supervised manner.

4.3 Evaluation metrics

For evaluating the performance of multi-label classifiers, we use One-error, Hamming Loss, Rank Loss and Micro/Macro-averaged Precision, Recall, F1-Score to test the efficiency of the experiment. The details are as follows:

  • One-error (O-error) shows the proportion that the label with the highest ranking score is not in the correct label set.

  • Hamming loss (H-Loss) computes the symmetric difference between the predicted labels and the relevant labels and calculates the fraction of its difference in the label space.

  • Ranking loss (R-Loss) evaluates the fraction of reversely ordered label pairs, e.g. an irrelevant label is ranked higher than a relevant label.

  • Precision, Recall and F1-score are calculated based on the number of true positives (tp), true negatives (tn), false positives (fp) and false negatives (fn). There are two ways to calculate these metrics over the whole test data: Micro-averaged and Macro-averaged. The former counts all true positives, true negatives, false positives and false negatives first among all labels and then has a binary evaluation for its overall counts, while the latter refers to the average performance (Precision, Recall and F1-score) over labels. To be specific, the computations of Micro/Macro-averaged Precision, Recall, F1-score are illustrated below:

$$ {\displaystyle \begin{array}{rcl}{P}_{micro}& =& \frac{\Sigma_{l=1}^Lt{p}_l}{\Sigma_{l=1}^Lt{p}_l+f{p}_l},\kern1em {R}_{micro}=\frac{\Sigma_{l=1}^Lt{p}_l}{\Sigma_{l=1}^Lt{p}_l+f{n}_l},\\ {}{F_1}_{micro}& =& \frac{\Sigma_{l=1}^L2t{p}_l}{\Sigma_{l=1}^L2t{p}_l+f{p}_l+f{n}_l}\\ {}{P}_{macro}& =& \frac{1}{L}{\Sigma}_{l=1}^L\frac{t{p}_l}{t{p}_l+f{p}_l},\kern1em {R}_{macro}=\frac{1}{L}{\Sigma}_{l=1}^L\frac{t{p}_l}{t{p}_l+f{n}_l},\\ {}{F_1}_{macro}& =& \frac{1}{L}{\Sigma}_{l=1}^L\frac{2t{p}_l}{2t{p}_l+f{p}_l+f{n}_l}\end{array}} $$

4.4 Experimental results and analysis

4.4.1 Semantic words extraction

Fig. 4
figure 4

The influence of word embedding and context window, e is the dimension of word embedding, c is the size of context window

In the process of semantic words extraction, we analyze the effect of embedding dimension and context window size by fixing other variables. Note that we set the minimum word frequency of the input embedding model to 3 (that is, words that occur more than three times in the text are trained to generate the embedding vectors). The setting reduces the influence of uncommon words on the training process and improves the training speed of the word embedding model.

The results show that the Micro-averaged F1-Score (MiF) with word embedding dimension e = 100 is significantly higher than the MiF with e = 50, which indicates that the beneficial semantic information of embedding vectors increases as embedding dimension does. As shown in Fig. 4, the size of the context window has less impact on the classification performance than the dimension of word embedding. Besides, the increment of MiF decreases gradually with the increase of the context window, which indicates that the correlation of words decreases rapidly with the increase of distance between them. Note that in Fig. 5a the increase of vector dimension or context window enriches the semantic information of the word embedding, but the training time also increases greatly.

Fig. 5
figure 5

The train time of word embedding and classifier on RCV1-v2

Fig. 6
figure 6

The influence of semantic words number, and n is the number of semantic words

To obtain a reasonable number of semantic words, we set the number of semantic words in each text to 5, or 5%, 10%, 20% of the number of words in the text. In general, with the increase of selected words from each text, the final results are gradually improved, because the semantic representation ability of text vectors is more suitable for the training of classifiers. On RCV1-v2, the MiF with n = 20% is significantly lower than that with n = 10%. The reason is that DSRM-DNN contains most of the words that are conductive to classification when n = 10%. As n continues to increase, the newly selected words are irrelevant and interfere with the performance of the existing words, so the increase of semantic words reduces the training effect of the classifier. The experimental results in Figs. 5b and 6 show that the proposed method obtains excellent accuracy when n = 10% and 20% respectively, but the training time of classifier increases rapidly with the growth of DSRM-DNN.

4.4.2 Word feature construction

There are many ways to quantify words in NLP, among which TF-IDF and word frequency are widely used. In the experiment, we test the influence of word attributes on RCV1-v2, and make a quantitative analysis of the representation ability of part-of-speech.

Fig. 7
figure 7

The study of word feature construction. a We set the weight of the noun to 1 and seek the optimal combination by adjusting the weights of the verb, adjective and other words. b The performance of different word attributes on the final classification result

As shown in Fig. 7a, we test several possible combinations of part-of-speech weights. Note that part-of-speech cannot quantize the text alone (quantifying the word using only part-of-speech results in very poor performance), we analyze its effect indirectly through its weighted combination with word frequency. The experimental results show that (1, 0.8, 0.6) is the optimal weight combination of parts-of-speech. Besides, the results in Fig. 7b show that the weighted combination is more effective than single word attribute.

4.4.3 Dynamic representation

Fig. 8
figure 8

The influence of dynamic representation, MiF-1/MaF-1 is the result without dynamic representation, MiF-2/MaF-2 is the result of the proposed method

To verify the effectiveness of the dynamic representation method, we test MiF and Macro-averaged F1-Score (MaF) of different classification methods on four datasets.

As shown in Fig. 8, the dynamic representation method improves the performance of the classifier on most datasets. The performance of the optimization on RCV1-v2 is poor, the reason is that existing DSRM-DNN contains most of the representative words, and the new words added by the dynamic representation is less. The high average number of labels per text on EUR-Lex results in the selected semantic words being very important and increases the impact of dynamic representation, so the proposed method can achieve better performance.

4.4.4 Comparison of different methods

Table 3 Comparison of different methods on benchmark datasets

We test different classification methods on the datasets mentioned in Section 4.1 and use the various evaluations to analysis their performance. As the performance of different methods using Hamming Loss to evaluate on the above datasets is not obvious, we replace Hamming Loss with Rank Loss on EUR-Lex and Bookmarks. In the experiments, we set the parameters of DSRM-DNN as e = 100, c = 7, n = 10%.

Both BR and CC in Table 3 are based on Euclidean-SVMs, and achieve significantly higher recall and precision than ML-DT and ML-KNN. It shows that the problem transformation methods are more effective than the algorithm adaptation methods for these datasets. Among the methods based on deep learning, DSRM-DNN has better classification effects on RCV1-v2, which shows that the proposed method is more suitable for multi-label text classification. In addition, we observe that many classification methods perform well on Reuter-21578. The reason is that the weak correlation of labels leads to the great difference between the texts from different categories. The label space of EUR-Lex and Bookmarks in Table 2 is complex and difficult to be classified. The results show that the proposed method also outperforms the existing methods.

The experimental results also show that the performance of the classifiers based on deep learning is obviously better than that of the classifiers traditional machine learning algorithms. DSRM-DNN is optimal or sub-optimal on most datasets, and its classification results are better than the compared classification methods. Besides, our method can effectively reduce the negative impact of lowly-frequent words and new words, so it is more suitable for datasets with strong label correlation and wide semantic word distribution.

5 Conclusion

In this paper, we propose a DSRM-DNN method to extract semantic words and construct text features, and integrate multiple restricted Boltzmann machines to construct deep belief networks to initialize the classifier and accelerate its convergence. Comparison with the state-of-the-art methods on four datasets demonstrates the proposed method achieves better results on both text quantization and multi-label classification. To further improve the speed and accuracy of text classification, we will focus on the optimization of semantic extraction and the neural network construction in the future research.