Keywords

1 Introduction

Continued development of the Internet and information technology has spawned a large number of text data in various forms. How to organize, manage and analyze such a huge data, and find the user information quickly, accurately and comprehensively is a big challenge. Text automatic classification is an important research point in the field of information mining. Compared to the traditional single classification problem, multi-label text classification has more value of research and application.

In multi-label learning, the text data are always in high dimensionality and sparsity. e.g. In a large number of feature words, only a few are related to the topic of a text and most of the rest are redundant. Therefore, introducing sparsity into machine learning has become a popular technology, which not only meet the need of practical problems but also can simplify the learning model. In resent years, extreme learning machine (ELM) [14] has attracted increasing attention and been widely used for its distinguishing characteristics: (1) fast learning speed, (2) good generalization performance on classification or regression, (3) less human intervention with randomly setted hidden layer parameters. For these reasons, the theoretical analysis and various improvement algorithms of ELM are put forward continuously.

In ELM network, the function of the random hidden layer nodes can be seen as feature mapping. It maps the data from the input feature space to the hidden layer feature space, which is called ELM feature space in literature [5]. In this ELM feature space, each instance may still remains the sparsity. Meantime, considering the characteristics of multi-label learning and the advantages of the classifier ELM, in this paper, we propose an embedded model for multi-label text classification, which is derived from a formulation based on ELM with \(L_{21}\)-norm minimization of the output weights matrix. Through the constraint of the \(L_{21}\)-norm regularization, the training model becomes simplified, also we can sufficiently preserve the intrinsic relation of different nodes in the ELM feature space and select them by joint multiple related labels, where the labels are not always independent to each other. Experimental results on several benchmark data sets verify the efficiency of our proposed algorithm.

The main contributions of this paper can be summarized below:

  • According to the characteristics of the multi-label text data we introduce the sparsity model.

  • Applying \(L_{21}\)-norm for joint hidden layer nodes selection and avoiding individual training for each label.

  • Using ELM for multi-label text classification.

The remainder of this paper is organized as follows. After reviewing the related works in Sect. 2, we present the algorithm \(L_{21}\)-ELM in Sect. 3 and describe the evaluation measures of multi-label learning in Sect. 4. Experimental results are presented in Sect. 5 and we conclude this paper in Sect. 6.

2 Related Work

2.1 Multi-label Learning

Unlike traditional supervised learning, in multi-label learning each instance may belong to multiple classes and for a new instance we try to predict its associated set of labels. This is a generalized case of the prevalent multi-class problems where in multiple classes each instance has only one class restrictedly.

Let \(\mathcal {X} \in \mathbb {R}^{d}\) denote the d-dimensional space of instances, \(\mathcal {Y}= \left\{ y_{1},\ldots ,y_{k}\right\} \) denote the label space with k possible class labels. Given the training data set \(\left\{ (x_{1},\textit{Y}_{1}),\ldots ,(x_{n},\textit{Y}_{n})\right\} \) where \(x_{i}\in \mathcal {X}\) and \(\textit{Y}_{i} \subseteq \mathcal {Y}\). the task of multi-label learning is to learn a multi-label classifier \(f: \mathcal {X}\rightarrow 2^{k}\) from the training data set. For any unknown instance \(x \in \mathcal {X}\) , the multi-label classifier \(f(\cdot )\) predicts \(f(x) \subseteq \mathcal {Y}\) as the set of proper labels. Existing multi-label learning algorithms can be divided into two main categories [6, 7].

Problem transformation methods. The main idea of most problem transformation methods is to transform the original multi-label learning problem into multiple single-label learning problems, which usually reconstructs the multi-label data sets and then existing classification algorithms can be applied directly.

The binary relevance (BR) [8] algorithm is a popular kind of this transformation method and has been widely used in many practical applications. This algorithm divides the multi-label classification problem into k independent binary classification problems, however, the assumption of label independence is too implicit and the label correlations are ignored. The label powerset (LP) [9] algorithm is another common transformation method. It considers each unique set of labels in a multi-label training data as one class in the new transformed data. While the computational complexity of LP is too big and it may pose class imbalance problem. The basic idea of the classifier chains (CC) [10] is to chain the transformed binary classifiers one by one, but the sequence of each classifier is a problem. The ensembles of classifier chains (ECC) [11] improved the CC algorithm and identify the sequence of each classifier effectively.

Algorithm adaptation methods. From another perspective, this method improves conventional algorithms to deal with multi-label data directly. Some representative algorithms include ML-kNN [12] adapting k-nearest neighbor techniques, which has the advantage of both lazy learning and Bayesian but ignores label correlations. ML-DT [13] adapting decision tree techniques, Rank-SVM [14] adapting kernel techniques, etc.

In this paper, the algorithm based on ELM we proposed is designed to deal with multi-label data directly, therefore, it can be considered as a kind of algorithm adaptation method.

2.2 \(L_{21}\)-norm Regularization for Parameter Estimation

In recent years, parameter estimation via sparsity-promoting regularization has be widely used in machine learning and statistics. Perhaps \(L_{1}\)-norm regularization is the most successful and common method to promote sparsity for the parameter vector (the lasso approach). Along with the development of multi-task learning, in 2006, Obozinski et al. [15, 16] proposed to constraint the sum of \(L_{2}\)-norms of the blocks of weights connected with each feature, and then leading to the \(L_{21}\)-norm regularized optimization problem (the group lasso).

In this section, we will briefly review the basics of this technique. Usually, the optimization problem can be described as following:

$$\begin{aligned} \min _{ X} : loss(X) + \lambda \parallel X \parallel _{2,1} \end{aligned}$$
(1)

where \(\lambda > 0\) is the regularization parameter, \( X \in \mathbb {R}^{n \times k} \) is the weights matrix, \(\parallel X \parallel _{2,1} = \sum _{i=1}^{n}\parallel X \parallel _{2}\) and loss(X) is a smooth and convex loss function (such as the logistic loss, the least square loss and the hinge loss). Take the least squares problem as an example, the Eq.1 is expressed as:

$$\begin{aligned} \min _{ X}: \frac{1}{2}\parallel AX-Y \parallel _{2}^{2}+ \lambda \parallel X \parallel _{2,1} \end{aligned}$$
(2)

where \(A \in \mathbb {R}^{m \times n}\), \( Y \in \mathbb {R}^{m \times k} \) are the data matrices, each row of X forms a feature group. Figure 1 visualizes this optimization problem.

Fig. 1
figure 1

Illustration of the data matrix AY, and the weights matrix X

This optimization problem will be more challenging to solve due to the non-smoothness and non-differential of the \(L_{21}\)-norm regularization. In this paper, we apply the strategy proposed in literature [17] to solve this problem, which reformulates the non-smooth \(L_{21}\)-norm regularized problem to an equivalent smooth convex optimization problem and can be solved in linear time.

3 \(L_{21}\)-minimization ELM (\(L_{21}\)-ELM)

In this section, we propose \(L_{21}\)-ELM algorithm for multi-label learning problem, which takes the significant advantages of ELM like affording good generalization performance at extremely fast learning speed, meantime, offers us some additional characteristics. Firstly, we will review the theories of ELM, then, introduce the algorithm we proposed.

3.1 Extreme Learning Machine

Extreme learning machine [2, 3] was originally proposed for single hidden layer feedforward neural networks and then extended to the generalized single hidden layer feedforward networks where the hidden layer need not be neuron alike [1]. Figure 2 shows the structure of ELM network. It contains an input layer, a hidden layer and an output layer.

Fig. 2
figure 2

Structure of ELM network

In ELM, the hidden layer parameters are chosen randomly, and the output function can be represented as following (take the case of p hidden layer nodes and one output layer node as an example):

$$\begin{aligned} \ f_{output}(x) = \sum _{i=1}^{p} \beta _{i} h_{i}(x) = \mathbf h (x) \beta \end{aligned}$$
(3)

where \(x \in \mathbb {R}^{n}\) is the input variable, \(\beta =(\beta _{1},\beta _{2},\ldots ,\beta _{p})^{T}\) is the weights vector between the hidden layer nodes and the output layer nodes. \(\mathbf h (x)= \left[ h_{1}(x),h_{2}(x),\ldots ,h_{p}(x) \right] \) is the output vector of the hidden layer with respect to the input vector x. \(h_{i}(x)\) is the ith activation function, its input weights vector and bias are \(w_{i}\) and \(b_{i}\).

Figure 2 shows that \(\mathbf h (x)\) actually maps the input variables from the n-dimension to the p-dimensional hidden layer space (ELM feature space), thus, it appears to be a feature mapping function.

The ELM reliably approximates m samples, \(X=[x_{1},\ldots ,x_{m}]\), with minimum error:

$$\begin{aligned} \min _{ \beta }: \parallel H\beta -Y \parallel _{2}^{2} \end{aligned}$$
(4)

where H is hidden layer output matrix,

$$\begin{aligned} H=\left[ \begin{array}{ccc} h(x_{1}) \\ \vdots \\ h(x_{m}) \\ \end{array} \right] =\left[ \begin{array}{ccc} g(w_1 \cdot x_1 + b_1) &{} \cdots &{} g(w_p \cdot x_1 + b_p) \\ \vdots &{} \ddots &{} \vdots \\ g(w_1 \cdot x_m + b_1) &{} \cdots &{} g(w_p \cdot x_m + b_p) \\ \end{array} \right] _{m \times p} \end{aligned}$$
(5)

and \(Y=[y_{1},\ldots ,y_{m}]^{T}\) is the target vector.

The analytical result of this least squares equation is:

$$\begin{aligned} \hat{\beta } = H^{\dagger } Y \end{aligned}$$
(6)

where \(H^{\dagger }\) is called Moore-Penrose generalized inverse of matrix H.

3.2 \(L_{21}\)-norm Minimization ELM for Multi-label Learning

In this section, we consider adapting the ELM network to solve the multi-label learning problem. Given the multi-label training data with m samples \((x_{i},y_{i})\), where \(x_{i}=(x_{i1},x_{i2},\ldots ,x_{in})^{T}\in \mathbb {R}^{n}\) and \(y_{i}=(y_{i1},y_{i2},\ldots ,y_{ik})\in \mathbb {R}^{k}\). As shown in the Fig. 2, we set the number of output layer nodes k , which equals the number of labels, and set the number of hidden layer nodes p randomly.

Inspired by ELM, we consider combining the smallest training error of ELM with the \(L_{21}\)-norm minimization of output weights matrix. It is reformulated as following:

$$\begin{aligned} \min _{ \beta } : \parallel H\beta -Y \parallel _{2}^{2} + \lambda \parallel \beta \parallel _{2,1} \end{aligned}$$
(7)

where \(\parallel \beta \parallel _{2,1} = \sum _{i=1}^{p}\parallel \beta _{i} \parallel _{2}\) is the \(L_{21}\)-norm of the matrix \(\beta \), and \(\beta _{i}=(\beta _{i1},\beta _{i2},\ldots ,\beta _{ik})\), \(\lambda \) is the regularization parameter.

To solve the nonsmooth optimization problem in Eq. (7), the literature [17] proposed to employ the Nesterov’s optimal method by optimizing its equivalent smooth convex reformulation. When using a constraint to replace the nonsmooth \(L_{21}\)-norm, the original problem can be equivalent to the \(L_{21}\)-ball constrained smooth convex optimization problem as following:

$$\begin{aligned} \min _{ \beta }: \parallel H\beta -Y \parallel _{2}^{2} s.t. \parallel \beta \parallel _{2,1} \le z \end{aligned}$$
(8)

When applying the Nesterov’s optimal method to solve Eq. (8), one key building block of this method is Euclidean projection onto the \(L_{21}\)-ball. The Euclidean projection problem is defined as:

$$\begin{aligned} \pi _{Z}(U)= arg \min _{\beta \in Z} \frac{1}{2} \parallel \beta -U \parallel _{2}^{2} \end{aligned}$$
(9)

where \(Z = \left\{ \beta \in \mathbb {R}^{p \times k} \mid \parallel \beta \parallel _{2,1} \le z \right\} \) is the \(L_{21}\)-ball and \(z\ge 0^{1}\) is the radius of \(L_{21}\)-ball. To solve the problem in Eq. (9), the Lagrangian variable \(\alpha \) is introduced for the inequality constrain \(\parallel \beta \parallel _{2,1} \le z\) , then we can lead to the Lagrangian function of Eq. (9) as:

$$\begin{aligned} \L (\beta ,\alpha )=\frac{1}{2}\parallel \beta -U \parallel _{2}^{2}+\alpha (\parallel \beta \parallel _{2,1}-z) \end{aligned}$$
(10)

Let \(\beta ^{*}\) be the primal optimal point, and \(\alpha ^{*}\) be the dual optimal point. This two points must satisfy the condition: \(\parallel \beta ^{*} \parallel _{2,1} \le z\) and \(\alpha ^{*} \ge 0\). Since considering the strong duality holds of the Slater’s condition, and values of the primal and dual optimal points are equal: \(\alpha ^{*}(\parallel \beta \parallel _{2,1}-z)=0\). Therefore, the primal optimal point \(\beta ^{*}\) can be given by Eq. (11) if the dual optimal point \(\alpha ^{*}\) is known.

$$\begin{aligned} \beta ^{*}_{i} = \left\{ \begin{array}{ccc} (1-\frac{\alpha ^{*}}{\parallel u^{i} \parallel })u^{i} ,&{} &{} \alpha ^{*}>0, \parallel u^{i} \parallel > \alpha ^{*}\\ 0 ,&{} &{} \alpha ^{*}>0, \parallel u^{i} \parallel \le \alpha ^{*}\\ u^{i} ,&{} &{} \alpha ^{*}=0 \\ \end{array} \right. \end{aligned}$$
(11)

where \(u^{i} \in \mathbb {R}^{1\times k}\) is the ith row of U.

According to Eq. (11), \(\beta ^{*}\) can be computed as long as \(\alpha ^{*}\) is solved. Now, the key step is how to compute the unknown dual optimal point \(\alpha ^{*}\). Liu et al. [17] gives the theorem : if \(\parallel U \parallel _{2,1} \le z\) the value of \(\alpha ^{*}\) is zero, otherwise, it can be solved as the unique root of the following auxiliary function.

$$\begin{aligned} \omega (\alpha )=\sum _{i=1}^{p}max(\parallel u^{i} \parallel -\alpha ,0)-z \end{aligned}$$
(12)

The Eq. (12) can be solved in O(p) flops by the bisection [18], and it costs O(pk) flops to compute \(\beta ^{*}\) by Eq. (11). Above all, for solving Eq. (7) the time complexity is O(pk). When testing an unseen instance, we will use a threshold function t(x) to determine its associated label set. For an actual outputs \(c_{j}\), \( Y = \left\{ j \mid c_{j} > t(x) \right\} \). An usual solution is to set t(x) to be zero. In this paper, we adopt the threshold category used in literature [19].

4 Evaluation Measures

Being different from the traditional single-label learning system, in multi-label learning an instance usually have one or more labels simultaneously, therefore those classical evaluation methods would be no longer applied in multi-label learning system. For this reason, a series of evaluation metrics for multi-label learning are proposed.

In order to compare our algorithm with other commonly used methods, we adopt five evaluation measures in multi-label learning in this section, including: hamming loss, one-error, coverage, ranking loss and average precision [6, 20, 21]. The following is a look at these measures based on a test data set \(S = \lbrace (x_{i},Y_{i}) \mid 1 \leqslant i \leqslant n \rbrace \) and a trained model \(f(\cdot , \cdot )\) or \(g(\cdot )\).

Hamming loss. This measure evaluates the error rate of all instances on all labels, e.g. a relevant label of an instance is not predicted or an irrelevant one is predicted. the smaller the value of hamming loss, the better the performance.

$$\begin{aligned} \ hloss_S(g) = \frac{1}{n} \sum _{i=1}^{n} \frac{1}{m} \mid g(x_i) \bigtriangleup Y_i \mid \end{aligned}$$
(13)

where \(\bigtriangleup \) stands for the symmetric difference between two sets, m is the total number of labels. It is worth noting that when most of these instances have little correlative labels, it can also get a small value of hamming loss even if all the labels of an instance are predicted in error. Therefore, we should integrate it with other measures.

One-error. This measure evaluates the times that the top-ranked label of an instance is not in its relevant label set. The smaller the value of \(one-error_S(f)\), the better the performance.

$$\begin{aligned} \ one-error_S(f) = \frac{1}{n} \sum _{i=1}^{n} \left[ arg \max _{y \in y } f(x_{i},y) \notin Y_{i} \right] \end{aligned}$$
(14)

One-error mainly focuses on the most relevant label being correct or not, and it don’t pay attention to other labels. Note that, it is equal to ordinary error identically in single-label classification problems.

Coverage. This measure evaluates the average steps we need to go down the ranked-label list for the sake of covering all the relevant labels. The smaller the value of coverage, the better the performance.

$$\begin{aligned} \ coverage_S(f) = \frac{1}{n} \sum _{i=1}^{n} \max _{ \ell \in Y_{i}} rank_{f}(x_{i}, \ell )-1 \end{aligned}$$
(15)

where the \(rank_{f}(\cdot ,\cdot )\) is derived from the real-valued function \(f(\cdot , \cdot )\), and the bigger the value of f, the smaller the \(rank_{f}\) is. The performance is perfect when \(coverage_S(f)=0\).

Ranking loss. This measure evaluates the average fraction of the reversely ordered label pairs. The smaller the value of \(rloss_{S}(f)\), the better the performance.

$$\begin{aligned} \ rloss_S(f) = \frac{1}{n} \sum _{i=1}^{n}\frac{1}{\mid Y_{i}\mid \mid \overline{Y_{i}}\mid } \mid \left\{ (y,\overline{y})|f(x_{i},y)\le f(x_{i},\overline{y}),(y,\overline{y})\in Y_{i} \times \overline{Y_{i}}\right\} \mid \end{aligned}$$
(16)

where \(Y_{i}\) and \(\overline{Y_{i}}\) denote the possible and impossible label sets of the instance \(x_{i}\). When the value is zero, it means that all impossible labels follow possible ones.

Average precision. This measure evaluates the average fraction of relevant labels ranked above a particular one \(\ell \in Y_{i}\). It is typically used in information retrieval (IR) system to evaluate the document ranking performance query retrieval [22]. The bigger the value of \(avgpec_{S}(f)\), the better the performance.

$$\begin{aligned} \ avgpec_{S}(f) = \frac{1}{n} \sum _{i=1}^{n} \frac{1}{\mid Y_{i}\mid } \sum _{ y \in Y_{i}} \frac{\mid L_{i} \mid }{rank_{f}(x_{i},y)} \end{aligned}$$
(17)

where \( L_{i}= \left\{ y' \mid rank_{f}(x_{i},y') \le rank_{f}(x_{i},y),y' \in Y_{i} \right\} \). Note that \(avgpec_{S}(f)\) =1 ranks perfectly, that means there is no instance \(x_{i}\) for which a label not in \(Y_{i}\) is ranked above on a label in \(Y_{i}\).

Table 1 Data sets

5 Experimental Results

In this section, \(L_{21}\)-ELM is compared with the performance of the original ELM as well as other common multi-label classification algorithms. The benchmark data sets we used are all in text areas, including: Enron for email analysis, Reuters for text categorization, BibTeX for tags of paper and Yahoo for web page categorization. Table 1 describes the datasets in detail. For Enron and Reuters without pre-separated training and testing sets, therefore, we decide to select 1,500 instances of them for training randomly and the rest data for testing. We repeat the data partition for thirty times randomly, and give the “average results” \(\pm \) “standard deviations”.

Table 2 Results on data set Enron
Table 3 Results on data set Reuters
Table 4 Results on data set Recreation
Table 5 Results on data set BibTeX

Table 2, 3, 4 and 5 shows the comparison results on a single data set. Among them, the symbol “\(\downarrow \)” means the smaller the better, “\(\uparrow \)” on the contrary. HL, OE, Co, RL and AP are the abbreviations of hamming loss, one-error, coverage, ranking loss and average precision respectively, unit of Time (training) is seconds. The number of ELM hidden layer nodes is randomly setted but not more than the training samples and the best results are selected.

Overall, compared with other algorithms, \(L_{21}\)-ELM achieves the best performance in most case. Especially, it shows the absolute advantage on coverage, ranking loss and average precision in all datasets. On hamming loss it is worse than Rank-SVM only on BibTeX data set, and performs better on other cases. On one-error, ECC achieves comparable performance with other approaches. Without consideration of ECC, \(L_{21}\)-ELM outperforms other approaches by the metric of one-error.

Compared with the original ELM approach, \(L_{21}\)-ELM achieves obviously better performance on almost all datasets over all the 5 criteria. This validates the effectiveness of the \(L_{21}\)-norm regularization on the original ELM and eliminating redundant information.

On the training time, the ELM group has faster training time than other approaches. This validates that \(L_{21}\)-ELM could fully inherit the merits of ELM with extreme learning speed. Compared with original ELM, \(L_{21}\)-ELM consumes more training time, but considering its better performance it is worth.

Note that Yahoo is comprised of 11 independent data sets, including: Arts, Business, Computers, Education, Entertainment, Health, Recreation, Reference, Science, Social and Society. We just give the average results over the 11 data sets. From the results as Table 6 shows, our approach could also achieve a better performance relatively.

Table 6 Results on data set Yahoo

6 Conclusion

In this paper, we propose a \(L_{21}\)-norm Minimization ELM algorithm for multi-label learning problem, which not only inherits the advantage of ELM but also offers us additional characteristics. Through the constraint of the \(L_{21}\)-norm regularization on the original ELM, the output weights matrix of the hidden layer nodes become sparse and then leading to the simplification of the learning model. Experimental results validate that \(L_{21}\)-ELM has highly competition to state-of-the-art multi-label algorithms (e.g. Rank-SVM, ML-kNN and ECC) especially in training time. Our approach greatly improves the performance of the original ELM, although it sacrifices more time.