Abstract
Extreme learning machine (ELM) was extended from the generalized single hidden layer feedforward networks where the input weights of the hidden layer nodes can be assigned randomly. It has been widely used for its much faster learning speed and less manual works. Considering the field of multi-label text classification, in this paper, we propose an ELM based algorithm combined with \(L_{21}\)-norm minimization of the output weights matrix called \(L_{21}\)-norm Minimization ELM, which not only fully inherits the merits of ELM but also facilitates group sparsity and reduces complexity of the learning model. Extensive experiments on several benchmark data sets show a more desirable performance compared with other common multi-label classification algorithms.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Continued development of the Internet and information technology has spawned a large number of text data in various forms. How to organize, manage and analyze such a huge data, and find the user information quickly, accurately and comprehensively is a big challenge. Text automatic classification is an important research point in the field of information mining. Compared to the traditional single classification problem, multi-label text classification has more value of research and application.
In multi-label learning, the text data are always in high dimensionality and sparsity. e.g. In a large number of feature words, only a few are related to the topic of a text and most of the rest are redundant. Therefore, introducing sparsity into machine learning has become a popular technology, which not only meet the need of practical problems but also can simplify the learning model. In resent years, extreme learning machine (ELM) [1–4] has attracted increasing attention and been widely used for its distinguishing characteristics: (1) fast learning speed, (2) good generalization performance on classification or regression, (3) less human intervention with randomly setted hidden layer parameters. For these reasons, the theoretical analysis and various improvement algorithms of ELM are put forward continuously.
In ELM network, the function of the random hidden layer nodes can be seen as feature mapping. It maps the data from the input feature space to the hidden layer feature space, which is called ELM feature space in literature [5]. In this ELM feature space, each instance may still remains the sparsity. Meantime, considering the characteristics of multi-label learning and the advantages of the classifier ELM, in this paper, we propose an embedded model for multi-label text classification, which is derived from a formulation based on ELM with \(L_{21}\)-norm minimization of the output weights matrix. Through the constraint of the \(L_{21}\)-norm regularization, the training model becomes simplified, also we can sufficiently preserve the intrinsic relation of different nodes in the ELM feature space and select them by joint multiple related labels, where the labels are not always independent to each other. Experimental results on several benchmark data sets verify the efficiency of our proposed algorithm.
The main contributions of this paper can be summarized below:
-
According to the characteristics of the multi-label text data we introduce the sparsity model.
-
Applying \(L_{21}\)-norm for joint hidden layer nodes selection and avoiding individual training for each label.
-
Using ELM for multi-label text classification.
The remainder of this paper is organized as follows. After reviewing the related works in Sect. 2, we present the algorithm \(L_{21}\)-ELM in Sect. 3 and describe the evaluation measures of multi-label learning in Sect. 4. Experimental results are presented in Sect. 5 and we conclude this paper in Sect. 6.
2 Related Work
2.1 Multi-label Learning
Unlike traditional supervised learning, in multi-label learning each instance may belong to multiple classes and for a new instance we try to predict its associated set of labels. This is a generalized case of the prevalent multi-class problems where in multiple classes each instance has only one class restrictedly.
Let \(\mathcal {X} \in \mathbb {R}^{d}\) denote the d-dimensional space of instances, \(\mathcal {Y}= \left\{ y_{1},\ldots ,y_{k}\right\} \) denote the label space with k possible class labels. Given the training data set \(\left\{ (x_{1},\textit{Y}_{1}),\ldots ,(x_{n},\textit{Y}_{n})\right\} \) where \(x_{i}\in \mathcal {X}\) and \(\textit{Y}_{i} \subseteq \mathcal {Y}\). the task of multi-label learning is to learn a multi-label classifier \(f: \mathcal {X}\rightarrow 2^{k}\) from the training data set. For any unknown instance \(x \in \mathcal {X}\) , the multi-label classifier \(f(\cdot )\) predicts \(f(x) \subseteq \mathcal {Y}\) as the set of proper labels. Existing multi-label learning algorithms can be divided into two main categories [6, 7].
Problem transformation methods. The main idea of most problem transformation methods is to transform the original multi-label learning problem into multiple single-label learning problems, which usually reconstructs the multi-label data sets and then existing classification algorithms can be applied directly.
The binary relevance (BR) [8] algorithm is a popular kind of this transformation method and has been widely used in many practical applications. This algorithm divides the multi-label classification problem into k independent binary classification problems, however, the assumption of label independence is too implicit and the label correlations are ignored. The label powerset (LP) [9] algorithm is another common transformation method. It considers each unique set of labels in a multi-label training data as one class in the new transformed data. While the computational complexity of LP is too big and it may pose class imbalance problem. The basic idea of the classifier chains (CC) [10] is to chain the transformed binary classifiers one by one, but the sequence of each classifier is a problem. The ensembles of classifier chains (ECC) [11] improved the CC algorithm and identify the sequence of each classifier effectively.
Algorithm adaptation methods. From another perspective, this method improves conventional algorithms to deal with multi-label data directly. Some representative algorithms include ML-kNN [12] adapting k-nearest neighbor techniques, which has the advantage of both lazy learning and Bayesian but ignores label correlations. ML-DT [13] adapting decision tree techniques, Rank-SVM [14] adapting kernel techniques, etc.
In this paper, the algorithm based on ELM we proposed is designed to deal with multi-label data directly, therefore, it can be considered as a kind of algorithm adaptation method.
2.2 \(L_{21}\)-norm Regularization for Parameter Estimation
In recent years, parameter estimation via sparsity-promoting regularization has be widely used in machine learning and statistics. Perhaps \(L_{1}\)-norm regularization is the most successful and common method to promote sparsity for the parameter vector (the lasso approach). Along with the development of multi-task learning, in 2006, Obozinski et al. [15, 16] proposed to constraint the sum of \(L_{2}\)-norms of the blocks of weights connected with each feature, and then leading to the \(L_{21}\)-norm regularized optimization problem (the group lasso).
In this section, we will briefly review the basics of this technique. Usually, the optimization problem can be described as following:
where \(\lambda > 0\) is the regularization parameter, \( X \in \mathbb {R}^{n \times k} \) is the weights matrix, \(\parallel X \parallel _{2,1} = \sum _{i=1}^{n}\parallel X \parallel _{2}\) and loss(X) is a smooth and convex loss function (such as the logistic loss, the least square loss and the hinge loss). Take the least squares problem as an example, the Eq.1 is expressed as:
where \(A \in \mathbb {R}^{m \times n}\), \( Y \in \mathbb {R}^{m \times k} \) are the data matrices, each row of X forms a feature group. Figure 1 visualizes this optimization problem.
This optimization problem will be more challenging to solve due to the non-smoothness and non-differential of the \(L_{21}\)-norm regularization. In this paper, we apply the strategy proposed in literature [17] to solve this problem, which reformulates the non-smooth \(L_{21}\)-norm regularized problem to an equivalent smooth convex optimization problem and can be solved in linear time.
3 \(L_{21}\)-minimization ELM (\(L_{21}\)-ELM)
In this section, we propose \(L_{21}\)-ELM algorithm for multi-label learning problem, which takes the significant advantages of ELM like affording good generalization performance at extremely fast learning speed, meantime, offers us some additional characteristics. Firstly, we will review the theories of ELM, then, introduce the algorithm we proposed.
3.1 Extreme Learning Machine
Extreme learning machine [2, 3] was originally proposed for single hidden layer feedforward neural networks and then extended to the generalized single hidden layer feedforward networks where the hidden layer need not be neuron alike [1]. Figure 2 shows the structure of ELM network. It contains an input layer, a hidden layer and an output layer.
In ELM, the hidden layer parameters are chosen randomly, and the output function can be represented as following (take the case of p hidden layer nodes and one output layer node as an example):
where \(x \in \mathbb {R}^{n}\) is the input variable, \(\beta =(\beta _{1},\beta _{2},\ldots ,\beta _{p})^{T}\) is the weights vector between the hidden layer nodes and the output layer nodes. \(\mathbf h (x)= \left[ h_{1}(x),h_{2}(x),\ldots ,h_{p}(x) \right] \) is the output vector of the hidden layer with respect to the input vector x. \(h_{i}(x)\) is the ith activation function, its input weights vector and bias are \(w_{i}\) and \(b_{i}\).
Figure 2 shows that \(\mathbf h (x)\) actually maps the input variables from the n-dimension to the p-dimensional hidden layer space (ELM feature space), thus, it appears to be a feature mapping function.
The ELM reliably approximates m samples, \(X=[x_{1},\ldots ,x_{m}]\), with minimum error:
where H is hidden layer output matrix,
and \(Y=[y_{1},\ldots ,y_{m}]^{T}\) is the target vector.
The analytical result of this least squares equation is:
where \(H^{\dagger }\) is called Moore-Penrose generalized inverse of matrix H.
3.2 \(L_{21}\)-norm Minimization ELM for Multi-label Learning
In this section, we consider adapting the ELM network to solve the multi-label learning problem. Given the multi-label training data with m samples \((x_{i},y_{i})\), where \(x_{i}=(x_{i1},x_{i2},\ldots ,x_{in})^{T}\in \mathbb {R}^{n}\) and \(y_{i}=(y_{i1},y_{i2},\ldots ,y_{ik})\in \mathbb {R}^{k}\). As shown in the Fig. 2, we set the number of output layer nodes k , which equals the number of labels, and set the number of hidden layer nodes p randomly.
Inspired by ELM, we consider combining the smallest training error of ELM with the \(L_{21}\)-norm minimization of output weights matrix. It is reformulated as following:
where \(\parallel \beta \parallel _{2,1} = \sum _{i=1}^{p}\parallel \beta _{i} \parallel _{2}\) is the \(L_{21}\)-norm of the matrix \(\beta \), and \(\beta _{i}=(\beta _{i1},\beta _{i2},\ldots ,\beta _{ik})\), \(\lambda \) is the regularization parameter.
To solve the nonsmooth optimization problem in Eq. (7), the literature [17] proposed to employ the Nesterov’s optimal method by optimizing its equivalent smooth convex reformulation. When using a constraint to replace the nonsmooth \(L_{21}\)-norm, the original problem can be equivalent to the \(L_{21}\)-ball constrained smooth convex optimization problem as following:
When applying the Nesterov’s optimal method to solve Eq. (8), one key building block of this method is Euclidean projection onto the \(L_{21}\)-ball. The Euclidean projection problem is defined as:
where \(Z = \left\{ \beta \in \mathbb {R}^{p \times k} \mid \parallel \beta \parallel _{2,1} \le z \right\} \) is the \(L_{21}\)-ball and \(z\ge 0^{1}\) is the radius of \(L_{21}\)-ball. To solve the problem in Eq. (9), the Lagrangian variable \(\alpha \) is introduced for the inequality constrain \(\parallel \beta \parallel _{2,1} \le z\) , then we can lead to the Lagrangian function of Eq. (9) as:
Let \(\beta ^{*}\) be the primal optimal point, and \(\alpha ^{*}\) be the dual optimal point. This two points must satisfy the condition: \(\parallel \beta ^{*} \parallel _{2,1} \le z\) and \(\alpha ^{*} \ge 0\). Since considering the strong duality holds of the Slater’s condition, and values of the primal and dual optimal points are equal: \(\alpha ^{*}(\parallel \beta \parallel _{2,1}-z)=0\). Therefore, the primal optimal point \(\beta ^{*}\) can be given by Eq. (11) if the dual optimal point \(\alpha ^{*}\) is known.
where \(u^{i} \in \mathbb {R}^{1\times k}\) is the ith row of U.
According to Eq. (11), \(\beta ^{*}\) can be computed as long as \(\alpha ^{*}\) is solved. Now, the key step is how to compute the unknown dual optimal point \(\alpha ^{*}\). Liu et al. [17] gives the theorem : if \(\parallel U \parallel _{2,1} \le z\) the value of \(\alpha ^{*}\) is zero, otherwise, it can be solved as the unique root of the following auxiliary function.
The Eq. (12) can be solved in O(p) flops by the bisection [18], and it costs O(pk) flops to compute \(\beta ^{*}\) by Eq. (11). Above all, for solving Eq. (7) the time complexity is O(pk). When testing an unseen instance, we will use a threshold function t(x) to determine its associated label set. For an actual outputs \(c_{j}\), \( Y = \left\{ j \mid c_{j} > t(x) \right\} \). An usual solution is to set t(x) to be zero. In this paper, we adopt the threshold category used in literature [19].
4 Evaluation Measures
Being different from the traditional single-label learning system, in multi-label learning an instance usually have one or more labels simultaneously, therefore those classical evaluation methods would be no longer applied in multi-label learning system. For this reason, a series of evaluation metrics for multi-label learning are proposed.
In order to compare our algorithm with other commonly used methods, we adopt five evaluation measures in multi-label learning in this section, including: hamming loss, one-error, coverage, ranking loss and average precision [6, 20, 21]. The following is a look at these measures based on a test data set \(S = \lbrace (x_{i},Y_{i}) \mid 1 \leqslant i \leqslant n \rbrace \) and a trained model \(f(\cdot , \cdot )\) or \(g(\cdot )\).
Hamming loss. This measure evaluates the error rate of all instances on all labels, e.g. a relevant label of an instance is not predicted or an irrelevant one is predicted. the smaller the value of hamming loss, the better the performance.
where \(\bigtriangleup \) stands for the symmetric difference between two sets, m is the total number of labels. It is worth noting that when most of these instances have little correlative labels, it can also get a small value of hamming loss even if all the labels of an instance are predicted in error. Therefore, we should integrate it with other measures.
One-error. This measure evaluates the times that the top-ranked label of an instance is not in its relevant label set. The smaller the value of \(one-error_S(f)\), the better the performance.
One-error mainly focuses on the most relevant label being correct or not, and it don’t pay attention to other labels. Note that, it is equal to ordinary error identically in single-label classification problems.
Coverage. This measure evaluates the average steps we need to go down the ranked-label list for the sake of covering all the relevant labels. The smaller the value of coverage, the better the performance.
where the \(rank_{f}(\cdot ,\cdot )\) is derived from the real-valued function \(f(\cdot , \cdot )\), and the bigger the value of f, the smaller the \(rank_{f}\) is. The performance is perfect when \(coverage_S(f)=0\).
Ranking loss. This measure evaluates the average fraction of the reversely ordered label pairs. The smaller the value of \(rloss_{S}(f)\), the better the performance.
where \(Y_{i}\) and \(\overline{Y_{i}}\) denote the possible and impossible label sets of the instance \(x_{i}\). When the value is zero, it means that all impossible labels follow possible ones.
Average precision. This measure evaluates the average fraction of relevant labels ranked above a particular one \(\ell \in Y_{i}\). It is typically used in information retrieval (IR) system to evaluate the document ranking performance query retrieval [22]. The bigger the value of \(avgpec_{S}(f)\), the better the performance.
where \( L_{i}= \left\{ y' \mid rank_{f}(x_{i},y') \le rank_{f}(x_{i},y),y' \in Y_{i} \right\} \). Note that \(avgpec_{S}(f)\) =1 ranks perfectly, that means there is no instance \(x_{i}\) for which a label not in \(Y_{i}\) is ranked above on a label in \(Y_{i}\).
5 Experimental Results
In this section, \(L_{21}\)-ELM is compared with the performance of the original ELM as well as other common multi-label classification algorithms. The benchmark data sets we used are all in text areas, including: Enron for email analysis, Reuters for text categorization, BibTeX for tags of paper and Yahoo for web page categorization. Table 1 describes the datasets in detail. For Enron and Reuters without pre-separated training and testing sets, therefore, we decide to select 1,500 instances of them for training randomly and the rest data for testing. We repeat the data partition for thirty times randomly, and give the “average results” \(\pm \) “standard deviations”.
Table 2, 3, 4 and 5 shows the comparison results on a single data set. Among them, the symbol “\(\downarrow \)” means the smaller the better, “\(\uparrow \)” on the contrary. HL, OE, Co, RL and AP are the abbreviations of hamming loss, one-error, coverage, ranking loss and average precision respectively, unit of Time (training) is seconds. The number of ELM hidden layer nodes is randomly setted but not more than the training samples and the best results are selected.
Overall, compared with other algorithms, \(L_{21}\)-ELM achieves the best performance in most case. Especially, it shows the absolute advantage on coverage, ranking loss and average precision in all datasets. On hamming loss it is worse than Rank-SVM only on BibTeX data set, and performs better on other cases. On one-error, ECC achieves comparable performance with other approaches. Without consideration of ECC, \(L_{21}\)-ELM outperforms other approaches by the metric of one-error.
Compared with the original ELM approach, \(L_{21}\)-ELM achieves obviously better performance on almost all datasets over all the 5 criteria. This validates the effectiveness of the \(L_{21}\)-norm regularization on the original ELM and eliminating redundant information.
On the training time, the ELM group has faster training time than other approaches. This validates that \(L_{21}\)-ELM could fully inherit the merits of ELM with extreme learning speed. Compared with original ELM, \(L_{21}\)-ELM consumes more training time, but considering its better performance it is worth.
Note that Yahoo is comprised of 11 independent data sets, including: Arts, Business, Computers, Education, Entertainment, Health, Recreation, Reference, Science, Social and Society. We just give the average results over the 11 data sets. From the results as Table 6 shows, our approach could also achieve a better performance relatively.
6 Conclusion
In this paper, we propose a \(L_{21}\)-norm Minimization ELM algorithm for multi-label learning problem, which not only inherits the advantage of ELM but also offers us additional characteristics. Through the constraint of the \(L_{21}\)-norm regularization on the original ELM, the output weights matrix of the hidden layer nodes become sparse and then leading to the simplification of the learning model. Experimental results validate that \(L_{21}\)-ELM has highly competition to state-of-the-art multi-label algorithms (e.g. Rank-SVM, ML-kNN and ECC) especially in training time. Our approach greatly improves the performance of the original ELM, although it sacrifices more time.
References
Huang, G.B., Chen, L.: Convex incremental extreme learning machine. Neurocomputing 70, 3056–3062 (2007)
Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: a new learning scheme of feedforward neural networks. In: 2004 IEEE International Joint Conference on Neural Networks, 2004. Proceedings, vol. 2, pp. 985–990. IEEE (2004)
Huang, G.B., Chen, L., Siew, C.K.: Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans. Neural Netw. 17, 879–892 (2006)
Huang, G.B., Siew, C.K.: Extreme learning machine with randomly assigned RBF kernels. Int. J. Inf. Technol. 11(1), 16–24 (2005)
Huang, G.B., Ding, X., Zhou, H.: Optimization method based extreme learning machine for classification. Neurocomputing 74(1–3), 155–163 (2010)
Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)
Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehousing Min. 2007(3), 1–13 (2007)
Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recogn. 37(9), 1757–1771 (2004)
Tsoumakas, G., Vlahavas, I.: Random k-labelsets: an ensemble method for multilabel classification. Lecture Notes in Computer Science, pp. 406-417 (2007)
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classier chains for multi-label classification. In: Proceedings of ECML-KDD, vol. 22, no. 4, pp. 829–840 (2009)
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Mach. Learn. 85(3), 333–359 (2011)
Zhang, M.L., Zhou, Z.H.: ML-kNN: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007)
Clare, A., King, R.D.: Knowledge Discovery in Multi-label Phenotype Data. Lecture Notes in Computer Science, pp. 42–53 (2001)
Elisseeff, A., Weston, J.: A Kernel Method for Multi-labelled Classification, pp. 681–687. MIT Press, USA (2002)
Obozinski, G., Taskar, B., Jordan, M.I.: Multi-task feature selection. Statistics Department, UC Berkeley, Technical Report, 1693–1696 (2006)
Obozinski, G., Taskar, B., Jordan, M.I.: Joint covariate selection for grouped classification. Statistics Department, UC Berkeley, Technical Report (2007)
Liu, J., Ji, S., Ye, J.: Multi-task feature learning via efficient \(L_{21}\)-norm minimization. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 339–348 (2009)
Liu, J., Ye, J.: Efficient Euclidean projections in linear time. In: Proceedings of the Twenty-Sixth Annual International Conference on Machine Learning, pp. 657–664. ACM, (2009)
Zhang, M.L., Zhou, Z.H.: Multi-label neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 18(10), 1338–1351 (2006)
Gjorgji, M.A., Dejan, G.A.: Two stage architecture for multi-label learning. Pattern Recogn. 45(3), 1019–1034 (2012)
Schapire, R.E., Singer, Y.: Boostexter: a boosting-based system for text categorization. Mach. Learn. 39(2–3), 135–168 (2000)
Salton, G.: Developments in automatic text retrieval. Science 253(5023), 974–980 (1991)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Jiang, M., Li, N., Pan, Z. (2016). Multi-label Text Categorization Using \(L_{21}\)-norm Minimization Extreme Learning Machine. In: Cao, J., Mao, K., Wu, J., Lendasse, A. (eds) Proceedings of ELM-2015 Volume 1. Proceedings in Adaptation, Learning and Optimization, vol 6. Springer, Cham. https://doi.org/10.1007/978-3-319-28397-5_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-28397-5_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28396-8
Online ISBN: 978-3-319-28397-5
eBook Packages: EngineeringEngineering (R0)