1 Introduction

Text categorization (TC) is a task of automatically assigning predefined categories to a given text document based on its content [1]. A growing number of machine learning techniques have been used for TC such as probabilistic model [2], k-nearest neighbor (KNN) [3], neural networks [46], support vector machines (SVM) [7, 8], and logistic regression [9, 10].

Among above methods, SVM has been regarded as one of the most successful methods in TC [1, 8, 11]. However, SVM has some disadvantages, for example, learning its parameters usually needs to spend a lot of time [7, 12]. Moreover, extending learning algorithms from binary classification to multi-classification will increase the computational cost.

Neural network is also an efficient and popular approach for TC, with which multiclass classification could be implemented easily [13]. Generally, the free parameters of the neural networks are learnt via gradient descent algorithms [14], which are relatively slow and have many issues related to its convergence such as stopping criteria, learning rate, learning epochs, and local minima.

Recently, Huang et al. [15] proposed a novel learning algorithm for single-hidden layer feedforward neural networks called extreme learning machine (ELM). It is shown that ELM not only learns much faster with higher generalization performance than the traditional gradient-based neural network methods but also avoids the convergence difficulties mentioned above [16, 17]. However, a potential disadvantage of ELM is that it tends to require more hidden neurons than conventional tuning-based algorithms in many cases [18], so its scale will become remarkable large if the input dimensionality is rather high such as text data.

In this article, we propose a novel approach based on a regularization extreme learning machine (RELM) for TC. Firstly, the latent semantic analysis (LSA) [19] was used to obtain a semantic representation of text and reduce the dimensionality to fit the input scale of ELM. Nextly, a regularization term was added into the linear system of ELM to construct a RELM in which its output weights were obtained analytically. The aim of adding regularization term is that we wish the RELM could overcome the overfitting problem, since the dimensionality in semantic space might still be high after using LSA, which may result in a low bias but large variance of the estimated weights. Finally, a TC algorithm was developed including uni-label and multi-label situations.

The major contributions of this article are as follows: (1) introducing a regularization term into the linear system of ELM, meanwhile, its analytical solution and theoretical proof are presented; (2) proposing a framework combining the LSA and RELM for TC (including uni-label and multi-label cases); (3) giving some experimental suggestions about parameter selection for TC based on RELM.

The rest of this article is organized as follows. Section 2 introduces the preliminaries and related works. Section 3 explains the proposed method in detail. Experimental results and analysis are shown in Sect. 4. Finally, we summarize the conclusions in Sect. 5.

2 Preliminaries and related works

A brief review of ELM is presented and the related works about neural networks for TC are introduced.

2.1 A review of extreme learning machine

ELM is a single-hidden layer feed forward networks (SLFNs) where the input weights are chosen randomly and the output weights are calculated analytically. For N arbitrary distinct samples \({(x_i,t_i) \in {\mathbb{R}}^k \times {\mathbb{R}}^m,}\) the SLFNs with \(\tilde N\) hidden nodes and activation function g(x) are mathematically modeled as

$$ {o_j} = \sum\limits_{i = 1}^{\tilde N} {{\beta _i}} g({w_i} \cdot {x_j} + {b_i}), \quad j = 1,\ldots,N, $$
(1)

where \({w_i} = {[{w_{i1},}{w_{i2},} \ldots ,{w_{ik}}]^{\rm{T}}}\) is the weight vector connecting the ith hidden node and the input nodes, \({w_i} \cdot {x_j}\) denotes the inner product of w i and x j ,  b i is the threshold of the ith hidden node, and \({\beta _i} = {[{\beta _{i1},}{\beta _{i2},} \ldots ,{\beta _{im}}]^{\rm{T}}}\) is the weight vector connecting the ith hidden node and the output nodes. If a SLFNs with \(\tilde N\) hidden nodes can approximate these N samples with zero error (i.e. \(\sum\nolimits_{j = 1}^{\tilde N} {||{o_j} - {t_j}} || = 0\)), there exist β i ,  w i and b i such that

$$ \sum\limits_{i = 1}^{\tilde N} {{\beta _i}} g({w_i} \cdot {x_j} + {b_i})= t_j,\quad j = 1,\ldots,N. $$
(2)

The above N equations can be written compactly as

$$ H\beta=T, $$
(3)

where

$$ H = {\left[ {\begin{array}{lll} {g({w_1} \cdot {x_1} + {b_1})}& \cdots & {g({w_{\tilde N}} \cdot {x_1} + {b_{\tilde N}})} \\ \vdots & \cdots & \vdots \\ {g({w_1} \cdot {x_N} + {b_1})} & \cdots & {g({w_{\tilde N}} \cdot {x_N} + {b_{\tilde N}})} \\ \end{array}} \right]_{N \times \tilde N}}, $$
(4)
$$ \beta = {\left[ {\begin{array}{l} {\beta _1^{\rm{T}}} \\ \vdots \\ {\beta _{\tilde N}^{\rm{T}}} \\ \end{array}} \right]_{\tilde N \times m}}\quad{\rm{ and}}\quad T = {\left[ {\begin{array}{l} {t_1^{\rm{T}}} \\ \vdots \\ {t_{\tilde N}^{\rm{T}}} \\ \end{array}} \right]_{N \times m}}, $$
(5)

H is called the hidden layer output matrix of the neural networks; the ith column of H is the ith hidden node output with respect to inputs \(x_{1}, x_{2}, \ldots , x_{N}\).

Huang et al. [15, 16] proved that one may randomly choose and fix the hidden node parameters with almost any nonzero activation function and then analytically determine the output weights when approximating any continuous target function on any compact input sets. Therefore, (3) becomes a linear system and the output weights β are estimated as

$$ \hat {\beta} = {H^{\dag}}T, $$
(6)

where \(H^{\dag}\) is the Moore–Penrose generalized inverse of the hidden layer output matrix H. Thus, the output weights β are calculated in a single step, and this avoids any long-training procedure where the network parameters are adjusted iteratively with appropriately chosen control parameters.

Huang et al. [20] also show that from the standard optimization method point of view, ELM for classification is equivalent to SVM, but ELM has less optimization constraints due to its special separability feature.

2.2 Neural networks for text categorization

Since several years ago, neural networks have been applied to TC tasks. In [4], Ng et al. used the perceptrons to construct a text classifier and reported a surprisingly high performance. Moreover, multilayer perceptron method was used for subject categorization [21] or authorship attribution classification [22]. To overcome the high dimensionality problem, Wang and Yu [5] introduced a combination of modified back propagation neural network (BP) and LSA, they used LSA to map the high dimensional term space into a low dimensional semantic space, so the dimensionality was reduced dramatically, and the performance was reported to be improved. However, the learning process is quite slow because of the convergence issues. In [23], Liu et al. introduced ELM for TC and reported the performance comparison with SVM, they pointed out experimentally that SVM still outperforms ELM in terms of F 1 value (see 4.2 for the definition of F 1) even the ELM has higher accuracy. Nevertheless, they did not introduce the multi-label case of TC, and the learning and classification time were not mentioned. Different from that, our RELM method tends to have better generalization performance due to the regularization constrain, and our classification algorithm can deal with the uni-label or multi-label cases for TC.

3 Text categorization based on regularization extreme learning machine

Generally, TC based on machine learning techniques consists three parts: text representation method, classification algorithm and performance evaluation. LSA is a classical text representation method, which could not only greatly reduces the dimensionality but also discovers the important associative relationship between terms [19]. Thus, LSA is used to project the original high dimensional term vectors into the low dimensional semantic vectors for text representation. Next, a regularization extreme learning machine (RELM) and its solution are presented. Finally, a TC algorithm based on RELM is developed.

3.1 Representation of text

Given a document \(d = ({t_1},{t_2}, \ldots ,{t_n})^{\rm{T}}, \) where n is the dimensionality in the term space. The tfidf value [24] for each term is defined as:

$$ tfidf({t_i},d) = tf({t_i},d) \times idf({t_i}), $$
(7)

where \(tf\left( {{t_i},d} \right)\) denotes the number of times that t i occurred in d, and \(idf\left( {{t_i}} \right)\) is the inverse document frequency which is defined as idf(t i ) = log (N /df(t i )), where N is the number of documents in training set and \(df\left( {{t_i}} \right)\) denotes the number of documents in training set in which t i occurs at least once. Then a document can be represented as a vector:

$$ d = {({w_1},{w_2}, \ldots ,{w_n})^{\rm{T}}}, $$
(8)

where \({w_i} = tfidf({t_i},d)/\sqrt {\sum\nolimits_j^n {tfidf{{({t_j},d)}^2}} }\).

Here, all document vectors are combined as a term by document matrix \({D \in {\mathbb{R}}^{n \times N}, }\) where n and N are commonly considerable large. Generally, the matrix D is not suitable to be used as input for ELM directly. With a singular value decomposition: \(D = U \times \Upsigma \times {V^{\rm{T}}},\) where U and V are two orthogonal matrices and \(\Upsigma = \rm{diag}({\sigma _1},{\sigma _2}, \ldots ,{\sigma _n})\) is the diagonal matrix of singular values. The best approximation of D with rank-k matrix is \({D_k} = {U_k} \times {\Upsigma _k} \times V_k^{\rm{T}},\) where U k is comprised of the first k columns of the matrix U and \(V_k^{\rm T}\) is comprised of the first k rows of matrix V T corresponding to the largest k singular values, which form the diagonal matrix \(\Upsigma_k = \rm{diag}({\sigma _1},{\sigma _2}, \ldots ,{\sigma _k})\). Thus the matrix D k captures most of the important latent semantic of the term by document matrix D. Consequently, a document vector \(d = {({w_1},{w_2}, \ldots ,{w_n})^{\rm{T}}}\) can be projected from the term space into the k-dimensional semantic space and represented by

$$ \hat d = {d^{\rm{T}}}{U_k}\Upsigma _k^{ - 1}. $$
(9)

Therefore, the dimensionality is reduced from n to k, and all of the training and test examples could be represented by this way.

3.2 Regularization extreme learning machine

Assuming \({X \in {\mathbb{R}}^{k \times N}}\) is a training example matrix obtained by (9), the ELM with \(\tilde N\) hidden nodes and activation function g(x) are mathematically modeled as Hβ = T (3), where H is the hidden layer output matrix of the neural network (4). To solve the linear system is equivalent to finding a least-squares solution \(\hat \beta\) to satisfy follow equation:

$$ ||H\hat \beta - T||_{\rm{F}}^2 = \mathop {\min }\limits_\beta ||H\beta - T||_{\rm{F}}^2, $$
(10)

where \(|| \cdot |{|_{\rm{F}}}\) is the Frobenius norm.

According to the text data, the high dimensional and sparse characteristic might lead to the overfitting problem. The estimated weight \(\hat \beta\) often have a low bias but large variance such that the model performs well on the training set but poorly on any other set. Regularization [25] is an effective way to deal with this problem by sacrificing a little bias to reduce the variance of the predicted values and hence may improve the overall prediction accuracy.

A lot of regularization methods are used in the linear system, for example, ridge regression [26], lasso [27], elastic net [28], or even the nonconvex regularizer l 1/2 [29] and minimax concave term [30]. Nevertheless, these approaches, except ridge regression, need an iterative estimated algorithm. In order to keep the advantage that the linear system of ELM can be solved analytically, we use the Frobenius norm as a regularization term and rewrite (10) as

$$ ||H\hat \beta - T||_{\rm{F}}^2 = \mathop {\min }\limits_\beta (||H\beta - T||_{\rm{F}}^2 + \lambda ||\beta ||_{\rm{F}}^2), $$
(11)

where λ is a parameter used to control the trade-off between the approximation error and the regularization degree, and \(\hat \beta\) can be obtained by theorem 1.

Theorem 1

The minimization problem of Eq. (11) has an optimal solution when λ is a positive constant, the solution is given by:

$$ \hat \beta = {({H^{\rm{T}}}H + \lambda {\rm{I}})^{ - 1}}{H^{\rm{T}}}T. $$
(12)

Proof

Let us denote the objective function of (11) by l(β), i.e. l(β) = ||Hβ − T|| 2F  + λ ||β || 2F . Setting

$$ \frac{{{\rm{d}}l(\beta )}}{{d\beta }} = 0, $$
(13)

we have

$$ ({H^{\rm{T}}}H + \lambda {\rm{I}})\beta = {H^{\rm{T}}}T, $$
(14)

because (H T H + λ I) is an invertible matrix when λ > 0. So,

$$ \beta = {({H^{\rm{T}}}H + \lambda {\rm{I}})^{ - 1}}{H^{\rm{T}}}T. $$
(15)

The second-order derivative of l(β) w.r.t. β is

$$ \frac{{{{\rm{d}}^2}l(\beta )}}{{d\beta {\beta ^{\rm{T}}}}} = 2({H^{\rm{T}}}H + \lambda {\rm{I}}), $$
(16)

it is a positive define matrix when λ > 0. Therefore, (12) is the optimal solution of (11) when λ > 0. \(\square\)

It should be noted that a similar result is also mentioned in [17] from the point of stable view. Recently, Huang et al. [31] did a more general discussion about the constrained optimization based ELM in, and different solutions can be obtained based on the concerns on the efficiency in different size of training datasets.

3.3 Algorithm for text categorization

Generally, ELM focus on the function approximation applications. According to the classification problem, we need some category discrimination function.

Here, we code the category information as the target vector of training set. In order to represent the coding for uni-label or multi-label corpus uniformly, we define the target vector corresponding to a document d as

$$ t = {({b_1}, \ldots ,{b_i}, \ldots ,{b_m})^{\rm{T}}}, $$
(17)

where m is the number of categories in corpus, and b i is equal to 1 or −1 depending on whether the related document belongs to the corresponding categories. For example, supposing there are five categories (m = 5), a document \(d = {({w_1},{w_2}, \ldots ,{w_k})^{\rm{T}}}\) belongs to the first and the forth categories, then the related target vector is t = (1,  − 1,  − 1, 1,  − 1)T. For test set, the output target matrix can be evaluated as

$$ Y = \tilde H\hat \beta, $$
(18)

where \(\tilde H\) is the hidden layer output matrix of testing data.

According to the uni-label corpus, we define the category discrimination function as:

$$ Category(d_j) = \mathop {\arg }\limits_i \max (Y_j), $$
(19)

where d j is the jth sample and Y j is the jth row output vector of Y.

According to the multi-label corpus, we define the category discrimination function as:

$$ Category(d_j) = \mathop {\arg }\limits_i ({Y_j} > \theta ), $$
(20)

where d j is the jth sample, Y j is the jth row output vector of Y, and θ = 0 or can be estimated by cross verification.

Algorithm 1 gives the implementation pseudo code of TC based on RELM.

Algorithm 1 TC based on RELM

4 Experiments

This section firstly introduces the datasets used in the experiments, then the evaluation measures of performance are given. The results and analysis are presented finally.

Some commonly used notations are listed in Table 1 for convenience. Moreover, a boldface in a table means better performance when the setting is same.

Table 1 Some commonly used notations in the experiments

4.1 Datasets

Two popular TC benchmarks are tested in our experiments: Reuters-21578 and WebKB. The Reuters-21578 datasetFootnote 1 is a standard multi-label TC benchmark and contains 135 categories. In our experiments, we use a subset of the data collection which includes the 10 most frequent categories among the 135 topics and we call it Reuters-top10. We divide it into the training and testing set with the standard "ModApte" version. The pre-processed procedure includes: removing the stop words, switching upper case to lower case, stemmingFootnote 2, and removing the low frequency words (less than three). After that, 5,920 training documents and 2,315 testing documents with 5,585 term features are obtained.

WebKB dataset is a standard uni-label TC benchmark which contains web pages gathered from university computer science departments. We use the subset called WebKB4Footnote 3 including four most populous entity-representing categories. After pre-processed procedure, 2,777 training documents and 1,376 testing documents with 7,287 term features are obtained.

4.2 Evaluation measures

In TC, the most commonly used performance measures are recall, precision and their harmonic mean F 1 [1]. Given a specific category c i from the category space \({\{c_{1}, \ldots ,c_{m}\}},\) the corresponding recall (Re i ), precision (Pr i ) are defined by:

$$ R{e_i} = \frac{{T{P_i}}}{{T{P_i} + F{N_i}}}, \quad P{r_i} = \frac{{T{P_i}}}{{T{P_i} + F{P_i}}}, $$
(21)

where TP i (true positives) is the number of documents assigned correctly to class i,  FP i (false positives) is the number of documents that do not belong to class i but are assigned to this class incorrectly and FN i (false negatives) is the number of documents that actually belong to class i but are not assigned to this class. The corresponding F 1 i is defined as:

$$ {F_1}_i = \frac{{2 \times R{e_i} \times P{r_i}}}{{R{e_i} + P{r_i}}}. $$
(22)

The average performance of a binary classifier over multiple categories is derived from the micro-averaged and the macro-averaged. For micro-averaged, the measures are computed globally without categorical discrimination. The micro-averaged recall \(\widehat{Re}^U\) and micro-averaged precision \({\widehat{Pr}^U}\) are defined as:

$$ {\widehat{Re}^U} = \frac{{\sum\nolimits_{i = 1}^m {\left| {T{P_i}} \right|} }}{{\sum\nolimits_{i = 1}^m {(\left| {T{P_i}} \right| + \left| {F{N_i}} \right|)} }}, \quad {\widehat{Pr}^U}= \frac{{\sum\nolimits_{i = 1}^m {\left| {T{P_i}} \right|} }}{{\sum\nolimits_{i = 1}^m {(\left| {T{P_i}} \right| + \left| {F{P_i}} \right|)} }}, $$
(23)

and the micro-averaged F 1 is defined as

$$ micro-averaged\,{F_1} = \frac{{2 \times {{\widehat{Pr}}^U} \times {{\widehat{Re}}^U}}}{{{{\widehat{Pr}}^U} + {{\widehat{Re}}^U}}}. $$
(24)

For macro-averaged, F-measure is computed locally over each category c i first and then the average over all categories is taken:

$$ macro-averaged\,{F_1} = \left(\sum\limits_i^m {{F_1}_i}\right)/{m}. $$
(25)

To evaluate the performance overall, we adapt the micro − averaged F 1 and macro − averaged F 1 as the performance measures.

4.3 Results and analysis

To verify the performance of RELM, we compare it with the standard ELM [23], back propagation neural network (BP) [5], and SVM [7]. All experiments are carried out in MATLAB 2010a environment running in a 2.8 GHZ CPU and 8 G memory. For each experiment, we run the test 10 times and take their averaged values as the results. Because all experiments take the same representation method (using LSA), we does not count the time cost of the dimensionality reduction. All experiments below using RELM or ELM are taken radial basis function as their active functions.

4.3.1 Comparison with ELM

The performance comparisons between ELM and RELM are presented in Tables 2 and 3. For brief, we only give three situations (about 1, 2, and 5% of the original dimensionality) and others have similar cases. In these tables, the training speed of RELM is much faster than ELM. The performances of RELM increases during the number of hidden nodes increases; however, the performances of ELM become instable and drop dramatically when the number of hidden nodes reaches some certain numbers which might be induced by overfitting. Therefore, RELM obtains more stable performance and possesses the potential to improve performance by increasing the scale of networks.

Table 2 Performance comparison between RELM and ELM in Reuters-top10
Table 3 Performance comparison between RELM and ELM in WebKB4

4.3.2 Comparison with BP

In the BP experiments, we assigned a small number to the number of hidden nodes because the training time increases tremendously as the number of hidden nodes increases, and its performance does not necessarily increases in the same time. Even #node=50, the time cost is remarkable. From Table 4, the training speed of BP is much slower than RELM; however, the performance of BP is inferior to RELM in most cases. So, RLEM is obviously better than BP in our experiments.

Table 4 Performance comparison between RELM and BP

4.3.3 Comparison with SVM

Table 5 presents the comparison results between RELM and SVM in Reuters-top10 and WebKB4. The dimensionality is from about 1 to 10% of the number of the original features. From this table, we can observe that the F 1 performance of RELM is slightly lower than SVM in most cases; however, the speed of RELM is much faster than SVM, especially in the cases that the dimensionality is relatively large.

Table 5 Performance comparison between RELM and SVM

4.3.4 Parameter discussions

The parameters of RELM include the following: the active function, the input dimensionality, the number of hidden nodes, and the regularization factor λ. Although these parameters are also need to be tuned, they are actually very easy to determined for TC. Here, we give some suggestions how to tune these parameters experimentally. Since all experimental results are too many, we only present some typical cases or give the conclusions directly for concise.

According to the active functions selection for TC, a experimental suggestion is: radial basis function ≥ triangular basis function ≥ sine function ≫ sigmoid ≥ hard limit function, where ≥ means the performance is slightly superior and ≫ means the performance is considerable superior.

From the experimental results above and Figs. 1 and 2, we can see that the performance is very poor when the dimensionality is exceedingly small (commonly <1% of the original dimensionality) or when it is close to or larger than the number of hidden nodes (these cases can be observed in the lines of #node = 200 in Fig. 1 and in the lines of #node = 300 in Fig. 2). Apart from that, the performance is almost insensitive to the dimensionality which means the dimensionality can be selected randomly in a specific interval (commonly lies in 2–5% of the original dimensionality) when the number of hidden nodes is relatively large enough.

Fig. 1
figure 1

Performance in Reuters-top10 while the dimensionality varies (the dimensionality can be selected randomly in the suggested interval when #node is relatively large)

Fig. 2
figure 2

Performance in WebKB4 while the dimensionality varies (the dimensionality can be selected randomly in the suggested interval when #node is relatively large)

We also observe that the number of hidden nodes is an important factor. In Figs. 3 and 4, the performance increases when the number of hidden nodes increase until the it reaches a stable situation. Therefore, the number of hidden nodes should be take enough large for example greater than 10 times of the input dimensionality.

Fig. 3
figure 3

Performance in Reuters-top10 while #node varies (the performance increases as #node increase until it reaches a stable situation)

Fig. 4
figure 4

Performance in WebKB4 while #node varies (the performance increases as #node increase until it reaches a stable situation)

For the regularization factor λ, an experimental suggestion is \(\lambda \in [0.1 \; 20]\). The selected rule is: the more scale of RELM the larger λ should be taken, i.e. λ should be increased while the dimensionality and the number of hidden nodes increase. Generally, λ could be selected in interval \([0.1 \; 1]\) when the scale of RELM is small (for example, the cases #dim = 50 in Fig. 5 or #dim = 70 in Fig. 6), and λ could be selected in interval \([10 \; 20]\) when the scale of RELM is relatively large (for example, the cases #dim = 150 in Fig. 5 or #dim = 300 in Fig. 6).

Fig. 5
figure 5

Performance in Reuters-top10 while λ varies (λ can be selected in the suggested interval according to the scale of RELM)

Fig. 6
figure 6

Performance in WebKB4 while λ varies (λ can be selected in the suggested interval according to the scale of RELM)

In sum, the radial basis function and triangular basis function are suggested to be taken as the active function for TC, and the input dimensionality could be selected randomly in our propositional interval (commonly lies in 2–5% of the original dimensionality). Moreover, the number of hidden nodes should be relatively large (commonly greater than 10 times of the input dimensionality), and λ should be tuned in \([0.1 \; 20]\) according to the rule: the more scale of RELM the larger λ.

5 Conclusions

In this article, a regularization ELM is presented and its analytical solution and theoretical proof are given simultaneously. Moreover, a TC framework combining the LSA and RELM is proposed, and an algorithm including uni-label and multi-label classification for TC is developed. The experimental results show that the proposed method can produce good performance in most cases and can learn faster than conventional popular learning algorithms such as feedforward neural networks or support vector machine.

The features of the proposed approach include much faster learning and classification speed, ease of implementation, and least human intervene. It might become a promising technique for TC and its applications such as news section classification, quick text retrieval, and realtime topic tracking.

For the further researches, we are trying to incorporate cognitive information into RELM to improve the classification performance.