Study on suitability and importance of multilayer extreme learning machine for classification of text data

Roul, Rajendra Kumar; Asthana, Shubham Rohan; Kumar, Gaurav

doi:10.1007/s00500-016-2189-8

Study on suitability and importance of multilayer extreme learning machine for classification of text data

Foundations
Published: 21 June 2016

Volume 21, pages 4239–4256, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Soft Computing Aims and scope Submit manuscript

Study on suitability and importance of multilayer extreme learning machine for classification of text data

Download PDF

Rajendra Kumar Roul^1,2,
Shubham Rohan Asthana¹ &
Gaurav Kumar¹

706 Accesses
29 Citations
Explore all metrics

Abstract

The dynamic Web, which contains huge number of digital documents, is expanding day by day. Thus, it has become a tough challenge to search for a particular document from such a large volume of collections. Text classification is a technique which can speed up the search and retrieval tasks and hence is the need of the hour. Aiming in this direction, this study proposes an efficient technique that uses the concept of connected component (CC) of a graph and Wordnet along with four established feature selection techniques [e.g., TF-IDF, Chi-square, Bi-Normal Separation (BNS) and Information Gain (IG)] to select the best features from a given input dataset in order to prepare an efficient training feature vector. Next, multilayer extreme learning machine (ML-ELM) (which is based on the architecture of deep learning) and other state-of-the-art classifiers are trained on this efficient training feature vector for classification of text data. The experimental work has been carried out on DMOZ and 20-Newsgroups datasets. We have studied the behavior and compared the results of different classifiers using these four important feature selection techniques used for classification process and observed that ML-ELM achieved the maximum overall F-measure of 72.28 % on DMOZ dataset using TF-IDF as the feature selection technique and 81.53 % on 20-Newsgroups dataset using BNS as the feature selection technique compared to other state-of-the-art classifiers which signifies the usefulness of deep learning used by ML-ELM for classifying the text data. Experimental results on these benchmark datasets show the stability and effectiveness of our approach over other competing approaches.

K-means and Wordnet Based Feature Selection Combined with Extreme Learning Machines for Text Classification

A Novel Feature Selection Based Text Classification Using Multi-layer ELM

Text Classification Using Correlation Based Feature Selection on Multi-layer ELM Feature Space

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The Web has indexed at least 4.76 billion of documents.^{Footnote 1} Organizing these documents on the Web in an effective manner is the real challenge for the present search engine. The ultimate aim of the search engine is to satisfy the internet user who is looking for the desired information every time he queries. The most time-consuming job is searching these informations in the net. If this happens efficiently, then the user can effectively absorb and use the knowledge of the documents. Text classification is an attempt in this direction, which not only reduces the searching time but also makes available the required information to the user for which he is looking for. It is a vital topic in machine learning where learning is done over the text. Classification is a well-known machine learning technique where the set of label datasets is used to trained the classifier before it is applied to the test dataset for deciding the target class. Based on the number of classes used in the process, classification can be broadly classified into two categories: binary classification where a test instance is categorized into one of two predefined classes and multi-class classification where the test instance deals with more than two classes. In order to classify the text data more effectively, selection of top features is highly essential and this in turn generates a technique called feature selection upon which the generalization capability of a machine learning algorithm depends. The performance of the classifiers depends on how robust the feature vector is. Feature selection involves reducing the number of features by selecting a subset of it which would help in building the required model. It is important in text classification for two main reasons:

1.
Effective number of features are reduced, and hence, training the classifier will consume less network bandwidth, time and storage in the training phase.
2.
Classification errors due to noise features are eliminated, and thus, accuracy of classification process improves.

Generally, the feature selection methods are either unsupervised or supervised. As the name suggests ‘unsupervised’ and hence no class labels are required to select the top features, but on the other hand supervised methods do require class labels. Some of the unsupervised feature selection methods are ‘document frequency,’ ‘term contribution,’ ‘TF-IDF metric,’ etc. Supervised feature selection methods are further categorized into two sub-categories: accuracy-based and correlation-based.

1.
Accuracy-based: This method chooses the features which maximize the occurrence of features in the positive class and minimize the occurrences of the features in the negative class. Some of the existing methods are odds ratio,’ ‘probability ratio,’ ‘GU metric,’ ‘Bi-Normal Separation (BNS),’ ‘power metric,’ ‘Fisher criterion,’ etc.
2.
Correlation-based: This method evaluates the features by finding the correlation of the features with the various classes and choose the features which have the highest correlation score. For example, ‘Chi-square metric,’ ‘NGL coefficient,’ ‘GSS coefficient,’ ‘MI-judge’ and ‘Information Gain’ are some of the existing correlation-based methods.

The techniques used for feature selection are categorized as wrapper, filters and embedded methods. For constructing a feature set, wrapper and embedded methods need the involvement of classifier which increases the running time and computationally intensive. But filter method does not require any classifier interaction for preparing the feature set and hence more preferable compared to the other two methods.

The next important thing after the feature selection which affects the text classification process is an efficient classifier. There are many traditional classifiers exist for text classification which includes decision trees, k-nearest neighbor, Naive Bayes, SVM etc. But they have their own limitations, and most of them use the shallow neural networks algorithms in which there are certain restrictions for the capabilities to achieve approximating the complex function. Deep learning has aroused interest in the past decade in many research domains such as computer vision, automatic speech recognition and pattern recognition and recently has attracted much attention in the field of machine learning. It is a multilayer perception artificial neural network algorithm. There is no such restriction found in deep learning (i.e., capabilities to achieve approximating the complex function) which removes the difficulty of optimization associated with the deep models (Ding et al. 2015) and achieves an approximation of complex function. Extreme Learning Machine (Huang et al. 2006) is able to approximate any complex nonlinear mappings directly from the training samples, but it has shallow architecture similar to traditional SLFNs. Hence, it may need a large network to perfectly fit the highly variant input data, which is difficult to implement. Recently designed multilayer ELM (Kasun et al. 2013) is able to address this issue which combines deep learning (i.e., ELM autoencoder) with ELM, decomposes the original input data into multiple hidden layers and performs unsupervised learning layer-wise.

Considering that selection of informative features and efficient classifier are able to generate good performance for text classification process, this study uses ML-ELM as the classifier which earned name quickly in the field of machine learning owing to its fast speed, easy implementation and ability to handle a large volume of data. To prepare an efficient feature vector, we have considered four standard feature selection techniques, such as TF-IDF, Chi-square, BNS and IG, which are generally used for text classification. The concept of connected component of a graph along with the Wordnet has been used that help us for selecting the top features from each class of a given corpus after calculating the TF-IDF/Chi-square/BNS/IG for each feature (i.e., keyword) of a class. Finally, the reduced feature vector of each class is combined together to form the final reduced feature vectors [one for each feature selection technique (i.e., TF-IDF, Chi-square, BNS and IG)]. ML-ELM and other traditional classifiers including ELM and SVM are trained on these final reduced feature vectors for the classification of text data. The experimental work which focused on text classification process is carried out on two benchmark datasets: DMOZ (Open Directory Project) and 20-Newsgroups. The performance of different classifiers is compared in the experimental section, and it has been observed that ML-ELM outperforms the other established classifiers including ELM and SVM. The empirical results show that the performance of the proposed approach is promising compared to other existing approaches.

The paper is outlined in this way: The literature review based on different classification techniques used for text data is discussed in Sect. 2. Section 3 describes different existing feature selection techniques and model structure of ELM and ML-ELM. The proposed approach for classifying the text data is discussed in Sect. 4. Section 5 describes the analysis of empirical results and compares the proposed approach with other existing approaches. We concluded our work with some future enhancements in Sect. 6.

2 Literature review

Recently, ELM and ML-ELM have attracted the attention of many researchers in the field of text classification. Working in this direction, Huang et al. (2012) in their approach have discussed three important things. First, ELM provides unified learning platform, second, compared to PSVM and LS-SVM, ELM has less optimization constraints and third, in theory ELM can classify any disjoint regions and approximate any target continuous function. Their simulation results show that ELM has good performance and scalability at much faster learning speed compared to SVM and LS-SVM. Bai et al. (2014) have worked on sparse ELM and showed that sparse ELM can reduce the training time and storage space compared to the unified ELM. It has very good performance with faster learning speed compared to the state-of-the-art SVM classifier. It also has the ability to handle large-scale binary classification compared to the unified ELM. Ding et al. (2014) have introduced ELM and described different principles and algorithms used in ELM. In their studies, typical variants of ELM like incremental ELM, two-stage ELM, pruning ELM, error-minimized ELM, evolutionary ELM and online sequential ELM have been described. They have summarized the applications of ELM for classification, function approximation, regression, pattern recognition, etc.

Very less research work has been done where ML-ELM is used as the classifier (Ding et al. 2015; Mirza et al. 2016; Yang and Wu 2015; Tang et al. 2014). Many other state-of-the-art mechanisms have also been used for text classification. A new Web page classification based on SVM-weighted voting scheme has been proposed by Chen and Hsieh (2006). In their work, latent semantic analysis is used to find the hidden information from the documents and to extract text features from each Web page. This helps the SVM to classify the Web pages. Experimental results show that their approach is better than the traditional approaches. Wan et al. (2012) have introduced a new text document classification, which is a combination of k-nearest neighbor (kNN) and SVM techniques. They have tested their approach on many benchmark datasets, and the results show that the accuracy of the combined approach has less impact on the values of the parameters as compared to the traditional kNN technique. A rough set approach to SVM classification is proposed by Lingras and Butz (2007), which is mostly useful when handling noisy data. Their work has proposed two new approaches, extension (1-v-r) and (1-v-1) to SVM multi-classification by using the boundary region in rough sets. They have justified that extended (1-v-r) can reduce the training time of the traditional (1-v-r) approach. The experimental results support their theoretical results. Gomez and Moens (2012) have discussed a method to classify the Web documents into a predefined hierarchy using textual content of the documents. They have developed a Stratified Discriminant Analysis (SDA) technique to reduce the feature vectors of the Web documents. Rujiang et al. (2011) have suggested a model called SUMO (The Suggested Upper Merged Ontology) based on text classification, which is integrated with Wordnet ontology to classify the Web pages. Experimentally they claimed that their method can reduce the dimensionality of the vector space and increase the performance of the text classification. Li et al. (2012) have proposed a hierarchical-vertical classification of framework that built a hierarchical classifier after discovering the inherent hierarchical structure of relationships among vertical Web pages based on flat datasets. They have used SVM using odds ratio to select discriminative features which obtained best results. Klassen and Paturi (2010) have worked on a technique for Web pages classification using keywords as the attributes from documents and random forest learning method. Their work identifies that the random forest learning method is better than other state-of-the-art machine learning mechanisms for classification.

Introducing ML-ELM which uses deep learning extensively in the field of text classification can begin a new era in the field of machine learning. Our approach has used Wordnet and connected component of the graph to select the best features using different feature selection techniques. Experimental results on two large benchmark datasets demonstrate the effectiveness of our approach over the other existing approaches.

3 Background

3.1 Different feature selection techniques

This section discusses the most important existing feature selection techniques we have used in our proposed work for feature vectors preparation and the architecture of ELM and multilayer ELM.

i.
TF-IDF:

Rare appearance of features (or words) in a text document reflects the category of the text document in a better manner. To identify such important words, term frequency-inverse document frequency^{Footnote 2}, a statistical measure has been used extensively. Term frequency (TF) or local frequency of a word w in a document d indicates how important the word w for d is. $TF_{w,d} = \big (\frac{p}{q}\big )$ where ‘p’ represents frequency of w in d and ‘q’ represents sum of frequency of all the words in d. Inverse document frequency (IDF) or global frequency of a word w in the entire corpus C measures how important the word w for C is. $IDF_w = log \big (\frac{r}{s}\big )$, where ‘r’ represents the total number of documents in C and ‘s’ represents the number of documents of C which contain the word w.
$$\begin{aligned} (TF\hbox {-}IDF)_w = TF_w \times IDF_w \end{aligned}$$
ii.
Chi-square ($\chi ^ 2$):

This technique is based on Chi-square distribution of statistics and generally used to test the independence of two events. In feature selection, the two events are occurrence of the keyword and occurrence of the class. It measures the confidence in association between two categorical variables (based on available statistics). The keywords are ranked with respect to Eq. 1 mentioned below.
$$\begin{aligned} \chi ^ 2(w,c)= \sum \limits _{e_w\in {0,1}}\sum \limits _{e_c\in {0,1}}\frac{(O_{e_we_c}-E_{e_we_c})^2}{E_{e_we_c}} \end{aligned}$$
(1)
where w is the word and c is the class of documents, ‘O’ and ‘E’ represent the observed and the expected frequency, respectively (Manning et al. 2008), and $e_w$ and $e_c$ are the binary variables. If a document d contains w, then $e_w$ = 1 else $e_w$ = 0. Similarly, if the class c contains the document d, then $e_c$ = 1 else $e_c$ = 0.
iii.
Information Gain:

Information Gain (IG) of a word w measures how much presence or absence of w in a document d affects the class c to take a correct decision on classification. It is a measure of the decrease in entropy of the class variable after the value for the word is observed, and it can be generalized to any number of classes (Yang and Pedersen 1997). Equation 2 measures the Information Gain of w.
$$\begin{aligned} \begin{aligned} IG(w)&= -\sum \limits _{i=1}^{m}p(c_i)log \;p(c_i) \\&\quad + p(w)\sum \limits _{i=1}^{m}p(c_i|w)log\; p(c_i|w)\\&\quad + p(\overline{w})\sum \limits _{i=1}^{m}p(c_i|\overline{w})log\; p(c_i|\overline{w}) \end{aligned} \end{aligned}$$
(2)
where,

m: number of predefined classes,

$p(c_i)$: a prior probability of ith class,

p(w): probability of word w in a given data set,

$p(c_i|w)$: conditional probability of ith class given w,

$p(\overline{w})$: complementary probability of p(w), and

$p(c_i|\overline{w})$: conditional probability of ith class in the absence of w.
iv.
Bi-Normal Separation:

Bi-Normal Separation (BNS) originally developed by Forman (2003) tries to find the words which have high difference between their tpr (true-positive rate) and fpr (false-positive rate). It is the difference between the inverse of the standard normal distribution of the true-positive and false-positive rate and is represented in Eq. 3.
$$\begin{aligned} BNS(w,c_i)=\Big |\phi ^{-1}\Big (\frac{n_{iw}}{n_i}\Big ) - \phi ^{-1}\Big (\frac{n_{\overline{i}w}}{n_{\overline{i}}}\Big )\Big | \end{aligned}$$
(3)
where,

$n_i$: number of documents belongs to class $c_i$,

$n_{iw}$: number of documents contains the word w and belongs to the class $c_i$,

$n_{\overline{i}}$: number of documents not belongs to class $c_i$,

$n_{\overline{i}w}$: number of documents contains the word w but does not belongs to the class $c_i$, and

$\phi ^{-1} $: inverse of the standard normal distribution.

3.2 Extreme learning machine

ELM proposed by Huang et al. (2006) is a single-layer feed-forward neural networks (SLFNs). ELM become popular over the other established classifiers which is mainly due to the following reasons:

(i)
Input weights and hidden layer biases adjustment which consumes more time are not required in ELM as they are assigned randomly.
(ii)
Neither hidden layer requires to be tuned nor to be neuron alike.
(iii)
Easy to implement and very fast learning speed.
(iv)
Ability to handle a large volume of data.
(v)
No back propagation.
(vi)
Gives good performance with less human intervention.
(vii)
Avoids local minimization.
(viii)
Parallelization of computation.
(ix)
Produces one optimal solution with negligible errors.

The computational speed of ELM is exceptionally good compared to SVM, and this increases drastically when the training dataset increases (Liu et al. 2012).

ELM at a Glance:

For N arbitrary distinct examples $(x_i, y_i)$, where $x_i=[x_{i1}, x_{i2},\ldots , x_{in}]^T \in R^n$ and $y_i = [y_{i1}, y_{i2},\ldots , y_{im}]^T \in R^m$, such that $(x_i, y_i) \in R^n \times R^m$, $i = 1, 2,\ldots , N$. Along with this, ELM is having an activation function g(x) and L hidden nodes. For a given input x, the output function of extreme learning machine is as follows:

$$\begin{aligned} {{\varvec{g}}}_{{\varvec{{L}}}}({{\varvec{x}}}_{{\varvec{{j}}}}) = \sum \limits _{{\varvec{{i=1}}}}^{{\varvec{{L}}}} {\varvec{\beta }}_{{\varvec{{i}}}} {{\varvec{g}}}({{\varvec{w}}}_{{\varvec{{i}}}}\cdot {{\varvec{x}}}_{{\varvec{{j}}}}+ {{\varvec{b}}}_{{\varvec{{i}}}}) = {{\varvec{y}}}_{{\varvec{j}}}, {{\varvec{j}}} = {{\varvec{1}}},\dots ,{{\varvec{N}}} \end{aligned}$$

(4)

Here, $(w_i, b_i)$ are hidden node parameters generated randomly where i lies between 1 and L, $w_i$ = [$w_{i1},w_{i2}\ldots w_{in}]^T$ represents the weight vector which connects the input nodes of ‘n’ numbers into the ith hidden node and $b_i$ is the bias of ith hidden node. $\varvec{\beta }$ which connects each hidden node to every output nodes is the weight vector and is represented as $\varvec{\beta }$ = $[\beta _1,\ldots ,\beta _L]^T$. The output vector $g(\mathbf x )$ maps the n-dimensional input space to a L-dimensional feature space. Here, ${\varvec{H}}$ represents the output matrix of hidden layer. The compact form of Eq. 4 is represented by Eq. 5 as follows:

$$\begin{aligned} {\varvec{H}} \varvec{\beta }={\varvec{Y}} \end{aligned}$$

(5)

where

$$\begin{aligned} {\varvec{H}}= & {} \left[ \begin{array}{ccc} g(\varvec{w_1\cdot x_1}+b_1)&{} \ldots &{} g(\varvec{w_L\cdot x_1}+b_L) \\ g(\varvec{w_1\cdot x_2}+b_1)&{}\ldots &{} g({\varvec{w_L\cdot x_2}}+b_L) \\ . &{}....&{} .\\ . &{}....&{} .\\ . &{}....&{} .\\ g({\varvec{w_1\cdot x_N}}+b_1)&{} \ldots &{} g({\varvec{w_L\cdot x_N}}+b_L) \end{array} \right] _{N \times L}\\ \varvec{\beta }= & {} \left[ \begin{array}{ccc} \beta _{11}&{} \ldots &{} \beta _{1m} \\ \beta _{21}&{} \ldots &{} \beta _{2m}\\ &{}.&{}\\ &{}.&{}\\ &{}.&{}\\ \beta _{L1}&{}\ldots &{} \beta _{Lm}\end{array} \right] _{L \times m} {\varvec{Y}} = \left[ \begin{array}{ccc} y_{11 }&{}\ldots &{} y_{1m} \\ y_{21} &{}\ldots &{} y_{2m} \\ &{}.&{}\\ &{}.&{}\\ &{}.&{}\\ y_{N1}&{}\ldots &{} y_{Nm}\end{array} \right] _{N \times m} \end{aligned}$$

Till the number of hidden layer nodes is large enough, the parameters of the network do not all need to adjust (Huang 2003). Smallest training error and smallest norm of output weights can be achieved by ELM and can be represented by Eq. 6 as follows:

$$\begin{aligned} minimize: \parallel {\varvec{H}} \varvec{\beta }-{\varvec{Y}} \parallel ^2 and \parallel \varvec{\beta } \parallel \end{aligned}$$

(6)

$\beta $ can be derived in many ways and one of such technique to derive $\beta $ is using Moore–Penrose (Liang et al. 2006) generalized inverse of matrix ${\varvec{H}}$ which when multiplied with Y gives $\varvec{\beta }$. The system diagram of Extreme Learning Machine is shown in Fig. 1.

3.3 Multilayer ELM

Multilayer ELM is a machine learning approach based on the architecture of artificial neural network and is inspired by deep learning and extreme learning machine. Deep learning was first proposed by Hinton and Salakhutdinov (2006) who in their work used deep structure of multilayer autoencoder and established a multilayer neural network on the unsupervised data. In their proposed method, first they used an unsupervised training to obtain the parameters in each layer. Next, the network is fine-tuned by supervised learning. Hinton et al. (2006), who proposed the deep belief network, outperforms the traditional multilayer neural network, SLFNs, SVMs, but it has slow learning speed. Working in this direction, recently Kasun et al. (2013) proposed multilayer ELM which performs unsupervised learning from layer to layer, and it does not need to iterate during the training process, and hence, it does not spend a long time in the training phase. Compared to other conventional deep networks, it has a better or comparable performance. Figure 2 shows the system architecture of ML-ELM.

3.3.1 ELM autoencoder (ELM-AE)

Autoencoder is an unsupervised neural network. The outputs and inputs of the autoencoder are same. Like ELM, ELM-AE has $`n'$ input layer nodes, single hidden layer of $`L'$ nodes and $`n'$ output layer nodes. In spite of many resemblance between these two, there are two major differences that exist between them which are as follows:

i.
ELM is a supervised neural network and the output of ELM is a class label while ELM-AE is an unsupervised one and its output is same as the input.
ii.
Input weights and biases of the hidden layer are random in case of ELM, but they are orthogonal in ELM-AE.

Depending on the number of hidden layer nodes, the ELM-AE can be divided into the following three categories.

(i)
Compressed representation ($n > L$):

In compressed representation, features of training dataset need to be represented from a higher-dimensional (or sparse) input signal space to a lower-dimensional (or compressed) feature space.
(ii)
Equal dimension representation ($n = L$):

In this representation of features, the dimension of input signal space and feature space needs to be equal.
(iii)
Sparse representation ($n < L$):

It is just the reverse of compressed representation where features of training dataset need to be represented from a lower-dimensional input signal space to a higher-dimensional (or sparse) feature space.

The multilayer ELM is considerably faster than deep networks because iterative tuning mechanism is not require in case of ML-ELM and obtained better or similar performance compared to deep networks. It is also known that in ELM, for L hidden nodes and N training examples $(\varvec{x_j, y_j})$, the following Eq. 7 holds:

$$\begin{aligned} {\varvec{g}}_{\varvec{L}}({\varvec{x}}_{\varvec{j}})= \sum \limits _{\varvec{{i=1}}}^{\varvec{L}} {\varvec{\beta }}_{\varvec{{i}}} {\varvec{g}}_{\varvec{i}}({{\varvec{x}}}_{{\varvec{{j}}}},{{\varvec{w}}}_{{\varvec{{i}}}}, {{\varvec{b}}}_{{\varvec{{i}}}})= {\varvec{y}}_{\varvec{j}}\;, \; {\varvec{j}}={\varvec{1}},\ldots ,N \end{aligned}$$

(7)

where each symbol has the same meaning as in Eq. (4). In case of ELM-AE, the output weights ${\varvec{\beta }}$ can be computed using Eq. (8), (9) or (10), and this is different from the computation of ${\varvec{\beta }}$ in case of ELM.

In order to perform unsupervised learning, few modifications have been done in ELM-AE whose working principle is similar to regular ELM, which are described as follows:

(1)
The output data and the input data remain same for every hidden layer. Hence, for every input data ${\varvec{X}}$:
$$\begin{aligned} {\varvec{Y}} = {\varvec{X}} \end{aligned}$$
(2)
To improve the performance of ELMs, we need to consider the weights and the biases of the random hidden nodes to be orthogonal and can be represented as follows:
$$\begin{aligned} \begin{aligned} {\varvec{h}}= {\varvec{g(w\cdot x + b)}}, {\varvec{w^T\cdot w}}={\varvec{I}}\;{\varvec{and}} \; {\varvec{b^T\cdot b}}=1 \end{aligned} \end{aligned}$$
(3)
the output weight ${\varvec{\beta }}$ is decided based on the following conditions:
1. i.
  if ${\varvec{n > L}}$ then
  $$\begin{aligned} {\varvec{\beta }}= \left( \frac{\varvec{I}}{C}+{\varvec{H^{T}}}{\varvec{H}} \right) ^{-1}\varvec{H^{T}{X}} \end{aligned}$$
  (8)
2. ii.
  if ${\varvec{n = L}}$ then
  $$\begin{aligned} {\varvec{\beta }} ={\varvec{H}}^{\varvec{-1}}{\varvec{X}} \end{aligned}$$
  (9)
3. iii.
  if ${\varvec{n < L}}$ then
  $$\begin{aligned} {\varvec{\beta }}= {\varvec{H}}^{\varvec{T}}\left( \frac{\varvec{I}}{\varvec{C}}+{\varvec{HH}}^{\varvec{T}}\right) ^{\varvec{-1}}{\varvec{X}} \end{aligned}$$
  (10)

where C is a scale parameter which adjusts structural and experiential risk. ELM-AE is used for training the parameters in each layer of ML-ELM. The general equation representing ML-ELM is described as follows:

$$\begin{aligned} {\varvec{H}}^{\varvec{n}}= {\varvec{g}}(({\varvec{\beta }}^{\varvec{n}})^{\varvec{T}} {\varvec{H}}^{\varvec{n-1}}) \end{aligned}$$

(11)

For ${\varvec{n}} = 0$, the 0th hidden layer or the first layer is considered to be the input layer ${\varvec{X}}$. Equation (11) shows how the transformations of the data take place from layer to layer until it reaches the last but one layer before the final (i.e., output) layer ${\varvec{Y}}$. The final output matrix ${\varvec{Y}}$ can be obtained by computing the results between the last hidden layer and the output layer using the regularized least squares technique (Rifkin et al. 2003).

4 Proposed approach

In this section, first we have discussed the architecture of our approach and then summarized the complete approach with the details of algorithms to implement it.

4.1 Architecture of the proposed approach

Given a corpus of classes having text documents, the propose approach involves the following steps:

1.
Preprocessing of text documents of different classes
1. i.
  Stop words and unwanted words are removed from the text documents of each class from the corpus.
2. ii.
  Other categories need to be ignored, such as verbs, adverbs, adjectives, pronounce. Minipar^{Footnote 3} is used to select nouns as the keywords.
3. iii.
  Now every class in the corpus have preprocessed documents of keywords.
2.
Features score generation with the help of training dataset
1. i.
  Keywords from preprocessed text document of each class are taken to generate the term–document matrix.
2. ii.
  Separate new documents are made, where each new document represents a particular class in the corpus. A document ‘${\varvec{D}}_{\varvec{new}}$’ representing a class ‘${\varvec{C}}$’ is constructed by putting all the preprocessed content (i.e., keywords) of all documents (also known as training dataset) belonging to the class ‘${\varvec{C}}$’ into the document ‘${\varvec{D}}_{\varvec{new}}$’. In other words, a pool of keywords is constructed from all documents of class ‘C’ and stored in ‘${\varvec{D}}_{\varvec{new}}$.’ Hence, now we have per class only one new document which consists all keywords of that class or one can say, training set has one instance for each class.
3. iii.
  Now those documents (‘${\varvec{D}}_{\varvec{new}}$’) are sent as an input to different feature selection techniques (TF-IDF/Chi-square/BNS/IG) as discussed in Sect. 3.1) separately for comparison purpose to generate the scores of each feature (i.e., keyword). Then for each class represented by ‘${\varvec{D}}_{\varvec{new}}$’, we have a list of keywords in that class along with their corresponding TF-IDF/ Chi-square/BNS/IG scores which represent different feature vectors, one for each feature selection technique.
3.
Reduce feature vectors generation by selecting most important keywords for each feature selection (TF-IDF/ Chi-square/BNS/IG) technique

Next, we need to select ‘${\varvec{n}}$’ most important keywords from each ‘${\varvec{D}}_{\varvec{new}}$,’ resulting in a vector of dimensions ‘${\varvec{nm}}$,’ where ‘${\varvec{m}}$’ represents the number of predefined classes. In order to obtain ‘${\varvec{n}}$’ most important keywords from each ‘${\varvec{D}}_{\varvec{new}}$,’ we take into consideration the idea of connected components of graph theory. In graph theory, an undirected graph ‘${\varvec{G}}$’ is called connected if between any two vertices in the graph there exist a path.^{Footnote 4} If a graph consists of only one vertex, then it is always connected or in other words every individual vertex of a graph is a connected component of that graph. Figure 3 shows three connected component (0-1-2-3-4, 5-6 and 7) of an undirected graph ‘${\varvec{G}}$.’ In our approach, we consider each keyword as the vertex and the semantic relationship between two keywords forms an edge between them which generates an undirected graph. Here, each connected component will consist of related keywords. A keyword ‘a’ is related to keyword ‘${\varvec{b}}$,’ if ‘${\varvec{b}}$’ is either in the synonym or in the lemma list of ‘a.’ Each connected component represents keywords of similar context. For example, all synonym and lemma list for ‘a’ become one connected component. Figure 4 shows the connected component of ‘a’ where ‘b,’ ‘c,’ ‘f,’ ‘h,’ ‘m,’ ‘n,’ ‘s’ and ‘x’ are in the synonym and lemma list of ‘a.’ Table 1 shows some of the synonym lists of certain keywords. For each ‘${\varvec{D}}_{\varvec{new}}$,’ a list of connected components are generated using Wordnet.

Next, from each connected component of ‘${\varvec{D}}_{\varvec{new}}$’ the keyword with the highest TF-IDF/Chi-square/BNS/IG score will be selected as the representative keyword (or important keyword) of that component. At the end, a reduced feature vector with ‘${\varvec{n}}$’ most important keywords will be generated from each ‘${\varvec{D}}_{\varvec{new}}$’ based on the feature selection technique used (i.e., TF-IDF/Chi-square/BNS/IG). In other way, for every ‘${\varvec{D}}_{\varvec{new}}$,’ four reduced feature vectors with ‘${\varvec{n}}$’ most important keywords are generated, one for each of the feature selection technique. Details are discussed in Step 3–8 of Sect. 4.2.

Table 1 Synonym list of certain keywords

Study on suitability and importance of multilayer extreme learning machine for classification of text data

Abstract

Similar content being viewed by others

K-means and Wordnet Based Feature Selection Combined with Extreme Learning Machines for Text Classification

A Novel Feature Selection Based Text Classification Using Multi-layer ELM

Text Classification Using Correlation Based Feature Selection on Multi-layer ELM Feature Space

Explore related subjects

1 Introduction

2 Literature review

3 Background

3.1 Different feature selection techniques

3.2 Extreme learning machine

3.3 Multilayer ELM

3.3.1 ELM autoencoder (ELM-AE)

4 Proposed approach

4.1 Architecture of the proposed approach

4.2 Detailed summary of the proposed approach

5 Experimental results and discussion

5.1 Experimental setup

5.1.1 Precision (P)

5.1.2 Recall (R)

5.1.3 F-Measure (F)

5.2 DMOZ dataset

5.2.1 Comparison with existing approaches on DMOZ dataset

5.3 20-Newsgroups dataset

5.3.1 Comparison with existing approaches on 20-Newsgroups dataset

5.4 Comparisons with other traditional classifiers

6 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Human and animal rights

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation