1 Introduction

The Web has indexed at least 4.76 billion of documents.Footnote 1 Organizing these documents on the Web in an effective manner is the real challenge for the present search engine. The ultimate aim of the search engine is to satisfy the internet user who is looking for the desired information every time he queries. The most time-consuming job is searching these informations in the net. If this happens efficiently, then the user can effectively absorb and use the knowledge of the documents. Text classification is an attempt in this direction, which not only reduces the searching time but also makes available the required information to the user for which he is looking for. It is a vital topic in machine learning where learning is done over the text. Classification is a well-known machine learning technique where the set of label datasets is used to trained the classifier before it is applied to the test dataset for deciding the target class. Based on the number of classes used in the process, classification can be broadly classified into two categories: binary classification where a test instance is categorized into one of two predefined classes and multi-class classification where the test instance deals with more than two classes. In order to classify the text data more effectively, selection of top features is highly essential and this in turn generates a technique called feature selection upon which the generalization capability of a machine learning algorithm depends. The performance of the classifiers depends on how robust the feature vector is. Feature selection involves reducing the number of features by selecting a subset of it which would help in building the required model. It is important in text classification for two main reasons:

  1. 1.

    Effective number of features are reduced, and hence, training the classifier will consume less network bandwidth, time and storage in the training phase.

  2. 2.

    Classification errors due to noise features are eliminated, and thus, accuracy of classification process improves.

Generally, the feature selection methods are either unsupervised or supervised. As the name suggests ‘unsupervised’ and hence no class labels are required to select the top features, but on the other hand supervised methods do require class labels. Some of the unsupervised feature selection methods are ‘document frequency,’ ‘term contribution,’ ‘TF-IDF metric,’ etc. Supervised feature selection methods are further categorized into two sub-categories: accuracy-based and correlation-based.

  1. 1.

    Accuracy-based: This method chooses the features which maximize the occurrence of features in the positive class and minimize the occurrences of the features in the negative class. Some of the existing methods are odds ratio,’ ‘probability ratio,’ ‘GU metric,’ ‘Bi-Normal Separation (BNS),’ ‘power metric,’ ‘Fisher criterion,’ etc.

  2. 2.

    Correlation-based: This method evaluates the features by finding the correlation of the features with the various classes and choose the features which have the highest correlation score. For example, ‘Chi-square metric,’ ‘NGL coefficient,’ ‘GSS coefficient,’ ‘MI-judge’ and ‘Information Gain’ are some of the existing correlation-based methods.

The techniques used for feature selection are categorized as wrapper, filters and embedded methods. For constructing a feature set, wrapper and embedded methods need the involvement of classifier which increases the running time and computationally intensive. But filter method does not require any classifier interaction for preparing the feature set and hence more preferable compared to the other two methods.

The next important thing after the feature selection which affects the text classification process is an efficient classifier. There are many traditional classifiers exist for text classification which includes decision trees, k-nearest neighbor, Naive Bayes, SVM etc. But they have their own limitations, and most of them use the shallow neural networks algorithms in which there are certain restrictions for the capabilities to achieve approximating the complex function. Deep learning has aroused interest in the past decade in many research domains such as computer vision, automatic speech recognition and pattern recognition and recently has attracted much attention in the field of machine learning. It is a multilayer perception artificial neural network algorithm. There is no such restriction found in deep learning (i.e., capabilities to achieve approximating the complex function) which removes the difficulty of optimization associated with the deep models (Ding et al. 2015) and achieves an approximation of complex function. Extreme Learning Machine (Huang et al. 2006) is able to approximate any complex nonlinear mappings directly from the training samples, but it has shallow architecture similar to traditional SLFNs. Hence, it may need a large network to perfectly fit the highly variant input data, which is difficult to implement. Recently designed multilayer ELM (Kasun et al. 2013) is able to address this issue which combines deep learning (i.e., ELM autoencoder) with ELM, decomposes the original input data into multiple hidden layers and performs unsupervised learning layer-wise.

Considering that selection of informative features and efficient classifier are able to generate good performance for text classification process, this study uses ML-ELM as the classifier which earned name quickly in the field of machine learning owing to its fast speed, easy implementation and ability to handle a large volume of data. To prepare an efficient feature vector, we have considered four standard feature selection techniques, such as TF-IDF, Chi-square, BNS and IG, which are generally used for text classification. The concept of connected component of a graph along with the Wordnet has been used that help us for selecting the top features from each class of a given corpus after calculating the TF-IDF/Chi-square/BNS/IG for each feature (i.e., keyword) of a class. Finally, the reduced feature vector of each class is combined together to form the final reduced feature vectors [one for each feature selection technique (i.e., TF-IDF, Chi-square, BNS and IG)]. ML-ELM and other traditional classifiers including ELM and SVM are trained on these final reduced feature vectors for the classification of text data. The experimental work which focused on text classification process is carried out on two benchmark datasets: DMOZ (Open Directory Project) and 20-Newsgroups. The performance of different classifiers is compared in the experimental section, and it has been observed that ML-ELM outperforms the other established classifiers including ELM and SVM. The empirical results show that the performance of the proposed approach is promising compared to other existing approaches.

The paper is outlined in this way: The literature review based on different classification techniques used for text data is discussed in Sect. 2. Section 3 describes different existing feature selection techniques and model structure of ELM and ML-ELM. The proposed approach for classifying the text data is discussed in Sect. 4. Section 5 describes the analysis of empirical results and compares the proposed approach with other existing approaches. We concluded our work with some future enhancements in Sect. 6.

2 Literature review

Recently, ELM and ML-ELM have attracted the attention of many researchers in the field of text classification. Working in this direction, Huang et al. (2012) in their approach have discussed three important things. First, ELM provides unified learning platform, second, compared to PSVM and LS-SVM, ELM has less optimization constraints and third, in theory ELM can classify any disjoint regions and approximate any target continuous function. Their simulation results show that ELM has good performance and scalability at much faster learning speed compared to SVM and LS-SVM. Bai et al. (2014) have worked on sparse ELM and showed that sparse ELM can reduce the training time and storage space compared to the unified ELM. It has very good performance with faster learning speed compared to the state-of-the-art SVM classifier. It also has the ability to handle large-scale binary classification compared to the unified ELM. Ding et al. (2014) have introduced ELM and described different principles and algorithms used in ELM. In their studies, typical variants of ELM like incremental ELM, two-stage ELM, pruning ELM, error-minimized ELM, evolutionary ELM and online sequential ELM have been described. They have summarized the applications of ELM for classification, function approximation, regression, pattern recognition, etc.

Very less research work has been done where ML-ELM is used as the classifier (Ding et al. 2015; Mirza et al. 2016; Yang and Wu 2015; Tang et al. 2014). Many other state-of-the-art mechanisms have also been used for text classification. A new Web page classification based on SVM-weighted voting scheme has been proposed by Chen and Hsieh (2006). In their work, latent semantic analysis is used to find the hidden information from the documents and to extract text features from each Web page. This helps the SVM to classify the Web pages. Experimental results show that their approach is better than the traditional approaches. Wan et al. (2012) have introduced a new text document classification, which is a combination of k-nearest neighbor (kNN) and SVM techniques. They have tested their approach on many benchmark datasets, and the results show that the accuracy of the combined approach has less impact on the values of the parameters as compared to the traditional kNN technique. A rough set approach to SVM classification is proposed by Lingras and Butz (2007), which is mostly useful when handling noisy data. Their work has proposed two new approaches, extension (1-v-r) and (1-v-1) to SVM multi-classification by using the boundary region in rough sets. They have justified that extended (1-v-r) can reduce the training time of the traditional (1-v-r) approach. The experimental results support their theoretical results. Gomez and Moens (2012) have discussed a method to classify the Web documents into a predefined hierarchy using textual content of the documents. They have developed a Stratified Discriminant Analysis (SDA) technique to reduce the feature vectors of the Web documents. Rujiang et al. (2011) have suggested a model called SUMO (The Suggested Upper Merged Ontology) based on text classification, which is integrated with Wordnet ontology to classify the Web pages. Experimentally they claimed that their method can reduce the dimensionality of the vector space and increase the performance of the text classification. Li et al. (2012) have proposed a hierarchical-vertical classification of framework that built a hierarchical classifier after discovering the inherent hierarchical structure of relationships among vertical Web pages based on flat datasets. They have used SVM using odds ratio to select discriminative features which obtained best results. Klassen and Paturi (2010) have worked on a technique for Web pages classification using keywords as the attributes from documents and random forest learning method. Their work identifies that the random forest learning method is better than other state-of-the-art machine learning mechanisms for classification.

Introducing ML-ELM which uses deep learning extensively in the field of text classification can begin a new era in the field of machine learning. Our approach has used Wordnet and connected component of the graph to select the best features using different feature selection techniques. Experimental results on two large benchmark datasets demonstrate the effectiveness of our approach over the other existing approaches.

3 Background

3.1 Different feature selection techniques

This section discusses the most important existing feature selection techniques we have used in our proposed work for feature vectors preparation and the architecture of ELM and multilayer ELM.

  1. i.

    TF-IDF:

    Rare appearance of features (or words) in a text document reflects the category of the text document in a better manner. To identify such important words, term frequency-inverse document frequencyFootnote 2, a statistical measure has been used extensively. Term frequency (TF) or local frequency of a word w in a document d indicates how important the word w for d is. \(TF_{w,d} = \big (\frac{p}{q}\big )\) where ‘p’ represents frequency of w in d and ‘q’ represents sum of frequency of all the words in d. Inverse document frequency (IDF) or global frequency of a word w in the entire corpus C measures how important the word w for C is. \(IDF_w = log \big (\frac{r}{s}\big )\), where ‘r’ represents the total number of documents in C and ‘s’ represents the number of documents of C which contain the word w.

    $$\begin{aligned} (TF\hbox {-}IDF)_w = TF_w \times IDF_w \end{aligned}$$
  2. ii.

    Chi-square (\(\chi ^ 2\)):

    This technique is based on Chi-square distribution of statistics and generally used to test the independence of two events. In feature selection, the two events are occurrence of the keyword and occurrence of the class. It measures the confidence in association between two categorical variables (based on available statistics). The keywords are ranked with respect to Eq. 1 mentioned below.

    $$\begin{aligned} \chi ^ 2(w,c)= \sum \limits _{e_w\in {0,1}}\sum \limits _{e_c\in {0,1}}\frac{(O_{e_we_c}-E_{e_we_c})^2}{E_{e_we_c}} \end{aligned}$$
    (1)

    where w is the word and c is the class of documents, ‘O’ and ‘E’ represent the observed and the expected frequency, respectively (Manning et al. 2008), and \(e_w\) and \(e_c\) are the binary variables. If a document d contains w, then \(e_w\) = 1 else \(e_w\) = 0. Similarly, if the class c contains the document d, then \(e_c\) = 1 else \(e_c\) = 0.

  3. iii.

    Information Gain:

    Information Gain (IG) of a word w measures how much presence or absence of w in a document d affects the class c to take a correct decision on classification. It is a measure of the decrease in entropy of the class variable after the value for the word is observed, and it can be generalized to any number of classes (Yang and Pedersen 1997). Equation 2 measures the Information Gain of w.

    $$\begin{aligned} \begin{aligned} IG(w)&= -\sum \limits _{i=1}^{m}p(c_i)log \;p(c_i) \\&\quad + p(w)\sum \limits _{i=1}^{m}p(c_i|w)log\; p(c_i|w)\\&\quad + p(\overline{w})\sum \limits _{i=1}^{m}p(c_i|\overline{w})log\; p(c_i|\overline{w}) \end{aligned} \end{aligned}$$
    (2)

    where,

    m: number of predefined classes,

    \(p(c_i)\): a prior probability of ith class,

    p(w): probability of word w in a given data set,

    \(p(c_i|w)\): conditional probability of ith class given w,

    \(p(\overline{w})\): complementary probability of p(w), and

    \(p(c_i|\overline{w})\): conditional probability of ith class in the absence of w.

  4. iv.

    Bi-Normal Separation:

    Bi-Normal Separation (BNS) originally developed by Forman (2003) tries to find the words which have high difference between their tpr (true-positive rate) and fpr (false-positive rate). It is the difference between the inverse of the standard normal distribution of the true-positive and false-positive rate and is represented in Eq. 3.

    $$\begin{aligned} BNS(w,c_i)=\Big |\phi ^{-1}\Big (\frac{n_{iw}}{n_i}\Big ) - \phi ^{-1}\Big (\frac{n_{\overline{i}w}}{n_{\overline{i}}}\Big )\Big | \end{aligned}$$
    (3)

    where,

    \(n_i\): number of documents belongs to class \(c_i\),

    \(n_{iw}\): number of documents contains the word w and belongs to the class \(c_i\),

    \(n_{\overline{i}}\): number of documents not belongs to class \(c_i\),

    \(n_{\overline{i}w}\): number of documents contains the word w but does not belongs to the class \(c_i\), and

    \(\phi ^{-1} \): inverse of the standard normal distribution.

3.2 Extreme learning machine

ELM proposed by Huang et al. (2006) is a single-layer feed-forward neural networks (SLFNs). ELM become popular over the other established classifiers which is mainly due to the following reasons:

  1. (i)

    Input weights and hidden layer biases adjustment which consumes more time are not required in ELM as they are assigned randomly.

  2. (ii)

    Neither hidden layer requires to be tuned nor to be neuron alike.

  3. (iii)

    Easy to implement and very fast learning speed.

  4. (iv)

    Ability to handle a large volume of data.

  5. (v)

    No back propagation.

  6. (vi)

    Gives good performance with less human intervention.

  7. (vii)

    Avoids local minimization.

  8. (viii)

    Parallelization of computation.

  9. (ix)

    Produces one optimal solution with negligible errors.

The computational speed of ELM is exceptionally good compared to SVM, and this increases drastically when the training dataset increases (Liu et al. 2012).

ELM at a Glance:

For N arbitrary distinct examples \((x_i, y_i)\), where \(x_i=[x_{i1}, x_{i2},\ldots , x_{in}]^T \in R^n\) and \(y_i = [y_{i1}, y_{i2},\ldots , y_{im}]^T \in R^m\), such that \((x_i, y_i) \in R^n \times R^m\), \(i = 1, 2,\ldots , N\). Along with this, ELM is having an activation function g(x) and L hidden nodes. For a given input x, the output function of extreme learning machine is as follows:

$$\begin{aligned} {{\varvec{g}}}_{{\varvec{{L}}}}({{\varvec{x}}}_{{\varvec{{j}}}}) = \sum \limits _{{\varvec{{i=1}}}}^{{\varvec{{L}}}} {\varvec{\beta }}_{{\varvec{{i}}}} {{\varvec{g}}}({{\varvec{w}}}_{{\varvec{{i}}}}\cdot {{\varvec{x}}}_{{\varvec{{j}}}}+ {{\varvec{b}}}_{{\varvec{{i}}}}) = {{\varvec{y}}}_{{\varvec{j}}}, {{\varvec{j}}} = {{\varvec{1}}},\dots ,{{\varvec{N}}} \end{aligned}$$
(4)

Here, \((w_i, b_i)\) are hidden node parameters generated randomly where i lies between 1 and L, \(w_i\) = [\(w_{i1},w_{i2}\ldots w_{in}]^T\) represents the weight vector which connects the input nodes of ‘n’ numbers into the ith hidden node and \(b_i\) is the bias of ith hidden node. \(\varvec{\beta }\) which connects each hidden node to every output nodes is the weight vector and is represented as \(\varvec{\beta }\) = \([\beta _1,\ldots ,\beta _L]^T\). The output vector \(g(\mathbf x )\) maps the n-dimensional input space to a L-dimensional feature space. Here, \({\varvec{H}}\) represents the output matrix of hidden layer. The compact form of Eq. 4 is represented by Eq. 5 as follows:

$$\begin{aligned} {\varvec{H}} \varvec{\beta }={\varvec{Y}} \end{aligned}$$
(5)

where

$$\begin{aligned} {\varvec{H}}= & {} \left[ \begin{array}{ccc} g(\varvec{w_1\cdot x_1}+b_1)&{} \ldots &{} g(\varvec{w_L\cdot x_1}+b_L) \\ g(\varvec{w_1\cdot x_2}+b_1)&{}\ldots &{} g({\varvec{w_L\cdot x_2}}+b_L) \\ . &{}....&{} .\\ . &{}....&{} .\\ . &{}....&{} .\\ g({\varvec{w_1\cdot x_N}}+b_1)&{} \ldots &{} g({\varvec{w_L\cdot x_N}}+b_L) \end{array} \right] _{N \times L}\\ \varvec{\beta }= & {} \left[ \begin{array}{ccc} \beta _{11}&{} \ldots &{} \beta _{1m} \\ \beta _{21}&{} \ldots &{} \beta _{2m}\\ &{}.&{}\\ &{}.&{}\\ &{}.&{}\\ \beta _{L1}&{}\ldots &{} \beta _{Lm}\end{array} \right] _{L \times m} {\varvec{Y}} = \left[ \begin{array}{ccc} y_{11 }&{}\ldots &{} y_{1m} \\ y_{21} &{}\ldots &{} y_{2m} \\ &{}.&{}\\ &{}.&{}\\ &{}.&{}\\ y_{N1}&{}\ldots &{} y_{Nm}\end{array} \right] _{N \times m} \end{aligned}$$

Till the number of hidden layer nodes is large enough, the parameters of the network do not all need to adjust (Huang 2003). Smallest training error and smallest norm of output weights can be achieved by ELM and can be represented by Eq. 6 as follows:

$$\begin{aligned} minimize: \parallel {\varvec{H}} \varvec{\beta }-{\varvec{Y}} \parallel ^2 and \parallel \varvec{\beta } \parallel \end{aligned}$$
(6)

\(\beta \) can be derived in many ways and one of such technique to derive \(\beta \) is using Moore–Penrose (Liang et al. 2006) generalized inverse of matrix \({\varvec{H}}\) which when multiplied with Y gives \(\varvec{\beta }\). The system diagram of Extreme Learning Machine is shown in Fig. 1.

Fig. 1
figure 1

Architecture of ELM

Fig. 2
figure 2

ELM-AE and multilayer ELM

3.3 Multilayer ELM

Multilayer ELM is a machine learning approach based on the architecture of artificial neural network and is inspired by deep learning and extreme learning machine. Deep learning was first proposed by Hinton and Salakhutdinov (2006) who in their work used deep structure of multilayer autoencoder and established a multilayer neural network on the unsupervised data. In their proposed method, first they used an unsupervised training to obtain the parameters in each layer. Next, the network is fine-tuned by supervised learning. Hinton et al. (2006), who proposed the deep belief network, outperforms the traditional multilayer neural network, SLFNs, SVMs, but it has slow learning speed. Working in this direction, recently Kasun et al. (2013) proposed multilayer ELM which performs unsupervised learning from layer to layer, and it does not need to iterate during the training process, and hence, it does not spend a long time in the training phase. Compared to other conventional deep networks, it has a better or comparable performance. Figure 2 shows the system architecture of ML-ELM.

3.3.1 ELM autoencoder (ELM-AE)

Autoencoder is an unsupervised neural network. The outputs and inputs of the autoencoder are same. Like ELM, ELM-AE has \(`n'\) input layer nodes, single hidden layer of \(`L'\) nodes and \(`n'\) output layer nodes. In spite of many resemblance between these two, there are two major differences that exist between them which are as follows:

  1. i.

    ELM is a supervised neural network and the output of ELM is a class label while ELM-AE is an unsupervised one and its output is same as the input.

  2. ii.

    Input weights and biases of the hidden layer are random in case of ELM, but they are orthogonal in ELM-AE.

Depending on the number of hidden layer nodes, the ELM-AE can be divided into the following three categories.

  1. (i)

    Compressed representation (\(n > L\)):

    In compressed representation, features of training dataset need to be represented from a higher-dimensional (or sparse) input signal space to a lower-dimensional (or compressed) feature space.

  2. (ii)

    Equal dimension representation (\(n = L\)):

    In this representation of features, the dimension of input signal space and feature space needs to be equal.

  3. (iii)

    Sparse representation (\(n < L\)):

    It is just the reverse of compressed representation where features of training dataset need to be represented from a lower-dimensional input signal space to a higher-dimensional (or sparse) feature space.

The multilayer ELM is considerably faster than deep networks because iterative tuning mechanism is not require in case of ML-ELM and obtained better or similar performance compared to deep networks. It is also known that in ELM, for L hidden nodes and N training examples \((\varvec{x_j, y_j})\), the following Eq. 7 holds:

$$\begin{aligned} {\varvec{g}}_{\varvec{L}}({\varvec{x}}_{\varvec{j}})= \sum \limits _{\varvec{{i=1}}}^{\varvec{L}} {\varvec{\beta }}_{\varvec{{i}}} {\varvec{g}}_{\varvec{i}}({{\varvec{x}}}_{{\varvec{{j}}}},{{\varvec{w}}}_{{\varvec{{i}}}}, {{\varvec{b}}}_{{\varvec{{i}}}})= {\varvec{y}}_{\varvec{j}}\;, \; {\varvec{j}}={\varvec{1}},\ldots ,N \end{aligned}$$
(7)

where each symbol has the same meaning as in Eq. (4). In case of ELM-AE, the output weights \({\varvec{\beta }}\) can be computed using Eq. (8), (9) or (10), and this is different from the computation of \({\varvec{\beta }}\) in case of ELM.

In order to perform unsupervised learning, few modifications have been done in ELM-AE whose working principle is similar to regular ELM, which are described as follows:

  1. (1)

    The output data and the input data remain same for every hidden layer. Hence, for every input data \({\varvec{X}}\):

    $$\begin{aligned} {\varvec{Y}} = {\varvec{X}} \end{aligned}$$
  2. (2)

    To improve the performance of ELMs, we need to consider the weights and the biases of the random hidden nodes to be orthogonal and can be represented as follows:

    $$\begin{aligned} \begin{aligned} {\varvec{h}}= {\varvec{g(w\cdot x + b)}}, {\varvec{w^T\cdot w}}={\varvec{I}}\;{\varvec{and}} \; {\varvec{b^T\cdot b}}=1 \end{aligned} \end{aligned}$$
  3. (3)

    the output weight \({\varvec{\beta }}\) is decided based on the following conditions:

    1. i.

      if \({\varvec{n > L}}\) then

      $$\begin{aligned} {\varvec{\beta }}= \left( \frac{\varvec{I}}{C}+{\varvec{H^{T}}}{\varvec{H}} \right) ^{-1}\varvec{H^{T}{X}} \end{aligned}$$
      (8)
    2. ii.

      if \({\varvec{n = L}}\) then

      $$\begin{aligned} {\varvec{\beta }} ={\varvec{H}}^{\varvec{-1}}{\varvec{X}} \end{aligned}$$
      (9)
    3. iii.

      if \({\varvec{n < L}}\) then

      $$\begin{aligned} {\varvec{\beta }}= {\varvec{H}}^{\varvec{T}}\left( \frac{\varvec{I}}{\varvec{C}}+{\varvec{HH}}^{\varvec{T}}\right) ^{\varvec{-1}}{\varvec{X}} \end{aligned}$$
      (10)

where C is a scale parameter which adjusts structural and experiential risk. ELM-AE is used for training the parameters in each layer of ML-ELM. The general equation representing ML-ELM is described as follows:

$$\begin{aligned} {\varvec{H}}^{\varvec{n}}= {\varvec{g}}(({\varvec{\beta }}^{\varvec{n}})^{\varvec{T}} {\varvec{H}}^{\varvec{n-1}}) \end{aligned}$$
(11)

For \({\varvec{n}} = 0\), the 0th hidden layer or the first layer is considered to be the input layer \({\varvec{X}}\). Equation (11) shows how the transformations of the data take place from layer to layer until it reaches the last but one layer before the final (i.e., output) layer \({\varvec{Y}}\). The final output matrix \({\varvec{Y}}\) can be obtained by computing the results between the last hidden layer and the output layer using the regularized least squares technique (Rifkin et al. 2003).

4 Proposed approach

In this section, first we have discussed the architecture of our approach and then summarized the complete approach with the details of algorithms to implement it.

4.1 Architecture of the proposed approach

Given a corpus of classes having text documents, the propose approach involves the following steps:

  1. 1.

    Preprocessing of text documents of different classes

    1. i.

      Stop words and unwanted words are removed from the text documents of each class from the corpus.

    2. ii.

      Other categories need to be ignored, such as verbs, adverbs, adjectives, pronounce. MiniparFootnote 3 is used to select nouns as the keywords.

    3. iii.

      Now every class in the corpus have preprocessed documents of keywords.

  2. 2.

    Features score generation with the help of training dataset

    1. i.

      Keywords from preprocessed text document of each class are taken to generate the term–document matrix.

    2. ii.

      Separate new documents are made, where each new document represents a particular class in the corpus. A document ‘\({\varvec{D}}_{\varvec{new}}\)’ representing a class ‘\({\varvec{C}}\)’ is constructed by putting all the preprocessed content (i.e., keywords) of all documents (also known as training dataset) belonging to the class ‘\({\varvec{C}}\)’ into the document ‘\({\varvec{D}}_{\varvec{new}}\)’. In other words, a pool of keywords is constructed from all documents of class ‘C’ and stored in ‘\({\varvec{D}}_{\varvec{new}}\).’ Hence, now we have per class only one new document which consists all keywords of that class or one can say, training set has one instance for each class.

    3. iii.

      Now those documents (‘\({\varvec{D}}_{\varvec{new}}\)’) are sent as an input to different feature selection techniques (TF-IDF/Chi-square/BNS/IG) as discussed in Sect. 3.1) separately for comparison purpose to generate the scores of each feature (i.e., keyword). Then for each class represented by ‘\({\varvec{D}}_{\varvec{new}}\)’, we have a list of keywords in that class along with their corresponding TF-IDF/ Chi-square/BNS/IG scores which represent different feature vectors, one for each feature selection technique.

  3. 3.

    Reduce feature vectors generation by selecting most important keywords for each feature selection (TF-IDF/ Chi-square/BNS/IG) technique

    Next, we need to select ‘\({\varvec{n}}\)’ most important keywords from each ‘\({\varvec{D}}_{\varvec{new}}\),’ resulting in a vector of dimensions ‘\({\varvec{nm}}\),’ where ‘\({\varvec{m}}\)’ represents the number of predefined classes. In order to obtain ‘\({\varvec{n}}\)’ most important keywords from each ‘\({\varvec{D}}_{\varvec{new}}\),’ we take into consideration the idea of connected components of graph theory. In graph theory, an undirected graph ‘\({\varvec{G}}\)’ is called connected if between any two vertices in the graph there exist a path.Footnote 4 If a graph consists of only one vertex, then it is always connected or in other words every individual vertex of a graph is a connected component of that graph. Figure 3 shows three connected component (0-1-2-3-4, 5-6 and 7) of an undirected graph ‘\({\varvec{G}}\).’ In our approach, we consider each keyword as the vertex and the semantic relationship between two keywords forms an edge between them which generates an undirected graph. Here, each connected component will consist of related keywords. A keyword ‘a’ is related to keyword ‘\({\varvec{b}}\),’ if ‘\({\varvec{b}}\)’ is either in the synonym or in the lemma list of ‘a.’ Each connected component represents keywords of similar context. For example, all synonym and lemma list for ‘a’ become one connected component. Figure 4 shows the connected component of ‘a’ where ‘b,’ ‘c,’ ‘f,’ ‘h,’ ‘m,’ ‘n,’ ‘s’ and ‘x’ are in the synonym and lemma list of ‘a.’ Table 1 shows some of the synonym lists of certain keywords. For each ‘\({\varvec{D}}_{\varvec{new}}\),’ a list of connected components are generated using Wordnet.

    Next, from each connected component of ‘\({\varvec{D}}_{\varvec{new}}\)’ the keyword with the highest TF-IDF/Chi-square/BNS/IG score will be selected as the representative keyword (or important keyword) of that component. At the end, a reduced feature vector with ‘\({\varvec{n}}\)’ most important keywords will be generated from each ‘\({\varvec{D}}_{\varvec{new}}\)’ based on the feature selection technique used (i.e., TF-IDF/Chi-square/BNS/IG). In other way, for every ‘\({\varvec{D}}_{\varvec{new}}\),’ four reduced feature vectors with ‘\({\varvec{n}}\)’ most important keywords are generated, one for each of the feature selection technique. Details are discussed in Step 3–8 of Sect. 4.2.

Fig. 3
figure 3

Connected components of an undirected graph

Fig. 4
figure 4

Connected component of ‘a’

Table 1 Synonym list of certain keywords
  1. 4.

    ELM classification (One-Against-All)

    For ELM classification, we choose ELM One-Against-All (OAA) scheme after generating the reduced feature vectors in Step 3.

    In this scheme, the number of nodes in the output layer is set as equal to the number of distinct classes. In other words, for each training instance x, \({\varvec{m}}\) bits are required to represent the target output y i.e., \(({\varvec{y}}_{\varvec{1}},\ldots , {\varvec{y}}_{\varvec{m}})^{\varvec{T}}\). Thus, for \({\varvec{N}}\) training examples (\(\varvec{x_{n}, y_{n}})\) of dimension \({\varvec{R}}_{\varvec{n}}\) \(\times \) \({\varvec{R}}_{\varvec{m}}\) and \({\varvec{L}}\) hidden nodes, we randomly assign hidden node parameters \(({\varvec{w}}_{\varvec{i}}, {\varvec{b}}_{\varvec{i}})\), \({\varvec{i}} ={\varvec{1}},\ldots ,{\varvec{L}}\) and calculate the output weights \({\varvec{\beta }}\) and the hidden node output matrix \({\varvec{H}}\). \({\varvec{\beta }} = \varvec{H}^+ \varvec{Y}\).

    Upon receiving the results of \({\varvec{m}}\) output nodes, the results are submitted to a decision function to find the class label for the instance x, and it is decided by the target output node having the maximum value using the voting mechanism. Mathematically, it can be represented by Eq. 12 as follows:

    $$\begin{aligned} \hat{\varvec{g}}(\mathbf x ) = {\varvec{argmax}}_{\varvec{i=1,\ldots ,m}}{\varvec{g}}_{\varvec{i}}(\mathbf x ) \end{aligned}$$
    (12)

    Figure 5 shows the ELM OAA architecture.

Fig. 5
figure 5

ELM one-against-all

  1. 5.

    SVM classification (One-Against-One)

    Similarly for SVM classification, we classify the text documents from the test data using SVM classifier. In order to classify the text documents into different categories, multi-class SVM with one-against-one (OAO) approach is used after the preparation of reduced feature vectors in Step 3. For m classes, \(\frac{m(m-1)}{2}\) classifiers are built, one for each pair of classes. The prediction of class for a new document is based on a voting scheme. It involves giving test document as an input to binary SVM’s which classifies the document into one of the two classes and voting up that class. Hence, it constructs \(\frac{m(m-1)}{2}\) classifiers, and at the end of comparisons, the test document is categorized into that class which receives the maximum votes.

  1. 6.

    Usefulness of Wordnet during Feature vector preparation and mapping

    WordnetFootnote 5 developed at the University of Princeton is a thesaurus for the English language. Some of the semantic relations available in Wordnet are synonymy, antonym, hyponymy, etc. Synonyms are words that have similar meanings. A synonym set, or synsets, is a group of synonyms, and the synonyms contained within a synsets are called lemmas. We have made use of Wordnet for feature vector preparation (discussed in Step 3).

We summarized the above steps in a concise manner in Sect. 4.2, and implementation details are discussed in Algorithm 1 and .

4.2 Detailed summary of the proposed approach

  1. (1)

    For each preprocessed class (i.e., ‘\({\varvec{D}}_{\varvec{new}}\)’), separately calculate the score of each keyword in that class using the feature selection techniques (TF-IDF/Chi-square/BNS/IG).

  2. (2)

    Select a class ‘\({\varvec{D}}_{\varvec{new}}\)’ randomly from the corpus, remove duplicates from its keyword set and store them in a list called Keyword_List. Initially, each word of Keyword_List represents an individual connected component.

  3. (3)

    Using Wordnet, find the synonym list and lemma list of a keyword \({\varvec{k}}\) (which selected randomly from the Keyword_List of \({\varvec{D}}_{\varvec{new}}\)) and store them in a list called Synonym_Lemma_List of \({\varvec{k}}\). Now find out common keywords between Synonym_Lemma_List of \({\varvec{k}}\) and Keyword_List of \({\varvec{D}}_{\varvec{new}}\). If such common keywords are found, then remove them from Keyword_List of \({\varvec{D}}_{\varvec{new}}\) and add them to a list called Synonym_Required_List which is the required connected component of \({\varvec{k}}\).

    Add the Synonym_Required_List of \({\varvec{k}}\) to a list called List_of_List that contains the connected components of all those keywords picked up randomly from the Keyword_List of \({\varvec{D}}_{\varvec{new}}\).

  4. (4)

    Repeat Step 3 till the Keyword_List of \({\varvec{D}}_{\varvec{new}}\) get exhausted. We now have List_of_List of components where each list represents a connected component.

  5. (5)

    From each component of List_of_List, find the keyword with highest TF-IDF/Chi-square/BNS/IG value and add it to feature list \({\varvec{i}}\). This feature list represents the feature_vectors of class \({\varvec{i}}\) (\({\varvec{FV}}({\varvec{C}}_{\varvec{i}}))\) one for each feature selection technique.

  6. (6)

    Repeat Step 2 to 5 for each class (\({\varvec{D}}_{\varvec{new}}\)) of the corpus.

  7. (7)

    Now we have feature_vectors for each \({\varvec{D}}_{{\varvec{new}}_{\varvec{i}}}\) ( \({\varvec{i}}\) from 1 to \({\varvec{m}}\) and \({\varvec{m}}\) is the number of classes). Remove those keywords from each feature_vector of \({\varvec{D}}_{{\varvec{new}}_{\varvec{i}}}\) which are occurring more than the threshold number of feature_vectors.Footnote 6 This decides the maximum number of occurrences of any keyword in all the feature vectors.

  8. (8)

    Select the top ‘\({\varvec{n}}\)’ keywords from each feature_vector of \({\varvec{D}}_{{\varvec{new}}_{\varvec{i}}}\) which have highest TF-IDF/Chi-square/BNS/IG value. Thus, the words are filtered out according to their priority, importance and semantics. Now we have the reduced feature vectors of \({\varvec{D}}_{{\varvec{new}}_{\varvec{i}}}\) one for each feature selection technique (i.e., TF-IDF/Chi-square/BNS/IG).

  9. (9)

    Combine all the feature_vectors of each class (separately for each feature selection technique used) to obtain the final unique reduced feature vector (\({\varvec{V}}\)) of size \({\varvec{nm}}\).

  10. (10)

    The final reduced feature vector (\({\varvec{V}}\)) is then used for training and testing all the traditional classifiers as follows:

    1. i.

      The final reduced feature vector (\({\varvec{V}}\)) is mapped into Multilayer ELM and other conventional classifiers for training purpose.

    2. ii.

      Once the model is trained, test data (excluding class label) are passed to ML-ELM and other trained classifiers separately to get predictions of a text instance belongs to which class.

  11. (11)

    Calculate the precision, recall and F-measure of the target classifier (i.e., ML-ELM and other established classifiers) by using the known class label of a test instance and the output prediction of the target classifier for that test instance.

figure a
figure b

5 Experimental results and discussion

5.1 Experimental setup

In order to demonstrate the performance of our approach, precision, recall and F-measure are calculated. Testing is conducted on two benchmark datasets (DMOZ Open Directory ProjectFootnote 7 and 20-Newsgroups).Footnote 8 Python language has been used for implementation of the approach. The algorithm has been run on a machine having 16 Processors - Intel Xeon Processor E5-2690 @ 2.90 GHZ, 64GB RAM running Ubuntu 14.04. We have used four different feature selection techniques (TF-IDF, Chi-square, BNS and IG) for the experimental work and observed their performances on these two datasets using different classifiers. We ran our algorithm extensively on various feature vector lengths used for different classifiers, different number of hidden layer nodes used in ELM and number of hidden layers used for ML-ELM on both datasets. But considered only those length of the feature vector, number of hidden layer nodes and hidden layers for which we obtained the maximum overall F-measure, and are shown in Table 2 and 3 on DMOZ and 20-Newsgroups datasets, respectively. The precision, recall and F-measure of the proposed approach are calculated as follows:

5.1.1 Precision (P)

Precision is the fraction of the documents retrieved by the propose approach which are relevant.

$$\begin{aligned} \textit{P} = \frac{(\hbox {relevant}_{documents})\cap (\hbox {retrieved}_{documents}) }{\hbox {retrieved}_{documents}} \end{aligned}$$

5.1.2 Recall (R)

Recall is the fraction of the relevant documents which are retrieved by the propose approach.

$$\begin{aligned} \textit{R} = \frac{(\hbox {relevant}_{documents})\cap (\hbox {retrieved}_{documents})}{\hbox {relevant}_{documents}} \end{aligned}$$

5.1.3 F-Measure (F)

F-measureFootnote 9 is the overall performance measurement of a system which gives equal importance to both precision and recall and can be represented by Eq. 13 as follows:

$$\begin{aligned} \textit{F} = 2 *\frac{(\textit{P} *\textit{R})}{(\textit{P} + \textit{R})} \end{aligned}$$
(13)

The overall values for precision, recall and F-measure of the proposed approach using different classifiers on both the datasets have been calculated using Eqs. 14, 15 and 16, respectively.

$$\begin{aligned}&\hbox {Overall precision} = \frac{\sum \nolimits _{\varvec{i=1}}^{\varvec{n}}{({\varvec{p}}_{\varvec{i}}.{\varvec{d}}_{\varvec{i}})}}{\hbox {Total no. of test documents}} \end{aligned}$$
(14)
$$\begin{aligned}&\hbox {Overall recall} = \frac{\sum \nolimits _{\varvec{i=1}}^{\varvec{n}}{({\varvec{r}}_{\varvec{i}}.{\varvec{d}}_{\varvec{i}})}}{\hbox {Total no. of test documents}} \end{aligned}$$
(15)
$$\begin{aligned}&\hbox {Overall F-measure} = \frac{\sum \nolimits _{\varvec{i=1}}^{\varvec{n}}{({\varvec{f}}_{\varvec{i}}.{\varvec{d}}_{\varvec{i}})}}{\hbox {Total no. of test documents}} \end{aligned}$$
(16)

where \({\varvec{p}}_{\varvec{i}}\), \({\varvec{r}}_{\varvec{i}} \), \({\varvec{f}}_{\varvec{i}}\) and \({\varvec{d}}_{\varvec{i}}\) are the precision, recall, F-measure and the number of testing documents of the \({\varvec{i}}{\varvec{th}}\) category, respectively, and \({\varvec{n}}\) is the number of categories of the dataset.

Table 2 Different parameters used for DMOZ dataset
Table 3 Different parameters used on 20-Newsgroups Dataset

5.2 DMOZ dataset

DMOZ is an Open Directory Project which consists of 14 categories of Web pages. For our work, we considered 69,068 documents out of which 38,000 documents are used for training and 31,068 documents are used for testing purposes. We observed the overall F-measure on DMOZ dataset for different classifiers using the four traditional feature selection techniques as follows:

  1. (1)

    It has been observed from Fig. 6 that LinearSVC using ‘TF-IDF’ feature selection technique generates good results as can be evident from the overall F-measure of 0.7035 with a feature vector length of 1858. ‘IG’ is also on a par with ‘TF-IDF’ having overall F-measure value of 0.6919 with a feature vectors length of 1240 followed by Chi-square with overall F-measure of 0.6820. ‘BNS’ has shown poor performance having overall F-measure of 0.59. Category-wise performance of LinearSVC for TF-IDF in which it achieved the maximum overall F-measure is given in Table 4 for demonstration purpose.

  2. (2)

    When ELM has been used as a classifier (shown in Fig. 7), it has been observed that ‘TF-IDF’ selection techniques have maximum overall F-measure of 0.7055 followed by ‘IG’ and ‘Chi-square’ with overall F-measure of 0.6898 and 0.6824, respectively. However, the results when ‘BNS’ has been used as the selection technique are not impressive as the overall F-measure is 0.5996. In order to demonstrate the category-wise performance of ELM (shown in Table 5), we have shown only TF-IDF feature selection technique in which it achieved the maximum overall F-measure.

  3. (3)

    Similarly, when ML-ELM has been used as a classifier (shown in Fig. 8), the results get improved, achieving an overall F-Measure of 0.7228 using ‘TF-IDF’ as feature selection technique followed by IG having 0.7147. The number of hidden nodes is same as used in ELM. But again result of ‘BNS’ is not impressive, which is 0.6068. Performances of ML-ELM on different categories using different feature selection techniques are given in Table 6, 7, 8 and 9, respectively.

From the above results, it is concluded that using ML-ELM as the classifier, the results obtained are better compared to SVM and ELM techniques. For ML-ELM classifier, the number of hidden layer nodes is set more than the length of the input feature vector (i.e., \({\varvec{n}} <{\varvec{L}}\), as discussed in Sect. 3.3.1) in order to represent the training feature set from a lower-dimensional input space to a higher-dimensional (or sparse) feature space. This in turn generated a good performance of ML-ELM on this dataset. Although maximum overall F-measure is achieved by ML-ELM using ‘TF-IDF’ as the feature selection technique but ‘IG’ is on a par with ‘TF-IDF’ in most of the cases. Generalization capability of ELM on different feature selection techniques is either better or almost similar to SVM on this dataset, and the reason may be due the large training dataset which possibly generates overfitting during the training process of ELM.

Fig. 6
figure 6

Performance measurements of different feature selection techniques using SVM

Table 4 DMOZ dataset using SVM and 1858 feature vector length for TF-IDF
Fig. 7
figure 7

Performance measurements of different feature selection techniques using ELM

Table 5 DMOZ dataset using ELM and 1858 feature vector length for TF-IDF
Fig. 8
figure 8

Performance measurements of different feature selection techniques using ML-ELM

Table 6 DMOZ dataset using ML-ELM and 1858 feature vector length for TF-IDF
Table 7 DMOZ dataset using ML-ELM and 1408 feature vector length for Chi-square
Table 8 DMOZ dataset using ML-ELM and 1789 feature vector length for BNS
Table 9 DMOZ dataset using ML-ELM and 1240 feature vector length for IG

5.2.1 Comparison with existing approaches on DMOZ dataset

  1. 1.

    Proposed approach versus Heung-Seon Oh et al.

    We have compared our work with Oh et al. (2011) who have proposed an algorithm consisting of two stages: search and classification. In the search stage, using a search technique, they retrieved several candidate categories from entire hierarchy that are more similar to the input document. Then they collected training data for each candidate from the documents associated with the candidate category (local information) and from top-level categories (global information). Their proposed methods for determining the mixture weights are applied to each category node in modulating the relative contributions of local and global models. The model was tested on DMOZ dataset, and the maximum average F-measure they achieved is 0.3773 which is lesser than our approach.

  2. 2.

    Proposed approach versus Gui-Rong Xue et al.

    Xue et al. (2008) have suggested a two-stage approach for large-scale hierarchical classification called deep classification. In the first stage, they organized the hierarchy of text into flat categories where a search process is conducted on large-scale hierarchies by retrieving the related categories for a given document. Then they ranked the categories and select the most useful categories. In the second stage, to classify the given document on a given small subset, they trained a classification model on that small subset of original hierarchy. They evaluated their deep classification approach on the Open Directory Project with an F-measure of 0.5180, which is lower compared to the maximum overall F-measure of 0.7228 of our approach.

  3. 3.

    Proposed approach versus Siddharth Gopal et al.

    Finally, we compared our work with Gopal and Yang (2013) where they have developed a recursive regularization framework along with a scalable optimization algorithm for large-scale hierarchical classification with hierarchical and graphical dependencies between the class labels. They developed two different variants of their framework using the logistic-loss function and the hinge-loss function. They have used multiple benchmark datasets including DMOZ for experimental purpose and achieved a consistent results. An F-measure of 0.5717 has been achieved while using DMOZ dataset which is significantly lesser than the maximum overall F-measure obtained in our approach.

The comparison results are given in Table 6. Our work with ML-ELM as the classifier acquired an impressive overall F-measure of 0.7228 using ‘TF-IDF’ feature selection technique on DMOZ dataset which justified the significance of our approach compared to the above existing approaches (Table 10).

Table 10 Comparison of results with different approaches

5.3 20-Newsgroups dataset

The 20-Newsgroups is one of the most popular datasets used for text classification and has 20 different newsgroups. The Web documents in it are categorized into 7 categories. For experimental purpose, 18,846 documents have been considered out of which 11,318 documents are used for training purpose and 7528 documents are used for testing purpose.

Fig. 9
figure 9

Performance measurements of different feature selection techniques using SVM

Table 11 20-Newsgroups dataset using SVM and 1462 feature vector length for IG

We observed the following F-measure on 20-Newsgroups dataset for different classifiers with different existing feature selection techniques:

  1. (1)

    It has been found that LinearSVC using ‘IG’ as the feature selection technique generates decent result as can be evident from the overall F-measure of 0.8033 with a feature vector length of 1462 (shown in Fig. 9), which is followed by Chi-square having F-measure of 0.8020 with feature vector length of 1510 and then by BNS and TF-IDF having overall F-measure of 0.7945 (feature vector length of 1756) and 0.7827 (feature vector length of 2260), respectively. Table 11 demonstrates the performance of SVM on each category of 20-Newsgroups using ‘IG’ as the feature selection technique for which it achieved the maximum overall F-measure.

  2. (2)

    When ELM has been used as a classifier (shown in Fig. 10), it is observed that ‘BNS’ feature selection technique generates an overall F-measure of 0.7861 followed by ‘Chi-square’ and ‘IG’ having overall F-measure of 0.7754 and 0.7744, respectively. ELM using TF-IDF has the lowest performance compared to all other feature selection techniques. Table 12 shows the category-wise performance of ELM using ‘BNS’ technique in which it achieved the highest overall F-measure.

  3. (3)

    Similarly, when ML-ELM with hidden layer = 3 has been used as a classifier (shown in Fig. 11), the results improved achieving an F-measure of 0.8153 using ‘BNS’ as feature selection technique followed by ‘Chi-square’ and ‘IG’ having overall F-measure of 0.8139 and 0.8134, respectively. Table 13, 14, 15 and 16 show the category-wise performances of ML-ELM using TF-IDF, Chi-square, BNS and IG, respectively.

Fig. 10
figure 10

Performance measurements of different feature selection techniques using ELM

From the above results, it is concluded that using ML-ELM as the classifier, the results obtained are better compared to SVM and ELM techniques. The number of hidden layer nodes is set more than the length of the input feature vector (i.e., n < L, as discussed in Sect. 3.3.1) in order to represent the training feature set in a higher-dimensional (or sparse) feature space. This in turn generated good performance of ML-ELM on this dataset. ML-ELM using BNS feature selection technique achieved the highest F-measure compared to other feature selection techniques. SVM showed decent performance compared to ELM in all feature selection techniques on this dataset.

Table 12 20-Newsgroups dataset using ELM and 1756 feature vector length for BNS
Fig. 11
figure 11

Performance measurements of different feature selection techniques using ML-ELM

Table 13 20-Newsgroups dataset using ML-ELM and 2260 feature vector length for TF-IDF
Table 14 20-Newsgroups dataset using ML-ELM and 1510 feature vector length for Chi-square
Table 15 20-Newsgroups dataset using ML-ELM and 1756 feature vector length for BNS
Table 16 20-Newsgroups dataset using ML-ELM and 1462 feature vector length for IG
Table 17 Comparison of results with different approaches

5.3.1 Comparison with existing approaches on 20-Newsgroups dataset

  1. 1.

    Proposed approach versus Zhang et al.

    The obtained results are compared with Zhang et al. (2009) where they have suggested Fuzzy kNN algorithm to classify the Web pages, and preparation of feature vector is done using simple TF-IDF approach. They have tested their approach on 20-Newsgroups dataset and from Table 2 listed in Zhang et al. (2009), we observed that their F-Measure of Fuzzy kNN approach is 0.7638 when the feature vector size is 2500. On the other hand, our approach obtained maximum overall F-measure of 0.8153 using ML-ELM as the classifier.

  2. 2.

    Proposed approach versus Gongde Guo et al.

    Guo et al. (2003) have used a simple kNN model-based approach to classify Web pages into different categories. From Table 3 of the Guo et al. (2003), it is evident that their approach has obtained an F-measure of 0.8079 on the 20-Newsgroups dataset which is lesser than what we achieve using ML-ELM as the classifier.

  3. 3.

    Proposed approach versus Weimao Ke et al.

    Ke (2012) have discussed the least information theory (LIT) to quantify meaning of information in probability distributions and derived a new document representation for text classification. LIT offers an information centric approach to weight terms based on probability distributions in the documents versus in the collection. They suggested two term weight quantities in the context of document classification [least information binary (LIB) and least information frequency (LIF)]. They have shown that LIB*LIF weighting scheme outperforms TF*IDF in several experimental settings. For experimental work, 20-Newsgroups dataset has been used and they have claimed an F-measure of 0.6030 which is lower than our results.

  4. 4.

    Proposed approach versus Yangqiu Song et al.

    A dataless hierarchical classification approach has introduced by Song and Roth (2014). Their scheme is composed of two steps: a bootstrapping step and semantic similarity step. In the bootstrapping step, they adapt to the specific document collection. In the semantic similarity step, to compute meaningful semantic similarity between a document and a potential label, they embedded both labels and documents in a semantic space. They have justified that their algorithm is competitive with supervised classification algorithms. An F-measure of 0.6820 has been achieved by them using 20-Newsgroups dataset, which is significantly lesser than what we obtained using ML-ELM as the classifier.

  5. 5.

    Proposed approach versus Nguyen Cao Truong Hai et al.

    Im Kim and Park (2009) have combined SVD with LDA for text classification. In their method, they first applied SVD on input data which convert it into a rank-specified reduced space and latter they applied LDA on this reduced space for classifying the text. They have used 20-Newsgroups dataset for experimental purpose and achieved an F-measure of 0.6442. It is less than the overall F-measure of 0.8153 received by our approach using ML-ELM.

Table 18 Comparisons of ML-ELM with other state-of-the-art classifiers

The comparison results are given in Table 17. The proposed approach with ML-ELM as the classifier obtained an impressive overall F-measure of 0.8153 using ‘BNS’ as the feature selection technique on 20-Newsgroups dataset which signifies the importance of our approach compared to the above existing approaches.

5.4 Comparisons with other traditional classifiers

The overall F-measures of different classifiers using different feature selection techniques on both datasets are given in Table 18. It is observed from the experimental results that by representing the training dataset of ML-ELM on both datasets in a higher-dimensional feature space, generates highest F-measure compared to all other traditional classifiers which justified the prominence of deep learning for classifying the text data. From the Table 18, it can be observed that in DMOZ dataset, ML-ELM dominated all other classifiers for all feature selection techniques. Similarly in 20-Newsgroups dataset for all feature selection techniques, ML-ELM dominates all other classifiers with highest overall F-measure of 0.8153.

6 Conclusion

The paper presented an efficient approach for classifying the text data using ML-ELM and other established classifiers. The proposed approach selects the best features (using connected component technique and Wordnet) for preparing an effective training dataset. ML-ELM yields very good results which demonstrates the efficiency of our approach compared to other existing approaches. Various feature selection techniques have been combined individually with connected component to prepare a good feature vector. The experimental results on DMOZ and 20-Newsgroups datasets using SVM, ELM and ML-ELM classifiers are concluded as follows:

  1. (1)

    Using LinearSVC, SVM gives maximum overall F-measure of 0.7035 on DMOZ dataset with ‘TF-IDF’ as the feature selection technique and overall F-measure of 0.8033 on 20-Newsgroups dataset with ‘IG’ as the feature selection technique.

  2. (2)

    Similarly, using ELM as classifier gives maximum overall F-measure of 0.7055 on DMOZ dataset with ‘TF-IDF’ as the feature selection technique and overall F-measure of 0.7861 using 20-Newsgroups dataset with ‘BNS’ as the feature selection technique.

  3. (3)

    Using ML-ELM as classifier gives maximum overall F-measure of 0.7228 on DMOZ dataset with ‘TF-IDF’ as the feature selection technique and overall F-measure of 0.8153 on 20-Newsgroups dataset with ‘BNS’ as the feature selection technique.

  4. (4)

    The possible reasons for ML-ELM to shows better performance on both datasets compared to other state-of-the-art classifiers can be summarized as follows:

    1. i.

      ability of representing the training feature set in a higher- or sparse-dimensional feature space by making nodes of the hidden layer more than the nodes of the input layer.

    2. ii.

      layer-wise unsupervised training is set for the weighting parameters of each hidden layer.

    3. iii.

      in the deep architecture of ML-ELM, the presence of multiple layers gives multiple nonlinear transformation of the input data which in turn able to generate better representation learning.

    4. iv.

      the nature of the deep architecture of ML-ELM can capture higher-level abstraction and every layers in the network can learn a distinct form of the input by performing representation learning using unsupervised learning technique.

  5. (5)

    The large training dataset of DMOZ compared to 20-Newsgroups yields similar or better results of ELM than SVM possibly due to the occurrences of overfitting during the training phase of ELM. But in 20-Newsgroups dataset, performance of SVM is better than ELM. This indicates that ELM has lower generalization ability compared to SVM when the training dataset is small but for large training dataset, ELM has similar or better performance than SVM.

Our approach with the promising results witnessed the suitability and importance of ML-ELM in the field of text classification compared to other state-of-the-art classifiers. This shows that deep learning has high impact for classification of text data. Toward future work, this approach can be implemented in a distributed environment which will consume less processing time and will help in load balancing. We believe that by combining the feature space of ML-ELM with other state-of-the-art classifiers will further strengthen the results of text classification.