Keywords

1 Introduction

Text mining can be understood as data mining on textual documents. Typical text mining tasks are text classification, clustering, retrieval, etc. Most of the earlier works used traditional machine learning techniques for text classification, such as support vector machine, naive Bayes, logistic regression, maximum entropy, decision trees, etc. But they are not able to capture the discriminative features automatically from the training data. Their performances heavily depend on data representation, and it is labor-intensive. The literature on text classification has been dominated by deep learning techniques motivated by the outstanding results of deep neural networks in text mining, image processing, and natural language processing [1, 2]. But they have limitations like they need large memory bandwidth, huge training time is required because of backpropagation, architecture is very complex, preserving interdependencies among the internal layers for a long time is quite difficult etc. Hence it is not easy to generalize the text classification models to a new domain. An efficient deep learning classifier called Multi-layer ELM was introduced in the year 2013 by Kasun et al. [3] to address the above problems.

1.1 Research Motivation

Overall these existing machine and deep classification techniques have the following limitations:

  1. i.

    Displaying the data becomes more complex when a large storage space is required due to the growth of the dataset size.

  2. ii.

    When the input data grows exponentially on the limited dimensional space, distinguishing input features onTF-IDF vector space becomes challenging.

  3. iii.

    Machine and deep learning classifier performances heavily depend on data representation, which is labor-intensive.

Feature selection which selects an optimal subset from the massive volume of the dataset, can alleviate the dimensionality curse but cannot separate the features in lower dimensional space due to the dynamic growth of data items [4, 5]. Kernel approaches [6, 7] are commonly utilized by any classification process to deal with this challenge. Kernel methods have been used for classification techniques in the past, and better results have been obtained. A detailed survey of kernel and spectral methods for classification has been done by Filippone M et al. [8]. Though kernel methods can handle the data separation in a lower dimensional space by projecting them to a higher-dimensional space, they are expensive (i.e., time-consuming) because of using the dot product to compute the structural similarity among the input features. Using the feature mapping technique of ELM, Huang et al. [9] admitted that by mapping the input vector non-linearly to a high-dimensional feature space, the features become simple and separable linearly, and thus can outperform the kernel approaches [10]. But ELM is a single-layer architecture, thus requiring an extensive network, which is challenging to design to perfectly match the heavily changed input data.

In this vein, this research investigated the feature space of Multi-layer ELM (ML-ELM) [11], which extensively exploits the advantages of ELM feature mapping [12, 13] and ELM autoencoder to address the constraints mentioned above. The goal of this study is to investigate the extended feature space of ML-ELM (HDFS-MLELM) and to thoroughly test this feature space for text classification in comparison to the TF-IDF vector space (VS-TFIDF).

1.2 Research Contribution

The major contributions of the paper can be summarized as follows:

  • This work studies HDFS-MLELM, and uses text data to thoroughly investigate multiple classification algorithms on HDFS-MLELM and on VS-TFIDF.

  • It is clear from the past literature that no research on classification using text data has been done on the Multi-layer ELM’s enlarged feature space. As a result, in light of the benefits mentioned above, this study can be considered as a new direction in the text classification domain.

  • A novel feature selection termed Correlation-based Feature Selection CORFS is proposed for selecting the essential features from a big corpus.

  • To demonstrate its usefulness, the performance of Multi-layer ELM employing the suggested CORFS technique has been compared with several machines and deep learning classifiers.

  • Text classification results of various traditional classifiers after running them on ELM feature space and on HDFS-MLELM are compared in order to show the effectiveness of HDFS-MLELM.

  • The experimental results of the proposed approach are compared with the state-of-the-art approaches.

Rest of the paper is as follows: Sect. 2 introduces the preliminaries of Multi-layer ELM and its feature mapping technique. The proposed methodology is discussed in Sect. 3. Section 4 carried out the experimental work. The paper is concluded in Sect. 5.

2 Prelims

2.1 Multi-layer ELM

As demonstrated in Fig. 3, Multi-layer ELM (ML-ELM) is a hybrid of ELM (shown in Fig. 1) and ELM autoencoder (shown in Fig. 2) with more than one hidden layer and is discussed using the following steps.

Fig. 1.
figure 1

Overview of ELM

Fig. 2.
figure 2

Overview of ELM-autoencoder

Fig. 3.
figure 3

Overview of Multi-layer ELM

  • Unsupervised training occurs between the hidden layers using ELM Autoencoder [14]. Unlike other deep networks, ML-ELM does not require fine-tuning since ELM’s autoencoder capacity is an excellent match for ML-ELM [15].

  • Stacks are built on top of the ELM Autoencoder in a progressive way to create a multi-layer neural network architecture. The output of one trained ELM Autoencoder is fed into the next ELM Autoencoder, and so on.

  • ELM Autoencoder’s first level teaches the fundamental representation of input data. By integrating the previous level’s output, the network learns a better representation in the next level, and so on. Equation 1 is used to calculate the numerical understanding between \(i^{th}\) and \((i-1)^{th}\) layers.

    $$\begin{aligned} H_i= g((\beta _i)^T H_{i-1}) \end{aligned}$$
    (1)

    where \(H_{i-1}\) and \(H_i\) are the input and output matrices of the \(i^{th}\) hidden layer, respectively. g(.) is the activation function, and \(\beta \) is the learning parameter. The input layer is \(H_0\), and the first hidden layer is \(H_1\). Regularized least squares is used to get the output weight \(\beta \) [16].

  • Finally, supervised learning is utilized to fine-tune the network (ELM is used for this purpose).

3 Methodology

  1. 1.

    Documents Pre-processing:

    Let corpus P consists of C classes. At the beginning of the feature engineering, all documents of each class are combined into a single set called \(D_{large}\). Then lexical-analysis, stop-word deletion, HTML tag removal, and stemmingFootnote 1 are done on \(D_{large}\). Natural Language ToolkitFootnote 2 is used to extract index terms from \(d_{large}\). After completing the basic data cleaning, the first set of features was derived from \(D_{large}\) and created a term-document matrix.

  2. 2.

    Correlation Based Feature Selection (CORFS):

    Using k-meansFootnote 3 clustering algorithm [17], the \(D_{large}\) is divided into n term-document clusters \(td_i, i \in [1, n]\). The following steps discuss the methodology used to extract important features from each cluster \(td_i\).

    1. i.

      Calculating Centroid:

      First the centroid of \(td_i\) is calculated using Eq. 2.

      $$\begin{aligned} sc_i = \frac{\sum \limits _{j=1}^{r}t_{i}}{r} \end{aligned}$$
      (2)

      Then cosine-similarity is computed between \(t_j \in td_i\) and \(sc_i\).

    2. ii.

      Generating correlation matrix:

      Equation 3 is used to find the correlation (cr)Footnote 4 between pair of terms \(t_i\) and \(t_j\) and is shown in Table 1.

      $$\begin{aligned} cr_{t_it_j} = \frac{C_{t_it_j}}{\sqrt{(V_{t_i} * V_{t_j})}} \end{aligned}$$
      (3)

      where, \(C_{t_it_j}\) is the covariance (joint variability between two terms) between \(t_i\) and \(t_j\). \(V_{t_i}\) and \(V_{t_j}\) are their variances respectively as defined below.

      $$ V_{t_i} = \frac{1}{b-1}\sum _{m=1}^{b}(X_{im} - \overline{X_i})^2 $$
      $$ V_{t_j} = \frac{1}{b-1}\sum _{m=1}^{b}(X_{jm} - \overline{X_j})^2 $$

      where \(\overline{X_i}\) and \(\overline{X_j}\) represents the mean of b documents having the terms \(t_i\) and \(t_j\) respectively. The covariance between \(t_i\) and \(t_j\) is computed using Eq. 4.

      $$\begin{aligned} C_{t_it_j} = \frac{1}{b-1}\sum _{m=1}^{b}(X_{im} - \overline{X_i})(X_{jm} - \overline{X_j}) \end{aligned}$$
      (4)
    3. iii.

      Rejection of high correlated terms from \(td_i\):

      Terms that are highly correlated in a cluster are generally considered as a sort of synonym, and hence they do not discriminate well in the cluster. Therefore, those terms should be removed from the cluster. To find those terms in \(td_i\), initially, those terms that have the maximum cosine-similarity score in \(td_i\) get selected. Subsequently, a set of terms are identified which are highly correlated to \(t_i\) \((\le -0.87 \text{ or } \ge 0.89)\)Footnote 5 and that set of terms get removed from \(td_i\). This step is repeated for the next highest cosine-similarity score term and so on till \(td_i\) gets exhausted. Finally, all highly correlated terms are removed from \(td_i\).

    4. iv.

      Computing Discriminating Power Measure (DPM):

      (DPM) [18] is a technique that measures the relevance, i.e., the importance of a term in a cluster. If the DPM score of a term inside an unbiased cluster is very high, then that term is an important term for that cluster. It is because many documents of the cluster contain that term. The cohesion or tightness of that term is very close to the cluster’s center.

      • For each \(t_i \in td_i\), the document frequency inside (\(DF_{in, t_i}\)) and outside (\(DF_{out, t_i} \)) of \(td_i\) are calculated using Eqs. 5 and 6 respectively.

        $$\begin{aligned} DF_{in, t_i} = \frac{no.\; of\; documents\; \in td_i\; and\; have\; t_i}{no.\; of\; documents\; \in td_i} \end{aligned}$$
        (5)
        $$\begin{aligned} DF_{out, t_i} = \frac{no.\; of\; documents\; have \;t_i \;and\; \notin td_i}{no.\; of\; documents\; \notin td_i} \end{aligned}$$
        (6)
      • The difference between inside and outside document frequency of \(t_i \in td_i \) is computed using Eq. 7.

        $$\begin{aligned} DIFF_{td_i, t_i} = | DF_{{in}, t_i} - DF_{{out}, t_i} | \end{aligned}$$
        (7)
      • Equation 8 computes the DPM score of each term.

        $$\begin{aligned} DPM(td_i, t_i) = \sum _{i=1}^{P}DIFF_{td_i, t_i} \end{aligned}$$
        (8)
    5. v.

      Selection of candidate terms having High DPM scores:

      of term-document cluster are arranged as per the DPM scores, and higher \(k\%\) terms are selected as the candidate terms. This step is repeated for each \(td_i\) so that every \(td_i\) has top \(k\%\) candidate terms in them.

  3. 3.

    Input feature vector generation:

    To build the input feature vector, all the top \(k\% \) features of each \(td_i\) are merged into a list \(L_{list}\).

  4. 4.

    Feature mapping of Multi-layer ELM:

    1. i.

      Multi-layer ELM heavily employs the universal classification [19, 20] and approximation [21, 22] capabilities of ELM.

    2. ii.

      ML-ELM cleverly leveraged the extended representation (i.e., \(n < L\)) technique of the ELM autoencoder [12, 23], where n and L are the number of input and hidden layer nodes, respectively.

    3. iii.

      The features of ML-ELM are transferred from a low-dimensional feature space to a higher-dimensional feature space using Eq. 9. Mapping of the input vector to HDFS-MLELM is shown in Fig. 4 where, \(h_i({\textbf {x}}) = g(w_i.{\textbf {x}} + b_i)\).

      $$\begin{aligned} h({\textbf {x}}) = \left[ \begin{array}{c} h_1({\textbf {x}})\\ h_2({\textbf {x}})\\ h_3({\textbf {x}})\\ .\\ .\\ . \\ h_L({\textbf {x}}) \end{array} \right] ^T = \left[ \begin{array}{c} g(w_1, b_1, {\textbf {x}}) \\ g(w_2, b_2, {\textbf {x}})\\ g(w_3, b_3, {\textbf {x}}) \\ .\\ .\\ .\\ g(w_L, b_L, {\textbf {x}})\end{array} \right] ^T \end{aligned}$$
      (9)

      \(h({\textbf {x}}) = {[h_1({\textbf {x}}), h_2({\textbf {x}}),\cdots , h_i({\textbf {x}}), \cdots , h_L({\textbf {x}})]}^T \) transfer the input features to HDFS-MLELM [24, 25].

    4. iv.

      \(L_{list}\) is mapped into MLELM-HDFS using Eq. 9. Before the transformation, L is set to a higher value than n. This makes all the features of \(L_{list}\) linearly separable.

  5. 5.

    Classification on MLELM-HDFS:

    Different supervised learning algorithms employing \(L_{list}\) as the input feature vector are run individually on TFIDF-VS and MLELM-HDFS respectively.

Fig. 4.
figure 4

Feature mapping technique of ML-ELM

Table 1. Correlation matrix

4 Analysis of Experimental Results

The setup for the experimental study is detailed in depth in this section. The performance evaluation of state-of-the-art classification algorithms in the feature space of ML-ELM is examined thoroughly. Experiments were done on the feature space of ML-ELM by altering the number of hidden layer nodes L of ML-ELM as per the three representations mentioned below, where n is the number of nodes in the input layer.

  • for compress representation (\(n > L\)): \(L = 0.4n\) and \(L = 0.7n\)

  • for extended representation (\(n<L \)): \(L = 1.4n\) and \(L = 1.2n \)

  • for equal representation: (\(n = L\)): \(L=1.0n\)

4.1 Experimental Setup

A Brief Description of the Datasets Utilized in the Experiment: To conduct the experiment, four benchmark datasets ( WebKBFootnote 6, Classic4Footnote 7, 20-NewsgroupsFootnote 8, and ReutersFootnote 9 are used and the details are shown in Table 2.

Table 2. Corpus statistics

Tuning Hyper-parameters: The proposed approach for the classification of text data is implemented using python 3.7.3 on Spyder IDE running on a system with Intel Core i11 processor, 32 GB RAM, and 24 GB GPU. GPU is used while running ANN, CNN, and RNN algorithms, and CPU while running Multi-layer ELM. For Multi-layer ELM, we have used 3 hidden layers with 150 nodes in each layer, activation function as Sigmoid (for hidden layer) and Softmax (for output layer). The model is trained using DGX workstation. Tables 3 and 4 show the parameter used for several machine and deep learning algorithms, respectively. Fixing all parameter values is done by repeating the experiment.

Table 3. Setting different parameters (machine learning)
Table 4. Setting different parameters (deep learning)

4.2 Discussion

Performance Evaluation of CORFS Technique: The proposed CORFS technique is compared with different traditional feature selection techniques (Bi-normal separation (BNS), Mutual Information (MI), Chi-square, and Information Gain (IG)), and the F-measures are shown in Tables 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 and 16 respectively for different datasets on top 1%, 5%, and 10% features, where bold indicates maximum. The F-measure of the proposed CORFS approach is compared with the state-of-the approaches, which is summarized in Table 17. The findings suggest that the proposed feature selection approach is equivalent to or better than the previous one and can be used to classify text documents using the ML-ELM feature space.

Table 5. 20-NG (Top 1%)
Table 6. 20-NG (Top 5%)
Table 7. 20-NG (Top 10%)
Table 8. Classic4 (Top 1%)
Table 9. Classic4 (Top 5%)
Table 10. Classic4 (Top 10% )
Table 11. Reuters (Top 1%)
Table 12. Reuters (Top 5% )
Table 13. Reuters (Top 10%)
Table 14. DMOZ (Top 1% )
Table 15. DMOZ (Top 5% )
Table 16. DMOZ (Top 10%)

Performance Comparisons of Multi-layer ELM: It’s worth noting that ML-ELM outperforms other machine learning classifiers in most feature selection strategies across various datasets, as shown in Table 18. Figures 5 and 6 show F-measure and accuracy comparisons of Multi-layer ELM with various deep learning techniques using the CORFS approach. Results indicate the effectiveness of ML-ELM over the machine and deep learning classifiers.

Reasons for Better Performance of Multi-layer ELM over Other Classifiers:

The following points highlighted the basic reasons behind the superiority of ML-ELM.

  1. i.

    In ML-ELM, there is no need to fine-tune the hidden node settings and other parameters, and no back-propagations are required. This saves training time, and the learning speed becomes exceedingly rapid throughout the classification phase.

  2. ii.

    ML-ELM is less expensive than other deep learning architectures because it does not require any GPU to run. When the dataset size grows, excellent performance is realized in ML-ELM.

  3. iii.

    ML-ELM can map and linearly separate a huge volume of data in the extended space, thanks to its universal approximation and classification capabilities.

  4. iv.

    The training in ML-ELM is mostly unsupervised except at the last level, where it is supervised.

  5. v.

    Multiple hidden layers provide a high-level data abstraction, and each layer learns new input forms, making ML-ELM more efficient.

Performance Evaluation of Classification Algorithms: For practical reasons, six distinct classification approaches are performed on the HDFS-MLELM and the VS-TFIDF, employing four datasets individually. The obtained accuracies and F-measures are shown in Figs. 7, 8, 9 and 10 and Figs. 11, 12, 13 and 14 respectively.

Table 17. Performance of Feature selection algorithms (bold indicates maximum)
Table 18. Comparing ML-ELM with machine learning classifiers using CORFS

The following conclusions are drawn from the findings:

  1. i.

    Compared to the VS-TFIDF, the empirical findings in all three feature spaces of Multi-layer ELM are superior.

  2. ii.

    Linear SVM outperforms other supervised learning algorithms, owing to its convex optimization property [35] and generalization property [36], both of which are independent of feature space dimension.

  3. iii.

    F-measure and accuracy are better in HDFC-MLELM whereas it is close on equal dimensional space.

The performance of the proposed approach is compared with the state-of-the-art classification approaches, and the results are shown in Table 19, where bold indicates the maximum accuracy.

Table 19. Performance of text classification algorithms (bold indicates maximum)

4.3 Comparisons of ELM and ML-ELM Feature Space

Traditional classifiers are run on HDFS-MLELM and ELM feature space. Figures 15, 16, 17 and 18 compare the performances of different classifiers on the higher dimensional feature space(\(L=1.4n\)) ML-ELM and ELM. The results indicate that the performances of classifiers are better in ML-ELM feature space compared to ELM feature space. The reason is due to the multilayer processing of ML-ELM compared to a single layer in ELM. SVM shows a better performance compared to other classifiers on both feature spaces.

Fig. 5.
figure 5

F1-measure

Fig. 6.
figure 6

Accuracy

Fig. 7.
figure 7

20-NG (Accuracy)

Fig. 8.
figure 8

Classic4 (Accuracy)

Fig. 9.
figure 9

Reuters (Accuracy)

Fig. 10.
figure 10

DMOZ (Accuracy)

Fig. 11.
figure 11

20NG (F1-measure)

Fig. 12.
figure 12

Classic-4 (F1-measure)

Fig. 13.
figure 13

Reuter (F1-measure)

Fig. 14.
figure 14

DMOZ (F1-measure)

Fig. 15.
figure 15

F1-measure comparisons on ML-ELM and ELM Feature space (20-NG)

Fig. 16.
figure 16

F1-measure comparisons on ML-ELM and ELM Feature space (Classic4)

Fig. 17.
figure 17

F1-measure comparisons on ML-ELM and ELM Feature space (Reuters)

Fig. 18.
figure 18

F1-measure comparisons on ML-ELM and ELM Feature space (DMOZ)

5 Conclusion

The suggested approach investigates the significance of the Multi-layer ELM feature space in-depth. Initially, the corpus is subjected to a novel feature selection technique (CORFS), which removes superfluous features from the corpus and improves the classification performance. An extensive empirical study on several benchmark datasets has demonstrated the efficiency of the suggested technique on HDFS-MLELM compared to the VS-TFIDF. According to empirical investigations, SVM outperforms other classifiers on both feature spaces for all the datasets. After a thorough examination of the experimental results, it has been determined that the Multi-layer ELM feature space

  • is able to solve the three major problems faced by the current machine/deep learning techniques as highlighted in Sect. 1.

  • can replace the costly kernel techniques.

  • is more suitable and much useful for text classification in comparison with TF-IDF vector space.

This work can be extended on the following lines:

  1. i.

    Deep learning methods such as CNN, RNN, and ANN need a vast amount of data and many tuned parameters to train the network. As part of future work, combining these deep learning architectures with ML-ELM can reduce the requirement of tuned parameters without compromising their performances.

  2. ii.

    More applications of ML-ELM can be studied to verify its generalization capability on huge datasets having noise.

  3. iii.

    The variance of hidden layer weights is still under investigation to fully comprehend ML-ELM’s operation.