A Novel Feature Selection Based Text Classification Using Multi-layer ELM

Roul, Rajendra Kumar; Satyanath, Gaurav

doi:10.1007/978-3-031-24094-2_3

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13773))

Included in the following conference series:

International Conference on Big Data Analytics

405 Accesses
3 Citations

Abstract

Deep learning architectures used for text classification are becoming increasingly prevalent. However, the existing deep architectures have flaws such as slow speed, long training times, and the local minimum problem. Multi-layer Extreme Learning Machine has overcome these problems by avoiding backpropagation and thus saves a significant amount of training time, ensures global optimal, and can handle a vast quantity of data. The most important characteristic of Multi-layer ELM is its feature space (FS), which allows the input features to be linearly separated without using any kernel techniques. The architecture of Multi-layer ELM and its technique of feature mapping are examined in this research with the help of a novel feature selection technique termed as Correlation-based Feature Selection (CORFS). Empirical results of the proposed feature selection technique are compared with state-of-the-art techniques. Different classification algorithms are extensively tested on Multi-layer ELM feature space and on TFIDF vector space to demonstrate the efficiency of the feature mapping technique. Results of the experiment revealed that the proposed feature selection technique is better than the conventional feature selection techniques, and the feature space of Multi-layer ELM outperforms TFIDF.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Text Classification Using Correlation Based Feature Selection on Multi-layer ELM Feature Space

Study and Understanding the Significance of Multilayer-ELM Feature Space

Study on suitability and importance of multilayer extreme learning machine for classification of text data

Article 21 June 2016

Keywords

1 Introduction

Text mining can be understood as data mining on textual documents. Typical text mining tasks are text classification, clustering, retrieval, etc. Most of the earlier works used traditional machine learning techniques for text classification, such as support vector machine, naive Bayes, logistic regression, maximum entropy, decision trees, etc. But they are not able to capture the discriminative features automatically from the training data. Their performances heavily depend on data representation, and it is labor-intensive. The literature on text classification has been dominated by deep learning techniques motivated by the outstanding results of deep neural networks in text mining, image processing, and natural language processing [1, 2]. But they have limitations like they need large memory bandwidth, huge training time is required because of backpropagation, architecture is very complex, preserving interdependencies among the internal layers for a long time is quite difficult etc. Hence it is not easy to generalize the text classification models to a new domain. An efficient deep learning classifier called Multi-layer ELM was introduced in the year 2013 by Kasun et al. [3] to address the above problems.

1.1 Research Motivation

Overall these existing machine and deep classification techniques have the following limitations:

i.
Displaying the data becomes more complex when a large storage space is required due to the growth of the dataset size.
ii.
When the input data grows exponentially on the limited dimensional space, distinguishing input features onTF-IDF vector space becomes challenging.
iii.
Machine and deep learning classifier performances heavily depend on data representation, which is labor-intensive.

Feature selection which selects an optimal subset from the massive volume of the dataset, can alleviate the dimensionality curse but cannot separate the features in lower dimensional space due to the dynamic growth of data items [4, 5]. Kernel approaches [6, 7] are commonly utilized by any classification process to deal with this challenge. Kernel methods have been used for classification techniques in the past, and better results have been obtained. A detailed survey of kernel and spectral methods for classification has been done by Filippone M et al. [8]. Though kernel methods can handle the data separation in a lower dimensional space by projecting them to a higher-dimensional space, they are expensive (i.e., time-consuming) because of using the dot product to compute the structural similarity among the input features. Using the feature mapping technique of ELM, Huang et al. [9] admitted that by mapping the input vector non-linearly to a high-dimensional feature space, the features become simple and separable linearly, and thus can outperform the kernel approaches [10]. But ELM is a single-layer architecture, thus requiring an extensive network, which is challenging to design to perfectly match the heavily changed input data.

In this vein, this research investigated the feature space of Multi-layer ELM (ML-ELM) [11], which extensively exploits the advantages of ELM feature mapping [12, 13] and ELM autoencoder to address the constraints mentioned above. The goal of this study is to investigate the extended feature space of ML-ELM (HDFS-MLELM) and to thoroughly test this feature space for text classification in comparison to the TF-IDF vector space (VS-TFIDF).

1.2 Research Contribution

The major contributions of the paper can be summarized as follows:

This work studies HDFS-MLELM, and uses text data to thoroughly investigate multiple classification algorithms on HDFS-MLELM and on VS-TFIDF.
It is clear from the past literature that no research on classification using text data has been done on the Multi-layer ELM’s enlarged feature space. As a result, in light of the benefits mentioned above, this study can be considered as a new direction in the text classification domain.
A novel feature selection termed Correlation-based Feature Selection CORFS is proposed for selecting the essential features from a big corpus.
To demonstrate its usefulness, the performance of Multi-layer ELM employing the suggested CORFS technique has been compared with several machines and deep learning classifiers.
Text classification results of various traditional classifiers after running them on ELM feature space and on HDFS-MLELM are compared in order to show the effectiveness of HDFS-MLELM.
The experimental results of the proposed approach are compared with the state-of-the-art approaches.

Rest of the paper is as follows: Sect. 2 introduces the preliminaries of Multi-layer ELM and its feature mapping technique. The proposed methodology is discussed in Sect. 3. Section 4 carried out the experimental work. The paper is concluded in Sect. 5.

2 Prelims

2.1 Multi-layer ELM

As demonstrated in Fig. 3, Multi-layer ELM (ML-ELM) is a hybrid of ELM (shown in Fig. 1) and ELM autoencoder (shown in Fig. 2) with more than one hidden layer and is discussed using the following steps.

Unsupervised training occurs between the hidden layers using ELM Autoencoder [14]. Unlike other deep networks, ML-ELM does not require fine-tuning since ELM’s autoencoder capacity is an excellent match for ML-ELM [15].
Stacks are built on top of the ELM Autoencoder in a progressive way to create a multi-layer neural network architecture. The output of one trained ELM Autoencoder is fed into the next ELM Autoencoder, and so on.
ELM Autoencoder’s first level teaches the fundamental representation of input data. By integrating the previous level’s output, the network learns a better representation in the next level, and so on. Equation 1 is used to calculate the numerical understanding between $i^{th}$ and $(i-1)^{th}$ layers.
$$\begin{aligned} H_i= g((\beta _i)^T H_{i-1}) \end{aligned}$$
(1)
where $H_{i-1}$ and $H_i$ are the input and output matrices of the $i^{th}$ hidden layer, respectively. g(.) is the activation function, and $\beta $ is the learning parameter. The input layer is $H_0$, and the first hidden layer is $H_1$. Regularized least squares is used to get the output weight $\beta $ [16].
Finally, supervised learning is utilized to fine-tune the network (ELM is used for this purpose).

3 Methodology

1.
Documents Pre-processing:

Let corpus P consists of C classes. At the beginning of the feature engineering, all documents of each class are combined into a single set called $D_{large}$. Then lexical-analysis, stop-word deletion, HTML tag removal, and stemming^{Footnote 1} are done on $D_{large}$. Natural Language Toolkit^{Footnote 2} is used to extract index terms from $d_{large}$. After completing the basic data cleaning, the first set of features was derived from $D_{large}$ and created a term-document matrix.
2.
Correlation Based Feature Selection (CORFS):

Using k-means^{Footnote 3} clustering algorithm [17], the $D_{large}$ is divided into n term-document clusters $td_i, i \in [1, n]$. The following steps discuss the methodology used to extract important features from each cluster $td_i$.
1. i.
  Calculating Centroid:
  
  First the centroid of $td_i$ is calculated using Eq. 2.
  $$\begin{aligned} sc_i = \frac{\sum \limits _{j=1}^{r}t_{i}}{r} \end{aligned}$$
  (2)
  Then cosine-similarity is computed between $t_j \in td_i$ and $sc_i$.
2. ii.
  Generating correlation matrix:
  
  Equation 3 is used to find the correlation (cr)^{Footnote 4} between pair of terms $t_i$ and $t_j$ and is shown in Table 1.
  $$\begin{aligned} cr_{t_it_j} = \frac{C_{t_it_j}}{\sqrt{(V_{t_i} * V_{t_j})}} \end{aligned}$$
  (3)
  where, $C_{t_it_j}$ is the covariance (joint variability between two terms) between $t_i$ and $t_j$. $V_{t_i}$ and $V_{t_j}$ are their variances respectively as defined below.
  $$ V_{t_i} = \frac{1}{b-1}\sum _{m=1}^{b}(X_{im} - \overline{X_i})^2 $$
  
  $$ V_{t_j} = \frac{1}{b-1}\sum _{m=1}^{b}(X_{jm} - \overline{X_j})^2 $$
  where $\overline{X_i}$ and $\overline{X_j}$ represents the mean of b documents having the terms $t_i$ and $t_j$ respectively. The covariance between $t_i$ and $t_j$ is computed using Eq. 4.
  $$\begin{aligned} C_{t_it_j} = \frac{1}{b-1}\sum _{m=1}^{b}(X_{im} - \overline{X_i})(X_{jm} - \overline{X_j}) \end{aligned}$$
  (4)
3. iii.
  Rejection of high correlated terms from $td_i$:
  
  Terms that are highly correlated in a cluster are generally considered as a sort of synonym, and hence they do not discriminate well in the cluster. Therefore, those terms should be removed from the cluster. To find those terms in $td_i$, initially, those terms that have the maximum cosine-similarity score in $td_i$ get selected. Subsequently, a set of terms are identified which are highly correlated to $t_i$ $(\le -0.87 \text{ or } \ge 0.89)$^{Footnote 5} and that set of terms get removed from $td_i$. This step is repeated for the next highest cosine-similarity score term and so on till $td_i$ gets exhausted. Finally, all highly correlated terms are removed from $td_i$.
4. iv.
  Computing Discriminating Power Measure (DPM):
  
  (DPM) [18] is a technique that measures the relevance, i.e., the importance of a term in a cluster. If the DPM score of a term inside an unbiased cluster is very high, then that term is an important term for that cluster. It is because many documents of the cluster contain that term. The cohesion or tightness of that term is very close to the cluster’s center.
  - For each $t_i \in td_i$, the document frequency inside ($DF_{in, t_i}$) and outside ($DF_{out, t_i} $) of $td_i$ are calculated using Eqs. 5 and 6 respectively.
    $$\begin{aligned} DF_{in, t_i} = \frac{no.\; of\; documents\; \in td_i\; and\; have\; t_i}{no.\; of\; documents\; \in td_i} \end{aligned}$$
    (5)
    
    $$\begin{aligned} DF_{out, t_i} = \frac{no.\; of\; documents\; have \;t_i \;and\; \notin td_i}{no.\; of\; documents\; \notin td_i} \end{aligned}$$
    (6)
  - The difference between inside and outside document frequency of $t_i \in td_i $ is computed using Eq. 7.
    $$\begin{aligned} DIFF_{td_i, t_i} = | DF_{{in}, t_i} - DF_{{out}, t_i} | \end{aligned}$$
    (7)
  - Equation 8 computes the DPM score of each term.
    $$\begin{aligned} DPM(td_i, t_i) = \sum _{i=1}^{P}DIFF_{td_i, t_i} \end{aligned}$$
    (8)
5. v.
  Selection of candidate terms having High DPM scores:
  
  of term-document cluster are arranged as per the DPM scores, and higher $k\%$ terms are selected as the candidate terms. This step is repeated for each $td_i$ so that every $td_i$ has top $k\%$ candidate terms in them.
3.
Input feature vector generation:

To build the input feature vector, all the top $k\% $ features of each $td_i$ are merged into a list $L_{list}$.
4.
Feature mapping of Multi-layer ELM:
1. i.
  Multi-layer ELM heavily employs the universal classification [19, 20] and approximation [21, 22] capabilities of ELM.
2. ii.
  ML-ELM cleverly leveraged the extended representation (i.e., $n < L$) technique of the ELM autoencoder [12, 23], where n and L are the number of input and hidden layer nodes, respectively.
3. iii.
  The features of ML-ELM are transferred from a low-dimensional feature space to a higher-dimensional feature space using Eq. 9. Mapping of the input vector to HDFS-MLELM is shown in Fig. 4 where, $h_i({\textbf {x}}) = g(w_i.{\textbf {x}} + b_i)$.
  $$\begin{aligned} h({\textbf {x}}) = \left[ \begin{array}{c} h_1({\textbf {x}})\\ h_2({\textbf {x}})\\ h_3({\textbf {x}})\\ .\\ .\\ . \\ h_L({\textbf {x}}) \end{array} \right] ^T = \left[ \begin{array}{c} g(w_1, b_1, {\textbf {x}}) \\ g(w_2, b_2, {\textbf {x}})\\ g(w_3, b_3, {\textbf {x}}) \\ .\\ .\\ .\\ g(w_L, b_L, {\textbf {x}})\end{array} \right] ^T \end{aligned}$$
  (9)
  $h({\textbf {x}}) = {[h_1({\textbf {x}}), h_2({\textbf {x}}),\cdots , h_i({\textbf {x}}), \cdots , h_L({\textbf {x}})]}^T $ transfer the input features to HDFS-MLELM [24, 25].
4. iv.
  $L_{list}$ is mapped into MLELM-HDFS using Eq. 9. Before the transformation, L is set to a higher value than n. This makes all the features of $L_{list}$ linearly separable.
5.
Classification on MLELM-HDFS:

Different supervised learning algorithms employing $L_{list}$ as the input feature vector are run individually on TFIDF-VS and MLELM-HDFS respectively.

Table 1. Correlation matrix

A Novel Feature Selection Based Text Classification Using Multi-layer ELM

Abstract

Similar content being viewed by others

Text Classification Using Correlation Based Feature Selection on Multi-layer ELM Feature Space

Study and Understanding the Significance of Multilayer-ELM Feature Space

Study on suitability and importance of multilayer extreme learning machine for classification of text data

Keywords

1 Introduction

1.1 Research Motivation

1.2 Research Contribution

2 Prelims

2.1 Multi-layer ELM

3 Methodology

4 Analysis of Experimental Results

4.1 Experimental Setup

4.2 Discussion

4.3 Comparisons of ELM and ML-ELM Feature Space

5 Conclusion

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation