Keywords

1 Introduction

Nowadays, digital documentation is increasing at a very fast pace, and it is very important to maintain the classification of digital documents. The main aim of digital document classification is to categorize the documents into predefined classes. It is an active research area for the information retrieval [1] and machine learning from the digital text documentation. There are many supervised algorithms which are employed on the digital text documents for the classification such as support vector machine [2], Naïve Bayes [3], decision tree [4], and nearest neighbors [5].

There are two phases of text categorization [6] of digital documents: One is the training phase, and the second is classification testing phase. Earlier, subject indexing and feature extraction method [7] were used for text categorization. However, these methods are not very much successful for the classification. Text categorization methods are based on the term frequency and inverted term frequency and count the frequencies of the term but not consider the position of the term. Therefore, these methods were not efficient in articulating the class for the text data. In each data, the position of the term is very relevant for the identification of the documents.

The remaining paper is organized as follows: Sect. 2 discusses the related work. In Sect. 3, material and methodology used for this work are discussed. Section 4 describes the experimental results and discussions. Lastly, Sect. 5 concludes this study.

2 Related Work

Earlier, the text classification was done manually, but those classifications were not at all efficient. After that, many classification schemes came to existence such as subject indexing [8], term frequency [9], Gini index [10], mutual information, and information gain [11]. Till now, a significant amount of research has been done in automatic text categorization (ATC). Term frequency and subject indexing also used for classification, but these techniques were using the phenomenon of term redundancy [12] and subject index but missing the relevancy of the term. Gini index is also a global feature selection method for text classification. It is an improved version attribute selection algorithm. Currently, the weighted feature selection [13] algorithms are used for automatic text categorization since it is based on the mutual information [14, 15] of the term of the dataset. Mutual information and maximum entropy classification [16] are the basic techniques which are used by the researcher for machine learning and information retrieval from the text document.

3 Material and Methodology

3.1 Data Source

Four datasets have been taken from the Knowledge Extraction based on Evolutionary Learning (KEEL) repository text classification datasets. It contains preprocessed data of text document of Ohsumed test collection which is a subset of the MEDLINE database. The MEDLINE database is a collection of the bibliographic database of important, peer-reviewed medical literature maintained by the National Library of Medicine. Ohsumed test classification [17] is the collection of each dataset which contains 100 attributes which are enough to test various feature selection algorithms. The brief description of the used dataset is given in Table 1.

Table 1 A brief description of used data sets

3.2 Methodology

Before doing any classification, we need to do preprocessing of dataset. Since dataset is very large and has an enormous number of the attribute, we have to reduce the number of the attribute in the dataset using preprocessing step known as feature selection. There are various feature selection algorithms available, but we will use only those feature selection algorithms which are available in the feature selection toolbox developed at UTIA of the Czech Academy of Sciences. The methodology works as shown in Fig. 1.

Fig. 1
figure 1

Methodology of text classification via feature selection

Feature Selection Algorithms

CMIM. The conditional mutual information maximization (CMIM) [18] algorithm selects a subset of a feature from dataset to minimize the number of features. The selected features carry more relevant information of data according to mutual information and save computational time.

mRMR. The minimum redundancy maximum relevance (mRMR) [18] select features are having less redundant data to minimize feature redundancy and high correlation to maximize feature relevance. The two usually used objective functions in mRMR are mutual information difference criterion (MID) and mutual information quotient criterion (MIQ).

JMI. The joint mutual information (JMI) [19] uses information theory to calculate the mutual information and entropy between any random variables together for feature selection. The representation is shown in Eq. (1).

$$ I\left( {x, \, y} \right) = H\left( x \right) - H \, \left( {x|y} \right) $$
(1)

where I is mutual information and H is entropy.

Condred. In the condition redundancy [20] feature selection method, the race condition is overcome. The race condition is occurred due to the redundancy of term which is not related to the classification of a text document to predefined classes and statistical property.

MIFS. Mutual information feature selection (MIFS) [21] algorithm is entirely based on the mutual information computed for each term of the dataset. Mutual information of dataset is more reliable data as compared to the frequency of data for the classification of a text document. MIFS gives a more precise result, but it is a little bit slower since it calculates the weight of each data and then weight frequency.

ICAP. In Interaction capping (ICAP) [22] feature selection algorithm features are sorted using the interaction of their term with other term using the information capping.

DISR. The double-input symmetrical relevance (DISR) [23] feature selection algorithm combines two main properties of variable complementarily, and the collection of the feature gives a different result. The most promising set is d − 1 if there is no information about the relation of the variable in datasets.

CIFE. The conditional infomax feature extraction (CIFE) [24] algorithm is based on information theory. In this feature selection, the systematical study of the structure of the document is done. It improves the performance of joint-class relevant detail by reducing class redundancy of dataset [8].

BetaGamma. The BetaGamma [25] is conditional mutual information-based feature selection algorithm. In this algorithm, beta and gamma are two values that maintain the weight of a feature by their relevance. Normally, the value of β (beta) and γ (gamma) is zero.

Classification Algorithm

Support Vector Machine. SVM is a supervised learning technique and segregates classes using hyperplane for the classification using feature values of N instances.

Decision Tree. A decision tree is a predictive model used for the classification based on the tree model. In this supervised algorithm, the datasets are broken down into a subset and create an association with that subset in a form decision tree having the node as decision node, intermediate node, and leave node.

K-nearest neighbors. In this method, the prediction function depends on the approximated locality, and Euclidean distance [25], Chebyshev norm, or Mahalanobis distance is used for distance computation.

Gaussian Naïve Bayes. This algorithm is based on the probabilistic classifier and relies on the well-known theorem—Bayes’ theorem. It is also very popular for text categorization method. In this algorithm, the following formula in Eq. (2) are implied

$$ P\left( {c/X} \right) = \mathop \sum \limits_{n = 1}^{\infty } P(xi/c) $$
(2)

where P(c/X) is the posterior probability.

4 Experimental Results and Discussions

This section summarizes the simulation result performed on four text categorical data. We consider four classifiers and nine feature selection technique for the sake of performance evaluation. Results are annotated for each classifier and feature selection technique pair. The accuracy values are recorded and listed in Tables 2, 3, 4, and 5. The feature selection techniques referred in this study are from the filter-based approach, which requires the number of the feature as an input parameter. Due to uncertainty in the selection of optimal features, we performed multiple experiments by initializing the ten features with an interval of 10. In total, we capture the five instances in multiple of ten features.

Table 2 Impact of feature selection algorithm on 6-ketoprostaglandin F1 α
Table 3 Impact of feature selection algorithm on uric acid data
Table 4 Impact of feature selection algorithm on heart valve data
Table 5 Impact of feature selection algorithm on brain chemistry data

Table 2 lists the experimental result performed on ketoprostaglandin dataset. The accuracy values in the table indicate the highest value when support vector machine is aligned with DISR on ten features. When selected features are 20, then the three feature selection techniques, CMIM, conditional reduced (Condred), and MRMR, produce 100% accuracy.

The performance on uric acid data is noted in Table 3. The table reveals that the CIFE feature selection techniques with Naïve Bayesian classifiers achieved 100% accuracy with ten features only.

Decision tree performance in heart valve data is delivering the maximum accuracy. MIFS feature selection technique with ten features achieves 99.2% of accuracy. Experimental results are listed in Table 4.

In brain chemistry data, the combination of Naïve Bayes algorithm along with CMIM feature selection technique has the highest classification rate. This pair of classifier feature selection achieves 100% accuracy when the number of the feature is selected as 10. Table 5 reports the experimental outcome.

The objectives of these experiments were to extract out the best feature selection and classifier combination so that the choice of making an efficient model could be effortless. However, there is no single pair that can be identified, but still, the use of Naïve Bayes classifier along with CMIM feature selection technique could be an optimal choice for text categorization model.

5 Conclusion

In this paper, nine weighted feature selection algorithms are used in the four-text classification preprocessed dataset from KEEL repository. The feature selection is performed with the different number of features ranging from 10 to 50 at the interval of 10 features. This experiment shows improvement in the performance of classification for text documentation categorization on using weighted feature selection. The experimental results concluded that mutual information based feature selection algorithm improves the result of text classification significantly. The weighted feature selection methods are also work well because of the relevance of position of the term used in the text document, and it also reduces factor of redundancy while classification of documents.