Keywords

1 Introduction

The increasing and overwhelming amount of scientific research documents available in bio-medicine scientific databases such as MEDLINE, requires the development of tools to help researchers to keep up with all the relevant work being done all over the world. Moreover the number of corpora of full scientific texts publicly available is also increasing rapidly. The availability of full texts may increase the chances for better analyses but also requires processing larger amounts of data. We argue that new and more powerful tools are required.

Our approach to the increase in the amount of information to process is a novel feature selection algorithm that achieves better results than existing competitors with a much smaller number of attributes. We have empirically assessed the performance of several feature selection algorithms conducting a series of experiments using: Information Gain, Correlation and the newly developed feature selection algorithm.

Text classifiers can adequately be used to extract medical/biological information from very large scientific papers repositories as shown in [20]. Techniques, like the one presented in this paper, can contribute to build better text classifiers and therefore extract better papers from those large repositories.

The rest of the paper is organized as follows. Section 2 makes an introduction to the feature selection methods focusing on the ones used in the present study. Section 3 presents the new algorithm for feature selection, k-Best-Discriminative-Terms. Section 4 will highlight the feature selection work applied to biomedical full-text document for classification. In Sect. 5 we present the results regarding our study and finally in Sect. 6 we draw the conclusions of the work described in the paper.

2 Feature Selection Methods for Text Classification

According to [2] the feature selection process or attribute reduction is the process of selecting a subset of features that best represents by itself all the data. The rational of feature selection in the context of text classification is to represent a document with a reduced number of highly representative/discriminative attributes.

The full-text document classification, specially in the biomedical domain, involves the manipulation of very large data sets. This brings several well-known problems such as the increase of the computation time. Besides that, not all the attributes are relevant and important for the classification task, which is another well-known problem that disturbs the performance of the classifiers.

We have adopted the bag-of-word approach, original documents are seen as a vector containing a huge number of words. Since we are working with a large collection of documents, the number of words increases quite dramatically, which entails memory and time restrictions to run learning algorithms. Due to the exposed situation it is seriously important to select the most important and relevant attributes for the classification process. That is the objective of Feature Selection algorithms.

According to [11] there are two main reasons for selecting some features over others. The first reason is related to the algorithm’s performance, e.g., algorithms produce better results when not considering all the attributes. This is due to some attributes do not add more information instead they add noise, and removing them makes the classifier to perform better. The second reason is due to scalability, once a huge number of attributes demands computation power, memory, network bandwidth, storage, etc.) thus running a smaller subset decreases the computation time.

We have assessed the performance of three different feature selection methods:

  1. 1.

    Information Gain (IG)

  2. 2.

    Correlation (Corr)

  3. 3.

    k-Best-Discriminative-Terms (k-BDT)

The first two methods are now described and the k-BDT is presented in Sect. 3.

2.1 Information Gain

Information Gain (IG) is used to determine which attribute in a given set of training feature vectors is most useful for discriminating the class values to be learned [4, 5].

IG is a “synonym” for KullbackLeibler divergence [14] and it is often used for ranking individual features [15].

In document classification, IG measures the number of bits of information gained, with respect to deciding the class to which a document belongs, by using each word frequency of occurrence in the document. However, IG only evaluates features in an individual manner.

IG is a feature selection method used prior and independent from the learning process, e.g., a filter method compares the computation score of each attribute and then selects the best attributes according to the highest scores [6].

Based on their comparative study of filter methods, [7] and [11] concluded that IG and Chi-Square (CHI) are among the most effective methods of feature selection for classification.

2.2 Correlation

According to the correlation algorithm an attribute is very relevant if it is highly correlated with the class, otherwise it is irrelevant [12].

We have used the WEkA CorrelationAttributeEval functionality, that evaluates the worth of an attribute by measuring its correlation (Pearson’s Correlation) with the class. The WEkA CorrelationAttributeEval technique used requires a Ranker Search Method, that evaluates each attribute and lists the results in a ranked order.

3 k-Best-Discriminative-Terms

The rational of the k-BDT method is to find the best k termsFootnote 1 in the corpus that best discriminate the two classes of documents (assuming a binary classification problem). In an informal description the documents are first separated by class value and, for each class value, the metric Tf\(\times \)Df is computed for each term. This metric represents the average term frequency in the class value multiplied by the document frequency. The justification for Df is that we aim at terms that are frequent in all documents of each class value but infrequent in the “other class value”. The k-BDT method is described in detail in Algorithm 1. The documents of the two class values are separate in lines 2 and 3. The documents from one of the class value (let say POS) are processed between lines 5 and 11. First the term frequency (Tf - line 7) and document frequency (Df - line 8) are computed for each term and document and then the Tf\(\times \)Df is computed (line 9). Finally we compute the average Tf\(\times \)Df for each term in lines 10–11. We repeat the same procedure for the other class value (NEGS) (lines 12–18). The “final values” are the difference between the Tf\(\times \)Df of POS and the corresponding Tf\(\times \)Df of the NEGS (lines 19–21). The final values are sorted by descending order (line 22) and the k first terms are returned in line 23. In Algorithm 1 (line 21) we have used the abs function but, as described in the next paragraph, we have also considered an alternative procedure.

figure a

Two Alternative Implementations

To choose the best k discriminative terms we have adopted and evaluated (Section 5) two alternative methods of computing the “final value” of each term (line 21). One approach, designated abs, sorts, in decreasing order, the absolute value of the difference between the Tf\(\times \)Df value in the positives minus the value for the same term in the negatives, and chooses the best k of them. An alternative approach called half-k also computes, for each term, the differences of Tf\(\times \)Df between the corresponding positive term and in the negative term but does not take the absolute value of their difference. It then chooses the k/2 terms achieving the most positive values (appear most frequently in relevant documents) and also the k/2 terms achieving the most negative values (appear most frequently in irrelevant documents). This late approach aims at making sure that representative terms from positive texts and negative texts are chosen.

Given a set of labeled examples, the goal of a classifier is to discriminate the elements of the different classes. K-BDT is based on a similar principle. K-BDT identifies terms that discriminate relevant documents from the non-relevant ones. It does that differently from the traditional tf \(\times \) idf approach. In traditional text classification tf \(\times \) idf promotes terms that are highly represented in a single document independent of the class it belongs. K-BDT promotes terms that are highly represented in a large number of documents of one of the classesFootnote 2 and at the same time rare in the documents of the other class. K-BDT promotes terms that are good at discriminating the two classes. Since K-BDT looks for terms highly represented in the whole set of documents of each class the experimental results show that we often need a small number of such terms to build a good classifier. This feature seems to be an advantage over the traditional tf \(\times \) idf approach.

The k-BDT technique is suitable to be applied to a text classification problem in any domain and text corpus. There are no domain restrictions to the application of the technique.

4 Related Work on Feature Selection for Attribute Reduction in Full-Text Documents Classification

In the literature the work in feature selection is quite extensive, so we will highlight the feature selection work applied to the biomedical full-text document for classification purpose.

The recent work of [8] presents a study of the impact of feature selection on medical document classification. This study uses two data sets containing MEDLINE documents and makes a comparison between two different feature selections methods: the Gini Index and the Distinguish Feature Selector through two base classifiers: C4.5 decision tree and the Bayesian network. The authors also used documents from ten different disease categories for the experiments. The authors concluded that the best accuracy results are a combination of the two proposed feature selection methods.

The authors in [9] present a novel method for attribute reduction using a data set of PubMed articles. The authors claim that achieved better results with their new method in terms of accuracy. The process involves a first phase of pre-processing the documents through the application of the tokenization, stemming and stop words removal. This new method is a variation of the Global Weighting Schema (GRW), that extracts unique terms from documents and these terms are weighted through the global weighting schema proposed.

The authors in [10] propose a group of scoring measures for feature selection using an SVM classifier and applied it to the OHSUMED corpus. The authors claim that the results achieved mixing their proposed scoring measures outperformed both Information Gain and Tf\(\times \)IDf in some cases. According to the authors the proposed measures are more dependent of the distribution of the terms through the categories and also of the documents over the categories.

The work proposed in [16] presents a novel feature selection method to reduce the dimension of terms which takes into a new semantic space, between terms, based on the latent semantic indexing method. The idea is to appropriately capture the underlying conceptual similarity between terms and documents, which is helpful for improving the accuracy of text categorization.

Xu et al. [18] describe a work based on a very simple technique called Document Frequency thresholding (DF) that has shown to be one of the best methods in either Chinese or English text data. To improve DF Xu added the Term Frequency (TF) factor. The extended method called TFDF was tested on Reuters-21578 and OHSUMED corpora showed better results than the original DF method. Although we also use document frequency (Df), Xu approach is still quite different from the novel method reported in this paper. In Xu’s work there is no concern to use directly a method that discriminates the class values by performing separate computations on each class value set of documents. Document Frequency thresholding (DF) is also a different definition than the Df used and defined in this paper.

An extensive survey on text categorization techniques can be found in [19].

5 Experimental Work and Results

Methods

The empirical evaluation was done using the OHSUMED corpus [13]. We have used five OHSUMED data sets for which we manage to collect the full texts. With that corpus we have “created” two corpus: the original corpus with the full text papers; and a corpus with the same papers but with just title and abstract. For each corpus there are five data sets (c04, c06 c14, c20 and c23) that are characterized in Table 1 for title and abstract and Table 2 for full text.

Table 1. Characterization of the data sets in the Ohsumed corpus (Title+Abstract).
Table 2. Characterization of the data sets in the Ohsumed corpus (full text).

We have performed three sets of experiments. We first conducted an experiment to estimate the best values of k for the title and abstract data sets and for the full text data sets. Secondly and with the best values of k for title and abstract we have compared the performance of the three feature selection algorithms in the title and abstract corpus. Lastly and using the best values of k for full text, we have compared the algorithms in the full data set corpus.

For the experiments we have used the Support Vector Machine (SVM) algorithm from Weka [17]. A 10-fold Cross Validation procedure was used as the evaluation method. The values used for k were set to 10, 50 and 100 for the title and abstract corpus and 50, 100, 500, 1000 and 1500 for the full text corpus. Both alternative implementation (abs and half-k) were used in the assessment of the novel approach.

For the purpose of our work and concerning the Information Gain feature selection method we have used a threshold of \(1.00e^{-10}\) that was a value used in a previous work [1]. In the Normalize component we have used the nominal representation.

The metric used for the evaluation of the classifiers performance was the F-measure. The F-measure value combines precision and recall, where precision is the percentage of classifications that are correct and recall is the percentage of classifications actually made by the classifier. F-measure is computed as the harmonic average of the precision and recall. The best performance of a classifier on a classification task is when the F-measure has value 1 (perfect precision and recall) and its worst performance is when the F-measure is 0.

Results

Table 3 shows the results of the experiments to assess the impact of parameter k and the two alternative methods to choose the attributes in k-BDT method. We can see from those results that the novel method does not perform well in data sets that have a small number of terms. Looking at the term-doc matrix we see a very large amount of zeros, the matrix is very sparse. There is a low probability to find a frequent term common to a large number of documents of each class value.

Table 3. Choosing the values of k together with the best of abs or half-k alternatives. The title and abstract corpus was used. k = 100 was the best value among all alternatives tested for both abs and half-k.
Table 4. Choosing the values of k together with the best of abs or half-k alternatives. The full text corpus was used. k values are the best ones for each abs and half-k among other values tested.

Table 4 shows the results of the experiments to assess the impact of parameter k and the two alternative methods to choose the attributes in k-BDT method on the full text data sets. The results are completely the opposite of the results with title and abstract. The f-measure values are well above the reference values in all data sets.

Concerning the second set of experiments we have obtained the results shown in Table 5. The results in the table show that in the case of using only title and abstract the novel method is much worse than its competitors.

Table 5. Comparison of the feature selection methods on the corpus using only title and abstract. Cells of the table contain the average and standard deviation of F-measure of a 10-fold cross validation. IG stands for information Gain. k-BDT stands for k Best Discriminative Terms. ‘+’ means that the value is statistically significantly better than the base line value. Base line values can be found in Table 3.
Table 6. Comparison of the feature selection methods on the data sets using full text. Cells of the table contain the average and standard deviation of F-measure of a 10-fold cross validation. IG stands for information Gain. k-BDT stands for k Best Discriminative Terms. ‘+’ means that the value is statistically significantly better than the base line value. Base line values can be found in Table 4.

Concerning the third set of experiments we have obtained the best results with the novel method in all data sets. Table 6 shows the best results of the experiments to compare the study’s feature selection methods on the full text corpus. We can see that in all data sets the novel method achieves performances well above the base line value and better than the competitors. In data set c06 and using the half-k version of the k-BDT method we need only 100 attributes to achieve a very good performance.

6 Conclusions

In this paper we have presented and empirically evaluated a novel feature selection method. The method is based on the idea of finding terms that are frequent in the documents of one of the class values and infrequent in the other class values. We have compared the novel method with too other feature selection approaches for title and abstract and for full-text document classification.

The results of the novel method are much better than its competitors in all full text data sets used. However, the novel method seems to be inadequate for data sets using title and abstract only.

The results suggest that the novel method requires a very small number of attributes to achieve good performances. In one of the data sets used in the study, the novel method just need 100 attributes to achieve the best performance among the competitors.