Keywords

1 Introduction

Highly increase in the usage of Internet technology caused a significant growth in the number of electronic documents worldwide. This increase make automatic text classification approaches quite important. The main task of automatic text classification approach is to assign the electronic documents to the appropriate classes according to their content [1]. These documents can be retrieved from many different domains. It should be noted that every domain may have slightly different problems and solutions due to its nature. Text classification can be used to solve a variety of problems such as the filtering of spam e-mails [2], author identification [3], classification of web pages [4], sentiment classification [5, 6] and classification of medical text documents [7, 89].

Classification of medical abstracts is one of the main concerns inside medical text classification research field. Researches related to medical abstracts are generally carried out on MEDLINE database [10]. MEDLINE is a bibliographic database containing over 21 million documents, about 5600 medical journals. This database consists of medical abstracts in English which are assigned to some categories namely medical subject headings (MeSH). This database can be queried on internet through a search platform called PubMed [11]. Documents in MEDLINE database is indexed with corresponding relevant categories of MeSH terms by experts manually. In the literature, there exist some studies conducted on automatic classification of MEDLINE documents [8, 9, 12,13,14,15,16,21,21,20,19,18,17,]. In these studies, datasets containing a certain amount of MEDLINE documents are used. The most used dataset for automatic classification of MEDLINE documents is called Ohsumed dataset. It contains medical abstracts in English for 23 types of diseases. Ohsumed, due to the structure of the MEDLINE database, is multi-label. So, it is necessary to apply multi-label classification approaches whenever a study on this dataset is performed using all documents.

In a previous study, the usage of words, medical phrases, and their combinations as features is investigated [8] for medical document classification. The results show that using combination of words and phrases as features gives slightly better classification performances than the others. In another study, multi-label classification performance based on associative classifier is examined on medical articles [12]. In another study, hidden Markov models are used for classification [16]. Besides, there exist a number of studies in the literature that ontology-based classification approaches are applied [14, 18]. In a recent study, an approach using support vector machines and latent semantic indexing is applied to some datasets including the ones consisting of medical abstracts [20]. Moreover, the performances of classifiers on medical document classification is analyzed for two cases where stemming is applied and not applied [21]. Also, the impact of different text representations of biomedical texts on the performance of classification are analyzed [9]. In a recent study [22], several experiments have been conducted using OHSUMED corpus. They obtained results using biomedical text categorization system based on three machine learning models. These models are support vector machine (SVM), naïve Bayes (NB) and maximum entropy (ME). The results show that the context-based methods (SenseRelate and NoDistanceSenseRelate) outperform the others. As a part of another study [23], a collection consisting of 1499 PubMed abstracts annotated according to the scientific evidence are used. They provide for the 10 currently known hallmarks of cancer to train a system that classifies PubMed literature according to the hallmarks. The system uses supervised machine learning and rich features largely based on biomedical text mining. In another study [24], the authors designed and assessed a method for extracting clinically useful sentences from synthesized online clinical resources that represent the most clinically useful information for directly answering clinicians’ information needs. The feature-rich approach significantly outperformed general baseline methods. This approach significantly outperformed classifiers based on a single type of feature. Within the scope of one of the recent studies [25], the impact of feature selection on medical document classification is analyzed using two datasets containing MEDLINE documents. Gini index and distinguishing feature selector are used as two different feature selection methods. Two different pattern classifiers namely Bayesian network and C4.5 decision tree are utilized. As this study deals with single-label classification, a subset of documents inside OHSUMED and a self-constructed dataset is used for assessment of feature selection methods. According to experimental results, the combination of distinguishing feature selector and Bayesian network classifier gives more successful results in most cases than the others.

Apart from studies that uses MEDLINE documents, there exist some medical text classification studies using data obtained from various clinics data [13, 26,26,27,28,31,31,30,29,]. Some of these studies concerns with medical text documents in different languages such as German [13].

In this study, the performances of two widely-known classifiers namely Bayesian networks and C4.5 decision trees are extensively analyzed using two feature selection methods on two different datasets consisting of MEDLINE documents. Also, a comparison on two different widely-known feature weighting methods is carried out in order to obtain the best combination of various parameters such as feature selection methods, feature weighting algorithms, and classifiers for medical document classification. In order to make a generalization from the results, two datasets having different characteristics are used in the experiments. The first dataset is a subset of well-known OHSUMED dataset. The second one is a self-constructed dataset whose data is retrieved programmatically with querying Pubmed search platform. This dataset differs from the first one. It consists of MEDLINE documents originated from medical journals in Turkey. However, it has smaller amount of data than the first dataset.

Rest of the paper is organized as follows: feature extraction and selection approaches used in the study are briefly described in Sect. 2. Section 3 explains pattern classifiers used in this study. Section 4 presents the experimental study and results. Finally, some concluding remarks are given in Sect. 5.

2 Feature Extraction and Selection

2.1 Feature Extraction

As in most of the text classification studies, bag of words approach [1, 21] can be used for feature extraction process. In this approach, the order of terms within documents is ignored and their occurrence frequencies are used [32, 39]. Therefore, each of the unique words in a text collection is considered as a different feature. Consequently, a document is represented by a multi-dimensional feature vector [1]. In a feature vector, each dimension corresponds to a value which is weighted by term frequency (TF), term frequency-inverse document frequency (TF-IDF), and etc. [33].

It should also be noted that it is necessary to apply some preprocessing steps during feature extraction from text documents. Widely used preprocessing steps are “stopword removal” and “stemming”. In this study, both of these two steps were applied. Porter stemming algorithm [34] was used for stemming and two different term weighting approaches are applied. These two weighting approaches are TF and TF-IDF, respectively.

2.2 Feature Selection

Feature selection techniques generally fall into three categories: filters, wrappers, and embedded methods. Filter techniques are computationally fast; however, they usually do not take feature dependencies into consideration [1]. Filter-based methods are widely preferred especially for text classification domain. There is a mass amount of filter-based techniques for the selection of distinctive features in text classification. In this study, two different filter-based feature selection methods namely Gini index (GI) and distinguishing feature selector (DFS) were used. These methods are explained below in details.

2.2.1 Gini Index (GI)

GI is an improved version of the method originally used to find the best split of features in decision trees [35]. It is an accurate and fast method. Its formula is as below:

$$ GI(\text{t}) = \sum\limits_{i = 1}^{M} {P(\text{t}|\text{C}_{\text{i}} )^{2} .\,\,P(\text{C}_{\text{i}} |\text{t})^{2} } $$
(1)

where \( P(\text{t}|\text{C}_{\text{i}} ) \) is the probability of term \( t \) given presence of class \( C_{i} \), \( P(\text{C}_{\text{i}} |\text{t}) \) is the probability of class \( C_{i} \) given presence of term \( t \), respectively.

2.2.2 Distinguishing Feature Selector (DFS)

DFS is one of the recent successful feature selection methods for text classification [1] whose aim is to select distinctive features while eliminating uninformative ones considering some pre-determined criteria. DFS can be expressed with the following formula:

$$ DFS(\text{t}) = \sum\limits_{i = 1}^{M} {\frac{{P(\text{C}_{\text{i}} \left| \text{t} \right.)}}{{P(\overline{\text{t}} \left| {\text{C}_{\text{i}} } \right.) + \text{P}(\text{t}\left| {\overline{\text{C}}_{\text{i}} } \right.) + 1}}} $$
(2)

where M is the total number of classes, \( P(\text{C}_{\text{i}} \left| \text{t} \right.) \) is the conditional probability of class \( C_{i} \) given presence of term \( t \), \( P(\overline{\text{t}} \left| {\text{C}_{\text{i}} } \right.) \) is the conditional probability of absence of term \( t \) given class \( C_{i} \), and \( P(\text{t}|\mathop {\text{C}_{\text{i}} }\limits^{ - } ) \) is the conditional probability of term \( t \) given all the classes except \( C_{i} \).

3 Pattern Classifiers

In this study, two classifiers in Weka [36] package were used programmatically. These are Bayesian Networks and C4.5 decision tree classifiers. These algorithms are explained in details below.

3.1 Bayesian Networks (BN)

BN is one of the methods which are used to denote modeling and state transitions [37]. BN is often used for modeling discrete and continuous variables of multinomial data. These networks encrypt the relationships between variables in the modeled data. In BN, the nodes are interconnected by arrows to indicate the direction of engagement with each other.

3.2 C4.5 Decision Tree (DT)

The main purpose of the decision tree algorithms is to split the feature space into unique regions corresponding to the classes [1]. An unknown feature vector is assigned to a class via a sequence of Yes/No decisions along a path of nodes of a decision tree. C4.5 is an algorithm used to generate a decision tree and it is known as one of the successful decision tree classification algorithms.

4 Experimental Work

In this section, an in-depth investigation was carried out to measure the performance of feature selection methods, term weighting methods and classifiers. For this purpose, combinations of feature selection methods with BN and DT classifiers were analyzed in order to determine the best combination for both of the datasets. At the same time, two different term weighting methods which are TF and TF-IDF are used. Also, the effect of dimension reduction can be inferred according to the experimental results. In the following subsections, the utilized datasets and success measures are briefly described. Then, the experimental results are presented.

4.1 Datasets

In this study, two different datasets containing MEDLINE documents were used. The first one is a subset of well-known Ohsumed dataset. It consists of medical abstracts collected in 1991 related to 23 cardiovascular disease categories. As this study deals with single-label text classification, the documents belonging to multiple categories are eliminated. Also, only 10 classes are used for classification in order to make the class distribution same with the second dataset. The second dataset is a self-constructed dataset whose data is retrieved programmatically with querying Pubmed search platform. This dataset is constructed via retrieving XML results containing medical abstracts and parsing it appropriately. The documents having multiple categories are removed from this dataset because of concerning single-label classification of medical documents. This dataset differs from the first one depending on its origins. It consists of MEDLINE documents only originated from medical journals in Turkey rather than originating from different locations. However, it has same categories with smaller amount of data than the first one. In this dataset, 10 categories having enough number of documents were used for the evaluation. The detailed information regarding those datasets is provided in Tables 1 and 2. In the experiments, 70% of documents in each class was used for training. The rest was also used for testing.

Table 1 Ohsumed dataset
Table 2 Self-constructed dataset

4.2 Accuracy Analysis

Varying numbers of the features, which are selected by each selection method, were fed into DT and BN classifiers. In the experiments, stopword removal and stemming were applied. Widely-known Porter stemmer was carried out as stemming algorithm. In this study, GI and DFS are used as feature selection methods. Dimension reduction was carried out by constructing feature sets consisting of 300, 500, 1000, and 2000 features. Also, F-score [38] was used as success measure. This score is presented as both class specific and weighted averaged. Resulting F-Scores obtained on two datasets using TF and TF-IDF weighting approaches are listed in Tables 3, 4 and Tables 5, 6, respectively. The best ones in the results are shown as bolded.

Table 3 Results on Ohsumed dataset (tf-weighted)
Table 4 Results on self-constructed dataset (tf-weighted)
Table 5 Results on Ohsumed dataset (tf-idf weighted)
Table 6 Results on self-constructed dataset (tf-idf weighted)

Considering the highest weighted averaged F-scores, in most cases, DFS is superior to GI. In a small part of experiments, DFS and GI give similar results on both of the two datasets. It should be noted that DFS seems more successful when the feature size is low. Also, the scores obtained with TF weighting is generally more successful than the ones obtained with TF-IDF term weighting. In a small part of experiments, TF-IDF weighting is superior to TF weighting. It is common that TF and TF-IDF term weighting methods are both successful when the feature size is high. Besides, in spite of originated from different sources and having different class-based distributions, the maximum classification performances obtained on these two datasets are similar. BN classifier is more successful than DT classifier in most of the cases.

Considering class based F-scores, classification performances obtained on neoplasms (C4) and cardiovascular diseases (C14) categories are generally higher than the others for the first dataset. The results are unchanged when applying two different term weighting methods which are TF and TF-IDF methods in two datasets. This may be due to having high amount of training instances for these two categories. For self-constructed dataset, classification performances obtained on parasitic diseases (C3) and cardiovascular diseases (C14) categories are generally higher than the others. TF and TF-IDF term weighting methods did not change the results both Ohsumed and self-constructed dataset. In this case, these are not the classes with maximum number of documents. This situation may be caused by having small amount of data for most of the categories. Also, for most of the class-based F-scores, combination of DFS and BN seems better than the other ones.

5 Conclusions

In this study, the performances of two widely-known classifiers are extensively analyzed using two different feature selection methods. Two different term weighting methods are also used in the experiments. This analysis is realized on two different datasets consisting of MEDLINE documents. In the experiments, stopword removal and stemming as preprocessing steps are applied. Experimental results show that the most successful setting is the combination of Bayesian Network classifier, distinguishing feature selector, and TF term weighting method. As a future work, a new dataset containing Turkish versions of the documents in the self-constructed dataset may be compiled and classification performances of these two datasets having same documents in different languages can be extensively analyzed. In this paper, we have revised and extended the research results presented earlier in [25].