On Feature Weighting and Selection for Medical Document Classification

Parlak, Bekir; Uysal, Alper Kursat

doi:10.1007/978-3-319-58965-7_19

Bekir Parlak⁴ &
Alper Kursat Uysal⁴

Part of the book series: Studies in Computational Intelligence ((SCI,volume 718))

1281 Accesses
14 Citations

Abstract

Medical document classification is still one of the popular research problems inside text classification domain. In this study, the impact of feature selection and feature weighting on medical document classification is analyzed using two datasets containing MEDLINE documents. The performances of two different feature selection methods namely Gini index and distinguishing feature selector and two different term weighting methods namely term frequency (TF) and term frequency-inverse document frequency (TF-IDF) are analyzed using two pattern classifiers. These pattern classifiers are Bayesian network and C4.5 decision tree. As this study deals with single-label classification, a subset of documents inside OHSUMED and a self-constructed dataset is used for assessment of these methods. Due to having low amount of documents for some categories in self-compiled dataset, only documents belonging to 10 different disease categories are used in the experiments for both datasets. Experimental results show that the better result is obtained with combination of distinguishing feature selector, TF feature weighting, and Bayesian network classifier.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Comparative Study of Feature Selection Methods for Medical Full Text Classification

Improved Multi-label Medical Text Classification Using Features Cooperation

Using Class Based Document Frequency to Select Features in Text Classification

Keywords

1 Introduction

Highly increase in the usage of Internet technology caused a significant growth in the number of electronic documents worldwide. This increase make automatic text classification approaches quite important. The main task of automatic text classification approach is to assign the electronic documents to the appropriate classes according to their content [1]. These documents can be retrieved from many different domains. It should be noted that every domain may have slightly different problems and solutions due to its nature. Text classification can be used to solve a variety of problems such as the filtering of spam e-mails [2], author identification [3], classification of web pages [4], sentiment classification [5, 6] and classification of medical text documents [7, 8–9].

Classification of medical abstracts is one of the main concerns inside medical text classification research field. Researches related to medical abstracts are generally carried out on MEDLINE database [10]. MEDLINE is a bibliographic database containing over 21 million documents, about 5600 medical journals. This database consists of medical abstracts in English which are assigned to some categories namely medical subject headings (MeSH). This database can be queried on internet through a search platform called PubMed [11]. Documents in MEDLINE database is indexed with corresponding relevant categories of MeSH terms by experts manually. In the literature, there exist some studies conducted on automatic classification of MEDLINE documents [8, 9, 12,13,14,15,16,21,21,20,19,18,17,]. In these studies, datasets containing a certain amount of MEDLINE documents are used. The most used dataset for automatic classification of MEDLINE documents is called Ohsumed dataset. It contains medical abstracts in English for 23 types of diseases. Ohsumed, due to the structure of the MEDLINE database, is multi-label. So, it is necessary to apply multi-label classification approaches whenever a study on this dataset is performed using all documents.

In a previous study, the usage of words, medical phrases, and their combinations as features is investigated [8] for medical document classification. The results show that using combination of words and phrases as features gives slightly better classification performances than the others. In another study, multi-label classification performance based on associative classifier is examined on medical articles [12]. In another study, hidden Markov models are used for classification [16]. Besides, there exist a number of studies in the literature that ontology-based classification approaches are applied [14, 18]. In a recent study, an approach using support vector machines and latent semantic indexing is applied to some datasets including the ones consisting of medical abstracts [20]. Moreover, the performances of classifiers on medical document classification is analyzed for two cases where stemming is applied and not applied [21]. Also, the impact of different text representations of biomedical texts on the performance of classification are analyzed [9]. In a recent study [22], several experiments have been conducted using OHSUMED corpus. They obtained results using biomedical text categorization system based on three machine learning models. These models are support vector machine (SVM), naïve Bayes (NB) and maximum entropy (ME). The results show that the context-based methods (SenseRelate and NoDistanceSenseRelate) outperform the others. As a part of another study [23], a collection consisting of 1499 PubMed abstracts annotated according to the scientific evidence are used. They provide for the 10 currently known hallmarks of cancer to train a system that classifies PubMed literature according to the hallmarks. The system uses supervised machine learning and rich features largely based on biomedical text mining. In another study [24], the authors designed and assessed a method for extracting clinically useful sentences from synthesized online clinical resources that represent the most clinically useful information for directly answering clinicians’ information needs. The feature-rich approach significantly outperformed general baseline methods. This approach significantly outperformed classifiers based on a single type of feature. Within the scope of one of the recent studies [25], the impact of feature selection on medical document classification is analyzed using two datasets containing MEDLINE documents. Gini index and distinguishing feature selector are used as two different feature selection methods. Two different pattern classifiers namely Bayesian network and C4.5 decision tree are utilized. As this study deals with single-label classification, a subset of documents inside OHSUMED and a self-constructed dataset is used for assessment of feature selection methods. According to experimental results, the combination of distinguishing feature selector and Bayesian network classifier gives more successful results in most cases than the others.

Apart from studies that uses MEDLINE documents, there exist some medical text classification studies using data obtained from various clinics data [13, 26,26,27,28,31,31,30,29,]. Some of these studies concerns with medical text documents in different languages such as German [13].

In this study, the performances of two widely-known classifiers namely Bayesian networks and C4.5 decision trees are extensively analyzed using two feature selection methods on two different datasets consisting of MEDLINE documents. Also, a comparison on two different widely-known feature weighting methods is carried out in order to obtain the best combination of various parameters such as feature selection methods, feature weighting algorithms, and classifiers for medical document classification. In order to make a generalization from the results, two datasets having different characteristics are used in the experiments. The first dataset is a subset of well-known OHSUMED dataset. The second one is a self-constructed dataset whose data is retrieved programmatically with querying Pubmed search platform. This dataset differs from the first one. It consists of MEDLINE documents originated from medical journals in Turkey. However, it has smaller amount of data than the first dataset.

Rest of the paper is organized as follows: feature extraction and selection approaches used in the study are briefly described in Sect. 2. Section 3 explains pattern classifiers used in this study. Section 4 presents the experimental study and results. Finally, some concluding remarks are given in Sect. 5.

2 Feature Extraction and Selection

2.1 Feature Extraction

As in most of the text classification studies, bag of words approach [1, 21] can be used for feature extraction process. In this approach, the order of terms within documents is ignored and their occurrence frequencies are used [32, 39]. Therefore, each of the unique words in a text collection is considered as a different feature. Consequently, a document is represented by a multi-dimensional feature vector [1]. In a feature vector, each dimension corresponds to a value which is weighted by term frequency (TF), term frequency-inverse document frequency (TF-IDF), and etc. [33].

It should also be noted that it is necessary to apply some preprocessing steps during feature extraction from text documents. Widely used preprocessing steps are “stopword removal” and “stemming”. In this study, both of these two steps were applied. Porter stemming algorithm [34] was used for stemming and two different term weighting approaches are applied. These two weighting approaches are TF and TF-IDF, respectively.

2.2 Feature Selection

Feature selection techniques generally fall into three categories: filters, wrappers, and embedded methods. Filter techniques are computationally fast; however, they usually do not take feature dependencies into consideration [1]. Filter-based methods are widely preferred especially for text classification domain. There is a mass amount of filter-based techniques for the selection of distinctive features in text classification. In this study, two different filter-based feature selection methods namely Gini index (GI) and distinguishing feature selector (DFS) were used. These methods are explained below in details.

2.2.1 Gini Index (GI)

GI is an improved version of the method originally used to find the best split of features in decision trees [35]. It is an accurate and fast method. Its formula is as below:

$$ GI(\text{t}) = \sum\limits_{i = 1}^{M} {P(\text{t}|\text{C}_{\text{i}} )^{2} .\,\,P(\text{C}_{\text{i}} |\text{t})^{2} } $$

(1)

where $ P(\text{t}|\text{C}_{\text{i}} ) $ is the probability of term $ t $ given presence of class $ C_{i} $, $ P(\text{C}_{\text{i}} |\text{t}) $ is the probability of class $ C_{i} $ given presence of term $ t $, respectively.

2.2.2 Distinguishing Feature Selector (DFS)

DFS is one of the recent successful feature selection methods for text classification [1] whose aim is to select distinctive features while eliminating uninformative ones considering some pre-determined criteria. DFS can be expressed with the following formula:

$$ DFS(\text{t}) = \sum\limits_{i = 1}^{M} {\frac{{P(\text{C}_{\text{i}} \left| \text{t} \right.)}}{{P(\overline{\text{t}} \left| {\text{C}_{\text{i}} } \right.) + \text{P}(\text{t}\left| {\overline{\text{C}}_{\text{i}} } \right.) + 1}}} $$

(2)

where M is the total number of classes, $ P(\text{C}_{\text{i}} \left| \text{t} \right.) $ is the conditional probability of class $ C_{i} $ given presence of term $ t $, $ P(\overline{\text{t}} \left| {\text{C}_{\text{i}} } \right.) $ is the conditional probability of absence of term $ t $ given class $ C_{i} $, and $ P(\text{t}|\mathop {\text{C}_{\text{i}} }\limits^{ - } ) $ is the conditional probability of term $ t $ given all the classes except $ C_{i} $.

3 Pattern Classifiers

In this study, two classifiers in Weka [36] package were used programmatically. These are Bayesian Networks and C4.5 decision tree classifiers. These algorithms are explained in details below.

3.1 Bayesian Networks (BN)

BN is one of the methods which are used to denote modeling and state transitions [37]. BN is often used for modeling discrete and continuous variables of multinomial data. These networks encrypt the relationships between variables in the modeled data. In BN, the nodes are interconnected by arrows to indicate the direction of engagement with each other.

3.2 C4.5 Decision Tree (DT)

The main purpose of the decision tree algorithms is to split the feature space into unique regions corresponding to the classes [1]. An unknown feature vector is assigned to a class via a sequence of Yes/No decisions along a path of nodes of a decision tree. C4.5 is an algorithm used to generate a decision tree and it is known as one of the successful decision tree classification algorithms.

4 Experimental Work

In this section, an in-depth investigation was carried out to measure the performance of feature selection methods, term weighting methods and classifiers. For this purpose, combinations of feature selection methods with BN and DT classifiers were analyzed in order to determine the best combination for both of the datasets. At the same time, two different term weighting methods which are TF and TF-IDF are used. Also, the effect of dimension reduction can be inferred according to the experimental results. In the following subsections, the utilized datasets and success measures are briefly described. Then, the experimental results are presented.

4.1 Datasets

In this study, two different datasets containing MEDLINE documents were used. The first one is a subset of well-known Ohsumed dataset. It consists of medical abstracts collected in 1991 related to 23 cardiovascular disease categories. As this study deals with single-label text classification, the documents belonging to multiple categories are eliminated. Also, only 10 classes are used for classification in order to make the class distribution same with the second dataset. The second dataset is a self-constructed dataset whose data is retrieved programmatically with querying Pubmed search platform. This dataset is constructed via retrieving XML results containing medical abstracts and parsing it appropriately. The documents having multiple categories are removed from this dataset because of concerning single-label classification of medical documents. This dataset differs from the first one depending on its origins. It consists of MEDLINE documents only originated from medical journals in Turkey rather than originating from different locations. However, it has same categories with smaller amount of data than the first one. In this dataset, 10 categories having enough number of documents were used for the evaluation. The detailed information regarding those datasets is provided in Tables 1 and 2. In the experiments, 70% of documents in each class was used for training. The rest was also used for testing.

Table 1 Ohsumed dataset

Full size table

Table 2 Self-constructed dataset

Full size table

4.2 Accuracy Analysis

Varying numbers of the features, which are selected by each selection method, were fed into DT and BN classifiers. In the experiments, stopword removal and stemming were applied. Widely-known Porter stemmer was carried out as stemming algorithm. In this study, GI and DFS are used as feature selection methods. Dimension reduction was carried out by constructing feature sets consisting of 300, 500, 1000, and 2000 features. Also, F-score [38] was used as success measure. This score is presented as both class specific and weighted averaged. Resulting F-Scores obtained on two datasets using TF and TF-IDF weighting approaches are listed in Tables 3, 4 and Tables 5, 6, respectively. The best ones in the results are shown as bolded.

Table 3 Results on Ohsumed dataset (tf-weighted)

Full size table

Table 4 Results on self-constructed dataset (tf-weighted)

Full size table

Table 5 Results on Ohsumed dataset (tf-idf weighted)

Full size table

Table 6 Results on self-constructed dataset (tf-idf weighted)

Full size table

Considering the highest weighted averaged F-scores, in most cases, DFS is superior to GI. In a small part of experiments, DFS and GI give similar results on both of the two datasets. It should be noted that DFS seems more successful when the feature size is low. Also, the scores obtained with TF weighting is generally more successful than the ones obtained with TF-IDF term weighting. In a small part of experiments, TF-IDF weighting is superior to TF weighting. It is common that TF and TF-IDF term weighting methods are both successful when the feature size is high. Besides, in spite of originated from different sources and having different class-based distributions, the maximum classification performances obtained on these two datasets are similar. BN classifier is more successful than DT classifier in most of the cases.

Considering class based F-scores, classification performances obtained on neoplasms (C4) and cardiovascular diseases (C14) categories are generally higher than the others for the first dataset. The results are unchanged when applying two different term weighting methods which are TF and TF-IDF methods in two datasets. This may be due to having high amount of training instances for these two categories. For self-constructed dataset, classification performances obtained on parasitic diseases (C3) and cardiovascular diseases (C14) categories are generally higher than the others. TF and TF-IDF term weighting methods did not change the results both Ohsumed and self-constructed dataset. In this case, these are not the classes with maximum number of documents. This situation may be caused by having small amount of data for most of the categories. Also, for most of the class-based F-scores, combination of DFS and BN seems better than the other ones.

5 Conclusions

In this study, the performances of two widely-known classifiers are extensively analyzed using two different feature selection methods. Two different term weighting methods are also used in the experiments. This analysis is realized on two different datasets consisting of MEDLINE documents. In the experiments, stopword removal and stemming as preprocessing steps are applied. Experimental results show that the most successful setting is the combination of Bayesian Network classifier, distinguishing feature selector, and TF term weighting method. As a future work, a new dataset containing Turkish versions of the documents in the self-constructed dataset may be compiled and classification performances of these two datasets having same documents in different languages can be extensively analyzed. In this paper, we have revised and extended the research results presented earlier in [25].

References

Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl.-Based Syst. 36, 226–235 (2012)
Article Google Scholar
Idris, I., Selamat, A., Nguyen, N.T., Omatu, S., Krejcar, O., Kuca, K., Penhaker, M.: A combined negative selection algorithm—particle swarm optimization for an email spam detection system. Eng. Appl. Artif. Intell. 39, 33–44 (2015)
Article Google Scholar
Zhang, C., Wu, X., Niu, Z., Ding, W.: Authorship identification from unstructured texts. Knowl.-Based Syst. 66, 99–111 (2014)
Article Google Scholar
Ozel, S.A.: A Web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst. Appl. 38(4), 3407–3415 (2011)
Article MathSciNet Google Scholar
Agarwal, B., Mittal, N.: Prominent Feature Extraction for Sentiment Analysis, pp. 21–45. Springer (2016)
Google Scholar
Pak, M.Y., Gunal, S.: Sentiment classification based on domain prediction. Elektronika ir Elektrotechnika 22(2), 96–99 (2016)
Article Google Scholar
Garla, V., Taylor, C., Brandt, C.: Semi-supervised clinical text classification with Laplacian SVMs: an application to cancer case management. J. Biomed. Inform. 46(5), 869–875 (2013)
Article Google Scholar
Yetisgen-Yildiz, M., Pratt, W.: The effect of feature representation on MEDLINE document classification. In: AMIA Annual Symposium Proceedings, p. 849. American Medical Informatics Association (2005)
Google Scholar
Yepes, A.J.J., Plaza, L., Carrillo-de-Albornoz, J., Mork, J.G., Aronson, A.R.: Feature engineering for MEDLINE citation categorization with MeSH. BMC Bioinform. 16(1), 1 (2015)
Article Google Scholar
MEDLINE. [http://www.nlm.nih.gov/databases/databases_medline.html]. Accessed 2015
Pubmed [http://www.ncbi.nlm.nih.gov/pubmed]. Accessed 2015
Rak, R., Kurgan, L.A., Reformat, M.: Multilabel associative classification categorization of MEDLINE articles into MeSH keywords. IEEE Eng. Med. Biol. Mag. 26(2), 47 (2007)
Article Google Scholar
Spat, S., Cadonna, B., Rakovac, I., Gutl, C., Leitner, H., Stark, G., Beck, P.: Multi-label text classification of German language medical documents. In: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems, p. 2343 (2007)
Google Scholar
Camous, F., Blott, S., Smeaton, A.F.: Ontology-based MEDLINE document classification. In: Bioinformatics Research and Development, pp. 439–452. Springer Berlin Heidelberg (2007)
Google Scholar
Poulter, G.L., Rubin, D.L., Altman, R.B.: Seoighe, C.: MScanner: a classifier for retrieving medline citations. BMC Bioinform. 9(1), 108 (2008)
Article Google Scholar
Yi, K., Beheshti, J.: A hidden Markov model-based text classification of medical documents. J. Inf. Sci. (2008)
Google Scholar
Frunza, O., Inkpen, D., Matwin, S., Klement, W., O’blenis, P.: Exploiting the systematic review protocol for classification of medical abstracts. Artif. Intell. Med. 51(1), 17–25 (2011)
Google Scholar
Dollah, R.B., Aono, M.: Ontology based approach for classifying biomedical text abstracts. Int. J. Data Engi. (IJDE), 2(1), 1–15 (2011)
Google Scholar
Albitar, S., Espinasse, B., Fournier, S.: Semantic enrichments in text supervised classification: application to medical domain. In: The Twenty-Seventh International Flairs Conference (2014)
Google Scholar
Uysal, A.K., Gunal, S.: Text classification using genetic algorithm oriented latent semantic features. Expert Syst. Appl. 41(13), 5938–5947 (2014)
Article Google Scholar
Parlak, B., Uysal, A. K.: Classification of medical documents according to diseases. In: 23th IEEE Signal Processing and Communications Applications Conference (SIU), pp. 1635–1638 (2015)
Google Scholar
Rais, M., Lachkar, A.: Evaluation of disambiguation strategies on biomedical text categorization. In: International Conference on Bioinformatics and Biomedical Engineering, pp. 790–801. Springer International Publishing (2016)
Google Scholar
Baker, S., Silins, I., Guo, Y., Ali, I., Högberg, J., Stenius, U., Korhonen, A.: Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics 32(3), 432–440 (2016)
Article Google Scholar
Morid, M.A., Fiszman, M., Raja, K., Jonnalagadda, S.R., Del Fiol, G.: Classification of clinically useful sentences in clinical evidence resources. J. Biomed. Inform. 60, 14–22 (2016)
Article Google Scholar
Parlak, B., Uysal, A.K.: The impact of feature selection on medical document classification. In: 11th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–5 (2016)
Google Scholar
Pakhomov, S.V., Buntrock, J.D., Chute, C.G.: Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques. J. Am. Med. Inform. Assoc. 13(5), 516–525 (2006)
Article Google Scholar
Van Der Zwaan, J., Sang, E.T.K., de Rijke, M.: An experiment in automatic classification of pathological reports. In: Artificial Intelligence in Medicine, pp. 207–216. Springer, Berlin Heidelberg (2007)
Google Scholar
Waraporn, P., Meesad, P., Clayton, G.: Ontology-supported processing of clinical text using medical knowledge integration for multi-label classification of diagnosis coding (2010). arXiv:1004.1230
Boytcheva, S.: Automatic matching of ICD-10 codes to diagnoses in discharge letters. In: Proceedings of the Workshop on Biomedical Natural Language Processing, pp. 11–18. Hissar, Bulgaria (2011)
Google Scholar
Ceylan, N.M., Alpkocak, A., Esatoglu, A.E.: Tıbbi Kayıtlara ICD-10 Hastalık Kodlarının Atanmasına Yardımcı Akıllı Bir Sistem (2012)
Google Scholar
Arifoglu, D., Deniz, O., Alecakır, K., Yondem, M.: CodeMagic: semi-automatic assignment of ICD-10-AM codes to patient records. In: Information Sciences and Systems 2014, pp. 259–268. Springer International Publishing (2014)
Google Scholar
Uysal, A.K., Gunal, S., Ergin, S., Gunal, E.S.: Detection of SMS spam messages on mobile phones. In: 20th IEEE Signal Processing and Communications Applications Conference (SIU), pp. 1–4 (2012)
Google Scholar
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval Cambridge University Press, New York, USA (2008)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Article Google Scholar
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33(1), 1–5 (2007)
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explor. 11(1) (2009)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, Jim Gray (ed.). Morgan Kaufmann Publishers, San Fransisco (2005)
Google Scholar
Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Proceedings of the Europe Conference Information Retrieval Research, pp. 345–359 (2005)
Google Scholar
Rocha, A., Rocha, B.: Adopting nursing health record standards. Inform. Health Soc. Care 39(1), 1–14 (2014)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by Anadolu University, Fund of Scientific Research Projects under grant number 1503F136.

Author information

Authors and Affiliations

Department of Computer Engineering, Anadolu University, Eskisehir, Turkey
Bekir Parlak & Alper Kursat Uysal

Authors

Bekir Parlak
View author publications
You can also search for this author in PubMed Google Scholar
Alper Kursat Uysal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bekir Parlak .

Editor information

Editors and Affiliations

DEI, University of Coimbra , Coimbra, Portugal
Álvaro Rocha
DSI, University of Minho, Guimarães, Portugal
Luís Paulo Reis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Parlak, B., Uysal, A.K. (2018). On Feature Weighting and Selection for Medical Document Classification. In: Rocha, Á., Reis, L. (eds) Developments and Advances in Intelligent Systems and Applications. Studies in Computational Intelligence, vol 718. Springer, Cham. https://doi.org/10.1007/978-3-319-58965-7_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-58965-7_19
Published: 08 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58963-3
Online ISBN: 978-3-319-58965-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

On Feature Weighting and Selection for Medical Document Classification

Abstract

Similar content being viewed by others

Comparative Study of Feature Selection Methods for Medical Full Text Classification

Improved Multi-label Medical Text Classification Using Features Cooperation

Using Class Based Document Frequency to Select Features in Text Classification

Keywords

1 Introduction

2 Feature Extraction and Selection

2.1 Feature Extraction

2.2 Feature Selection

2.2.1 Gini Index (GI)

2.2.2 Distinguishing Feature Selector (DFS)

3 Pattern Classifiers

3.1 Bayesian Networks (BN)

3.2 C4.5 Decision Tree (DT)

4 Experimental Work

4.1 Datasets

4.2 Accuracy Analysis

5 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

On Feature Weighting and Selection for Medical Document Classification

Abstract

Similar content being viewed by others

Comparative Study of Feature Selection Methods for Medical Full Text Classification

Improved Multi-label Medical Text Classification Using Features Cooperation

Using Class Based Document Frequency to Select Features in Text Classification

Keywords

1 Introduction

2 Feature Extraction and Selection

2.1 Feature Extraction

2.2 Feature Selection

2.2.1 Gini Index (GI)

2.2.2 Distinguishing Feature Selector (DFS)

3 Pattern Classifiers

3.1 Bayesian Networks (BN)

3.2 C4.5 Decision Tree (DT)

4 Experimental Work

4.1 Datasets

4.2 Accuracy Analysis

5 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation