Abstract
In this paper, the problem of classification of imbalanced text data is addressed. Initially, imbalanceness present across the classes is reduced by converting each class into multiple smaller subclasses. Further, each document is represented in a lower-dimensional space of size equal to the number of subclasses using term-class relevance (TCR) measure-based transformation technique. Then, each subclass is represented in the form of an interval-valued feature vector to achieve the compactness and stored in a knowledgebase. A symbolic classifier has been effectively used for the classification of unlabeled text documents. Experiments are conducted on Reuters-21578 and TDT2 text datasets. The results reveal that the performance of the proposed method is better than the other existing methods.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
With the advancement of Web technology, the amount of text data over the Internet is increasing tremendously and the management of such a huge amount of text data is very difficult. Text classification, the task of automatically classifying unknown text documents into predefined categories, is an important task for most of the text management activities. Various applications such as news categorization, online marketing, e-mails, spam filtering, and text mining have also made the researchers to come out with efficient methods for text classification.
The three important factors which affect the efficiency of a text classification system are (i) the text can be either structured or unstructured, (ii) the number of features used to represent the text documents, and (iii) the number of documents present in different classes in the given collection. Numerous methods can be traced in the literature of text classification to handle these factors effectively [9]. Any machine learning approach for text classification requires a good representation to classify the text documents in an efficient way [23]. A most widely used representation scheme is the vector space model where a document is represented in the form of a vector of dimension equal to the number of terms present in the corpus. The conventional VSM is not effective as it leads to a very high-dimensional and sparse representation for the documents. Also, as the number of words increases the dimension of the representative feature vectors of the text documents also increases. But only a small subset of the entire population of features is helpful in achieving accurate classification. So, feature selection is a mandatory requirement to reduce the dimension through selecting only the important words for representing the documents.
In feature selection, the features are ranked based on some criteria which measure the importance of each feature so that a best subset of features can be chosen for the classification. The number of features to be used is usually fixed through empirical analysis. Some of the important feature selection methods available in the literature are terms-based discriminative information space [11], Fisher discriminant analysis [27], chi-square [26], information gain [1], term weightage using term frequency [18, 19, 22, 29], term frequency and document frequency-based feature selection [2], ontology-based [4], global feature selection methods [14, 15, 21]. In the literature of text classification, we can also trace a couple of methods which use transformation of features for dimensionality reduction [9], latent semantic indexing (LSI) [13], genetic algorithm (GA) [3].
Most of the methods available in the literature work well for balanced data collections and fail to perform well on imbalanced collections [15, 18, 29]. Few works focus on the classification of imbalanced text collections based on clustering methods, different type metrics for imbalanced text classification, clustering and dimensionality reduction [12].
2 Background and Motivation
Classification of imbalanced text collection is one of the challenging issues in designing an effective text classification system. To this end, [20] and [12] have recommended conversion of imbalanced text collection into a balanced one by dividing larger classes into a number of multiple smaller subclasses through classwise clustering. In their methods, once the subclasses are identified, it is recommended to treat each subclass itself as a class. Hence, the original K-class classification problem becomes a Q-class classification problem where K is the total number of classes originally present and Q (≫K) is the total number of subclasses obtained due to clustering. Though both the methods are similar from the point of view of handling class imbalanceness, the approach in [20] has outperformed that of the [12]. In [12], the method uses a transformation approach which represents the documents in a K-dimensional space, where K is the number of classes present in the collection. While in [20], the dimensionality reduction is achieved through feature selection. Our observation here is that, once the natural groups among different classes are identified, representing the documents as feature vectors in K-dimensional space which is very small is not advisable as it is difficult to effectively capture the variation within each group effectively. Hence, a classifier may not generalize well on such a lower-dimensional space. With this motivation, in this paper once the original K-classes are converted into Q clusters, it is recommended to represent the documents in the form of Q-dimensional feature vectors by applying the same transformation applied in [12]. Since the value of Q is greater than that of K but not as large as the size of the bag of words, the classifier trained on a Q-dimensional vector space is expected to effectively capture the variations across different groups.
3 Representation of Documents in Lower-Dimensional Space
In the proposed method, it is recommended to represent the text documents in lower-dimensional space without using any explicit dimensionality reduction technique. The conventional bag-of-words-based representation leads to a high-dimensional vector space which is not suitable for effective classification without applying dimensionality reduction through an effective feature selection or feature transformation method. To overcome this difficulty, [10] proposed a text representation method which reduces the text document dimension from number of features n to a lower-dimensional space to the number of classes K. Further, the same method is adopted by Guru and Suhil [8] for better performance in text categorization through the introduction of a new term weighting scheme. The methods of [10] and Guru and Suhil [8] are used for the classification of imbalanced text documents. Hence, we have used the reduced representation of [10] along with the term-class relevance (TCR) measure of Guru and Suhil [8] to represent the text documents in lower-dimensional form as explained below.
In this method, each document is initially represented in the form of a matrix F of size n × K as shown in Fig. 1, where n is the number of terms and K is the number of classes present in the corpus. The value of each location F(i, j) is the weight of ith term ti with respect to the class Cj computed using the TCR measure. Then, a feature vector f of dimension K is created as a representative vector of the document by aggregating the matrix F corresponding to the document such that f(j) is the average relevancy of all terms present in the class Cj. Here, the dimension of the document is reduced to the number of classes K which is very small when compared to the original dimension n of the document. The TCR score for every term is calculated as
where
4 Proposed Method
The general architecture of the proposed model is given in Fig. 2. The different steps of proposed model are explained in subsequent subsections.
4.1 Clustering
It has been observed from the literature of text classification that most of the models work well for the balanced corpus. For an imbalanced corpus, the classes with large number of documents will generally dominate the classes with lower number of documents. One of the solutions for handling this issue is to convert the imbalance corpus into a balanced one. In Lavanya et al. [12] and Suhil et al. [20], it has been recommended to split a larger class into smaller subclasses by classwise clustering since the larger classes contain large intraclass variations. Hence, in this model we have converted large classes into small subclasses by applying hierarchical clustering technique. Finally, the clusters obtained due to all the K-classes are considered to be the classes and learning is applied to new classes.
Formally, let {C1, C2, C3, …, CK} be the K-classes present in the corpus. Each class Cj is converted into clusters of almost equal size by applying hierarchical clustering technique. Let \( \left\{ {\mathop {cl}\nolimits_{1}^{j} ,\mathop {cl}\nolimits_{2}^{j} ,\mathop {cl}\nolimits_{3}^{j} , \ldots ,\mathop {cl}\nolimits_{{\mathop Q\nolimits_{J} }}^{j} } \right\} \) be the Qj number of clusters obtained for the class Cj. Similarly, let {Q1, Q2, Q3, …, QK} be the number of clusters obtained for K different classes, respectively, and the total number of clusters is given by,
The number of clusters varies from class to class which is based on size and intraclass variations present in the class. Since each cluster consists of similar documents, we can treat each cluster itself as a unique class. The representation scheme presented in Sect. 3 has been used to represent the documents in lower dimension since it is difficult to apply clustering in higher-dimensional space. By classwise clustering, we arrive at Q clusters which we treat as Q-classes, and hence, the original K-class classification problem is converted into a Q-class classification problem.
4.2 Representation of Documents of Each Cluster
Given a cluster Clj, we represent each document in the form of a Q-dimensional feature vector using TCR as explained in Sect. 3. The major difference here is that the TCR of each term is now recomputed with respect to each cluster. Each document in a cluster is represented in the form of a Q-dimensional feature vector \( f = \,\left\{ {f_{1} , \, f_{2} , \, f_{3} , \ldots ,f_{Q} } \right\} \) which is very small when compared to the original dimension of the documents.
4.3 Creation of Knowledgebase of Interval-Valued Representatives
Recently, it has been shown that the approaches by the use of symbolic data outperform conventional algorithms in clustering and classification [7, 16]. Also, we can find in the literature some of the works on symbolic text representation and classification [5, 8, 12]. In our method, cluster-based interval-valued features are used for compact representation of documents to improve the performance on imbalanced text data.
Given a class Cj with Qj number of clusters, each cluster \( cl_{p}^{j} \) consisting of \( m_{p}^{j} \) number of documents is represented by an interval-valued feature vector. We propose to use interval-valued-type symbolic data to effectively capture the variations within a cluster of text documents. Another advantage of having such a representation is its simplicity in classifying an unknown document. Hence, the cluster \( cl_{p}^{j} \) is represented by an interval-valued symbolic representative vector \( {\text{R}}_{pj} \) as follows.
Let every document is represented by a feature vector of dimension K given by \( \{ f_{1} ,f_{2} , \ldots ,f_{Q} \} \). Then, with respect to every feature fs, the documents of the cluster are aggregated in the form of an interval \( \left[ {\mu^{s} - \sigma^{s} ,\,\mu^{s} + \sigma^{s} } \right] \) where \( \mu^{s} \) and \( \sigma^{s} \) are, respectively, the mean and standard deviation of the values of \( f_{s} \) in the cluster. Hence, \( R_{pj} \) contains K intervals corresponding to the K features as,
where
\( R_{s}^{ij} \, = [\mu^{s} - \sigma^{s} ,\,\mu^{s} + \sigma^{s} ] \) is the interval formed for the sth feature of the pth cluster \( cl_{p}^{i} \) of the jth class Cj. This process of creation of interval-valued symbolic representative is applied on all the Q clusters individually to obtain Q interval representative vector \( \{ R^{11} ,\,R^{12} , \ldots ,R^{{1Q_{1} }} ,R^{21} ,\,R^{22} , \ldots ,R^{{2Q_{2} }} , \ldots ,R^{K1} ,\,R^{K2} , \ldots ,R^{{KQ_{K} }} \} \) which are then stored in the knowledgebase for the purpose of classification.
4.4 Classification
Given an unlabeled text document Dq, its class label is predicted by comparing it with all the representative vectors present in the knowledgebase. Initially, Dq is converted and represented as a feature vector \( \left\{ {f_{1}^{q} ,f_{2}^{q} , \ldots ,f_{Q}^{q} } \right\} \) of dimension Q as explained in Sect. 4.1. Then, the similarity between the crisp vector Dq and an interval-based representative vector R is computed using the similarity measure proposed by [6] as follows:
where
Similarly, the similarity of Dq with all the Q representative vectors present in the knowledgebase is computed. The class of a cluster cl which gets highest similarity with Dq is decided as the class of Dq as shown in Eq. (6).
where \( R^{ij} \) is the representative of the jth cluster of the ith class.
5 Experimentation and Results
We have conducted the experiments to evaluate the method and to verify the efficiency of the proposed model by considering different training and testing sets. The performance of the proposed model has been evaluated using precision, recall, and F-measure in terms of both micro- and macro-averaging. The following sections present details about the imbalanced datasets considered for the experimentation and the results obtained.
5.1 Dataset and Experimental Setup
Dataset
To evaluate the efficiency of the model, we have conducted experiments on two benchmark imbalanced text datasets. The first benchmark dataset is Reuters-21578 which is collected from Reuters newswire, and we have considered a total 7285 documents from top 10 classes out of 135 classes with features 18,221 dimensions. The second dataset which is considered for our experiments is TDT2. A total of 8741 documents have been considered from the top 20 classes out of the 96 classes with 36,771 dimensions. Figure 3 shows the distribution of number of documents.
Experimental Setup
Experimentation has been conducted on each dataset by varying the percentage of training and testing from 10 to 80% in steps of 10% with 10 random trials each. The performance measures such as macro-precision, macro-recall, F-measure and the average performance of the 10 trials have been tabulated.
5.2 Results and Analysis
In this section, we present the results of the proposed method on both the datasets. The experiments were conducted by varying the number of clusters by varying the value of the inconsistency coefficient, and an optimal number of clusters is decided which produces the best results. More importantly to evaluate the goodness of the proposed model, a quantitative comparative analysis with the existing models is performed.
Table 1 and Table 2 show the results of the proposed method on Reuters-21578 and TDT2 datasets, respectively, in terms of macro-precision, macro-recall, macro-F-measure, and micro-F-measure. From these results, we can observe that the performance is increasing gradually with the increase in the percentage of training.
To compare the performance of the proposed method with that of the available methods, we have selected two methods which try to handle the class imbalance by performing classwise clustering. The first method [20] uses classwise clustering for removing class imbalance and χ2 for feature selection. The second method [12] uses classwise clustering for handling class imbalance and TCR for representation. In [12], each document is represented as feature vector of dimension equal to the number of classes originally present in the dataset, whereas in the proposed method, each document is represented by a feature vector of dimension equal to the number of clusters identified after classwise clustering.
Table 3 presents the comparison of the proposed method with that of Suhil et al. [20] and Lavanya et al. [12] in terms of macro-F and micro-F for both the datasets. The number of features used and the total number of clusters formed are also shown. It can be observed from Table 3 that the proposed method is better than the model of Lavanya et al. [12] in terms of both macro-F and micro-F. When it comes to the model of Suhil et al. [20], the proposed model has less performance. But the number of features used by Suhil et al. [20] is very high when compared to the number of features used by the proposed method. Thus, the model proposed by Suhil et al. [20] is very complex as it involves handling of very high-dimensional feature vectors.
6 Conclusion
In this paper, we have proposed a classwise cluster-based symbolic representation for imbalanced text classification using term-class relevance measure. To validate our results, we have conducted experiments with two different datasets, viz. Reuters-21578 and TDT2. The experimental results show that the proposed method works better with the class-based representation method. Hence, the classifier trained on a Q-dimensional vector space model can be used to capture the variations across different classes. In the future, the text classification can be conducted by this method using different dimensionality reduction techniques and clustering documents by considering various parameters like number of clusters and clustering technique.
References
Aghdam MH, Aghaee NG, Basiri ME (2009) Text feature selection using ant colony optimization. Expert Syst Appl 36(3):6843–6853
Azam N, Yao J (2012) Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst Appl 39:4760–4768
Bharti KK, Singh PK (2015) Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst Appl 42:3105–3114
Elhadad MK, Khaled M, Badran KM, Salama G (2017) A novel approach for ontology-based dimensionality reduction for web text document classification. In: International conference on information systems (ICIS)-2017, vol 978. IEEE, pp 5090–5507
Guru DS, Harish BS, Manjunath S (2010) Symbolic representation of text documents. In: Proceedings of the third annual ACM Bangalore conference (COMPUTE ‘10). ACM, New York, NY, USA, Article 18, 4 pp.
Guru DS, Nagendraswamy HS (2006) Symbolic representation of two-dimensional shapes. Pattern Recognit Lett 28:144–155
Guru DS, Prakash HN (2009) Online signature verification and recognition: an approach based on symbolic representation. IEEE TPAMI 31(6):1059–1073
Guru DS, Suhil M (2015) A novel term class relevance measure for text categorization. Procedia Comput Sci 45:13–22
Harish BS, Guru DS, Manjunath S (2010) Representation and classification of text documents: a brief review. IJCA Spec Issue on RTIPPR 110–119
Isa D, Lee LH, Kallimani VP, Rajkumar R (2008) Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE TKDE 20:1264–1272
Junejo KA, Karim A, Tahir MH, Jeon M (2016) Terms-based discriminative Information space for robust text classification. Inf Sci 372:518–538
Lavanya NR, Suhil M, Guru DS, Harsha SG (2016) Cluster based symbolic representation for skewed text categorization. In: International conference on recent trends in image processing and pattern recognition (RTIP2R)-2016, vol 709. Springer-CCIS, pp 202–216
Meng J, Lin H, Yu Y (2011) A two-stage feature selection method for text categorization. Comput Math Appl 62(7):2793–2800
Pinheiro RHW, Cavalcanti GDC, Ren TI (2015) Data-driven global-ranking local feature selection methods for text categorization. Expert Syst Appl 42:1941–1949
Pinheiro RHW, Cavalcanti GDC, Correa RF, Ren TI (2012) A global-ranking local feature selection method for text categorization. Expert Syst Appl 39:12851–12857
Punitha P, Guru DS (2008) Symbolic image indexing and retrieval by spatial similarity: an approach based on B-tree. Pattern Recognit 41(6):2068–2085
Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manag 53:473–489
Rehman A, Javed K, Babri HA, Saeed M (2015) Relative discrimination criterion—a novel feature ranking method for text data. Expert Syst Appl 42:3670–3681
Sabbaha T, Selamat A, Selamat MH, Fawaz S, Viedmae AEH, Krejcarg O (2017) Modified frequency-based term weighting schemes for text classification. Appl Soft Comput 58:193–206
Suhil M, Guru DS, Lavanya NR, Harsha SG (2016) Simple yet effective classification model for skewed text categorization. In: International conference on computing, communications and informatics (ICACCI)-2016. IEEE, pp 904–910
Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92
Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl-Based Syst 36:226–235
Vieira AS, Borrajo L, Iglesias EL (2016) Improving the text classification using clustering and a novel HMM to reduce the dimensionality. Comput Methods Programs Biomed 136:119–130
Wang D, Zhang H, Li R, Lv W, Wang D (2014) t-Test feature selection approach based on term frequency for text categorization. Pattern Recognit Lett 45:1–10
Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48:741–754
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, vol 97, pp 412–420
Zeina D, Al-Anzi FS (2017) Employing fisher discriminant analysis for Arabic text classification. Comput Electr Eng 000:1–13
Zhang L, Jiang L, Li C, Kong G (2016) Two feature weighting approaches for naive Bayes text classifiers. Knowl-Based Syst 100(c):137–144
Zong W, Wu F, Chu LK, Sculli D (2015) A discriminative and semantic feature selection method for text categorization. Int J Prod Econ 165:215–222
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Swarnalatha, K., Guru, D.S., Anami, B.S., Suhil, M. (2019). Classwise Clustering for Classification of Imbalanced Text Data. In: Sridhar, V., Padma, M., Rao, K. (eds) Emerging Research in Electronics, Computer Science and Technology. Lecture Notes in Electrical Engineering, vol 545. Springer, Singapore. https://doi.org/10.1007/978-981-13-5802-9_8
Download citation
DOI: https://doi.org/10.1007/978-981-13-5802-9_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-5801-2
Online ISBN: 978-981-13-5802-9
eBook Packages: EngineeringEngineering (R0)