Keywords

1 Introduction

With the advancement of Web technology, the amount of text data over the Internet is increasing tremendously and the management of such a huge amount of text data is very difficult. Text classification, the task of automatically classifying unknown text documents into predefined categories, is an important task for most of the text management activities. Various applications such as news categorization, online marketing, e-mails, spam filtering, and text mining have also made the researchers to come out with efficient methods for text classification.

The three important factors which affect the efficiency of a text classification system are (i) the text can be either structured or unstructured, (ii) the number of features used to represent the text documents, and (iii) the number of documents present in different classes in the given collection. Numerous methods can be traced in the literature of text classification to handle these factors effectively [9]. Any machine learning approach for text classification requires a good representation to classify the text documents in an efficient way [23]. A most widely used representation scheme is the vector space model where a document is represented in the form of a vector of dimension equal to the number of terms present in the corpus. The conventional VSM is not effective as it leads to a very high-dimensional and sparse representation for the documents. Also, as the number of words increases the dimension of the representative feature vectors of the text documents also increases. But only a small subset of the entire population of features is helpful in achieving accurate classification. So, feature selection is a mandatory requirement to reduce the dimension through selecting only the important words for representing the documents.

In feature selection, the features are ranked based on some criteria which measure the importance of each feature so that a best subset of features can be chosen for the classification. The number of features to be used is usually fixed through empirical analysis. Some of the important feature selection methods available in the literature are terms-based discriminative information space [11], Fisher discriminant analysis [27], chi-square [26], information gain [1], term weightage using term frequency [18, 19, 22, 29], term frequency and document frequency-based feature selection [2], ontology-based [4], global feature selection methods [14, 15, 21]. In the literature of text classification, we can also trace a couple of methods which use transformation of features for dimensionality reduction [9], latent semantic indexing (LSI) [13], genetic algorithm (GA) [3].

Most of the methods available in the literature work well for balanced data collections and fail to perform well on imbalanced collections [15, 18, 29]. Few works focus on the classification of imbalanced text collections based on clustering methods, different type metrics for imbalanced text classification, clustering and dimensionality reduction [12].

2 Background and Motivation

Classification of imbalanced text collection is one of the challenging issues in designing an effective text classification system. To this end, [20] and [12] have recommended conversion of imbalanced text collection into a balanced one by dividing larger classes into a number of multiple smaller subclasses through classwise clustering. In their methods, once the subclasses are identified, it is recommended to treat each subclass itself as a class. Hence, the original K-class classification problem becomes a Q-class classification problem where K is the total number of classes originally present and Q (≫K) is the total number of subclasses obtained due to clustering. Though both the methods are similar from the point of view of handling class imbalanceness, the approach in [20] has outperformed that of the [12]. In [12], the method uses a transformation approach which represents the documents in a K-dimensional space, where K is the number of classes present in the collection. While in [20], the dimensionality reduction is achieved through feature selection. Our observation here is that, once the natural groups among different classes are identified, representing the documents as feature vectors in K-dimensional space which is very small is not advisable as it is difficult to effectively capture the variation within each group effectively. Hence, a classifier may not generalize well on such a lower-dimensional space. With this motivation, in this paper once the original K-classes are converted into Q clusters, it is recommended to represent the documents in the form of Q-dimensional feature vectors by applying the same transformation applied in [12]. Since the value of Q is greater than that of K but not as large as the size of the bag of words, the classifier trained on a Q-dimensional vector space is expected to effectively capture the variations across different groups.

3 Representation of Documents in Lower-Dimensional Space

In the proposed method, it is recommended to represent the text documents in lower-dimensional space without using any explicit dimensionality reduction technique. The conventional bag-of-words-based representation leads to a high-dimensional vector space which is not suitable for effective classification without applying dimensionality reduction through an effective feature selection or feature transformation method. To overcome this difficulty, [10] proposed a text representation method which reduces the text document dimension from number of features n to a lower-dimensional space to the number of classes K. Further, the same method is adopted by Guru and Suhil [8] for better performance in text categorization through the introduction of a new term weighting scheme. The methods of [10] and Guru and Suhil [8] are used for the classification of imbalanced text documents. Hence, we have used the reduced representation of [10] along with the term-class relevance (TCR) measure of Guru and Suhil [8] to represent the text documents in lower-dimensional form as explained below.

In this method, each document is initially represented in the form of a matrix F of size n × K as shown in Fig. 1, where n is the number of terms and K is the number of classes present in the corpus. The value of each location F(i, j) is the weight of ith term ti with respect to the class Cj computed using the TCR measure. Then, a feature vector f of dimension K is created as a representative vector of the document by aggregating the matrix F corresponding to the document such that f(j) is the average relevancy of all terms present in the class Cj. Here, the dimension of the document is reduced to the number of classes K which is very small when compared to the original dimension n of the document. The TCR score for every term is calculated as

$$ TCR\left( {t_{i} ,C_{j} } \right) = Class\_weight(c_{j} )\times Class\_Term\_Weight(t_{i} ,c_{j} )\times Class\_Term\_Density(t_{i} ,c_{j} ) $$
(1)

where

$$ Class\_weight(C_{j} ) = \frac{{No.\,of\,Documents \,in \,Class\,C_{j} }}{No.\,of\,Documents\,in\,Training\,Set} $$
(2)
$$ Class\_Term\_Weight(t_{i} ,C_{j} ) = \frac{{No.\,of\,Documents\,in\,Class\,C \,containing\,t_{i} }}{{No.\,of\,documents\,containing\,t_{i} \,in\,Training\,set}} $$
(3)
$$ Class\_Term\_Density\left( {t_{i} ,C_{j} } \right) = \frac{{No.\,of\,occurences\,of\,t_{i} \,in\,Class\,C}}{{No.\,of\,occrences\,of\,t_{i} \,in\,Training\,collection}} $$
(4)
Fig. 1
figure 1

Representation of a document [10]

4 Proposed Method

The general architecture of the proposed model is given in Fig. 2. The different steps of proposed model are explained in subsequent subsections.

Fig. 2
figure 2

Architecture of the proposed text classification method

4.1 Clustering

It has been observed from the literature of text classification that most of the models work well for the balanced corpus. For an imbalanced corpus, the classes with large number of documents will generally dominate the classes with lower number of documents. One of the solutions for handling this issue is to convert the imbalance corpus into a balanced one. In Lavanya et al. [12] and Suhil et al. [20], it has been recommended to split a larger class into smaller subclasses by classwise clustering since the larger classes contain large intraclass variations. Hence, in this model we have converted large classes into small subclasses by applying hierarchical clustering technique. Finally, the clusters obtained due to all the K-classes are considered to be the classes and learning is applied to new classes.

Formally, let {C1, C2, C3, …, CK} be the K-classes present in the corpus. Each class Cj is converted into clusters of almost equal size by applying hierarchical clustering technique. Let \( \left\{ {\mathop {cl}\nolimits_{1}^{j} ,\mathop {cl}\nolimits_{2}^{j} ,\mathop {cl}\nolimits_{3}^{j} , \ldots ,\mathop {cl}\nolimits_{{\mathop Q\nolimits_{J} }}^{j} } \right\} \) be the Qj number of clusters obtained for the class Cj. Similarly, let {Q1, Q2, Q3, , QK} be the number of clusters obtained for K different classes, respectively, and the total number of clusters is given by,

$$ Q = \sum\limits_{j = 1}^{K} {Q_{j} } $$
(5)

The number of clusters varies from class to class which is based on size and intraclass variations present in the class. Since each cluster consists of similar documents, we can treat each cluster itself as a unique class. The representation scheme presented in Sect. 3 has been used to represent the documents in lower dimension since it is difficult to apply clustering in higher-dimensional space. By classwise clustering, we arrive at Q clusters which we treat as Q-classes, and hence, the original K-class classification problem is converted into a Q-class classification problem.

4.2 Representation of Documents of Each Cluster

Given a cluster Clj, we represent each document in the form of a Q-dimensional feature vector using TCR as explained in Sect. 3. The major difference here is that the TCR of each term is now recomputed with respect to each cluster. Each document in a cluster is represented in the form of a Q-dimensional feature vector \( f = \,\left\{ {f_{1} , \, f_{2} , \, f_{3} , \ldots ,f_{Q} } \right\} \) which is very small when compared to the original dimension of the documents.

4.3 Creation of Knowledgebase of Interval-Valued Representatives

Recently, it has been shown that the approaches by the use of symbolic data outperform conventional algorithms in clustering and classification [7, 16]. Also, we can find in the literature some of the works on symbolic text representation and classification [5, 8, 12]. In our method, cluster-based interval-valued features are used for compact representation of documents to improve the performance on imbalanced text data.

Given a class Cj with Qj number of clusters, each cluster \( cl_{p}^{j} \) consisting of \( m_{p}^{j} \) number of documents is represented by an interval-valued feature vector. We propose to use interval-valued-type symbolic data to effectively capture the variations within a cluster of text documents. Another advantage of having such a representation is its simplicity in classifying an unknown document. Hence, the cluster \( cl_{p}^{j} \) is represented by an interval-valued symbolic representative vector \( {\text{R}}_{pj} \) as follows.

Let every document is represented by a feature vector of dimension K given by \( \{ f_{1} ,f_{2} , \ldots ,f_{Q} \} \). Then, with respect to every feature fs, the documents of the cluster are aggregated in the form of an interval \( \left[ {\mu^{s} - \sigma^{s} ,\,\mu^{s} + \sigma^{s} } \right] \) where \( \mu^{s} \) and \( \sigma^{s} \) are, respectively, the mean and standard deviation of the values of \( f_{s} \) in the cluster. Hence, \( R_{pj} \) contains K intervals corresponding to the K features as,

$$ R^{pj} = \left\{ {R_{1}^{pj} ,R_{2}^{pj} , \ldots ,R_{Q}^{pj} } \right\} $$

where

\( R_{s}^{ij} \, = [\mu^{s} - \sigma^{s} ,\,\mu^{s} + \sigma^{s} ] \) is the interval formed for the sth feature of the pth cluster \( cl_{p}^{i} \) of the jth class Cj. This process of creation of interval-valued symbolic representative is applied on all the Q clusters individually to obtain Q interval representative vector \( \{ R^{11} ,\,R^{12} , \ldots ,R^{{1Q_{1} }} ,R^{21} ,\,R^{22} , \ldots ,R^{{2Q_{2} }} , \ldots ,R^{K1} ,\,R^{K2} , \ldots ,R^{{KQ_{K} }} \} \) which are then stored in the knowledgebase for the purpose of classification.

4.4 Classification

Given an unlabeled text document Dq, its class label is predicted by comparing it with all the representative vectors present in the knowledgebase. Initially, Dq is converted and represented as a feature vector \( \left\{ {f_{1}^{q} ,f_{2}^{q} , \ldots ,f_{Q}^{q} } \right\} \) of dimension Q as explained in Sect. 4.1. Then, the similarity between the crisp vector Dq and an interval-based representative vector R is computed using the similarity measure proposed by [6] as follows:

$$ SIM(D_{q} ,R) = \frac{1}{Q}\sum\limits_{s = 1}^{Q} {SIM(D_{q}^{s} } ,R_{s} ) $$

where

$$ SIM(D_{q}^{s} ,R_{s} ) = \left\{ {\begin{array}{*{20}c} 1 & {if(\mu^{s} - \sigma^{s} ) \le f_{s}^{q} \le (\mu^{s} + \sigma^{s} )} \\ \max \left[ {\frac{1}{{1 + abs((\mu^{s} - \sigma^{s} ) - f_{s}^{q} )}},\frac{1}{{1 + abs((\mu^{s} + \sigma^{s} ) - f_{s}^{q} )}}}\right] & {otherwise} \\ \end{array} } \right. $$

Similarly, the similarity of Dq with all the Q representative vectors present in the knowledgebase is computed. The class of a cluster cl which gets highest similarity with Dq is decided as the class of Dq as shown in Eq. (6).

$$ ClassLabel(D_{q} ) = Class(\mathop {\arg \hbox{max} }\limits_{i,j} (SIM(D_{q} ,R^{ij} ))) $$
(6)

where \( R^{ij} \) is the representative of the jth cluster of the ith class.

5 Experimentation and Results

We have conducted the experiments to evaluate the method and to verify the efficiency of the proposed model by considering different training and testing sets. The performance of the proposed model has been evaluated using precision, recall, and F-measure in terms of both micro- and macro-averaging. The following sections present details about the imbalanced datasets considered for the experimentation and the results obtained.

5.1 Dataset and Experimental Setup

Dataset

To evaluate the efficiency of the model, we have conducted experiments on two benchmark imbalanced text datasets. The first benchmark dataset is Reuters-21578 which is collected from Reuters newswire, and we have considered a total 7285 documents from top 10 classes out of 135 classes with features 18,221 dimensions. The second dataset which is considered for our experiments is TDT2. A total of 8741 documents have been considered from the top 20 classes out of the 96 classes with 36,771 dimensions. Figure 3 shows the distribution of number of documents.

Fig. 3
figure 3

Documents samples distribution in Reuters-21578 and TDT2 datasets

Experimental Setup

Experimentation has been conducted on each dataset by varying the percentage of training and testing from 10 to 80% in steps of 10% with 10 random trials each. The performance measures such as macro-precision, macro-recall, F-measure and the average performance of the 10 trials have been tabulated.

5.2 Results and Analysis

In this section, we present the results of the proposed method on both the datasets. The experiments were conducted by varying the number of clusters by varying the value of the inconsistency coefficient, and an optimal number of clusters is decided which produces the best results. More importantly to evaluate the goodness of the proposed model, a quantitative comparative analysis with the existing models is performed.

Table 1 and Table 2 show the results of the proposed method on Reuters-21578 and TDT2 datasets, respectively, in terms of macro-precision, macro-recall, macro-F-measure, and micro-F-measure. From these results, we can observe that the performance is increasing gradually with the increase in the percentage of training.

Table 1 Performance of the proposed model on Reuters-21578 dataset for 102 clusters
Table 2 Performance of the proposed model on TDT2 dataset for 182 clusters

To compare the performance of the proposed method with that of the available methods, we have selected two methods which try to handle the class imbalance by performing classwise clustering. The first method [20] uses classwise clustering for removing class imbalance and χ2 for feature selection. The second method [12] uses classwise clustering for handling class imbalance and TCR for representation. In [12], each document is represented as feature vector of dimension equal to the number of classes originally present in the dataset, whereas in the proposed method, each document is represented by a feature vector of dimension equal to the number of clusters identified after classwise clustering.

Table 3 presents the comparison of the proposed method with that of Suhil et al. [20] and Lavanya et al. [12] in terms of macro-F and micro-F for both the datasets. The number of features used and the total number of clusters formed are also shown. It can be observed from Table 3 that the proposed method is better than the model of Lavanya et al. [12] in terms of both macro-F and micro-F. When it comes to the model of Suhil et al. [20], the proposed model has less performance. But the number of features used by Suhil et al. [20] is very high when compared to the number of features used by the proposed method. Thus, the model proposed by Suhil et al. [20] is very complex as it involves handling of very high-dimensional feature vectors.

Table 3 Comparison of the proposed method against the class-based method, Lavanya et al. [12] and Suhil et al. [20] with 70% training and 30% testing for Reuters and TDT2 datasets

6 Conclusion

In this paper, we have proposed a classwise cluster-based symbolic representation for imbalanced text classification using term-class relevance measure. To validate our results, we have conducted experiments with two different datasets, viz. Reuters-21578 and TDT2. The experimental results show that the proposed method works better with the class-based representation method. Hence, the classifier trained on a Q-dimensional vector space model can be used to capture the variations across different classes. In the future, the text classification can be conducted by this method using different dimensionality reduction techniques and clustering documents by considering various parameters like number of clusters and clustering technique.