1 Introduction

The volume of information on the web has grown drastically in the last decade mainly due to the increased demand for sharing knowledge and cheaper storage mediums. Most of the existing hypertext documents such as academic and business web pages, news articles, and forums are in the form of inter-linked natural language files. Organization of such files into predefined categories for fast and effective retrieval and spam filtering for electronic mails are two primary application areas of automatic text categorization which aims to save time and efforts. The basic components of an automatic text categorization system are document representation and classifier design. The bag of words approach is generally used for document representation where each word corresponds to a different feature [1]. In this approach, the order, meaning, structure, and grammar of words are not considered, and each document is represented by a high-dimensional feature vector where the number of entries is equal to the number of words selected from the vocabulary of the textual documents under concern [2]. A variety of pattern classification techniques such as neural networks [3], Rocchio method [4], naive Bayes [5, 6], k-nearest neighbors [7], and support vector machines (SVM) [8] are widely studied for text categorization. The robustness of SVM in very high dimensional feature space sets it as the state-of-the-art classifier for text categorization since documents are generally represented as feature vectors consisting of thousands of entries [9]. In its simplest implementation, a linear SVM computes a hyperplane which separates the samples belonging to different categories with the largest margin. Recent studies clearly demonstrate that SVM achieves better scores compared to its competitors such as k-NN, naive Bayes, and neural networks [10, 11].

The text categorization problem is generally tackled as the solution of several independent binary classification sub-problems. For a particular category, the positive class includes all documents belonging to the target (minority) category, whereas the negative (majority) class consists of all documents from other categories. As a matter of fact, the distribution of training documents in the classes is generally uneven which leads to the well-known class imbalance problem. Since the negative class has much more documents compared to the positive, learning algorithms are overwhelmed by the negative class, and hence, they generally tend to classify the test samples as negative, producing many false negatives [1214]. In other words, the performance of the categorization systems is generally poor on the positive class. Since the performance on the positive class is of primary concern, precision which is defined as the percentage of documents which are correctly labeled as positive and recall which is the percentage of correctly classified positive documents are generally used to compare different systems. However, since a categorization system can be tuned to maximize either precision or recall at the expense of the other, their harmonic mean named as F 1 score is considered as more significant [15].

In text categorization, it is desirable to have higher F 1 scores by boosting both precision and recall. Nevertheless, with imbalanced training examples, SVM-based categorization systems often provide high precision but low recall [16, 17]. Although it is a young field of pattern classification, studies in class imbalance are rapidly growing and various different techniques are proposed [18]. Resampling technique which is also known as dataset balancing is the most widely used approach to address the imbalance problem [19]. In this approach, undersampling the majority or oversampling the minority class before classifier construction is applied [20]. In random undersampling approach, a randomly selected subset of negative samples is used in training the categorization system. Alternatively, informed undersampling can be used. For instance, in NearMiss-2 method [21, 22], the distance to the three farthest minority samples is considered in selecting the majority class samples. In general, better F 1 scores are achieved using undersampling techniques by improving recall where precision generally deteriorates [23]. This is due to the fact that undersampling mitigates the bias toward the majority class and hence increases the number of false positives. On the other hand, oversampling techniques such as SMOTE (Synthetic Minority Over-sampling TEchnique) [24] rarely produce higher precision in text categorization compared to the case where resampling is not applied and the gain in recall is generally smaller compared to undersampling [23]. It is recently shown by Sun et al. [16] that random undersampling provides better F 1 scores compared to SMOTE on three benchmark text categorization datasets. Li et al. [17] have also recently shown that, despite the gain in recall, oversampling the minority class brings down the precision value achieved by SVM in text categorization problem. It is generally argued that undersampling techniques are more promising compared to oversampling in various domains including text categorization [22, 23, 25] although the opposite is observed in some cases [12].

Since large number of negative samples are ignored, the main drawback of undersampling is the loss of potentially useful information. In order to avoid this, the use of an ensemble of classifiers is considered and plenty of schemes are proposed [13, 15, 18, 26]. Although the developed schemes differ in various aspects, the major difference is in the way the samples are selected. The most popular approaches are random partitioning, clustering, bagging, and boosting [2633]. In bagging- and boosting-based approaches, the ensembles are made up of weak learners, each of which focuses on a different set of samples. SMOTEBoost and DataBoost-IM are examples of such schemes. In the former approach, samples from the minority class are synthetically generated by focusing on the difficult samples [34]. In the latter approach, synthetic samples are generated from the hard samples of both classes which is followed by re-balancing the total weights of different classes to alleviate the bias toward the majority class [35]. RUSBoost which is based on random undersampling of the majority until desired number of samples are obtained is another AdaBoost-based iterative scheme where, it is recently shown that, RUSBoost surpasses SMOTEBoost about twice the cases when SMOTEBoost outperforms RUSBoost [36]. In these schemes, C4.5, naive Bayes and RIPPER are generally considered as the weak learners [37]. The joint decision is formed by averaging the outputs of the member classifiers or weighted voting on the individual decisions [32, 38]. It is generally argued that an ensemble of undersampling-based classifiers outperforms a single classifier [15, 38]. However, due to its impressive performance on sparse and very high dimensional feature vectors, SVM is more widely used compared to boosting weak learners in the field of text categorization.

Improved F 1 scores are achieved by balancing techniques mainly due to the gain in recall compared to the case when resampling is not used. However, precision may be the major consideration in some text categorization applications. For instance, it may be required to have small number of false positives at the top results of web search applications which correspond to high precision [39, 40]. Similarly, a spam filter that can detect all spam messages which form the positive class (i.e., perfect recall) but classifies many legitimate mails as spam (i.e., low precision) cannot be accepted [39, 41]. Improved F 1 score with a poor precision value is not advantageous for such implementations. Hence, for SVMs which generally provide high precision, improving F 1 score without dropping precision is an important problem. In other words, achieving a good trade-off between precision and recall is important [42].

Although majority voting is more popular in the pattern classification literature, unanimity rule which requires the agreement of all members on the positive class for a positive decision is known to provide better precision values compared to majority voting but worse recall in general [4244]. In this study, we mainly focused on clarifying the characteristics of unanimity and majority voting rules. An analytical investigation of these rules is firstly presented. To the best of our knowledge, this is the first study to prove that unanimity rule can provide better F 1 scores compared to majority voting when an ensemble of high recall but low precision classifiers is considered. In other words, the relative performance of unanimity and majority voting rules is shown to depend on the ensemble under concern. In order to generate high recall members for text categorization, category-based undersampling is proposed where the number of subsets from the negative class is selected as the number of categories it includes. More specifically, each undersampled set consists of documents from a different category. This type of undersampling is shown to provide members having higher individual recall values compared to random partitioning. Then, fusion of classifiers generated using category-based undersampling by unanimity rule is studied. Experiments conducted on three datasets have shown that better trade-off between precision and recall can be achieved compared to the SVM-based baseline system and voting using a random undersampling-based ensemble.

Section 2 presents a review about differences between unanimity and majority voting rules which are explained using an artificial example. In Sect. 3, the F 1 scores provided by unanimity and majority voting rules are studied by expressing them as functions of precision and recall of the ensemble members. The category-based undersampling scheme for text categorization is also presented in that section. In order to evaluate the proposed approach, experiments are conducted on Reuters-21578 ModApte Top10, WebKB and 7-Sectors datasets that are presented in Sect. 4. The last part, Sect. 5 summarizes the conclusions drawn from this study.

2 Undersampling-based ensembles: a brief review

The superiority of SVM in text categorization compared to the other well known machine learning schemes is the trade-off it provides in precision and recall which is generally expressed in terms of F 1 score. However, it is known that recall has room for improvement. This can simply be achieved using a resampled subset of the majority class. The cost of this improvement might be the deterioration of precision since the decision boundary moves to the majority class due to the reduced bias on that class [23, 33]. For further clarification of this well known fact, consider Fig. 1 where the decision boundaries computed when all or a randomly undersampled set of majority samples shown by ‘•’s is used in an artificial dataset. As seen in the figure, the decision boundary is moved toward the majority class when an undersampled training set is used. The new boundary corresponds to a reduction in the number of false negatives and hence an improvement in recall. However, this also corresponds to an increase in the number of false positives which causes a decrease in precision. Consequently, if precision is of main interest, an improved F 1 score due to a major gain in recall may not be considered as valuable.

Fig. 1
figure 1

The decision boundaries computed using SVM when an undersampled set or all samples from the majority class are used

Using an ensemble of classifiers for the solution of a binary classification problem, each of which is trained using an undersampled set of the majority class, generally produces a low precision system as mentioned in Sect. 1. For a better understanding of this fact, assume that M different subsets, \(r_1,r_2,\ldots,r_M\) from the majority class denoted by c 2 are formed and a binary classifier is trained for each subset where all samples from the positive class, c 1 are used in each member. This corresponds to M binary classifiers, one for each of the sample set pairs \(\{c_1,r_1 \}, \{c_1,r_2 \}, \ldots, \{c_1,r_M \}. \) It is plausible that there are similarities between some r m ’s and c 1. In other words, r i may be more similar to c 1 compared to r j . In such a case, a negative sample may be classified as positive by the corresponding classifiers. Since each wrong vote is assigned to the same (positive) class, they can reach the majority, leading to a false positive joint decision when majority voting is used. The practical consequences of this observation can be easily understood when the members trained on subsets of the majority samples are investigated using an artificial dataset as depicted in Fig. 2. The samples in the negative class are partitioned into three sets r 1, r 2 and r 3 which are shown using the markers ‘⋄’, ‘^’ and ‘□’, respectively. The samples of the positive class (c 1) are marked using the symbol ‘+’. D 1, D 2, and D 3, respectively, denote the boundaries computed using an SVM classifier for {c 1r 1}, {c 1,r 2} and {c 1,r 3}. It can be easily seen in the figure that these decision boundaries correspond to classifiers which individually have poor precision values due to large number of false positives. Assume that voting is applied on the outputs of these binary classifiers. For this combination scheme, the samples in the regions labeled as R 1, R 2, R 6 and R 7 that are bounded by the nearest decision boundaries and axes are classified as positive. The reason for this is that some samples in r 1 (lying in R 2) are more similar to those in c 1 compared to r 2 and they are on the same side of the decision boundary, D 2. Similarly, some samples in r 3 (lying in R 6) are more similar to those in c 1 than r 2 and they are again on the same side of D 2. These similarities lead to false positives in R 2 and R 6. In the figure, five sets of samples belonging to the negative class that are wrongly classified as positive (i.e., false positive) are shown using circles. This explains the reason for poor precision generally achieved when voting is applied on undersampling-based classifiers. The gain in recall can be explained by the reduced number of false negatives due to these decision regions. On the other hand, it can be easily seen that only the samples that are in R 1 are labeled as positive with unanimity rule where samples in R 2, R 6 and R 7 are labeled as negative. As a matter of fact, the number of false positives is reduced at the cost of increased number of false negatives which are located in R 2, R 6 and R 7. This corresponds to a smaller recall value compared to the voting-based scheme which is tolerable in applications where high precision is essential, provided that a better F 1 score is achieved.

Fig. 2
figure 2

Three decision boundaries computed using SVM classifier. For each binary classifier, all positive samples are considered but a subset of the negatives marked by either ‘⋄’, ‘^’ or ‘□’ are used

The review presented above can be summarized as follows:

  • Undersampling the majority class increases recall at the expense of precision.

  • Combination of undersampled classifiers using unanimity rule provides better precision than majority voting at the expense of recall.

  • Combination of undersampled classifiers using majority voting rule provides better recall than unanimity at the expense of precision.

Although the relative performance of unanimity and majority voting rules in terms of precision and recall is well established, their relative performance in terms of F 1 scores depends on the selected ensemble. In order to clarify this relationship, an analytical investigation is presented in the following section.

3 Analytical investigation of unanimity and majority voting rules

In the proposed analysis, the F 1 score of the combined system is firstly formulated in terms of the classification behaviors of the individual members for both majority voting and unanimity rules. For each rule, the best combined score that can be achieved is then defined as an optimization problem. The formulation of the optimization problem is given in “Formulation of the optimization problem” of the “Appendix”. In the analysis, it is assumed that there are M ensemble members, each providing the same precision (p) and recall (r) values. The findings of this analysis is presented in the following propositions.

Proposition 1

The F 1 score of the unanimity rule-based combined system will be 1, if r = 1 and \(p \geq \frac{2M}{3M+1}. \)

Proof

The proof is given in “Proof of proposition 1” of the “Appendix”. □

Proposition 2

The F 1 score of the majority voting rule-based combined system will be 1, if r = 1 and \(p \geq \frac{2M}{3M-1}.\)

Proof

The proof is given in “Proof of proposition 2” of the “Appendix”. □

The special case considered in Propositions 1 and 2 clearly show that, for high values of individual recall, unanimity rule is able to achieve F 1 = 1 (perfect combined system) for smaller precision values compared to majority voting. In other words, for small precision values, the performance of the unanimity rule can be better provided that the individual recall is large. In fact, this is the case when ensemble members are generated using undersampling. The members generally have small precision but high recall values as mentioned in Sect. 2.

The solutions presented in the “Appendix” for proving Propositions 1 and 2 put restrictions to the joint distributions of the member outputs. However, the solutions for the optimization problems (defined in Eqs. 11, 13) are not unique. Because of this, other forms of solutions might provide different limits. Although different solutions can be computed, the solutions found in the proofs clearly verify the fact that better combined scores can be obtained using unanimity rule. In order to compare these rules by considering all feasible solutions and more general cases than the special one considered in the propositions, the optimization toolbox of Matlab is used to study the best combined F 1 scores that can be achieved for \(p \in [0.5,1.0]. \) Fig. 3 presents the results for three ensemble members and three recall values, 1.0, 0.95, and 0.90, respectively, in parts (a), (b) and (c). The superiority of unanimity for low precision values is evident in the figures. In particular, when r = 1.0, it can be seen in the figure that perfect combined system can be achieved using classifiers having \(p \geq \frac{2M}{3M+1}= \frac{3}{5}=0.6\) whereas p should be greater than \(\frac{2M}{3M-1}= \frac{3}{4}=0.75\) as found in the proofs of Propositions 1 and 2, respectively. It should be noted that, as seen in Fig. 3, majority voting rule provides superior performance than unanimity when both precision and recall values are large. The experimental results presented in [45] are consistent with this observation.

Fig. 3
figure 3

The best combined F 1 scores that can be computed using three ensemble members as a function of individual precision, p where a r = 1.0, b r = 0.95, c r = 0.90

For a better visualization of the relative performance of the two rules, Fig. 4 presents the contour lines where each line represents the values of p and r which provide the same difference between the best F 1 scores that can be computed using unanimity and majority voting rules, i.e (F una1  − F mv1 ). As seen in the figure, for small values of p and large values of r, unanimity performs better.

Fig. 4
figure 4

The contour plot of the difference of the best F 1 scores (F una1  − F mv1 ) for \(p,r \in [0.5,1.0]\)

The method of undersampling is an important factor that may affect the performance of unanimity-based systems since, as shown above, high recall is an important parameter. It should also be noted that, we agree with the argument of Chang et al. [46] that using random undersampling for generating negative sample subsets will generate geometrically inconsistent sample sets. Clustering-based undersampling may be a better approach. However, the effectiveness of a clustering algorithm in partitioning the given data into homogenous clusters depends on various parameters such as the clustering algorithm, number of clusters and distance measure. In text categorization where thousands of features are used, it is difficult to justify a particular choice of these design parameters by any technique different from empirical evaluation. Moreover, it is argued that the geometric relations between majority and minority sub-populations may be lost if resampling is not done appropriately [29]. Because of these, we followed a different path for the determination of clusters for the negative documents.

In text categorization, the documents in the negative class belong to one or more (multi-labeled case) pre-defined categories. These categories naturally constitute sub-populations in the negative class. Instead of applying a clustering algorithm to estimate the clusters, we propose to define the set of documents that belong to a different category as a separate cluster. The implementation of this rule is illustrated in Fig. 5. If the documents are multi-labeled, then they will exist in more than one cluster. The present form of this approach does not require any design parameter to be tuned. The number of clusters considered and correspondingly the number of ensemble members depend on the number of categories involved in the negative class. This is reasonable since, as the number of categories increases, the negative class becomes more heterogenous and it is advantageous to use increased number of clusters. As it will be shown in the following section, this form of undersampling leads to ensemble members having higher recall when compared to random partitioning which is advantageous for unanimity rule of combination.

Fig. 5
figure 5

Block diagram for the implementation of the unanimity rule in an ensemble of classifiers that are trained using category-based undersampling of the negative class

4 Experiments

In this study, three widely used datasets are considered. The ModApte split of top ten classes of Reuters-21578 is the first where the negative classes are defined to include documents which belong to one or more of the remaining nine categories [47]. The highly imbalanced category distribution of Reuters-21578 ModApte Top10 makes it significant among other datasets. WebKB is a collection of web pages which belong to either of seven categories [48]. They were collected by the Carnegie Mellon University Text Learning Group from several universities in 1997. Four of the categories namely, “Student”, “Faculty”, “Course” and “Project” which contain totally 4,199 documents are generally used in text categorization experiments [49]. The 7-Sectors dataset contains 4,581 web pages. Each page belongs to one parent category and one subcategory in hierarchical order. Following the previous experiments in the literature [50], seven parent categories are considered in the simulations. The training and test sets of Reuters-21578 dataset are defined. However, this is not the case for WebKB and 7-Sectors. Because of this, following the work of Bekkerman et al. [51] and Xue et al. [49], four-fold cross-validation is performed on these datasets. For this purpose, the available data is initially partitioned into four folds. Four experiments are then performed where, in each experiment, one fold’s data are used for testing while data in the remaining folds are used for training. The average scores are reported.

Before training the classifiers, preprocessing is applied on the datasets. Firstly, stopwords such as prepositions (in, on, down, etc.), conjunctions (and, but, while, etc.) and articles (a, an, the, etc.) are removed using SMART stoplist [52]. Subsequently, Porter stemming algorithm is applied to group the words with the same stems together [53]. Porter stem produced does not need to be meaningful or identical to the original root of the words. For instance, the words “comparing”, “compared” and “comparable” are stemmed as “compar”. Term frequencies in each document are cosine-normalized to eliminate the effect of varying document lengths. By exploiting different kernel functions, it is possible to generate linear and nonlinear SVMs [54]. Previous experiments have shown that linear SVMs perform better than nonlinear ones in text categorization [55]. Hence, linear SVM is adopted in our experiments. The SVM-based classification toolbox, SVMlight with default parameters (\(C = 1/{\text {avg}}(\overline{x}^T\overline{x})\) which is the inverse of the average of the inner product values of the training data) and linear kernel is employed as the classification scheme [8, 56]. The overall performances of the schemes considered are firstly evaluated using precision, recall and F 1 scores where both macro and micro scores are presented [1]. More specifically, macro-precision, macro-recall and macro-F 1 are computed as the averages of the corresponding scores obtained for each individual category. Micro-F 1 scores obtained by assigning equal weights to all documents are also reported. Although F 1 score is the most commonly used performance measure in text categorization studies, the area under the precision-recall curve (AUP) is also considered as a powerful metric, especially for evaluating the decision surface of the generated classifier [16]. AUP is also employed as an alternative metric for the evaluation of the proposed scheme.

The bag of words approach used for document representation generally leads to a very high-dimensional feature space consisting of tens of thousands of words even for medium-sized datasets. Although the computational power is elevating rapidly in today’s world, there is a need for a decrease in the number of original features due to the fact that all terms may not be useful for discriminating different categories and the curse of dimensionality problem is encountered in many classification algorithms. In a recent study, it is observed that the F 1 scores of most weighting schemes plateau after 5000 features for SVM [55]. Because of this, top 5,000 features ranked by the term selection measure, chi-square (χ2) are considered in the experiments [57].

After the feature set is specified, term weighting is generally applied as the following step [58]. The main idea is to quantify the relative importance of the selected terms where discriminative terms are assigned larger weights [55]. Term weighting has been traditionally formulated as the product of the term frequency and inverse document frequency, tf × idf [59]. In this study, relevance frequency (RF)-based weighting (tf × RF) which is recently proposed and shown to deliver the best results on several benchmark datasets is used [55]. RF is defined as [55]

$$ {\text {RF}} = {\text {log}} \left( 2+\frac{A}{C} \right) $$
(1)

where A and C, respectively, represent the number of documents which contain the term under concern in the positive and negative classes.

The proposed approach referred as CATEGORY-UNANIMITY in the following context is compared with other well known approaches. Firstly, we studied the performance of a single random undersampling-based categorization system. This system which is referred as UNDERSAMPLING in the following context is trained and tested for ten times and the average scores are reported. The scheme is tested for eight undersampling rates, s in the interval [2, 9]. For instance, for s = 3, all training samples from the positive category and a random set made of one-third of the negative class training documents are used where the rest are not considered during training.

Secondly, random partitioning of the negative class is implemented where the negative class is split into s partitions. For instance, for s = 3, the negative class is randomly partitioned into three non-overlapping subsets. Each subset is used separately together with all positive documents to construct M = 3 different classifiers. For small categories, if s is small, the data may still be imbalanced. It may be argued that the number of samples in the negative class should depend on the number of samples in the minority (positive) class. For instance, the number of majority (negative) samples can be equal to that of the minority class. In fact, it is shown that this is not a trivial task since the best-fitting number of the majority class samples depends on the classification scheme and the problem under concern [60]. In order to investigate this, UNDERSAMPLING system is tested for equal number of positive and negative samples. The macro-F 1 and micro-F 1 scores obtained are presented in the last column of Table 1. The scores obtained for s = 2, 3, 4 and 5 are also presented. It can be seen in the table that choosing equal number of negative samples as the positive samples provides remarkably worse scores compared to using larger number of negatives.

Table 1 The macro-F 1 and micro-F 1 scores obtained on three datasets when the number of negative samples is selected as the number of samples in the positive class and for s = 2, 3, 4 and 5

In text categorization, it is shown that weighted voting performs better compared to plain voting [61]. In our simulations, we used the SVM scores as weights where a higher positive score is considered as a more confident positive decision and similarly a lower negative score is considered as a more confident negative decision. In binary text categorization, this weighted combination rule can be implemented simply by averaging the SVM scores. Hence, weighted voting (referred as PARTITION-W_VOTING) and unanimity rule (referred as PARTITION-UNANIMITY) are used for the aggregation of the outputs provided by the individual classifiers trained on disjoint partitions. In weighted voting-based fusion, the joint decision is positive if the average score is greater than zero.

In the implementation of the proposed system, M is dataset dependent since the number of categories included in the negative class varies in different datasets. In the case of Reuters-21578, since there are ten categories three of which are “Earn”, “Acquisition” and “Money-fx”, M = 9. For instance, in studying the binary problem of categorizing the test documents as belonging to “Earn” or not, nine classifiers are generated using the training data. The positive training samples of the first classifier are those having “Earn” as one of their labels whereas the negative class involves the samples which have “Acquisition” as one of their labels but not “Earn”. Similarly, the positive class is comprised of samples that have “Earn” as one of their labels for the second member classifier whereas the negative class involves the samples which have “Money-fx” as one of their labels. Other member classifiers are built in the same manner. The values of M are three and six, respectively, for WebKB and 7-Sectors datasets.

The macro-F 1 and micro-F 1 scores achieved on three different datasets are presented in Figs. 6 and 7, respectively, as functions of the sampling rate, s. The system that exploits all negative documents in model training denoted by SVM is also provided as a reference. The results clearly show that the highest macro-F 1 and micro-F 1 scores are achieved by the proposed approach on all three datasets. The results achieved should be evaluated for both the type of resampling and output aggregation method used. When CATEGORY-UNANIMITY and PARTITION-UNANIMITY are compared, it can easily be seen that the proposed approach provides higher macro-F 1 and micro-F 1 scores for all values of s. On WebKB dataset, PARTITION-UNANIMITY cannot provide a higher macro-F 1 or micro-F 1 score than the baseline system (SVM) for any value of s. On 7-Sectors, its macro-F 1 and micro-F 1 scores drop below that of SVM for large values of s. It can be concluded that, when unanimity rule is used, the proposed resampling approach provides better scores. This is in fact reasonable since, for deciding on the positive class, unanimity rule necessitates the use of individual classifiers having high recall. In other words, the number of misclassified positive documents (false negatives) should be very small. In the ideal case, unanimity rule requires all positive documents to be correctly classified by all member classifiers. On the other hand, it has very high tolerance to large number of false positives since it is adequate for one member classifier to provide correct decision for avoiding misclassification of a negative document as positive.

Fig. 6
figure 6

The macro-F 1 scores computed on three datasets for eight different sampling ratios

Fig. 7
figure 7

The micro-F 1 scores computed on three datasets for eight different sampling ratios

In order to investigate the recall values of the individual members, consider Fig. 8 which presents the average precision and recall values of nine individual members of category-based and random partitioning-based resampling schemes on Reuters-21578 dataset. As seen in the figure, the proposed category-based partitioning provides higher recall but lower precision values on all categories. Lower precision values of the individual members mean that the number of false positives is larger which is a consequence of having larger recall. This is mainly due to the fact that the data used to train the ensemble members of the category-based partitioning approach do not cover the whole feature space. However, as stated above, unanimity rule has high tolerance to larger number of false positives since a decision on the positive class is made only if all members agree on that class.

Fig. 8
figure 8

The average precision and recall values of nine individual members of category-based and random partitioning-based (s = 9) resampling schemes

As it can be seen in Fig. 8, the recall values obtained using category-based partitioning are smaller on categories where the number of training samples is small such as “Wheat”, “Ship” and “Corn”. For the corresponding binary categorization problems, if the training data from the negative documents belong to a category involving large number of training samples, there may still be a bias toward the negative samples which explains the corresponding comparatively smaller recall values. On the other hand, if the tested category is large, the comparatively smaller number of samples in the negative class corresponding to a small category cannot form a bias.

As it can be seen in Fig. 6, the macro-F 1 scores provided by PARTITION-W_VOTING and PARTITION-UNANIMITY generally degrade when four or more members are considered. However, PARTITION-UNANIMITY is more robust compared to PARTITION-W_VOTING. For a better understanding of this behavior, consider the macro-precision and macro-recall values presented in Fig. 9 which are computed as the averages of the corresponding scores obtained from the independent binary categorization tasks of each dataset. As it can be seen in the figure, the recall values obtained by using PARTITION-W_VOTING are higher than those of PARTITION-UNANIMITY for all values of s. The robustness of F 1 is achieved as a consequence of the robustness of precision to the increasing values of s which is the main characteristic of unanimity rule.

Fig. 9
figure 9

The macro-precision and macro-recall values computed on three datasets for eight different undersampling rates

In order to get further insights about the differences among the schemes, consider the F 1 scores computed for each category as given in Table 2. The categories at the top ten rows are from Reuters-21578 dataset. The following four categories are from WebKB dataset. On fourteen categories out of twenty one, unanimity-based systems (PARTITION-UNANIMITY or CATEGORY-UNANIMITY) provide the best F 1 scores where, on eleven of these, the proposed approach (CATEGORY-UNANIMITY) provides the best F 1. On eighteen categories, PARTITION-W_VOTING performs better than UNDERSAMPLING. Nevertheless, the performance improvements are not significant.

Table 2 The F 1 scores obtained for the individual categories of Reuters-21578, WebKB and 7-Sectors datasets

PARTITION-W_VOTING achieves its best F 1 scores when s is equal to 2 or 3 on seventeen categories. As the number of partitions is increased, the performance generally degrades. However, PARTITION-UNANIMITY provides its best F 1 score using a larger number of splits compared to PARTITION-W_VOTING on eleven categories. This can be explained by considering the fact that, as mentioned above, unanimity rule requires classifiers having high recall which can be generated by using a smaller number of negative documents in their training.

For a further comparison of unanimity- and voting-based schemes, we investigated their performances on category-based undersampling. The conventionally used maximum-voting scheme is employed where the total number of the positive and negative predictions are considered in making the decision. This scheme is named as CATEGORY-MAX_VOTING. It should be noted that, as illustrated in Fig. 4, maximum-voting performs better compared to unanimity rule when the member classifiers achieve high precision values. Since category-based undersampling provides smaller precision values compared to random partitioning as shown in Fig. 8, the unanimity rule is expected to surpass maximum-voting rule. The macro-F 1 and micro-F 1 scores achieved on three datasets are presented in Table 3. As expected, the scores achieved by the unanimity rule are significantly superior to those provided by the maximum-voting-based scheme.

Table 3 The macro-F 1 and micro-F 1 scores obtained on three datasets using unanimity and maximum-voting-based schemes to combine classifiers obtained using category-based undersampled sets

In order to assess the statistical significance of the improvements in the F 1 scores provided by the proposed approach, hypothesis tests are performed using the t-test approach. The null hypothesis is defined as “H 0 = mean of the improvement is equal to zero” and the alternative hypothesis is defined as “H 1 = mean of the improvement is greater than zero”. The tests are performed between CATEGORY-UNANIMITY and baseline SVM-based system. The null hypothesis is rejected at significance level of α = 0.05, with p values 0.0118 and 0.0057 for Reuters-21578 and 7-Sectors, respectively. On the other hand, PARTITION-W_VOTING provides statistically significant improvements only on Reuters-21578 only for s = 2 whereas UNDERSAMPLING does not achieve significant improvements on any of the datasets under concern for any s value. Significance tests are similarly performed between CATEGORY-UNANIMITY and PARTITION-UNANIMITY for which the s values providing the highest macro-F 1 scores are considered for each dataset. More specifically, the values of s are 7, 2, and 4, respectively, for Reuters-21578, WebKB and 7-Sectors for PARTITION-UNANIMITY system as seen in Fig. 6. The null hypothesis is rejected at significance level of α = 0.05, with p values 0.0291 and 0.0273 for WebKB and 7-Sectors, respectively.

In a recent study, it is observed that the relative performances of different schemes may not be consistent when different performance measures are used [16]. In particular, when AUP is considered, SVM trained by the whole training set which is considered as the baseline provided the best scores on two of the three datasets when compared to various undersampling, oversampling or instance weighting-based schemes. On the other hand, its performance came out to be the worst on all three datasets in terms of macro-F 1 which is also consistent with the results of our work. This observation was explained by the fact that the precision-recall curve characterizes the performance of the classifier for different thresholds on the prediction values whereas the F 1 score employs the precision and recall obtained at the default threshold that is zero. By tuning the threshold on the test data, they have shown that the baseline SVM becomes the best in terms of the macro-F 1 score, concluding that threshold setting is still an open problem. In this study, the proposed scheme is also evaluated in terms of AUP score and the scores achieved are compared with the baseline SVM which is recently found to provide the best scores by Sun et al. [16]. Table 4 presents the average AUP scores for each dataset where the best scores are typed in boldface. As it can be seen in the table, the proposed scheme provided the best scores on two of the three datasets utilized in this study.

Table 4 The average AUP scores obtained on three datasets

It should be noted that Kumar and Gopal have recently shown that one-versus-rest approach provides better scores compared to one-versus-one approach in multi-class text categorization [62]. The one-versus-rest approach solves the multi-class problem by defining multiple binary classification problems where the negative class of each problem is defined as the union of the documents from all other categories. Instead of using the union of the documents from all other categories as the negative class, the proposed scheme generates a different binary classifier for each negative category whose outputs are then combined. As a matter of fact, the proposed scheme is also a potential candidate to solve each binary classification problem in multi-class text categorization.

5 Conclusions and future work

In this study, an analytical investigation of unanimity and majority voting rules is presented. It is shown that unanimity rule can provide better F 1 scores compared to majority voting when an ensemble of high recall but low precision classifiers is considered. On the other hand, it is shown that majority voting provides better F 1 score if both precision and recall are high. Then, in order to generate high recall members, category-based undersampling is proposed where the number of subsets from the negative class is selected as the number of categories it includes. Combination of the classifiers generated using category-based undersampling by unanimity rule is studied next.

Experiments conducted on three datasets have shown that random undersampling method provides individual members having higher precision than members obtained using category-based undersampling whereas the latter approach provides members having higher recall. When applied on the members trained using category-based subsets of the negative documents, unanimity rule is shown to attain a better precision-recall trade-off compared to the baseline SVM-based system, random undersampling approach and weighted voting on the outputs of the ensemble members generated using randomly partitioned negative data and, improves F 1 score at satisfactory levels.

Hereafter, there are several tasks awaiting us. Firstly, in order to achieve better F 1 scores, the proposed undersampling scheme can be modified for small target categories. The category selected from the negative class might include many more documents compared to the target category. Splitting the category-based negative sample sets into smaller groups may be considered as a solution for such cases. Therefore, studying the effect of the number of samples in the subsets of the negative class will be considered for future work. The class imbalance problem is extensively studied in the literature where numerous resampling schemes are proposed. Secondly, the relative performance of unanimity and majority voting should be studied for these alternative schemes to obtain a comprehensive experimental evaluation.