Keywords

1 Introduction

The massive amounts of unstructured data sorted in public resources continue to increase. In order to organize and manage this data, the use of efficient and successful methods must be considered. Text classification is an active technique for information organization and management [1]. Different methods and algorithms have been developed for text classification including Support Vector Machines (SVM) [2], Naive Bayes probabilistic Classifier (NB) [3], Rocchio Similarity [4], K-Nearest Neighbour (KNN) [5], and C4.5 integration Decision Trees [1].

Binary classification is a key type of text classification with two predefined categories, namely, relevant or irrelevant classes [6], on which our research focuses. A binary text classifier determines a decision boundary to classify documents into two groups: positive and negative classes [7]. However, drawing a clear boundary between the positive and negative classes of text documents is not easy for a classic binary text classifier [8, 9].

The solution of classification issues using SVM, which was proposed by Vapinik in 1995, has gained increasing recognition and popularity among researchers due to its ability to handle high dimensional data such as textual documents [10, 11]. SVM performs classification by finding a decision boundary (separating hyperplane) that partitions the feature space into two distinct classes of data, positive and negative, with the maximum margin and represents the decision boundary using a set of support vectors (SV) generated from the training dataset [12, 13]. However, it is difficult for an SVM classifier to deal with non-separable data because the margin between positive and negative objectives is still unclear. In such situations, due to the uncertainty, an SVM classifier might not be completely effective in providing the optimal classification.

In practical problems, most training datasets include uncertainties. With an uncertain boundary, the learning classifier is more complex and difficult to find the optimal line to classify related objects and a full separation of relevant and irrelevant documents would require a curve. However, it is not easy to achieve the curve in a direct way with high precision because it requires too much computation [8]. Even if this were possible, there is no guarantee that it can be applied to completely classify all unknown testing samples because of the differences between training and testing document sets [9]. Thus, a nonlinear classifier is inefficient for a prediction task where an uncertain boundary exists in the training set. It is, therefore, desirable to design a classifier model able to linearly cope with non-separable data. Therefore, how to cope with data having uncertainties into the learning phase to improve the performance of binary classifier is a challenging problem.

This paper aims to present an effective binary classification model, called the Multiple-SVMs with Sliding Window model (MSVMs-SW model), in order to overcome the limitations in the existing classifiers and achieve the best performance in linear SVM for data having uncertainties. Different from traditional binary classifiers, the MSVMs-SW model aims to understand uncertainty by partitioning training samples (with two labels) into three regions, namely, positive, boundary, and negative regions in order to understand the decision boundary. Allowing this partitioning of the training set can help to describe relevant and non-relevant information and support to design a multiple-SVMs based classifier. We developed three different SVM classifiers (SVMP, SVMN, and SVMB), each of which is trained using its own training set that is derived by using the three regions. The training set for each classifier was different in order to obtain a greater improvement of the prediction results, to increase the certainty of all objects in positive and negative regions and to resolve the uncertainty in boundary region. The main motivation for using multiple-SVMs to classify new incoming documents is that a problem which requires expert knowledge will be better solved by a committee of experts rather than by a single expert [6]. Therefore, this research made three innovative contributions to the fields of text classification: (a) A new and effective model that deals with the uncertain decision boundary for text classification. Our proposed model uses a training set with only minimal experimental parameters to identify the uncertain boundary, which makes it efficient; (b) An alternative solution for the hard uncertain boundary problem that was traditionally solved by non-linear SVMs; (c) A structure to guide the design of a fusion of multiple classifiers. To measure the effectiveness of the proposed model, extensive experiments were conducted, based on the RCV1 dataset and TREC assessors’ relevance judgements. The results show a significant improvement on F1 and Accuracy in the performance of binary text classification.

2 Related Work

Automated binary text classification is a significant research problem in information filtering and information organization fields [15]. It provides a way to determine a decision boundary that classifies textual documents into two distinct classes: relevant or irrelevant. Several approaches to binary text categorization, such as NB, KNN, decision tree, Rocchio, and SVM, have been developed to identify an efficient way to separate all related documents from a large dataset to determine a clear boundary between the classes in the text dataset [1]. However, in practice, the decision boundary includes much uncertainty because of the limitation of traditional machine learning algorithms, the presence of noise in text documents and feature scalability [16, 17].

SVM represents the training dataset as vectors, where each vector comprised of its words with their frequencies, and then tries to locate the linear hyperplane which separates two classes [13]. SVM can solve linear and nonlinear classifications and works well when applied to many practical problems [18, 19]. Although nonlinear SVM is effective when classifying nonlinear data, it has much higher computational complexity than linear SVM when making predictions for sparse data [19]. In addition, linear SVM performs better than nonlinear SVM when the number of features is very high, for example, in document classification [20, 21]. Therefore, if the number of features is extremely large, it is better to select linear SVM, due to the difficulty in finding the optimal parameters of a classifier when using nonlinear SVM [22]. Because linear SVM still has no effective way to deal with the uncertain factors it is, therefore, desirable to have a classifier model with the efficiency of a linear classifier to deal with data having uncertainty. The linear SVM is chosen in this study due to its computational and algorithmic simplicity.

The above limitations can be alleviated by employing the SW technique to divide the training set into three regions based on scores that present their degree of relevance and then to design a multiple-SVMs based classifier in order to derive a linear decision boundary for each classifier. In our proposed model the SW technique can be optimized by using Entropy. The entropy measurement is chosen in this research because it is a commonly understood measure in information theory and it is a fundamental measure for describing randomness and uncertainty of data [14, 23].

3 Description of MSVMs-SW Model

The MSVMs-SW model attempts to use the training dataset effectively to deal with the probable uncertainty and to improve the accuracy of the classifier. Our proposed model uses SVM as a high-performance classifier and generates new training set by dividing a universal set of documents into three disjointed parts (the positive region (POS), the boundary region (BND), and the negative region (NEG)). However, a single SVM may not be sufficient to classify all unknown testing samples. Therefore, we propose to use a multiple-SVMs based classifier. The proposed model contains two stages, a training stage and a testing stage, as shown in Fig. 1.

Fig. 1.
figure 1

Architecture of a multiple SVMs classifier

3.1 Training Stage of MSVMs-SW Model

To achieve the best performance in binary classification, the objective is to determine a decision boundary between classes. Our proposed model uses the training set only to set the decision boundary and to explore the uncertainty situation as shown in Fig. 2. It starts with the calculation of the score of training documents, and further regroups the training samples into three regions using the SW technique.

Fig. 2.
figure 2

Decision Boundary Setting

Document Scoring.

Scoring documents to indicate their importance is an effective way for ranking relevant information. For a collection of documents in the datasets consisting of two sets (positive document sets, D+; and negative document sets, D), the MSVMs-SW model calculates the weight of terms extracted from D+ and ranks them to use the top-k features based on their values, for example, T = {t1, t2, t3,…, tk}. However, identifying the value of k is experimental. In our proposed model, we use the Okapi BM25 as a term weighting function. BM25 is a probabilistic state-of-the-art retrieval model [24], which can be calculated as follows:

$$ w\left( t \right) = \frac{{tf . \left( {k_{1} + 1} \right)}}{{k_{1} . \left( {\left( {1 - b} \right) + b\frac{DL}{AVDL}} \right) + tf}} .\log \frac{{\frac{{\left( {r + 0.5} \right)}}{{\left( {n - r + 0.5} \right)}}}}{{\frac{{\left( {R - r + 0.5} \right)}}{{\left( {N - n - R + r + 0.5} \right)}}}} $$
(1)

where N is the total number of training documents; R is the number of relevant documents; n is the number of documents which contain the term t; r is the number of relevant documents which contain the term t; tf is the term frequency; DL and AVDL are the document length and average document length, respectively; and k1 and b are the tuning parameters.

The reason for using the BM25 to calculate term weight is that the BM25 is a probabilistic model and in binary text classification we deal with uncertain information [24]. Probability is the measure used to understand the uncertainty in the information. Therefore, probability theory is the best way to quantify uncertainties. Next, the weighted terms are used to calculate the scores for all training documents dD as follows:

$$ score\left( d \right) = \sum\nolimits_{{\text{t} \in \text{T}}} w \left( t \right) \cdot \tau \left( {t,d} \right) $$
(2)

where w(t) = BM25(t, D+); and τ (t, d) = 1 if td; otherwise τ (t, d) = 0.

Once the scores of the documents are calculated, the documents are ranked in descending order based on their scores.

Sliding Window Technique.

After ranking the training documents in the previous step, the most related documents will be located at the top of the list, while irrelevant ones will be located at the bottom of the ranked list, as shown in Fig. 2 (step 1). However, in most cases there are regions in which positive and negative documents are mixed due to the uncertain boundary. To find this area with many noisy documents, a sliding window technique and entropy are used to effectively determine the boundary region. Ko and Seo [25] used entropy and a sliding window to remove noisy data and solve the problem of the One-Against-All method. Our proposed model extends this idea to use a sliding window and entropy measurement to construct the decision boundary.

In this research, the sliding window was used to identify the boundary values which denote the region with the highest rate of noisy documents [25, 26]. The window size in this paper was set to 5 documents. The model starts to slide the window from the top documents in the ranked list, and then calculates the entropy value for the window. The window then slides over one document and yields a new entropy value. It continues to slide and stop when the entropy is greater than the threshold. We chose a high entropy threshold (95%). The same process applies from the bottom of the ranked list as shown in Fig. 2 (step 1).

Entropy Algorithm.

Entropy is commonly used to define the uncertainty of variable [23, 26]. In this paper, for each sliding window(s), the entropy value can be calculated using the following function based on the number of positive and negative documents as follows:

$$ E\left( s \right) \, = \, {-}\left[ { \frac{P}{P + N}\log_{2} \left( { \frac{P}{P + N }} \right) + \frac{N}{P + N}\log_{2} \left( { \frac{N}{P + N }} \right)} \right] $$
(3)

where P and N are the numbers of positive and negative documents in SW, respectively.

Next, we select two windows with the greatest degree of entropy value. The first window (W1) is from the top of the list and the second window (W2) is from the bottom of the list. For W1, the irrelevant documents are denoted as τN. For W2, relevant documents are denoted as τP. In this study, the values of the boundary are calculated based on the scores of the relevant documents \( \left( {\tau_{p} } \right) \) and the irrelevant documents \( \left( {\tau_{N} } \right) \); we selected the highest score of irrelevant documents in W1 as a maximum threshold (τmax), and the lowest score of relevant documents in W2 as a minimum threshold (τmin), as shown in Fig. 2 (step 2). Hence, the upper and lower decision boundary values τmax and τmin are calculated as follows:

$$ \tau_{\hbox{max} } = \mathop {\hbox{max} }\nolimits_{{d_{i} \,\epsilon \,D^{ - } \cap \,W_{1} }} \left\{ {score} \right.\left. {\left( {d_{i} } \right)} \right\} $$
(4)
$$ \tau_{\hbox{min} } = \mathop {\hbox{min} }\nolimits_{{d_{i} \,\epsilon \,D^{ + } \cap \,W_{2}}} \left\{ {score} \right.\left. {\left( {d_{i} } \right)} \right\} $$
(5)

Three Regions for Partitioning the Training Set.

The SVMs-SW model aims to group training sets into three regions rather than two classes. The training set D can be split into three regions based on the document scores and threshold settings in the previous step: the positive region (POS, possible relevant); the boundary region (BND, uncertain); and the negative region (NEG, possible irrelevant). The ranges of these regions are defined as follows:

$$ \begin{aligned} POS & = \left\{ {d \in D\left| {score\left( d \right) > \tau_{\hbox{max} } } \right.} \right\} \\ BND & = \left\{ {d \in D\left| {\tau_{\hbox{min} } \le score\left( d \right) \le \tau_{\hbox{max} } } \right.} \right\} \\ NEG & = \left\{ {d \in D\left| {score\left( d \right) < \tau_{min} } \right.} \right\} \\ \end{aligned} $$

The boundary region BND contains many relevant and irrelevant documents under uncertain decisions which can be divided into two subsets: \( B^{ + } = BND \cap D^{ + } \text{and}\,B^{ - } = BND \cap D^{ - } \).

Design Multiple SVMs Based Classifier.

Building a classifier is achieved by training the SVM using chosen training documents via three regions. As shown on the left side of Fig. 1, we constructed three different SVMs classifiers; SVMP, SVMN, and SVMB. To explain this process, the Algorithm 1 describes the training stage to learn the classifiers. The First classifier, SVMP (step 8), takes strong positive documents POS and all negative documents \( \left( {B^{ - } \cup NEG} \right) \) as input, and uses the SVM classifier to build a predication model. The SVMP generates the hyperplane between POS and \( \left( {B^{ - } \cup NEG} \right) \) to maintain the maximum margin between them. However, a potential problem with this approach can arise when the number of training samples in the POS part is very low and, in this case, the boundary of class would not be accurate due to insufficient positive training samples provided for text classification. To overcome this issue, we use a pseudo feedback technique. We selected the top-k scoring documents from the unlabeled testing set U and add them to the POS part as shown in step 1 to step 6. Different numbers of top-k have been tested and we found that using 5 documents improved the performance compared with using k > 5, which reduced the performance.

The second classifier, SVMN, is constructed from the all positive documents \( \left( {POS \cup B^{ + } } \right) \) and strong negative documents NEG, as in step 9. For SVMB, it is difficult to construct a classifier from the documents in the boundary region because SVM is very sensitive to noise, especially when noise is large and, in this case, the classifier will be very poor. Therefore, for even better classification we used the strong positive and negative samples (POS, NEG) to build SVMB in our model, as in step 10.

figure a

3.2 Testing Stage of MSVMs-SW Model

In this phase, each stage has a different classification model, as shown on the right side of Fig. 1. The SVMP classification model concentrates on identifying positive documents. In this stage, the documents that are classified as positive are denoted by TP1 (true positive one) if they are true positive or grouped as FP1 (false positive one) if they are actually negative. The objective of this stage is to achieve a high precision rate for positive documents and to minimize the FP rate, with an acceptable False Negative rate FN. The SVMN classifier, which is generated in stage two, is applied to classify the documents that were predicted as negative in stage one. This stage focuses on increasing the precision rate for negative documents. In this stage, the documents that are classified as negative are denoted by TN1 (true negative one) if they are negative or grouped into the FN1 if they actually are positive. However, as the documents that were predicted as positive in this stage are still uncertain, the classifier will collect them into the boundary set BND. To classify these documents, we used the final classifier, SVMB. This classifier can then assign those documents as positive or negative and produce four outputs, namely, TP2, FP2, TN2, and FN2. In our proposed classifier model, true positive TP  = TP1  + TP2, false positive FP = FP1 + FP2, true negative TN = TN1 + TN2, and false negative FN = FN1 + FN2, as listed in Algorithm 2.

figure b

4 Experiments and Evaluation

4.1 Dataset and Evaluation Metrics

To evaluate the performance of our proposed model, the RCV1 dataset, which consists of 100 topics, was used. Each topic has been divided into training and testing sets with relevance judgements. The RCV1 corps has more than 804,000 documents which are news stories in English published by Reuters journalists [27]. These documents are grouped into 100 collection with 100 different topics. However, in our experiments in this study, we used the first 50 topics where the experiments are more reliable.

Three evaluation metrics were used to measure the effectiveness of the MSVMs-SW model and the baselines. The measures are the F1-score and Accuracy. These evaluation metrics are widely used in text classification research. For more details of these measures refer to [6]. We also used the t-test p-values to analyse the significance of the difference between the results of the MSVMs-SW model and the baselines.

4.2 Baseline Models and Settings

In order to make an extensive evaluation, we compared our proposed model with six different baseline models. These models are the state-of-the-art influential models, which include statistical methods libSVM, SVMperf [28], J48 [29], NB [3], IBk (Instance-Based Learning), and Rocchio. All six models were trained and tested with the same dataset to conduct the experiments. They were also run with their best settings obtained through experimental practice. For libSVM, some default setting were utilized because the F1-scores of the classifier are low when using the default setting. Different types of kernel functions and values of C were conducted, and we found that if we set k = 0 (linear kernel) and C = 1, we could get better results. In addition, we set C = 10 in SVMperf as it is the best value recommended in [9]. For our proposed model we used the linear kernel because it is quick and efficient with very large numbers of features as in document classification. For the experimental parameters of the BM25, k1 and b values were set at 1.2 and 0.75, respectively.

4.3 Experimental Results

The experimental results of the MSVMs-SW and the baseline models are presented in Table 1. These results are the average of the 50 collections of the RCV1 dataset. The comparison between the proposed model, MSVMs-SW model, and other six baseline models was completed using two measures, F1 and Accuracy. The results in Table 1 have been categorized into two groups. The first group includes two SVM models (libSVM and SVMperf); the second group includes a popular influential classifier.

Table 1. Evaluation results of our model compared with the baselines.

Table 1 shows that our proposed model outperformed all baseline models for text classification. Compared to the SVM models, the MSVMs-SW was significantly better on average with a minimum improvement of 4.3% and a maximum improvement of 36.1%. Compared to the IBK model, which has the highest Accuracy value in the second group, F1 and Accuracy of the MSVM-SW model were significantly improved by 40.1% and 2.6%, respectively.

The t-test p-values evaluation in Table 2 also indicated that the proposed model is extremely statistically significant with a pvalue < 0.05, compared with other baseline models on both F1 and Accuracy for both one-tail and two-tails.

Table 2. The p-values (one/two-tails) of the baseline models in comparison with MSVMs-SW model on RCV1.

In order to test the effectiveness of using multiple-SVMs in our proposed model, we performed the same experiments with a single SVM classifier which used the original training set. The aims of using multiple-SVMs was to provide a way to make the decision boundary better. Therefore, we tried to separate the uncertain boundary to identify a clear boundary for both relevant and irrelevant parts. Table 3 shows the results of the performance of a single SVM classifier and multiple-SVMs on the RCV1 dataset. We used the precision, F1-measure and Accuracy as measures for comparison.

Table 3. Multiple-SVMs Results compared with single SVM on RCV1.

In Table 3 we found that using multiple-SVMs achieved an average increase of 30.4% for F1. When considering precision value, multiple-SVMs showed the best performance, especially for the relevant part (Precision+) with 7.8% improvement on average. It is clear that using multiple-SVMs instead of a single one can lead to better classification and improve the overall accuracy with data having uncertainty.

Based on the results presented earlier, the MSVMs-SW model improved the binary classification with the highest score in both F1 and Accuracy (and particularly in F1) that best expresses the real situation in text classification.

5 Conclusion

The MSVMs-SW model was proposed to deal with data having uncertainty, a situation in which is difficult to obtain good results when using non-linear SVM. This model uses the training set effectively to achieve super machine learning with high classification accuracy and to improve the performance of binary text classification. It tries to understand uncertainty by dividing the training set into three regions, namely, positive, negative, and boundary, in order to improve the certainty of both relevant and irrelevant parts and to reduce the impact of uncertainty in the boundary part. The partition of training sets was achieved by applying an effective SW technique and threshold setting and then organizing training samples to generate new training sets. After the boundary region was identified, we used multiple-SVMs instead of a single one to learn the classifiers and to classify new incoming documents. The experimental results using the standard RCV1 dataset show that the proposed model achieved significant improvements in F1 and Accuracy, especially F1, and outperforms existing classifiers, including state-of-the-art classifiers.