1 Introduction

Classification is an active research problem in the field of machine learning (Domingos 2012; Jm and Mitchell 2015). It has been widely used in many areas, including document classification (Manevitz and Yousef 2002), biological medicine (Zeng et al. 2016) and face recognition (Cao et al. 2011; Su et al. 2009). Traditional classification paradigm only relies heavily on labeled data to achieve an efficient classifier. However, a large amount of labeled instances are often difficult, expensive or time-consuming to obtain, as they require a lot of manpower and material resources. Meanwhile, unlabeled instances may be easy to obtain. Therefore, a great quantity of unlabeled data and few labeled data often appear in many practical applications. In this scenario, traditional supervised classification often fails to learn an appropriate classifier. Nevertheless, semi-supervised classification (SSC) (Zhu and Goldberg 2009; Sakai et al. 2017) addresses this problem. SSC researches on training better classifiers by using the abundant unlabeled data, together with few labeled data. Various SSC methods have been proposed. The common methods are as follows:

Generative method (Narayanaswamy et al. 2017; Nigam et al. 2000). This method assumes that all data, including labeled data and unlabeled data, are generated by the same model. The assumption allows us to utilize techniques, such as the expectation maximization (EM) algorithm, to estimate model parameters and labels of unlabeled data. The main difference of generative method is hypothesis model, and different hypothesis models will produce different methods. In addition, EM algorithm has a strong dependence on the selection of initial value. The common solution is that using multiple initial values for repeated calculation to select the best value, or the optimal solution of parameters can be obtained by optimization algorithm. These methods reduce the sensitivity of initial value selection, but increase the computational complexity.

Graph-based method (Wang et al. 2013; Liu et al. 2014). This method maps all samples into a weighted graph. The nodes of graph are labeled or unlabeled samples, and the undirected edges between two vertices \({x_i}\)and \({x_j}\) represent the similarity of the two samples. Then, semi-supervised learning corresponds to the process of label propagation on the graph. A graph corresponds to a matrix, and the semi-supervised learning could be inferred based on matrix. Thus, graph-based method is easy to explore the properties of algorithm through the analysis of matrix operation, but it has some obvious defects. For example, in the storage aspect, if the number of samples is m, the size of matrix in algorithm is \(O({m^2})\). It is difficult to directly deal with large-scale data.

Semi-supervised support vector machine (S3VM) (Chen et al. 2014). S3VM attempts to find a partition hyperplane that separates the two classes of labeled data and traversed a low-density region. The most prominent of S3VM is transductive support vector machine (TSVM). Similar to the standard SVM, TSVM is also proposed for two classification problems. TSVM tries to consider various label assignments to unlabeled samples. Each unlabeled sample is labeled positive or negative. Then, in all of these results, TSVM attempts to find a partition hyperplane that maximized the margin in all samples. Once the partition hyperplane is determined, the final label assignment of unlabeled data is the predicted result. Obviously, it is an exhausting process to attempt various assignments in the case of massive unlabeled samples. And more efficient optimization strategies must be considered. Therefore, one of the key research problems of S3VM is how to design efficient optimization strategies.

Co-training method (Zhou and Li 2010; Zhang et al. 2014). Co-training is one of an important disagreement-based method, and it is originally designed for multi-view data. Co-training assumes that the data have two or more sufficient and conditionally independent views. Each view trains a classifier individually based on labeled data. Then, classifiers learned from each other to well predict labels of unlabeled data. Two processes continue until the classifiers are no longer changed or a preset number of iterations is reached. In co-training algorithm, if sufficient and conditionally independent features can be divided, the performance of trained classifier will be improved greatly. However, the conditional independence of features is often difficult to satisfy in practical applications, so the classification accuracy cannot be enhanced by co-training method.

Self-training method (Yun et al. 2012; Tanha et al. 2017). As its name implies, it attempts to iteratively enlarge the labeled training sets. First, an initial classifier is trained with the small amount of labeled data. Second, unlabeled samples, which are selected with the highest confidence, are added incrementally into the labeled set with their predicted labels. The classifier is retrained. The procedures are repeated until convergence.

Among the above SSC methods, one of the most effective and concise methods is self-training. The self-training does not require explicit feature segmentation or specific assumptions. It has been widely used in many practical applications, such as text classification (Pavlinek and Podgorelec 2017), semantic segmentation (Zou et al. 2018)and sentiment classification (Zhang and He 2013). However, the effect of self-training is limited by the number of initial labeled data and their distribution. If the amount of initial labeled data is quite small and the distribution of entire data sets is not represented, the performance of initial classifier trained by the initial labeled set may be poor. Therefore, it is easy to misclassify unlabeled samples. The updated classifier will be worse along with mislabeled samples which are added directly into next iteration. Iterating in accordance with that will lead to errors accumulation. Consequently, the performance of trained classifier is extremely poor. Thus, identification of mislabeled data and distribution of entire data set play an important role in the self-training algorithm.

In order to solve the influence of the distribution of data set and mislabeled samples on the performance of classifier simultaneously, in this paper, a self-training algorithm based on density peaks and cut edge weight statistic (ST-DP-CEWS) is proposed. In ST-DP-CEWS, the underlying space structure of entire data set is found by density peak clustering method. In the process of discovering space structure, the representative unlabeled samples are selected previously to predict labels. Then, the cut edge weight statistic (CEWS) is used to determine whether the predicted labels are correct. Density peak clustering method and CEWS take full advantage of space structure and information of unlabeled samples. They also deal with the problem that some samples may be misclassify. Thereby, the error accumulation in iterative process is decreased and the performance of trained classifier is increased effectively. The proposed framework has two main advantages: (a) It is not limited by the number of initial labeled data and distribution of entire data space. (b) It identifies the mislabeled instances during self-training process without prior conditions. The experimental results on thirteen benchmark data sets clearly demonstrate the efficiency of the proposed algorithm, which is superior to some previous works.

The remainder of the paper is organized as follows: Sect. 2 reviews the related works. In Sect. 3, the proposed framework and algorithm are described. Section 4 provides the experimental study on twelve benchmark data sets. The conclusion of this paper and some future plans are discussed in Sect. 5.

2 Related works

In SSC, a sample is described by a d-dimensional vector of attributes plus class label as follows:

$$\begin{aligned} x_i = (x_i^1,x_i^2, \ldots ,x_i^d,{{y_i}} ) \end{aligned}$$
(1)

where \(x_i\) is the ith instance and \(i = 1,2, \ldots ,m\). \(y_i \) indicates the label of \(x_i\) and \({y_i} \in \{ {\omega _1},{\omega _2}, \ldots ,{\omega _r}\} \). r is the number of class. L is the labeled set with \(y_i\) known, and U is the unlabeled set with \(y_i\) unknown. Particularly, the number of data in U is much larger than that in L for a typical SSC problem. \(L \cup U\) forms the training set \(T_\mathrm{R}\). In addition, there are some unseen data which have the same characteristics as \(x_i\) with y unknown to form the testing set \(T_\mathrm{s}\). The purpose of SSC is to learn a better classifier C by using \(T_\mathrm{R}\) instead of L only to predict the class labels of unlabeled set U or test set \(T_\mathrm{s}\).

2.1 Previous algorithms to improve self-training

Self-training is an iterative and autonomous learning process. The iterative procedure needs to constantly strengthen the requirements for new sample selection. This process needs to be done with cautious, as inappropriate operations will make the unlabeled samples to be labeled incorrectly. Thus, it will not get a proper classifier with good performance. In order to improve the performance of self-training algorithm and increase the classification accuracy, many algorithms have been researched. The details are as follows:

Zhou and Li proposed the Tri-training algorithm (Li and Guo 2012), and it is a co-training style SSC algorithm. In contrast to previous algorithm that uses two classifiers, it generates three classifiers from initial labeled data set and then they are refined by unlabeled data set in the Tri-training process. In detail, an unlabeled data point is labeled for a classifier if the other two classifiers agree on the labeling under certain conditions. Tri-training does not require sufficient and conditionally independent views, and it is applied to common data sets. But the performance of Tri-training algorithm is usually not stable because the unlabeled samples may often be wrongly labeled during training process. Moreover, the important information of unlabeled samples is not utilized.

As unlabeled data may contain crucial information about the data space, Gan et al. proposed to improve self-training algorithm with fuzzy c-means clustering (ST-FCM) (Gan et al. 2013). The fuzzy c-means (FCM) clustering is integrated into self-training as a helping strategy. It is employed to reveal the underlying space structure by using labeled set and unlabeled set. In the process of iteration, FCM generates the membership degree of each unlabeled sample to different classes. The unlabeled sample that has higher degree is labeled by the classifier trained using labeled data. Nevertheless, ST-FCM algorithm is not appropriate for the non-Gaussian distribution of data sets, which appear quite often in real application. And ST-FCM algorithm may not find the real decision boundary. The reason may be FCM algorithm cannot discover the space structure of non-Gaussian distribution of data.

In order not to be limited by the distribution of initial labeled samples and entire data space, a self-training SSC algorithm based on density peaks of data (ST-DP) (Wu and Shang 2018) was proposed. It improves the ST-FCM algorithm. In ST-DP, the underlying data space structure is discovered by clustering based on density peaks of data (Rodriguez and Laio 2014a). Then, structure is integrated into self-training process to train a better classifier. Clustering based on density peaks of data could discover the underlying structure of data sets, whether they are spherical distributed or non-spherical distributed. However, ST-DP cannot work on the data set with large amounts of strongly overlapping data.

Then, Wu et al. proposed a self-training SSC algorithm based on density peaks of data and differential evolution (ST-DP-DE)(Wu et al. 2018). ST-DP-DE utilizes differential evolution (DE) algorithm to optimize the position of newly labeled data (Di et al. 2017) during the self-training process. This optimization process is incorporated into the framework proposed in ST-DP. Although ST-DP-DE solves the problem of samples overlapping to some extent, it does not fundamentally deal with the defect of ST-DP. Moreover, the optimization algorithm introduces too much computation. The main reason is that overlapping samples are easy to be misclassified in self-training process, and DE cannot solve the trouble of wrong labels.

These previous algorithms have shown the promising prospect, but there are several issues to be considered. These methods may work ineffectively in some circumstances. The four self-training techniques update the classifier by adding unlabeled samples with their predicted labels to labeled data set. However, the predicted labels may be incorrect if the number of initial labeled samples is very small. The performance of trained classifier will be decreased by mislabeled data. Therefore, how to identify the mislabeled samples plays an important role in self-training algorithm.

2.2 Works related to mislabeled data

There are two main methods to identify the mislabeled samples (Triguero et al. 2014): One is based on the nearest neighbor rule, and the other is based on the classifier.

Methods based on the nearest neighbor include edited nearest neighbor (ENN), all KNN (ALLKNN), modified edited nearest neighbor (MENN) and nearest centroid neighbor edition (NCNEdit). ENN method depends only on the distances from the sample to be classified. For each data sample, it is mislabeled if its label does not agree with the majority of its k-nearest neighbors. ALLKNN is an extension of ENN. In this algorithm, the NN rule varies the number of neighbors from 1 to k; therefore, the NN rule is applied k times. For each example \(x_t\), its label removed at once if it is misclassified by all the NN rule. MENN is a modified technique of ENN. Each data point \(x_t\) is mislabeled if its label does not agree with all of its \(k + l\)-nearest neighbors, where l is the number of samples in \(L'\) which were the same distance as the last neighbor of \(x_t\). NCNEdit has a slight modification to ENN. For a given data \(x_t\), the concept of its neighbors is defined considering not only the proximity of \(x_t\), but also their symmetrical distribution around \(x_t\). Particularly, it takes count into the k-nearest centroid neighbors (k NCNs). These k neighbors are searched for through the iteration. The first neighbor of \(x_t\) is its nearest neighbor \(x_r^1\). The ith neighbor, \(x_r^i\), \(i = 2, \ldots ,k\) is such that the centroid of this and previously selected neighbors, \(x_r^1, \ldots ,x_r^i\), is the closest to \(x_t\). The main thoughts of method based on classifier are as follows: The existing labeled set is divided into n subsets in each iteration. For each of these n parts, a learning algorithm, such as C4.5, is trained on the other \(n-1\) parts, resulting in n different classifiers. Then, unlabeled samples are classified by these classifiers. The final labels of unlabeled samples are decided by consensus or majority voting schemes.

Note that the nearest neighbor rule needs to set the distance measure and the value of k in advance. The classifier-based identification method has a high demand for the division of samples and the selection of learning algorithm. Inappropriate selection of these parameters will lead to errors of identification and affect the final classification effect. In addition, these two methods do not utilize the valuable information of unlabeled samples, which will reduce the accuracy rate of identification.

In recent, cut edge weight statistic (CEWS) was proposed to identify mislabeled instances (Muhlenbach et al. 2004). For a given sample, CEWS utilized its sum of cut edge weights as a statistic for hypothesis testing to determine whether the label of the data was correct or not. The CEWS was originally proposed to calculate the separability index in supervised learning (Zighed et al. 2002). Moreover, CEWS does not need to set any parameters in advance, but also can make full use of the information of unlabeled samples. Thus, in this paper, CEWS is integrated into the self-training process to identify the mislabeled samples. CEWS is introduced briefly in next section.

3 Self-training based on density peaks and cut edge weight statistic

To improve the performance of self-training algorithm, two respects including the distribution of entire data set and the identification of mislabeled samples are considered. In this paper, an improved self-training method based on density peaks of data and cut edge weight statistic (ST-DP-CEWS) is proposed and is described in this section. In ST-DP-CEWS, first, the density peaks clustering method is used to discover the underlying space structure of data set, including labeled samples and unlabeled samples. Then, the CEWS technique is used to identify the mislabeled data. The space structure and the process of identification are integrated into each iteration of self-training framework. The proposed algorithm not only decreases the negative impact of mislabeled data on trained classifier, but also takes into account the structure of data space.

3.1 Clustering using density peaks for finding structure of data set

Clustering is a typical unsupervised learning method without labels for analyzing unlabeled data (Jain 2010). The clustering skill helps discover the underlying structure of data space. Recently, a density peaks clustering algorithm was achieved to detect non-spherical clusters (Rodriguez and Laio 2014b). And the correct number of clusters was found automatically. Density peaks clustering algorithm is an important clustering method and can discover the underlying structure of data space of any data set, no matter it is spherical distribution or non-spherical. For each sample \(x_i\), this clustering algorithm computed two quantities: its local density and its distance from points of higher local density. The local density \(\rho _i\) of each data point \(x_i\) is defined as:

$$\begin{aligned} {\rho _i} = \sum \limits _j \chi ({d_{ij}} - {d_c}), \chi (x) = {\left\{ \begin{array}{ll} 1, &{} {x < 0}\\ 0, &{} \text {others} \end{array}\right. } \end{aligned}$$
(2)

where \(\chi (x)\) is the indicator function of x, and \(d_{ij}\) is the distance between data points \(x_i\) and \(x_j\). \(d_c\) is a cutoff distance without a fixed value, and its value is related to data set (Wang and Xu 2017). The second measure \(\delta _i\) is the minimum distance between \({x_i}\) and any other data point with higher local density:

$$\begin{aligned} {\delta _i}= {\left\{ \begin{array}{ll} {\min _j}({d_{ij}}), &{} {{\rho _i} < {\rho _j}}\\ {\max _j}({d_{ij}}), &{} {\forall j,{\rho _i} \ge {\rho _j}} \end{array}\right. } \end{aligned}$$
(3)

Note that the process of how to find underlying structure of data space by density peaks clustering has introduced in great detail in (Wu and Shang 2018). Here, we just provide a brief introduction. In the process of calculating these two quantities \(\rho _i\) and \(\delta _i\) of all samples, the real structure of entire data space can be discovered by pointing each point \(x_i\) to its corresponding sample \(x_j\), which is the nearest point with higher local density of \(x_i\). The samples that like \(x_j\) being pointed to are defined “next” samples; meanwhile, samples that like \(x_i\) are referred to as the “previous” sample of \(x_j\).

3.2 Cut edges weight statistic to identify mislabeled data

Cut edge weight statistic (CEWS) is a technology of identifying and handling mislabeled instances. Its main idea is as follows: To begin with, create a relative neighborhood graph in data set and assign weights to each edge. Then, calculate the sum of cut edges weight for each data point. Finally, two-side hypothesis test is applied for identifying the mislabeled instances. The details are described below.

For expressing the proximity between examples, a relative neighborhood graph is used in data set. There exists an edge between vertices \(x_i\) and \(x_j\) if the distance between the two vertices satisfies the following condition:

$$\begin{aligned} {{d(x_i,x_j) \le \max (d(x_i,x_m),d(x_j,x_m))}},\,\, \forall m \ne i,j \end{aligned}$$
(4)

where \(d({x_i},{x_j})\) denotes the distance between vertices \(x_i\) and \(x_j\). Different from the KNN editing technology, for a given data point \(x_i\), its neighbors can be automatically found through the above definition without setting the number of neighbors in advance. Two samples connected by edges are neighbors of each other. A cut edge is defined as the edge connecting two samples with different labels. Intuitively, most examples in a neighborhood possess the same label. The label of an example is supposed to be wrong if it has many cut edges. Thus, cut edge plays an important role in identifying mislabeled examples. In addition, for different samples, they might have the same number of cut edges, and the importance of every cut edge is different. Therefore, the weight is introduced to each edge. We define \(w_{ij}\) as the weight of edge connecting \(x_i\) and \(x_j\). There are two methods based on distances or ranks of the neighbors to define \(w_{ij}\):

$$\begin{aligned} {w_{ij}} = {(1 + {d_{ij}})^{ - 1}} \quad or \quad {w_{ij}} = {1 \over {{r_j}}} \end{aligned}$$
(5)

where \(d_{ij}\) is the distance of \(x_i\) and \(x_j\), and \(r_j\) is the rank of the vertex \(x_j\) among the neighbors of the vertex \(x_i\).

In order to test whether a sample \(x_i\) is mislabeled, the sum of cut edge weights \(J_i\) running from this point is calculated. \(J_i\) is defined as follows:

$$\begin{aligned} {J_i} = \sum \nolimits _{j = 1}^{{n_i}} {{w_{ij}}{I_i}(j)}, \quad {I_i}(j)= {\left\{ \begin{array}{ll} 1, &{} {y_i} \ne {y_j}\\ 0, &{} {y_i} = {y_j} \end{array}\right. } \end{aligned}$$
(6)

where \(n_i\) denotes the number of samples belonging to the neighborhood of \(x_i\). \({y_i}\) is the class label of \(x_i\).

For \(x_i\), it may be considered to be mislabeled supposing that its \(J_i\) value is particularly large. Thus, \(J_i\) is selected as a statistic for hypothesis testing. Firstly, null hypothesis is defined as:

\(H_0\): the samples in the graph are labeled independently of each other on the basis of the same probability distribution \(\pi _y\), where \(\pi _y\) denotes the probability of class y, \(y \in \{ {\omega _1},{\omega _2}, \ldots ,{\omega _r}\} \).

\(H_0\) specifies a case that for any sample \(x_i\), the probability of examples in its neighborhood possessing labels other than \(y_i\) is expected to be no more than \(1 - {\pi _{{y_i}}}\) under \(H_0\). Hence, an example \(x_i\) is considered as a correctly labeled example if the value of \(J_i\) is significantly smaller than its expectation under \({H_0}\). And \(x_i\) is considered as a wrongly labeled example if the value of \(J_i\) is abnormally greater than its expectation under \({H_0}\).

To test \(H_0\) with statistic \(J_i\), two-side test is used if we are interested in the significantly smaller value and abnormally greater value. The distribution to statistic \({J_i}\) under \({H_0}\) should be studied firstly. \({I_i}(j)\) are independent and identically distributed Bernoulli random variables with parameter \(1 - {\pi _{{y_i}}}\). Here, \(\pi _{{y_i}}\) is the global proportion of class \(y_i\) in the training set. As a result, the expectation \({\mu _0}\) and variance \({\sigma ^2}\) of \(J_i\) under \(H_0\) are as follows:

$$\begin{aligned} {{{\mu _0}}}= & {} (1 - {\pi _{{y_i}}})\sum \nolimits _{j = 1}^{{n_i}} {{w_{ij}}} \end{aligned}$$
(7)
$$\begin{aligned} {{{\sigma ^2}}}= & {} {\pi _{{y_i}}}(1 - {\pi _{{y_i}}})\sum \nolimits _{j = 1}^{{n_i}} {w_{ij}^2} \end{aligned}$$
(8)

Under null hypothesis \(H_0\), \(J_i\) is submitted to normal distribution \({J_i} \sim N({\mu _0},{\sigma ^2})\), and the test statistics is selected as follows:

$$\begin{aligned} u = {{{J_i} - {\mu _0}} \over \sigma } \end{aligned}$$
(9)

and thus, \(u \sim N(0,1)\). Given the significance level \(\alpha \), the rejection domain is concluded as

$$\begin{aligned} W = \{ \left| u \right| \ge {u_{1 - \alpha /2}}\} \end{aligned}$$
(10)

Thereby, the rejection domain of \(J_i\) is

$$\begin{aligned} W = [ - \infty ,{\mu _0} - \sigma \cdot {u_{1 - \alpha /2}}] \cup [{\mu _0} + \sigma \cdot {u_{1 - \alpha /2}}, + \infty ]\nonumber \\ \end{aligned}$$
(11)

For sample \(x_i\), if the value of \(J_i\) is significantly less than the expectation under \(H_0\), that is, the value of \(J_i\) is in the left rejection domain, \(x_i\) is labeled correctly. Otherwise, it may be labeled wrongly. Detailed identification steps by CEWS are as follows: a) Create a relative neighborhood graph in data set S. Initialize correctly labeled data set \(T = \{ \emptyset \}\) and wrongly labeled data set \(T' = \{ \emptyset \} \). b) Calculate the weight of each edge. For each sample \(x_i\), calculate \(J_i\), the expectation \({\mu _i}\) and variance \(\delta _i^2\) of \(J_i\) under \(H_0\). c) Given the significance level \(\alpha \), for \(x_i\), the rejection domain can be obtained. d) If the value of \(J_i\) is on the left rejection domain, its label is corrected. Otherwise, it is a suspect sample. The label of this sample should be relabeled or be predicted in next iteration. The pseudo-code of CEWS to identify mislabeled data samples in data set S is in Algorithm 1.

figure a
figure b

3.3 Proposed framework and algorithm

Self-training is a process of autonomous learning. In each iteration of self-training algorithm, it is easy to misclassify the unlabeled samples. These errors will participate in the next iteration, which will affect the trained classifier, and reduce the performance of algorithm. It is obvious that identification of mislabeled samples plays an important role in self-training process, especially in the early iterations. The method of CEWS finds the neighbors of a sample according to Eq. (4). It can avoid the negative impact caused by improper selection of parameter. Therefore, in order to improve the performance of self-training algorithm, CEWS is integrated into ST-DP for identifying the wrong labels. In this paper, ST-DP-CEWS is proposed to train a better classifier.

In ST-DP-CEWS, the space structure of data set is discovered by density clustering firstly. The representative unlabeled samples are selected previously for labels prediction by utilizing the structure information in the iterative process. In this way, the accuracy of the prediction labels is improved. Then, CEWS is used to determine whether prediction labels are correct. The labeled set is gradually enlarged by correctly labeled samples for next iteration training. In this way, the disturbance degree of wrongly labeled samples to algorithm performance is decreased. The above procedures are repeated until unlabeled samples are completely labeled. Figure 1 describes the proposed algorithm scheme.

Fig. 1
figure 1

Proposed algorithm scheme

Step 1. Discover the underlying space structure of entire data set by finding density peaks of samples. Making each sample \(x_i\) points to its unique nearest sample \(x_j\) with higher \(\rho _i\). Then, the underlying structure of entire data space is used in Step 2 and Step 3, which are two similar steps to train a classifier.

Step 2. (a) A initial classifier is trained on the labeled set L with SVM or KNN as the base classifier.

(b) Select a subset \(L'\) from U, where each data sample \(x_k\) is the “next” sample of L according to the space structure. These samples are labeled by the trained classifier.

(c) Mislabeled samples in \(L'\) are identified by CEWS algorithm. The correctly labeled samples are obtained. Then, update L and U. After that, update classifier.

(d) Repeat steps from (a) to (c) until all the “next” samples of L are labeled.

Step 3. (a)Select a subset \(L'\) from U, where each data sample \(x_k\) is the “previous” sample of updated L. These samples are labeled by classifier.

(b) Mislabeled samples in \(L'\) are identified by CEWS algorithm. The correctly labeled samples are obtained. Then, update L and U. After that, update classifier.

(c) Repeat steps from (a) and (b) until all the “previous” samples of L are labeled.

Obviously, Step 3 is similar to Step 2, except replacing “next” in Step 2 with “previous.” The specific algorithm pseudo-code of proposed algorithm is described in Algorithm 2.

4 Experimental results and analysis

Twelve benchmark classification data sets are selected to validate the effectiveness of proposed algorithm. These data sets are from the University of California Irvine (UCI) (Asuncion A 2007) and KEEL repositories (Alcalá-Fdez et al. 2011). The more information on these data sets is shown in Table 1. The samples with missing value are deleted from data sets Mammographic, Cleveland and Dermatology. The rest of the data sets do nothing.

Table 1 Experimental data sets
Table 2 Parameters for all algorithms used in experiments

4.1 Implementation of experiment

In order to illustrate the effectiveness of proposed algorithm ST-DP-CEWS, four algorithms: Tri-training (Li and Guo 2012), ST-FCM (Gan et al. 2013), ST-DP (Wu and Shang 2018) and ST-DP-DE (Wu et al. 2018), are chosen to compare with ST-DP-CEWS. The parameters of these algorithms are shown in Table 2.

In the experimental part, the tenfold cross-validation strategy is adopted to determine the experimental results. Each of these ten parts is selected as test set \(T_\mathrm{s}\), and the remaining nine folds are selected as the training set \(T_\mathrm{R}\) . When each experiment is performed, \(10{{\% }}\) samples of \(T_\mathrm{R}\) are selected as initial labeled set L by randomized strategy. The rest samples of \(T_\mathrm{R}\) are as unlabeled set U. Therefore, each data set is divided into three parts: L, U and \(T_\mathrm{s}\). To ensure the accuracy of experiment, tenfold cross-validation experiment is conducted ten times. The average of ten experimental results is the final result. In each tenfold cross-validation experiment, accuracy rate (AR), mean accuracy rate (MAR) and standard deviation of AR (SD-AR) are selected as the comparison basis of algorithm performance. They are, respectively, computed as:

$$\begin{aligned} \mathrm{AR}{{ = }}{1 \over {{N_{{T_\mathrm{s}}}}}}\sum \limits _{i = 1}^{{N_{{T_\mathrm{s}}}}} {\psi (\omega ,f({x_i}))} \end{aligned}$$
(12)

where

$$\begin{aligned} \psi (\omega ,f({x_i}))= {\left\{ \begin{array}{ll} 1, &{} {{\omega = f({x_i})}}\\ 0, &{} \text {else} \end{array}\right. }. \end{aligned}$$
$$\begin{aligned}&\mathrm{MAR} = {1 \over n}\sum \limits _{k = 1}^n {A{R_k}}. \end{aligned}$$
(13)
$$\begin{aligned}&\mathrm{SD-AR} = \sqrt{{1 \over n}\sum \limits _{k = 1}^n {{{(A{R_k} - MAR)}^2}} }. \end{aligned}$$
(14)

Here, \(f({x_i})\) is the predicted label of algorithm to \(x_i\), and \(\omega \) is the original label. \(\psi \) is an indicator function to determine whether prediction label and original label are the same. \({N_{{T_\mathrm{s}}}}\) is the number of samples in test set \(T_\mathrm{s}\). n is the repeated times of computing AR, and \(A{R_k}\) is result of of the k-th calculating AR . MAR represents the classification performance of algorithms, and SD-AR represents the robustness of algorithms.

Table 3 Experimental results of tenfold cross-validation on data set Glass with KNN as base classifier (MAR ± SD-AR, \({{\% }}\))
Table 4 Experimental results of tenfold cross-validation on data set Glass with SVM as base classifier (MAR ± SD-AR, \({{\% }}\))
Table 5 Experimental results of tenfold cross-validation on data set Ionosphere with KNN as base classifier (MAR ± SD-AR, \({{\% }}\))
Table 6 Experimental results of tenfold cross-validation on data set Ionosphere with SVM as base classifier (MAR ± SD-AR, \({{\% }}\))
Table 7 Experimental results of tenfold cross-validation on different data sets with KNN as base classifier (MAR ± SD-AR, \({{\% }}\))

4.2 Analysis of experimental results

The classification results of tenfold cross-validation on data sets Glass and Ionosphere are shown in Tables 3, 4, 5 and 6. As shown in these tables, Tri-training, ST-FCM, ST-DP or ST-DP-DE may have worse classification accuracy than KNN or SVM. The reason for this circumstance might be that some unlabeled data samples are misclassified and directly used in the next iteration. As a result, Tri-training, ST-FCM, ST-DP and ST-DP-DE fail to improve the performance of classification. However, ST-DP-CEWS has better classification accuracy than the base classifier or the other four methods. We believe the reason is that ST-DP-CEWS utilizes CEWS to identify the mislabeled data and utilizes distribution of entire data during the self-training process.

Table 8 Experimental results of tenfold cross-validation on different data sets with SVM as base classifier (MAR ± SD-AR, \({{\% }}\))
Fig. 2
figure 2

The distribution of four data sets

Fig. 3
figure 3

The relationship between MAR and the ratio of labeled data with KNN as base classifier

Fig. 4
figure 4

The relationship between MAR and the ratio of labeled data with SVM as base classifier

The comparative experimental results of the remaining ten data sets are shown in Tables 7 and 8. As shown in these two tables, when KNN is used as base classifier, the proposed algorithm may be less effective than other algorithms in some data sets. In order to analyze reason, four of the data sets are visualized. As shown in Fig. 2, the data with different classes can be distinguished by the distribution of four data sets of Haberman, Banknote authentication, Wdbc and Heart. By contrast, the data sets of Haberman, Banknote authentication and Wdbc all have an area, in which various classes of samples are densely distributed. Samples in this area are difficult to classify KNN classifier and hard to distinguish whether they are labeled incorrectly. The distribution of Heart is obviously different from the other three data sets. Its distribution is relatively balance. In addition, note that the classification accuracy of proposed algorithm in Cleveland and Climate... is not increased when SVM is base classifier. The reason may be that most attributes values in data sets are close to 0. For the same feature, the difference of each sample is tiny, resulting in the tiny difference between samples. Hence, it is hard to find the decision boundary, which will affect the classification effect.

In summary, we can draw a conclusion from the experimental results that the performance of proposed algorithm ST-DP-CEWS is better than other algorithms. The main reason is that ST-DP-CEWS utilizes CEWS to identify the mislabeled samples. Furthermore, the space structure information of entire data set and the valuable information of unlabeled samples are used to full advantage in ST-DP-CEWS.

4.3 Impact of labeled data ratio

The performance of proposed algorithm with respect to the labeled data ratio is discussed. Figures 3 and 4, respectively, show the trend of MAR of different algorithms with the ratio of initial labeled data. The labeled samples ratio increases from \(10{{\% }}\) to \(50{{\% }}\).

It can be seen from Figures 3 and 4 that the performance of ST-DP-CEWS is better than other algorithms on the whole. When the ratio of labeled samples is very little, the classification accuracy of ST-DP-CEWS is significantly higher than other algorithms. The reason is that ST-DP-CEWS utilizes CEWS to identify mislabeled samples during labeled set which is gradually expended. The error accumulation caused by mislabeled samples is reduced in iterative process. Thereby, the classification accuracy is improved. Figures 3 and 4 also show that the classification accuracy of ST-DP-CEWS is close to other algorithms as the ratio of labeled data gradually increases. This is because when the number of labeled samples is sufficient, the average classification accuracy of these algorithms is relatively stable. ST-DP-CEWS is proposed based on the situation that there are very few labeled samples. ST-DP-CEWS is mainly used in semi-supervised classification, and it is more suitable for classification in the environment with low ratio of labeled samples.

Table 9 Result of the Wilcoxon test with KNN as base classifier
Table 10 Result of the Wilcoxon test with SVM as base classifier

4.4 Comparison with previous algorithms

To validate the effectiveness of our self-training algorithm, ST-DP-CEWS is compared with Tri-training, ST-FCM, ST-DP and ST-DP-DE. The Wilcoxon’s test and the Friedman’s test, executed by KEEL software Zhou and Li (2010), are used to detect the statistical differences of the compared methods based on the PR values (Wang et al. 2015). Tables 9 and 10 collect the Wilcoxon’s test results, and Table 11 collects the average rankings of algorithms by Friedman’s test. As shown in Tables 9 and 10, ST-DP-CEWS achieves higher \(R^{+}\) values than \(R^{-}\) values in all cases. Moreover, the P value is less than 0.05 which means that ST-DP-CEWS has more reliable performance. In addition, according to Friedman test in Table 11, ST-DP-CEWS exhibits the best ranking.

Furthermore, to verify whether the accuracy improvement of the proposed ST-DP-CEWS algorithm is statistical significant, the classification results of KNN, SVM, Tri-training, ST-FCM, ST-DP, ST-DP-DE and ST-DP-CEWS algorithms are determined by the Friedman’s test (Fan et al. 2018; Zhang and Hong 2019). The Friedman’s test (Derrac et al. 2011) is a multiple comparisons test that aims to detect significant differences between the results of two or more algorithms. The statistic F of Friedman test is shown as follows:

$$\begin{aligned} F = {{12N} \over {q(q + 1)}}\left[ {\sum \limits _{j = 1}^q {\mathrm{Rank}_j^2 - {{q{{(q + 1)}^2}} \over 4}} } \right] \end{aligned}$$
(15)

where N is the total number of data sets; q is the number of compared algorithms. \(Ran{k_j}\) is the average rank sum received from each classification value for each algorithm. The null hypothesis for Friedman test is that equality of classification errors among compared algorithms. The alternative hypothesis is defined as the negation of the null hypothesis. The test results are shown in Table 12, at the 0.05 significance level in one-tail test. Clearly, the proposed ST-DP-CEW self-training algorithm is significant superior to other algorithms.

Table 11 Average rankings of algorithms by Friedman test
Table 12 Friedman test for ST-DP-CEWS against compared other algorithms

5 Conclusion

In this paper, an improved self-training method based on density peaks and cut edge weight statistic is proposed. First, clustering based on density is used to discover the underlying space structure of data set. The representative unlabeled samples are selected priority to be classified by utilizing information of space structure. Second, The CEWS method is used to identify the mislabeled data. The incorrectly labeled samples are added to enlarge the labeled set. The above procedures are repeated until unlabeled samples are labeled completely. In proposed algorithm, the valuable information of unlabeled samples is excavated in the course of discovering space structure. And in the process of self-training, mislabeled samples are handled by CEWS, which reduces the negative impact of wrong labels on the performance of algorithm. The experimental results in this paper show the superiority of ST-DP-CEWS over the compared algorithms.

Certainly, the proposed algorithm may have a weakness. In the identification process, the predicted labels are either correct or incorrect. In essence, the identification method adopts one-zero sampling. There are two types of errors during the mislabeled data identifying. Type 1, a correctly labeled sample is regarded as mislabeled data and relabel. Type 2, a mislabeled sample is regarded as correctly labeled data and retained. Two types of errors may harm the classification performance. Therefore, in the future plan, the probability that may be mislabeled is assigned to each sample. The samples, which have higher probability, will be regarded as mislabeled data. We will further study how to use the probability concept to improve the performance of self-training.