Abstract
Most machine learning algorithms rely on having a sufficient amount of labeled data to train a reliable classifier. However, labeling data is often costly and time-consuming, while unlabeled data can be readily accessible. Therefore, learning from both labeled and unlabeled data has become a hot topic of interest. Inspired by the co-training algorithm, we present a learning framework called CSCC, which combines semi-supervised clustering and classification to learn from both labeled and unlabeled data. Unlike existing co-training style methods that construct diverse classifiers to learn from each other, CSCC leverages the diversity between semi-supervised clustering and classification models to achieve mutual enhancement. Existing classification algorithms can be easily adapted to CSCC, allowing them to generalize from a few labeled data. Especially, in order to bridge the gap between class information and clustering, we propose a semi-supervised hierarchical clustering algorithm that utilizes labeled data to guide the process of cluster-splitting. Within the CSCC framework, we introduce two loss functions to supervise the iterative updating of the semi-supervised clustering and classification models, respectively. Extensive experiments conducted on a variety of benchmark datasets validate the superiority of CSCC over other state-of-the-art methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In many real-world tasks, unlabeled data are readily available in abundance, while labeled data are often limited. As two typical learning paradigms, classification and clustering learn from labeled and unlabeled data, respectively. To learn a reasonably accurate classifier, classification algorithms require sufficient labeled data. On the other hand, exploiting only unlabeled data is not enough for clustering algorithms to find an optimal partition for the user. Classification algorithms suffer from the insufficiency of labeled data. By contrast, solely relying on unlabeled data is insufficient for clustering algorithms to achieve an optimal data partition for the user. Therefore, developing a paradigm to learn from both labeled and unlabeled data has become a prominent topic in recent decades.
Semi-supervised classification algorithms incorporate unlabeled data into their loss functions based on specific model assumptions such as smoothness, cluster, or manifold assumptions (Van Engelen & Hoos, 2020). However, it is important to note that a mismatch between the problem structure and the model assumption can lead to a degradation in performance (Cholaquidis et al., 2020). Additionally, the incorporation of unlabeled data can make the loss functions nonconvex. By contrast, co-training (Blum & Mitchell, 1998) uses the predictions of base classifiers, known as pseudo-labeled data, to augment the training set. It does not rely on additional model assumptions, which helps to avoid the problems associated with mismatching and nonconvexity. Essentially, the success of co-training style methods relies on the diversity between base classifiers, which is typically achieved through different views. However, it is difficult to satisfy the strong assumption of having compatible but uncorrelated views in practice. While there has been some research (Goldman & Zhou, 2000; Jiang et al., 2013; Zhou & Li, 2005) on creating diversity through alternative means, it remains a challenging issue.
On the other hand, semi-supervised clustering algorithms (Basu et al., 2002; Wagstaff et al., 2001) use side information, in the form of class labels or pairwise constraints, to guide the clustering towards the desired partition. Typically, this side information is used to initialize the parameters of the clustering model or incorporated into the loss function to impose constraints on the clustering procedure. Liu et al. (2017) indicate that both the scale and quality of side information have a significant impact on semi-supervised clustering. Particularly, there is a gap between the clustering process and the side information. The improper utilization of side information can lead to a degradation in the clustering performance, as illustrated in Fig. 2.
In addition to semi-supervised learning, another approach directly combines clustering with classification to learn from both labeled and unlabeled data. In this approach, clustering is used as a pre-processing step for classification to achieve various objectives, such as reducing the size of the training set (Gallego et al., 2018), selecting more representative training data (Rashmi & Sankaran, 2019), or clustering features for high-dimensional data (Raskutti et al., 2002; Sachdeva et al., 2023). Based on the cluster assumption, some researchers partition the dataset into a set of disjoint clusters to facilitate the following classification (Song et al., 2011). Some ensemble learning methods (Huang et al., 2023; Md. Jan & Verma, 2019; Verma & Rahman, 2011) use clustering techniques to partition the original data and generate a set of base classifiers that learn the boundaries between clusters. A fusion classifier is then used for the final prediction by mapping the confidences of the clusters to class decisions. In general, clustering is utilized as an auxiliary approach to preprocessing data for classification without fully leveraging its full potential.
Motivated by co-training, this paper presents a generalized framework for combining semi-supervised clustering and classification (CSCC) to learn from both labeled and unlabeled data. Unlike existing co-training style methods that focus on constructing diverse classifiers, CSCC takes advantage of the inherent diversity between semi-supervised clustering and classification to improve the generalization ability. In theory, any semi-supervised clustering and classification models can be integrated into CSCC to mutually enhance their performance. To maximize the strength of CSCC, we design a semi-supervised hierarchical clustering algorithm to bridge the gap between class information and clustering. With the supervision of labeled data, the proposed algorithm iteratively uses a cluster-splitting technique to refine the clustering result. Furthermore, we define loss functions to guide the iterative training of two components within the CSCC framework. Finally, the proposed method is validated through two series of experiments. The experimental results demonstrate the superiority of our method over other state-of-the-art methods. The main contributions of this paper are highlighted as follows.
-
1)
We present a new learning paradigm that combines semi-supervised clustering and classification to learn from both labeled and unlabeled data. To the best of our knowledge, this paper is the first to incorporate semi-supervised clustering into a co-training framework. It provides a new approach to creating diversity for co-training.
-
2)
To safely and efficiently leverage labeled data within CSCC, we propose a semi-supervised hierarchical clustering algorithm that uses labeled data to supervise the iterative cluster-splitting process.
-
3)
The proposed method has been validated through extensive experiments on 27 representative datasets. The experimental results demonstrate the significant superiority of the proposed method over nine state-of-the-art algorithms.
The rest of this paper is organized as follows. Section 2 presents the background of the proposed method. In Sect. 3.1, we describe the CSCC framework and present a semi-supervised hierarchical clustering algorithm. The experimental evaluations are presented in Sect. 2.3. Finally, we present a conclusion and future work in Sect. 3.3.
2 Related Works
2.1 Co-training Style Methods
Co-training style algorithms (Blum & Mitchell, 1998; Chen et al., 2022; Gong et al., 2022; Jiang et al., 2023a, 2023b) initialize two classifiers on two compatible but uncorrelated views (feature spaces) and iteratively train each classifier with the other’s confident predictions of unlabeled data. Multi-view learning (Ma et al., 2020; Sindhwani & Rosenberg, 2008) can be seen as a generalization of co-training, which takes advantage of multiple views. Note that there is a risk that the noise among pseudo-labeled data may be propagated to the next round of training. To reduce the noise, CoTrade (Zhang & Zhou, 2011) and Co-teaching (Han et al., 2018) apply data editing or cleaning techniques, respectively, to select pseudo-labeled data with higher confidence. Ma et al. combined co-training with the alternative optimization process of self-paced curriculum learning and presented a “draw with replacement” strategy to select pseudo-labeled data (Ma et al., 2017, 2020). Rather than iteratively training base classifiers, RKHS (Sindhwani & Rosenberg, 2008) and RANC (Ye et al., 2015) incorporate the information from other views as a regularization term into the loss function.
Essentially, co-training takes advantage of the diversity between base classifiers to improve the generalization ability. In traditional co-training style algorithms, the diversity comes from compatible but uncorrelated views. However, it is difficult to satisfy this requirement in many tasks. Alternatively, Goldman and Zhou trained two classifiers of different learning algorithms on the same view (Goldman & Zhou, 2000). Similarly, Jiang et al. presented a hybrid method (Jiang et al., 2013) combining a generative and a discriminative classifier under a co-training framework. Moreover, tri-training style methods (Dong-DongChen & WeiGao, 2018; Zhou & Li, 2005) generate three classifiers on different subsets of the original training set. Despite all this, creating diverse but accurate classifiers for co-training remains a challenging task. Distinct from existing co-training algorithms, our proposed method combines a semi-supervised clustering model and a classifier for mutual promotion in a co-training framework. It presents a novel approach to creating diversity for co-training.
2.2 Clustering-Based Classification Methods
Clustering is widely used to modify the training set or improve the feature representation for the subsequent classification tasks. Gallego et al. used clustering to down-sample training data to decrease the computation complexity (Gallego et al., 2018). Some class-imbalanced classification algorithms (Jiang et al., 2023a, 2023b; Lin et al., 2017) utilize clustering to improve the result of oversampling or undersampling. In addition, clustering is used for nonlinear dimensionality reduction (Rashmi & Sankaran, 2019) and feature selection (Jia et al., 2023; Song et al., 2011). ScanMix (Sachdeva et al., 2023) incorporates semantic clustering into deep neural networks to improve the feature representation. Raskutti et al. generated a new view for co-training by clustering both labeled and unlabeled data (Raskutti et al., 2002).
Clustering is also employed to provide constraints for classification based on the cluster assumption which asserts that data in one cluster are likely to share the same class label. Jan et al. (Md. Jan & Verma, 2019) generated a random subspace by incrementally clustering input data with a particle swarm optimization. A set of classifiers are then trained on these optimized clusters. Similarly, Verma et al. (Verma & Rahman, 2011) generated a set of base classifiers to learn the cluster boundaries. DRLSC (Xue et al., 2009) uses clustering to learn the local discriminative information and the manifold structure from unlabeled data. Subsequently, they are incorporated into the loss function as a graph-based regularization term. Notably, classification does not always benefit from the assistance of clustering due to its unsupervised nature. Thus, a few reported works turn to leveraging semi-supervised clustering. During self-training, semi-supervised clustering is introduced to estimate the data distribution to train a better classifier (Gan et al., 2013; Piroonsup & Sinthupinyo, 2018). SuperRLSC (Gan et al., 2018) uses supervised Kmeans to learn a graph-based regularization term and embeds it into the loss function of the classification.
In summary, existing methods typically utilize clustering as an auxiliary technique for data processing, while applying classification to enhance clustering remains relatively unexplored. This paper presents a new learning paradigm that enables semi-supervised clustering and classification to benefit from each other within a co-training style framework.
2.3 Semi-supervised Clustering Methods
Semi-supervised clustering uses side information, such as labeled data (Basu et al., 2002) or pairwise constraints (Melnykov & Melnykov, 2020; Wagstaff et al., 2001), to guide the clustering towards the desired partition. Compared with pairwise constraints, the labeling information is more suitable to describe data distribution at a high level. In Kmeans-style methods, labeled data can be used to initialize seeds or constrain the centroid-updating process (Basu et al., 2002; Jiang et al., 2022). In a similar way, the labeled data are used to estimate the local density parameters for density-based algorithms (Gertrudes et al., 2018). Another approach incorporates a regularization term into the loss function to penalize the partition inconsistent with the given label information (Zeng et al., 2013). Liu et al. proposed a partition level constrained clustering (PLCC) framework (Zeng et al., 2013) where the class labels are added as additional dimensions to the original feature vector. In addition, semi-supervised clustering is combined with ensemble learning. Wu et al. proposed a Kmeans-based consensus clustering (KCC) algorithm (Wu et al., 2014). Within the ensemble clustering framework, Yu et al. presented a series of SSC ensemble methods (Yu et al., 2018) that assign different subsets of pairwise constraints to different ensemble members.
Based on the cluster assumption that data in one cluster share the same class label, semi-supervised clustering algorithms label each cluster according to its dominant labeled data. Thereby, semi-supervised clustering can be readily incorporated into co-training to work with a classifier. However, cluster assumption does not always hold in existing semi-supervised clustering algorithms, as illustrated in Fig. 2. To address this issue, we propose a semi-supervised clustering hierarchical clustering for CSCC in Sect. 2.3.
3 The Framework Combining Semi-supervised Clustering and Classification (CSCC)
3.1 Motivation
Essentially, co-training leverages the diversity between two classifiers to complement each other, thereby improving its generalization ability. As the key to the success of co-training, diversity can be achieved through different views, training subsets, or classification algorithms.
Motivated by clustering-based classification algorithms, we propose a co-training-like framework, CSCC, which combines semi-supervised clustering and classification to learn from each other. Clustering and classification offer distinct perspectives for grouping data. Therefore, we argue that the diversity harnessed by CSCC is more inherent compared to the diversity between the two classifiers. It presents a promising approach for co-training to generate and harness diversity.
The intuition of CSCC is depicted in Fig. 1. Initially, a traditional classification model, denoted as Λ, is trained on labeled data, while a semi-supervised clustering model, denoted as Θ, is learned on both labeled and unlabeled data. Subsequently, CSCC engages in an iterative co-training process between Λ and Θ. To be more specific, each model incorporates its confident predictions of unlabeled data (pseudo-labeled data) into the training set of the other model for the next round of retraining. Both models benefit from the diversity during co-training. The iterative process continues until a specific criterion is satisfied.
3.2 The Semi-supervised Clustering Hierarchical Clustering Algorithm
3.2.1 Intuition
To better exert the strength of semi-supervised clustering for CSCC, we propose a semi-supervised hierarchical clustering algorithm based on Kmeans. The intuition is to utilize labeled data to gradually find the optimum cluster number and centroids. Notably, there is a natural gap between the class information and the clustering, which can potentially misguide the label-based clustering. For example, there are two classes in Fig. 2a, each comprising two distinct clusters. Supervised by labeled data, Kmeans-style methods usually initialize two clusters with inappropriate centroids, resulting in a wrong partition, as shown in Fig. 2b. Moreover, irregular cluster shapes and variations in cluster sizes can further impede the performance of centroid-based clustering algorithms when utilizing class information.
According to the cluster assumption, data in one cluster are likely to share the same class label. Conversely, data belonging to the same class may be dispersed across multiple clusters. Our motivation is to initialize each cluster with a class of labeled data and then update the clusters on both labeled and unlabeled data.
If an updated cluster contains labeled data from different classes, it goes against the cluster assumption. In such cases, we divide the “impure” cluster into multiple subclusters, each of which only contains labeled data of one class. This cluster-splitting aims to rectify the clustering result and achieve a more satisfactory partition, as shown in Fig. 2c. Correspondingly, we propose a semi-supervised clustering method for Kmeans-style algorithms, which iteratively refines the clustering model with the supervision of labeled data until all labeled data within each cluster belong to the same class.
3.2.2 Loss Function and Algorithm
To begin with, we give a definition of the learning problem. Suppose Q is an unknown distribution defined on the instance space X, and Y = {y1,…, yC} is the class label set for X. Let DL = {XL,YL} = {(xi, yi)|i = 1,2,…,L} represent labeled data drawn from Q, and DU = {XU,YU} = {(xj, yj) |j = 1,2,…,U} represent unlabeled data whose label YU is unobserved. Given a semi-supervised clustering algorithm, DL U DU can be grouped into a cluster set P = { p1,…, pK}. ∀ pk ϵ P has a label sk ϵ Y which is determined by the dominant labeled data in pk. Let M = {m1,…, mK} denote the centroid matrix of P, where each mk is the k-th centroid that is estimated as.
where |pk| is the number of data in cluster pk.
Semi-supervised clustering algorithms aim to learn the clustering model Θ with the supervision of labeled data. For Kmeans-style algorithms, Θ can be expressed by the number of clusters in P and its centroid matrix M. The original loss function of Kmeans is defined as.
where ||・|| denotes the Euclidean distance. Given the number of clusters and initial centroids, Kmeans aims to minimize the sum of within-cluster distances. However, deciding the number of clusters and initial centroids is challenging. To address this issue, we propose a cluster-splitting technique and redefine the loss function for Kmeans-style algorithms as.
where λ ϵ [0,1] is a weighting parameter. The first item Err ∈ [0,1] is used to measure the empirical error, which is defined as
where \({H}_{i}({y}_{i},{y}_{i}{\prime})\) is a discriminant function whose value is 1 when \({y}_{i}={y}_{i}^{{^{\prime}}}\) and 0 otherwise. Y’L is the class label vector of XL, which is predicted by Θ. Given ∀xi, its predicted label yi’ is the same as its cluster label determined by the dominant labeled data. The clustering model Θ is then estimated by minimizing the loss function in Eq. (3) to seek an optimum partition. The corresponding solution is described in Algorithm 1.
In step 1, Θ is initialized with DL; then, a top-down cluster-splitting technique is proposed in step 2 to find the optimal cluster number and initial centroids. Finally, the cluster centroids are iteratively updated in step 3. Considering that both items in Eq. (3) may be in different scales, we normalize both items in the j-th (j > 1) iteration as.
Therefore, we have the final loss function in the j-th iteration:
The loss function in Eq. (6) is then used to guide the cluster-splitting in step 2. Due to the impact of noises or outliers, some newly generated clusters might degrade the performance. Therefore, we delete those clusters whose empirical errors are above the average level in step 2.5 for a better outcome. It is noteworthy that the false predictions may degrade the clustering model when updating the cluster centroids. Therefore, the loss function in Eq. (6) is also used to supervise the updating of centroids in step 3.
Algorithm 1 The semi-supervised hierarchical clustering algorithm
After the semi-supervised clustering, unlabeled data in each cluster are given the same label as its cluster. Based on the prediction, the semi-supervised clustering model Θ can provide a classifier with pseudo-labeled data within the CSCC framework. To select reliable pseudo-labeled data, we estimate the confidence of x that belongs to class yi (yi ∈ Y):
where sj and mj denote the label and centroid of cluster pj, respectively. The P(pj|x; Θ) is the confidence that x belongs to cluster pj whose label sj = yi. Here, P(pj|x; Θ) is negatively correlated with the distance of x to its cluster centroid mj. Recall that a class consists of at least one cluster, while a cluster belongs to only one class. Therefore, when estimating the prediction confidence of x, we take account into its distances to all clusters with the label yi. Supposing pj represents a cluster whose cluster label is yi, we use the aggregate P(pj|x; Θ) to estimate P(yi|x; Θ) in Eq. (7).
3.3 The CSCC Algorithm
Let V1, V2 be pseudo-labeled datasets used for the semi-supervised clustering model Θ and classification model Λ, respectively. The task is to learn Θ and Λ on DL ∪ V1 ∪ DU and DL ∪ V2, respectively. Next, we provide a detailed illustration of CSCC in Algorithm 2. In step 1, we initialize a standard classification model Λ and run Algorithm 1 to learn Θ. Subsequently, Θ and Λ are iteratively retrained in step 2. In step 3, two learning models are combined for the final prediction of test data.
Algorithm 2 Co-training between semi-supervised clustering and classification (CSCC)
3.3.1 Selecting Pseudo-Labeled Data
The risk associated with co-training style algorithms arises from the presence of noise in pseudo-labeled data. However, this approach deviates from the i.i.d. (independent and identically distributed) assumption, thereby introducing distribution noise.
In step 2, we select the most confident predictions as pseudo-labeled data, according to the class proportion learned from initial labeled data. Therefore, we can acquire relatively accurate pseudo-labeled data that approximately obey the actual distribution. Considering that class imbalance can lead to a deficiency of pseudo-labeled data for minority classes, we undersample majority classes in V1 for Θ in step 2.4 and oversample minority classes in DL for Λ. Moreover, we use an incremental value del_n to gradually add pseudo-labeled data for Θ and Λ. To seek a balance between the quantity and the quality, we limit the number of pseudo-labeled data, n, such that it satisfies the condition n < 0.5*|DU| in step 2.9.
3.3.2 Supervising the Training of Learning Models
To further mitigate the influence of noise, we evaluate both learning models in step 2 to prevent possible degradation. Recall that the clustering model Θ is evaluated according to Eq. (6). To supervise the iterative training of Λ, we propose a loss function as follows.
where T = DL ∪ V2 = {XT, YT} is the training set, λ1 ϵ [0,1] is a weighting parameter. The first item is used to measure the empirical risk defined in Eq. (4). The second item represents a structural risk which is estimated according to the inter-class distance estimated on T:
Within the CSCC framework, the proposed loss function is used to guide the iterative training of Λ by minimizing not only the empirical risk but also the structural risk. Considering that both items in Eq. (9) may be on different scales, we normalize both items in the t-th iteration, as done in Eq. (5). Therefore, we have the final loss function for the t-th iteration of step 2:
In step 2, Eq. (6) and Eq. (10) are used to supervise the iterative training of Θ and Λ, respectively, to decrease the risk of model degradation.
3.3.3 Combining Semi-supervised Clustering and Classification for the Final Prediction
To improve the generalization ability, we combine Θ and Λ in step 3 for the final prediction, as shown in Eq. (11).
where P(y|x; Θ) is the prediction probability of Θ, which is estimated according to Eq. (7). The P(y|x; Λ) is the prediction probability of Λ, which depends on a specific classification algorithm. The weighting parameter \(\mu\) is used to regulate the impact of Θ and Λ on final predictions. Usually, the value of \(\mu\) can be set by experience or cross-validation.
4 Experiments
In two series of experiments, the proposed method, CSCC, was compared with clustering-based classification methods and co-training style methods, respectively.
4.1 Comparisons with Clustering-Based Classification Methods
4.1.1 Setup
Datasets
For a fair comparison, we validate the proposed method on the same datasets used in SuperRLSC (Gan et al., 2018). Table 1 shows the details of 20 UCI datasets,Footnote 1 where the value of coefficient of variation (CV) is a statistical measure that expresses the relative variability or dispersion of a dataset. Following the experimental setting of SuperRLSC, we randomly divide each dataset into the training set and test set while maintaining that both datasets have almost equivalent numbers of data in each class.
Methods for Comparison
The CSCC is first compared with three clustering-based classification methods (Gan et al., 2018; Md. Jan & Verma, 2019; Xue et al., 2009). ECCS (Gan et al., 2018) generates an ensemble of classifiers on a series of clusters optimized by a particle swarm optimization-based approach. DRLSC (Xue et al., 2009) learns the local discriminative information and the manifold structure from unlabeled data a, which are then incorporated into the loss function as a regularization term. SuperRLSC (Gan et al., 2018) employs supervised Kmeans to build local and global data graphs to supervise the classification. Moreover, SVM and Seeded Kmeans (Basu et al., 2002) are employed as the baselines for the experimental comparison.
In the CSCC framework, we use SVMFootnote 2 as the base classifier. Both the standard SVM and our SVM component use the default parameters of Libsvm2 with a linear kernel. Our method has two weighting parameters: λ, λ1ϵ[0,1]. For the sake of simplicity, we empirically set λ = λ1 = 0.7. The regularization parameters in DRLSC and SuperRLSC are selected using tenfold cross-validation. All other parameters of ECCS, DRLSC, and SuperRLSC use their default setting.
4.1.2 Experimental Results
Table 2 shows the classification accuracy of different algorithms on 20 UCI datasets. Here, Kmeans-C and SVM-C represent two components of CSCC: the semi-supervised hierarchical clustering algorithm and a standard SVM, respectively. All results are the averages of 10 runs of the focused algorithms with random initializations.
The Effect of the Semi-supervised Cluster-Splitting Algorithm
Due to the different learning objectives between clustering and classification, it is unfair to compare the classification accuracies of semi-supervised clustering algorithms with those of classification algorithms. Thus, it is not surprising that Seeded Kmeans performs much worse than classification algorithms. Another reason, we argue, is that the natural gap between the clustering and classification information may misguide the clustering procedure. In contrast, Kmeans-C uses labeled data to supervise the cluster-splitting process, enabling it to find the optimum cluster number and centroids more effectively.
Table 3 shows the final cluster number after the cluster-splitting, where parenthesized values denote the number of classes. It is observed that more sub-clusters are identified by CSCC, especially on cmc, Vehicle, and Yeast. Obviously, the cluster-splitting technique is conducive to capturing the underlying data distribution and bridging the gap between the clustering and classification information.
Furthermore, the proposed centroid-updating technique decreases the impact of noise. As a result, Kmeans-C outperforms Seeded Kmeans by 12.8% on the average result. After the iterative training of CSCC, the accuracy of Kmeans-C is even close to SVM and DRLSC.
The Effect of the CSCC Framework
The CSCC takes advantage of the diversity between the semi-supervised clustering and classification to train its two components. As observed in columns 7 and 8 of Table 2, the performances of both components show improvements on nearly all datasets, except for slight declines on sonar, water, and cmc. This indicates that our tactic of selecting pseudo-labeled data effectively decreases noise. Furthermore, the supervision provided by the two proposed loss functions further reduces the risk associated with utilizing unlabeled data.
Initially, SVM-C obviously outperforms Kmeans-C. However, through co-training, Kmeans-C benefits more and shows greater improvement than SVM-C. Despite this, the accuracy of SVM-C still increases by 2.45%, demonstrating the effectiveness of incorporating semi-supervised clustering for final prediction.
The final label is determined by combining the predictions of semi-supervised clustering and classification. The CSCC algorithm outperforms its individual components on most datasets, except pid and cmc. Compared with Kmeans-C and SVM-C, the performance of CSCC increases by 4.76% and 1.23%, respectively. These results demonstrate the effectiveness of combining two components in CSCC.
The Comparison with Clustering-Based Classification Methods
Table 2 shows that CSCC performs better than its competitors in terms of average results. Additionally, CSCC achieves the best performance on 8 out of the 16 datasets. DRLSC utilizes structure information learned from unlabeled data to assist classification. However, without the guidance of class information, this learned information does not contribute significantly to the classification, especially when an ample amount of labeled data is available. As a result, DRLSC performs closely to SVM. Moreover, the ensemble method (Verma & Rahman, 2011) does not exhibit significant improvement over SVM. In contrast, SuperRLSC employs supervised Kmeans to find meaningful information for the classification, leading to superior performance.
Rather than using clustering as an auxiliary means, CSCC takes advantage of the diversity between semi-supervised clustering and classification to improve the generalization ability. As shown in Table 2, CSCC outperforms other methods. However, unlike SuperRLSC, our method does not utilize tenfold cross-validation to select the best parameters for different datasets. We argue that this is the main reason why the superiority of CSCC over SuperRLSC is not extremely pronounced. It is widely acknowledged that semi-supervised methods, especially co-training style methods, are particularly effective when labeled data are scarce. However, this series of experiments was conducted with an abundance of labeled data. Therefore, it is not surprising that the majority of semi-supervised methods do not exhibit significant superiority over SVM on some datasets.
To quantify the statistical differences between CSCC and other clustering-based classification methods, we conducted the paired Wilcoxon signed-rank test (Pratt, 1959). All significance levels are measured at 5% in Table 4, where p-value is calculated to compare the accuracy values presented. If the p-value is greater than 0.05, then the null hypothesis that there is no significant difference between the performances of two methods cannot be rejected. Conversely, if the p-value is less than or equal to 0.05, then the null hypothesis is rejected, indicating a significant difference.
According to the results in Table 4, CSCC is statistically superior to other methods, except for SuperRLSC. The results of the aligned Friedman’s test comparing CSCC along with the other methods are presented in Table 5. It is observed that CSCC achieves the highest ranking among all six methods. Besides, the p-value obtained from the aligned Friedman’s test is very low (0.006592), indicating a significant difference between the methods compared.
The Impact of the Weighting Parameters λ
There are two weighting parameters λ,λ1ϵ[0,1] used in CSCC for the semi-supervised clustering and classification model, respectively. In the following, we explore the impact of λ,λ1 on six representative datasets. Figure 3 shows the accuracy curves of CSCC when λ,λ1 vary from 0 to 1.
In general, the variances of accuracy with respect to different λ,λ1 on six datasets are relatively stable, particularly on pid and glass. This indicates that CSCC is not significantly sensitive to λ,λ1 on these datasets. As shown in Fig. 3, when λ,λ1 varies from 0 to 1, most performance curves slightly go up at first, demonstrating the benefit of using the guide of labeled data. After a relatively flat stage, most curves may end with a slight decline when λ,λ1 = 0.8 or 0.9. These results show that the usage of unlabeled data is also crucial. Overall, the best performance in Fig. 3 is usually achieved when λ,λ1 is between 0.5 and 0.8. This suggests that there is an optimal balance between the impacts of labeled and unlabeled data within this range.
4.2 Comparisons with Co-training Style Methods
4.2.1 Setup
Methods for Comparison
In this subsection, CSCC is further compared with five co-training style algorithms: Co-training (Blum & Mitchell, 1998), RANC (Ye et al., 2015), CoTrade (Zhang & Zhou, 2011), SPaCo (Ma et al., 2017), and SPamCo (Ma et al., 2020). Co-training generates two base classifiers on different views and iteratively trains each classifier by predictions of the other. RANC assumes that predictions for unlabeled data under different views are consistent with each other. It enforces an affixed rank constraint on the optimization function of each view. CoTrade proposes a specific data editing technique for co-training to avoid undesirable classification noise. By introducing the self-paced curriculum learning, SPaCo designs a “draw with replacement” learning mode to decrease the classification noise. SPamCo improves upon SPaCo by incorporating two co-regularization terms for selecting pseudo-labeled data. It can naturally extend the algorithm to multi-view scenarios.
All above-mentioned algorithms employ SVM with a linear kernel as base classifiers. In CSCC, we empirically set the weighting parameter λ = λ1 = 0.7. All other methods for comparison adopt their default parameters setting. In addition, a standard SVM and Seeded Kmeans (Basu et al., 2002) are employed as the baselines for the experimental comparison.
Datasets
Seven text datasets are used for the experimental comparison. Their characters are summarized in Table 6.
Course datasetFootnote 3 contains home pages collected from the websites of four universities. Each page has a page-based view and a link-based view. The task is to predict whether a page is about course or noncourse.
Advertisement datasetFootnote 4 contains advertising images on web pages, each of which is about an advertisement or not. We use two out of the three URL-based views (i.e., 1-image URL, 2-base URL, 3-destination URL) to create three datasets named ads12, ads13, and ads23.
Newsgroup dataset consists of 16 newsgroups from the Mini-Newsgroup dataset.Footnote 5 As done in comparison algorithms (Jiang et al., 2023a, 2023b; Lin et al., 2017), we divide these newsgroups into four groups and create two binary-class datasets, NG1 and NG2, with two artificially generated views.
Ner dataset is about the sliding window named entity recognition using the CoNLL2003 dataset.Footnote 6 It consists of nine classes.
Among these datasets, Course, Advertisement, and Newsgroup have two independent views. For some algorithms that do not require two views, we simply combined the two views. For the ner dataset, which only has one view, the attribute sets are randomly split into two disjoint views for algorithms that require two views.
For each dataset, 25% of data are selected as train data (including labeled and unlabeled data to be utilized), while the rest are used as test data. According to the different sizes of datasets, we randomly choose 2 k positive and 3 \(\times\) 2 k negative labeled data for the course, 2 k positive and 6 \(\times\) 2 k negative labeled data for advertisement, and 2 \(\times\) 2 k+1 positive and 2 \(\times\) 2 k negative labeled data for Newsgroup, and 20 \(\times\) 2 k+1 for Ner. On each dataset, three series of experiments are performed when k is set as 1, 2, and 3, respectively.
4.2.2 Experimental Results
Table 7 shows the classification accuracy of different algorithms on seven datasets, where Kmeans-C and SVM-C denote two components of CSCC, respectively. All results are the averages of 50 runs of the focused algorithms with random initializations. In the following, we provide detailed analyses.
The Effect of the Semi-supervised Cluster-Splitting Algorithm
In Table 7, the performance of Seeded Kmeans is comparable to that of most classification algorithms. It might be attributed to the limited availability of labeled data. When compared to Seeded Kmeans, the accuracy of the Kmeans component, Kmeans-C, shows an improvement of 6.80%. In many instances, Kmeans-C even outperforms the majority of classification algorithms.
These results serve as evidence that our cluster-splitting method and centroid-updating technique aid in identifying the optimal number of clusters and their corresponding centroids.
Table 8 presents the final cluster number after the cluster-splitting, where parenthesized values indicate the number of classes. It is observed that a greater number of sub-clusters are identified, which explains why the Kmeans-C outperforms Seeded Kmeans. As the amount of labeled data increases, both Seeded Kmeans and Kmeans-C exhibit steady improvements in performance. This observation highlights the valuable contribution of labeled data in semi-supervised clustering. Compared to the results obtained from the UCI dataset, it is evident that the number of sub-clusters and the improvement in Kmeans-C are noticeably reduced. It can be attributed to the insufficient availability of labeled data, which limits the extent to which the cluster-splitting process can benefit from guidance.
The Effect of the CSCC Framework
By enlarging the training set with pseudo-labeled data, both Kmeans-C and SVM-C exhibit improvements on all datasets, as shown in Table 7. Notably, SVM-C outperforms the standard SVM by a margin of 3.48%. This improvement surpasses those achieved by other co-training methods. Besides, we also noticed that the enhancements are particularly significant when k = 1. It demonstrates that taking advantage of unlabeled data is more effective when available labeled data are few. This finding suggests that leveraging unlabeled data is more effective in scenarios where labeled data is scarce. Consequently, the improvements observed in both components are more pronounced compared to those seen in the UCI datasets.
Notably, the improvement achieved by SVM-C is particularly remarkable on NG1 and NG2, where the semi-supervised clustering methods perform much better than SVM. Therefore, the SVM-C benefits more from Kmeans-C on NG1 and NG2. On the contrary, the improvements of SVM component on ads and course are not significant due to the negative impact of Kmeans-C. Additionally, the proposed loss function’s supervision may halt the iterative training process on these datasets as a precautionary measure to mitigate the risks of utilizing unlabeled data.
The final prediction is generated by combining the predictions of both components. In comparison to the individual performance of Kmeans-C and SVM-C, CSCC shows an increase in performance by 2.59% and 0.72%, respectively. Moreover, CSCC outperforms both components on most datasets. This outcome serves as evidence that combining the two components effectively improves overall performance.
The Comparison with Other Co-training Style Methods
Table 7 shows that CSCC outperforms its competitors significantly on the average result. Additionally, CSCC performs best in 11 out of 21 scenarios. Co-training style methods face the risk of introducing noise through pseudo-labeled data when utilizing unlabeled data. Co-training and RANC do not take specific measures to mitigate the impact of noise. Consequently, the average accuracy of co-training is close to that of standard SVM. RANC performs even worse than standard SVM, particularly on ner, course, and NG.
On the other hand, Cotrade, SPaCo, SPamCo, and CSCC exhibit some superiority over standard SVM. CoTrade utilizes a specific data editing technique to avoid the inclusion of mispredicted data. Meanwhile, SPaCo and SPamCo employ a “draw with replacement” strategy to reduce classification noise. As a result, these methods achieve better performance than SVM in terms of average accuracy. Our method aims to address both classification and distribution noise when selecting pseudo-labeled data. Furthermore, CSCC supervises the iterative training process using the proposed loss function to prevent potential degradation of the models. Consequently, CSCC attains the best performance, with its superiority being particularly significant on the NG dataset. However, CSCC performs worse than SPaCo and SPamCo on course and certain ads scenarios. We argue that the proposed loss function might stop the iterative training prematurely on these datasets and prevent the further improvement of both components.
Indeed, this series of experiments is specifically designed and carried out in scenarios where only a limited number of labeled data is available. It is worth noting that these methods exhibit more significant improvements over SVM compared to clustering-based classification methods, which are trained using abundant labeled data. This phenomenon demonstrates the value and utility of semi-supervised learning in addressing classification tasks with a scarcity of labeled data.
Finally, we quantify the statistical differences between CSCC and other co-training style methods through a paired Wilcoxon signed-rank test (Pratt, 1959). All significance levels are measured at 5% in Table 9. If a p-value > 0.05, then the null hypothesis that there is no significant difference between the two performances cannot be rejected; otherwise, if the p-value < = 0.05, then the null hypothesis is rejected. As shown in Table 9, CSCC is statistically superior to all other methods except Spaco.
The results of the aligned Friedman’s test comparing CSCC with other methods are presented in Table 10. The Friedman’s rank of CSCC is observed to be the best among all the compared methods. Additionally, the p-value of the aligned Friedman’s test is very low (0.008810), indicating a significant difference between the methods.
The Impact of the Weighting Parameters λ
Figure 4 showcases the accuracy curves of CSCC on four datasets as the parameter λ varies from 0 to 1. For each dataset, we list the curves representing different supervision levels (i.e., k = 1, 2, 3). Recall that the experimental results presented in Table 6 were obtained when λ,λ1 are empirically set as 0.7. Generally, the accuracy variances across different values of λ and λ1 on all scenarios are within 5%, indicating a generally stable performance. This result demonstrates that our method is not significantly sensitive to changes in λ and λ1 on these datasets.
It is observed that the variances seen in the curves on these datasets are more significant compared to those observed in the UCI datasets. This suggests that a smaller number of labeled data can lead to a larger variance in the impact on the loss function. As depicted in Fig. 4, when λ and λ1 vary from 0 to 1, most accuracy curves exhibit a slight increase initially, indicating the benefit of utilizing labeled data as guidance. After a relatively flat stage, most curves end with a slight decline. This phenomenon highlights the limitation of solely relying on empirical risk. An exception is that the curve on ads12 (k = 2) declines when λ,λ1 vary from 0 to 0.1. This observation may be attributed to the sampling of labeled data in this scenario.
The best performance in Fig. 4 is typically achieved when λ and λ1 fall within the range of 0.5 to 0.8. This observation suggests that there is an optimum balance between the impacts of labeled and unlabeled data when λ,λ1 falls in this range. Moreover, higher supervision levels tend to yield better performance compared to lower supervision levels, highlighting the strength of labeled data in model learning.
5 Conclusion
This paper introduces a novel learning paradigm, CSCC, which leverages the diversity between semi-supervised clustering and classification to enhance generalization capability. Especially, we propose a semi-supervised hierarchical algorithm for CSCC, which utilizes a cluster-splitting technique to bridge the gap between class information and clustering. In two sets of experiments, CSCC has been compared with clustering-based classification methods and co-training style methods. The proposed method outperforms others on most datasets and achieves the best overall performance in both series of experiments. Particularly, when only a few labeled data are available, the superiority of CSCC becomes even more evident. It is worth noting that the Kmeans component of CSCC performs especially well, surpassing many SOTA classification algorithms on multiple text datasets.
Considering the potential issue of skewed decision boundaries, the class imbalance problem may become more pronounced when utilizing pseudo-labeled data to expand the training set. In the future, we plan to conduct further research on this topic to better control the risks associated with utilizing unlabeled data.
Data Availability
All data sets used in the experiments can be downloaded through the link provided in the footnotes.
Code Availability
The source code is available at: https://codeocean.com/capsule/7974318/tree.
Notes
References
Basu, S., Banerjee, A., Mooney, A. & Raymond, J. (2002). Semi-supervised clustering by seeding. In Proceedings of the nineteenth international conference on machine learning (pp. 27–34). Morgan Kaufmann Publishers Inc. https://doi.org/10.5555/645531.656012
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on computational learning theory (pp. 92–100). Association for Computing Machinery. https://doi.org/10.1145/279943.279962
Chen, M., Du, Y., Zhang, Y., Qian, S., & Wang, C. (2022). Semi-supervised learning with multi-head co-training. In Proceedings of the AAAI conference on artificial intelligence (Vol. 36(6), pp. 6278–6286).
Cholaquidis, A., Fraiman, R., & Sued, M. (2020). On Semi-Supervised Learning. TEST, 29(4), 914–937.
Dong-DongChen, W., & WeiGao, Z. (2018). Tri-net for semi-supervised deep learning. In Proceedings of twenty-seventh international joint conference on artificial intelligence (pp. 2014–2020).
Gallego, A.-J., Calvo-Zaragoza, J., Valero-Mas, J. J., & Rico-Juan, J. R. (2018). Clustering-based k-nearest neighbor classification for large-scale data with neural codes representation. Pattern Recognition, 74, 531–543.
Gan, H., Sang, N., Huang, R., Tong, X., & Dan, Z. (2013). Using clustering analysis to improve semi-supervised classification. Neurocomputing, 101, 290–298.
Gan, H., Huang, R., Luo, Z., Xi, X., & Gao, Y. (2018). On using supervised clustering analysis to improve classification performance. Information Sciences, 454, 216–228.
Gertrudes, J. C., Zimek, A., Sander, J., & Campello, R. J. G. B. (2018). A unified framework of density-based clustering for semi-supervised classification. In Proceedings of the 30th international conference on scientific and statistical database management. Association for Computing Machinery. https://doi.org/10.1145/3221269.3223037
Goldman, S., & Zhou, Y. (2000). Enhancing supervised learning with unlabeled data. In Proceedings of the seventeenth international conference on machine learning (pp. 327–334).
Gong, M., Zhou, H., Qin, A. K., Liu, W., & Zhao, Z. (2022). Self-paced co-training of graph neural networks for semi-supervised node classification. IEEE Transactions on Neural Networks and Learning Systems, 34(11), 9234–9247.
Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., … Sugiyama, M. (2018). Co-teaching: Robust training of deep neural networks with extremely noisy labels. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31). Curran Associates, Inc.
Huang, Q., Gao, R., & Akhavan, H. (2023). An ensemble hierarchical clustering algorithm based on merits at cluster and partition levels. Pattern Recognition, 136, 109255.
Jia, H., Zhu, D., Huang, L., Mao, Q., Wang, L., & Song, H. (2023). Global and local structure preserving nonnegative subspace clustering. Pattern Recognition, 138, 109388.
Jiang, Z., Zhang, S., & Zeng, J. (2013). A hybrid generative/discriminative method for semi-supervised classification. Knowledge-Based Systems, 37, 137–145.
Jiang, Z., Zhan, Y., Mao, Q., & Du, Y. (2022). Semi-supervised clustering under a “compact-cluster” assumption. IEEE Transactions on Knowledge and Data Engineering, 35(5), 5244–5256.
Jiang, Z., Zhao, L., Lu, Y., Zhan, Y., & Mao, Q. (2023a). A semi-supervised resampling method for class-imbalanced learning. Expert Systems with Applications, 221, 119733.
Jiang, Z., Zhao, L., & Zhan, Y. (2023b). A boosted co-training method for class-imbalanced learning. Expert Systems, 40(9), e13377.
Lin, W.-C., Tsai, C.-F., Hu, Y.-H., & Jhang, J.-S. (2017). Clustering-based undersampling in class-imbalanced data. Information Sciences, 409, 17–26.
Liu, H., Tao, Z., & Fu, Y. (2017). Partition level constrained clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(10), 2469–2483.
Ma, F., Meng, D., Dong, X., & Yang, Y. (2020). Self-paced multi-view co-training. Journal of Machine Learning Research, 21(57), 1–38.
Ma, F., Meng, D., Xie, Q., Li, Z., & Dong, X. (2017, 06–11 Aug). Self-paced co-training. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th international conference on machine learning (Vol. 70, pp. 2275–2284). PMLR. https://proceedings.mlr.press/v70/ma17b.html
Jan, Md., & Z., & Verma, B. (2019). Evolutionary classifier and cluster selection approach for ensemble classification. ACM Transactions on Knowledge Discovery from Data (TKDD), 14(1), 1–18.
Melnykov, I., & Melnykov, V. (2020). A note on the formal implementation of the K-means algorithm with hard positive and negative constraints. Journal of Classification, 37(3), 789–809.
Piroonsup, N., & Sinthupinyo, S. (2018). Analysis of training data using clustering to improve semi-supervised self-training. Knowledge-Based Systems, 143, 65–80.
Pratt, J. W. (1959). Remarks on zeros and ties in the Wilcoxon signed rank procedures. Journal of the American Statistical Association, 54(287), 655–667.
Rashmi, M., & Sankaran, P. (2019). Optimal landmark point selection using clustering for manifold modeling and data classification. Journal of Classification, 36(1), 94–112.
Raskutti, B., Ferrá, H., & Kowalczyk, A. (2002). Combining clustering and co-training to enhance text classification using unlabelled data. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 620–625). Association for Computing Machinery.
Sachdeva, R., Cordeiro, F. R., Belagiannis, V., Reid, I., & Carneiro, G. (2023). ScanMix: Learning from severe label noise via semantic clustering and semi-supervised learning. Pattern Recognition, 134, 109121.
Sindhwani, V., & Rosenberg, D. S. (2008). An RKHS for multi-view learning and manifold co-regularization. In Proceedings of the 25th international conference on machine learning (pp. 976–983). Association for Computing Machinery. https://doi.org/10.1145/1390156.1390279
Song, Q., Ni, J., & Wang, G. (2011). A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Transactions on Knowledge and Data Engineering, 25(1), 1–14.
Van Engelen, J. E., & Hoos, H. H. (2020). A survey on semi-supervised learning. Machine Learning, 109(2), 373–440.
Verma, B., & Rahman, A. (2011). Cluster-oriented ensemble classifier: Impact of multicluster characterization on ensemble classifier learning. IEEE Transactions on Knowledge and Data Engineering, 24(4), 605–618.
Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al. (2001). Constrained k-means clustering with background knowledge. In Proceedings of the eighteenth international conference on machine learning (Vol. 1, pp. 577–584).
Wu, J., Liu, H., Xiong, H., Cao, J., & Chen, J. (2014). K-means-based consensus clustering: A unified view. IEEE Transactions on Knowledge and Data Engineering, 27(1), 155–169.
Xue, H., Chen, S., & Yang, Q. (2009). Discriminatively regularized least-squares classification. Pattern Recognition, 42(1), 93–104.
Ye, H.-J., Zhan, D.-C., Miao, Y., Jiang, Y., & Zhou, Z.-H. (2015). Rank consistency based multi-view learning: A privacy-preserving approach. In Proceedings of the 24th ACM international on conference on Information and knowledge management (pp. 991–1000). Association for Computing Machinery.
Yu, Z., Luo, P., Liu, J., Wong, H.-S., You, J., Han, G., & Zhang, J. (2018). Semi-supervised ensemble clustering based on selected constraint projection. IEEE Transactions on Knowledge and Data Engineering, 30(12), 2394–2407.
Zeng, S., Tong, X., Sang, N., & Huang, R. (2013). A study on semi-supervised FCM algorithm. Knowledge and Information Systems, 35, 585–612.
Zhang, M.-L., & Zhou, Z.-H. (2011). CoTrade: Confident co-training with data editing. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(6), 1612–1626.
Zhou, Z.-H., & Li, M. (2005). Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering, 17(11), 1529–1541.
Acknowledgements
The authors would like to thank the editors and the anonymous reviewers for their valuable comments to improve the quality of our paper.
Funding
This research is supported by the National Natural Science Foundation of China (NSFC: 62176106), the key project of NSFC (U1836220).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics Approval
The authors comply with all ethical standards. No research involving human participants and/or animals was conducted.
Conflict of Interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jiang, Z., Zhao, L. & Lu, Y. Combining Semi-supervised Clustering and Classification Under a Generalized Framework. J Classif (2024). https://doi.org/10.1007/s00357-024-09489-9
Accepted:
Published:
DOI: https://doi.org/10.1007/s00357-024-09489-9