1 Introduction

In many real-world tasks, unlabeled data are readily available in abundance, while labeled data are often limited. As two typical learning paradigms, classification and clustering learn from labeled and unlabeled data, respectively. To learn a reasonably accurate classifier, classification algorithms require sufficient labeled data. On the other hand, exploiting only unlabeled data is not enough for clustering algorithms to find an optimal partition for the user. Classification algorithms suffer from the insufficiency of labeled data. By contrast, solely relying on unlabeled data is insufficient for clustering algorithms to achieve an optimal data partition for the user. Therefore, developing a paradigm to learn from both labeled and unlabeled data has become a prominent topic in recent decades.

Semi-supervised classification algorithms incorporate unlabeled data into their loss functions based on specific model assumptions such as smoothness, cluster, or manifold assumptions (Van Engelen & Hoos, 2020). However, it is important to note that a mismatch between the problem structure and the model assumption can lead to a degradation in performance (Cholaquidis et al., 2020). Additionally, the incorporation of unlabeled data can make the loss functions nonconvex. By contrast, co-training (Blum & Mitchell, 1998) uses the predictions of base classifiers, known as pseudo-labeled data, to augment the training set. It does not rely on additional model assumptions, which helps to avoid the problems associated with mismatching and nonconvexity. Essentially, the success of co-training style methods relies on the diversity between base classifiers, which is typically achieved through different views. However, it is difficult to satisfy the strong assumption of having compatible but uncorrelated views in practice. While there has been some research (Goldman & Zhou, 2000; Jiang et al., 2013; Zhou & Li, 2005) on creating diversity through alternative means, it remains a challenging issue.

On the other hand, semi-supervised clustering algorithms (Basu et al., 2002; Wagstaff et al., 2001) use side information, in the form of class labels or pairwise constraints, to guide the clustering towards the desired partition. Typically, this side information is used to initialize the parameters of the clustering model or incorporated into the loss function to impose constraints on the clustering procedure. Liu et al. (2017) indicate that both the scale and quality of side information have a significant impact on semi-supervised clustering. Particularly, there is a gap between the clustering process and the side information. The improper utilization of side information can lead to a degradation in the clustering performance, as illustrated in Fig. 2.

In addition to semi-supervised learning, another approach directly combines clustering with classification to learn from both labeled and unlabeled data. In this approach, clustering is used as a pre-processing step for classification to achieve various objectives, such as reducing the size of the training set (Gallego et al., 2018), selecting more representative training data (Rashmi & Sankaran, 2019), or clustering features for high-dimensional data (Raskutti et al., 2002; Sachdeva et al., 2023). Based on the cluster assumption, some researchers partition the dataset into a set of disjoint clusters to facilitate the following classification (Song et al., 2011). Some ensemble learning methods (Huang et al., 2023; Md. Jan & Verma, 2019; Verma & Rahman, 2011) use clustering techniques to partition the original data and generate a set of base classifiers that learn the boundaries between clusters. A fusion classifier is then used for the final prediction by mapping the confidences of the clusters to class decisions. In general, clustering is utilized as an auxiliary approach to preprocessing data for classification without fully leveraging its full potential.

Motivated by co-training, this paper presents a generalized framework for combining semi-supervised clustering and classification (CSCC) to learn from both labeled and unlabeled data. Unlike existing co-training style methods that focus on constructing diverse classifiers, CSCC takes advantage of the inherent diversity between semi-supervised clustering and classification to improve the generalization ability. In theory, any semi-supervised clustering and classification models can be integrated into CSCC to mutually enhance their performance. To maximize the strength of CSCC, we design a semi-supervised hierarchical clustering algorithm to bridge the gap between class information and clustering. With the supervision of labeled data, the proposed algorithm iteratively uses a cluster-splitting technique to refine the clustering result. Furthermore, we define loss functions to guide the iterative training of two components within the CSCC framework. Finally, the proposed method is validated through two series of experiments. The experimental results demonstrate the superiority of our method over other state-of-the-art methods. The main contributions of this paper are highlighted as follows.

  1. 1)

    We present a new learning paradigm that combines semi-supervised clustering and classification to learn from both labeled and unlabeled data. To the best of our knowledge, this paper is the first to incorporate semi-supervised clustering into a co-training framework. It provides a new approach to creating diversity for co-training.

  2. 2)

    To safely and efficiently leverage labeled data within CSCC, we propose a semi-supervised hierarchical clustering algorithm that uses labeled data to supervise the iterative cluster-splitting process.

  3. 3)

    The proposed method has been validated through extensive experiments on 27 representative datasets. The experimental results demonstrate the significant superiority of the proposed method over nine state-of-the-art algorithms.

The rest of this paper is organized as follows. Section 2 presents the background of the proposed method. In Sect. 3.1, we describe the CSCC framework and present a semi-supervised hierarchical clustering algorithm. The experimental evaluations are presented in Sect. 2.3. Finally, we present a conclusion and future work in Sect. 3.3.

2 Related Works

2.1 Co-training Style Methods

Co-training style algorithms (Blum & Mitchell, 1998; Chen et al., 2022; Gong et al., 2022; Jiang et al., 2023a, 2023b) initialize two classifiers on two compatible but uncorrelated views (feature spaces) and iteratively train each classifier with the other’s confident predictions of unlabeled data. Multi-view learning (Ma et al., 2020; Sindhwani & Rosenberg, 2008) can be seen as a generalization of co-training, which takes advantage of multiple views. Note that there is a risk that the noise among pseudo-labeled data may be propagated to the next round of training. To reduce the noise, CoTrade (Zhang & Zhou, 2011) and Co-teaching (Han et al., 2018) apply data editing or cleaning techniques, respectively, to select pseudo-labeled data with higher confidence. Ma et al. combined co-training with the alternative optimization process of self-paced curriculum learning and presented a “draw with replacement” strategy to select pseudo-labeled data (Ma et al., 2017, 2020). Rather than iteratively training base classifiers, RKHS (Sindhwani & Rosenberg, 2008) and RANC (Ye et al., 2015) incorporate the information from other views as a regularization term into the loss function.

Essentially, co-training takes advantage of the diversity between base classifiers to improve the generalization ability. In traditional co-training style algorithms, the diversity comes from compatible but uncorrelated views. However, it is difficult to satisfy this requirement in many tasks. Alternatively, Goldman and Zhou trained two classifiers of different learning algorithms on the same view (Goldman & Zhou, 2000). Similarly, Jiang et al. presented a hybrid method (Jiang et al., 2013) combining a generative and a discriminative classifier under a co-training framework. Moreover, tri-training style methods (Dong-DongChen & WeiGao, 2018; Zhou & Li, 2005) generate three classifiers on different subsets of the original training set. Despite all this, creating diverse but accurate classifiers for co-training remains a challenging task. Distinct from existing co-training algorithms, our proposed method combines a semi-supervised clustering model and a classifier for mutual promotion in a co-training framework. It presents a novel approach to creating diversity for co-training.

2.2 Clustering-Based Classification Methods

Clustering is widely used to modify the training set or improve the feature representation for the subsequent classification tasks. Gallego et al. used clustering to down-sample training data to decrease the computation complexity (Gallego et al., 2018). Some class-imbalanced classification algorithms (Jiang et al., 2023a, 2023b; Lin et al., 2017) utilize clustering to improve the result of oversampling or undersampling. In addition, clustering is used for nonlinear dimensionality reduction (Rashmi & Sankaran, 2019) and feature selection (Jia et al., 2023; Song et al., 2011). ScanMix (Sachdeva et al., 2023) incorporates semantic clustering into deep neural networks to improve the feature representation. Raskutti et al. generated a new view for co-training by clustering both labeled and unlabeled data (Raskutti et al., 2002).

Clustering is also employed to provide constraints for classification based on the cluster assumption which asserts that data in one cluster are likely to share the same class label. Jan et al. (Md. Jan & Verma, 2019) generated a random subspace by incrementally clustering input data with a particle swarm optimization. A set of classifiers are then trained on these optimized clusters. Similarly, Verma et al. (Verma & Rahman, 2011) generated a set of base classifiers to learn the cluster boundaries. DRLSC (Xue et al., 2009) uses clustering to learn the local discriminative information and the manifold structure from unlabeled data. Subsequently, they are incorporated into the loss function as a graph-based regularization term. Notably, classification does not always benefit from the assistance of clustering due to its unsupervised nature. Thus, a few reported works turn to leveraging semi-supervised clustering. During self-training, semi-supervised clustering is introduced to estimate the data distribution to train a better classifier (Gan et al., 2013; Piroonsup & Sinthupinyo, 2018). SuperRLSC (Gan et al., 2018) uses supervised Kmeans to learn a graph-based regularization term and embeds it into the loss function of the classification.

In summary, existing methods typically utilize clustering as an auxiliary technique for data processing, while applying classification to enhance clustering remains relatively unexplored. This paper presents a new learning paradigm that enables semi-supervised clustering and classification to benefit from each other within a co-training style framework.

2.3 Semi-supervised Clustering Methods

Semi-supervised clustering uses side information, such as labeled data (Basu et al., 2002) or pairwise constraints (Melnykov & Melnykov, 2020; Wagstaff et al., 2001), to guide the clustering towards the desired partition. Compared with pairwise constraints, the labeling information is more suitable to describe data distribution at a high level. In Kmeans-style methods, labeled data can be used to initialize seeds or constrain the centroid-updating process (Basu et al., 2002; Jiang et al., 2022). In a similar way, the labeled data are used to estimate the local density parameters for density-based algorithms (Gertrudes et al., 2018). Another approach incorporates a regularization term into the loss function to penalize the partition inconsistent with the given label information (Zeng et al., 2013). Liu et al. proposed a partition level constrained clustering (PLCC) framework (Zeng et al., 2013) where the class labels are added as additional dimensions to the original feature vector. In addition, semi-supervised clustering is combined with ensemble learning. Wu et al. proposed a Kmeans-based consensus clustering (KCC) algorithm (Wu et al., 2014). Within the ensemble clustering framework, Yu et al. presented a series of SSC ensemble methods (Yu et al., 2018) that assign different subsets of pairwise constraints to different ensemble members.

Based on the cluster assumption that data in one cluster share the same class label, semi-supervised clustering algorithms label each cluster according to its dominant labeled data. Thereby, semi-supervised clustering can be readily incorporated into co-training to work with a classifier. However, cluster assumption does not always hold in existing semi-supervised clustering algorithms, as illustrated in Fig. 2. To address this issue, we propose a semi-supervised clustering hierarchical clustering for CSCC in Sect. 2.3.

3 The Framework Combining Semi-supervised Clustering and Classification (CSCC)

3.1 Motivation

Essentially, co-training leverages the diversity between two classifiers to complement each other, thereby improving its generalization ability. As the key to the success of co-training, diversity can be achieved through different views, training subsets, or classification algorithms.

Motivated by clustering-based classification algorithms, we propose a co-training-like framework, CSCC, which combines semi-supervised clustering and classification to learn from each other. Clustering and classification offer distinct perspectives for grouping data. Therefore, we argue that the diversity harnessed by CSCC is more inherent compared to the diversity between the two classifiers. It presents a promising approach for co-training to generate and harness diversity.

The intuition of CSCC is depicted in Fig. 1. Initially, a traditional classification model, denoted as Λ, is trained on labeled data, while a semi-supervised clustering model, denoted as Θ, is learned on both labeled and unlabeled data. Subsequently, CSCC engages in an iterative co-training process between Λ and Θ. To be more specific, each model incorporates its confident predictions of unlabeled data (pseudo-labeled data) into the training set of the other model for the next round of retraining. Both models benefit from the diversity during co-training. The iterative process continues until a specific criterion is satisfied.

Fig. 1
figure 1

Co-training between semi-supervised clustering and classification

3.2 The Semi-supervised Clustering Hierarchical Clustering Algorithm

3.2.1 Intuition

To better exert the strength of semi-supervised clustering for CSCC, we propose a semi-supervised hierarchical clustering algorithm based on Kmeans. The intuition is to utilize labeled data to gradually find the optimum cluster number and centroids. Notably, there is a natural gap between the class information and the clustering, which can potentially misguide the label-based clustering. For example, there are two classes in Fig. 2a, each comprising two distinct clusters. Supervised by labeled data, Kmeans-style methods usually initialize two clusters with inappropriate centroids, resulting in a wrong partition, as shown in Fig. 2b. Moreover, irregular cluster shapes and variations in cluster sizes can further impede the performance of centroid-based clustering algorithms when utilizing class information.

Fig. 2
figure 2

The gap between the clustering and class information. a Original data distribution; b Clustering result misguided by class information; c The result after cluster-spitting

According to the cluster assumption, data in one cluster are likely to share the same class label. Conversely, data belonging to the same class may be dispersed across multiple clusters. Our motivation is to initialize each cluster with a class of labeled data and then update the clusters on both labeled and unlabeled data.

If an updated cluster contains labeled data from different classes, it goes against the cluster assumption. In such cases, we divide the “impure” cluster into multiple subclusters, each of which only contains labeled data of one class. This cluster-splitting aims to rectify the clustering result and achieve a more satisfactory partition, as shown in Fig. 2c. Correspondingly, we propose a semi-supervised clustering method for Kmeans-style algorithms, which iteratively refines the clustering model with the supervision of labeled data until all labeled data within each cluster belong to the same class.

3.2.2 Loss Function and Algorithm

To begin with, we give a definition of the learning problem. Suppose Q is an unknown distribution defined on the instance space X, and Y = {y1,…, yC} is the class label set for X. Let DL = {XL,YL} = {(xi, yi)|i = 1,2,…,L} represent labeled data drawn from Q, and DU = {XU,YU} = {(xj, yj) |j = 1,2,…,U} represent unlabeled data whose label YU is unobserved. Given a semi-supervised clustering algorithm, DL U DU can be grouped into a cluster set P = { p1,…, pK}. ∀ pk ϵ P has a label sk ϵ Y which is determined by the dominant labeled data in pk. Let M = {m1,…, mK} denote the centroid matrix of P, where each mk is the k-th centroid that is estimated as.

$${{{m}}}_{{{k}}}=\frac{{\sum }_{{{{x}}}_{{{i}}}\in {{{p}}}_{{{k}}}}{{{x}}}_{{{i}}}}{\left|{{{p}}}_{{{k}}}\right|},$$
(1)

where |pk| is the number of data in cluster pk.

Semi-supervised clustering algorithms aim to learn the clustering model Θ with the supervision of labeled data. For Kmeans-style algorithms, Θ can be expressed by the number of clusters in P and its centroid matrix M. The original loss function of Kmeans is defined as.

$$\text{SSE}({\varvec{X}};\Theta )={\sum }_{k=1}^{K}{\sum }_{{{\varvec{x}}}_{{\varvec{i}}}\in {{\varvec{p}}}_{{\varvec{k}}}}\Vert {{\varvec{x}}}_{{\varvec{i}}}-{{\varvec{m}}}_{{\varvec{k}}}\Vert ,$$
(2)

where |||| denotes the Euclidean distance. Given the number of clusters and initial centroids, Kmeans aims to minimize the sum of within-cluster distances. However, deciding the number of clusters and initial centroids is challenging. To address this issue, we propose a cluster-splitting technique and redefine the loss function for Kmeans-style algorithms as.

$$J\left({{{X}}}_{{{L}}}\cup {{{X}}}_{{{U}}},{{{Y}}}_{{{L}}},{{{{Y}}}_{{{L}}}}^{{^{\prime}}};\Theta \right)=\lambda \text{Err}\left({{{Y}}}_{{ {L}}},{{{{Y}}}_{{{L}}}}^{\mathbf{^{\prime}}};\Theta \right)+\left(1-\lambda \right)\text{SSE}\left({{{X}}}_{{{L}}}\cup {{{X}}}_{{{U}}};\Theta \right),$$
(3)

where λ ϵ [0,1] is a weighting parameter. The first item Err ∈ [0,1] is used to measure the empirical error, which is defined as

$$\text{Err}\left({{\varvec{Y}}}_{{\varvec{L}}},{{{\varvec{Y}}}_{{\varvec{L}}}}^{\boldsymbol{^{\prime}}};\Theta \right)=\frac{1}{\left|{{\varvec{D}}}_{{\varvec{L}}}\right|}{\sum }_{i=1}^{\left|{{\varvec{D}}}_{{\varvec{L}}}\right|},{H}_{i}\left({y}_{i},{y}_{i}{\prime}\right)$$
(4)

where \({H}_{i}({y}_{i},{y}_{i}{\prime})\) is a discriminant function whose value is 1 when \({y}_{i}={y}_{i}^{{^{\prime}}}\) and 0 otherwise. YL is the class label vector of XL, which is predicted by Θ. Given ∀xi, its predicted label yi’ is the same as its cluster label determined by the dominant labeled data. The clustering model Θ is then estimated by minimizing the loss function in Eq. (3) to seek an optimum partition. The corresponding solution is described in Algorithm 1.

In step 1, Θ is initialized with DL; then, a top-down cluster-splitting technique is proposed in step 2 to find the optimal cluster number and initial centroids. Finally, the cluster centroids are iteratively updated in step 3. Considering that both items in Eq. (3) may be in different scales, we normalize both items in the j-th (j > 1) iteration as.

$$\begin{array}{c}\text{Err}{}_{j}{}{\prime}\left({{\varvec{Y}}}_{{\varvec{L}}},{{{\varvec{Y}}}_{{\varvec{L}}}}^{\boldsymbol{^{\prime}}};\Theta \right)=\frac{{\text{Err}}_{\text{j}}\left({{\varvec{Y}}}_{{\varvec{L}}},{{{\varvec{Y}}}_{{\varvec{L}}}}^{\boldsymbol{^{\prime}}};{\Theta }_{j}\right)}{{\sum }_{t=1}^{j}{\text{Err}}_{\text{t}}\left({{\varvec{Y}}}_{{\varvec{L}}},{{{\varvec{Y}}}_{{\varvec{L}}}}^{\boldsymbol{^{\prime}}};{\Theta }_{t}\right) },\\ \text{SSE}{}_{j}{}{\prime}\left({{\varvec{X}}}_{{\varvec{L}}}\cup {{\varvec{X}}}_{{\varvec{U}}};\Theta \right)=\frac{{\text{SSE}}_{\text{j}}\left({{\varvec{X}}}_{{\varvec{L}}}\cup {{\varvec{X}}}_{{\varvec{U}}};{\Theta }_{j}\right)}{{\sum }_{t=1}^{j}{\text{SSE}}_{\text{t}}\left({{\varvec{X}}}_{{\varvec{L}}}\cup {{\varvec{X}}}_{{\varvec{U}}};{\Theta }_{t}\right)}\end{array}$$
(5)

Therefore, we have the final loss function in the j-th iteration:

$${J}_{j}\left({{\varvec{X}}}_{{\varvec{L}}},{{\varvec{X}}}_{{\varvec{U}}},{{\varvec{Y}}}_{{\varvec{L}}},{{\varvec{Y}}}_{{\varvec{L}}}^{\boldsymbol{^{\prime}}};{\Theta }_{j}\right)=\lambda \text{Err}{}_{j}{}{\prime}\left({{\varvec{Y}}}_{{\varvec{L}}},{{{\varvec{Y}}}_{{\varvec{L}}}}^{\boldsymbol{^{\prime}}};{\Theta }_{j}\right)+\left(1-\lambda \right)\text{SSE}{}_{j}{}{\prime}\left({{\varvec{X}}}_{{\varvec{L}}}\cup {{\varvec{X}}}_{{\varvec{U}}};{\Theta }_{j}\right).$$
(6)

The loss function in Eq. (6) is then used to guide the cluster-splitting in step 2. Due to the impact of noises or outliers, some newly generated clusters might degrade the performance. Therefore, we delete those clusters whose empirical errors are above the average level in step 2.5 for a better outcome. It is noteworthy that the false predictions may degrade the clustering model when updating the cluster centroids. Therefore, the loss function in Eq. (6) is also used to supervise the updating of centroids in step 3.

Algorithm 1 The semi-supervised hierarchical clustering algorithm

figure a

After the semi-supervised clustering, unlabeled data in each cluster are given the same label as its cluster. Based on the prediction, the semi-supervised clustering model Θ can provide a classifier with pseudo-labeled data within the CSCC framework. To select reliable pseudo-labeled data, we estimate the confidence of x that belongs to class yi (yi ∈ Y):

$$P({y}_{i}\left|{\varvec{x}};\Theta \right.)={\sum }_{{s}_{j}={y}_{i}}P({p}_{{\varvec{j}}}\left|{\varvec{x}};\Theta \right.)={\sum }_{{s}_{j}={y}_{i}}1-\left(\frac{\Vert {\varvec{x}}-{{\varvec{m}}}_{{\varvec{j}}}\Vert }{{\sum }_{k=1}^{\left|{\varvec{P}}\right|}\Vert {\varvec{x}}-{{\varvec{m}}}_{{\varvec{k}}}\Vert }\right) ,$$
(7)

where sj and mj denote the label and centroid of cluster pj, respectively. The P(pj|x; Θ) is the confidence that x belongs to cluster pj whose label sj = yi. Here, P(pj|x; Θ) is negatively correlated with the distance of x to its cluster centroid mj. Recall that a class consists of at least one cluster, while a cluster belongs to only one class. Therefore, when estimating the prediction confidence of x, we take account into its distances to all clusters with the label yi. Supposing pj represents a cluster whose cluster label is yi, we use the aggregate P(pj|x; Θ) to estimate P(yi|x; Θ) in Eq. (7).

3.3 The CSCC Algorithm

Let V1, V2 be pseudo-labeled datasets used for the semi-supervised clustering model Θ and classification model Λ, respectively. The task is to learn Θ and Λ on DL ∪ V1 ∪ DU and DL ∪ V2, respectively. Next, we provide a detailed illustration of CSCC in Algorithm 2. In step 1, we initialize a standard classification model Λ and run Algorithm 1 to learn Θ. Subsequently, Θ and Λ are iteratively retrained in step 2. In step 3, two learning models are combined for the final prediction of test data.

Algorithm 2 Co-training between semi-supervised clustering and classification (CSCC)

figure b

3.3.1 Selecting Pseudo-Labeled Data

The risk associated with co-training style algorithms arises from the presence of noise in pseudo-labeled data. However, this approach deviates from the i.i.d. (independent and identically distributed) assumption, thereby introducing distribution noise.

In step 2, we select the most confident predictions as pseudo-labeled data, according to the class proportion learned from initial labeled data. Therefore, we can acquire relatively accurate pseudo-labeled data that approximately obey the actual distribution. Considering that class imbalance can lead to a deficiency of pseudo-labeled data for minority classes, we undersample majority classes in V1 for Θ in step 2.4 and oversample minority classes in DL for Λ. Moreover, we use an incremental value del_n to gradually add pseudo-labeled data for Θ and Λ. To seek a balance between the quantity and the quality, we limit the number of pseudo-labeled data, n, such that it satisfies the condition n < 0.5*|DU| in step 2.9.

3.3.2 Supervising the Training of Learning Models

To further mitigate the influence of noise, we evaluate both learning models in step 2 to prevent possible degradation. Recall that the clustering model Θ is evaluated according to Eq. (6). To supervise the iterative training of Λ, we propose a loss function as follows.

$$F({{\varvec{X}}}_{{\varvec{T}}},{{\varvec{Y}}}_{{\varvec{T}}},{{\varvec{Y}}}_{{\varvec{T}}}^{\boldsymbol{^{\prime}}};\Lambda )={\lambda }_{1}\text{Err} \left({{\varvec{Y}}}_{{\varvec{T}}},{{\varvec{Y}}}_{{\varvec{T}}}^{\boldsymbol{^{\prime}}};\Lambda \right)-\left(1-{\lambda }_{1}\right){\sum }_{{C}_{i},{C}_{j}\in {\varvec{Y}}}\text{Dis}_{T}({C}_{i},{C}_{j}\left|\Lambda \right.),$$
(8)

where T = DL ∪ V2 = {XT, YT} is the training set, λ1 ϵ [0,1] is a weighting parameter. The first item is used to measure the empirical risk defined in Eq. (4). The second item represents a structural risk which is estimated according to the inter-class distance estimated on T:

$$\text{Dis}_{T}\left({C}_{i},{C}_{j}\left|\Lambda \right.\right)=\frac{1}{\left|{C}_{i}\right|\cdot \left|{C}_{{\varvec{j}}}\right|}{\sum }_{{{\varvec{x}}}_{{\varvec{i}}}\in {{\varvec{C}}}_{{\varvec{i}}}}{\sum }_{{{\varvec{x}}}_{{\varvec{j}}}\in {{\varvec{C}}}_{{\varvec{j}}}}\Vert {{\varvec{x}}}_{{\varvec{i}}}-{{\varvec{x}}}_{{\varvec{j}}}\Vert .$$
(9)

Within the CSCC framework, the proposed loss function is used to guide the iterative training of Λ by minimizing not only the empirical risk but also the structural risk. Considering that both items in Eq. (9) may be on different scales, we normalize both items in the t-th iteration, as done in Eq. (5). Therefore, we have the final loss function for the t-th iteration of step 2:

$${F}_{t}({{\varvec{X}}}_{{\varvec{T}}},{{\varvec{Y}}}_{{\varvec{T}}},{{\varvec{Y}}}_{{\varvec{T}}}^{\boldsymbol{^{\prime}}};{\Lambda }_{t})={\lambda }_{1}\text{Err} {}_{t}{}{\prime}\left({{\varvec{Y}}}_{{\varvec{T}}},{{\varvec{Y}}}_{{\varvec{T}}}^{\boldsymbol{^{\prime}}};{\Lambda }_{t}\right)-\left(1-{\lambda }_{1}\right){\sum }_{{C}_{i},{C}_{j}\in {\varvec{Y}}}{\text{Dis}}_{T}{\prime}({C}_{i},{C}_{j}\left|{\Lambda }_{t}\right.).$$
(10)

In step 2, Eq. (6) and Eq. (10) are used to supervise the iterative training of Θ and Λ, respectively, to decrease the risk of model degradation.

3.3.3 Combining Semi-supervised Clustering and Classification for the Final Prediction

To improve the generalization ability, we combine Θ and Λ in step 3 for the final prediction, as shown in Eq. (11).

$$P(y\left|{\varvec{x}};\Theta ,\Lambda \right.)=\mu P(y\left|{\varvec{x}};\Theta \right.)+(1-\mu )P(y\left|{\varvec{x}};\Lambda \right.),$$
(11)

where P(y|x; Θ) is the prediction probability of Θ, which is estimated according to Eq. (7). The P(y|x; Λ) is the prediction probability of Λ, which depends on a specific classification algorithm. The weighting parameter \(\mu\) is used to regulate the impact of Θ and Λ on final predictions. Usually, the value of \(\mu\) can be set by experience or cross-validation.

4 Experiments

In two series of experiments, the proposed method, CSCC, was compared with clustering-based classification methods and co-training style methods, respectively.

4.1 Comparisons with Clustering-Based Classification Methods

4.1.1 Setup

Datasets

For a fair comparison, we validate the proposed method on the same datasets used in SuperRLSC (Gan et al., 2018). Table 1 shows the details of 20 UCI datasets,Footnote 1 where the value of coefficient of variation (CV) is a statistical measure that expresses the relative variability or dispersion of a dataset. Following the experimental setting of SuperRLSC, we randomly divide each dataset into the training set and test set while maintaining that both datasets have almost equivalent numbers of data in each class.

Table 1 Statistics of 20 UCI Datasets

Methods for Comparison

The CSCC is first compared with three clustering-based classification methods (Gan et al., 2018; Md. Jan & Verma, 2019; Xue et al., 2009). ECCS (Gan et al., 2018) generates an ensemble of classifiers on a series of clusters optimized by a particle swarm optimization-based approach. DRLSC (Xue et al., 2009) learns the local discriminative information and the manifold structure from unlabeled data a, which are then incorporated into the loss function as a regularization term. SuperRLSC (Gan et al., 2018) employs supervised Kmeans to build local and global data graphs to supervise the classification. Moreover, SVM and Seeded Kmeans (Basu et al., 2002) are employed as the baselines for the experimental comparison.

In the CSCC framework, we use SVMFootnote 2 as the base classifier. Both the standard SVM and our SVM component use the default parameters of Libsvm2 with a linear kernel. Our method has two weighting parameters: λ, λ1ϵ[0,1]. For the sake of simplicity, we empirically set λ = λ1 = 0.7. The regularization parameters in DRLSC and SuperRLSC are selected using tenfold cross-validation. All other parameters of ECCS, DRLSC, and SuperRLSC use their default setting.

4.1.2 Experimental Results

Table 2 shows the classification accuracy of different algorithms on 20 UCI datasets. Here, Kmeans-C and SVM-C represent two components of CSCC: the semi-supervised hierarchical clustering algorithm and a standard SVM, respectively. All results are the averages of 10 runs of the focused algorithms with random initializations.

Table 2 Classification accuracy on 20 UCI datasets

The Effect of the Semi-supervised Cluster-Splitting Algorithm

Due to the different learning objectives between clustering and classification, it is unfair to compare the classification accuracies of semi-supervised clustering algorithms with those of classification algorithms. Thus, it is not surprising that Seeded Kmeans performs much worse than classification algorithms. Another reason, we argue, is that the natural gap between the clustering and classification information may misguide the clustering procedure. In contrast, Kmeans-C uses labeled data to supervise the cluster-splitting process, enabling it to find the optimum cluster number and centroids more effectively.

Table 3 shows the final cluster number after the cluster-splitting, where parenthesized values denote the number of classes. It is observed that more sub-clusters are identified by CSCC, especially on cmc, Vehicle, and Yeast. Obviously, the cluster-splitting technique is conducive to capturing the underlying data distribution and bridging the gap between the clustering and classification information.

Table 3 The cluster number on 20 UCI datasets

Furthermore, the proposed centroid-updating technique decreases the impact of noise. As a result, Kmeans-C outperforms Seeded Kmeans by 12.8% on the average result. After the iterative training of CSCC, the accuracy of Kmeans-C is even close to SVM and DRLSC.

The Effect of the CSCC Framework

The CSCC takes advantage of the diversity between the semi-supervised clustering and classification to train its two components. As observed in columns 7 and 8 of Table 2, the performances of both components show improvements on nearly all datasets, except for slight declines on sonar, water, and cmc. This indicates that our tactic of selecting pseudo-labeled data effectively decreases noise. Furthermore, the supervision provided by the two proposed loss functions further reduces the risk associated with utilizing unlabeled data.

Initially, SVM-C obviously outperforms Kmeans-C. However, through co-training, Kmeans-C benefits more and shows greater improvement than SVM-C. Despite this, the accuracy of SVM-C still increases by 2.45%, demonstrating the effectiveness of incorporating semi-supervised clustering for final prediction.

The final label is determined by combining the predictions of semi-supervised clustering and classification. The CSCC algorithm outperforms its individual components on most datasets, except pid and cmc. Compared with Kmeans-C and SVM-C, the performance of CSCC increases by 4.76% and 1.23%, respectively. These results demonstrate the effectiveness of combining two components in CSCC.

The Comparison with Clustering-Based Classification Methods

Table 2 shows that CSCC performs better than its competitors in terms of average results. Additionally, CSCC achieves the best performance on 8 out of the 16 datasets. DRLSC utilizes structure information learned from unlabeled data to assist classification. However, without the guidance of class information, this learned information does not contribute significantly to the classification, especially when an ample amount of labeled data is available. As a result, DRLSC performs closely to SVM. Moreover, the ensemble method (Verma & Rahman, 2011) does not exhibit significant improvement over SVM. In contrast, SuperRLSC employs supervised Kmeans to find meaningful information for the classification, leading to superior performance.

Rather than using clustering as an auxiliary means, CSCC takes advantage of the diversity between semi-supervised clustering and classification to improve the generalization ability. As shown in Table 2, CSCC outperforms other methods. However, unlike SuperRLSC, our method does not utilize tenfold cross-validation to select the best parameters for different datasets. We argue that this is the main reason why the superiority of CSCC over SuperRLSC is not extremely pronounced. It is widely acknowledged that semi-supervised methods, especially co-training style methods, are particularly effective when labeled data are scarce. However, this series of experiments was conducted with an abundance of labeled data. Therefore, it is not surprising that the majority of semi-supervised methods do not exhibit significant superiority over SVM on some datasets.

To quantify the statistical differences between CSCC and other clustering-based classification methods, we conducted the paired Wilcoxon signed-rank test (Pratt, 1959). All significance levels are measured at 5% in Table 4, where p-value is calculated to compare the accuracy values presented. If the p-value is greater than 0.05, then the null hypothesis that there is no significant difference between the performances of two methods cannot be rejected. Conversely, if the p-value is less than or equal to 0.05, then the null hypothesis is rejected, indicating a significant difference.

Table 4 Wilcoxon signed-rank test results on 20 UCI datasets

According to the results in Table 4, CSCC is statistically superior to other methods, except for SuperRLSC. The results of the aligned Friedman’s test comparing CSCC along with the other methods are presented in Table 5. It is observed that CSCC achieves the highest ranking among all six methods. Besides, the p-value obtained from the aligned Friedman’s test is very low (0.006592), indicating a significant difference between the methods compared.

Table 5 Aligned Friedman’s test for the comparison of CSCC with other methods

The Impact of the Weighting Parameters λ

There are two weighting parameters λ,λ1ϵ[0,1] used in CSCC for the semi-supervised clustering and classification model, respectively. In the following, we explore the impact of λ,λ1 on six representative datasets. Figure 3 shows the accuracy curves of CSCC when λ,λ1 vary from 0 to 1.

Fig. 3
figure 3

Impact of λ,λ1 on six datasets

In general, the variances of accuracy with respect to different λ,λ1 on six datasets are relatively stable, particularly on pid and glass. This indicates that CSCC is not significantly sensitive to λ,λ1 on these datasets. As shown in Fig. 3, when λ,λ1 varies from 0 to 1, most performance curves slightly go up at first, demonstrating the benefit of using the guide of labeled data. After a relatively flat stage, most curves may end with a slight decline when λ,λ1 = 0.8 or 0.9. These results show that the usage of unlabeled data is also crucial. Overall, the best performance in Fig. 3 is usually achieved when λ,λ1 is between 0.5 and 0.8. This suggests that there is an optimal balance between the impacts of labeled and unlabeled data within this range.

4.2 Comparisons with Co-training Style Methods

4.2.1 Setup

Methods for Comparison

In this subsection, CSCC is further compared with five co-training style algorithms: Co-training (Blum & Mitchell, 1998), RANC (Ye et al., 2015), CoTrade (Zhang & Zhou, 2011), SPaCo (Ma et al., 2017), and SPamCo (Ma et al., 2020). Co-training generates two base classifiers on different views and iteratively trains each classifier by predictions of the other. RANC assumes that predictions for unlabeled data under different views are consistent with each other. It enforces an affixed rank constraint on the optimization function of each view. CoTrade proposes a specific data editing technique for co-training to avoid undesirable classification noise. By introducing the self-paced curriculum learning, SPaCo designs a “draw with replacement” learning mode to decrease the classification noise. SPamCo improves upon SPaCo by incorporating two co-regularization terms for selecting pseudo-labeled data. It can naturally extend the algorithm to multi-view scenarios.

All above-mentioned algorithms employ SVM with a linear kernel as base classifiers. In CSCC, we empirically set the weighting parameter λ = λ1 = 0.7. All other methods for comparison adopt their default parameters setting. In addition, a standard SVM and Seeded Kmeans (Basu et al., 2002) are employed as the baselines for the experimental comparison.

Datasets

Seven text datasets are used for the experimental comparison. Their characters are summarized in Table 6.

Table 6 Statistics of seven text datasets

Course datasetFootnote 3 contains home pages collected from the websites of four universities. Each page has a page-based view and a link-based view. The task is to predict whether a page is about course or noncourse.

Advertisement datasetFootnote 4 contains advertising images on web pages, each of which is about an advertisement or not. We use two out of the three URL-based views (i.e., 1-image URL, 2-base URL, 3-destination URL) to create three datasets named ads12, ads13, and ads23.

Newsgroup dataset consists of 16 newsgroups from the Mini-Newsgroup dataset.Footnote 5 As done in comparison algorithms (Jiang et al., 2023a, 2023b; Lin et al., 2017), we divide these newsgroups into four groups and create two binary-class datasets, NG1 and NG2, with two artificially generated views.

Ner dataset is about the sliding window named entity recognition using the CoNLL2003 dataset.Footnote 6 It consists of nine classes.

Among these datasets, Course, Advertisement, and Newsgroup have two independent views. For some algorithms that do not require two views, we simply combined the two views. For the ner dataset, which only has one view, the attribute sets are randomly split into two disjoint views for algorithms that require two views.

For each dataset, 25% of data are selected as train data (including labeled and unlabeled data to be utilized), while the rest are used as test data. According to the different sizes of datasets, we randomly choose 2 k positive and 3 \(\times\) 2 k negative labeled data for the course, 2 k positive and 6 \(\times\) 2 k negative labeled data for advertisement, and 2 \(\times\) 2 k+1 positive and 2 \(\times\) 2 k negative labeled data for Newsgroup, and 20 \(\times\) 2 k+1 for Ner. On each dataset, three series of experiments are performed when k is set as 1, 2, and 3, respectively.

4.2.2 Experimental Results

Table 7 shows the classification accuracy of different algorithms on seven datasets, where Kmeans-C and SVM-C denote two components of CSCC, respectively. All results are the averages of 50 runs of the focused algorithms with random initializations. In the following, we provide detailed analyses.

Table 7 Classification accuracy on seven text datasets

The Effect of the Semi-supervised Cluster-Splitting Algorithm

In Table 7, the performance of Seeded Kmeans is comparable to that of most classification algorithms. It might be attributed to the limited availability of labeled data. When compared to Seeded Kmeans, the accuracy of the Kmeans component, Kmeans-C, shows an improvement of 6.80%. In many instances, Kmeans-C even outperforms the majority of classification algorithms.

These results serve as evidence that our cluster-splitting method and centroid-updating technique aid in identifying the optimal number of clusters and their corresponding centroids.

Table 8 presents the final cluster number after the cluster-splitting, where parenthesized values indicate the number of classes. It is observed that a greater number of sub-clusters are identified, which explains why the Kmeans-C outperforms Seeded Kmeans. As the amount of labeled data increases, both Seeded Kmeans and Kmeans-C exhibit steady improvements in performance. This observation highlights the valuable contribution of labeled data in semi-supervised clustering. Compared to the results obtained from the UCI dataset, it is evident that the number of sub-clusters and the improvement in Kmeans-C are noticeably reduced. It can be attributed to the insufficient availability of labeled data, which limits the extent to which the cluster-splitting process can benefit from guidance.

Table 8 The cluster number on seven text datasets

The Effect of the CSCC Framework

By enlarging the training set with pseudo-labeled data, both Kmeans-C and SVM-C exhibit improvements on all datasets, as shown in Table 7. Notably, SVM-C outperforms the standard SVM by a margin of 3.48%. This improvement surpasses those achieved by other co-training methods. Besides, we also noticed that the enhancements are particularly significant when k = 1. It demonstrates that taking advantage of unlabeled data is more effective when available labeled data are few. This finding suggests that leveraging unlabeled data is more effective in scenarios where labeled data is scarce. Consequently, the improvements observed in both components are more pronounced compared to those seen in the UCI datasets.

Notably, the improvement achieved by SVM-C is particularly remarkable on NG1 and NG2, where the semi-supervised clustering methods perform much better than SVM. Therefore, the SVM-C benefits more from Kmeans-C on NG1 and NG2. On the contrary, the improvements of SVM component on ads and course are not significant due to the negative impact of Kmeans-C. Additionally, the proposed loss function’s supervision may halt the iterative training process on these datasets as a precautionary measure to mitigate the risks of utilizing unlabeled data.

The final prediction is generated by combining the predictions of both components. In comparison to the individual performance of Kmeans-C and SVM-C, CSCC shows an increase in performance by 2.59% and 0.72%, respectively. Moreover, CSCC outperforms both components on most datasets. This outcome serves as evidence that combining the two components effectively improves overall performance.

The Comparison with Other Co-training Style Methods

Table 7 shows that CSCC outperforms its competitors significantly on the average result. Additionally, CSCC performs best in 11 out of 21 scenarios. Co-training style methods face the risk of introducing noise through pseudo-labeled data when utilizing unlabeled data. Co-training and RANC do not take specific measures to mitigate the impact of noise. Consequently, the average accuracy of co-training is close to that of standard SVM. RANC performs even worse than standard SVM, particularly on ner, course, and NG.

On the other hand, Cotrade, SPaCo, SPamCo, and CSCC exhibit some superiority over standard SVM. CoTrade utilizes a specific data editing technique to avoid the inclusion of mispredicted data. Meanwhile, SPaCo and SPamCo employ a “draw with replacement” strategy to reduce classification noise. As a result, these methods achieve better performance than SVM in terms of average accuracy. Our method aims to address both classification and distribution noise when selecting pseudo-labeled data. Furthermore, CSCC supervises the iterative training process using the proposed loss function to prevent potential degradation of the models. Consequently, CSCC attains the best performance, with its superiority being particularly significant on the NG dataset. However, CSCC performs worse than SPaCo and SPamCo on course and certain ads scenarios. We argue that the proposed loss function might stop the iterative training prematurely on these datasets and prevent the further improvement of both components.

Indeed, this series of experiments is specifically designed and carried out in scenarios where only a limited number of labeled data is available. It is worth noting that these methods exhibit more significant improvements over SVM compared to clustering-based classification methods, which are trained using abundant labeled data. This phenomenon demonstrates the value and utility of semi-supervised learning in addressing classification tasks with a scarcity of labeled data.

Finally, we quantify the statistical differences between CSCC and other co-training style methods through a paired Wilcoxon signed-rank test (Pratt, 1959). All significance levels are measured at 5% in Table 9. If a p-value > 0.05, then the null hypothesis that there is no significant difference between the two performances cannot be rejected; otherwise, if the p-value <  = 0.05, then the null hypothesis is rejected. As shown in Table 9, CSCC is statistically superior to all other methods except Spaco.

Table 9 Wilcoxon signed-rank test results on seven text datasets

The results of the aligned Friedman’s test comparing CSCC with other methods are presented in Table 10. The Friedman’s rank of CSCC is observed to be the best among all the compared methods. Additionally, the p-value of the aligned Friedman’s test is very low (0.008810), indicating a significant difference between the methods.

Table 10 Aligned Friedman’s test for the comparison of CSCC with other methods on seven text datasets

The Impact of the Weighting Parameters λ

Figure 4 showcases the accuracy curves of CSCC on four datasets as the parameter λ varies from 0 to 1. For each dataset, we list the curves representing different supervision levels (i.e., k = 1, 2, 3). Recall that the experimental results presented in Table 6 were obtained when λ,λ1 are empirically set as 0.7. Generally, the accuracy variances across different values of λ and λ1 on all scenarios are within 5%, indicating a generally stable performance. This result demonstrates that our method is not significantly sensitive to changes in λ and λ1 on these datasets.

Fig. 4
figure 4

Impact of λ,λ1 on four datasets

It is observed that the variances seen in the curves on these datasets are more significant compared to those observed in the UCI datasets. This suggests that a smaller number of labeled data can lead to a larger variance in the impact on the loss function. As depicted in Fig. 4, when λ and λ1 vary from 0 to 1, most accuracy curves exhibit a slight increase initially, indicating the benefit of utilizing labeled data as guidance. After a relatively flat stage, most curves end with a slight decline. This phenomenon highlights the limitation of solely relying on empirical risk. An exception is that the curve on ads12 (k = 2) declines when λ,λ1 vary from 0 to 0.1. This observation may be attributed to the sampling of labeled data in this scenario.

The best performance in Fig. 4 is typically achieved when λ and λ1 fall within the range of 0.5 to 0.8. This observation suggests that there is an optimum balance between the impacts of labeled and unlabeled data when λ,λ1 falls in this range. Moreover, higher supervision levels tend to yield better performance compared to lower supervision levels, highlighting the strength of labeled data in model learning.

5 Conclusion

This paper introduces a novel learning paradigm, CSCC, which leverages the diversity between semi-supervised clustering and classification to enhance generalization capability. Especially, we propose a semi-supervised hierarchical algorithm for CSCC, which utilizes a cluster-splitting technique to bridge the gap between class information and clustering. In two sets of experiments, CSCC has been compared with clustering-based classification methods and co-training style methods. The proposed method outperforms others on most datasets and achieves the best overall performance in both series of experiments. Particularly, when only a few labeled data are available, the superiority of CSCC becomes even more evident. It is worth noting that the Kmeans component of CSCC performs especially well, surpassing many SOTA classification algorithms on multiple text datasets.

Considering the potential issue of skewed decision boundaries, the class imbalance problem may become more pronounced when utilizing pseudo-labeled data to expand the training set. In the future, we plan to conduct further research on this topic to better control the risks associated with utilizing unlabeled data.