1 Introduction

The field of cluster analysis comprises many diverse methods that locate groups of similar observations in a dataset. Clustering is usually understood as the grouping of data without any additional considerations or restrictions. We are going to adopt the view taken by the authors that refer to this kind of clustering as unsupervised. At the same time, the methods that are implemented in the presence of restrictions on the proposed solution or any additional supplementary information are called semi-supervised (Hennig et al. 2015; Yu et al. 2015; Liu and Fu 2015).

The development of methodology for grouping data under constraints goes back to DeSarbo and Mahajan (1984) who developed an algorithm for the formation of clusters with pre-determined sizes and used it to analyze the information on Forbes 500 corporations. With the advances in machine learning theory and expansion of computing capabilities, a variety of restrictions implemented in semi-supervised clustering were explored. One scenario that enjoys much attention in the literature occurs when the classification of a part of the dataset is known and may be used to determine the classification of the rest of the data (Basu et al. 2002; Barbier et al. 2012; Gu and Lu 2012). A more general framework is to consider blocks of points for which the classification may not be necessarily known, but the points are required to be joined together or separated in the clustering solution. When the points are joined together, such a constraint is called a positive or “must-link” constraint, while when the points are separated, the constraint is called negative or “cannot-link” (Basu et al. 2008; Śmieja and Wiercioch 2017; Ruiz et al. 2010). In addition, a variety of approaches may be taken regarding the strictness of such membership rules. If a constraint must be satisfied in the solution, it is called hard, while if such a constraint can be avoided (usually, at a penalty), it is called soft (Hennig et al. 2015).

A substantial effort has been made to accommodate various semi-supervised scenarios as a part of K-means, one of the most popular clustering algorithms (Basu et al. 2004; Wang et al. 2011; Dinler and Tural 2016). The existing modifications of the K-means algorithm oriented towards hard constraints usually take one of the following approaches. One strategy is to place the conditions consistent with the constraints on the inclusion of points in classes and check the set of restrictions during the assignment of each subsequent point (Wagstaff et al. 2001; Gu and Lu 2012). This methodology can lead to vastly different solutions depending on the order in which the points are assigned to classes as the constraints that come into play at any given time depend on the points that have already been accommodated and assigned to a class.

Some modifications of the K-means algorithm assume that the labels are known for a part of the dataset and use this information to initialize the means as well as recompute them during subsequent iterations while selecting the labels only for the portion of the dataset that has not been classified yet. Two variations of this methodology are referred to as “seeded” K-means algorithm and “constrained” K-means algorithm (Basu et al. 2002). The seeded version uses the provided data with known labeling to initialize the cluster centers at the first iteration. After this, the training data are set aside and the K-means algorithm runs essentially unchanged. The constrained method actively uses the labeled data at each iteration by taking it into account while recomputing the cluster centers. Both of these approaches have been shown to exceed the performance of the method suggested by (Wagstaff et al. 2001). Although the described methods have clear benefits, the algorithms relying on the partial labeling of the data cannot accommodate the scenario where hard constraints join or disconnect certain data points but do not insist on their membership in a particular class. In addition, when the overlap between clusters is substantial, complications of different nature may arise due to the split that occurs in the process of class formation when a part of the dataset is classified instantly, before the classification process has begun. This may lead to unintended misclassifications of large portions of data. We will further discuss this issue in Section 3.

Other suggested adjustments to the semi-supervised K-means algorithm involve labeling of a subset of data by user in multiple stages (Fatehi et al. 2014), inclusion of the information on constraints at the level of classes rather than individual data points (Liu and Fu 2015), and use of adaptive distance measures that learn from the training set (Bilenko and Mooney 2003).

Despite the active research, there are no rigorous theoretically sound approaches to semi-supervised K-means. All methods that have been proposed in literature represent ad hoc empirical algorithms that appeal to intuition but lack methodological rigor. While some of them rely on a labeled set of training data (Basu et al. 2002), the majority of such methods add an external check for possible constraint violations to the core K-means algorithm (Ruiz et al. 2010; Zhigang et al. 2013; Covões et al. 2013). In these circumstances, the algorithm often fails to converge and some authors decide not to pursue all the constraints strictly, but instead resort to minimizing the number of violated constraints (Covões et al. 2013; Davidson and Ravi 2005).

In this paper, we propose a formal modification of K-means that incorporates hard positive and negative constraints directly into the algorithm. While the suggested methodology shares the same goals as the procedures of Basu et al. (2002) or Wagstaff et al. (2001), our approach merges the restrictions completely with the K-means algorithm by modifying its objective function and making the constraints a part of the fabric of the algorithm, which was not done in the past.

We start by describing the methodology of the proposed approach in Section 2. The experimental justification is provided in Section 3 and the discussion of the results concludes the paper in Section 4.

2 Methodology

In this section, we will make a formal statement of our method and show the modifications that occur to the K-means algorithm in the presence of hard constraints.

2.1 Problem Statement and Notation

First, we are going to consider the classical unconstrained K-means algorithm. Suppose that a dataset consists of n independent p-dimensional observations y1,y2,…,yn. Also, suppose that the number of clusters K is pre-determined. Then, the goal of the K-means algorithm is to find the partition of the given dataset that would minimize the amount of variation within the clusters given by

$$ S=\sum\limits_{k=1}^{K} \sum\limits_{i=1}^{n} \Vert{\boldsymbol{y}_{i}-\boldsymbol{\mu}_{k}}\Vert^{2}I(z_{i}=k), $$

where μk is the mean of the k th cluster, I(⋅) is the indicator function, and zi denotes the membership of the i th observation.

Initially, the algorithm is started with a collection of centers \(\hat {\boldsymbol {\mu }}_{k}\) that can be obtained randomly or by means of some specialized initialization techniques. Then, the method iteratively goes through two steps, where on step 1, the clusters are formed by means of assigning each observation to the cluster with the nearest center and during step 2, the new estimated cluster means \(\hat {\boldsymbol {\mu }}_{k}\) are re-calculated according to the rule \(\hat {\boldsymbol {\mu }}_{k}=\frac {{\sum }_{i=1}^{n} I(z_{i}=k)\boldsymbol {y}_{i}} {{\sum }_{i=1}^{n} I(z_{i}=k)}\). This process continues until convergence when no changes occur to the partition anymore or until it becomes clear that the convergence will not occur.

Let us now suppose that certain positive and negative restrictions have been placed on the inclusion of points in clusters. The positive constraints specify which points must belong to the same cluster in the final solution. Effectively, positive constraints define the disjoint blocks of points with the sets of indices \({\mathscr{B}}_{b}, b=1,2,\ldots ,B\) such that \(\bigcup _{b=1}^{B} {\mathscr{B}}_{b} = \{1, 2, \ldots , n\}\). Here, B is the total number of blocks defined. For any two distinct points yi and yj, \(\{ i,j \in {\mathscr{B}}_{b} \} \Rightarrow \{ z_{i}=z_{j} \}\). In a trivial case, when no positive constraints are defined, each point can be viewed as a singleton block with B = n.

Contrary to the positive restrictions, negative constraints appear when certain points yi and yj cannot be in the same cluster. It should be noted that such a restriction would involve not only yi and yj, but the whole blocks associated with these points, i.e., \(b(i) = \arg _{b}\{I(i \in {\mathscr{B}}_{b}) = 1\}\) and \(b(j) = \arg _{b}\{I(j \in {\mathscr{B}}_{b}) = 1\}\). Thus, in the presence of a negative restriction, \(\{ r \in {\mathscr{B}}_{b(i)}, q \in {\mathscr{B}}_{b(j)} \} \Rightarrow \{ z_{r} \neq z_{q} \}\).

The accommodation of either of these constraints is not straightforward in the K-means model as it requires the modification of the objective function and a mechanism that would account for more complicated structures. We will address this issue using the connection between the K-means algorithm and a modified expectation-maximization (EM) algorithm (Dempster et al. 1977), but first we proceed to give the relevant background on mixture models.

2.2 K-means Algorithm in a Constrained Setting

The Gaussian mixture model is given by \(f(\boldsymbol {y}; \boldsymbol {\theta })={\sum }_{k=1}^{K}\tau _{k}\phi (\boldsymbol {y};\boldsymbol {\mu }_{k},\boldsymbol {\Sigma }_{k})\), where ϕ(⋅;μk,Σk) denotes a multivariate Gaussian density with mean vector μk and covariance matrix Σk. τk ∈ (0,1],k = 1,2,…,K are mixing proportions bound by the constraint \({\sum }_{k} \tau _{k} = 1\). The parameters of this mixture are most often estimated with the use of the EM algorithm which consists of two iteratively repeated steps: expectation and maximization (McLachlan and Peel 2000). During the expectation step, or E-step, the conditional expected value of the complete data log-likelihood is computed, which in the case of a Gaussian mixture model amounts to finding the posterior probabilities

$$ \pi_{ik} = \frac{\tau_{k} \phi(\boldsymbol{y}_{i}; \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k})}{{\sum}_{k^{\prime}=1}^{K}\tau_{k^{\prime}} \phi(\boldsymbol{y}_{i}; \boldsymbol{\mu}_{k^{\prime}}, \boldsymbol{\Sigma}_{k^{\prime}})} $$

that observation i originated from the k th component of the mixture. At the maximization step, or M-step, parameters τk,μk, and Σk are estimated by maximizing the conditional expectation of the complete data log-likelihood.

A modification of this method called classification EM (CEM) algorithm uses an additional classification step performed right after the E-step where each observation obtains a hard assignment to a particular cluster based on the highest posterior probability πik that was observed. Thus, \(z_{i}={\arg \max \limits }_{k}\pi _{ik}\). Subsequently, at the M-step, πik are replaced by I(zi = k). This process makes the algorithm closer in spirit to the K-means method, where each data point is also unequivocally assigned to a particular cluster. In fact, it has been shown by Celeux and Govaert (1993) that the use of K-means algorithm with Euclidean distances is equivalent to implementing a CEM algorithm on a Gaussian mixture model with equal spherical covariance matrices as well as identical mixing proportions. Then, due to the restrictions on the model, \(\tau _{k}=\frac {1}{K}\) and Σk = σ2I for all k = 1,2,…,K, where I is the identity matrix and σ2 is a common variance parameter. The modification of the K-means method that would accommodate positive or negative constraints will occur at the classification stage when the assignment \(z_{i}={\arg \max \limits }_{k}\pi _{ik}\) is carried out. The classification rules implemented by means of zi will be determined by particular configurations of constraints in each case. We now proceed to obtain the expressions for πik in several configurations of constraints.

2.3 K-means Algorithm with Positive Constraints

Consider B blocks determined by positive constraints. It should be noted that for the set of indices \({\mathscr{B}}_{b}\) associated with block b, \(\pi _{ik}=\pi _{jk}, \forall i,j \in {\mathscr{B}}_{b}\); therefore, we can define \(\pi _{bk} \equiv \pi _{ik}, i \in {\mathscr{B}}_{b}\). It was shown by Melnykov et al. (2016) that

$$ \pi_{bk} = \frac{\tau_{k}^{|\mathcal{B}_{b}|} {\prod}_{j \in \mathcal{B}_{b}} \phi(\boldsymbol{y}_{j}; \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k})}{{\sum}_{k^{\prime} = 1}^{K} \tau_{k^{\prime}}^{|\mathcal{B}_{b}|} {\prod}_{j \in \mathcal{B}_{b}} \phi(\boldsymbol{y}_{j}; \boldsymbol{\mu}_{k^{\prime}}, \boldsymbol{\Sigma}_{k^{\prime}})}. $$

Consequently, we can write the rule for the assignment of block b to cluster k as \(\pi _{bk} \geq \pi _{bk^{\prime }}\), \(k^{\prime }=1,2,\ldots ,K\), which after taking logarithms and some straightforward manipulations becomes

$$ {\sum}_{j \in \mathcal{B}_{b}} (\boldsymbol{y}_{j}-\boldsymbol{\mu}_{k})^{\prime} \boldsymbol{\Sigma}_{k}^{-1} (\boldsymbol{y}_{j}-\boldsymbol{\mu}_{k}) < {\sum}_{j \in \mathcal{B}_{b}} (\boldsymbol{y}_{j}-\boldsymbol{\mu}_{k^{\prime}})^{\prime} \boldsymbol{\Sigma}_{k^{\prime}}^{-1} (\boldsymbol{y}_{j}-\boldsymbol{\mu}_{k^{\prime}}) + |\mathcal{B}_{b}| \left[ \log \frac{{\tau_{k}^{2}} |\boldsymbol{\Sigma}_{k^{\prime}}|} {\tau_{k^{\prime}}^{2}|\boldsymbol{\Sigma}_{k}|} \right], $$

where \(|{\mathscr{B}}_{b}| = {\sum }_{j=1}^{n}I(j \in {\mathscr{B}}_{b})\). |Σk| and \(|\boldsymbol {\Sigma }_{k^{\prime }}|\) denote the determinants of Σk and \(\boldsymbol {\Sigma }_{k^{\prime }}\), respectively.

Since the assignments will now apply to whole blocks of points at once, we can define \(z_{b} \equiv z_{i}, i \in {\mathscr{B}}_{b}\). Taking into account the sphericity of covariance matrices and restrictions on mixing proportions,

$$ z_{b}=\underset{k}{\arg\max} \left[ K^{-|\mathcal{B}_{b}|} {\prod}_{j \in \mathcal{B}_{b}} \phi(\boldsymbol{y}_{j}; \boldsymbol{\mu}_{k}, \sigma^{2}\mathbf{I}) \right], $$

with k = 1,2,…,K. This expression can be further simplified to obtain the following rule for the assignment of block b to a particular class:

$$ z_{b}=\underset{k}{\arg\min} \left[ {\sum}_{j \in \mathcal{B}_{b}} \| \boldsymbol{y}_{j}-\boldsymbol{\mu}_{k} \|^{2} \right]. $$

We can see that in the presence of positive constraints, the membership of a block of points is decided by the K-means method in an intuitive way. Similarly to picking the smallest distance to a cluster center in the case of a single point, the block is assigned to a center that minimizes the sum of squared deviations from the class mean within the block. However, this membership criterion operates with the squares of Euclidean distances rather than distances themselves.

2.4 K-means Algorithm with Negative Constraints

We now turn to the situation where negative constraints are defined along with positive ones. Melnykov et al. (2016) described some difficulties in handling negative restrictions that arise mainly from the fact that the number of possible configurations grows very fast with the number of blocks involved. We will describe some common situations that occur with two and three blocks that can be readily generalized for a larger number of blocks.

First, let a negative constraint be defined between two blocks. Among the B blocks defined in the dataset, without the loss of generality, we can number the blocks in such a way that the negative constraint disconnects blocks 1 and 2, with the sets of indices \({\mathscr{B}}_{1}\) and \({\mathscr{B}}_{2}\), respectively. Utilizing the results of Melnykov et al. (2016), we observe that positive block relations remain in place even as negative constraints are enforced additionally. With the kernel \(\tau _{k}^{|{\mathscr{B}}_{b}|} {\prod }_{j \in {\mathscr{B}}_{b}} \phi (\boldsymbol {y}_{j}; \boldsymbol {\mu }_{k}, \boldsymbol {\Sigma }_{k})\) describing the structure of block b, the posterior probability that block 1 belongs to cluster k equals

$$ \begin{array}{@{}rcl@{}} \pi_{1k} &=& \tau_{k}^{|\mathcal{B}_{1}|} \prod\limits_{j_{1} \in \mathcal{B}_{1}} \phi(\boldsymbol{y}_{j_{1}}; \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}) \underset{s \ne k}{\sum\limits_{s=1}^{K}} \tau_{s}^{|\mathcal{B}_{2}|} \prod\limits_{j_{2} \in \mathcal{B}_{2}} \phi(\boldsymbol{y}_{j_{2}}; \boldsymbol{\mu}_{s}, \boldsymbol{\Sigma}_{s}) \\ &&\times \left[ \sum\limits_{k^{\prime} = 1}^{K} \tau_{k^{\prime}}^{|\mathcal{B}_{1}|} \prod\limits_{j_{1} \in \mathcal{B}_{1}} \phi(\boldsymbol{y}_{j_{1}}; \boldsymbol{\mu}_{k^{\prime}}, \boldsymbol{\Sigma}_{k^{\prime}}) \underset{s^{\prime} \ne k^{\prime}}{\sum\limits_{s^{\prime}=1}^{K}} \tau_{s^{\prime}}^{|\mathcal{B}_{2}|} \prod\limits_{j_{2} \in \mathcal{B}_{2}} \phi(\boldsymbol{y}_{j_{2}}; \boldsymbol{\mu}_{s^{\prime}}, \boldsymbol{\Sigma}_{s^{\prime}}) \right]^{-1}. \end{array} $$

Then, it follows that the membership of block 1 is determined by

$$ z_{1} = \underset{k}{\arg\max} \left[ \tau_{k}^{|\mathcal{B}_{1}|} \prod\limits_{{j_{1}} \in \mathcal{B}_{1}} \phi(\boldsymbol{y}_{j_{1}}; \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}) \times \left( \underset{s \neq k}{\sum\limits_{s=1}^{K}} \tau_{s}^{| \mathcal{B}_{2} |} \prod\limits_{j_{2} \in \mathcal{B}_{2}} \phi \left( \boldsymbol{y}_{j_{2}}; \boldsymbol{\mu}_{s}, \boldsymbol{\Sigma}_{s} \right) \right) \right], $$

which in turn can be re-written as

$$ z_{1}= \underset{k}{\arg\max} \left[ \sum\limits_{s=1, s \neq k}^{K} \exp \left( -\frac{1}{2\sigma^{2}} \left( \sum\limits_{j_{1} \in \mathcal{B}_{1}} \| \boldsymbol{y}_{j_{1}} - \boldsymbol{\mu}_{k} \|^{2} + \sum\limits_{j_{2} \in \mathcal{B}_{2}} \| \boldsymbol{y}_{j_{2}} - \boldsymbol{\mu}_{s} \|^{2} \right) \right) \right], $$

where we once again utilized the sphericity of covariance matrices and equality of mixing proportions. The corresponding expressions for block 2 are completely symmetric.

Similarly, in the case of three blocks represented by \({\mathscr{B}}_{1}\), \({\mathscr{B}}_{2}\), and \({\mathscr{B}}_{3}\), where negative constraints are established for each pair, i.e., all three blocks must belong to different clusters in the solution,

$$ \begin{array}{@{}rcl@{}} z_{1} & =& \underset{k}{\arg\max} \left[ \underset{s \neq k}{\sum\limits_{s=1,}^{K}} \underset{h \neq k, h \neq s}{\sum\limits_{h=1,}^{K}} \exp \left( -\frac{1}{2\sigma^{2}} \left( \sum\limits_{j_{1} \in \mathcal{B}_{1}} \| \boldsymbol{y}_{j_{1}} - \boldsymbol{\mu}_{k} \|^{2} \right.\right.\right.\\ &&\left.\left.\left. + \sum\limits_{j_{2} \in \mathcal{B}_{2}} \| \boldsymbol{y}_{j_{2}} - \boldsymbol{\mu}_{s} \|^{2} + \sum\limits_{j_{3} \in \mathcal{B}_{3}} \| \boldsymbol{y}_{j_{3}} - \boldsymbol{\mu}_{h} \|^{2} \right) \right) {\vphantom{\underset{s \neq k}{\sum\limits_{s=1,}^{K}} }}\right]. \end{array} $$

Thanks to the symmetry, z2 and z3 are determined by similar expressions with straightforward index adjustments.

3 Experiments and Illustrations

3.1 Motivating Example

We begin by considering an example that will emphasize the flexibility of our method and show the advantage of defining block restrictions over the direct labeling of points. Suppose that there are two classes of individuals overlapping quite substantially with one another. Thus, the two classes may represent two species of animals that, in principle, could be identified with high degree of certainty by an expert. An example of such process is detailed in Nimmo et al. (2018) in relation to the identification of insects that belong to family Chironomidae. The identification of these insects involves their dissection on a microscopic scale, exposure to specialized chemical solutions, and further slide preparation that takes 4 weeks. For the classification described in Nimmo et al. (2018), the slides had to be mailed to an expert located off-site. Alternatively, the identification can be made by observing morphological characteristics of Chironomidae flies such as their length, measurements of antennae, leg segments, and other parts of the body. The latter method is not error-proof, but much less demanding in terms of time and effort. To train the clustering algorithm, a small group of individuals can be first identified by an expert using the classical technique involving the microscopic slide preparation. At this point, the algorithms such as seeded K-means or constrained K-means will assign labels to the points that have been identified. The identification of Chironomidae can involve a large number of species; however, for illustrative purposes, we will focus on only two classes in this example.

Figure 1 shows a two-dimensional dataset generated from two such classes of points with n = 200. Both clusters here are spherical and have approximately equal proportions, making this dataset favorable for classification by the K-means algorithm. Three points in each class have been identified and this information is used to start the K-means algorithm. In Fig. 1, these points are shown circled and connected by the lines of corresponding color. Note that the points used for training were all picked in the area of overlap between the clusters where misclassifications can easily occur. We implement the constrained K-means approach with these starting conditions and observe that in the estimated classification the classes are essentially flipped. As a result, the majority of animals that belong to species 1 will be classified as species 2, while those from species 2 will be identified as species 1. Being driven to maintain the labels of the points in the training set, this method does so at the expense of the vast majority of the observations.

Fig. 1
figure 1

Differences between the true (a) and estimated classifications using the constrained K-means algorithm (b) and the proposed method (c). The points with known classification are shown via connecting lines

This scenario is less likely to occur if a large random sample of points is chosen for training. At the same time, the described situation is not far-fetched if one takes it into account that the observations used in the training set are often the ones that are readily available to the researcher and are not selected at random. On the surface, it may also seem advantageous to determine the classification of points in the “gray” area with an intent to equip the algorithm with the tools to classify observations in the most difficult circumstances. However, such a strategy can easily misfire as shown by our example. We will explore this phenomenon in more detail in the simulations with a larger number of clusters.

3.2 Experiments

We now proceed to consider a variety of practical situations with different configurations of positive and negative constraints. Where appropriate, we consider a comparison with the constrained K-means and seeded K-means approaches of Basu et al. (2002) that are often used as benchmarks for other algorithms (Gu and Lu 2012; Fatehi et al. 2014). In addition, in the situations where negative constraints are defined among all specified blocks the constrained K-means approach is close in spirit to our proposed method. R package MixSim (Melnykov et al. 2012) was used to simulate datasets with varying degrees of overlap and clustering complexity.

3.2.1 Ten Clusters with Positive and Negative Constraints in Place

We first consider a simulation with K = 10 classes where ten blocks were defined in such a manner that each class has exactly one block within it and all blocks are separated by negative constraints. This setup fits well with the constrained K-means and seeded K-means methods of Basu et al. (2002), but such a configuration captures only one in a variety of possible scenarios that our proposed method is capable of handling.

In addition, a “naive” semi-supervised clustering approach was realized where a randomly chosen training set was utilized for finding the initial mean in each of the ten clusters. Then, each of the remaining points was assigned to the cluster with the nearest mean, but no repeated recalculation of means and reassignment of points took place. This approach amounts to one iteration of the K-means algorithm.

To generate the datasets of varying complexity, the number of dimensions p was taken at 5, 10, and 20 and the average pairwise overlap \(\bar {\omega }\) was set at 0.01 and 0.10. Here, the pairwise overlap is defined as the sum of misclassification probabilities when two clusters are considered at a time (Maitra and Melnykov 2010). For each combination of p and \(\bar {\omega }\), 100 datasets of n = 10,000 points each were generated with approximately equal allocations of points to the ten clusters. The training set was obtained by randomly extracting a fraction of points in each class.

The results of the simulation study are summarized in Tables 1 and 2 where the proposed method is labeled as “ssK-means” (for “semi-supervised K-means”). In Table 1, training percentages varied between 1 and 7% in 2% increments, while in Table 2 larger percentages between 10 and 70% were considered.

Table 1 Smaller training percentages
Table 2 Larger training percentages

It can be seen that all four methods performed roughly the same with the “naive” method falling behind somewhat at lower training percentages. This effect is maintained for both lower and larger values of the average pairwise overlap and across different dimensionality values. Thus, the proposed method has been shown to be quite viable in the simulated scenario with the training sets picked at random. It is worth noting that in practice the training sets are often formed not at random, but based on other considerations such as availability of points or their perceived usefulness in classification of future observations.

3.2.2 Selecting a Training Set in the Overlap Area

We now proceed to evaluate some less straightforward clustering situations that will showcase the advantages of our proposed method. First, let us extend the scenario described in Section 3.1 to a larger number of clusters and consider how the choice of points for training affects the proportion of correct classifications. We consider K = 4 two-dimensional clusters with the average pairwise overlap \(\bar {\omega }\) equal to 0.2. Each dataset consists of the total of n = 500 points with equal proportions among the four classes that were generated. We have seen in Section 3.1 that the choice of a training set in the “gray” area carries a substantial misclassification risk. Let fk(x) and \(f_{k^{\prime }}(\boldsymbol {x})\), where \(k,k^{\prime }=1, ..., K, k \neq k^{\prime }\), be two of the Gaussian pdfs used by MixSim to generate the classes of points and let the “gray” area be defined by the choice of a constant C > 1 to have \(C^{-1} \leq \frac {f_{k}(\boldsymbol {x})}{f_{k^{\prime }}(\boldsymbol {x})} \leq C\). Figure 2 shows the proportion of correct classifications for the proposed method (green dots) and constrained K-means (blue triangles) as a function of C. The average proportions were determined over 10,000 runs for 3, 5, 7, and 9 points in the training set.

Fig. 2
figure 2

ad Proportion of correct classifications versus the ratio of densities (C) defining the “gray” area for the proposed method (dots) and constrained K-means (triangles)

While the proposed method shows a stable performance for different training set sizes and even leads to slightly better outcomes for lower values of C, the constrained K-means algorithm experiences problems if the size of the training set is small and especially when the points for training are picked in the area of high overlap. For C < 3, it has lower classification proportions even when the training set is obtained from 9 randomly selected points that represent about 7% of the given class and are classified without error.

It is worth noting that a similar effect is observed if one or more of the points in the training set are misclassified. When the labels are determined for the points in the training set, it is typically assumed that these labels are 100% error-free. However, in practice, there is a possibility that a point may be misidentified even by an expert or some other error may occur. We have simulated n = 10,000 points in equal proportions between K = 2 two-dimensional clusters with the pairwise overlap of 0.2 and generated at random two blocks of three points with known labels in each class. In the absence of errors, both the proposed method and constrained K-means yield 0.900 rate of correct classifications over 1000 simulations.

Now to determine the effect that an incorrect label may have, we kept one of the blocks error-free, while in the second one we included one point that was misidentified and in reality came from the opposite class. Since our method does not rely on direct labeling, its proportion of correct classifications stands unaffected at 0.900 due to the negligible effect that one point has on a set of 10,000 data. At the same time, the constrained K-means method sees its rate go down to 0.882.

To summarize, if the assignment of labels to observations is done in two stages, caution needs to be exercised due to a real possibility of misclassifying substantial portions of data, especially, if the majority of training points are located in the area of cluster overlap. Consequently, it is beneficial to include some points with the easy straightforward classification in the training set. The proposed algorithm would overcome this obstacle by holding off the assignment of labels to the points comprising each block until the stage in the algorithm when all observations receive their labels.

3.2.3 Positive and Negative Constraint Configurations

As was pointed out in the beginning of Section 3.2, the proposed method can be used to model a variety of situations involving both positive and negative constraints not accessible to methods that rely on the direct labeling of points involved in constraints. For the following illustrations, we consider the simulations on the two-dimensional datasets generated from K = 4 classes with n = 500 and \(\bar {\omega }=0.1\). A typical dataset of this nature is shown in Fig. 3.

Fig. 3
figure 3

True classification in a dataset with \(K=4, p=2, \bar {\omega }=0.10\)

A configuration of blocks consistent with that of Section 3.2.1 would involve a block from each class defined by means of positive constraints and a set of negative constraints disconnecting all blocks from one another. This configuration is shown in Fig. 4a with three points per block. The symbols defining each block, such as circles, squares, triangles and diamonds, are different to indicate the negative constraints in place. Below, we will adhere to the same conventions by showing positive constraints as connecting lines and indicating negative constraints by using different symbols in the respective blocks. Figure 4b shows the proportion of correct classifications as a function of the number of points in each block. For the subsequent configurations of blocks, we omit the graphs that show correct classification proportions as they are very similar to (b).

Fig. 4
figure 4

af Blocks with various configurations of constraints defined over four classes

Figure 4c shows a similar situation where only positive constraints are in place and no negative constraints are defined. Thus, in principle, the blocks that were defined can be allocated to the same class in the solution. The points participating in the four blocks are all shown by circles to indicate the lack of negative constraints.

Another possibility is for multiple blocks to be defined within the same class. Thus, in Fig. 4d, there were three blocks defined in the first and second classes but none were defined in the remaining two. Also, no negative constraints were imposed.

Figure 4e shows three blocks that were set up in three different classes in such a way that one of them, shown with squares, is separated from the other two by negative constraints. However, those two remaining blocks, both shown by circles, do not have a negative constraint defined between them. We use this opportunity to illustrate the construction of the membership function for multiple blocks. Let \({\mathscr{B}}_{1}\) and \({\mathscr{B}}_{3}\) represent the blocks marked with circles and \({\mathscr{B}}_{2}\) represent the block marked with squares in Fig. 4e. Then, for \({\mathscr{B}}_{1}\),

$$ \begin{array}{@{}rcl@{}} z_{1} & =& \underset{k}{\arg\max} \left[ \underset{s \neq k}{\sum\limits_{s=1,}^{K}} \underset{h \neq s}{\sum\limits_{h=1,}^{K}} \exp \left( -\frac{1}{2\sigma^{2}} \left( \sum\limits_{j_{1} \in \mathcal{B}_{1}} \| \boldsymbol{y}_{j_{1}} - \boldsymbol{\mu}_{k} \|^{2} \right.\right.\right.\\ &&\left.\left.\left.+ \sum\limits_{j_{2} \in \mathcal{B}_{2}} \| \boldsymbol{y}_{j_{2}} - \boldsymbol{\mu}_{s} \|^{2} + \sum\limits_{j_{3} \in \mathcal{B}_{3}} \| \boldsymbol{y}_{j_{3}} - \boldsymbol{\mu}_{h} \|^{2} \right) \right) {\vphantom{\underset{s \neq k}{\sum\limits_{s=1,}^{K}}}}\right]. \end{array} $$

The key to implementing negative constraints here is the careful treatment of the sums over s and h. The sum over s is matched with \({\mathscr{B}}_{2}\) and thus needs an exclusion for k, the summation index matched with \({\mathscr{B}}_{1}\). At the same time, the sum over h is matched with \({\mathscr{B}}_{3}\) and requires an exclusion for s, the index matched with \({\mathscr{B}}_{2}\).

The membership function for block \({\mathscr{B}}_{2}\) also needs to reflect the fact that \({\mathscr{B}}_{2}\) is separated from the other two blocks, while \({\mathscr{B}}_{1}\) and \({\mathscr{B}}_{3}\) do not have a negative constraint between them:

$$ \begin{array}{@{}rcl@{}} z_{2} & =& \underset{k}{\arg\max} \left[ \underset{s \neq k}{\sum\limits_{s=1,}^{K}} \underset{h \neq k}{\sum\limits_{h=1,}^{K}} \exp \left( -\frac{1}{2\sigma^{2}} \left( \sum\limits_{j_{1} \in \mathcal{B}_{1}} \| \boldsymbol{y}_{j_{1}} - \boldsymbol{\mu}_{s} \|^{2} \right.\right.\right.\\ && \left.\left.\left.+ \sum\limits_{j_{2} \in \mathcal{B}_{2}} \| \boldsymbol{y}_{j_{2}} - \boldsymbol{\mu}_{k} \|^{2} + \sum\limits_{j_{3} \in \mathcal{B}_{3}} \| \boldsymbol{y}_{j_{3}} - \boldsymbol{\mu}_{h} \|^{2} \right) \right) {\vphantom{\underset{s \neq k}{\sum\limits_{s=1,}^{K}}}}\right]. \end{array} $$

The membership function for \({\mathscr{B}}_{3}\) is analogous to that of \({\mathscr{B}}_{1}\) with appropriate index adjustments.

Finally, Fig. 4f shows a configuration of four blocks where only two negative constraints were defined. The blocks shown with solid circles \(({\mathscr{B}}_{1})\) and solid squares (\({\mathscr{B}}_{2}\)) will be kept separate, but either of them may end up in the same class with the blocks shown by the blank symbols. Similarly, the blocks shown with blank circles \(({\mathscr{B}}_{3})\) and blank squares \(({\mathscr{B}}_{4})\) have a negative constraint only between the two of them. We present the membership function for \({\mathscr{B}}_{3}\), while other membership functions will be similar due to the fact that each block has only one negative constraint in place separating it from one other block:

$$ \begin{array}{@{}rcl@{}} z_{3} & =& \underset{k}{\arg\max} \left[ \sum\limits_{s=1}^{K} \underset{h \neq s}{\sum\limits_{h=1,}^{K}} \underset{t \neq k}{\sum\limits_{t=1,}^{K}} \exp \left( -\frac{1}{2\sigma^{2}} \left( \sum\limits_{j_{1} \in \mathcal{B}_{1}} \| \boldsymbol{y}_{j_{1}} - \boldsymbol{\mu}_{s} \|^{2}\right. \right.\right.\\ && \left.\left.\left.+ \sum\limits_{j_{2} \in \mathcal{B}_{2}} \| \boldsymbol{y}_{j_{2}} - \boldsymbol{\mu}_{h} \|^{2} + \sum\limits_{j_{3} \in \mathcal{B}_{3}} \| \boldsymbol{y}_{j_{3}} - \boldsymbol{\mu}_{k} \|^{2} + \sum\limits_{j_{4} \in \mathcal{B}_{4}} \| \boldsymbol{y}_{j_{4}} - \boldsymbol{\mu}_{t} \|^{2} \right) \right) {\vphantom{\sum\limits_{s=1}^{K} \underset{h \neq s}{\sum\limits_{h=1,}^{K}}}}\right]. \end{array} $$

Overall, the proposed method offers much needed flexibility in modeling positive or negative constraints. We once again emphasize that the configurations illustrated in Fig. 4c–f could not be accommodated by the methods that require hard class assignments of the points involved in the constraints.

3.2.4 Comparison with a Model-Based Semi-supervised Method

The proposed method enjoys the advantages associated with the K-means algorithm, such as the relative simplicity and ease of practical implementation if compared to model-based approaches. At the same time, it is worth exploring how ssK-means compares to potentially more flexible models, for example, those involving the use of finite Gaussian mixtures. In this simulation study, we focus on the comparison of our algorithm with the semi-supervised method of Melnykov et al. (2016).

The datasets were generated from mixture models with K = 4 and K = 10 and the sample size was chosen to be 100K. The number of dimensions p was taken equal to 5, 10, and 20, while the maximum overlap \(\check {\omega }\) was set at 0.01 and 0.1. In each class, one block of points was defined at random. As discussed in previous sections, the observations within each block are bound together by positive constraints and, in principle, the blocks drawn from different classes could be clustered together. The sizes of blocks |B| equal to 1, 10, and 25 were evaluated. Here, |B| = 1 represents a trivial case of unsupervised clustering with each block being a singleton. Finally, to evaluate the effect that cluster shapes have on the outcomes, all simulations were performed for clusters of elliptical as well as spherical shapes. Of course, the Gaussian mixtures are expected to outperform ssK-means in the case of elliptical clusters. On the contrary, for the second scenario when spherical clusters are simulated under the assumptions of equal covariance matrices, it is our expectation that ssK-means will be highly competitive with semi-supervised Gaussian mixtures. Each simulation scenario was repeated for 250 mixtures generated by MixSim.

As expected, the model-based method performed noticeably better on elliptical clusters, especially, in the presence of a larger overlap (\(\check {\omega }=0.1\)) (Table 3). At the same time, the ssK-means algorithm was not far behind, especially, in the situations when larger blocks were used. This emphasizes the positive effect that the introduction of extra information in the form of constraints has on successful classification.

Table 3 Results of the simulation study for elliptical clusters with K = 4 and K = 10

On spherical clusters with equal covariance matrices, the K-means algorithm was expectedly very competitive. In fact, it was even more successful than the model-based clustering method in the vast majority of situations as shown in Table ??. A similar phenomenon was pointed out by Steinley and Brusco (2011) in their study of unsupervised clustering. In our semi-supervised setting, this effect is observed across different degrees of overlap, number of dimensions, and block sizes in positive constraints. Even though our developed K-means-based method is expected to perform well under such circumstances, the fact that the proposed method is capable of showing better performance than its model-based counterpart is quite noteworthy.

Table 4 Results of the simulation study for spherical clusters with K = 4 and K = 10

4 Discussion

The methodology described in the paper allows to accommodate both positive and negative constraints in semi-supervised clustering by making them a part of the K-means algorithm framework. The proposed method performs well in the situations that are generally favorable for the use of the K-means algorithm, in particular, when the clusters are roughy spherical and have approximately equal representations.

The novelty of the proposed approach is in making both kinds of hard constraints an organic part of this clustering algorithm, while the methods proposed to date either verify the restrictions concurrently with executing the algorithm or use the labeled data at the training stage of the algorithm. Another advantage of the proposed method is the fact that it does not rely on the training data to be labeled as belonging to a certain class before the algorithm starts, since the placement of blocks into different classes is ensured by means of negative constraints. Thus, the process does not prohibit the training data from being labeled in advance but does not rely on such labeling. As a result, a larger variety of clustering situations can be accommodated compared to the methods that rely on direct labeling.