Keywords

1 Introduction

Clustering is a classical data analysis method that aims at creating natural groups from a set of objects by assigning similar objects into the same cluster while separating dissimilar objects into different clusters. Clustering solutions can be expressed in the form of a partition. Amongst partitional clustering methods, some produce hard [6, 18], fuzzy [10, 19] and credal partitions [2,3,4, 14]. A hard partition assigns an object to a cluster with total certainty whereas a fuzzy partition allows us to represent the class membership of an object in the form of a probabilistic distribution. The credal partition, developed in the framework of belief function theory, extends the concepts of hard and fuzzy partition. It makes possible the representation of both uncertainty and imprecision regarding the class membership of an object.

Clustering is a challenging task since various clustering solutions can be valid although distinct. In order to lead clustering methods towards a specific and desired solution, semi-supervised clustering algorithms integrate background knowledge, generally in the form of instance-level constraints. In [2, 3, 19], labeled data constraints are taken into account to improve the performances of the clustering. In [4, 6, 10, 18], two less informative constraints are introduced: the must-link constraint, which specifies that two objects have to be in the same cluster and the cannot-link constraint, which indicates that two objects should not be assigned in the same cluster.

The combination of the three types of instance-level constraints can help to retrieve as most information as possible and thus can achieve better performances. However, there exists currently very few methods able to deal with such constraints [17], more particularly, none generates a credal partition. In this paper, we propose to associate two evidential semi-supervised clustering algorithms, the first one handling pairwise constraints and the second one dealing with labeled data constraints. The goal is to create a more general algorithm that can obtain a large number of constraints from the background knowledge and that can generate a credal partition.

The rest of the paper is organized as follows. Section 2 recalls the necessary backgrounds about belief function, credal partition and evidential clustering algorithms. Section 3 introduces the new algorithm named LPECM and presents the objective function as well as the optimization steps. Several experiments are produced in Sect. 4. Finally, Sect. 5 makes a conclusion about the work.

2 Background

2.1 Belief Function and Credal Partition

Evidence theory [15] (or belief function theory) is a mathematical framework that enables to reflect the state of partial and uncertainty knowledge. Let \(\mathbf {X}\) be a data set composed of n objects such that \(\mathbf {x}_{i} \in \mathbb {R}^{p}\) corresponds to the \(i^{th}\) object. Let \(\varOmega = \{\omega _1, \dots , \omega _c\}\) be the set of possible clusters. The mass function \(m_{ik} : 2^\varOmega \rightarrow [0, 1]\) applied on the instance \(\mathbf {x}_{i}\) measures the degree of belief that the real class of \(\mathbf {x}_{i}\) belongs to a subset \(A_{k} \subseteq \varOmega \). It satisfies:

$$\begin{aligned} \sum _{A_{k} \subseteq \varOmega }m_{ik} = 1. \end{aligned}$$
(1)

The collection \(\mathbf {M}=\left[ \mathbf {m}_{1}, \ldots , \mathbf {m}_{n}\right] \) such that \(\mathbf {m}_i=(m_{ik})\) forms a credal partition that is a generalization of a fuzzy partition. Indeed, any subset \(A_k\) such that \(m_{ik} > 0\) is named a focal set of \(\mathbf {m}_i\). When all focal elements are singletons, the mass function is equivalent to a probability distribution. If such situation occurs for all objects, the credal partition \(\mathbf {M}\) can be seen as a fuzzy partition.

Several transformations of a mass function \(\mathbf {m}_i\) are possible in order to extract particular information. The plausibility function \(pl{(A)}: 2^\varOmega \rightarrow [0, 1]\) defined in Eq. (2) corresponds to the maximal degree of belief that could be given to subset A:

$$\begin{aligned} pl(A)=\sum _{A_{k} \cap A \ne \emptyset } m(A_{k}), \quad \forall A \subseteq \varOmega . \end{aligned}$$
(2)

To make a decision, a mass function can also be transformed into a pignistic probability distribution [16]. Finally, a hard credal partition can be obtained by assigning each object to the subset of cluster with the highest mass. This allows us to easily detect objects located in an ambiguous region.

2.2 Evidential C-Means Algorithm

Evidential C-Means (ECM) [14] is the credibilistic version of Fuzzy C-Means algorithm (FCM) [5]. In the FCM algorithm, each cluster is represented by a point called centroid or prototype. The ECM algorithm, which generates a credal partition, generalizes the cluster representation by considering a centroid \(\mathbf {v}_k\) in \(\mathbb {R}^p\) for each subset \(A_{k} \subseteq \varOmega \). The objective function is:

$$\begin{aligned} J_{\mathrm {ECM}}(\mathbf {M}, \mathbf {V})=\sum _{i=1}^{n} \sum _{A_{k} \ne \emptyset }\left| A_{k}\right| ^{\alpha } m_{ik}^{\beta } d_{ik}^{2}+\sum _{i=1}^{n} \rho ^{2} m_{i \emptyset }^{\beta }, \end{aligned}$$
(3)

subject to

$$\begin{aligned} \sum _{A_k \subseteq \varOmega , A_k \ne \emptyset } m_{ik} + m_{i \emptyset }=1 \quad \text { and } \quad m_{i k} \ge 0 \quad \forall i \in \{1,\dots , n\}. \end{aligned}$$
(4)

where \(|A_k|\) corresponds to the cardinality of the subset \(A_k\), \(\mathbf {V}\) is the set of prototypes and \(d^2_{ik}\) represents the squared Euclidean distance between \(\mathbf {x}_i\) and the centroid \(\mathbf {v}_k\). Outliers are handled with masses \(m_{i\emptyset }, \forall i \in 1,\dots ,n\), allocated to the empty set and with the \(\rho ^2 >0\) parameter. The two parameters \(\alpha \ge 0\) and \(\beta > 1\) are introduced to penalize the degree of belief assigned to subsets with high cardinality and to control the fuzziness of the partition.

An extension of the ECM algorithm has been proposed in order to deal with a Mahalanobis distance [4]. Such metric is adaptive and handles various ellipsoidal shapes of clusters, giving more flexibility for the algorithm to better find the inherent structure of the data. Mahalanobis distance \(d_{ik}^2\) between a point \(\mathbf {x}_i\) and a subset \(A_k\) is defined as follows:

$$\begin{aligned} d_{i k}^{2}=\left\| \mathbf {x}_{i}-\mathbf {v}_{k}\right\| _{\mathbf {S}_{k}}^{2}=\left( \mathbf {x}_{i}-\mathbf {v}_{k}\right) ^{T} \mathbf {S}_{k}\left( \mathbf {x}_{i}-\mathbf {v}_{k}\right) , \end{aligned}$$
(5)

where \(\mathbf {S}_{k}\) represent the evidential covariance matrix associated to subset \(A_k\) and is calculated as the average of the covariance matrices of the singletons included in subset \(A_k\). Finally, objective function (3) has to be minimized with the respect to the credal partition matrix \(\mathbf {M}\), the centroids matrix \(\mathbf {V}\) and the covariance matrix \(\mathbf {S} = \{\mathbf {S}_1, \dots , \mathbf {S}_c\}\) the set composed of covariance matrices dedicated to clusters.

2.3 Evidential Constrained C-Means Algorithm

Several evidential C-Means based algorithms have already been proposed [1,2,3,4, 8, 13] to deal with background knowledge. For each of them, constraints are expressed in the framework of a belief function and a term penalizing the constraints violation is incorporated in the objective function of the ECM algorithm.

In [2, 3], labeled data constraints are introduced in the algorithms, i.e. the expert can express the uncertainty about the label of an object by assigning it to a subset. Objective functions of the algorithms are written in such a way that any mass function which partially or fully respects a constraint on a specific subset has a high weighted plausibility given to a singleton included in the subset.

$$\begin{aligned} T_{ij}=T_{i}\left( A_{j}\right) =\sum _{A_{j} \cap A_{l} \ne \emptyset } \frac{\left| A_{j} \cap A_{l}\right| ^{\frac{r}{2}}}{\left| A_{l}\right| ^{r}} m_{i l}, \quad \forall i \in \{1 \ldots n\},~A_{l} \subseteq \varOmega , \end{aligned}$$
(6)

where \(r\ge 0\) is a fixed parameter. Notice that if \(r = 0\), then \(\frac{\left| A_{j} \cap A_{l}\right| ^{\frac{r}{2}}}{\left| A_{l}\right| ^{r}}=1\), which implies that \(T_{ij}\) is identical to the plausibility \(pl_{ij}\).

In [4], authors assumed that pairwise constraints (i.e. must-link and cannot-link constraints) are available. A plausibility to belong or not to the same class is then defined. This plausibility allows us to add a penalty term having high values when there exists a high plausibility that two objects are (respectively are not) in the same cluster although they have a must-link constraint (respectively a cannot-link constraint).

$$\begin{aligned} \begin{aligned} pl_{l \times j}(\theta )&=\sum _{\left\{ A_l \times A_j \subseteq \varOmega ^{2} |(A_l \times A_j) \cap \theta \ne \emptyset \right\} } m_{l \times j}(A_l \times A_j) \\&=\sum _{A_l \cap A_j \ne \emptyset } m_{l}(A_l) m_{j}(A_j), \end{aligned} \end{aligned}$$
(7)
$$\begin{aligned} \begin{aligned} pl_{l \times j}(\overline{\theta })&=1-m_{l \times j}(\emptyset )-bel_{l \times j}(\theta ) \\&=1-m_{l \times j}(\emptyset )-\sum _{k=1}^{c} m_{l}\left( A_{k}\right) m_{j}\left( A_{k}\right) , \end{aligned} \end{aligned}$$
(8)

where, \(\theta \) denotes the event that objects \(x_i\) and \(x_j\) belong to the same class corresponds to the subset \(\{(\omega _1,\omega _1), (\omega _2,\omega _2), \dots , (\omega _k,\omega _k)\}\) within \(\varOmega ^2\), whereas \(\overline{\theta }\) denotes the event that objects \(x_i\) and \(x_j\) do not belong to the same class corresponds to its complement.

3 The LPECM Algorithm with Instance-Level Constraints

3.1 Objective Function

We propose a new algorithm called Labeled and Pairwise constraints Evidential C-Means (LPECM), which is based on the ECM algorithm [14], handles Mahalanobis distance and combines the advantages of pairwise constraints and labeled data constraints by adding three penalty terms:

$$\begin{aligned} J_{LPECM}(\mathbf {M},\mathbf {V},\mathbf {S})=\xi J_{ECM}(\mathbf {M},\mathbf {V},\mathbf {S})+ \gamma J_{\mathscr {M}}(\mathbf {M})+ \eta J_{\mathscr {C}} (\mathbf {M})+ \delta J_{\mathscr {L}}(\mathbf {M}), \end{aligned}$$
(9)

with respect to constraints (4). Formulation of \(J_{ECM}\) corresponds to equation (3) and (5), \(J_{\mathscr {M}}\) is a penalty term used for must-link constraints, \(J_{\mathscr {C}}\) is dedicated to cannot-link constraints and \(J_{\mathscr {L}}\) handles labeled data constraints. Coefficients \(\xi \), \(\gamma \), \(\eta \) and \(\delta \) allow us to give more importance to the structure of the data, the pairwise constraints or the labeled data constraints, respectively.

Penalty terms for pairwise constraints and labeled data constraints are defined similarly to [2, 4]:

$$\begin{aligned} J_{\mathscr {M}}(\mathbf {M})&= \sum _{(\mathbf {x}_i,\mathbf {x}_j) \in \mathscr {M}}\left( 1-(m_{i\emptyset } + m_{j\emptyset } - m_{i\emptyset } m_{j\emptyset })-\sum _{A_k \subseteq \varOmega , |A_k |= 1} m_{ik} m_{jk}\right) ,\end{aligned}$$
(10)
$$\begin{aligned} J_{\mathscr {C}}(\mathbf {M})&= \sum _{(\mathbf {x}_i,\mathbf {x}_j) \in \mathscr {C}}\sum _{A_{k} \cap A_{l} \ne \emptyset } m_{ik} m_{jl},\end{aligned}$$
(11)
$$\begin{aligned} J_{\mathscr {L}}(\mathbf {M})&= \sum _{i = 1}^n\sum _{A_{k} \subseteq \varOmega , A_{k} \ne \emptyset }b_{ik}\left( 1-\left( \sum _{A_{k} \cap A_{l}\ne \emptyset } \frac{\left| A_{k} \cap A_{l}\right| ^{\frac{r}{2}}}{\left| A_{l}\right| ^{r}} m_{il}\right) \right) , \end{aligned}$$
(12)

where \(b_{ik}\) denotes whether the \(i^{th}\) instance belongs to the subset \(A_k\) or not:

$$\begin{aligned} b_{ik}=\left\{ \begin{array}{cl} 1 &{} \text {if } \mathbf {x}_i \text { is constrained to subset } A_k,\\ 0 &{} \text {otherwise.} \end{array}\right. \end{aligned}$$
(13)

It should be emphasized that in this study, unlike [2], each labeled object is constrained to only one subset. Indeed, it makes more coherent the set of constraints retrieved from the background knowledge. Constraints are gathered in three different sets such that \(\mathscr {M}\) corresponds to the set of must-link constraints, \(\mathscr {C}\) to the set of cannot-link constraints and \(\mathscr {L}\) denotes the labeled data constraints set. The \(J_{\mathscr {M}}\) function returns the sum of the plausibilities that must-link constrained objects to belong to the same class. Similarly, \(J_{\mathscr {C}}\) returns the sum of the plausibilities that cannot-link constrained objects are not in the same class. The \(J_{\mathscr {L}}\) term calculates for each labeled object a weighted plausibility to belong to the label.

3.2 Optimization

The objective function is minimized as the ECM algorithm, i.e. by carrying out an iterative scheme where first \(\mathbf {V}\) and \(\mathbf {S}\) are fixed to optimize \(\mathbf {M}\), second \(\mathbf {M}\) and \(\mathbf {S}\) are fixed to optimize \(\mathbf {V}\) and finally \(\mathbf {M}\) and \(\mathbf {V}\) are fixed to optimize \(\mathbf {S}\).

Centroids Optimization. It can be observed from (9) that the three penalty terms included in the objective function of the LPECM algorithm do not depend on the cluster centroids. Hence, the update scheme of \(\mathbf {V}\) is identical to the ECM algorithm [14].

Masses Optimization. In order to obtain a quadratic objective function with linear constraints, we set parameter \(\beta = 2\). A classical optimal approach can then be used to solve the problem [7]. The following equations present how to transform the objective function (9) in order to obtain a format accepted by most usual quadratic optimization function.

Let us define \(\mathbf {m}_i^T = \left( m_{i\emptyset }, m_{i \omega _1}, \dots , m_{i\varOmega } \right) \) the vector of masses for object \(\mathbf {x}_i\). The first term of \(J_{LPECM}\) is then:

$$\begin{aligned} J_{ECM}(\mathbf {M})=\sum _{i=1}^n \mathbf {m}_i^T \mathbf {\Phi }^i \mathbf {m}_i, \end{aligned}$$
(14)

where \(\mathbf {\Phi }^i=\left[ \phi _{kl}^i\right] \) is a diagonal matrix of size \((2^c \times 2^c)\) associated to object \(\mathbf {x}_i\) and defined such as:

$$\begin{aligned} \phi _{k l}^{i}=\left\{ \begin{array}{ll}{\rho ^{2}} &{} { \text{ if } A_k=A_l \text{ and } A_{k} = \emptyset }, \\ {d_{i k}^{2}\left| A_{k}\right| ^{\alpha }} &{} { \text{ if } A_k=A_l \text{ and } A_{k} \ne \emptyset }, \\ {0} &{} { \text{ otherwise. } }\end{array}\right. \end{aligned}$$
(15)

Penalty term used for must-link constraints can be rewritten as follows:

$$\begin{aligned} \begin{aligned} J_{\mathscr {M}}(\mathbf {M}) = n_{\mathscr {M}} + \sum _{(\mathbf {x}_i,\mathbf {x}_j) \in \mathscr {M}} \left( \mathbf {F}_\mathscr {M}^T\mathbf {m}_i + \mathbf {F}_\mathscr {M}^T\mathbf {m}_j\right) + \sum _{(\mathbf {x}_i,\mathbf {x}_j) \in \mathscr {M}} \mathbf {m}_i^T \mathbf {\Delta }^\mathscr {M} \mathbf {m}_j, \end{aligned} \end{aligned}$$
(16)

where \(n_{\mathscr {M}}\) denotes the number of must-link constraints, \(\mathbf {F}_\mathscr {M}\) is a vector of size \(2^c\) and \(\mathbf {\Delta }^\mathscr {M}=\left[ \delta ^\mathscr {M}_{kl}\right] \) corresponds to a matrix \((2^c \times 2^c)\) such that:

$$\begin{aligned} \mathbf {F}^T_\mathscr {M} = \underbrace{\left[ -1, 0, \dots , 0 \right] }_{2^c} \quad \text {and}\quad \delta _{k l}^{\mathscr {M}}=\left\{ \begin{array}{ll}{1} &{} { \text{ if } A_{k}=\emptyset \text{ or } A_{l}=\emptyset }, \\ -1 &{} { \text{ if } A_{k}=A_{l} \text{ and } \left| A_{k}\right| =\left| A_{l}\right| =1}, \\ {0} &{} { \text{ otherwise. }}\end{array}\right. \end{aligned}$$
(17)

The penalty term associated to cannot-link constraints is:

$$\begin{aligned} J_{\mathscr {C}}(\mathbf {M}) = \sum _{(\mathbf {x}_i,\mathbf {x}_j) \in \mathscr {C}} \mathbf {m}_i^T \mathbf {\Delta }^\mathscr {C} \mathbf {m}_j, \end{aligned}$$
(18)

where \(\mathbf {\Delta ^\mathscr {C}}=\left[ \delta ^\mathscr {C}_{kl}\right] \) is a matrix \((2^c \times 2^c)\) such that:

$$\begin{aligned} \delta _{k l}^{\mathscr {C}}=\left\{ \begin{array}{ll}{1} &{} { \text{ if } A_{k} \cap A_{l} \ne \emptyset }, \\ {0} &{} { \text{ otherwise. }}\end{array}\right. \end{aligned}$$
(19)

Finally, the penalty term for the labeled data constraints is denoted as follows:

$$\begin{aligned} J_{\mathscr {L}}(\mathbf {M}) = n_{\mathscr {L}} - \sum _{i=1}^n \mathbf {F}^T_{\mathscr {L}}\mathbf {m}_i, \end{aligned}$$
(20)

where \(n_{\mathscr {L}}\) denotes the number of labeled data constraints and \(\mathbf {F}_{\mathscr {L}}\) is a vector of size \(2^c\) such that:

$$\begin{aligned} \mathbf {F}^T_{\mathscr {L}}= & {} v_{ikl}c_{lk}, \quad \forall A_l \in \varOmega ,\end{aligned}$$
(21)
$$\begin{aligned} c_{lk}= & {} \frac{\left| A_{k} \cap A_{l}\right| ^{\frac{r}{2}}}{\left| A_{l}\right| ^{r}},\end{aligned}$$
(22)
$$\begin{aligned} v_{ikl}= & {} \left\{ \begin{array}{ll} {1} &{} { \text{ if } \left( \mathbf {x}_i, A_{k} \right) \in \mathscr {L} \text { and }A_{k} \cap A_l \ne \emptyset },\\ {0} &{} { \text{ otherwise. }} \end{array}\right. \end{aligned}$$
(23)

where expression \(\left( \mathbf {x}_i, A_{k} \right) \in \mathscr {L}\) means that the labeled data constraint on object i is the subset \(A_k\). Function \(v_{ikl} = \{0,\ 1\}\) equals to 1 for subsets \(A_l\) that has an intersection with \(A_k\) knowing the constraint \(\mathbf {x}_i \in A_k\).

Now, let us define \(\mathbf {m}^T=\left( \mathbf {m}_1^T,\dots ,\mathbf {m}_n^T\right) \) the vector of size \(n2^c\) containing the masses for each object and each subset, \(\mathbf {H}\) a matrix of size \((n2^c \times n2^c)\) and \(\mathbf {F}\) a vector of size \(n2^c\) such that:

$$\begin{aligned} \mathbf {H}=\left( \begin{array}{cccc} \mathbf {\Phi }^1 &{} \mathbf {\Delta }_{12} &{} \cdots &{}\mathbf {\Delta }_{1n} \\ \mathbf {\Delta }_{21} &{} \mathbf {\Phi }^2 &{} \cdots &{} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \mathbf {\Delta }_{n1} &{} &{} {\cdots } &{} \mathbf {\Phi }^n \end{array}\right) , \quad \text {where}\quad \mathbf {\Delta }_{ij}={\left\{ \begin{array}{ll} \mathbf {\Delta }^\mathscr {M}, &{} \text {if } (\mathbf {x}_i, \mathbf {x}_j) \in \mathscr {M},\\ \mathbf {\Delta }^\mathscr {C}, &{} \text {else if } (\mathbf {x}_i, \mathbf {x}_j) \in \mathscr {C},\\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(24)
$$\begin{aligned} \mathbf {F}^T=\left( \begin{array}{ccccc} \mathbf {F}_1&\cdots&\mathbf {F}_i&\cdots&\mathbf {F}_n \end{array}\right) , \quad \text {where}\quad \mathbf {F}_i=t_i\mathbf {F}_\mathscr {M} - b_i\mathbf {F}_\mathscr {L}, \end{aligned}$$
(25)
$$\begin{aligned} t_i={\left\{ \begin{array}{ll} 1, &{}\text{ if } \mathbf {x}_i \in \mathscr {M},\\ 0, &{}\text {otherwise}. \end{array}\right. }, \quad \text {and}\quad b_i={\left\{ \begin{array}{ll} 1, &{}\text{ if } \mathbf {x}_i \in \mathscr {L},\\ 0, &{}\text {otherwise}. \end{array}\right. }. \end{aligned}$$
(26)

Finally, the objective function (9) can be rewritten as follows:

$$\begin{aligned} J_{LPECM}(\mathbf {M})= \mathbf {m}^T\mathbf {H}\mathbf {m}+ \mathbf {F}^T \mathbf {m}. \end{aligned}$$
(27)

3.3 Metric Optimization

It can be observed from (9), the three penalty terms of the LPECM algorithm objective function do not depend on the Mahalanobis distance. Since the set of metric \(\mathbf {S}\) only appears in \(J_{ECM}\), the update method is identical to the ECM algorithm [4]. The overall procedure of the LPECM algorithm is summarized in Algorithm 1.

figure a

4 Experiments

4.1 Experimental Protocols

Performances and time consumption of the LPECM algorithm have been tested on a toy data set and several classical data sets from UCI Machine Learning Repository [9]. For the Letters data set, we kept only the three letters \(\{\)I,J,L\(\}\) as done in [6]. As in [14], fixed parameters associated to the ECM algorithm were set such as \(\alpha = 1\), \(\beta = 2\) and \(\rho ^2 = 100\). In order to balance the importance of the data structure, must-link constraints, cannot-link constraints and labeled data constraints respectively, we respectively set \(\xi = \frac{1}{n2^c}\), \(\gamma = \frac{1}{|\mathscr {M}|}\), \(\eta = \frac{1}{|\mathscr {C}|}\) and \(\delta = \frac{1}{|\mathscr {L}|}\) as coefficients.

Experiment on a data set consists of 20 simulations with a random selection of the constraints. For each simulation, five runs of the LPECM algorithm with random initialization of the centroids are performed. Then, in order to avoid local optimum, the clustering solution with the minimum value of the objective function is selected.

The accuracy of the obtained credal partition is measured with the Adjusted Rand Index (ARI) [12], which is the corrected-for-chance version of the Rand Index that compares a hard partition with the true partition of a data set. As a consequence, the credal partition generated by the LPECM algorithm is first transformed into a fuzzy partition using the pignistic transformation and then the maximum of probability on each object is retrieved to obtain a hard partition.

4.2 Toy Data Set

In order to show the interest of the LPECM algorithm, we started our experiments with a tiny synthetic data set composed of 15 objects and three classes. Figure 1 presents the hard credal partition obtained using the ECM algorithm. Big cross marks denote the centroid of each cluster. Centroids for subsets with higher cardinalities are not represented to ease the reading. As it can be observed, objects located between two clusters are assigned in subsets with cardinality equal to two. Notice also that, due to the stochastic initialization of the centroids, there may exist a small difference between the results obtained from every execution of the ECM algorithm. After the addition of background knowledge in the form of must-link constraints, cannot-link constraints and labeled data constraints and the execution of the LPECM algorithm with a Euclidean distance, it is interesting to observe that previous uncertainties have vanished. Figure 2 presents the hard credal partition obtained. The magenta dashed line describes cannot-link constraints, the light green solid line represents must-link constraints and the circled point corresponds to the labeled data constraints .

Figure 3 illustrates, for the execution of the LPECM algorithm, the mass distribution for singletons with respect to the point numbers, allowing us a more distinct sight of the masses allocations. Table 1 displays the accuracy as well as time consumption for the ECM algorithm and the LPECM algorithm when first only the cannot-link constraint is incorporated, second when the cannot-link and the must-link constraint are introduced (Cannot-Must-Link line in Table 1), finally when all constraints are added (Cannot-Must-Labeled line in Table 1). Our results demonstrate that the combination of pairwise constraints and labeled data constraints improved the performance of the semi-clustering algorithm with tolerable time consumption. As expected, the more constraints are added, the better are the performance.

Fig. 1.
figure 1

Hard credal partition obtained on Toy data set with the ECM algorithm

Fig. 2.
figure 2

Hard credal partition obtained on Toy data set with the LPECM algorithm

4.3 Real Data Sets

The LPECM algorithm has been tested on three known data sets from the UCI Machine Learning Repository namely Iris, Glass, and Wdbc and a derived Letters data set from UCI. Table 2 indicates for each data set its number of objects, its number of attributes and its number of classes.

For each data set, we randomly created 5%, 8%, and 10% of each type of constraints out of the whole objects, leading to a total of 15%, 24%, and 30% of constraints. As an example, Fig. 4 shows the hard credal partition obtained with the Iris data set after executing the LPECM algorithm with a Mahalanobis distance and 24% of constraints in total. As can be observed, all the constrained objects are clustered with certainty in a singleton. Ellipses represent the covariance matrices obtained for each cluster.

Fig. 3.
figure 3

Mass curve obtained on Toy data set with the LPECM algorithm

Table 1. Performance obtained on toy data set with the LPECM algorithm

Tables 3 and 4 illustrate for all data sets the accuracy results with a Euclidean and a Mahalanobis distance respectively when the different percentage of constraints are employed. Mean and standard deviation are calculated over 20 simulations. As it can be observed, incorporating constraints lead most of the time to significant improvement of the clustering solution. Using a Mahalanobis distance particularly help to achieve better accuracy than using a Euclidean distance. Indeed, the Mahalanobis distance corresponds to an adaptive metric giving more freedom than a Euclidean distance to respect the constraints while finding a coherent data structure.

Table 2. Description of the data sets from UCIMLR
Fig. 4.
figure 4

Hard credal partition obtained on Iris data set with the LPECM algorithm

Table 3. LPECM’s performance (ARI) with Euclidean distance
Table 4. LPECM’s performance (ARI) with Mahalanobis distance

For the time consumption, as it can be observed from Fig. 5, (1) Adding constraints gives higher computation time than no constraints. (2) most of the time, the more constraints are added, the less time is needed to finish the computation.

Fig. 5.
figure 5

Time consumption (CPU) of the LPECM algorithm with Euclidean distance

5 Conclusion

In this paper, we introduced a new algorithm named Labeled and Pairwise constraints Evidential C-Means (LPECM). It generates a credal partition and mixes three main types of instance-level constraints together, allowing us to retrieve more constraints from the background knowledge than other semi-supervised clustering algorithms. In addition, the framework of belief function employed in our algorithm allows us (1) to represent doubts for the labeled data constraints (2) to clearly express, with the credal partition as a result, the uncertainties about the class memberships of the objects. Experiments show that the LPECM algorithm does obtain better accuracy with the introduction of constraints, particularly with a Mahalanobis distance. Further investigations have to be performed to fine-tune parameters and to study the influence of the constraints on the clustering solution. The LPECM algorithm can also be applied for a real application to show the interest in gathering various types of constraints. In this framework, active learning schemes, which automatically retrieve few informative constraints with the help of an expert, are interesting to study. Finally, in order to scale and fast the LPECM algorithm, a new minimization process can be developed by relaxing some optimization constraints.