Keywords

1 Introduction

The increasing complexity of modern technology leads to an increase in the requirements for the quality, of industrial products reliability and durability. Determination of product quality is carried out by production tests. The quality of products within a single production batch is determined by the stability of the product parameters. Moreover, an increase in the stability of product parameters in manufactured batches can be achieved by increasing the stability of the technological process.

In order to exclude the possibility of potentially unreliable electronic and radio components (ERC) intended to be installed in the onboard equipment of a spacecraft with a long period of active existence, the entire electronic component base passes through specialized technical test centers  [1, 2]. These centers carry out operations of the total input control of the ERC, total additional screening tests, total diagnostic non-destructive testing and the selective destructive physical analysis (DPA). To expand the results of the DPA to the entire batch of products obtained, we must be sure that the products are manufactured from a single batch of raw materials. Therefore, the identification of the original homogeneous ERC production batches from the shipped lots of the ERC is one of the most important steps during testing  [1].

The k-means model in this problem is well established  [1, 3,4,5,6,7,8,9,10]. Its application allows us to achieve a sufficiently high accuracy of splitting the shipped lots into homogeneous production batches. The problem is solved as a k-means problem  [11]. The aim is to find \(k\) points (centers or centroids) \(X_1,\dots ,X_k\) in a d-dimensional space, such that the sum of the squared distances from known points (data vectors) \(A_1,\dots ,A_N\) to the nearest of the required points reaches its minimum (1):

$$\begin{aligned} arg\,minF(X_{1},\dots ,X_{k})=\sum _{i=1}^{N}min_{j\in \{\overline{1,k}\}}{\parallel X_{j}-A{j}\parallel }^2. \end{aligned}$$
(1)

Factor analysis methods do not significantly reduce the dimension of the space without loss of accuracy in solving problems  [12]. However, in some cases, the accuracy of partitioning into homogeneous batches (the proportion of objects correctly assigned to “their” cluster representing a homogeneous batch of products) can be significantly improved, especially for samples containing more than 2 or 3 homogeneous batches. In addition, the methods of factor analysis, although they do not significantly reduce the dimension of the search space, show the presence of linear statistical dependencies (correlations) between the parameters of the ERC in a homogeneous batch.

A slight increase in accuracy is achieved by using an ensemble of models  [3]. We also applied some other clustering models, such as the Expectation-Maximization (EM) model and Self-organized Cohonen Maps (COM)  [12].

Distance measure used in practical tasks of automatic objects grouping in real space depends on the features of space. Changing distance measures can improve the accuracy of automatic ERC grouping.

The idea of this work is to use the Mahalanobis distance measure in the k-means problem and study the accuracy of clustering results. We proposed a new algorithm, based on k-means model using the Mahalanobis distance measure with an averaged estimation of the covariance matrix.

2 Mahalanobis Distance

In k-means, k-median  [13,14,15] and k-medoid  [16,17,18] models, various distance measures may be applied  [19, 20]. The use of correlation dependencies can be involved by moving from a search in space with a Euclidean or rectangular distance to a search in space with a Mahalanobis distance  [21,22,23,24]. The square of the Mahalanobis distance \(D_M\) defined as follows (2):

$$\begin{aligned} D_{M}(X)=\sum _{i=1}^{n}{(X-\mu )}^{T}C^{-1}(X-\mu ), \end{aligned}$$
(2)

where \(X\) is vector of values of measured parameters, \(\mu \) is vector of coordinate values of the cluster center point (or cluster center), \(C\) is the covariance matrix.

Experiments on automatic ERC grouping with the k-medoid and k-median models using the Mahalanobis distance show a slight increase in the clustering accuracy in simple cases (with 2–4 clusters)  [25].

3 Data and Preprocessing

In this study, we used data of test results performed in the testing center for the batches of integrated circuits (microchips)  [26]. The source data is a set of some ERC parameters measured during the mandatory tests. The sample (mixed lot) was originally composed of data on products belonging to different homogeneous batches (in accordance with the manufacturer’s markup). The total amount of ERC is 3987 devices. Batch 1 contains 71 device, 116 devices for Batch 2, 1867 for Batch 3, 1250 for Batch 4, 146 for batch 5, 113 for Batch 6, 424 for Batch 7. The items (devices) in each batch are described by 205 input measured parameters.

Computationally, the k-means problem, in which the sum of squared distances acts as the minimized objective function, is more convenient than the k-median model using the sum of distances, because when using the sum of the squared distances, the center point of the cluster (the centroid) coincides with the average coordinate value of all objects in the cluster. When passing to the sum of squared Mahalanobis distances, this property is preserved.

Nevertheless, the use of the Mahalanobis distance in the problem of automatic ERC grouping in many cases leads to accuracy decrease in comparison with the results achieved with the Euclidean distance due to the loss of the advantage of the special data normalization approach (Table 1, hit percentage computed as the sum of hits of algorithm (True Positives) in every batch divided by number of products in the mixed lot).

Table 1. Comparison of the clustering results with different measures of distance, number of exact hits (proportion of hits)
Fig. 1.
figure 1

Statistical dependence of the ERC parameters 57, 58

The assumption that the statistical dependences of the parameter values appear in different batches of ERC in a similar way has experimental grounds. As can be seen from Fig. 1, the span and variance of the parameters of different batches vary significantly. Even if the difference in the magnitude of the span and variance of any parameters is insignificant among separate batches, they differ significantly from the span and variance of the entire mixed lot (Fig. 2). Thus, it is erroneous to take the variance and covariance coefficients in each of the homogeneous batches equal to the variance and covariance coefficients for the whole sample. Experiments with the automatic grouping model based on a mixture of Gaussian distributions by maximizing the likelihood function by the EM algorithm  [27] show a relatively high model adequacy only when using diagonal covariance matrices (i.e. uncorrelated distributions), moreover, equal for all distributions. Apparent correlations between the parameters are not taken into account.

Fig. 2.
figure 2

Statistical dependence of the ERC parameters 23, 24

Mahalanobis distance is scale invariant  [28]. Due to this property, data normalization does not matter if this distance is applied. At the same time, binding of the boundaries of the parameters to the boundaries, determined by their physical nature, sets a scale proportional to the permissible fluctuations of these parameters under operating conditions, without reference to the span and variance of these values in a particular production batch. The solution to the problem of preserving the scale could be to use the Mahalanobis distance with the correlation matrix R instead of the covariance matrix \(C\) (3):

$$\begin{aligned} D_{M}(X)=\sum _{i=1}^{n}{(X-\mu )}^{T}R^{-1}(X-\mu ). \end{aligned}$$
(3)

Each element of the matrix \(R\) is calculated as follows (4):

$$\begin{aligned} r_{XY}=\frac{\sum _{i=1}^{N}(X_i-\overline{X})(Y_i-\overline{Y})}{(N-1)S_{X}S_Y}, \end{aligned}$$
(4)

where \(S_X\) and \(S_Y\) are standard deviations of parameters \(X\) and \(Y\), \(\overline{X}\) and \(\overline{Y}\) are their average values.

As shown by experiments, the results of which are given below, this approach does not show advantages compared to other methods.

4 The K-Means Model with Supervised Mahalanobis Distance Measure

The clustering problem is a classical example of the unsupervised learning approach. However, in some cases, when solving the problem of automatic grouping, we have a sample of a known composition. This sample can serve as a training (parameterizing) sample. In this case, a unique covariance matrix \(C\) (see (2)) is calculated on this training sample and then used on other data. We call the Mahalanobis distance (2) with the covariance matrix \(C\) pre-calculated on a training sample the supervised (or parameterized) Mahalanobis distance.

If there is no training sample, well-known cluster analysis models can be used to isolate presumably homogeneous batches with some accuracy. With this approach, a presumably heterogeneous batch can be divided into the number of presumably homogeneous batches, determined by the silhouette criterion  [29,30,31]. At the same time, a mixed lot can be divided into a larger number of homogeneous batches than it actually is: smaller clusters are more likely to contain data of the same class, i.e. the probability of false assignment of objects of different classes to one cluster reduces. The proportion of objects of the same class, falsely assigned to different classes, is not so important for assessing the statistical characteristics of homogeneous groups of objects.

Table 2. Comparison of the clustering results with different measures of distance, number of exact hits (proportion of hits) (Part 1)

In the next experiment, there were training sample contains 6 batches: Batch 1 (71 device), Batch 2 (116 devices), Batch 4 (1250 devices), Batch 5 (146 devices), Batch 6 (113 devices), Batch 7 (424 devices). Using covariance matrix \(C\), datasets contain 2 batches in all combinations were clustered with the use of various distance measure. The result was compared with the traditional k-means clustering method with the squared Mahalanobis distance (unsupervised squared Mahalanobis distance, Tables 234 proportion of hits computed as the sum of hits of algorithm in every batch divided by number of products in the batch), and with Euclidean and rectangular distances. For each model, we performed 5 experiments. Average clustering results are shown in Tables 234.

The experiment showed that the results of solving the k-means problem with a supervised Mahalanobis distance measure are higher in comparison with the results of a model with unsupervised Mahalanobis distance, however, it is still lower than in case of Euclidean and rectangular distances.

5 The K-Means Model with Supervised Mahalanobis Distance Measure Based on Averaged Estimation of the Covariance Matrix

Since the original covariance matrices are of the same dimension, we are able to calculate the average estimation of the covariance matrix among all homogeneous batches of products in the training (parameterizing) sample:

$$\begin{aligned} C=\frac{1}{n}\sum _{j=1}^{k}C_{j}n_j, \end{aligned}$$
(5)

where \(n_j\) is number of objects (components) in \(j\)th production batch, \(n\) is total sample size, \(C_j\) are covariance matrices calculated on separate production batches, each of which can be calculated by (6):

$$\begin{aligned} C_j=E[(X-EX){(E-EY)}^T]. \end{aligned}$$
(6)

We propose the k-means algorithm using the Mahalanobis distance measure with averaged estimation of the covariance matrix. Convergence of the k-means algorithm using a Mahalanobis distance reviewed in  [32]. Optimal k value was found by silhouette criterion  [30]:

figure a
Table 3. Comparison of the clustering results with different measures of distance, number of exact hits (proportion of hits) (Part 2)
Table 4. Comparison of the clustering results with different measures of distance, number of exact hits (proportion of hits) (Part 3)

6 Computational Experiments

A series of experiments was carried out on the data set described above. This mixed lot is convenient due to its composition is known in advance, which allows us to evaluate the accuracy of the applied clustering models. Moreover, this data set is difficult for grouping by well-known models: some homogeneous batches in its composition are practically indistinguishable from each other, and the accuracy of known clustering models on this sample is low  [12, 33].

As a measure of the clustering accuracy, we use the Rand Index (RI)  [34], which determines the proportion of objects for which the reference and resulting cluster splitting are similar.

To train the model with the averaged Mahalanobis distance measure from the components of the mixed lot, new combinations of batches were compiled containing devices belonging to different homogeneous batches. New combinations consists of 2–7 homogeneous batches. Training sample include the entire data from each batch.

Experiments conducted with 5 different clustering models:

Model DM1: K-means with the Mahalanobis distance measure, the estimation of the covariance matrix calculates for the entire training sample. The objective function defines as the sum of the squared distances.

Model DC: K-means with a distance measure similar to the Mahalanobis distance, but using a correlation matrix instead of a covariance matrix (3). The objective function defines as the sum of the squared distances.

Model DM2: K-means algorithm with Mahalanobis distance measure based on averaged estimation of the covariance matrix (4). The objective function defines as the sum of the squared distances.

Model DR: K-means with Manhattan distance measure. The objective function defines as the sum of the distances.

Model DE: K-means with Euclidean distance measure. The objective function defines as the sum of the squared distances.

This paper presents the results of three groups of experiments. In each of the groups of experiments, for each working sample, the k-means algorithm was run 30 times with each of the five studied clustering models. In these groups of experiments the highest RI value was shown by K-means algorithm with Mahalanobis distance measure based on averaged estimation of the covariance matrix.

Table 5. An experiment of the 1st group

First Group. The training set corresponds to the working sample for which clustering was carried out. Five series of experiments were carried out. In each series of experiments, the sample is composed of a combination of products belonging to 2–7 homogeneous batches. Table 5 presents the maximum, minimum, mean value and standard deviation for the Rand index and objective function for the 7-batches sample. For objective function also calculated the coefficient of variation (V) and span factor (R, where \(R=Max-Min\)).

Second Group. Training and work samples do not match. In practice, the test center can use retrospective data from the supply and testing of products of the same type as a training sample. In this series of experiments, no more than seven homogeneous batches are presented in the training set. The working sample is represented by a new combination of products belonging to different homogeneous batches. In Table 6 represented results for 5-batches working set and 7-batches training set.

Table 6. An experiment of the 2nd group

Third Group. The training and working samples also do not match, but the results of the automatic product grouping were used as the training sample (k-means in multistart mode with Euclidean distance measure). In each series of experiments, the training set consists of 10 batches, which in turn are the result of applying the k-means algorithm to the training set containing the entire sample. The working sample is represented by a new combination of products belonging to different homogeneous batches. In Table 7 showed results for 7-batches working set.

Table 7. An experiment of the 3rd group

In most cases, the coefficient of variation of the objective function values is highest for the DE model, where the Euclidean distance measure used. The span factor of the objective function, in the opposite, has most high values for the DM2 model, where the Mahalanobis distance measure with the average estimation of the covariance matrix used. Therefore, obtaining consistently good values of the objective function requires multiple attempts to run the k-means algorithm, or using other algorithms based on the k-means model, such as j-means  [35] or greedy heuristic algorithms  [36] or others.

According to Rand index, DM2 model shows the best accuracy among the presented models (Fig. 3(a)–3(c)) in almost all series of experiments. And in all cases, the DM2 model surpasses the traditional DE model, where Euclidean distance measure used (Fig. 3(b), 3(c)).

Experiments showed that there is no correlation between the values of the objective function and the Rand index in series of experiments with model DM1 in any combinations of training and working samples (Fig. 4(a)). In other models with an increase the volume of training and working samples (\(n_t\) and \(n_w\), respectively), the clustering accuracy becomes constant (Fig. 4(b)). For DM2 model there is an inverse correlation between the achieved value of the objective function and the clustering accuracy RI on a small sample (Fig. 5(a)).

Fig. 3.
figure 3

The mean value of the Rand index for a) 1st group; b) 2nd group; c) 3rd group

In addition, the fact deserves attention that when applying the Euclidean distance measure, the best (smaller) values of the objective function do not correspond to the best (large) accuracy values. (Fig. 5(b)). This fact shows that the model with the Euclidean distance measure is not quite adequate: the most compact clusters do not exactly correspond to homogeneous batches.

Fig. 4.
figure 4

Dependence of the Rand index on the value of the objective function for a) DM1 model (\(n_t\) = 3987, \(n_w\) = 2054); b) DM2 model (\(n_t\) = 3987, \(n_w\) = 3987)

Fig. 5.
figure 5

Dependence of the Rand index on the value of the objective function for a) DM2 model (\(n_t\) = 187, \(n_w\) = 187); b) DE model (\(n_t\) = 187, \(n_w\) = 187)

7 Conclusion

The proposed clustering model and algorithm which uses the k-means model with Mahalanobis distance and an averaged (weighted average) estimation of the covariance matrix was compared with the k-means model with the Euclidean and rectangular distances in solving the problem of automatic grouping of industrial products by homogeneous production batches.

Taking into account the higher average Rand Index value, the proposed optimization model and algorithm applied for the electronic radio components clustering by homogeneous production batches has an advantage over the models with traditionally used Euclidean and rectangular (Manhattan) metrics.