Keywords

1 Introduction

Instance selection methods represent an important approach in different areas of data science. In [1], some important elements are considered in topics related to instance selection. There are two important processes, namely training set selection and prototype selection.

The selection of a subset of data for another very large data set is summarized in the concept of “data condensation” [2]. This form of data reduction differs from the others and is integrated as one of the families of the instance selection methods. Mainly, data condensation approaches are studied based on the classification processes, particularly the k-nearest neighbor (KNN) methods which refer to obtain a consistent minimum set that classifies the entire original set. Figure 1 shows a simple representation of the KNN.

Fig. 1
figure 1

K-nearest neighbor representation [3]

One of the first pioneering methods in the analysis in the data structure for the selection of instances was CNN [4]. The methods of condensation of data that are not related to the classification process are also known as methods of condensation of generic data, such condensation is performed through the so-called vector quantization (VQ), and example of this is the self-organization map and other ways of organizing the data as shown in Fig. 2.

Fig. 2
figure 2

CNN decision diagram for data reduction task

1.1 Vector Quantization

Vector quantization (VQ) is a classic method that consists of approximating a continuous probability density function p (x) of the vector input variable x by using a finite number of book-encoded vectors mi, i = 1, 2, …, k; once these book-encoders have been chosen, the approximation of x implies finding the reference vector closest to x. An optimal location type of m minimizes to E where E is the rth power of the reconstruction error [5]:

$$E = \int {\left\| {x - m_{c} } \right\|^{r} p\left( x \right){\text{d}}x}$$
(1)

where dx represents the differential volume in the space x and the index c = c (x) of the best match between the book-encoders (winner) is a function of the input vector:

$$\left\| {x - m_{c} } \right\| = \underbrace {\min }_{i}\left\{ {\left\| {x - m_{i} } \right\|} \right\}$$
(2)

In general, a closed solution for the optimal location of m is not possible, so iterative approximation schemes can be used.

1.2 Condensed Methods

Generic data condensation methods are based on techniques that consider density; they consider the density function instead of minimizing the quantification error; that is, for a specific input set, the condensed output set [6] is established.

Other methods such as data squash or data clustering are used for sample selection. A crushing method seeks the compression of the data in such a way that a statistical analysis performed on the compressed data obtains the same result as with the original data. Clustering-based algorithms [7, 8] divide data into samples like each other and different from examples of data belonging to other groups [1].

Figure 3 is represented according to a distance function where the quality of the cluster could be measured according to the dimension of its diameter which is the maximum between two samples belonging to the same group.

Fig. 3
figure 3

Three clusters obtained from a set of two-dimensional data

1.3 Machine Learning and Feature Selection

In machine learning, a process known as feature selection consists of the selection of characteristics, attributes or selection of variable subsets for use in model building. In [2], two feature selection strategies are mentioned, the first based on feature ranking and the other based on best subset selection. In the case of the methods based on feature ranking, some statistical metrics are used, some of the simple complexity uses the correlation coefficient instead of other more complex used methods such as the Gini index, and this index can be used to quantify inequalities in variable distributions. Other feature ranking methods mentioned in the literature [9] are the bivariate and multivariate methods; these methods calculate the distance between the actual joint distributions of the characteristics of two or more variables and answer the question of what the joint distribution would be if these variables were independent, further. The joint distribution represents the probability distribution of existing case studies. Among the multivariate analysis, methods are the stepwise linear regression [10, 11] which has been used in cluster tasks and sample selection [12]; other slightly more complex algorithms include the use of machine learning and advanced statistics, for example, partial least squares regression [13] and sensitivity analysis [14]. Also, in performance analysis of virtual clusters [15] and architecture in wireless networks [16].

The second strategy based on subset selection has its focus on the selection of a subset for the selection of characteristics or attributes that have a significant effect on the prediction of a variable. The classic methods of data reduction and sample selection [17] mention its importance given the analysis of large amounts of data for each sample and the time consumed which may cause an over-adjustment of the model of training.

In all the approaches seen so far in a very simple way, the importance of selecting a suitable sample has been evidenced to reduce computational cost and time among other aspects. From now on, the various efforts made to obtain results using the CNN method [18] with the prototype approach that facilitates the machine learning approach [19] will be more rigorously required.

The rest of this paper is structured as follows: Sect. 2 presents the theoretical background and overview referring to the main problem by the CNN method. Section 3 describes more practically by introducing the idea of the use of metrics in unsupervised learning and its relationship with CNN. Finally, the conclusions are presented in Sect. 4.

2 Theoretical Background and Overview

In practical problems, one of the most important elements to handle is the elimination of noise, redundancies, useless instances and therefore the selection of prototypes, constituting the first step for any practical application.

2.1 Problem Definition

It is desired to isolate the smallest set of instances that could predict the class with the same or greater precision than the original set [20]:

Lemma 2.1.1.

Let \(X_{p}\) be an instance where \(X_{p} = \left( {X_{p1} , X_{p2} ,..., X_{pm} , X_{pc} } \right)\) , with \(X_{p} \in c\) given by \(X_{pc}\) and a \(X_{pi} \in R^{m}\) being the value of the ith feature of the \(p_{{{\text{th}}}}\) sample. A training set TR, and also the N instances \(X_{p}\) and a validation set TS with t instances \(X_{p}\) , is obtained. S ⊂ TR is the subset of the selected samples that resulted from applying an instance selection algorithm.

Summarizing Lemma 2.1.1., the objective of an instance selection method is to obtain a subset S ⊂ T such that S does not contain unnecessary instances [21]:

$${\text{Acc}}\left( S \right) \cong {\text{Acc}}\left( X \right)$$
(3)

where Acc(X) is the qualifier of the training set X.

2.2 Prototype-Based Approach on Unsupervised Learning

Models based on prototype analysis represent several appealing concepts such as the explicit representation of observations, data or typical representatives that exhibit some relation to psychology and neuroscience.

In Sect. 1, the relationship between condensation methods and vector quantization was approached in a very simple way, and this subsection discusses how to prototype selection matches the instance selection approach with a competitive perspective in unsupervised learning [18].

The vector quantization mathematical statement is formulated in terms of a function that represents costs and generally guides the computation of prototype vectors. A prototype-based representation [22] of a given set of P is defined in Lemma 2.2.1.

Lemma 2.2.1.

Assign the representation of a set of P feature vectors \(\left\{ {x^{\mu } \in {\mathbb{R}}^{n} } \right\},\;\mu = 1,2,...,P\) that represent a particular input values.

A popular approach considers the assignment of any data point to the closest prototype, the so-called winner in the set \(W = \left\{ {w^{1} ,w^{2} , \ldots ,w^{K} } \right\}\) in terms of a predefined distance measure.

Using the Euclidean metric in feature space with:

$$d^{2} \left( {x,y} \right) = \left( {x - y} \right)^{2} \;{\text{for}}\;x,y \in {\mathbb{R}}^{N}$$
(4)

Having the quantization error [3] as the corresponding cost function:

$$H_{{{\text{VQ}}}} = \mathop \sum \limits_{i = 1}^{P} \frac{1}{2}d^{2} \left( {w^{*} \left( {x^{\mu } } \right),x^{\mu } } \right)$$
(5)

where \(w^{*} \left( {x^{\mu } } \right) \in {\mathbb{R}}^{N}\) denote the closest prototype using a Euclidean metric \(x^{\mu } \in {\mathbb{R}}^{n}\):

$$d\left( {w^{*} \left( {x^{\mu } } \right),x^{\mu } } \right) \le d\left( {w^{j} ,x^{\mu } } \right)\;{\text{for}}\;{\text{all}}\;j = 1,2, \ldots ,K$$
(6)

The quantization error quantifies the fidelity with which the set of prototypes represent data.

2.3 The Condensed Nearest Neighbor Rule (CNN Rule)

An in-depth study on the pillars that support the CNN method [23] and that will be specified below:

Let \(\left( {X_{1}^{^{\prime}} ,Y_{1}^{^{\prime}} } \right) \ldots \left( {X_{m}^{^{\prime}} ,Y_{m}^{^{\prime}} } \right)\) be a sequence that depends somehow on the data \(D_{n}\), and let \(g_{n}\) be the 1-nearest neighbor rule with \(\left( {X_{1}^{^{\prime}} ,Y_{1}^{^{\prime}} } \right) \ldots \left( {X_{m}^{^{\prime}} ,Y_{m}^{^{\prime}} } \right)\) where m is previously set. One way to find the data is to find the subset of the size m data, for the remained minimal nm data is confirmed by the error with the I-NN rule (this is known as Hart’s rule).

If:

$$\hat{L}_{n} = \left( \frac{1}{n} \right)\mathop \sum \limits_{i = 1}^{n} I_{{\{ g_{n} \left( {X_{i} } \right) \ne Y_{i} \} }}$$
(7)

And:

$$L_{n} = P\left\{ {g_{n} \left( X \right) \ne Y|D_{n} } \right\}$$
(8)

Then, we have the following:

Lemma 2.3.1.

∀ ε > 0,

$$P\left\{ {|L_{n} - \hat{L}_{n} {|} \ge \varepsilon } \right\} \le 8{\text{e}}^{{ - \frac{{n\varepsilon^{2} }}{32}}} \left( {\frac{ne}{{d + 1}}} \right)^{{\left( {d + 1} \right)m\left( {m - 1} \right)}}$$
(9)

where \(\hat{L}_{n}\) is about the estimate error probability.

Observe that:

$$\hat{L}_{n} = \left( \frac{1}{n} \right)\mathop \sum \limits_{i = 1}^{n} I_{{\left\{ {\left( {X_{j} ,Y_{j} } \right) \notin \bigcup\limits_{i = 1}^{m} {B_{i} \times \left\{ {Y_{i}^{^{\prime}} } \right\}} } \right\}}}$$
(10)

where \(B_{i}\) is the Voronoi cell of \(X_{i}^{^{\prime}}\) corresponding to \(X_{1}^{^{\prime}} \ldots X_{m}^{^{\prime}}\), where \(B_{i} \subset\) \(R^{d}\) is the closer partition to \(X_{i}^{^{\prime}}\) than to any other \(X_{j}^{^{\prime}}\):

$$L_{n} = P\left\{ {\left( {X,Y} \right) \notin \bigcup\limits_{i = 1}^{m} {B_{i} \times \left\{ {Y_{i}^{^{\prime}} } \right\}|D_{n} } } \right\}$$
(11)

Using simple upper bound:

$$|L_{n} - \hat{L}_{n} | \le \underbrace {{{\text{Sup}}}}_{{A \in A_{m} }}\left| {v_{n} \left( A \right) - v\left( A \right)} \right|$$
(12)

where \(v\) denotes the measure of \(\left( {X,Y} \right)\), \(v_{n}\) is some measure and \(A_{m}\) refer a set of all subsets of \(R^{d} \times \left\{ {0,1} \right\}\) of the form \(\bigcup\nolimits_{i = 1}^{m} {B_{i} \times \left\{ {y_{i} } \right\}}\) where \(B_{1} , \ldots ,B_{m}\) are Voronoi’s cells corresponding to \(x_{1} , \ldots ,x_{m} ,{ }x_{i} \in R^{d} ,{ }y_{i} \in \left\{ {0,1} \right\}\).

Using the Vapnik–Chervonenkis inequality [24]:

$$s\left( {A_{m} ,n} \right) \le s\left( {A,n} \right)^{m}$$
(13)

Such that A is the class of sets \(B_{1} \times \left\{ {y_{1} } \right\}\) and each set in A intercepts in at most \(m - 1\) hyperplanes. Then:

$$s\left( {A,n} \right) \le \underbrace {Sup}_{{n_{0} ,n_{1} :n_{0} + n_{1} = n}}\left( {\mathop \prod \limits_{j = 0}^{1} \left( {\frac{{n_{j} e}}{d + 1}} \right)} \right)^{{\left( {d + 1} \right)\left( {k - 1} \right)}} \le \left( {\frac{{n_{j} e}}{d + 1}} \right)^{{\left( {d + 1} \right)\left( {k - 1} \right)}}$$
(14)

where \(n_{j}\) denotes the points \(R^{d} \times \left\{ j \right\}\) and the result follows from the Vapnik–Chervonenkis.

Other condensate rules based on CNN were also presented in [25, 26].

An approach to the CNN algorithm [27, 28] can be as follows:

Algorithm 1

1. \({\varvec{T}} \leftarrow \emptyset\)

2. Do

3. \(\forall {\varvec{x}} \in {\varvec{X}}\left( {{\varvec{in}} {\varvec{random}} {\varvec{order}}} \right)\)

4. Find  \({\varvec{x}}^{\prime} \in {\varvec{T}}\) such that  

\({\varvec{x}} - {\varvec{x}}^{\prime} = \underbrace {{{\varvec{min}}}}_{{{\varvec{x}}^{{\varvec{w}}} \in {\varvec{T}}}}{\varvec{x}} - {\varvec{x}}^{{\varvec{w}}}\)

5. If  \({\varvec{Class}}\left( {\varvec{x}} \right) \ne {\varvec{Class}}\left( {{\varvec{x}}^{{\varvec{w}}} } \right)\) insert x to T

6. While T does not change

Several investigations have been carried out to interpret, extend and enhance the traditional CNN algorithm [29, 30]. In Table 1, some novel variants of implementation and application of the CNN method are shown.

Table 1 New approaches based on traditional CNN methods

3 Results and Discussion

A small review of the process of selecting instances has shown the high potential of sample selection techniques. Its application is valid in all areas and sub-areas of the modern world. The prototyping approach given by machine learning contributes too many investigations to reduce the computational cost of processes and the tasks of classifying huge amounts of data. Stopping in the analysis of the condensed nearest neighbor (CNN) algorithm [31], it represents a cognitive and theoretical element that means the basis of other evolutionary models.

The CNN algorithms use one nearest neighbor rule to iteratively decide if a sample should be removed or not [4].

3.1 Metric Considerations and Visual Scheme for the CNN

Many unsupervised algorithms perform unsupervised learning of distance metrics using information from the data itself or from the dimension where they are represented. In the selection of instances, the measurement of the distance between instances or the metric used is of crucial importance.

To formalize, denote the vectors x and y to those that represent the attributes of two instances x and y (classes are excluded).

A widely used metric is the Minkowski metric, which is defined as:

$$d = \sqrt[p]{{\mathop \sum \limits_{j = 1}^{m} d_{j}^{p} }}$$
(15)

where \(d_{j}\) is defined for continuous attributes such as \(d_{j} = \left| {x_{j} - y_{j} } \right|\).

For some values of p, the Minkowski distance corresponds to a special metric as reflected in Table 2.

Table 2 Minkowski metrics for different p values

There are other important metrics such as the Mahalanobis distance based on the location of multivariate outliers to indicate an unusual combination between one or more variables.

A simple definition to this problem [10] is defined by:

$$d{ }\left( {{\text{Mahalanobis}}} \right) = \left[ {\left( {x_{B} { }{-}{ }x_{A} } \right)^{{\text{T}}} *C^{ - 1} *\left( {x_{B} { }{-}{ }x_{A} } \right)} \right]^{0.5}$$
(19)

where:

\(x_{A}\) and \(x_{B}\) are a pair of objects and C is the sample covariance matrix.

The following figure shows some examples of sample selection using the Euclidean and the Mahalanobis distance using the CNN algorithm and comparing some values for the n-neighbors:

Figure 4 shows the importance of the selection and use of metrics at the time of clustering, as indicated by the classic methods of selection of instances, and the fact of resorting to a sample that is sufficiently representative of a large population constitutes a difficult job. In this case, the example presented in Fig. 4 shows how the red, green and blue points are selected reflecting their color in a determined area according to the Euclidean and Mahalanobis metrics but using the CNN algorithm (squares on the right in Fig. 4) or simply using the aforementioned metrics (left squares in Fig. 4). As can be seen, using the CNN algorithm in combination with one of the two metrics achieves a clearer and more precise level of the reduced instances.

Fig. 4
figure 4

Sample selection considering the Euclidean and Mahalanobis distance

4 Conclusion

The beginning of the history of instance selection algorithms can be placed in the CNN algorithm (condensed nearest neighbor rule) whose contribution is due to Hart in 1968. The algorithm in its simplest state leaves in S a subset of T such that each element of T is closer to an element of S of the same class than to an element of S of a different class. From this idea, various variants have been formulated with an elegant mathematical profile that has allowed the reduction of computational costs in various modern problems given its simplicity.

Finally, the aim of this work has been to show some theoretical elements about the importance of the sample selection process and the condensed nearest neighbor method collected in the effort of several authors who have tried to theorize in complex aspects of the real world to give solutions to problems of today’s world.