Unsupervised instance selection via conjectural hyperrectangles

Aydin, Fatih

doi:10.1007/s00521-022-07974-z

Unsupervised instance selection via conjectural hyperrectangles

Original Article
Published: 02 November 2022

Volume 35, pages 5335–5349, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

Unsupervised instance selection via conjectural hyperrectangles

Download PDF

Fatih Aydin ORCID: orcid.org/0000-0001-9679-0403¹

377 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

Machine learning algorithms spend a lot of time processing data because they are not fast enough to commit huge data sets. Instance selection algorithms especially aim to tackle this trouble. However, even instance selection algorithms can suffer from it. We propose a new unsupervised instance selection algorithm based on conjectural hyper-rectangles. In this study, the proposed algorithm is compared with one conventional and four state-of-the-art instance selection algorithms by using fifty-five data sets from different domains. The experimental results demonstrate the supremacy of the proposed algorithm in terms of classification accuracy, reduction rate, and running time. The time and space complexities of the proposed algorithm are log-linear and linear, respectively. Furthermore, the proposed algorithm can obtain better results with an accuracy-reduction trade-off without decreasing reduction rates extremely. The source code of the proposed algorithm and the data sets are available at https://github.com/fatihaydin1/NIS for computational reproducibility.

An Efficient Approach for Instance Selection

Optimal instance subset selection from big data using genetic algorithm and open source framework

Article Open access 05 July 2022

Bi-criteria Data Reduction for Instance-Based Classification

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Machine Learning is a study field, which aims to bring in learning talent for automata to discover patterns in real-world problems. Two of the machine learning types are supervised and unsupervised learning, which are commonly used in lots of problems. Given a training set $T = \left\{ {\left( {x_{1} ,y_{1} } \right),\; \ldots ,\;\left( {x_{m} ,y_{m} } \right)} \right\} \in {\mathbb{R}}^{m \times d}$, which contains $m$ instances and is described by $d$ features and $c$ classes, from input points $X = \left\{ {x_{i} } \right\}_{i = 1}^{m} \Rightarrow x_{i} = \left( {x_{i}^{\left( 1 \right)} ,x_{i}^{\left( 2 \right)} ,\; \ldots ,\;x_{i}^{\left( d \right)} } \right) \in {\mathbb{R}}^{d} , \;i = 1,\; \ldots ,\;m$ and labels $y_{j} \in Y, j = 1,\; \ldots ,\;c$. A supervised learning algorithm seeks a function $h:X \to Y$ that is an element of the hypothesis space $H$. As for unsupervised learning, let $D$ be an unknown distribution function. Accordingly, unsupervised learning aims to learn $X\sim D$ concerning $h \in H$. While applying supervised and unsupervised learning tasks, batch learning is preferred in the industry because it is faster and easier to implement in comparison to online learning, which takes a real-time stream of data. Further, both learning types suffer from big data. Huge datasets are widespread in lots of areas, including data mining, text categorization, financial forecasting, multimedia databases, genome sequences, and meteorological, financial, industrial, and science repositories. Undoubtedly, the amount of data is quickly growing in all domains. Hence, constructing a model would be more costly and requires more clusters. Furthermore, outliers and noises participated in the data set while collecting more data tend to impact the performance more. To address these troubles, data reduction methods are often employed. Thus, outliers and noises are discarded, and the data set is refined. Data reduction could be conducted by feature selection or instance selection algorithms, or both. Some strategies used in feature selection are low variance filers, principal component analysis, high correlation filters, forward feature construction, and backward feature elimination methods [1]. In this paper, we handle instance selection for data reduction in supervised and unsupervised learning problems. In this regard, the objective in instance selection is to find an optimal subset of the input set, i.e., $S \subset X$. In this respect, the use of the instance selection (i.e., prototype selection) algorithms would be effective in building models rapidly.

Instance selection is a process to discard noisy or redundant data [2]. In other words, the objective of the instance selection algorithms is to remove useless data from the training set. In consequence, the classification accuracies of the selected subset and original data set are either close to or the almost same as each other. Instance selection would be useful for decreasing the training and test time for lazy learners (e.g., k-Nearest Neighbors [KNN]) and high computational costly algorithms such as Support Vector Machine (SVM) and Neural Networks. In addition, instance reduction algorithms are presented to tackle problems in the fields such as time series, class-imbalanced data sets, distributed learning, noise sensitiveness, monotonic data sets, and lazy learners. In the literature, the approaches used in developing instance selection algorithms are the nearest neighbor, evolutionary computation, meta-learning, artificial immune model, divide-and-conquer strategy, Bayesian approach, cluster-based approach, hashing approach, geometry-based approach, instance ranking approach, density-based approach, the edge detection approach in image processing, metaheuristic algorithms, swarm intelligence algorithms, and graph-based approach. There are a few common properties in instance selection algorithms: Evaluation of search, type of selection, and direction of search [2,3,4]. Besides, there are four criteria to contrast instance selection algorithms: Storage requirement, noise resistance, classification accuracy, and running time [3].

All these approaches have a variety of advantages and disadvantages. Besides, we think that instance selection methods would persist to be incomplete due to the accuracy-reduction dilemma and no single approach can overwhelm the other approaches over all the data sets. In other words, there exist data sets where each method is successful and unsuccessful because of their assumptions and characteristics. Despite this, we can compare instance selection algorithms in terms of criteria such as reduction rate, classification accuracy, and running time. In this paper, we try to focus on a faster instance selection and a simple accuracy rate-reduction ratio trade-off by using a single parameter. This is because instance selection algorithms suffer from big data just like classification, regression, and clustering algorithms, as well. Moreover, adjusting the accuracy rate—reduction ratio trade-off is a serious problem in instance selection algorithms that have lots of meta parameters and this situation complicates the management of the process. To efficiently implement and deploy instance selection algorithms in terms of software/hardware, it is a significant matter that the algorithm is also comprehensible and easily applicable. In this study, as devising an algorithm corresponding to the aforementioned criteria, conjectural hyper-rectangles are employed. Hyperrectangle, whose edges are all mutually perpendicular in Euclidean $n$-space ${E}^{n}$ is the generalization of a rectangle to higher dimensions. We call defining the pseudo-hyper-rectangles the conjectural hyper-rectangles instead of real hyper-rectangles to contain points in $E^{n}$. Hyper-rectangles are used in performing various tasks in the machine learning area (e.g., classification, clustering, instance selection, etc.). Coming up with a solution to a problem by creating hyper-rectangles in $E^{n}$ is more efficient than using other geometric structures and easier than forming them. Therefore, using those is preferable to others. But it takes a long time on data sets with huge volumes to calculate the proper positions of real hyper-rectangles. Particularly, we remind that hyper-rectangles can intersect with each other. In this case, many unnecessary hyper-rectangles have been created. In this respect, we have preferred to exploit conjectural hyper-rectangles both not to describe unnecessary hyper-rectangles and to speed up the algorithm by not creating real hyper-rectangles. Lastly, we develop an instance selection algorithm to use in both supervised and unsupervised learning problems.

In this paper, we propose a fast unsupervised instance selection algorithm based on conjectural hyper-rectangles. The time complexity of our proposed method is log-linear. Besides, the proposed algorithm has obtained remarkable results on the data sets used in the experiments. The main contributions of the proposed method are as follows:

The proposed unsupervised algorithm is straightforward and rapidly processes huge data sets.
The accuracy-reduction trade-off can be easily adjusted by the scaling rate parameter, which is a single parameter of the proposed algorithm.

The rest of the paper is arranged in the following. In Sect. 2, we introduce our algorithm. Section 3 presents a literature review. In Sect. 4, we introduce the experimental setup. In Sect. 5, we share results and discussion in detail. Lastly, we finalize in Sect. 6.

2 Background

Moving backward in time, it is seen that one of the first algorithms that have been developed to reduce the data set is Condensed Nearest Neighbor (CNN) [5]. CNN is an iterative algorithm and starts with an empty subset. In the next step, CNN randomly picks up an instance from the training data and adds it to the subset if it is misclassified when using the subset as training data. The stop criterion is that there remain no more prototypes. CNN does not guarantee finding the optimum subset. Further, it generates different subsets on every run due to random instance selection [6]. The Reduced Nearest Neighbor (RNN) has been introduced as a modification to CNN to obtain more minimal subsets [7]. Ullmann developed an algorithm that obtains smoother boundaries by improving CNN and RNN [8]. Selective Nearest Neighbor (SNN) has been presented as an extension of CNN. SNN guarantees that the nearest neighbor of each instance of the original data set is a member of the subset and a minimal subset. But SNN is a difficult and complex algorithm to implement [9]. Tomek proposed two methods to improve CNN and claimed that the methods ensure a minimal subset and a boundary close to the decision boundary [10]. Gowda and Krishna proposed a method to improve CNN. The proposed method precludes the holding of inner instances in the condensed set, trying to add near instances to the decision boundary [11]. IB2 and IB3 methods have been introduced to discard fluctuated instances from the training set [12]. Fast Condensed Nearest Neighbor Family (FCNN) contains a set of algorithms that run on big data fast. The time complexity of the algorithms is quadratic in the worst case [13]. One of the first methods that focus to remove noisy instances is Edited Nearest Neighbor (ENN). ENN aims to discard noisy instances in a training set [14]. The Edition Method Based on Local Sets (ELS) has been proposed as a new parameter-free algorithm based on local sets with natural neighbors to obtain more acceptable boundaries and to filter noisy instances efficiently [15].

The algorithms in the literature are categorized into several approaches in terms of the applied techniques: various hybrid methods that combine condensing and editing approaches have been proposed [16,17,18]. As for ensemble approaches, a variety of ensemble IS methods are proposed [6, 19, 20]. In respect of the use of evolutionary approaches, many methods have been proposed [21,22,23]. For the use of the computational methods, the MapReduce solution for Prototype Reduction (MRPR) has been introduced as a new instance selection algorithm to accelerate the classification stage and decrease the storage needs and the noise sensitiveness of the nearest neighbor rule [24]. In the matter of hashing, some approaches have been proposed [25,26,27]. To speed up SVM, Shell Extraction (SE) has been proposed [28] and in addition, two instance selection methods, based on a nature-inspired metaheuristic algorithm and a normal individual-based swarm intelligence algorithm, have been proposed [29]. Relating to scoring the instances, the instance selection algorithms based on the ranking approach have been proposed [30, 31]. Concerning the solution approaches to different data set types (e.g., class-imbalanced data sets), a variety of methods have been proposed [32, 33]. Finally, the point of geometrical approaches, many methods have been proposed [34,35,36,37,38].

Judging the works accomplished recently, it is seen that serious efforts are made to solve important problems. The major hardships in GA-based instance selections are high computational complexity and decreasing performance with the dataset size growth. To tackle the related problems in a three-step strategy, fuzzy clustering decomposition based on a genetic algorithm has been proposed for regression problems [39]. A generic cluster-oriented instance selection algorithm that conducts an unsupervised K-Means Clustering algorithm on the training set and with a given selection rate and chooses instances from the centers and the boundaries of the clusters has been proposed for classification problems [1]. Intrusion detection data sets are represented by huge data, which affects the learning of the classifier. Hence, there exists a requirement to decrease the size of the data sets. Therefore, a new fast instance selection method is proposed to provide better efficiency during the training stage, without greatly impacting the effectiveness of the intrusion detection scheme [40]. The preservation rough set model based on rough sets has been proposed, which deals with the instance selection problem to enhance lazy learners in hybrid and incomplete data sets [41]. A method that is inspired by the cross-validation and divide-and-conquer strategies and uses genetic algorithms and an open-source framework to choose an optimum instance subset from huge data has been proposed, which relies on combined information entropy [42].

3 The proposed algorithm

The proposed algorithm discards instances by exploiting hyper-rectangles. Essentially, since detecting directly hyper-rectangles can slow down the running time of the instance selection process, we do not try to find hyper-rectangles directly for solving the trouble. Instead, we utilize the concept of data scaling in Statistics. Thus, we can avoid the cost of calculating the positions of real hyper-rectangles and indirectly form hyper-rectangles. We call hyper-rectangles created by adopting this approach conjectural hyper-rectangles. Figure 1 shows the positions of the conjectural hyper-rectangles on the original seeds data set. Accordingly, thirty conjectural hyper-rectangles have been constituted by the proposed method on the original data set. These hyper-rectangles are rectangles in two-dimensional space. Accordingly, the proposed method randomly keeps only a single instance in each rectangular region and disposes of the rest of the instances from their own rectangular region. For high dimensional space, these rectangular regions are hyper-rectangular, and the approach of the proposed method is the same way for hyper-rectangular regions, as well. The proposed method relies on data scaling in Statistics to calculate conjectural hyper-rectangles. In this respect, the proposed method assumes that data is normally distributed. Considering one-dimensional space, when we scale a number of data by their standard deviation and then, when we round the scaled values by a certain rule some values would be in the same range. These ranges are regarded as the unevenly spaced points of a number line in one-dimensional space. For two-dimensional space, each number line corresponds to an axis such as the x-axis and y-axis, and these axes constitute a cartesian plane. As a result of intersections of points on each axis, a grid emerges, which is composed of rectangles. For high-dimensional spaces, there exists a hyper-grid, which consists of hyper-rectangles. In a data space, data takes a place in these hyper-rectangles, and the goal of the proposed method is to retain only the first data point in these hyper-rectangles and discard the rest of the data points in each hyper-rectangle. The proposed method is simply constructed in this way.

Given a training set $T = \left\{ {\left( {x_{1} ,y_{1} } \right),\; \ldots ,\;\left( {x_{m} ,y_{m} } \right)} \right\} \in {\mathbb{R}}^{m \times d}$, which contains $m$ instances and is described by $d$ features and $c$ classes, from training points $X = \left\{ {x_{i} } \right\}_{i = 1}^{m} \Rightarrow x_{i} = \left( {x_{i}^{\left( 1 \right)} ,\;x_{i}^{\left( 2 \right)} ,\; \ldots ,\;x_{i}^{\left( d \right)} } \right) \in {\mathbb{R}}^{d} , \;i = 1,\; \ldots ,\;m$ and labels $y_{j} \in Y, \;j = 1,\; \ldots ,\;c$. The range of a data $x_{i}^{\left( k \right)}$ in any dimension, $k = 1,\; \ldots ,\;d$, is calculated as in Eq. (1). Thus, we calculate in which range the data within each dimension (or feature) is. In other words, we scale all values. Eventually, the real positions of the conjectural hyper-rectangles are implicitly determined through $pos^{\left( k \right)} = x^{\prime}\sigma_{{x^{\left( k \right)} }} + \min \left( {x^{\left( k \right)} } \right)$ from Eq. (1). This is useful to view where the locations of the conjectural hyper-rectangles are on the original data set.

$$x^{\prime} = \left[ {\frac{{x_{i}^{\left( k \right)} - \min \left( {x^{\left( k \right)} } \right)}}{{\sigma_{{x^{\left( k \right)} }} }}} \right]$$

(1)

The proposed method assumes that data is normally distributed. In this regard, as values in Normal Distribution move away with a certain deviation from the mean, the probabilities that they are within the related range also change. Hence, all the hyper-rectangles are not the same size. Accordingly, the maximum and minimum values of $x^{\prime}$ in Eq. (1) are six and zero, respectively, according to the three-sigma rule of thumb, for example. In this case, each data cell can take seven possible values. In other words, it has seven possible states. But we note that data distribution can be heavy-tailed or light-tailed relative to a normal distribution or not symmetric. We add a scaling rate $\alpha$ to Eq. (1) as shown in Eq. (2) to adjust the size of hyper-rectangles. As the value of the scaling rate decreases, the size of the hyper-rectangles increases, and accordingly the number of hyper-rectangles decreases, and vice versa. Further, the default value of the scaling rate is 1.

$$x^{\prime} = \left[ {\alpha \left( {\frac{{x_{i}^{\left( k \right)} - \min \left( {x^{\left( k \right)} } \right)}}{{\sigma_{{x^{\left( k \right)} }} }}} \right)} \right]$$

(2)

The proposed algorithm runs depending on the scaling rate to reduce instances. We call the proposed instance reduction algorithm the Nimble Instance Selection (NIS) and describe it in Algorithm 1.

Now, we calculate the time and space complexities of the algorithm. NIS measures the rounded distance of the instances from the minimum value in terms of the standard deviation and accordingly removes some repetitive instances. The upper bound time and space complexities of scoring the instances are $O\;\left( {2md} \right)$ and $O\;\left( {md} \right)$, respectively. The upper bound time and space complexities of searching unique instances are $O\;\left( {md\log_{2} md + md} \right)$ and $O\;\left( {md} \right)$, respectively. In consequence, the total time complexity is $O\;\left( {md\log_{2} md} \right)$ and the total space complexity is $O\;\left( {md} \right)$.

Now let us find out the equation that gives the average number of reduced instances by making various assumptions. To facilitate calculations, we assume that all the dimensions have the same characteristics. The probability of successful events in a sample of size $n$ that are selected with replacement from a data set of $m$ instances is shown in Eq. (3).

$$b\left( {n,\;m,\;p} \right) = \left( {\begin{array}{*{20}c} m \\ n \\ \end{array} } \right)p^{n} q^{{\left( {m - n} \right)}}$$

(3)

where $p$ is success probability and $q$ is failure probability. The success probability depends on the range between data (i.e., $v\sigma - \left( { - v\sigma } \right) = 2v\sigma , \;v \in {\mathbb{R}}$ for standard normal distribution) and the number of dimensions. But we note that data distribution can be skew and kurtic relative to a normal distribution. Let $U:\Omega \to \left\{ {0,\; \cdots ,\;u} \right\}, \;u \in {\mathbb{Z}}$ be a discrete random variable with the range values from Eq. (2) and let $N$ be the number of total possible outcomes in $U$. Accordingly, the success probability is shown in Eq. (4).

$$p = \left( \frac{1}{N} \right)^{d}$$

(4)

Considering all possible cases of the same instances, we calculate the expected value of the number of selected instances (i.e., the number of reduced instances) by using Eq. (5).

$$E\left[ Z \right] = \mathop \sum \limits_{n = 0}^{m} \left( {\begin{array}{*{20}c} m \\ n \\ \end{array} } \right)p^{n} q^{{\left( {m - n} \right)}} n$$

(5)

where, $Z:\Omega \to \left\{ {1, \cdots ,m} \right\}$ is a discrete random variable with the number of the reduced instances and the symbol $n$ refers to the number of the reduced instances. Accordingly, let us calculate the expected value of the number of selected instances by using Lemma 1.

Lemma 1

Let $p,\;q \in {\mathbb{R}}_{ \ge 0} = \left\{ {x \in {\mathbb{R}} | \;x \ge 0} \right\}$ and $m,\;n \in {\mathbb{Z}}_{ \ge 0} = \left\{ {x \in {\mathbb{Z}} | \;x \ge 0} \right\}$ such that $mp\left( {p + q} \right)^{m - 1} = \sum\nolimits_{n = 0}^{m} {\left( {\begin{array}{*{20}c} m \\ n \\ \end{array} } \right)p^{n} q^{{\left( {m - n} \right)}} n}$.

Since $E\left[ Z \right] = \sum\nolimits_{n = 0}^{m} {\left( {\begin{array}{*{20}c} m \\ n \\ \end{array} } \right)p^{n} q^{{\left( {m - n} \right)}} n}$ from Eq. (5),

$$\sum\limits_{n = 0}^{m} {\frac{m!}{{n!\left( {m - n} \right)!}}\left( n \right)p^{n} q^{{\left( {m - n} \right)}} }$$

(6)

Since the $n$ variable appears both in the numerator and in the denominator, the $n$ variables cancel each other out. The $m$ and $p$ constants can be written separately from the general expression. Thus, let us rewrite Eq. (6) as in Eq. (7).

$$\sum\limits_{n = 0}^{m} {\left( m \right)\frac{{\left( {m - 1} \right)!}}{{\left( {n - 1} \right)!\left( {m - n} \right)!}}p^{{\left( {n - 1} \right)}} q^{{\left( {m - n} \right)}} \left( p \right)}$$

(7)

Due to the Constant Multiple Rule of the summation, the $m$ and $p$ constants can be written outside the summation. Then, the upper and lower limits of the summation are properly shifted in accordance with the new expression. Accordingly, let us rearrange Eq. (7) as below.

$$mp\mathop \sum \limits_{n = 1}^{m - 1} \left( {\begin{array}{*{20}c} {m - 1} \\ {n - 1} \\ \end{array} } \right)p^{{\left( {n - 1} \right)}} q^{{\left( {m - n} \right)}}$$

(8)

Eventually, by involving the binomial theorem $\left( {p + q} \right)^{m} = \mathop \sum \nolimits_{n = 0}^{m} \left( {\begin{array}{*{20}c} m \\ n \\ \end{array} } \right)p^{n} q^{{\left( {m - n} \right)}}$, we arrive at

$$mp\left( {p + q} \right)^{m - 1} = \mathop \sum \limits_{n = 0}^{m} \left( {\begin{array}{*{20}c} m \\ n \\ \end{array} } \right)p^{n} q^{{\left( {m - n} \right)}} n$$

(9)

We substitute Eq. (4) in Eq. (9) and since $p + q = 1$, we reach the expected value of the number of reduced instances in Eq. (10).

$$E\left[ Z \right] = m\left( \frac{1}{N} \right)^{d}$$

(10)

From Eq. (10), it is concluded that large deviations and high dimensions negatively affect the expected value of the number of reduced instances. As a result, the most important factor affecting the performance of the proposed algorithm is the number of dimensions if there are no noisy data in a data set. Furthermore, since the proposed method discards at least one instance, it must satisfy $\frac{m}{{N^{d} }} \ge 1$ condition. In other words, $m \ge N^{d}$ condition must be satisfied. In $d > m$ or $m < N^{d}$ cases, the reduction rate can be increased by adjusting the scaling rate. We note that $m$ and $d$ are constants and $\alpha$ is a variable.

Figure 2 shows the instances and reduction rates $r$ after the reduction process on the seeds data set in terms of the scaling rate α. As can be seen from the results, as the scaling rate increases the reduction rate decreases. The default value of the scaling rate is set as 1.

4 Experimental setup

In this section, we explain the experimental setup, including experimental data sets, instance selection algorithms used in the experiments, evaluation metrics, and implementations.

4.1 Data sets

We compare NIS with the state-of-the-art instance selection algorithms to measure its efficiency by using fifty-five data sets from the UCI database,^{Footnote 1} MATLAB sample data sets,^{Footnote 2} and the OpenML datasets.^{Footnote 3} The descriptive information belonging to those data sets is shown in Table 1.

Table 1 The characteristics of the data sets used in experiments

Full size table

4.2 Instance selection algorithms

We compare the proposed algorithm with the state-of-the-art instance selection algorithms. The algorithms used in the experiments are shown in Table 2. All methods are supervised and filtering-based instance selection algorithms. Besides, we run them by the default values of them in all the experiments.

Table 2 The state-of-the-art instance selection algorithms used in the experiments

Full size table

4.3 Parameter setting and implementations

In this study, the baseline method means that the 1NN algorithm applies to the whole data set. Besides, we apply tenfold cross-validation to all the experiments and repeat each experiment ten times to generate the training sets with different fold combinations. The experiments have been performed in the MATLAB R2021a on an i5-8265U CPU at 1.6 GHz with 8 GB of RAM on Windows 11 Pro (64-bit). Besides, we use the default parameter values of NIS unless otherwise stated.

4.4 Evaluation criteria

We compare instance selection algorithms by using three criteria: reduction rate, classification accuracy, and running time. The lower the reduction rate, the higher the storage requirement. Furthermore, as the reduction rate increases, less decrease in classification rate is the desired situation. Finally, the algorithms are expected to process large data sets quickly. As a result, the above-mentioned criteria should be included in an algorithm.

5 Results and discussion

In this section, we conduct experiments related to the comparison of the instance selection algorithms. Figure 3 shows the results of the experiments done over the data sets in terms of classification accuracy, running time, and reduction rate. According to the results, after the instance reduction process, the highest average classification accuracy belongs to NIS_tuned with 81.99%. The average scale rate of NIS_tuned is 1.48 and the scale rate has been only changed for 15 out of 55 data sets. The average accuracy rate of 1NN (i.e., the baseline) on original data sets is 82.33%. The average accuracy rate of NIS on the data sets is 80.24% as the scaling rate is 1. The average accuracy rates of BPLSH, DR.LSH, LSH-IS-S, LSH-IS-F and Wilson’s ENN are 81.19%, 81.45%, 81.44%, 81.69%, and 80.89%, respectively. NIS is faster than the other instance selection algorithms. Adjusting the scaling parameter of NIS does not significantly change its running time. Wilson’s ENN method is the slowest instance selection algorithm because it considers the neighborhoods between data points. The average running times of NIS, NIS_tuned, BPLSH, DR.LSH, LSH-IS-S, LSH-IS-F, and Wilson’s ENN are 0.03, 0.03, 342.17, 134.63, 2.30, 0.67, and 1450.50 s, respectively. The reduction rate of NIS is larger than the other instance selection methods. The average reduction rates of NIS, NIS_tuned, BPLSH, DR.LSH, LSH-IS-S, LSH-IS-F, and Wilson’s ENN are 35.50%, 30.46%, 21.59%, 14.89%, 15.10%, 15.10%, and 17.65%, respectively. The selection of the fit scaling rate induces high classification accuracy and reduction rate. As a result, although NIS does not use class information it can rapidly deliver high accuracy rates and reduction rates over many data sets. Overall, the results show the reduction rate decreases as the classification accuracy of the algorithms increases. Besides, better results can be obtained by adjusting the parameter values of the algorithms. The important point is a trade-off between accuracy, reduction, and speed-up. NIS promises to be a faster algorithm and to reduce more instances than others for any data set. Furthermore, the results show that the classification accuracy can be improved more and more by changing the scaling rate parameter.

The proposed method (with default scaling parameter) cannot reduce instances on 8 out of 55 data sets. The mean range of the dimensions on the arrhythmia data set is 11.12 and the number of dimensions is 279. Accordingly, let us substitute these values in $m\left( \frac{1}{N} \right)^{d} = 452\left( \frac{1}{11} \right)^{279} \approx 0$. But to overcome this problem, the scaling rate parameter can be adjusted. For instance, the accuracy and reduction rates are 56.64% and 3.76%, respectively for $\alpha = 0.07$. The scaling rate parameter changes the range, i.e., $N$. Thus, the mean range is 0.73 for $\alpha = 0.07$. As a result, tuning the scaling rate is the key to dealing with high dimensionality and large range. Besides, the mean running time of the proposed method is very close to 0. While the proposed method satisfies low running time, it also acquires high reduction rates. The summarized information about whether there is statistical importance between the results is shown in Table 3. According to Kruskal–Wallis test results, no accuracy rates have mean ranks significantly different from each other. According to Friedman’s test, there is a significant difference between accuracy rates. The accuracy rate of the baseline is significantly different than NIS, BPLSH, and Wilson’s ENN. But there is not a significant difference between the baseline, NIS_tuned, DR.LSH, LSH-IS-S, and LSH-IS-F. Besides, according to both tests, there exist significant differences in terms of running time and reduction rate.

Table 3 The mean rank results of the Kruskal–Wallis test and Friedman’s test (Small values of ρ weaken the validity of the null hypothesis, which means it is not significantly different between results)

Full size table

Table 4 shows the performance results of the instance selection methods on the Poker Hand data set in terms of accuracy rate, reduction rate, and running time. The Poker Hand data set has 1025010 instances and 10 features. From the results, the accuracy rates of all the instance selection methods are almost the same as the baseline. The accuracy rate difference between NIS and LSH-IS-S is 1.22%. This difference can be decreased by tuning the scaling rate parameter of NIS. NIS has the highest reduction rate of 43.09%. DR.LSH, LSH-IS-S, and LSH-IS-F deliver the lowest reduction rate by 0.22%. Besides, NIS has the lowest running time of 1.17 s. Wilson’s ENN is the slowest method by 78123.37 s. Amongst methods based on hashing, LSH-IS is faster in terms of running time. As a result, the other methods get worse in terms of running time as the size of data sets increases. Consequently, NIS can yield good performance in terms of accuracy rate-reduction rate dilemma and running time on a large data set.

Table 4 The performance results of the instance selection methods on the Poker Hand data set in terms of accuracy rate, reduction rate, running time

Full size table

Table 5 shows the comparative results of the instance selection methods on some data sets in terms of accuracy rate, reduction rate, and running time. The scaling rate of NIS is 1 in 4 out of 10 data sets. The mean of the tuned scaling rate is 2.1. We have adjusted the scaling rate in one decimal digit precision. Accordingly, we point out that one does not spend a lot of time searching for a suitable scaling rate. From the results, Wilson’s ENN has the highest accuracy rate in 4 out of 10 data sets. Besides, Wilson’s ENN ranks first in terms of average accuracy rate as NIS ranks second. But we underline that the classification accuracy performances of the methods are almost the same. NIS yields the highest reduction rate in 6 out of 10 data sets. DR.LSH ranks second after NIS in terms of the number of the highest reduction rates. Moreover, BPLSH ranks second in terms of average reduction rate and NIS ranks first in terms of average reduction rate. Furthermore, NIS is superior to the other methods in terms of running time since it delivers the least running time on all the data sets. LSH-IS-S and LSH-IS-F are the second fastest instance selection algorithms next to NIS in terms of speed comparison. Additionally, as NIS ranks first in terms of average running time, LSH-IS-S and LSH-IS-F rank second. Wilson’s ENN is the slowest instance selection algorithm. Briefly, NIS rapidly processes all the data sets. According to the characteristics of data sets, as the number of instances increases, the running time of Wilson’s ENN increases excessively. BPLSH and DR.LSH are affected by this situation, as well. NIS, LSH-IS-S, and LSH-IS-F are much less affected by these circumstances. BPLSH and DR.LSH are much more affected negatively by the number of dimensions compared to the number of instances. Wilson’s ENN is much more impacted unfavorably compared to other algorithms by both the number of dimensions and the number of instances. Further, BPLSH is more influenced by the number of classes in comparison to the other algorithms excluding Wilson’s ENN.

Table 5 The comparative results of the instance selection methods on some data sets in terms of accuracy rate, reduction rate, and running time (the data sets are denoted by their number)

Full size table

Figure 4 shows the accuracy and reduction rates obtained over the ten data sets according to the change in the scaling rate of NIS. The scaling rate has been picked in the range of 0.1 to 5 in increments of 0.1. According to the result, a high accuracy rate is obtained on the Nomao, HTRU2, and Human activity data sets when the scaling rate is 0.1. In other words, a high accuracy rate can be achieved at high reduction rates, as well. Besides, the scaling rate parameter does not affect the running time of the proposed algorithm. The number of instances or dimensions of a data set merely influences the running time of the proposed algorithm. Accordingly, both high accuracy and high reduction rates can be delivered by the fit scaling rate. Further, it is not generally needed to search for large scaling rates to get a trade-off between the accuracy rate-reduction rate. Considering the experimental results, the optimal scaling rate for data sets can be looked for in the range of 0 to 5 (zero is not included). Besides, an appropriate trade-off between accuracy rate-reduction rate can be mostly obtained when the scaling rate value is in the range of 0 to 2 in general (zero is not included). Finally, these results show that high accuracy rates cannot be also obtained at the same time on data sets when high reduction rates are reached on data sets. Therefore, either one of the accuracy or reduction rates should be sacrificed, or a common balance point should be probed for both, as well.

In consequence, the proposed method may rapidly cope with big data in comparison with other methods. Furthermore, the proposed algorithm yields larger reduction rates on average. Besides, higher classification accuracies can be obtained with an accuracy rate—reduction rate trade-off without decreasing reduction rates too much. In addition, the idea of scaling rate is the key to overcoming large data sets. Additionally, NIS affords the best time for all data sets from different domains and characteristics. To put short it, the advantages of NIS are as follows:

Data sets with huge volumes can be processed very fast.
It can conduct both supervised and unsupervised learning tasks.
It can easily manage accuracy rate—reduction rate trade-off through a single parameter.
Simultaneously, it can provide both high accuracy and high reduction rates on the data sets.
NIS can also deliver successful results on the imbalanced data sets.
Better outcomes on high-dimensional data sets can be obtained by adjusting the scaling rate parameter.

The disadvantages of NIS are as follows:

Since it selects the first one of the instances in a hyperrectangle without any criterion, the accuracy rate and reduction rate can be affected by this preference.
Someone spends some time finding an exact scaling rate value for the data set.

6 Conclusion

In this study, we propose a new unsupervised instance selection algorithm called NIS. We have measured the success of NIS by using fifty-five data sets from different domains and compared NIS with one conventional and four state-of-the-art instance selection algorithms in recent literature. NIS yields a more desired classification accuracy-reduction rate trade-off compared to the other algorithms. The accuracy and reduction rates of NIS are 80.24% and 35.50%, respectively. The closest contestant of NIS is LSH-IS-F in terms of the accuracy-reduction trade-off. Accordingly, the accuracy and reduction rates of LSH-IS-F are 81.69% and 15.10%, respectively. Additionally, when we roughly adjust the scaling rate of NIS, we obtain the accuracy and reduction rates of NIS_tuned as 81.99% and 30.46%, respectively. In this case, the average scaling rate of NIS is 1.48. But the optimum scaling rate for data sets can be searched in the range of 0 to 5 (Zero is not included), according to the experimental results. NIS_tuned is superior to the other methods on 7 out of 55 data sets. The number of the data sets on which BPLSH, DR.LSH, LSH-IS-S, LSH-IS-F, and Wilson’s ENN are superior to their own contestants is 8, 8, 8, 12, and 16, respectively. Moreover, the number of the data sets on which NIS_tuned, BPLSH, DR.LSH, LSH-IS-S, LSH-IS-F, and Wilson’s ENN are best in their own contestants in terms of the reduction rates, is 20, 3, 3, 9, 9, and 21, respectively. The running time of NIS is quite low and can rapidly overcome huge data sets. The average, maximum, and minimum running times of NIS are 0.0332, 1.1673, and 0.0003, respectively. The closest contestant of NIS is LSH-IS-F in terms of running time. The average, maximum, and minimum running times of LSH-IS-F are 0.6663, 29.9893, and 0.0007, respectively. NIS and NIS_tuned are faster than the other methods on all the data sets. Although NIS is an unsupervised algorithm it can rapidly yield high accuracy rates and reduction rates over many data sets. In general, the reduction rate decreases while the accuracy rate of the methods increases. The key ability of the algorithms is that they can achieve a fit trade-off between accuracy rate, reduction rate, and speed-up. NIS ensures to be a faster algorithm and reduces more instances than other methods by allowing to achieve high accuracy rates for data sets. The future study may be the development of a supervised version of the proposed method. Thus, the accuracy rates of the algorithm on data sets can be improved more by keeping reduction rates and detecting instances from the different classes located in the same area of the space.

Data availability

This study uses existing data, which is openly available at locations cited in the footnotes.

Notes

References

Saha S, Sarker PS, Al SA et al (2022) Cluster-oriented instance selection for classification problems. Inf Sci (Ny) 602:143–158. https://doi.org/10.1016/j.ins.2022.04.036
Article Google Scholar
Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF, Kittler J (2010) A review of instance selection methods. Artif Intell Rev 34:133–143. https://doi.org/10.1007/s10462-010-9165-y
Article Google Scholar
Garcia S, Derrac J, Cano JR, Herrera F (2012) Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans Pattern Anal Mach Intell 34:417–435. https://doi.org/10.1109/TPAMI.2011.142
Article Google Scholar
García-Pedrajas N (2011) Evolutionary computation for training set selection. Wiley Interdiscip Rev Data Min Knowl Discov 1:512–523. https://doi.org/10.1002/widm.44
Article Google Scholar
Hart P (1968) The condensed nearest neighbor rule (Corresp.). IEEE Trans Inf Theory 14:515–516. https://doi.org/10.1109/TIT.1968.1054155
Article Google Scholar
Alpaydin E (1997) Voting over multiple condensed nearest neighbors. Artif Intell Rev 11:115–132. https://doi.org/10.1023/A:1006563312922
Article Google Scholar
Gates G (1972) The reduced nearest neighbor rule (Corresp.). IEEE Trans Inf Theory 18:431–433. https://doi.org/10.1109/TIT.1972.1054809
Article Google Scholar
Ullmann J (1974) Automatic selection of reference data for use in a nearest-neighbor method of pattern classification (Corresp.). IEEE Trans Inf Theory 20:541–543. https://doi.org/10.1109/TIT.1974.1055252
Article MATH Google Scholar
Ritter G, Woodruff H, Lowry S, Isenhour T (1975) An algorithm for a selective nearest neighbor decision rule (Corresp.). IEEE Trans Inf Theory 21:665–669. https://doi.org/10.1109/TIT.1975.1055464
Article MATH Google Scholar
Tomek I (1976) Two Modifications of CNN. IEEE Trans Syst Man Cybern SMC 6:769–772. https://doi.org/10.1109/TSMC.1976.4309452
Article MathSciNet MATH Google Scholar
Gowda K, Krishna G (1979) The condensed nearest neighbor rule using the concept of mutual nearest neighborhood (Corresp.). IEEE Trans Inf Theory 25:488–490. https://doi.org/10.1109/TIT.1979.1056066
Article Google Scholar
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6:37–66. https://doi.org/10.1007/BF00153759
Article Google Scholar
Angiulli F (2007) Fast nearest neighbor condensation for large data sets classification. IEEE Trans Knowl Data Eng 19:1450–1464. https://doi.org/10.1109/TKDE.2007.190645
Article Google Scholar
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern SMC 2:408–421. https://doi.org/10.1109/TSMC.1972.4309137
Article MathSciNet MATH Google Scholar
Zhao S, Li J (2020) ELS: a fast parameter-free edition algorithm with natural neighbors-based local sets for k nearest neighbor. IEEE Access 8:123773–123782. https://doi.org/10.1109/ACCESS.2020.3005815
Article Google Scholar
Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38:257–286
Article MATH Google Scholar
Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6:153–172. https://doi.org/10.1023/A:1014043630878
Article MathSciNet MATH Google Scholar
Li J, Zhu Q, Wu Q (2020) A parameter-free hybrid instance selection algorithm based on local sets with natural neighbors. Appl Intell 50:1527–1541. https://doi.org/10.1007/s10489-019-01598-y
Article Google Scholar
García-Osorio C, de Haro-García A, García-Pedrajas N (2010) Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts. Artif Intell 174:410–441. https://doi.org/10.1016/j.artint.2010.01.001
Article MathSciNet Google Scholar
de Haro-García A, Cerruela-García G, García-Pedrajas N (2019) Instance selection based on boosting for instance-based learners. Pattern Recognit 96:106959. https://doi.org/10.1016/j.patcog.2019.07.004
Article Google Scholar
Cano JR, Herrera F, Lozano M (2003) Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. IEEE Trans Evol Comput 7:561–575. https://doi.org/10.1109/TEVC.2003.819265
Article Google Scholar
de Haro-García A, Pérez-Rodríguez J, García-Pedrajas N (2018) Combining three strategies for evolutionary instance selection for instance-based learning. Swarm Evol Comput 42:160–172. https://doi.org/10.1016/j.swevo.2018.02.022
Article Google Scholar
Dornaika F (2021) Joint feature and instance selection using manifold data criteria: application to image classification. Artif Intell Rev 54:1735–1765. https://doi.org/10.1007/s10462-020-09889-4
Article Google Scholar
Triguero I, Peralta D, Bacardit J et al (2015) MRPR: a MapReduce solution for prototype reduction in big data classification. Neurocomputing 150:331–345. https://doi.org/10.1016/j.neucom.2014.04.078
Article Google Scholar
Arnaiz-González Á, Díez-Pastor J-F, Rodríguez JJ, García-Osorio C (2016) Instance selection of linear complexity for big data. Knowledge-Based Syst 107:83–95. https://doi.org/10.1016/j.knosys.2016.05.056
Article Google Scholar
Aslani M, Seipel S (2020) A fast instance selection method for support vector machines in building extraction. Appl Soft Comput 97:106716. https://doi.org/10.1016/j.asoc.2020.106716
Article Google Scholar
Aslani M, Seipel S (2021) Efficient and decision boundary aware instance selection for support vector machines. Inf Sci (Ny) 577:579–598. https://doi.org/10.1016/j.ins.2021.07.015
Article MathSciNet Google Scholar
Liu C, Wang W, Wang M et al (2017) An efficient instance selection algorithm to reconstruct training set for support vector machine. Knowledge-Based Syst 116:58–73. https://doi.org/10.1016/j.knosys.2016.10.031
Article Google Scholar
Akinyelu AA, Ezugwu AE (2019) Nature inspired instance selection techniques for support vector machine speed optimization. IEEE Access 7:154581–154599. https://doi.org/10.1109/ACCESS.2019.2949238
Article Google Scholar
Rico-Juan JR, Valero-Mas JJ, Calvo-Zaragoza J (2019) Extensions to rank-based prototype selection in k-nearest neighbour classification. Appl Soft Comput 85:105803. https://doi.org/10.1016/j.asoc.2019.105803
Article Google Scholar
Ruiz IL, Gómez-Nieto MÁ (2020) Prototype selection method based on the rivality and reliability indexes for the improvement of the classification models and external predictions. J Chem Inf Model 60:3009–3021. https://doi.org/10.1021/acs.jcim.0c00176
Article Google Scholar
Wang Z, Tsai C-F, Lin W-C (2021) Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers. Data Technol Appl. https://doi.org/10.1108/DTA-01-2021-0027
Article Google Scholar
Liu H, Motoda H (2002) On issues of instance selection. Data Min Knowl Discov 6:115–130. https://doi.org/10.1023/A:1014056429969
Article MathSciNet Google Scholar
Cavalcanti GDC, Ren TI, Pereira CL (2013) ATISA: adaptive threshold-based instance selection algorithm. Expert Syst Appl 40:6894–6900. https://doi.org/10.1016/j.eswa.2013.06.053
Article Google Scholar
Hamidzadeh J, Monsefi R, Sadoghi Yazdi H (2016) Large symmetric margin instance selection algorithm. Int J Mach Learn Cybern 7:25–45. https://doi.org/10.1007/s13042-014-0239-z
Article MATH Google Scholar
Hamidzadeh J, Monsefi R, Sadoghi Yazdi H (2015) IRAHC: instance reduction algorithm using hyperrectangle clustering. Pattern Recognit 48:1878–1889. https://doi.org/10.1016/j.patcog.2014.11.005
Article MATH Google Scholar
Leyva E, González A, Pérez R (2015) Three new instance selection methods based on local sets: a comparative study with several approaches from a bi-objective perspective. Pattern Recognit 48:1523–1537. https://doi.org/10.1016/j.patcog.2014.10.001
Article Google Scholar
Yang L, Zhu Q, Huang J et al (2019) Constraint nearest neighbor for instance reduction. Soft Comput 23:13235–13245. https://doi.org/10.1007/s00500-019-03865-z
Article Google Scholar
Kordos M, Blachnik M, Scherer R (2022) Fuzzy clustering decomposition of genetic algorithm-based instance selection for regression problems. Inf Sci (Ny) 587:23–40. https://doi.org/10.1016/j.ins.2021.12.016
Article Google Scholar
Herrera-Semenets V, Hernández-León R, van den Berg J (2022) A fast instance reduction algorithm for intrusion detection scenarios. Comput Electr Eng 101:107963. https://doi.org/10.1016/j.compeleceng.2022.107963
Article Google Scholar
Villuendas-Rey Y (2022) Hybrid data selection with preservation rough sets. Soft Comput. https://doi.org/10.1007/s00500-022-07439-4
Article Google Scholar
Zhai J, Song D (2022) Optimal instance subset selection from big data using genetic algorithm and open source framework. J Big Data 9:87. https://doi.org/10.1186/s40537-022-00640-0
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Faculty of Engineering, Balikesir University, Balikesir, Turkey
Fatih Aydin

Authors

Fatih Aydin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fatih Aydin.

Ethics declarations

Conflict of interests

The authors have declared that no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Aydin, F. Unsupervised instance selection via conjectural hyperrectangles. Neural Comput & Applic 35, 5335–5349 (2023). https://doi.org/10.1007/s00521-022-07974-z

Download citation

Received: 10 June 2022
Accepted: 17 October 2022
Published: 02 November 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s00521-022-07974-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Unsupervised instance selection via conjectural hyperrectangles

Abstract

Similar content being viewed by others

An Efficient Approach for Instance Selection

Optimal instance subset selection from big data using genetic algorithm and open source framework

Bi-criteria Data Reduction for Instance-Based Classification

1 Introduction

2 Background