1 Introduction

Machine Learning is a study field, which aims to bring in learning talent for automata to discover patterns in real-world problems. Two of the machine learning types are supervised and unsupervised learning, which are commonly used in lots of problems. Given a training set \(T = \left\{ {\left( {x_{1} ,y_{1} } \right),\; \ldots ,\;\left( {x_{m} ,y_{m} } \right)} \right\} \in {\mathbb{R}}^{m \times d}\), which contains \(m\) instances and is described by \(d\) features and \(c\) classes, from input points \(X = \left\{ {x_{i} } \right\}_{i = 1}^{m} \Rightarrow x_{i} = \left( {x_{i}^{\left( 1 \right)} ,x_{i}^{\left( 2 \right)} ,\; \ldots ,\;x_{i}^{\left( d \right)} } \right) \in {\mathbb{R}}^{d} , \;i = 1,\; \ldots ,\;m\) and labels \(y_{j} \in Y, j = 1,\; \ldots ,\;c\). A supervised learning algorithm seeks a function \(h:X \to Y\) that is an element of the hypothesis space \(H\). As for unsupervised learning, let \(D\) be an unknown distribution function. Accordingly, unsupervised learning aims to learn \(X\sim D\) concerning \(h \in H\). While applying supervised and unsupervised learning tasks, batch learning is preferred in the industry because it is faster and easier to implement in comparison to online learning, which takes a real-time stream of data. Further, both learning types suffer from big data. Huge datasets are widespread in lots of areas, including data mining, text categorization, financial forecasting, multimedia databases, genome sequences, and meteorological, financial, industrial, and science repositories. Undoubtedly, the amount of data is quickly growing in all domains. Hence, constructing a model would be more costly and requires more clusters. Furthermore, outliers and noises participated in the data set while collecting more data tend to impact the performance more. To address these troubles, data reduction methods are often employed. Thus, outliers and noises are discarded, and the data set is refined. Data reduction could be conducted by feature selection or instance selection algorithms, or both. Some strategies used in feature selection are low variance filers, principal component analysis, high correlation filters, forward feature construction, and backward feature elimination methods [1]. In this paper, we handle instance selection for data reduction in supervised and unsupervised learning problems. In this regard, the objective in instance selection is to find an optimal subset of the input set, i.e., \(S \subset X\). In this respect, the use of the instance selection (i.e., prototype selection) algorithms would be effective in building models rapidly.

Instance selection is a process to discard noisy or redundant data [2]. In other words, the objective of the instance selection algorithms is to remove useless data from the training set. In consequence, the classification accuracies of the selected subset and original data set are either close to or the almost same as each other. Instance selection would be useful for decreasing the training and test time for lazy learners (e.g., k-Nearest Neighbors [KNN]) and high computational costly algorithms such as Support Vector Machine (SVM) and Neural Networks. In addition, instance reduction algorithms are presented to tackle problems in the fields such as time series, class-imbalanced data sets, distributed learning, noise sensitiveness, monotonic data sets, and lazy learners. In the literature, the approaches used in developing instance selection algorithms are the nearest neighbor, evolutionary computation, meta-learning, artificial immune model, divide-and-conquer strategy, Bayesian approach, cluster-based approach, hashing approach, geometry-based approach, instance ranking approach, density-based approach, the edge detection approach in image processing, metaheuristic algorithms, swarm intelligence algorithms, and graph-based approach. There are a few common properties in instance selection algorithms: Evaluation of search, type of selection, and direction of search [2,3,4]. Besides, there are four criteria to contrast instance selection algorithms: Storage requirement, noise resistance, classification accuracy, and running time [3].

All these approaches have a variety of advantages and disadvantages. Besides, we think that instance selection methods would persist to be incomplete due to the accuracy-reduction dilemma and no single approach can overwhelm the other approaches over all the data sets. In other words, there exist data sets where each method is successful and unsuccessful because of their assumptions and characteristics. Despite this, we can compare instance selection algorithms in terms of criteria such as reduction rate, classification accuracy, and running time. In this paper, we try to focus on a faster instance selection and a simple accuracy rate-reduction ratio trade-off by using a single parameter. This is because instance selection algorithms suffer from big data just like classification, regression, and clustering algorithms, as well. Moreover, adjusting the accuracy rate—reduction ratio trade-off is a serious problem in instance selection algorithms that have lots of meta parameters and this situation complicates the management of the process. To efficiently implement and deploy instance selection algorithms in terms of software/hardware, it is a significant matter that the algorithm is also comprehensible and easily applicable. In this study, as devising an algorithm corresponding to the aforementioned criteria, conjectural hyper-rectangles are employed. Hyperrectangle, whose edges are all mutually perpendicular in Euclidean \(n\)-space \({E}^{n}\) is the generalization of a rectangle to higher dimensions. We call defining the pseudo-hyper-rectangles the conjectural hyper-rectangles instead of real hyper-rectangles to contain points in \(E^{n}\). Hyper-rectangles are used in performing various tasks in the machine learning area (e.g., classification, clustering, instance selection, etc.). Coming up with a solution to a problem by creating hyper-rectangles in \(E^{n}\) is more efficient than using other geometric structures and easier than forming them. Therefore, using those is preferable to others. But it takes a long time on data sets with huge volumes to calculate the proper positions of real hyper-rectangles. Particularly, we remind that hyper-rectangles can intersect with each other. In this case, many unnecessary hyper-rectangles have been created. In this respect, we have preferred to exploit conjectural hyper-rectangles both not to describe unnecessary hyper-rectangles and to speed up the algorithm by not creating real hyper-rectangles. Lastly, we develop an instance selection algorithm to use in both supervised and unsupervised learning problems.

In this paper, we propose a fast unsupervised instance selection algorithm based on conjectural hyper-rectangles. The time complexity of our proposed method is log-linear. Besides, the proposed algorithm has obtained remarkable results on the data sets used in the experiments. The main contributions of the proposed method are as follows:

  • The proposed unsupervised algorithm is straightforward and rapidly processes huge data sets.

  • The accuracy-reduction trade-off can be easily adjusted by the scaling rate parameter, which is a single parameter of the proposed algorithm.

The rest of the paper is arranged in the following. In Sect. 2, we introduce our algorithm. Section 3 presents a literature review. In Sect. 4, we introduce the experimental setup. In Sect. 5, we share results and discussion in detail. Lastly, we finalize in Sect. 6.

2 Background

Moving backward in time, it is seen that one of the first algorithms that have been developed to reduce the data set is Condensed Nearest Neighbor (CNN) [5]. CNN is an iterative algorithm and starts with an empty subset. In the next step, CNN randomly picks up an instance from the training data and adds it to the subset if it is misclassified when using the subset as training data. The stop criterion is that there remain no more prototypes. CNN does not guarantee finding the optimum subset. Further, it generates different subsets on every run due to random instance selection [6]. The Reduced Nearest Neighbor (RNN) has been introduced as a modification to CNN to obtain more minimal subsets [7]. Ullmann developed an algorithm that obtains smoother boundaries by improving CNN and RNN [8]. Selective Nearest Neighbor (SNN) has been presented as an extension of CNN. SNN guarantees that the nearest neighbor of each instance of the original data set is a member of the subset and a minimal subset. But SNN is a difficult and complex algorithm to implement [9]. Tomek proposed two methods to improve CNN and claimed that the methods ensure a minimal subset and a boundary close to the decision boundary [10]. Gowda and Krishna proposed a method to improve CNN. The proposed method precludes the holding of inner instances in the condensed set, trying to add near instances to the decision boundary [11]. IB2 and IB3 methods have been introduced to discard fluctuated instances from the training set [12]. Fast Condensed Nearest Neighbor Family (FCNN) contains a set of algorithms that run on big data fast. The time complexity of the algorithms is quadratic in the worst case [13]. One of the first methods that focus to remove noisy instances is Edited Nearest Neighbor (ENN). ENN aims to discard noisy instances in a training set [14]. The Edition Method Based on Local Sets (ELS) has been proposed as a new parameter-free algorithm based on local sets with natural neighbors to obtain more acceptable boundaries and to filter noisy instances efficiently [15].

The algorithms in the literature are categorized into several approaches in terms of the applied techniques: various hybrid methods that combine condensing and editing approaches have been proposed [16,17,18]. As for ensemble approaches, a variety of ensemble IS methods are proposed [6, 19, 20]. In respect of the use of evolutionary approaches, many methods have been proposed [21,22,23]. For the use of the computational methods, the MapReduce solution for Prototype Reduction (MRPR) has been introduced as a new instance selection algorithm to accelerate the classification stage and decrease the storage needs and the noise sensitiveness of the nearest neighbor rule [24]. In the matter of hashing, some approaches have been proposed [25,26,27]. To speed up SVM, Shell Extraction (SE) has been proposed [28] and in addition, two instance selection methods, based on a nature-inspired metaheuristic algorithm and a normal individual-based swarm intelligence algorithm, have been proposed [29]. Relating to scoring the instances, the instance selection algorithms based on the ranking approach have been proposed [30, 31]. Concerning the solution approaches to different data set types (e.g., class-imbalanced data sets), a variety of methods have been proposed [32, 33]. Finally, the point of geometrical approaches, many methods have been proposed [34,35,36,37,38].

Judging the works accomplished recently, it is seen that serious efforts are made to solve important problems. The major hardships in GA-based instance selections are high computational complexity and decreasing performance with the dataset size growth. To tackle the related problems in a three-step strategy, fuzzy clustering decomposition based on a genetic algorithm has been proposed for regression problems [39]. A generic cluster-oriented instance selection algorithm that conducts an unsupervised K-Means Clustering algorithm on the training set and with a given selection rate and chooses instances from the centers and the boundaries of the clusters has been proposed for classification problems [1]. Intrusion detection data sets are represented by huge data, which affects the learning of the classifier. Hence, there exists a requirement to decrease the size of the data sets. Therefore, a new fast instance selection method is proposed to provide better efficiency during the training stage, without greatly impacting the effectiveness of the intrusion detection scheme [40]. The preservation rough set model based on rough sets has been proposed, which deals with the instance selection problem to enhance lazy learners in hybrid and incomplete data sets [41]. A method that is inspired by the cross-validation and divide-and-conquer strategies and uses genetic algorithms and an open-source framework to choose an optimum instance subset from huge data has been proposed, which relies on combined information entropy [42].

3 The proposed algorithm

The proposed algorithm discards instances by exploiting hyper-rectangles. Essentially, since detecting directly hyper-rectangles can slow down the running time of the instance selection process, we do not try to find hyper-rectangles directly for solving the trouble. Instead, we utilize the concept of data scaling in Statistics. Thus, we can avoid the cost of calculating the positions of real hyper-rectangles and indirectly form hyper-rectangles. We call hyper-rectangles created by adopting this approach conjectural hyper-rectangles. Figure 1 shows the positions of the conjectural hyper-rectangles on the original seeds data set. Accordingly, thirty conjectural hyper-rectangles have been constituted by the proposed method on the original data set. These hyper-rectangles are rectangles in two-dimensional space. Accordingly, the proposed method randomly keeps only a single instance in each rectangular region and disposes of the rest of the instances from their own rectangular region. For high dimensional space, these rectangular regions are hyper-rectangular, and the approach of the proposed method is the same way for hyper-rectangular regions, as well. The proposed method relies on data scaling in Statistics to calculate conjectural hyper-rectangles. In this respect, the proposed method assumes that data is normally distributed. Considering one-dimensional space, when we scale a number of data by their standard deviation and then, when we round the scaled values by a certain rule some values would be in the same range. These ranges are regarded as the unevenly spaced points of a number line in one-dimensional space. For two-dimensional space, each number line corresponds to an axis such as the x-axis and y-axis, and these axes constitute a cartesian plane. As a result of intersections of points on each axis, a grid emerges, which is composed of rectangles. For high-dimensional spaces, there exists a hyper-grid, which consists of hyper-rectangles. In a data space, data takes a place in these hyper-rectangles, and the goal of the proposed method is to retain only the first data point in these hyper-rectangles and discard the rest of the data points in each hyper-rectangle. The proposed method is simply constructed in this way.

Fig. 1
figure 1

The positions of the conjectural hyper-rectangles on the original seeds data set

Given a training set \(T = \left\{ {\left( {x_{1} ,y_{1} } \right),\; \ldots ,\;\left( {x_{m} ,y_{m} } \right)} \right\} \in {\mathbb{R}}^{m \times d}\), which contains \(m\) instances and is described by \(d\) features and \(c\) classes, from training points \(X = \left\{ {x_{i} } \right\}_{i = 1}^{m} \Rightarrow x_{i} = \left( {x_{i}^{\left( 1 \right)} ,\;x_{i}^{\left( 2 \right)} ,\; \ldots ,\;x_{i}^{\left( d \right)} } \right) \in {\mathbb{R}}^{d} , \;i = 1,\; \ldots ,\;m\) and labels \(y_{j} \in Y, \;j = 1,\; \ldots ,\;c\). The range of a data \(x_{i}^{\left( k \right)}\) in any dimension, \(k = 1,\; \ldots ,\;d\), is calculated as in Eq. (1). Thus, we calculate in which range the data within each dimension (or feature) is. In other words, we scale all values. Eventually, the real positions of the conjectural hyper-rectangles are implicitly determined through \(pos^{\left( k \right)} = x^{\prime}\sigma_{{x^{\left( k \right)} }} + \min \left( {x^{\left( k \right)} } \right)\) from Eq. (1). This is useful to view where the locations of the conjectural hyper-rectangles are on the original data set.

$$x^{\prime} = \left[ {\frac{{x_{i}^{\left( k \right)} - \min \left( {x^{\left( k \right)} } \right)}}{{\sigma_{{x^{\left( k \right)} }} }}} \right]$$
(1)

The proposed method assumes that data is normally distributed. In this regard, as values in Normal Distribution move away with a certain deviation from the mean, the probabilities that they are within the related range also change. Hence, all the hyper-rectangles are not the same size. Accordingly, the maximum and minimum values of \(x^{\prime}\) in Eq. (1) are six and zero, respectively, according to the three-sigma rule of thumb, for example. In this case, each data cell can take seven possible values. In other words, it has seven possible states. But we note that data distribution can be heavy-tailed or light-tailed relative to a normal distribution or not symmetric. We add a scaling rate \(\alpha\) to Eq. (1) as shown in Eq. (2) to adjust the size of hyper-rectangles. As the value of the scaling rate decreases, the size of the hyper-rectangles increases, and accordingly the number of hyper-rectangles decreases, and vice versa. Further, the default value of the scaling rate is 1.

$$x^{\prime} = \left[ {\alpha \left( {\frac{{x_{i}^{\left( k \right)} - \min \left( {x^{\left( k \right)} } \right)}}{{\sigma_{{x^{\left( k \right)} }} }}} \right)} \right]$$
(2)

The proposed algorithm runs depending on the scaling rate to reduce instances. We call the proposed instance reduction algorithm the Nimble Instance Selection (NIS) and describe it in Algorithm 1.

figure a

Now, we calculate the time and space complexities of the algorithm. NIS measures the rounded distance of the instances from the minimum value in terms of the standard deviation and accordingly removes some repetitive instances. The upper bound time and space complexities of scoring the instances are \(O\;\left( {2md} \right)\) and \(O\;\left( {md} \right)\), respectively. The upper bound time and space complexities of searching unique instances are \(O\;\left( {md\log_{2} md + md} \right)\) and \(O\;\left( {md} \right)\), respectively. In consequence, the total time complexity is \(O\;\left( {md\log_{2} md} \right)\) and the total space complexity is \(O\;\left( {md} \right)\).

Now let us find out the equation that gives the average number of reduced instances by making various assumptions. To facilitate calculations, we assume that all the dimensions have the same characteristics. The probability of successful events in a sample of size \(n\) that are selected with replacement from a data set of \(m\) instances is shown in Eq. (3).

$$b\left( {n,\;m,\;p} \right) = \left( {\begin{array}{*{20}c} m \\ n \\ \end{array} } \right)p^{n} q^{{\left( {m - n} \right)}}$$
(3)

where \(p\) is success probability and \(q\) is failure probability. The success probability depends on the range between data (i.e., \(v\sigma - \left( { - v\sigma } \right) = 2v\sigma , \;v \in {\mathbb{R}}\) for standard normal distribution) and the number of dimensions. But we note that data distribution can be skew and kurtic relative to a normal distribution. Let \(U:\Omega \to \left\{ {0,\; \cdots ,\;u} \right\}, \;u \in {\mathbb{Z}}\) be a discrete random variable with the range values from Eq. (2) and let \(N\) be the number of total possible outcomes in \(U\). Accordingly, the success probability is shown in Eq. (4).

$$p = \left( \frac{1}{N} \right)^{d}$$
(4)

Considering all possible cases of the same instances, we calculate the expected value of the number of selected instances (i.e., the number of reduced instances) by using Eq. (5).

$$E\left[ Z \right] = \mathop \sum \limits_{n = 0}^{m} \left( {\begin{array}{*{20}c} m \\ n \\ \end{array} } \right)p^{n} q^{{\left( {m - n} \right)}} n$$
(5)

where, \(Z:\Omega \to \left\{ {1, \cdots ,m} \right\}\) is a discrete random variable with the number of the reduced instances and the symbol \(n\) refers to the number of the reduced instances. Accordingly, let us calculate the expected value of the number of selected instances by using Lemma 1.

Lemma 1

Let \(p,\;q \in {\mathbb{R}}_{ \ge 0} = \left\{ {x \in {\mathbb{R}} | \;x \ge 0} \right\}\) and \(m,\;n \in {\mathbb{Z}}_{ \ge 0} = \left\{ {x \in {\mathbb{Z}} | \;x \ge 0} \right\}\) such that \(mp\left( {p + q} \right)^{m - 1} = \sum\nolimits_{n = 0}^{m} {\left( {\begin{array}{*{20}c} m \\ n \\ \end{array} } \right)p^{n} q^{{\left( {m - n} \right)}} n}\).

Since \(E\left[ Z \right] = \sum\nolimits_{n = 0}^{m} {\left( {\begin{array}{*{20}c} m \\ n \\ \end{array} } \right)p^{n} q^{{\left( {m - n} \right)}} n}\) from Eq. (5),

$$\sum\limits_{n = 0}^{m} {\frac{m!}{{n!\left( {m - n} \right)!}}\left( n \right)p^{n} q^{{\left( {m - n} \right)}} }$$
(6)

Since the \(n\) variable appears both in the numerator and in the denominator, the \(n\) variables cancel each other out. The \(m\) and \(p\) constants can be written separately from the general expression. Thus, let us rewrite Eq. (6) as in Eq. (7).

$$\sum\limits_{n = 0}^{m} {\left( m \right)\frac{{\left( {m - 1} \right)!}}{{\left( {n - 1} \right)!\left( {m - n} \right)!}}p^{{\left( {n - 1} \right)}} q^{{\left( {m - n} \right)}} \left( p \right)}$$
(7)

Due to the Constant Multiple Rule of the summation, the \(m\) and \(p\) constants can be written outside the summation. Then, the upper and lower limits of the summation are properly shifted in accordance with the new expression. Accordingly, let us rearrange Eq. (7) as below.

$$mp\mathop \sum \limits_{n = 1}^{m - 1} \left( {\begin{array}{*{20}c} {m - 1} \\ {n - 1} \\ \end{array} } \right)p^{{\left( {n - 1} \right)}} q^{{\left( {m - n} \right)}}$$
(8)

Eventually, by involving the binomial theorem \(\left( {p + q} \right)^{m} = \mathop \sum \nolimits_{n = 0}^{m} \left( {\begin{array}{*{20}c} m \\ n \\ \end{array} } \right)p^{n} q^{{\left( {m - n} \right)}}\), we arrive at

$$mp\left( {p + q} \right)^{m - 1} = \mathop \sum \limits_{n = 0}^{m} \left( {\begin{array}{*{20}c} m \\ n \\ \end{array} } \right)p^{n} q^{{\left( {m - n} \right)}} n$$
(9)

We substitute Eq. (4) in Eq. (9) and since \(p + q = 1\), we reach the expected value of the number of reduced instances in Eq. (10).

$$E\left[ Z \right] = m\left( \frac{1}{N} \right)^{d}$$
(10)

From Eq. (10), it is concluded that large deviations and high dimensions negatively affect the expected value of the number of reduced instances. As a result, the most important factor affecting the performance of the proposed algorithm is the number of dimensions if there are no noisy data in a data set. Furthermore, since the proposed method discards at least one instance, it must satisfy \(\frac{m}{{N^{d} }} \ge 1\) condition. In other words, \(m \ge N^{d}\) condition must be satisfied. In \(d > m\) or \(m < N^{d}\) cases, the reduction rate can be increased by adjusting the scaling rate. We note that \(m\) and \(d\) are constants and \(\alpha\) is a variable.

Figure 2 shows the instances and reduction rates \(r\) after the reduction process on the seeds data set in terms of the scaling rate α. As can be seen from the results, as the scaling rate increases the reduction rate decreases. The default value of the scaling rate is set as 1.

Fig. 2
figure 2

The illustration of the remaining instances and reduction rates \({\varvec{r}}\) on the seeds data set, according to the scaling rate α

4 Experimental setup

In this section, we explain the experimental setup, including experimental data sets, instance selection algorithms used in the experiments, evaluation metrics, and implementations.

4.1 Data sets

We compare NIS with the state-of-the-art instance selection algorithms to measure its efficiency by using fifty-five data sets from the UCI database,Footnote 1 MATLAB sample data sets,Footnote 2 and the OpenML datasets.Footnote 3 The descriptive information belonging to those data sets is shown in Table 1.

Table 1 The characteristics of the data sets used in experiments

4.2 Instance selection algorithms

We compare the proposed algorithm with the state-of-the-art instance selection algorithms. The algorithms used in the experiments are shown in Table 2. All methods are supervised and filtering-based instance selection algorithms. Besides, we run them by the default values of them in all the experiments.

Table 2 The state-of-the-art instance selection algorithms used in the experiments

4.3 Parameter setting and implementations

In this study, the baseline method means that the 1NN algorithm applies to the whole data set. Besides, we apply tenfold cross-validation to all the experiments and repeat each experiment ten times to generate the training sets with different fold combinations. The experiments have been performed in the MATLAB R2021a on an i5-8265U CPU at 1.6 GHz with 8 GB of RAM on Windows 11 Pro (64-bit). Besides, we use the default parameter values of NIS unless otherwise stated.

4.4 Evaluation criteria

We compare instance selection algorithms by using three criteria: reduction rate, classification accuracy, and running time. The lower the reduction rate, the higher the storage requirement. Furthermore, as the reduction rate increases, less decrease in classification rate is the desired situation. Finally, the algorithms are expected to process large data sets quickly. As a result, the above-mentioned criteria should be included in an algorithm.

5 Results and discussion

In this section, we conduct experiments related to the comparison of the instance selection algorithms. Figure 3 shows the results of the experiments done over the data sets in terms of classification accuracy, running time, and reduction rate. According to the results, after the instance reduction process, the highest average classification accuracy belongs to NIStuned with 81.99%. The average scale rate of NIStuned is 1.48 and the scale rate has been only changed for 15 out of 55 data sets. The average accuracy rate of 1NN (i.e., the baseline) on original data sets is 82.33%. The average accuracy rate of NIS on the data sets is 80.24% as the scaling rate is 1. The average accuracy rates of BPLSH, DR.LSH, LSH-IS-S, LSH-IS-F and Wilson’s ENN are 81.19%, 81.45%, 81.44%, 81.69%, and 80.89%, respectively. NIS is faster than the other instance selection algorithms. Adjusting the scaling parameter of NIS does not significantly change its running time. Wilson’s ENN method is the slowest instance selection algorithm because it considers the neighborhoods between data points. The average running times of NIS, NIStuned, BPLSH, DR.LSH, LSH-IS-S, LSH-IS-F, and Wilson’s ENN are 0.03, 0.03, 342.17, 134.63, 2.30, 0.67, and 1450.50 s, respectively. The reduction rate of NIS is larger than the other instance selection methods. The average reduction rates of NIS, NIStuned, BPLSH, DR.LSH, LSH-IS-S, LSH-IS-F, and Wilson’s ENN are 35.50%, 30.46%, 21.59%, 14.89%, 15.10%, 15.10%, and 17.65%, respectively. The selection of the fit scaling rate induces high classification accuracy and reduction rate. As a result, although NIS does not use class information it can rapidly deliver high accuracy rates and reduction rates over many data sets. Overall, the results show the reduction rate decreases as the classification accuracy of the algorithms increases. Besides, better results can be obtained by adjusting the parameter values of the algorithms. The important point is a trade-off between accuracy, reduction, and speed-up. NIS promises to be a faster algorithm and to reduce more instances than others for any data set. Furthermore, the results show that the classification accuracy can be improved more and more by changing the scaling rate parameter.

Fig. 3
figure 3

The comparative results of the algorithms in terms of accuracy rate, running time, and reduction rate

The proposed method (with default scaling parameter) cannot reduce instances on 8 out of 55 data sets. The mean range of the dimensions on the arrhythmia data set is 11.12 and the number of dimensions is 279. Accordingly, let us substitute these values in \(m\left( \frac{1}{N} \right)^{d} = 452\left( \frac{1}{11} \right)^{279} \approx 0\). But to overcome this problem, the scaling rate parameter can be adjusted. For instance, the accuracy and reduction rates are 56.64% and 3.76%, respectively for \(\alpha = 0.07\). The scaling rate parameter changes the range, i.e., \(N\). Thus, the mean range is 0.73 for \(\alpha = 0.07\). As a result, tuning the scaling rate is the key to dealing with high dimensionality and large range. Besides, the mean running time of the proposed method is very close to 0. While the proposed method satisfies low running time, it also acquires high reduction rates. The summarized information about whether there is statistical importance between the results is shown in Table 3. According to Kruskal–Wallis test results, no accuracy rates have mean ranks significantly different from each other. According to Friedman’s test, there is a significant difference between accuracy rates. The accuracy rate of the baseline is significantly different than NIS, BPLSH, and Wilson’s ENN. But there is not a significant difference between the baseline, NIStuned, DR.LSH, LSH-IS-S, and LSH-IS-F. Besides, according to both tests, there exist significant differences in terms of running time and reduction rate.

Table 3 The mean rank results of the Kruskal–Wallis test and Friedman’s test (Small values of ρ weaken the validity of the null hypothesis, which means it is not significantly different between results)

Table 4 shows the performance results of the instance selection methods on the Poker Hand data set in terms of accuracy rate, reduction rate, and running time. The Poker Hand data set has 1025010 instances and 10 features. From the results, the accuracy rates of all the instance selection methods are almost the same as the baseline. The accuracy rate difference between NIS and LSH-IS-S is 1.22%. This difference can be decreased by tuning the scaling rate parameter of NIS. NIS has the highest reduction rate of 43.09%. DR.LSH, LSH-IS-S, and LSH-IS-F deliver the lowest reduction rate by 0.22%. Besides, NIS has the lowest running time of 1.17 s. Wilson’s ENN is the slowest method by 78123.37 s. Amongst methods based on hashing, LSH-IS is faster in terms of running time. As a result, the other methods get worse in terms of running time as the size of data sets increases. Consequently, NIS can yield good performance in terms of accuracy rate-reduction rate dilemma and running time on a large data set.

Table 4 The performance results of the instance selection methods on the Poker Hand data set in terms of accuracy rate, reduction rate, running time

Table 5 shows the comparative results of the instance selection methods on some data sets in terms of accuracy rate, reduction rate, and running time. The scaling rate of NIS is 1 in 4 out of 10 data sets. The mean of the tuned scaling rate is 2.1. We have adjusted the scaling rate in one decimal digit precision. Accordingly, we point out that one does not spend a lot of time searching for a suitable scaling rate. From the results, Wilson’s ENN has the highest accuracy rate in 4 out of 10 data sets. Besides, Wilson’s ENN ranks first in terms of average accuracy rate as NIS ranks second. But we underline that the classification accuracy performances of the methods are almost the same. NIS yields the highest reduction rate in 6 out of 10 data sets. DR.LSH ranks second after NIS in terms of the number of the highest reduction rates. Moreover, BPLSH ranks second in terms of average reduction rate and NIS ranks first in terms of average reduction rate. Furthermore, NIS is superior to the other methods in terms of running time since it delivers the least running time on all the data sets. LSH-IS-S and LSH-IS-F are the second fastest instance selection algorithms next to NIS in terms of speed comparison. Additionally, as NIS ranks first in terms of average running time, LSH-IS-S and LSH-IS-F rank second. Wilson’s ENN is the slowest instance selection algorithm. Briefly, NIS rapidly processes all the data sets. According to the characteristics of data sets, as the number of instances increases, the running time of Wilson’s ENN increases excessively. BPLSH and DR.LSH are affected by this situation, as well. NIS, LSH-IS-S, and LSH-IS-F are much less affected by these circumstances. BPLSH and DR.LSH are much more affected negatively by the number of dimensions compared to the number of instances. Wilson’s ENN is much more impacted unfavorably compared to other algorithms by both the number of dimensions and the number of instances. Further, BPLSH is more influenced by the number of classes in comparison to the other algorithms excluding Wilson’s ENN.

Table 5 The comparative results of the instance selection methods on some data sets in terms of accuracy rate, reduction rate, and running time (the data sets are denoted by their number)

Figure 4 shows the accuracy and reduction rates obtained over the ten data sets according to the change in the scaling rate of NIS. The scaling rate has been picked in the range of 0.1 to 5 in increments of 0.1. According to the result, a high accuracy rate is obtained on the Nomao, HTRU2, and Human activity data sets when the scaling rate is 0.1. In other words, a high accuracy rate can be achieved at high reduction rates, as well. Besides, the scaling rate parameter does not affect the running time of the proposed algorithm. The number of instances or dimensions of a data set merely influences the running time of the proposed algorithm. Accordingly, both high accuracy and high reduction rates can be delivered by the fit scaling rate. Further, it is not generally needed to search for large scaling rates to get a trade-off between the accuracy rate-reduction rate. Considering the experimental results, the optimal scaling rate for data sets can be looked for in the range of 0 to 5 (zero is not included). Besides, an appropriate trade-off between accuracy rate-reduction rate can be mostly obtained when the scaling rate value is in the range of 0 to 2 in general (zero is not included). Finally, these results show that high accuracy rates cannot be also obtained at the same time on data sets when high reduction rates are reached on data sets. Therefore, either one of the accuracy or reduction rates should be sacrificed, or a common balance point should be probed for both, as well.

Fig. 4
figure 4

The accuracy and reduction rates obtained over the ten data sets according to the change in the scaling rate of NIS

In consequence, the proposed method may rapidly cope with big data in comparison with other methods. Furthermore, the proposed algorithm yields larger reduction rates on average. Besides, higher classification accuracies can be obtained with an accuracy rate—reduction rate trade-off without decreasing reduction rates too much. In addition, the idea of scaling rate is the key to overcoming large data sets. Additionally, NIS affords the best time for all data sets from different domains and characteristics. To put short it, the advantages of NIS are as follows:

  • Data sets with huge volumes can be processed very fast.

  • It can conduct both supervised and unsupervised learning tasks.

  • It can easily manage accuracy rate—reduction rate trade-off through a single parameter.

  • Simultaneously, it can provide both high accuracy and high reduction rates on the data sets.

  • NIS can also deliver successful results on the imbalanced data sets.

  • Better outcomes on high-dimensional data sets can be obtained by adjusting the scaling rate parameter.

The disadvantages of NIS are as follows:

  • Since it selects the first one of the instances in a hyperrectangle without any criterion, the accuracy rate and reduction rate can be affected by this preference.

  • Someone spends some time finding an exact scaling rate value for the data set.

6 Conclusion

In this study, we propose a new unsupervised instance selection algorithm called NIS. We have measured the success of NIS by using fifty-five data sets from different domains and compared NIS with one conventional and four state-of-the-art instance selection algorithms in recent literature. NIS yields a more desired classification accuracy-reduction rate trade-off compared to the other algorithms. The accuracy and reduction rates of NIS are 80.24% and 35.50%, respectively. The closest contestant of NIS is LSH-IS-F in terms of the accuracy-reduction trade-off. Accordingly, the accuracy and reduction rates of LSH-IS-F are 81.69% and 15.10%, respectively. Additionally, when we roughly adjust the scaling rate of NIS, we obtain the accuracy and reduction rates of NIStuned as 81.99% and 30.46%, respectively. In this case, the average scaling rate of NIS is 1.48. But the optimum scaling rate for data sets can be searched in the range of 0 to 5 (Zero is not included), according to the experimental results. NIStuned is superior to the other methods on 7 out of 55 data sets. The number of the data sets on which BPLSH, DR.LSH, LSH-IS-S, LSH-IS-F, and Wilson’s ENN are superior to their own contestants is 8, 8, 8, 12, and 16, respectively. Moreover, the number of the data sets on which NIStuned, BPLSH, DR.LSH, LSH-IS-S, LSH-IS-F, and Wilson’s ENN are best in their own contestants in terms of the reduction rates, is 20, 3, 3, 9, 9, and 21, respectively. The running time of NIS is quite low and can rapidly overcome huge data sets. The average, maximum, and minimum running times of NIS are 0.0332, 1.1673, and 0.0003, respectively. The closest contestant of NIS is LSH-IS-F in terms of running time. The average, maximum, and minimum running times of LSH-IS-F are 0.6663, 29.9893, and 0.0007, respectively. NIS and NIStuned are faster than the other methods on all the data sets. Although NIS is an unsupervised algorithm it can rapidly yield high accuracy rates and reduction rates over many data sets. In general, the reduction rate decreases while the accuracy rate of the methods increases. The key ability of the algorithms is that they can achieve a fit trade-off between accuracy rate, reduction rate, and speed-up. NIS ensures to be a faster algorithm and reduces more instances than other methods by allowing to achieve high accuracy rates for data sets. The future study may be the development of a supervised version of the proposed method. Thus, the accuracy rates of the algorithm on data sets can be improved more by keeping reduction rates and detecting instances from the different classes located in the same area of the space.