1 Introduction

Missing value problem is considered as one of the challenges in multi-applications, and it affects directory of taking the decision. Several solutions have been suggested to solve this problem. First, the simplest solution for this problem is the reduction in the dataset and the elimination of all samples with missing values. This is possible when large datasets are available, and missing values occur only in a small percentage of samples. Second, a data miner, together with the domain expert, can manually examine samples, which have no values and enter a reasonable, probable, or expected value, based on a domain experience. This solution is straightforward for small numbers of missing values and relatively small datasets. However, if there is no obvious or plausible value for each case, the minor is introducing noise into the dataset by manually generating a value. Finally, automatic replacement of missing values with some constants such as some of the solutions will be explaining later (Han and Kamber 2006; Graham 2012; Rubin 1976). Several solutions are possible here, which are:

  • Replace all missing values with a single global constant

  • Replace a missing value with its feature mean

  • Replace a missing value with its feature mean for the given class

  • Replace a missing value with the nearest neighborhood from top or bottom.

In intelligent data analysis, the researcher is often interested in discovering knowledge, which has a certain predictive power. The basic idea is to predict the value that some attribute(s) will take on in “the future,” based on previously observed data. However, the missing value problem always leads to some of the mistake in discovered knowledge process and then become very difficult to be comprehensible to the user and in many times results in incomplete intelligent analysis of dataset behaviors (Abualigah and Hanandeh 2015; Abualigah and Khader 2017; Abualigah et al. 2017; Al-Janabi 2018).

In real-world applications of data mining, even when there are huge amounts of data, the subset of cases with complete data may be relatively small. Some of data mining methods accept missing values and satisfactorily process data to reach a final conclusion. Other methods require the all values be available (Rubin 1996). The question is whether theses missing values can be filled in during data preparation, prior to the application of any data mining methods. Other, multivariate supervised classification methods such as support vector machines and multivariate statistical analysis methods such as principal component analysis, singular value decomposition and generalized singular value decomposition cannot be applied to data with missing values (Liew et al. 2010). Therefore, missing value estimation is an important preprocessing step.

Missing data present a problem in many fields, and different techniques have been developed to effectively handle this problem in datasets. The simplest way is to remove the entire raw that contains missing values and replace them with row average, medium or even with zero (Ali 2012a). However, this leads to biases since the correlation structure between the variables is ignored. Missing value methods are designed for continuous data and to those data that have a mixture of nominal and categorical variables, and implementation can break down in challenging datasetting. The inability to deal with complex interactions between variables prevents these techniques to tailor the missing data handling procedure and match a set of analysis goals. In addition, it is unclear how these methods appropriate for missing value problem because no universal tool exists for different types of datasets. For these reasons, there has been much interest in using machine learning (ML) methods to estimate missing data problems.

Random forest (RF) is considered to be a popular classifier that has shown significant attainments on a wide range of classification problems due to its ability to deal with high-dimensional data. Although RF has shown to have many good features, it has the potential to perform even better for estimating missing values (Rieger et al. 2010; Cutler et al. 2007; Hapfelmeier et al. 2012; Waljee et al. 2013; Stekhoven and Bühlmann 2012; Verikas et al. 2011; Aljarah et al. 2018). RF has offered different advantages for classification problems when applied on large datasets: it does not overfit, it has the ability to estimate feature importance during training with little additional time and most importantly, and it effectively accommodates nonlinear relations and interactions among variables (Mafarja et al. 2018; Genuer et al. 2010). Local least squares (LLS) technique has been employed, and its performance is proved to be effective and suitable for missing value imputation (Ali 2012b; Wasito and Mirkin 2006; Moorthy et al. 2014; Bose et al. 2013).

In this study, a new tool called DRFLLS based on the developed random forest (DRF) algorithm and LLS imputation strategy is applied to find the optimal estimation of missing values. Different equations as similarity measures inside the original RF were used rather than depending on the correlation only to estimate the optimal number of nearest neighbors (k). Then, we used LLS as the local strategy to estimate the missing values based on the number of neighbors. The Pearson correlation (PC) and normalized root mean squared error (NRMSE) are used to assess the imputation accuracy, where the value that gives the highest PC and the least NRMSE represents the optimal value. Thus, the similarity measure that generated the k is considered the best generator of the number of neighbors.

The rest of the paper includes the following sections. In Sect. 2, we summarize and analyze the work of previous related works. In Sect. 3, we describe materials and methods used in this work. In Sect. 4, the design of the proposed method is discussed. In Sect. 5, the results and evaluation measures are presented. This section is followed by the conclusion of the study.

2 Related works

In recent years, ML techniques were designed to estimate missing values. This section is centered upon the previous literature which is related to missing value estimation and the solutions that have been proposed.

Several recent works have been proposed to handle missing data. Al-Janabi (2017) presented a solution for missing value problem, which consists of many steps: first, dataset design: “Vertical” decomposition, “horizontal” decomposition; second, new constraint short hands: “distributed key” and “distributed foreign key”; third, new dataset updating construct: “multiple assignment” (table level); fourth, decomposition by query to derive (an improved) PERS_INFO when needed. However, this technique cannot find the missing values effectively because it depends on finding substation values from the tables of the missing values based on propositional constructions.

In another research, Bruggeman et al. (2009), suggested PhyloPars Web server to provide a statistically consistent technique. The authors merged an incomplete set of empirical records with species phylogeny to produce a complete set of estimates parameter for all species. Their model is extended to enable better handling of missing data. They stated that their method achieved optimal and accurate use of all available information than ad hoc alternatives. Another optimal method for replacing missing ensemble temperature data can be seen in the work of McCandless et al. (2011). The main objective of their work is to produce a consensus forecast through the use of statistical post-processing techniques to find out the results of replacing the missing data on these post-processing schemes. However, this method does not explain how to handle missing value problem in details.

Qi et al. (2005) presented a method to measure such similarities at task classifying pairs of proteins as they interact or not. They used direct and indirect information about interaction pairs to construct a RF from a training dataset. The resulted RF is employed to find the similarity between protein pairs using a modified k-nearest neighbor. Their final results demonstrate that the RF approach achieved a high level of accuracy compared to other proposed tools. Carranza and Laborte (2015) used RF to investigate its suitability for data-driven predictive model and to examine its ability to handle missing values using Abra data in Philippines. The analysis results indicate that RF is useful for both data-driven predictive and missing value handling. Pantanowitz and Marwala (2009) applied five methods to impute missing data using HIV seroprevalence dataset. The final results show that RF is a powerful and accurate method which can successfully be applied to handle missing values.

Golub et al. (2005) proposed the imputation method based on least squares formulation. The authors used local similarity structures in the data and least squares optimization process to estimate the missing values in gene expression dataset. In addition, the experiments show that their method achieved comparable results alongside with other approaches for missing value estimation on different datasets. In another work, four imputation approaches were assessed to handle missing values in Epistatic miniarray profiling (E-MAPs) data (Ryan et al. 2010). Three local (nearest neighbor-based) and one global (BPCA-based) techniques that adapted to work with symmetric pair-wise data were used. The experimental results prove that good missing value imputation can be achieved by using LLS. The work of Chiu et al. (2013) confirmed the use of LLS as the best and suitable technique for missing values handling.

3 Methods and material

3.1 Random forest

Random forest (RF) is a supervised learning technique (Breiman 2001) that is widely used to solve classification and regression problems in different domains (Heidari et al. 2018, 2019; Adam et al. 2014; Elyan and Gaber 2016; Ali 2013; Xie et al. 2009). RF ensembles of trees that have grown from bootstrapped training data for classification purposes, the trees are combined using majority voting with one vote per tree over all trees in the forest, while for regression purposes, forests are created by averaging over trees. The remaining samples that are not selected for training are collected to another subset called out-of-bag (OOB). This subset aims to assess generated decision trees and to estimate the classification or regression error rate in the RF (James et al. 2013).

3.2 Local least squares

Local least squares (LLS) imputation consists of two steps: the first is to use k similarity records, and the second is to utilize regression and estimation, regardless of how the k records are chosen. The traditional LLS is based on the L2-norm or Pearson correlation coefficients as methods to select the similarity records (number of neighbors k) and then recover missing values that depend on the k records with the largest Pearson correlation coefficients (Kumar et al. 2008).

3.3 Used datasets

In this work, six datasets were used which have different types and features as explained in Table 1. All of these datasets suffer from the missing value problem.

Table 1 Overview of the dataset information

4 DRFLLS as a novel tool

In intelligent data analysis, we are often concerned to discover knowledge which has a certain predictive power. Intelligent data analysis goal is to predict the value that some attribute(s) will take on in ‘the future’ based on previously observed data. But missing value problem always leads to inaccurate conclusions that can be drawn from the data, and the process then becomes difficult for the users to understand. To overcome this problem, this work proposes a tool called developed random forest local least squares (DRFLLS) to estimate the number of neighbors and find the optimal estimation of missing values. The DRFLLS design is divided into three parts: (1) the first part aim to generate k similarity records depending on seven different measures of similarity, where each of these measures uses a new correlation function of the random forest. As a result, this part generates seven different values of neighbors, i.e., solve the select k-nearest neighbor problem; (2) the second part takes the different values of k results from the previous step and applies LLS to estimate the missing values. This generates seven values for each missing value based on the number of similarity measures; (3) the third part evaluates which of these seven estimation values is optimal by applying two types of evaluation measures including Pearson correlation between the predicted and actual values and the normalized root mean squared error (NRMSE). Figure 1 shows the structure of DRFLLS.

Fig. 1
figure 1

Novel tool DRFLLS

Figures 2 and 3 illustrate the pseudo-code that is used to handle missing values and the procedure that is applied to build the tree in the training phase, respectively.

Fig. 2
figure 2

Missing value handling

Fig. 3
figure 3

Building tree in the training phase

5 Results and performance evaluation

5.1 Estimating the number of nearest neighbors

One of the remaining problems in the various missing value methods is how to select the optimal local nearest base of the missing values (the number of nearest neighbors). Figure 4 shows the outline of the DRF pseudo-code that used to find the optimal nearest neighbors’ numbers.

Fig. 4
figure 4

DRF used to find the optimal nearest neighbors’ numbers

In this paper, different types of correlation functions or similarity functions among the trees to estimate the optimal number of nearest neighbors are used. For a given forest f, the similarity between target record and other record X1 and Xj pairs is computed in the following way. For each of the two pairs, we first propagate their values down all trees within f. Next, the terminal node position for each pair in each of the trees is recorded. Let Z1 = (Z11, Z1L) be these tree node positions for X1 and similarly define Zj. The similarity between X1 and Xj pairs takes different forms that apply in parallel as follows:

  • Random forest with Simple Similarity (RFSS)

    This measure is computed using Eq. (1), where I presents the indicator function as explained in Eq. (1)

  • Random forest with Pearson correlation (RFPC)

    For this measure, Eq. (2) is used, where \( \bar{Z}_{j} \) represents the average of values in Xjj and σj is the standard deviation. The components of X1 that correspond to missing values are not included in computing the coefficients, as explained in Eq. (2).

  • Random forest with fuzzy similarity measure 1 (RFFSM1)

    This measure is based on fuzzy Minkowski distance. A smaller distance between X1 and Xj is observed, and there is a greater similarity between them as explained in Eq. (3).

    RFFSM1 is computed using Eq. (3), where r → ∞.

  • Random forest with fuzzy similarity measure 2 (RFFSM2)

    It is computed using Eq. (4), where i = 2, … , L.

  • Random forest with fuzzy similarity measure 3 (RFFSM3)

    This measure needs to define the notion of cardinality of the fuzzy set. The cardinality is given by the number of elements in that set. This concept can be extended to fuzzy sets using the sigma count, which can be defined as Zhou et al. (2015), and Eq. (5) is used to compute this measure.

  • Random forest with fuzzy similarity measure 4 (RFFSM4)

    This measure is computed using Eq. (6).

  • Random forest with fuzzy similarity measure 5 (RFFSM5)

    This measure is calculated using Eq. (7).

As a result, we get seven forests, each one of which is built based on one of the above correlation equations and provides one predicate value to the number of neighbors. Table 2 explains DRF results for several datasets used in this paper.

Table 2 Number of nearest neighbor estimation by the developed random forests method

The optimal numbers of nearest neighbors generated by DRF for a given dataset are shown in bold. The table above explains why the datasets have a low rate of missing values given that the best estimate of the number of nearest neighbors is given by DRFPC, followed by DRFFSM1 with r = 4. When the datasets have a high rate of missing values, the best estimate of the number of nearest neighbors is given by DRFFSM5, followed by DRFFSM3.

5.2 Missing value estimation

We get seven nearest neighbors (k1, k2, k3, k4, k5, k6, k7) by the DRF method. Let r1 be a record containing n features and q missing values. We deal with the case in which there is more than one missing value in a record vector by the local least squares method. In this case, recovering the total number of q in any location is as follows:

  1. (A)

    The nearest neighbor record vectors for r1 are first found. In this process of finding similar records, the q components of each record having the same location of the missing values in r1 are ignored.

  2. (B)

    Build a matrix, \( \varvec{A} \in R^{k * (n - q)} \), where A is a two-dimensional matrix in which the number of rows equals the number of nearest neighbors ki {i = 1, …, 7}. The number of columns equals the number of total features n minus the number of columns containing missing values q.

  3. (C)

    Build a matrix \( \varvec{B} \in R^{k*q} \), where B is a two-dimensional matrix in which the number of rows equals the number of nearest neighbors ki {i = 1, …, 7}. The number of columns equals the number of columns containing missing values q.

  4. (D)

    Build a vector \( \varvec{w} \in R(n - q)^{*1} \), where w is a one-dimensional matrix in which the number of columns equals the number of total features n minus the number of columns containing missing values q.

  5. (E)

    After A and B matrices and a vector w are formed, the least squares problem is formulated as:

    $$ \mathop {\hbox{min} }\limits_{x} \left\| {A^{\text{T}} X - W} \right\|_{2} $$
    (8)
  6. (F)

    The vector \( u = (\alpha_{1} ,\alpha_{2} , \ldots ,\alpha_{q} )^{\text{T}} \) of q missing values can be calculated as

    $$ {\mathbf{u = }}\left( {\begin{array}{*{20}l} {\alpha_{1} } \hfill \\ \vdots \hfill \\ {\alpha_{q} } \hfill \\ \end{array} } \right) = B^{\text{T}} {\text{X}} = B^{\text{T}} (A^{\text{T}} )^{t} {\mathbf{w}} $$
    (9)

    where (AT)t presents pseudo-inverse of AT and the pseudo-inverse At of A can be computed by:

    $$ \begin{aligned} A^{t} & = [V_{1} \;\;V_{2} ]\left[ {\begin{array}{*{20}c} {\sum\nolimits_{1}^{ - 1} } & 0 \\ 0 & 0 \\ \end{array} } \right][U_{1} \;\;U_{2} ]^{\text{T}} \\ & = V_{1} \sum\limits_{1}^{ - 1} {U_{1}^{{\;\;{\text{T}}}} } \\ \end{aligned} $$
    (10)

    where \( V_{1} \in R^{{n - 1*{\mathrm{rank}}}} ,\sum\nolimits_{1}^{ - 1} { \in R^{{{\mathrm{rank*rank}}}} } ,U_{1}^{\text{T}} \in R^{{{\mathrm{K*rank}}}}. \)

    The known elements of w can be presented by

    $$ {\mathbf{w}} \simeq x_{1} a_{1} + x_{2} a_{2} + \cdots + x_{k} a_{ki} $$
    (11)

    where the xi refers to the coefficients of the linear combination found from the least squares formulation given by Eq. (9).

  7. (G)

    As a result, the multiple regressions represent a target record, i.e., the record has multiple missing value features, as a linear combination of its nearest neighbors as:

    $$ {\text{Target}} = x_{1} b_{1} + x_{2} b_{2} + \cdots + x_{k} b_{k} $$
    (12)

    where bk is the kth nearest neighbor and xk represents the regression coefficient corresponding to that neighbor.

5.3 Performance evaluation

To assess the performance and accuracy of the DRFLLS method, two measures were used. Pearson correlation and NRMSE served to evaluate the differences between predicted and actual values

$$ {\text{NRMSE = }}\sqrt {\frac{{{\text{mean}}\left[ {\left( {ij_{\text{answer}} - ij_{\text{guess}} } \right)^{2} } \right]}}{{{\text{variance[}}ij_{\text{answer}} ]}}} $$
(13)

where the mean and variance are calculated over the missing elements in the whole matrix, the set of known values are ijanswer, while ijguess are the set of predicted values.

In each column of Table 3, the smallest value of NRMSE for a given dataset is presented in the bold font. This means that corresponding DRF correlation function is better function for a given dataset. In addition, the error generated of each database based on the number of nearest neighbors is yielded by different types of DRF correlation measures as shown in Table 4.

Table 3 Accuracy, as measured by NRMSE for all datasets
Table 4 Accuracy, as measured by correlation for all datasets

In each column of the Table 4, the highest value of the correlation for a given dataset is presented in bold font. This means that the corresponding DRF correlation function is the best function for a given dataset.

The results from the correlation measures (DRFSS, DRFPC, DRFFSM1, DRFFSM2, DRFFSM3, DRFFSM4 and DRFFSM5) and the imputation accuracy measures on the six datasets are provided. Higher correlation and a lower NRMSE scores will result in more accurate imputations. The results from the NRMSE method are provided in Table 3, while those from the Pearson correlation are provided in Table 4. As it can be seen in each column in Table 3, smaller values of NRMSE for a given dataset are shown in bold, while in Table 4, the correlation for a given dataset has the highest values. From these experiment results, we can find that the corresponding DRF correlation function is a better function for the datasets used in this study.

5.4 Relative importance of each similarity measure

To further illustrate the importance of each similarity measures in the estimation process, the generated error by the DRF is provided

$$ {\text{Generation}}\;{\text{Error}} \le \frac{{S_{i}^{ - } (X1,X_{j} )[{\text{ST}}^{ 2} ]}}{{{\text{ST}}^{ 2} !}} $$
(14)

where \( S_{i}^{ - } \) is the average correlation among trees. ST2 is the ‘strength’ of the tree classifiers, i.e., the average performance of the classifiers (Waske et al. 2010).

The error generation for each dataset based on the number of nearest neighbors yielded by different types of DRF similarity measures is provided. In Table 5 and Fig. 5, we can find that if the dataset has a small number of missing values, then the best estimation of the number of nearest neighbors is obtained from DRFPC, followed by DRFFSM1 with r = 4 in GIS dataset since it has both the smallest missing values and error generation. If the dataset has a large number of missing values, then the best estimation number of nearest neighbors is obtained from DRFFSM5, followed by DRFFSM3 as in the DNA dataset which includes the highest missing values.

Table 5 Error generation for each dataset based on the number of nearest neighbors yielded by different types of DRF correlation measures
Fig. 5
figure 5

Relationship between different types of DRF similarity measures and error rate

In this study, we have used between 10 and 120 different DRF trees, based on the trial-and-error principle to decide the optimal number. Table 6 shows the error generation by different numbers of trees used in DRF.

Table 6 Error generation by different numbers of trees in DRF

In Table 6, the values of trees in the range of [10, 120] were used; at each stage, 10 trees were added. By another word, each dataset was tested 12 times starting from 10 to 120. In each test, the cumulative error was calculated according to Eq. (14). As the number of decision trees is increased, the required time for calculating missing values by the model will be increased, while the error rate will be decreased. The main aim is to reach a state of stability between the number of trees used in constructing the model and the percentage of generated errors. In Table 6, it can be seen that in communities and crime dataset when the number of trees reached 80, the cumulative error was 0.112, when the number of trees increased to 90, the error was 0.110; and the error became 0.108 when the number of trees reached 100. This indicates that the error is decreased by 0.002. For this, the number 80 was selected because it represents the state of stability and the changes in the error rate after this number were relatively small compared to the used time. The optimal number of trees for communities and crime, DNA, P53 MUTANTS, SPLICE and GIS datasets are 80, 100, 120, 60 and 25, respectively, as shown in Fig. 6. The value of each parameter used in DRF for each dataset and the relative importance that is given for each similarity measure are summarized in Table 7.

Fig. 6
figure 6figure 6figure 6

Relationship between different numbers of trees used in DRF and error rate for each dataset

Table 7 Relative importance of each similarity measure

As the results showed in above figures and tables, the proposed DRFLLS method provides a great improvement in both accuracy and stability over the different types of dataset used in this paper for missing value estimation.

5.5 Comparing DRFLLS with other proposed methods

The result obtained with the DRFLLS for the datasets used in this study was compared among the proposed methods reported by other researchers. The compression of DRFLLS with other proposed methods with respect to the employed tools, datasets, structure and method of determine nearest neighbor is illustrated in Table 8.

Table 8 Comparison among the previous studies and the current study for handling missing values

5.6 Assumptions and limitations

The main assumption of this paper is one of processing original datasets that suffer from many records that have missing values in different locations of the record but not processing the missing values that may occur in the post-processing stages, i.e., the clustering, association rules and decision stages. RF is considered as one of the statistical tools that perform well in many fields. However, our experiments found that the combination of RF and similarity measures to design DRF methods leads to an increase in the time complexity. Because it requires performing many of mathematical operations as explained in the above 2–8 equations; also when the number of trees in the RF is increased, the required time to handle missing values in the training data will be increased. While the main advantages of DRFLLS are ability to find the missing values for small and huge dataset, using different measures to determine the number of nearest neighbors, these measures including simple similarity, Pearson similarity coefficient and fuzzy similarity (M1, M2, M3, M4 and M5) make the tool suitable to deal with the dataset that differs in their natural, amount of missing values, and the type features that contain missing values. Combination between DRF and LLS leads to the optimal estimation of missing values by LLS.

6 Conclusion and future directions

In this study, a new DRFLLS approach for missing values and problem of estimating the optimal number of nearest neighbors is proposed. To attain this goal, RF was developed through replacement of the correlation function of the original RF with seven different types of similarity measures. These measures include simple similarity, Pearson similarity coefficient and fuzzy similarity (M1, M2, M3, M4 and M5). We obtained the optimal estimation of missing values by LLS that depends on the values of nearest neighbors generated by developed random forest (DRF). We investigated the feasibility of the new method using six datasets obtained from different disciplines. The DRFLLS method is evaluated using two imputation accuracy measures: PC and NRMSE where the value that yields the highest PC and the least NRMSE represents the optimal value. The DRFLLS considerably showed a high performance and accuracy with regard to missing value problem. The experimental results show that an improvement in estimating missing values can be achieved by the proposed DRFLLS tool in comparison with other proposed methods.