1 Introduction

Missing values are a common problem and an important issue in the domain of data science and machine learning. Most off-the-shelf statistical and machine learning methods cannot learn from data containing missing values, and so prior to analysis or learning, either all instances with missing values must be removed or missing value imputation (MVI) must be performed. When many values are missing, the first approach of considering only complete instances (no missing values) can lead to a significant loss of information or even an empty dataset, and thus MVI becomes important.

Indeed, missing values can occur in many or most training samples, especially when there are sufficiently more features (p) than samples (N), i.e. when \(p \gg N\). Examples of this scenario include medical and bioinformatics arrays, classification problems in astronomy, tool development for finance data, and weather prediction (Johnstone and Titterington 2009).

Missing value patterns are traditionally classified into three types (Santos et al. 2019). Where values are Missing Completely at Random (MCAR), the presence or absence of missing values does not depend on observed or unobserved data. In the Missing at Random (MAR) case, the missingness is dependent on observed variables but not on the missing values themselves; and values Missing Not at Random (MNAR) are missing with probability dependent on the values that are missing. Like the methods we will refer to, we will assume that data follows the MCAR pattern.

In this paper we propose a novel framework specifically distinguishing between two types of approaches, which we call the 1) procedural approach, where each value is imputed only once, versus 2) iterative approach in which values are successively re-imputed until convergence. We first propose a unifying framework for MVI within which to set these two strategies.

Within this framework, we propose a novel MVI method, Autoreplicative Random Forests, that does not require such a vast number of instances to obtain accurate results and also leverages from the statistical dependence information of the surrounding features to predict the target missing values. The proposed approach can be carried out in either a procedural or iterative fashion. To the best of our knowledge, using multi-label models in such an autoreplicative fashion without explicit encoding has not been studied in the literature.

We empirically demonstrate the advantages of this method in terms of marginal accuracy, joint accuracy, and likelihood. Furthermore, we show the computational efficiency of this method when dealing with a small number of instances.

We also notice that many MVI methods do not explicitly consider the underlying distribution, in particular the joint distribution. That is to say, there is little work that explicitly considers joint imputation. Rather, existing approaches simply do MVI in the view that each imputation will be treated independently (of other imputations) and identically (to existing/non-imputed values). Furthermore, they do not explicitly model the associated uncertainty of such imputation. To approach this task, we further propose in our general framework a probabilistic imputation method, distributional iterative Autoreplicative Random Forest (ditARF), that takes uncertainty into account during MVI iterations and provides a corresponding estimate of uncertainty (or, inversely speaking, confidence) associated with final imputations, both value-wise (marginal) and instance-wise (jointly).

We consider Autoreplicative Random Forests for MVI as a multi-label predictive method, which allows us not only to exploit target interdependencies but also to sufficiently alleviate the time complexity when compared to leave-one-out schemes such as MICE. As we show in the empirical evaluation part, itARF consistently outperforms its iterative competitors in terms of computation time though maintains high imputation quality.

In this work, we focus on categorical features as a multi-label multi-output classification problem. The proposed framework does not fundamentally show any obstruction to work with continuous features and we believe that it would be easily adapted to work with such data. In the meantime, the categorical approach can be applied to continuous data via feature discretization. Data discretization is known to be an effective approach to regression in some contexts (Dougherty et al. 1995), particularly where interpretation is required.

To sum up, we contribute to the state-of-the-art with the following:

  • We propose a general MVI framework, incorporating both procedural and iterative imputation strategies;

  • In this framework, we identify weaknesses of existing methods and propose Autoreplicative Random Forests (ARF), represented by a variant in both procedural (pARF) and iterative (itARF) strategies;

  • We propose distributional iterative ARF (ditARF), a probabilistic variant that provides confidence over imputation hypotheses, both under the assumptions of marginal (individual) and joint (combinatorial) imputations;

  • We demonstrate the effectiveness of our proposed methods when compared to a range of competing methods on both standard-dimensional and high-dimensional (\(p \gg N\)) data.

The rest of the paper is organized as follows. Together with summarizing the background and related work, we present an imputing framework unifying different methods in Sect. 2. We expand this framework with a group of new methods, pARF, itARF, and ditARF, in Sect. 3. The results and their discussion as well as complexity analysis are described in Sect. 4. In Sect. 5, we draw conclusions and describe future work.

2 A general framework for missing value imputation

In this section, we describe a general framework unifying different approaches for MVI.

First, in Sect. 2.1, we formalize the problem and set out our notation. In Sect. 2.2 we classify the existing strategies as procedural and iterative; thus laying all the foundation in which to consider related work, which we do in detail in Sect. 2.3; and then we propose our novel methodology – in Sect. 3.

2.1 Features, missing and imputed values

We define a dataset \(\mathcal {D} = \{X \cup \tilde{X} \}\), consisting of N rows (instances) and p columns (features), with observed and missing values denoted as X and \(\tilde{X}\), respectively. For example (\(N=5\), \(p=3\)),

$$\mathcal {D} = \begin{bmatrix} \tilde{x}_{1,1} &{} \tilde{x}_{1,2} &{} x_{1,3} \\ x_{2,1} &{} \tilde{x}_{2,2} &{} x_{2,3} \\ \tilde{x}_{3,1} &{} x_{3,2} &{} x_{3,3} \\ x_{4,1} &{} x_{4,2} &{} x_{4,3} \\ x_{5,1} &{} x_{5,2} &{} x_{5,3} \\ \end{bmatrix} \quad \text {where} \tilde{X} = \{\tilde{x}_{1,1},\tilde{x}_{1,2},\tilde{x}_{2,2},\tilde{x}_{3,1}\},$$

where \(x_{i,j}\) stands for an existing value in the dataset, and \(\tilde{x}_{i,j}\) implies that such a value is not yet known/realized (i.e. it is missing).

We will further denote \(\dot{x}_{i,j}^{[t]}\) for the imputation of missing values on the iteration t. Besides, \(\varvec{x}_i\) corresponds to the i-th instance of the dataset \(\mathcal {D}\) and \(\varvec{\dot{x}}_i^{[t]}\) is the i-th instance after imputation t.

A model h (e.g. Autoencoder, Random Forest, ...) is parametrized by \(\varvec{\theta }\), and \(p_t(\dot{\varvec{x}}^{[t]}_i \mid \dot{\varvec{x}}^{[t-1]}_i, \varvec{\theta })\) is the probability that random vector \(\tilde{\varvec{x}}_i\) takes value \(\dot{\varvec{x}}_i^{[t]}\) at iteration t.

Formally, each of the missing value types can be formalized as follows. Let \(\mathcal {D} = \{X \cup \tilde{X} \}\) be a dataset, where X and \(\tilde{X}\) represent observed and missing data, respectively, and \(X_{i,j}\) is the observed value of the observation i for the variable j.

Let \(\mathcal {R}\) represent the indicator matrix where \(R_{i, j} = 1\) if \(X_{i, j}\) is observed and \(R_{i,j} = 0\) if \(X_{i, j}\) is missing.

Further, let \(P(\mathcal {R}_{i,j} = 1)\) be read as the probability that the j-th feature of the i-th row be missing. We consider the MCAR (Missing Completely At Random) framework, where this probability P is a Bernoulli distribution of unknown parameter \(\pi\): \(\mathcal {R}_{i,j} \sim P_\pi (\cdot )\); unknown but assumed to be independent of all other missingness. On the other hand, MAR (Missing At Random) is the case where \(\mathcal {R}_{i,j} \sim P_\pi (\cdot \mid \varvec{x}_i)\) (i.e., missingness depends on other observed features). Finally, the MNAR (Missing Not At Random) scenario is when \(\mathcal {R}_{i,j} \sim P_\pi (\cdot \mid \varvec{\tilde{x}}_i)\), i.e. depends on the missing values themselves.

The aforementioned scenarios can be frequently encountered in real-world situations. For instance, MCAR is commonly found in biological data, and in particular, Single Nucleotide Polymorphisms (SNP) data used in experiments in this paper. Often, some of the numerous features obtained from the genome are not valid due to a failure of the tests, the machines that carry on the analyses, or the mistake of the practitioner. These faults can not be directly associated with any specific cause and they are considered random. At the same time, if the source can be recognized, for instance, because there is a faulty machine, the MNAR scenario may be found, e.g. if a specific machine is prone to output samples as positive, rather than negative due to faulty behavior. Finally, the MAR can be described as when the machine is not able to analyze a specific type of individual. Hence, the dataset would result in the missingness of an entire observation due to its nature.

2.2 Missing value imputation: procedural vs iterative

In this work, we particularly distinguish between the ways how the missing values can be imputed, i.e. procedurally or iteratively. Procedural methods impute values only once, based on the observed values. Iterative methods first impute values randomly and then update these imputations until some convergence criterion is met. We note, that a method belonging to one of these families, might be easily adaptable to another one.

A general schema for the procedural methods is given in Algorithm 1. The following example below illustrates one-shot row-wise procedural imputation (\({\text{blue}}\) represents training samples):

$$\begin{bmatrix} \tilde{x}_{1,1} &{} \tilde{x}_{1,2} &{} x_{1,3} \\ x_{2,1} &{} \tilde{x}_{2,2} &{} x_{2,3} \\ \tilde{x}_{3,1} &{} x_{3,2} &{} x_{3,3} \\ x_{4,1} &{} x_{4,2} &{} x_{4,3} \\ x_{5,1} &{} x_{5,2} &{} x_{5,3} \\ \end{bmatrix} \Rightarrow \begin{bmatrix} \dot{x}^{[1]}_{1,1} &{} \dot{x}^{[1]}_{1,2} &{} x_{1,3} \\ x_{2,1} &{} \dot{x}^{[1]}_{2,2} &{} x_{2,3} \\ \dot{x}^{[1]}_{3,1} &{} x_{3,2} &{} x_{3,3} \\ x_{4,1} &{} x_{4,2} &{} x_{4,3} \\ x_{5,1} &{} x_{5,2} &{} x_{5,3} \\ \end{bmatrix}$$
Algorithm 1
figure a

General framework for procedural imputation

Algorithm 2
figure b

General framework for iterative imputation

Algorithm 2 summarizes the general schema for iterative imputation and the following example illustrates an approach (all samples are used for training):

$$\begin{bmatrix} \tilde{x}_{1,1} &{} \tilde{x}_{1,2} &{} x_{1,3} \\ x_{2,1} &{} \tilde{x}_{2,2} &{} x_{2,3} \\ \tilde{x}_{3,1} &{} x_{3,2} &{} x_{3,3} \\ x_{4,1} &{} x_{4,2} &{} x_{4,3} \\ x_{5,1} &{} x_{5,2} &{} x_{5,3} \\ \end{bmatrix} \Rightarrow \begin{bmatrix} \dot{x}^{[0]}_{1,1} &{} \dot{x}^{[0]}_{1,2} &{} x_{1,3} \\ x_{2,1} &{} \dot{x}^{[0]}_{2,2} &{} x_{2,3} \\ \dot{x}^{[0]}_{3,1} &{} x_{3,2} &{} x_{3,3} \\ x_{4,1} &{} x_{4,2} &{} x_{4,3} \\ x_{5,1} &{} x_{5,2} &{} x_{5,3} \\ \end{bmatrix} \Rightarrow \ldots \Rightarrow \begin{bmatrix} \dot{x}^{[t]}_{1,1} &{} \dot{x}^{[t]}_{1,2} &{} x_{1,3} \\ x_{2,1} &{} \dot{x}^{[t]}_{2,2} &{} x_{2,3} \\ \dot{x}^{[t]}_{3,1} &{} x_{3,2} &{} x_{3,3} \\ x_{4,1} &{} x_{4,2} &{} x_{4,3} \\ x_{5,1} &{} x_{5,2} &{} x_{5,3} \\ \end{bmatrix} \Rightarrow \ldots$$

and thus so for \(t+1,t+2,\ldots\) until some convergence is established.

2.3 Missing value imputation: related work

We will now review existing work from the literature on MVI, roughly categorized into the above-mentioned approaches, procedural and iterative.

2.3.1 Basic statistical imputation

The procedural methods range from rather simple ones such as replacement with the column-wise mean, mode, or median statistics (Little and Rubin 2019) to techniques such as k-Nearest Neighbours (kNN) (Schwender 2012) and Cascade Imputation (CIM) (Montiel et al. 2018), leveraging machine learning methods. While the kNN method processes the data row-wise, i.e. extracting information from the k instances that are most similar to the one whose missing values need to be replaced, CIM first rearranges data, so that missing values may be imputed block by block.

2.3.2 Denoising autoencoders (DAE)

Using Denoising Autoencoders (DAE) for MVI (Vincent et al. 2008) can be considered a state-of-the-art; wherein missing values are treated as ‘noise’. They may be trained either procedurally on complete data (ignoring missing values) or iteratively, with missing values randomly imputed first (i.e., as noisy values), and then iteratively re-trained on updated re-imputed data until convergence, as in, e.g., (Seo et al. 2022; Wright 2015; McCoy et al. 2018). Principal Component Analysis (PCA), which indeed can be seen as a special (linear) autoencoder, has been used in a similar way (Dray and Josse 2015); often done via singular value decomposition (SVD), e.g., (Troyanskaya et al. 2001).

Classical Autoencoders implemented within neural networks architecture consist of encoder and decoder structures as illustrated in Fig. 1a. A classical Autoencoder learns to embed the input data into a hidden representation aiming at a low reconstruction error when decoded. Denoising Autoencoders try to eliminate noise from the data by first manually corrupting the input data, embedding such input into a hidden representation, and then performing reconstruction (see Fig. 1b). The reconstruction error is computed against the clean input data and hence, the Autoencoder learns to clean noisy data or, in other words, impute missing values. While the inner structure of hidden layers of these approaches can be very different, the typical common property is having at least one narrow middle layer H (so-called bottleneck) to restrict the model to learning only important information from the data.

Fig. 1
figure 1

a Classic Autoencoder embeds the input data into a hidden representation H, often referred to as bottleneck, b Denoising Autoencoder, where input is corrupted with noise or missing values as \(X \cup \tilde{X}\) before encoding, and c Autoreplicator, or an Autoencoder without an explicit encoding, where we propose to bypass hidden representation and directly reconstruct input from its corruption \(X \cup \tilde{X}\). In all cases, the goal is to minimize the difference between input X and its reconstruction Z

2.3.3 Multiple imputation with chained equations (MICE)

Another well-known approach for MVI is Multiple Imputation with Chained Equations (MICE) (Buuren and Groothuis-Oudshoorn 2011). MICE, like iterative DAE approaches, also initially imputes the missing values randomly, but proceeds in a column-wise leave-one-out scheme using an off-the-shelf method (base learner). The procedure is further independently repeated with the goal to obtain multiple candidate values for imputation, i.e., an ensemble-like approach. The well-known MissForest imputer (Stekhoven and Buhlmann 2011) can be seen as a special case of MICE with Random Forest chosen as a base learner.

We need to mention also the idea of applying Gaussian Processes for MVI proposed in Jafrasteh et al. (2023). Gaussian Processes are non-parametric models and output predictive distributions for target variables, which can be also considered as uncertainty estimates. Sparse Gaussian Processes have been applied for MVI leveraging the idea of the MICE method where the features are processed one by one in a cascaded fashion. While obtaining promising results, the proposed method has a much higher computational cost both for training and prediction than other baselines including already computationally expensive MICE. Also, in Jafrasteh et al. (2023), Gaussian Processes are applied for continuous variables and assume a normal distribution for each, which may not always be the case.

2.3.4 Expectation-maximization and coordinate-ascent methods

The approach of iterative imputation takes the general schema of coordinate ascent, with special cases including expectation maximization (EM) and classification maximization (CM, i.e. ‘hard EM’) (MacKay 2003; Dempster et al. 1977) where the expectation step E is replaced by a hard classification step (an actual value is imputed), and the maximization step M refers to training. Inspired by this idea, many MVI algorithms start from a random or data-driven (mode, mean, median, ...) initial imputation so that any classifier can be trained on the entire dataset. Next, the missing values are predicted and the model is re-trained after each imputation. This process is repeated until convergence is reached.

2.3.5 Other approaches

We would also like to mention the work (Van Wolputte and Blockeel 2020) which uses a Random-Forest-based predictor to perform imputation in the prediction phase assuming complete data in the training phase which corresponds to a different problem setting. While we think that with some extra effort, one could adapt this method to the setup described in our work and thus incorporate it into the framework, we leave this question for future research.

2.4 Summary, and framework parameters

Table 1 Some example methods as specific parametrizations of a general framework. Imputation types CW = column-wise, RW = row-wise, BW = block-wise. Strategies p = procedural, it = iterative. Method families SO = single-output, MO = multi-output. Taking uncertainty into account SI = single(standard)-imputation, MI = multiple imputation where ‘-’ indicates that it could be implemented, but we are not aware of any reference doing so; with Ensemble (MICE) or via a predictive posterior Distribution (\(\tilde{\varvec{x}} \sim p(\cdot \, | \,\dot{\varvec{x}})\)) being options

Most of all iterative methods listed above (DAE, SVD, PCA) may be considered multi-output predicting models, as they impute all missing values simultaneously. Oppositely, Multivariate Imputation by Chained Equations (MICE) is also iterative but single-output, as it processes features consequently in a leave-one-out manner. The MICE method is very flexible with regard to the base model, i.e. any per-feature estimator is possible. The MICE method is commonly used for different types of data and, in particular, clinical data, and can be considered state-of-the-art for MVI, but we have not found substantial evidence of using the MICE method for high-dimensional datasets, as the computational cost drastically increases in this setting. The CIM method mentioned above may be considered a procedural version of MICE. Table 1 summarizes the characteristics of the discussed methods.

Obviously, multiple variations of the methodologies shown above can be discussed. Although we believe that these are out of the scope of this work, we think that is worth sharing some insights. In procedural imputation, the main characteristic is that the imputation is only done once for each missing value. Following this idea, we could introduce the row-wise imputation in which we increasingly impute the missing values through time and we relearn the imputation model after each procedure. By using this procedure, we would add more true values to the training dataset after every imputation, but, at the same time, we might supply incorrect imputations to the model as ground truth. Similar to the previous approach, there is the column-wise imputation that matches the MICE imputation strategy.

2.4.1 Estimator

For the MICE method, any single-output estimator can be used. As a default parameter, we use Random Forests (of 20 trees each) as they proved to be a robust and stable method, though any other classifier may be provided manually to the framework. Among multi-output methods, we propose including Autoencoders and PCA as a standard choice and Autoreplicative Random Forests as a novelty (see Sect. 3).

2.4.2 Initial imputation

For iterative methods, an initial pre-starting imputation is needed. There are several possibilities for that: imputing all missing values with a constant, imputing with modes of the values for each feature, or imputing randomly with a uniform or simulated distribution over the observed values. We use random imputation with uniform distribution over the observed values as a default value.

2.4.3 Number of iterations

For iterative methods, the re-predictive process stops when the convergence is reached. To measure convergence, we calculate the fraction of the number of labels changed after the last imputation and the number of all missing labels. If this fraction is lower than a provided parameter \(\varepsilon\), set as default to 0.005, we stop and recover the last imputed dataset as the final estimation. However, to keep the overall complexity feasible and to avoid infinite loops, we provide a maximum number of iterations parameter that we set to 10 as default.

3 Autoreplicative random forests

Following the description of the general imputation framework, we first introduce a new imputation approach, Autoreplicative Random Forests. Secondly, we propose its distributional extension.

3.1 Autoreplicative random forests (ARF)

Although apparently largely overlooked in the literature, we have noticed that any other model designed for multi-label prediction can be used instead of a neural network as an Autoreplicator for data denoising. One such example is a combination of Decision Trees (İrsoy and Alpaydin 2016) where the first Decision Tree is used as an encoder, and the second one is used in a vice versa manner as a decoder. Meanwhile, this idea can be simplified even more: in our approach, we will use a multi-output Random Forest as an estimator.

In contrast to Denoising Autoencoders, we use Random Forests as Autoreplicators without an explicit encoding/representation, as shown in Fig. 1c, implicitly as an off-the-shelf multi-label model. Not having an explicit latent representation in matrix form does not concern us, as for MVI we aim to directly reconstruct the input from its corruption without modeling hidden structure.

Random Forests have been selected since they naturally are multi-label and multi-class classifiers and they proved to be competitive and robust classifiers in several works (Wood et al. 2023). Such an approach can facilitate the optimization process for the model on data containing a small number of samples, and at the same time, tree-based models are both efficient and simple to understand and interpret. To the best of our knowledge, this simple but efficient idea has not been well studied in the literature. We argue, that however it deserves attention and can be further investigated. Applying this idea, we suggest further Autoreplicative Random Forests.

It is worth noting, that while we choose Random Forests as a well-known and stable multi-label method with good performance, this idea may be developed by using other multi-label methods, such as e.g. Classifier Chains (Read et al. 2011), Multilabel k Nearest Neighbours (Zhang and Zhou 2007), Random k-Labelsets (Tsoumakas and Vlahavas 2007), Conditional Dependency Networks (Guo and Gu 2011). However, a survey of base learners is not the main objective of this paper.

In the procedural approach, further referred as procedural Autoreplicative Random Forest (pARF), we first select complete instances X of the entire dataset \(\mathcal {D} = \{X \cup \tilde{X} \}\), corrupt them manually to \(\tilde{X'}\) with induced missing values (uniformly distributed, following the missing value ratio in the original dataset), and train an Autoreplicative Random Forest to reproduce \(Z \sim X\), i.e. fill missing values in \(\tilde{X'}\) by minimizing loss function between Z and X. In other words, a multi-label Random Forest is trained to predict p outputs corresponding to X from p features corresponding to corrupted \(\tilde{X'}\). Then the fitted model is used to impute actual missing values in the instances \(\tilde{X}\). In the usage of iterative Autoreplicative Random Forests (itARF), values should be first imputed randomly, then a Random Forest is re-trained in an iterative manner, on iteration t receiving \(\mathcal {D} = \{X \cup \tilde{X} \}\) as an input, learning to reproduce \(Z \sim \dot{X}^{[t-1]}\) as output and storing a prediction \(\dot{X}^{[t]}\) as a new imputation.

3.2 Distributional iterative ARF (ditARF)

A known issue of using MVI in a machine-learning pipeline is the imputation of imperfect values. Imputation is inherently imperfect, but furthermore, masks the information that values were imputed as well as any associated uncertainty about such values.

To address this issue, methods such as MICE propose using the technique of ‘multiple imputation’, that is repeating the imputation several times independently (essentially, bootstrapping) in order to obtain multiple plausible values and run further analysis on these datasets.

Here we propose a distributional variant of ARF (ditARF) which provides a probability distribution associated with imputations, i.e., encapsulating and expressing the uncertainty associated with any imputation.

And in particular, we take into account a novel consideration not embraced by other methods; namely a model of the joint distribution for a given instance with missing values. Whereas imputation from a marginal distribution (j-th feature) can be expressed as

$$\begin{aligned} \dot{x}_j = \mathop {\textrm{argmax}}\limits _{\tilde{x}_j} p(\tilde{x}_j \mid \varvec{x}), \end{aligned}$$
(1)

the imputation (full row/vector) from a joint distribution is expressed as

$$\begin{aligned} \varvec{\dot{x}} = \mathop {\textrm{argmax}}\limits _{\tilde{\varvec{x}}} p(\tilde{\varvec{x}} \mid \varvec{x}). \end{aligned}$$
(2)

There is an issue with a naive implementation of multi-output Random Forests as formally this model produces an empirical distribution that can be interpreted (with some generalization) as

$$\begin{aligned} p(\tilde{\varvec{x}} \, | \, \varvec{x}) = \prod _{j=1}^p p(\tilde{x}_j \, | \, \varvec{x}) \end{aligned}$$
(3)

which assumes that each imputed feature is conditionally independent of the others for a given instance. This may not be the case in real-world data, and ignoring feature dependencies can hinder the accuracy of imputation. Indeed, in certain application domains such as medicine, it may be a critical mistake to make this assumption (Gerych et al. 2021).

Consider the illustration of Fig. 2, where the joint distribution \(p(\tilde{\varvec{x}})\) only gives non-zero probability to two values (\(x_1 x_2 = 00\) and \(x_1 x_2 = 11\)), yet the marginal probability (as would be estimated by Random Forest under Eq. (3)) indicates the equal probability for all combinations (00, 01, 10, 11). This means that even though the dataset does not contain values for two of the possible combinations, a Random Forest would produce them as predictions. As an example, a Random Forest predicting gender and type of cancer may assign a male gender and the presence of ovarian cancer, which do not co-exist in reality.

The proposed ditARF variant is similar to itARF (introduced in Subsection 3.1) and learns to predict missing values iteratively. However, at every iteration, the instances are weighted by the output joint probability.

Fig. 2
figure 2

Illustrating the difference between the joint and marginal distributions of two binary missing-value variables \(\tilde{\varvec{x}} = \{\tilde{x}_1, \tilde{x}_2 \}\) of an instance. The marginal distribution (the same distribution covers both variables, having been marginalized from the joint) indicates that all combinations of values 0 and 1 are equally likely b, even though only two such combinations would occur a. A multi-output Random Forest may impute values from which such impossible combinations appear c. This indicates the potential importance of joint modeling, which would produce \(\varvec{\dot{x}} \in \{00,11\}\) according to Eq. (2)

The Label Powerset (LP) method (Tsoumakas and Katakis 2007), for example, transforms each combination of output values into a unique class and thus naturally models the labels jointly. However, such an approach could not be applied in the iterative setting as initial imputation creates value combinations that may not exist in the data while closing the opportunity to learn other possible combinations in future iterations.

We can imagine adapting, for example, a less strict and more generalized version of the LP approach, the Random k-Labelsets method (Tsoumakas and Vlahavas 2007), to tackle this issue, as well as inducing some randomness at each iteration. However, these possible solutions are out of the scope of this work and we leave them for future research.

The proposed solution is closely related to other well-known iterative methods such as the Expectation Maximization (EM) algorithm (Dempster et al. 1977) and to more general coordinate-ascent methods (Wright 2015). Such methods find the maximum likelihood parameters for the corresponding model based on data. In the case of the EM, it can be used to fit a mixture of Gaussian distribution models while the coordinate-ascent method just performs a linear optimization in the log-likelihood function by iteratively learning and predicting data. DitARF also maximizes the log-likelihood after each iteration, which is computed as

$$\begin{aligned} \log \mathcal {L}(\varvec{\theta } \, | \, \mathcal {D}) = \sum _{i=1}^N \max _{j=1}^p \log (p(\tilde{x}_{i,j} \, | \, \varvec{x}_{i})). \end{aligned}$$

Similar to the EM algorithm, ditARF considers a set of weights \(\textbf{w}\), one per instance,

$$w_i = \prod _j \max p(\tilde{x}_{i,j} \, | \, \varvec{x}_{i}).$$

These weights are used when training a Random Forest classifier as weights for each instance, thus giving higher weights for instances where the model is more confident about the imputation. Following the strategy of the iterative version of ARFs, a Random Forest is iteratively re-trained until it reaches convergence in terms of likelihood. When this occurs, an estimate of the joint posterior distribution \(\tilde{\varvec{x}}^{[t]} \sim P\) is obtained and hence, we provide \(p(\tilde{x}_{i, j} \, | \, \varvec{x}_i)\) as a measure of uncertainty along with the imputed missing value \(\dot{x}_{i, j}\).

4 Experimental study

Table 2 Datasets used in experiments, p features, N samples. Discretized continuous datasets are marked with \(^{\textit{d}}\). High-dimensional datasets (\(p > N\)) are marked with \({}^{*}\). Datasets with a large number of samples are marked with \({}^{\blacklozenge }\)

In order to compare the performance of the proposed solution, we perform several experiments on real-world datasets obtained from the UCI repository (Dua and Graff 2017) as well as on three high-dimensional (\(p > N\)) Single Nucleotide Polymorphism (SNP) datasets which we have truncated to 1000 features in order to bound the memory consumption. Most of these datasets contain categorical multinomial variables, while three of them are described by continuous variables which we uniformly discretize to b bins (Yeast, \(b=3\); Metro, \(b=3\); Energy, \(b=2\)). We include Metro and Energy to demonstrate the methods’ performance in a setting with a big number of samples. The datasets used in the experiments are summarized in Table 2. So as to properly simulate missing values in real-world situations, we followed the MCAR strategy by corrupting a percentage of the data values. These percentages range from \(1\%\) to \(30\%\). We refer to this parameter as the Missing Value Ratio (MVR) throughout the text.

For the purpose of evaluating the proposed solution, we consider marginal accuracy, which is also known as Hamming Score, among the imputed values; and joint accuracy also referred to as Exact Match in the literature. Formally, marginal accuracy can be defined as

$$\begin{aligned} \frac{1}{N_m}\frac{1}{p_m^i}\sum \limits _{i=1}^{N_m} \sum \limits _{j=1}^{p_m^i} \mathbbm {1}(\dot{x}_{i,j}, x_{i, j}), \end{aligned}$$
(4)

where \(N_m\) and \(p_m^i\) refer to the number of instances and the number of features per instance with missing values, respectively. Similarly, joint accuracy can be defined as

$$\begin{aligned} \frac{1}{N_m}\sum \limits _{i=1}^{N_m} \mathbbm {1}(\dot{\varvec{x}}_{i}, \varvec{x}_{i}). \end{aligned}$$
(5)

Finally, since MVI is usually a preprocessing step for further classification tasks, we compare the classification accuracy obtained with a Random Forest classifier trained on full data, and on imputed data. The experiments have been run 5 times and the average of the scores of all runs is used.

We compare our method against a variety of well-known approaches from the literature. Autoencoder and PCA methods are implemented using the scikit-learn (Pedregosa et al. 2011) package. We tested the performance of both procedural and iterative Autoencoders in three variants: with one hidden layer of 0.1p neurons, one hidden layer of 0.2p neurons, or three hidden layers of 0.2p, 0.1p, and 0.2p neurons respectively, where p is the number of features. The model with one hidden layer of 0.1p neurons has shown slightly better performance, although the difference was not significant. The results of this model are further presented. The PCA method was also realized as a neural network with one hidden layer of 0.1p neurons but with an identity activation function. The kNN method is presented with the number of neighbors \(k=2\) selected during inner validation where it consistently outperformed other k values.

In order to select the best-performing parameters, we have internally run a grid search over the parameters of Autoreplicative Random Forests. As a result, we opted to use 20 trees (base classifiers) per forest (no significant difference compared to other values), each tree trained on all p provided features (better performance than with default \(\sqrt{p}\) parameter), a minimal number of samples per split equal to 5. Criterion (gini/entropy) has not shown an influence on the method’s performance. Other hyperparameters of the competitors are used with default values shown in their original papers and implemented in the ScikitLearn python library.

The main aim of this experimental study is to answer the following research questions:

  1. (a)

    Analyze the imputation performance of the proposed solution and its competitors in terms of marginal (entry-wise) and joint (row-wise) accuracy.

  2. (b)

    Study the performance of the methods under the curse of dimensionality (when p > N).

  3. (c)

    Evaluate the impact of the number of features (p) and the number of samples (N) for each of the imputation methods w.r.t. the time taken to finish their imputation as well as predictive performance.

4.1 Results and discussion

4.1.1 Imputation performance

First, we empirically evaluate the convergence of the proposed itARF method and demonstrate the results for three datasets and different missing value ratios in Fig. 3. We observe that in all cases accuracy monotonously increases and reaches its maximum after several iterations. The number of iterations is shown to be small enough to maintain a feasible computation time of itARF imputation.

Fig. 3
figure 3

Convergence of the itARF method (accuracy vs number of iterations)

Further, we evaluate three different initial imputation strategies for iterative methods, namely imputing with a constant (0), with a mode of each feature, and with a randomly selected value from the set of the observed ones for the corresponding feature. The results illustrated in Fig. 4 suggest that imputing with a constant consistently provides the worst imputation accuracy for all iterative methods, while imputing randomly and with modes are competitive, and the best choice may depend on the dataset. In all further experiments, a random initial imputation is used.

Fig. 4
figure 4

Comparison of initial imputation strategies for iterative methods on four datasets. The results are averaged across 5 different missing value ratios and 5 independent runnings

Then, we evaluate marginal and joint accuracies for the imputed missing values. Table 3 summarizes the performance of all methods measured by the marginal accuracy, i.e. percentage of correctly imputed values out of the missing ones. Table 4 shows joint accuracy, i.e. percentage of the instances where all values were imputed correctly. MICE results are not shown for the datasets with a large number of features because of excessive computation time. The ditARF method was not evaluated on the datasets with a very large number of samples for computational sake. For the same datasets, the kNN results are not included, since kNN computational time is quadratic w.r.t. the number of samples (Troyanskaya et al. 2001) and thus difficultly accessible when the number of samples is large.

Table 3 Marginal accuracy. The best accuracy per column is in bold. The second best accuracy is underlined. All results are rounded to 3 dp. For [it]erative (includes MICE) and [p]rocedural versions of methods
Table 4 Joint accuracy. The best accuracy per column is in bold. The second best accuracy is underlined. All results are rounded to 3 dp. For [it]erative (includes MICE) and [p]rocedural versions of methods
Fig. 5
figure 5

Friedman–Nemenyi diagrams comparing the ranking of the experimentally tested methods. A lower rank is better, statistically indistinguishable methods are connected by a horizontal line

When evaluated, i.e. in low-dimensional datasets, the MICE method remains very competitive. Its time consumption is significantly higher than for the ARF-based methods but stays feasible when the number of features is relatively small. The procedural and iterative ARFs show competitive performance. For the Mushroom dataset, pARF shows the best results when the missing value ratio is small but fails when this ratio is big and thus there is not enough data to train a reliable model. In most cases, the itARF method along with its ditARF modification runs second best. We also observe that on the Nursery dataset, MICE fails to predict relevant values, while other methods demonstrate more optimistic results. The Friedman–Nemenyi diagrams demonstrate the statistical significance of the methods’ performance difference in Fig. 5, confirming that three ARF-based methods lie in the high spectrum of methods ranking along with the MICE method.

The ditARF method computes the probabilities of imputed values \(p(\tilde{\varvec{x}} \, | \, \varvec{x})\) on every iteration and uses these to provide a measure of confidence per instance as sample weights of the model on the next iteration. To understand better its behavior, we illustrate probabilities of having a ‘1’ class changing through iterations on Fig. 6. We observe that after several iterations each probability ‘converges’ to a certain level and continues oscillating around it. From this evidence, we conclude that the model is not overfitting (otherwise we would expect converging to 0 or 1) and indeed can provide a distribution for possible values for imputation.

Fig. 6
figure 6

Representation of the stability of the ditARF method over a set of binary missing values. Each line represents the changes in the probability of a missing value imputation throughout the different iterations. In this case, we opted to plot the \(p(\tilde{\varvec{x}} = 1 \, | \, \varvec{x})\). Each plot shows the stability of the method in a different dataset

Fig. 7
figure 7

Classification accuracy gain/loss when compared to a complete dataset (smaller value = better)

Figure 7 shows the difference in classification accuracy of a Random Forest classifier learned on ground-truth complete data, and a Random Forest learned on datasets imputed by different methods. The analysis is performed for the datasets possessing a label to predict. First, we observe that imputation quality and further classification quality do not strictly correlate. This poses the question if the best strategy would be to do imputation and classification simultaneously to optimize the performance of both. Second, in some cases, the proposed ARF method facilitates classification compared to the MICE method even when imputation accuracy is lower. Third, we see in the Votes dataset that the ditARF method in some cases provides significantly better accuracy even when compared to the itARF method, which supports the further need of considering prediction confidence during the imputation step.

4.1.2 Imputation performance in high-dimensional datasets

We study the performance of the proposed methods and their competitors in high-dimensional settings, when \(p > N\). The marginal imputation accuracy values are given in Table 3, where the datasets of interest are marked with an asterisk (\({}^{*}\)). Results for the MICE method could not be computed due to its excessive computation time. Similarly, procedural methods cannot be used as all instances are affected by missing values. We observe that the itARF and ditARF methods systematically outperform other iterative methods such as itAE and itPCA, as well as the kNN method, and thus prove to be a competitive and powerful alternative family of approaches.

4.1.3 Time complexity analysis

Fig. 8
figure 8

Empirical results on time complexity (in seconds) for imputation methods. For each method, its average running time (across 5 launches) is shown (by line) as well as its minimum and maximum (borders of color interval). In a the number of features varies from 10 to 80 while the number of samples is constant, in b the number of samples varies from 10 to 90 while the number of features is constant. Here, ditARF is not specifically included since it is already covered by iterative ARF (itARF)

The complexity of one Decision Tree with binary features is \(\mathcal {O} \left( p N \log N \right)\) with regard to the number of features p and the number of instances N. If all the trees in an Autoreplicative Random Forest are trained on all features, the total complexity of the forest remains the same. In the MICE method, a separate model is trained per feature, thus for one iteration, the complexity of the MICE method with Random Forest base estimator becomes quadratic \(\mathcal {O} \left( p^2 N \log N \right)\).

At the same time, with a multi-label Random Forest, the total complexity remains linear. Thus, both the methods itARF and pARF provide linear complexity with regard to the number of features, as the complexity of one forest is only multiplied by the number of iterations which typically is low as convergence is reached soon.

The complexity of both single- and multi-output Random Forests remains similar with regard to the number n of samples, i.e. \(N\log N\).

These theoretic estimations are well supported in the simulation study, see Fig. 8. We empirically compare the time complexity of the imputation methods on subsets of the Eucalyptus dataset under the MCAR scenario with 10% missing values. The subsets are selected as (a) the first \(p_s\) features of the original dataset, \(10 \le p_s \le 80\), and (b) the first \(N_s\) samples, \(10 \le N_s \le 90\).

Further, we access actual computation times for all methods and present the results for the missing value ratio of 0.01 in Table 5. In all datasets, the procedural methods and kNN are very fast, but we have seen above that they do not always produce adequate results or simply can not be used if there are not enough complete instances for training. Also, across all datasets, the MICE method works several times slower and is not applicable when the number of features increases. At the same time, we observe that itAE and itPCA methods often also require significant time expense while producing not necessary high imputation accuracy.

Table 5 Computational time (in seconds) for the experiments with missing value ratio 0.01. Median times for 5 independent runnings are shown. All times are rounded to 3 dp

5 Conclusions and future work

In this work, we propose a general framework for missing value imputation and we deeply analyze the literature on missing value imputation schemes. We identify that while there exist multi-output missing value imputation methods such as Autoencoders, this idea may be further applied to any multi-output machine learning methods but is yet not presented in the literature.

Developing this idea, we propose multi-output Autoreplicative Random Forests (ARFs) for accurate missing value imputation, in three different variants. First, we propose procedural ARF (pARF) that leverages the idea of Denoising Autoencoders for missing value imputation that only impute once the missing values. Second, we propose iterative ARF (itARF). The proposed itARF approach works as a deterministic iterative imputation method that not only obtains competitive results to the state-of-the-art methods but also drastically outperforms them in terms of computational time. We have shown that these approaches can provide significant improvements especially when there is a lack of complete instances in the case of high-dimensional data. Moreover, we focused on the necessity of providing a measure of uncertainty with respect to the imputed missing values, and we proposed the distributional itARF (ditARF) which works similarly to the EM algorithm and estimates the posterior distribution. With the probabilistic versions of the proposed framework, we provide not only imputation for the missing values but also a measure of uncertainty which we believe could be beneficial in numerous applications. Note that missing value imputation is commonly used in the preprocessing steps of broader machine-learning tasks. Hence, wrongly imputed values could significantly impact the forthcoming learning tasks. This could be avoided by only considering confident imputations.

To evaluate the proposed solution, we have performed an extensive evaluation of the proposed and previously existing methods on low- and high-dimensional datasets in which we included a variety of datasets from the UCI repository and three SNP datasets. As can be seen, the proposed solutions drastically outperform existing literature approaches when \(p \gg N\). Finally, we have also tested the difference between training a Random Forest classifier for an imputed dataset and ground-truth data. The results show that the obtained accuracy with the classifier learned in ARF methods are good estimates since they obtain similar results to the classifier learned with ground-truth data.