1 Introduction

Artificial Neural Networks (ANN) are a broad class of mathematical models, inspired by the neural mechanism of animals, that have been extensively investigated in the past decades. Successful applications of ANN have been reported in various domains such as signal processing [47], control systems [36] and computer vision [51].

Among the aspects that influence the performance of an ANN, the learning method is one of the most relevant. In this context, gradient-based learning algorithms have played a dominant role in training ANN [45]. However, problems such as slow convergence and convergence to local minima increased the popularity of Neural Networks with Random Weights (NNRW). NNRWs provide a solution to these issues by randomly assigning the hidden weights and adjusting only the output neurons during the training step. In one of the first works to analyze NNRW, [43] reported that the weights of the output layer are the most relevant, and the other weights may not need to be tuned once they are properly initialized. The authors argue that randomly setting the parameters in the hidden neurons helps to remove the redundancy of the model in parameter space and thus makes the performance less sensitive to the resulting parameters compared with other typical learning rules such as back-propagation. Similar basic ideas were also reported in the seminal work of Broomhead and Lowe [3], where the authors showed that the random selection of centers among the training data is a valid alternative to sophisticated clustering algorithms used in RBF neural networks.

Since the 90s, NNRWs have become increasingly popular and various models were proposed. For example, in [37], the authors proposed a new NNRW model named Random Vector Functional Link (RVFL). The RVFL model differs from the model presented by Schmidt et al. [43] since it includes a direct link between the inputs and the output layer. Later, several authors showed that the direct link may enhance the performance of RVFL [46]. Furthermore, various authors such as [2, 14, 15] introduced NNs with a randomly initialized hidden layer, trained using the pseudo-inverse. The universal approximation capability of NNRWs was also addressed in works such as [20, 26, 40]. More recently, several authors introduced NNRWs with different characteristics such as constructive/selective NNs [49] and deep architectures [13, 48]. Such models achieved remarkable performances and may point promising future research directions. It is worth pointing that the literature of NNRW is vast and, in this paper, we do not aim to present it in details. For a more complete survey on NNRW including historical perspectives and current challenges, the reader may refer to [42, 52].

Despite the vast number of successful applications [5, 39, 41], the use of NNRW can be limited by the presence of observations with one or more missing values. The occurrence of missing values can be caused by many reasons such as measurement error, device malfunction and operator failure. To overcome the problems derived from the occurrence of these missing components, it is necessary to understand the mechanisms governing this phenomenon. According [29], the missingness mechanism can be characterized as Not Missing at Random (NMAR), Missing Completely at Random (MCAR) and Missing at Random (MAR).

The NMAR mechanism describes the situation in which the probability of a component being missing is related to the value of this component. If missing data are NMAR, building a missingness model is the only way to obtain unbiased estimates of the missing values. In MCAR, the missingness of a component is independent of its real value or any value of other components on the dataset. Such a characterization is very restrictive, and it makes the MCAR case unlikely in real life problems [8]. A more reasonable assumption states that the data are MAR. In the MAR mechanism, the missingness of a component is independent of the value itself but can be related to the observed values. Under the assumption that the data are MAR, strategies for dealing with missing data in machine learning methods (including neural networks) can be grouped into three different approaches: deletion of incomplete cases, imputation of missing values and design of learning methods that can handle missing data directly.

The deletion approach is, by far, the most common [50]. In the listwise deletion, instances with at least one missing attribute are eliminated, and the analysis is performed using the rest of the examples. Although commonly used, listwise deletion can lead to a performance loss when the number of instances with missing attributes is significant [12].

The imputation approach consists of strategies for filling the missing values using the information available from the observed values of the dataset. These methods can be split into single and multiple imputation procedures. In single imputation, each missing value is filled with a single estimated value. Examples of single imputation methods are the incomplete case nearest neighbors imputation (ICk-NNI, [17]), where the value is imputed according to the k-nearest neighbors obtained with an incomplete distance metric and the Expectation Conditional Maximization (ECM, [31]). In the ECM method, a parametric model is assumed for the dataset, and the EM algorithm is used to estimate the parameters of the distribution. After that, the missing values are imputed with its expected value conditioned to the observed values of the instance. After the imputation procedure, a standard learning algorithm can be applied.

In contrast to single imputation, multiple imputation methods attribute a set of possible values rather than a single value for the missing attributes. Hence, these methods generate some different datasets where the complete instances are identical, but the incomplete instances have different values [23]. After that, learning algorithms are trained on each dataset, and a final result is obtained by combining all generated models. It is worth noting that multiple imputation methods present a significant increase in computational cost when compared to single imputation methods. However, multiple imputation approaches can handle the inherent uncertainty from the missing value estimation process in the final machine learning model thus leading to better results.

The third group of strategies consists of the adaptation of learning methods to handle missing data on its formulation. Such methods avoid direct imputation procedures [12]. In [9], the authors propose a method to estimate the expected value of pairwise distances between vectors with missing data using a Gaussian Mixture Model (GMM) for the distribution of the dataset. The proposed method is used to design a k-NN classifier for datasets with missing values. Support Vector Machines (SVM) and Gaussian Process (GP) variants for missing data were proposed in [44], where both SVM and GP are cast in a general framework for kernel methods as an estimation problem using exponential families in feature space. Also for SVM, the work in [38] propose a modified risk function that incorporates a probability distribution for missing data. Mesquita et al. [33] presents a method to estimate the expected value of the Gaussian kernel calculated over incomplete vectors.

Considering NNRWs, methods for directly handling missing data are indeed rare. Most of the works available in the literature are related to the use of neural networks for single imputation, as can be seen in [1, 19]. Other authors propose single imputation methods [35] or evaluate the impact of various well-known imputation methods on the performance of NNRW [30]. It is worth noting that these single imputation methods were not particularly designed to work with NNRW and can be used with almost any machine learning method.

In this paper, we propose a method to design NNRW models that can handle missing components directly. In the proposed framework, we consider each entry as a random variable with its distribution modeled with a GMM. We use the Unscented Transform (UT) to estimate both the expected value and the variance of each entry after going through the hidden layer. Then, we estimate the weights of the output layer using an uncertainty robust least squares formulation. We tested our method on two NNRW models in synthetic and real datasets. Results show that our approach outperformed other commonly used missing data treatments, especially when the number of missing components is high. The proposed method differs from other missing data treatments because it accounts for the uncertainty of the imputation process, similar to multiple imputation methods, whereas it does not rely on a computationally intensive sampling procedure. In this way, our method can be seen as an alternative to single and multiple imputation methods, since it provides more accurate estimates than single imputation methods (since it considers the uncertainty) with a reduced computational cost when compared to multiple imputation procedures.

The remainder of this paper is organized as follows. Section 2 presents a brief review of basis functions and NNRW models. Section 3 describes the proposed method to design NNRW models that are capable of handling missing values. Section 4 summarizes the steps to be followed to implement the proposed method. Section 5 presents the results obtained in numerical experiments conducted to evaluate the performance of the proposed method. Concluding remarks are given in Sect. 6.

2 Basis Expansion and Neural Networks with Random Weights

Let \({\mathcal {D}} = \{ {\mathcal {X}}, {\mathcal {Y}} \}\) be a training set with N examples, such that \({\mathcal {X}}=\{X_i\}_{i=1}^N\) and \({\mathcal {Y}}=\{Y_i\}_{i=1}^N\) are, respectively, a set of D-dimensional input points and their corresponding outputs. Assuming the existence of a continuous mapping \({\mathfrak {f}}:{\mathcal {X}} \rightarrow {\mathcal {Y}}\) between the input and the output space, we want to estimate \({\mathfrak {f}}\) from data. In many NNRW models, \({\mathfrak {f}}\) can be expressed as a linear combination of nonlinear functions (basis functions) over the input variables as follows:

$$\begin{aligned} \hat{Y_i}=\sum _{j=0}^{{\mathcal {M}}}{\beta _j \phi _j(X_i)}, \end{aligned}$$
(1)

where the function \(\phi _j(\cdot ): {\mathbb {R}}^D \rightarrow {\mathbb {R}}\) is the j-th basis function, \(\beta _j\) is its corresponding weight and \({\mathcal {M}}\) is the number of hidden neurons. Usually \(\phi _0(\cdot ) = 1\) and \(\beta _0\) is the bias term.

The resulting model, referred as linear basis expansion, can approximate a nonlinear function while still linear in the parameters. Under the condition that these basis functions are infinitely differentiable, this model can approximate any given function ([6, 11, 16]). Since the model is linear concerning the basis functions outputs, the parameters are typically estimated with maximum likelihood or Bayesian techniques.

The use of different basis functions defines various neural networks with random weights models such as the Feedforward Neural Network with Random Weights (FNNRW, [43]), the Random Vector Functional Link (RVFL, [36]) and the q-Generalized Random Neural Network (QGRNN, [45]) among others. To illustrate our point, we present reformulated versions of FNNRW and QGRNN according to the basis expansion model. Similar procedures can be performed for other NNRW models and require only straightforward mathematical manipulations.

The FNNRW model, originally proposed in [43], was one of the first to introduce the concept of NNRW. The FNNRW comprises a multilayer perceptron model where the weights between the input and the hidden neurons are randomly assigned. Such model can be written as:

$$\begin{aligned} {\hat{Y}}_i = \sum \limits _{j=0}^{{\mathcal {M}}} \beta _j {f}(\varvec{w}_j^T X_i + c_j ) \end{aligned}$$
(2)

where \(\varvec{w}_j\) is the weight vector connecting the input layer and the j-th hidden neuron, \(\beta _j\) is the weight connecting the j-th hidden neuron and output layer, \(c_j\) is the threshold of the j-th hidden neuron and \(f(\cdot )\) is the logistic sigmoid function. As can be noticed, in the FNNRW model the basis function is defined as a sigmoid function over the linear combination of the input vector.

The QGRNN model can be written similarly. The q-Generalized Random Neural Network is a recently proposed NNRW that uses radial basis neurons and a q-Gaussian activation function. The resulting model can be expressed as:

$$\begin{aligned} {\hat{Y}}_i = \sum \limits _{j=0}^{{\mathcal {M}}} \beta _j g(X_i) \end{aligned}$$
(3)

where:

$$\begin{aligned} g(\alpha ) = e_q{(-||\alpha - \gamma ||^2/\eta ^2)} \end{aligned}$$
(4)

where \(\gamma \) and \(\eta \) represent, respectively, the center and the width of the q-Gaussian, and \(e_q\) is the q-exponential defined by:

$$\begin{aligned} e_q(\alpha ) = [1 + (1-q)\alpha ]^{\frac{1}{1-q}} \end{aligned}$$
(5)

The choice of q, the entropic index, is of fundamental importance since it results in different deformations of the Gaussian distribution. Once again, the NNRW model can be expressed as a linear basis expansion model.

3 Neural Networks with Random Weights for Datasets with Missing Values

Given a set of possibly incomplete training inputs \({\mathcal {X}}\) and the set of corresponding outputs \({\mathcal {Y}}\), we wish to estimate the weight vector \({\mathcal {B}} =(\beta _0,\dots ,\beta _{\mathcal {M}})^T\). To do so, we model the missing entries as random variables and choose \({\mathcal {B}}\) to minimize the expected sum of square errors loss function. We derive a solution in terms of the expected values and the covariance matrices of the transformed inputs. In turn, we provide a method to estimate these statistics given the distribution of the input data. Furthermore, we consider that missing values are MAR [29] and that the full version of the examples in \({\mathcal {X}}\) are i.i.d.

3.1 Formulation

Given the aforementioned assumptions, each \(\phi _j(X_i)\) is an univariate random variable since it is a transform of \(X_i\) for which several components may be missing. In such scenario, \({\mathcal {B}}\) can be estimated by solving the following optimization problem:

$$\begin{aligned} \underset{{\mathcal {B}}}{\text {minimize}}\quad {\mathbb {E}}\Bigg [\sum _{i=1}^{N}{\Big (Y_i - \sum _{j=0}^{{\mathcal {M}}}{\beta _j \phi _j(X_i)}\Big )^2 }\Bigg ] \end{aligned}$$
(6)

For a compact notation, let us denote \({\psi _i} = (\phi _0(X_i),\dots ,\phi _{\mathcal {M}}(X_i))^T\), \(\varPsi = (\varvec{\psi _1},\dots ,\varvec{\psi _N})^T\), \(\varvec{Y} = (Y_1,\dots ,Y_N)^T\) and \({\mathcal {B}} = (\beta _0,\dots ,\beta _{\mathcal {M}})^T\). Thus, the same problem can be expressed as:

$$\begin{aligned} \underset{{\mathcal {B}}}{\text {minimize}}\quad {\mathbb {E}}\big [||\varvec{Y} - \varPsi {\mathcal {B}}||_2^2\big ] \end{aligned}$$
(7)

We can assume that \(\varPsi \) is a random variable with \({\overline{\varPsi }} = \mathrm {E}[\varPsi ]\), so we can describe \(\varPsi \) as \(\varPsi = {\overline{\varPsi }} + U\), where U is a random matrix with zero mean. By doing so, the objective function in Eq. (7) can be expressed as:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\big [||\varvec{Y} - \varPsi {\mathcal {B}}||_2^2\big ]&= {\mathbb {E}}\big [(\varvec{Y} - {\overline{\varPsi }}{\mathcal {B}}- U{\mathcal {B}})^T(\varvec{Y} - {\overline{\varPsi }}{\mathcal {B}} - U{\mathcal {B}})\big ]\\&= (\varvec{Y} - {\overline{\varPsi }}{\mathcal {B}})^T(\varvec{Y} - {\overline{\varPsi }}{\mathcal {B}}) + \mathrm {E}\big [{\mathcal {B}}^T U^T U {\mathcal {B}}\big ]\\&= ||\varvec{Y} - {\overline{\varPsi }}{\mathcal {B}}||_2^2 + {\mathcal {B}}^T P {\mathcal {B}} \end{aligned} \end{aligned}$$
(8)

where \(P = \mathrm {E}[U^TU]\). Therefore, the problem has the form of a regularized least squares problem and the solution is given by:

$$\begin{aligned} {\mathcal {B}} = ({\overline{\varPsi }}^T{\overline{\varPsi }} + P)^{-1} {\overline{\varPsi }}^T \varvec{Y} \end{aligned}$$
(9)

We can express P as a function of the moments of the random variable \(\varPsi \). For any \(l, l^\prime \in \{1, \ldots , {\mathcal {M}}\}\), the element \( P_{l, l^\prime }\), in the intersection of the l-th line with the \(l^\prime \)-th column of P is given by:

$$\begin{aligned} P_{l, l^\prime }= & {} \sum _{i=1}^{N} \mathrm {E}[U_{i, l} U_{i, l^\prime }] \\\triangleq & {} \sum _{i=1}^{N} \bigg ( \mathrm {Cov}(U_{i, l}, U_{i, l^\prime }) + {\mathbb {E}}[U_{i, l}]{\mathbb {E}}[U_{i, l^\prime }] \bigg ) \\= & {} \sum _{i=1}^{N} \mathrm {Cov}(U_{i, l}, U_{i, l^\prime }) \\= & {} \sum _{i=1}^{N} \mathrm {Cov}({\varPsi }_{i,l}, {\varPsi }_{i,l^\prime }) . \end{aligned}$$

According to this result, it follows directly that:

$$\begin{aligned} P = \sum _{i=1}^{N} \mathrm {Cov}(\psi _i) \end{aligned}$$
(10)

Consequently, estimating P consists in computing the covariance matrices of the projected input vectors \(\psi _1, \ldots , \psi _N\) and summing over them. According to the results presented in Eqs. (9) and (10), we have to estimate \({\mathbb {E}}[{\psi }]\) and \(\mathrm {Cov}({\psi })\) to obtain \({\mathcal {B}}\).

3.2 Computing \(\pmb {{\mathbb {E}}}[{\psi }]\) and \(\mathrm {Cov}({\psi })\)

Let \(X \in \{ X_1, \ldots , X_N \}\) be an input vector and \(\psi \) its projected version. The problem of computing the expectation of \(\psi \) consists in evaluating individually the expectation of its entries \(\phi _1(X), \ldots , \phi _{\mathcal {M}}(X)\), given by:

$$\begin{aligned} {\mathbb {E}}[\phi _{l}(X)] = \int _{{\mathbb {R}}^D} \phi _{l}(X) p(X) dX \quad \forall l \in \{1, \ldots , {\mathcal {M}} \}, \end{aligned}$$
(11)

while the elements of the covariance matrix \(\mathrm {Cov}(\psi )\) are given by:

$$\begin{aligned} \begin{aligned} \mathrm {Cov}(\phi _{l}(X), \phi _{l^\prime }(X)) = \int \nolimits _{{\mathbb {R}}^D} ({\phi }_{l} - {\mathbb {E}}[{\phi }_{l}])(\varvec{\phi }_{l^\prime } - {\mathbb {E}}[{\phi }_{l^\prime }]) p(X) dX\\ \forall l, l^\prime \in \{1, \ldots , {\mathcal {M}}\} \end{aligned} \end{aligned}$$
(12)

for which there is no trivial general solution and tailored ones depend on both the format of \(\phi (\cdot )\) and \(p(\cdot )\). However, for any \(\phi (\cdot )\) and \(p(\cdot )\), it is possible to approximate Eqs. (11) and (12) via sampling or numerical integration methods.

The Unscented Transform (UT), initially proposed in [21], is a sampling-based method that estimates the statistical moments of a probability distribution associated with a random variable which results from a nonlinear transformation of another random variable [24]. Here, we wish to estimate the moments of \(\psi \), which results from the application of \(\phi _1(\cdot ), \ldots , \phi _{\mathcal {M}}(\cdot )\) on the random variable X.

The UT pipeline consists of (1) selecting a set of samples from the original distribution. We consider the sampling scheme proposed by Julier and Uhlmann [22], where a set of \(L=2|M| + 1\) samples is deterministically chosen, where |M| is the number of missing entries of X. Additionally, we compute a set of weights that will be later used to retrieve the moments of the transformed variable. (2) Evaluating the values of \(\phi _1(\cdot ), \ldots , \phi _{\mathcal {M}}(\cdot )\) on the previously selected samples. (3) Estimate the moments of the transformed variable \(\psi \) using the transformed samples obtained in step 2 and the weights from step 1. Further details regarding the UT are presented in the Appendix.

Although this UT approach could be applied directly to estimate the statistical moments of an arbitrary basis function, the number of required samples grows with the number of missing entries, increasing the computational burden of the procedure. To alleviate this problem, we propose two methodologies based on the UT that require only three one-dimensional sigma points, independent of |M|. The first one, presented in Sect. 3.2.1, is tailored to the logistic function and can be easily generalized to any \(\phi (\cdot )\) that can be expressed as a transform of \(\varvec{w}^T X\). The second one, presented in Sect. 3.2.2, deals with the q-Gaussian function and can be adapted for any function that is a transform of \(\Vert \varvec{w}^T X\Vert ^2\). In this paper, we consider that \(\mathrm {Cov}(\psi _i), \ldots , \mathrm {Cov}(\psi _N)\) are diagonal matrices. Thus, to obtain an approximation of P, it suffices to compute the individual variances of the basis functions evaluated at each \(X \in \{X_i, \ldots , X_N \}\).

3.2.1 Sigmoid Function

Given an input vector X, the sigmoid function is given by:

$$\begin{aligned} f(X) = \frac{1}{1 + e^{-\varvec{w}^T X}} \end{aligned}$$
(13)

where \(\varvec{w} = ({w}_1, \ldots , {w}_D)^T\) is a predefined constant vector.

Note that f(X) can be written as a transform of the random variable \(\varvec{w}^T X\), whose expectation is given by:

$$\begin{aligned} {\mathbb {E}}[\varvec{w}^T X] =\sum _{d=1}^D w_d{\mathbb {E}}[x_d] , \end{aligned}$$
(14)

and has variance:

$$\begin{aligned} \mathrm {Var}[\varvec{w}^T X] = \sum _{d=1}^D w_d^2 \mathrm {Var}[x_d]. \end{aligned}$$
(15)

Using the UT scheme (see details in Appendix), we can approximate \({\mathbb {E}}[f(X)]\) with \(L=3\) sigma points as follows:

$$\begin{aligned} {\mathbb {E}}[f(X)] \approx \sum _{i\in \{-1, 0,1\} } \frac{u_i}{3} \end{aligned}$$
(16)

where:

$$\begin{aligned} u_i = {\big (1 + \exp \{-{\mathbb {E}}[\varvec{w}^T X] +i \sqrt{\mathrm {Var}[\varvec{w}^T X]}\}\big )^{-1}} \end{aligned}$$
(17)

Analogously, the variance of the logistic function f(X) is given by:

$$\begin{aligned} \mathrm {Var}(f(X)) \approx \sum _{i\in \{-1, 0,1\} } \frac{1}{3}\, \Big ( {u_i - {\mathbb {E}}[f(X)]\Big )^2} \end{aligned}$$
(18)

It is important to notice this methodology also applies to any transform of X that can be written as a function of \(\varvec{w}^T X\).

3.2.2 q-Gaussian Function

Given an input vector X, the q-Gaussian activation function can be expressed as:

$$\begin{aligned} g(X) = e_{q}({-\Vert X - \varvec{w}\Vert ^2 \nu ^{-1}}) \end{aligned}$$
(19)

where \(\varvec{w} = (w_1, \ldots , w_D)^T\), \(\nu > 0\) and \(q \in {\mathbb {R}}\) are predefined constants while

$$\begin{aligned} e_q(\alpha ) = [1 + (1-q)\alpha ]^{\frac{1}{1-\alpha }} . \end{aligned}$$
(20)

Note that g(X) can be written as a transform of \(\Vert X - \varvec{w}\Vert ^2\), whose expectation is given by

$$\begin{aligned} {\mathbb {E}}[ \Vert X - \varvec{w}\Vert ^2 ] =\sum _{d=1}^D ({\mathbb {E}}[x_d] - w_d)^2 + \mathrm {Var}[x_d] , \end{aligned}$$
(21)

and has variance:

$$\begin{aligned} \mathrm {Var}[ \Vert X - \varvec{w}\Vert ^2 ] =\sum _{d=1}^D {\mathbb {E}}[x_d^4] - {\mathbb {E}}[x_d^2]^2 + 4w_d^2 \mathrm {Var}[x_d]. \end{aligned}$$
(22)

Thus, \({\mathbb {E}}[g(X)]\) can be approximated using the aforementioned UT scheme with \(L = 3\) sigma points, as follows:

$$\begin{aligned} {\mathbb {E}}[g(X)] \approx \sum _{i\in \{-1, 0,1\}} \frac{r_i}{3} \end{aligned}$$
(23)

where

$$\begin{aligned} r_i = {e_q(- ({\mathbb {E}}[ \Vert X - \varvec{w}\Vert ^2 ] + i \sqrt{\mathrm {Var}[ \Vert X - \varvec{w}\Vert ^2 ]}) \nu ^{-1} )} \end{aligned}$$
(24)

Analogously, the variance of the q-Gaussian activation function g(X) is given by:

$$\begin{aligned} \mathrm {Var}(g(X))\approx \sum _{s\in \{-1, 0,1\} } \frac{1}{3}\, \Big ( r_i - {\mathbb {E}}[g(X)]\Big )^2 \end{aligned}$$
(25)

Notice that this methodology can be trivially adapted to estimate the value of any specific transform g(X) that can be expressed as a function of \(\Vert X - \varvec{w} \Vert ^2\).

3.3 Modeling the Data with a Gaussian Mixture Distribution

The developments presented so far require the computation of the expected values and variances of the missing components in each feature vector. For this purpose, we modeled the distribution of the input data with a Gaussian Mixture. Then, we condition this distribution on the observed features, obtaining the statistical moments of interest.

In order to provide a flexible representation for the distribution from which the feature vectors were drawn, we assume that it is possible to model the distribution as a linear superposition of CD-dimensional Gaussian densities. Each Gaussian density has mean \(\mu ^{(c)}\) and covariance matrix \(\varSigma ^{(c)}\), with \(c=[1,\dots ,C\)], i.e. a Gaussian Mixture Model (GMM). Given an arbitrary \(X_i \in {\mathbb {R}}^D\), the probability density function of a GMM with the aforementioned parameters takes the form [18]:

$$\begin{aligned} p(X_i)=\sum _{c=1}^{C}{w^{(c)}{\mathcal {N}}(X_i|\mu ^{(c)},\varSigma }^{(c)}), \end{aligned}$$
(26)

where \(\{w^{(c)}\}_{c=1}^C\) is a set of non-negative scalars that satisfies the constraint \(\sum _{c=1}^{C}{w^{(c)}}=1\). The GMM model is a flexible and powerful modeling tool capable to model a wide range of continuous distributions, provided a sufficient number of Gaussian components. The parameters \(\{ w^{(c)}, \mu ^{(c)},\varSigma ^{(c)} \}_{c=1}^C\) of the GMM can be estimated via Maximum Likelihood using Expectation Maximization [31].

Consider an arbitrary vector \(X_i\), with missing component values \(X_{i,M}\) and observed component values \(X_{i,O}\), where M and O denote the sets of indexes of missing and observed component values, respectively. Then, the parameters \(\mu ^{(c)}\) and \(\varSigma ^{(c)}\) of the c-th component of the GMM can be partitioned into two blocks as follows:

$$\begin{aligned} \mu ^{(c)} = \begin{bmatrix} \mu ^{(c)}_{O} \\ \mu ^{(c)}_{M} \end{bmatrix}, \quad \varSigma ^{(c)} = \begin{bmatrix} \varSigma ^{(c)}_{OO}&\quad \varSigma ^{(c)}_{OM} \\ \varSigma ^{(c)}_{MO}&\quad \varSigma ^{(c)}_{MM} \\ \end{bmatrix}. \end{aligned}$$
(27)

Then, the mean vector \({\tilde{\mu }}^{(c)}_{i} = \mathrm {E}^{(c)}[X_{i,M}|X_{i,O}]\) and the covariance matrix of the c-th Gaussian in the GMM, conditioned on \(X_{i, O}\), \({\tilde{\varSigma }}_{i}^{(c)} = \mathrm {Var}^{(c)}[X_{i,M}|X_{i,O}]\), are given by:

$$\begin{aligned} {\tilde{\mu }}^{(c)}_{i}&= \mu _{M}^{(c)} + \varSigma _{MO}^{(c)}(\varSigma _{O O}^{(c)})^{-1}(X_{i,O} - \mu _{O}^{(c)}) , \end{aligned}$$
(28)
$$\begin{aligned} {\tilde{\varSigma }}_{i}^{(c)}&= \varSigma _{M M}^{(c)} - \varSigma _{M O}^{(c)}(\varSigma _{O O}^{(c)})^{-1}\varSigma _{O M}^{(c)} , \end{aligned}$$
(29)

and the expected value and the covariance matrix of the missing features \(X_{i,M}\) are given by:

$$\begin{aligned} {\mathbb {E}}[X_{i,M}]&= \sum _{c=1}^C w^{(c)}{\tilde{\mu }}^{(c)}_{i}, \end{aligned}$$
(30)
$$\begin{aligned} \mathrm {Cov}( X_{i,M})&= \sum _{c=1}^C w^{(c)}({\tilde{\varSigma }}_{i}^{(c)} + {\tilde{\mu }}^{(c)}_{i} {\tilde{\mu }}^{(c)^T}_{i}) - {\mathbb {E}}[X_{i,M}] {\mathbb {E}}[X_{i,M}]^T . \end{aligned}$$
(31)

4 Implementation

Given a dataset \({\mathcal {X}}=\{X_i\}_{i=1}^{N} \subset {\mathbb {R}}^D\) with possibly incomplete feature vectors, we wish to compute the weights for a NNRW model. The proposed method can be briefly described by the following steps:

  1. 1.

    Estimate the parameters of the Gaussian mixture distribution of the data \({\mathcal {X}}\) with C components (mean vectors \(\mu ^{(c)}\), covariance matrices \(\varSigma ^{(c)}\) and coefficients \(w^{(c)}\), with \(c=1\dots ,C\)) by maximizing the likelihood function of the mixture model by Expectation-Maximization adapted for incomplete data [18, 31];

  2. 2.

    Compute the mean vectors \({\tilde{\mu }}^{(c)}_{i}\) and the covariance matrices \({\tilde{\varSigma }}_{i}^{(c)}\) of the missing components \(X_{i,M}\) for each of the \(c=1,\dots ,C\) components, using Eqs. (28) and (29);

  3. 3.

    Compute the expected value and the covariance matrix \(X_{i,M}\), using Eqs. (30) and (31);

  4. 4.

    Compute the expected value \({\mathbb {E}}[{\psi }]\) and the covariance matrix \(\mathrm {Cov}({\psi })\) of the transformed input vectors, as described in Sect. 3.2;

  5. 5.

    Compute \({\mathcal {B}}\) using Eq. (9).

5 Experiments and Results

In order to assess the performance of the proposed framework, we conducted experiments with the logistic and the q-Gaussian basis functions as well as the NNRW models based on both basis functions (FNNRW and QGRNN). The objective of our experiments is twofold: verify how well our proposal can approximate the value of a basis function over an incomplete vector and assess the performance of NNRW models that use our framework. All experiments are described in details in the following subsections.

5.1 Multivariate Normal Data with Known Parameters

In this experiment, we want to estimate the q-Gaussian and the sigmoid activation functions computed over a vector with missing components. The vectors were drawn from a 5-dimensional multivariate normal distribution with known mean and covariance matrix. We varied the number of missing components. The design of an experiment with a known distribution aims at verifying the effect of the number of missing components in our proposal without the influence of the distribution estimation algorithm (i.e. GMM).

Initially, we randomly selected the parameters that will generate the tested vectors. For that, the mean vector was sampled from the standard normal distribution N(0, 1). The covariance matrix is given by \(\varSigma _s = L^T L\), where L is an upper triangular matrix whose non-zero entries were also drawn from the same distribution N(0, 1). The procedure to generate those parameters was repeated 20 times and the results reported in this section are the average of these 20 rounds.

Given the mean \(\mu _s\) and covariance matrix \(\varSigma _s\), we created a dataset with \(10^3\) samples drawn from this Normal distribution. We gradually increased the number of missing entries and verified the impact on the performance of our method as well as on two other common imputation methods. At each round, the entries of \(\varvec{w}\), used for both the q-Gaussian and the sigmoid functions, were drawn at random from a standard normal N(0, 1).

Our proposal was compared to the Single Mean Imputation (SMI), that imputes all missing components with the average value calculated using the complete components and the Conditional Mean Imputation (CMI, [31]). In CMI, the missing entries are imputed with their expected values conditioned to the observed entries of the same vector. Then, we calculated the basis function using the imputed values. For all methods, the data distribution was considered to be known. Since all methods share the same statistical model, the performance of each one depends only on its formulation.

Tables 1 and 2 show, respectively, the Average Root Mean Squared Errors (ARMSE) between real and estimated basis functions computed for the logistic function and the q-Gaussian function.

Table 1 ARMSE for the logistic function
Table 2 ARMSE for the q-Gaussian function

By the results, we observe that the performance of all methods degrades as the number of missing entries increases. We also conclude that all methods have similar performances for the lowest missingness levels. In such cases, the uncertainty on the imputation procedure seems to have a small impact on the estimate of the basis functions. It is worth noting that our method outperforms all other methods for the highest missingness levels. This result is expected since the high number of missing components increases the uncertainty of the estimation process and our method is the only one that includes this factor in its formulation.

5.2 Experiments Using Real-World Data

We evaluate the performance of the proposed method using real-world datasets. For that purpose, eleven datasets (see Table 3) were selected from the UCI Machine Learning Repository [28].

For both the logistic and the q-Gaussian functions, twenty similar rounds of the experiment were carried out. In each of these rounds, we selected \(80\%\) of the dataset for training and \(20\%\) for testing. The percentage of instances with missing samples was interactively increased from 10 to \(50\%\) (in steps of \(10\%\)) of the dataset size. In each step, we used different methods to estimate the basis function and measured the ARMSE between the obtained estimates and the real function values computed beforehand. Since the real distribution of the data is unknown, we estimate a GMM comprising three components.

Besides SMI and CMI, we also compare our method against the Incomplete-Case k-Nearest-Neighbors Imputation algorithm (ICkNNI), [17] and the Singular Value Thresholding (SVT) [4]. ICkNNI is a popular imputation method due to its easiness of implementation and good performance. Unlike most of single imputation strategies, ICkNNI does not rely on the estimation of a statistical model for the data. For ICkNNI, we used the parameters suggested in [17]. SVT is a matrix factorization based imputation method that has been largely used in several real-world problems such as movie recommendation and image recovery [27]. All tests were performed using the SVT toolbox described in [25].

Tables 4 and 5 show the results, in terms of ARMSE, obtained for the logistic function and the q-gaussian, respectively. The comparisons between the proposed method, SMI and CMI lead to conclusions that are similar to the ones obtained in previous experiments. The UT-based method outperforms SMI and CMI when the number of missing components increases. It is important to point out that UT was also able to outperform ICkNNI and SVT in most of the datasets. This result is significant since ICkNNI and SVT do not rely on any assumption regarding the distribution of the dataset and UT used a statistical model with a fixed, and possibly non-optimal, number of Gaussians.

Aiming at assessing the statistical significance of our results, we conducted a Friedman statistical test [10]. The Friedman test quantifies the consistency of the results obtained by a method when applied in several datasets. The success of each method is quantified according to their performance rankings (for each dataset, the best performing algorithm getting the rank of 1, the second best rank 2 and so on). In the current setting, the null hypothesis \(H_0\) states that there is no statistical difference between all models.

Table 3 Datasets description

Following the methodology presented in [7], we verify that the null hypothesis was rejected with \(p = 5.3051e-23\) for the logistic and \(p = 9.0224e-16\) for the q-Gaussian. We obtained a Critical Difference (CD) of \(CD = 0.82251\). By observing the CD and the average rankings, we can state that our method significantly outperformed all other methods for the logistic activation. For the q-Gaussian, our proposal showed a significant performance gain when compared to SVT and SMI. It is worth pointing that, even though the performance gap between UT and the other two methods (ICkNNI and CMI) was not significant, our proposal had the best average ranking.

Table 4 Comparison of different approaches to estimate the logistic function in presence of missing entries on real-world data
Table 5 ARMSE for the QGRNN models

5.3 Application to Neural Networks with Random Weights

In this last experiment, our goal is to verify the impact of the basis function estimates on NNRW models. The proposed variants of FNNRW and QGRNN were compared to its original formulations trained with imputed datasets using SMI, CMI, SVT and ICkNNI. All FNNRW and QGRNN models have 100 hidden neurons.

Table 6 ARMSE for the FNNRW models
Table 7 Comparison of different approaches to estimate the Q-Gaussian function in presence of missing entries on real-world data

In this experiment, the number of missing components was iteratively increased from 10 to \(50\%\) (in steps of \(10\%\)). SMI, CMI and UT used GMM models comprising three Gaussian. The parameters of each Gaussian was estimated with the ECM algorithm. We repeated the experiments twenty times and computed the ARMSE. For all datasets, we randomly selected \(80\%\) for training and \(20\%\) for testing. Tables 6 and 7 show the results for FNNRW and QGRNN, respectively.

The results presented in Tables 6 and 7 show that our proposal outperformed all other methods in more than \(90\%\) of the tested cases. This experiment enforces the belief that better estimates of basis functions may lead to NNRW with better performance, although no theoretical guarantee could be given. Despite the performance gain, one crucial aspect that shall be highlighted it that our model has an automatically selected regularization term. In Eq. (9), we can see matrix P as a regularization term that depends on the uncertainty of the estimates. Thus, our formulation can be seen as an uncertainty robust approach.

Once again, the statistical significance of our results was tested with the Friedman test. The null hypothesis \(H_0\) was rejected for the FNNRW with \(p = 2.3275e-17\) and with \(p = 5.9173e-08\) for the QGRNN. The \(CD = 0.82251\) and the average rankings showed that the models based on our proposals significantly outperformed all other approaches.

6 Conclusion

In this paper, we presented a methodology to design NNRWs for datasets with missing components. Our proposal presents a training procedure for NNRWs which can profit from incomplete feature vectors without the need of a previous imputation step.

In the proposed approach, data is modeled with a GMM, and the output of the hidden layer is estimated in terms of its expected values and variances. Such statistical measures are used to solve a robust least-squares problem to estimate the coefficients of the output layer.

To validate our method, we perform three sets of experiments. The first two sets aim to verify how well our method estimates basis functions that are part of a NNRW. We considered the logistic and the q-Gaussian basis functions as case studies. Our experiments included simulated and real datasets. The last experiment tested our proposal on two NNRW, the QGRNN and the FNNRW, in real datasets.

The results obtained show that our proposal was able to estimate the basis function more accurately and also result in NNRW with better performances for almost all tested cases. It is worth noting that our method was compared to both parametric and non-parametric approaches. A statistical analysis of the results shows that the superior performance of our method is significant. Even though the method showed promising results, it is important to point that its performance is strongly related to the chosen distribution estimation procedure (in this paper, we used a GMM). Another import aspect that shall be highlighted is that the UT consists of an approximated procedure to obtain the moments of the distribution, thus exact algorithm shall be preferred when available (see [32, 34, 38] for some examples). In future works, we intend to compare our proposal to other recent imputation methods and also analyze the impact of our strategy in other models like RVFL and deep neural networks.