Abstract
In this paper, we propose a method to design Neural Networks with Random Weights in the presence of incomplete data. We present a method, under the general assumption that the data is missing-at-random, to estimate the weights of the output layer as a function of the uncertainty of the missing data estimates. The proposed method uses the Unscented Transform to approximate the expected values and the variances of the training examples after the hidden layer. We model the input data as a Gaussian Mixture Model with parameters estimated via a maximum likelihood approach. The validity of the proposed method is empirically assessed under a range of conditions on simulated and real problems. We conduct numerical experiments to compare the performance of the proposed method to the performance of popular, parametric and non-parametric, imputation methods. By the results observed in the experiments, we conclude that our proposed method consistently outperforms its counterparts.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Artificial Neural Networks (ANN) are a broad class of mathematical models, inspired by the neural mechanism of animals, that have been extensively investigated in the past decades. Successful applications of ANN have been reported in various domains such as signal processing [47], control systems [36] and computer vision [51].
Among the aspects that influence the performance of an ANN, the learning method is one of the most relevant. In this context, gradient-based learning algorithms have played a dominant role in training ANN [45]. However, problems such as slow convergence and convergence to local minima increased the popularity of Neural Networks with Random Weights (NNRW). NNRWs provide a solution to these issues by randomly assigning the hidden weights and adjusting only the output neurons during the training step. In one of the first works to analyze NNRW, [43] reported that the weights of the output layer are the most relevant, and the other weights may not need to be tuned once they are properly initialized. The authors argue that randomly setting the parameters in the hidden neurons helps to remove the redundancy of the model in parameter space and thus makes the performance less sensitive to the resulting parameters compared with other typical learning rules such as back-propagation. Similar basic ideas were also reported in the seminal work of Broomhead and Lowe [3], where the authors showed that the random selection of centers among the training data is a valid alternative to sophisticated clustering algorithms used in RBF neural networks.
Since the 90s, NNRWs have become increasingly popular and various models were proposed. For example, in [37], the authors proposed a new NNRW model named Random Vector Functional Link (RVFL). The RVFL model differs from the model presented by Schmidt et al. [43] since it includes a direct link between the inputs and the output layer. Later, several authors showed that the direct link may enhance the performance of RVFL [46]. Furthermore, various authors such as [2, 14, 15] introduced NNs with a randomly initialized hidden layer, trained using the pseudo-inverse. The universal approximation capability of NNRWs was also addressed in works such as [20, 26, 40]. More recently, several authors introduced NNRWs with different characteristics such as constructive/selective NNs [49] and deep architectures [13, 48]. Such models achieved remarkable performances and may point promising future research directions. It is worth pointing that the literature of NNRW is vast and, in this paper, we do not aim to present it in details. For a more complete survey on NNRW including historical perspectives and current challenges, the reader may refer to [42, 52].
Despite the vast number of successful applications [5, 39, 41], the use of NNRW can be limited by the presence of observations with one or more missing values. The occurrence of missing values can be caused by many reasons such as measurement error, device malfunction and operator failure. To overcome the problems derived from the occurrence of these missing components, it is necessary to understand the mechanisms governing this phenomenon. According [29], the missingness mechanism can be characterized as Not Missing at Random (NMAR), Missing Completely at Random (MCAR) and Missing at Random (MAR).
The NMAR mechanism describes the situation in which the probability of a component being missing is related to the value of this component. If missing data are NMAR, building a missingness model is the only way to obtain unbiased estimates of the missing values. In MCAR, the missingness of a component is independent of its real value or any value of other components on the dataset. Such a characterization is very restrictive, and it makes the MCAR case unlikely in real life problems [8]. A more reasonable assumption states that the data are MAR. In the MAR mechanism, the missingness of a component is independent of the value itself but can be related to the observed values. Under the assumption that the data are MAR, strategies for dealing with missing data in machine learning methods (including neural networks) can be grouped into three different approaches: deletion of incomplete cases, imputation of missing values and design of learning methods that can handle missing data directly.
The deletion approach is, by far, the most common [50]. In the listwise deletion, instances with at least one missing attribute are eliminated, and the analysis is performed using the rest of the examples. Although commonly used, listwise deletion can lead to a performance loss when the number of instances with missing attributes is significant [12].
The imputation approach consists of strategies for filling the missing values using the information available from the observed values of the dataset. These methods can be split into single and multiple imputation procedures. In single imputation, each missing value is filled with a single estimated value. Examples of single imputation methods are the incomplete case nearest neighbors imputation (ICk-NNI, [17]), where the value is imputed according to the k-nearest neighbors obtained with an incomplete distance metric and the Expectation Conditional Maximization (ECM, [31]). In the ECM method, a parametric model is assumed for the dataset, and the EM algorithm is used to estimate the parameters of the distribution. After that, the missing values are imputed with its expected value conditioned to the observed values of the instance. After the imputation procedure, a standard learning algorithm can be applied.
In contrast to single imputation, multiple imputation methods attribute a set of possible values rather than a single value for the missing attributes. Hence, these methods generate some different datasets where the complete instances are identical, but the incomplete instances have different values [23]. After that, learning algorithms are trained on each dataset, and a final result is obtained by combining all generated models. It is worth noting that multiple imputation methods present a significant increase in computational cost when compared to single imputation methods. However, multiple imputation approaches can handle the inherent uncertainty from the missing value estimation process in the final machine learning model thus leading to better results.
The third group of strategies consists of the adaptation of learning methods to handle missing data on its formulation. Such methods avoid direct imputation procedures [12]. In [9], the authors propose a method to estimate the expected value of pairwise distances between vectors with missing data using a Gaussian Mixture Model (GMM) for the distribution of the dataset. The proposed method is used to design a k-NN classifier for datasets with missing values. Support Vector Machines (SVM) and Gaussian Process (GP) variants for missing data were proposed in [44], where both SVM and GP are cast in a general framework for kernel methods as an estimation problem using exponential families in feature space. Also for SVM, the work in [38] propose a modified risk function that incorporates a probability distribution for missing data. Mesquita et al. [33] presents a method to estimate the expected value of the Gaussian kernel calculated over incomplete vectors.
Considering NNRWs, methods for directly handling missing data are indeed rare. Most of the works available in the literature are related to the use of neural networks for single imputation, as can be seen in [1, 19]. Other authors propose single imputation methods [35] or evaluate the impact of various well-known imputation methods on the performance of NNRW [30]. It is worth noting that these single imputation methods were not particularly designed to work with NNRW and can be used with almost any machine learning method.
In this paper, we propose a method to design NNRW models that can handle missing components directly. In the proposed framework, we consider each entry as a random variable with its distribution modeled with a GMM. We use the Unscented Transform (UT) to estimate both the expected value and the variance of each entry after going through the hidden layer. Then, we estimate the weights of the output layer using an uncertainty robust least squares formulation. We tested our method on two NNRW models in synthetic and real datasets. Results show that our approach outperformed other commonly used missing data treatments, especially when the number of missing components is high. The proposed method differs from other missing data treatments because it accounts for the uncertainty of the imputation process, similar to multiple imputation methods, whereas it does not rely on a computationally intensive sampling procedure. In this way, our method can be seen as an alternative to single and multiple imputation methods, since it provides more accurate estimates than single imputation methods (since it considers the uncertainty) with a reduced computational cost when compared to multiple imputation procedures.
The remainder of this paper is organized as follows. Section 2 presents a brief review of basis functions and NNRW models. Section 3 describes the proposed method to design NNRW models that are capable of handling missing values. Section 4 summarizes the steps to be followed to implement the proposed method. Section 5 presents the results obtained in numerical experiments conducted to evaluate the performance of the proposed method. Concluding remarks are given in Sect. 6.
2 Basis Expansion and Neural Networks with Random Weights
Let \({\mathcal {D}} = \{ {\mathcal {X}}, {\mathcal {Y}} \}\) be a training set with N examples, such that \({\mathcal {X}}=\{X_i\}_{i=1}^N\) and \({\mathcal {Y}}=\{Y_i\}_{i=1}^N\) are, respectively, a set of D-dimensional input points and their corresponding outputs. Assuming the existence of a continuous mapping \({\mathfrak {f}}:{\mathcal {X}} \rightarrow {\mathcal {Y}}\) between the input and the output space, we want to estimate \({\mathfrak {f}}\) from data. In many NNRW models, \({\mathfrak {f}}\) can be expressed as a linear combination of nonlinear functions (basis functions) over the input variables as follows:
where the function \(\phi _j(\cdot ): {\mathbb {R}}^D \rightarrow {\mathbb {R}}\) is the j-th basis function, \(\beta _j\) is its corresponding weight and \({\mathcal {M}}\) is the number of hidden neurons. Usually \(\phi _0(\cdot ) = 1\) and \(\beta _0\) is the bias term.
The resulting model, referred as linear basis expansion, can approximate a nonlinear function while still linear in the parameters. Under the condition that these basis functions are infinitely differentiable, this model can approximate any given function ([6, 11, 16]). Since the model is linear concerning the basis functions outputs, the parameters are typically estimated with maximum likelihood or Bayesian techniques.
The use of different basis functions defines various neural networks with random weights models such as the Feedforward Neural Network with Random Weights (FNNRW, [43]), the Random Vector Functional Link (RVFL, [36]) and the q-Generalized Random Neural Network (QGRNN, [45]) among others. To illustrate our point, we present reformulated versions of FNNRW and QGRNN according to the basis expansion model. Similar procedures can be performed for other NNRW models and require only straightforward mathematical manipulations.
The FNNRW model, originally proposed in [43], was one of the first to introduce the concept of NNRW. The FNNRW comprises a multilayer perceptron model where the weights between the input and the hidden neurons are randomly assigned. Such model can be written as:
where \(\varvec{w}_j\) is the weight vector connecting the input layer and the j-th hidden neuron, \(\beta _j\) is the weight connecting the j-th hidden neuron and output layer, \(c_j\) is the threshold of the j-th hidden neuron and \(f(\cdot )\) is the logistic sigmoid function. As can be noticed, in the FNNRW model the basis function is defined as a sigmoid function over the linear combination of the input vector.
The QGRNN model can be written similarly. The q-Generalized Random Neural Network is a recently proposed NNRW that uses radial basis neurons and a q-Gaussian activation function. The resulting model can be expressed as:
where:
where \(\gamma \) and \(\eta \) represent, respectively, the center and the width of the q-Gaussian, and \(e_q\) is the q-exponential defined by:
The choice of q, the entropic index, is of fundamental importance since it results in different deformations of the Gaussian distribution. Once again, the NNRW model can be expressed as a linear basis expansion model.
3 Neural Networks with Random Weights for Datasets with Missing Values
Given a set of possibly incomplete training inputs \({\mathcal {X}}\) and the set of corresponding outputs \({\mathcal {Y}}\), we wish to estimate the weight vector \({\mathcal {B}} =(\beta _0,\dots ,\beta _{\mathcal {M}})^T\). To do so, we model the missing entries as random variables and choose \({\mathcal {B}}\) to minimize the expected sum of square errors loss function. We derive a solution in terms of the expected values and the covariance matrices of the transformed inputs. In turn, we provide a method to estimate these statistics given the distribution of the input data. Furthermore, we consider that missing values are MAR [29] and that the full version of the examples in \({\mathcal {X}}\) are i.i.d.
3.1 Formulation
Given the aforementioned assumptions, each \(\phi _j(X_i)\) is an univariate random variable since it is a transform of \(X_i\) for which several components may be missing. In such scenario, \({\mathcal {B}}\) can be estimated by solving the following optimization problem:
For a compact notation, let us denote \({\psi _i} = (\phi _0(X_i),\dots ,\phi _{\mathcal {M}}(X_i))^T\), \(\varPsi = (\varvec{\psi _1},\dots ,\varvec{\psi _N})^T\), \(\varvec{Y} = (Y_1,\dots ,Y_N)^T\) and \({\mathcal {B}} = (\beta _0,\dots ,\beta _{\mathcal {M}})^T\). Thus, the same problem can be expressed as:
We can assume that \(\varPsi \) is a random variable with \({\overline{\varPsi }} = \mathrm {E}[\varPsi ]\), so we can describe \(\varPsi \) as \(\varPsi = {\overline{\varPsi }} + U\), where U is a random matrix with zero mean. By doing so, the objective function in Eq. (7) can be expressed as:
where \(P = \mathrm {E}[U^TU]\). Therefore, the problem has the form of a regularized least squares problem and the solution is given by:
We can express P as a function of the moments of the random variable \(\varPsi \). For any \(l, l^\prime \in \{1, \ldots , {\mathcal {M}}\}\), the element \( P_{l, l^\prime }\), in the intersection of the l-th line with the \(l^\prime \)-th column of P is given by:
According to this result, it follows directly that:
Consequently, estimating P consists in computing the covariance matrices of the projected input vectors \(\psi _1, \ldots , \psi _N\) and summing over them. According to the results presented in Eqs. (9) and (10), we have to estimate \({\mathbb {E}}[{\psi }]\) and \(\mathrm {Cov}({\psi })\) to obtain \({\mathcal {B}}\).
3.2 Computing \(\pmb {{\mathbb {E}}}[{\psi }]\) and \(\mathrm {Cov}({\psi })\)
Let \(X \in \{ X_1, \ldots , X_N \}\) be an input vector and \(\psi \) its projected version. The problem of computing the expectation of \(\psi \) consists in evaluating individually the expectation of its entries \(\phi _1(X), \ldots , \phi _{\mathcal {M}}(X)\), given by:
while the elements of the covariance matrix \(\mathrm {Cov}(\psi )\) are given by:
for which there is no trivial general solution and tailored ones depend on both the format of \(\phi (\cdot )\) and \(p(\cdot )\). However, for any \(\phi (\cdot )\) and \(p(\cdot )\), it is possible to approximate Eqs. (11) and (12) via sampling or numerical integration methods.
The Unscented Transform (UT), initially proposed in [21], is a sampling-based method that estimates the statistical moments of a probability distribution associated with a random variable which results from a nonlinear transformation of another random variable [24]. Here, we wish to estimate the moments of \(\psi \), which results from the application of \(\phi _1(\cdot ), \ldots , \phi _{\mathcal {M}}(\cdot )\) on the random variable X.
The UT pipeline consists of (1) selecting a set of samples from the original distribution. We consider the sampling scheme proposed by Julier and Uhlmann [22], where a set of \(L=2|M| + 1\) samples is deterministically chosen, where |M| is the number of missing entries of X. Additionally, we compute a set of weights that will be later used to retrieve the moments of the transformed variable. (2) Evaluating the values of \(\phi _1(\cdot ), \ldots , \phi _{\mathcal {M}}(\cdot )\) on the previously selected samples. (3) Estimate the moments of the transformed variable \(\psi \) using the transformed samples obtained in step 2 and the weights from step 1. Further details regarding the UT are presented in the Appendix.
Although this UT approach could be applied directly to estimate the statistical moments of an arbitrary basis function, the number of required samples grows with the number of missing entries, increasing the computational burden of the procedure. To alleviate this problem, we propose two methodologies based on the UT that require only three one-dimensional sigma points, independent of |M|. The first one, presented in Sect. 3.2.1, is tailored to the logistic function and can be easily generalized to any \(\phi (\cdot )\) that can be expressed as a transform of \(\varvec{w}^T X\). The second one, presented in Sect. 3.2.2, deals with the q-Gaussian function and can be adapted for any function that is a transform of \(\Vert \varvec{w}^T X\Vert ^2\). In this paper, we consider that \(\mathrm {Cov}(\psi _i), \ldots , \mathrm {Cov}(\psi _N)\) are diagonal matrices. Thus, to obtain an approximation of P, it suffices to compute the individual variances of the basis functions evaluated at each \(X \in \{X_i, \ldots , X_N \}\).
3.2.1 Sigmoid Function
Given an input vector X, the sigmoid function is given by:
where \(\varvec{w} = ({w}_1, \ldots , {w}_D)^T\) is a predefined constant vector.
Note that f(X) can be written as a transform of the random variable \(\varvec{w}^T X\), whose expectation is given by:
and has variance:
Using the UT scheme (see details in Appendix), we can approximate \({\mathbb {E}}[f(X)]\) with \(L=3\) sigma points as follows:
where:
Analogously, the variance of the logistic function f(X) is given by:
It is important to notice this methodology also applies to any transform of X that can be written as a function of \(\varvec{w}^T X\).
3.2.2 q-Gaussian Function
Given an input vector X, the q-Gaussian activation function can be expressed as:
where \(\varvec{w} = (w_1, \ldots , w_D)^T\), \(\nu > 0\) and \(q \in {\mathbb {R}}\) are predefined constants while
Note that g(X) can be written as a transform of \(\Vert X - \varvec{w}\Vert ^2\), whose expectation is given by
and has variance:
Thus, \({\mathbb {E}}[g(X)]\) can be approximated using the aforementioned UT scheme with \(L = 3\) sigma points, as follows:
where
Analogously, the variance of the q-Gaussian activation function g(X) is given by:
Notice that this methodology can be trivially adapted to estimate the value of any specific transform g(X) that can be expressed as a function of \(\Vert X - \varvec{w} \Vert ^2\).
3.3 Modeling the Data with a Gaussian Mixture Distribution
The developments presented so far require the computation of the expected values and variances of the missing components in each feature vector. For this purpose, we modeled the distribution of the input data with a Gaussian Mixture. Then, we condition this distribution on the observed features, obtaining the statistical moments of interest.
In order to provide a flexible representation for the distribution from which the feature vectors were drawn, we assume that it is possible to model the distribution as a linear superposition of CD-dimensional Gaussian densities. Each Gaussian density has mean \(\mu ^{(c)}\) and covariance matrix \(\varSigma ^{(c)}\), with \(c=[1,\dots ,C\)], i.e. a Gaussian Mixture Model (GMM). Given an arbitrary \(X_i \in {\mathbb {R}}^D\), the probability density function of a GMM with the aforementioned parameters takes the form [18]:
where \(\{w^{(c)}\}_{c=1}^C\) is a set of non-negative scalars that satisfies the constraint \(\sum _{c=1}^{C}{w^{(c)}}=1\). The GMM model is a flexible and powerful modeling tool capable to model a wide range of continuous distributions, provided a sufficient number of Gaussian components. The parameters \(\{ w^{(c)}, \mu ^{(c)},\varSigma ^{(c)} \}_{c=1}^C\) of the GMM can be estimated via Maximum Likelihood using Expectation Maximization [31].
Consider an arbitrary vector \(X_i\), with missing component values \(X_{i,M}\) and observed component values \(X_{i,O}\), where M and O denote the sets of indexes of missing and observed component values, respectively. Then, the parameters \(\mu ^{(c)}\) and \(\varSigma ^{(c)}\) of the c-th component of the GMM can be partitioned into two blocks as follows:
Then, the mean vector \({\tilde{\mu }}^{(c)}_{i} = \mathrm {E}^{(c)}[X_{i,M}|X_{i,O}]\) and the covariance matrix of the c-th Gaussian in the GMM, conditioned on \(X_{i, O}\), \({\tilde{\varSigma }}_{i}^{(c)} = \mathrm {Var}^{(c)}[X_{i,M}|X_{i,O}]\), are given by:
and the expected value and the covariance matrix of the missing features \(X_{i,M}\) are given by:
4 Implementation
Given a dataset \({\mathcal {X}}=\{X_i\}_{i=1}^{N} \subset {\mathbb {R}}^D\) with possibly incomplete feature vectors, we wish to compute the weights for a NNRW model. The proposed method can be briefly described by the following steps:
- 1.
Estimate the parameters of the Gaussian mixture distribution of the data \({\mathcal {X}}\) with C components (mean vectors \(\mu ^{(c)}\), covariance matrices \(\varSigma ^{(c)}\) and coefficients \(w^{(c)}\), with \(c=1\dots ,C\)) by maximizing the likelihood function of the mixture model by Expectation-Maximization adapted for incomplete data [18, 31];
- 2.
Compute the mean vectors \({\tilde{\mu }}^{(c)}_{i}\) and the covariance matrices \({\tilde{\varSigma }}_{i}^{(c)}\) of the missing components \(X_{i,M}\) for each of the \(c=1,\dots ,C\) components, using Eqs. (28) and (29);
- 3.
Compute the expected value and the covariance matrix \(X_{i,M}\), using Eqs. (30) and (31);
- 4.
Compute the expected value \({\mathbb {E}}[{\psi }]\) and the covariance matrix \(\mathrm {Cov}({\psi })\) of the transformed input vectors, as described in Sect. 3.2;
- 5.
Compute \({\mathcal {B}}\) using Eq. (9).
5 Experiments and Results
In order to assess the performance of the proposed framework, we conducted experiments with the logistic and the q-Gaussian basis functions as well as the NNRW models based on both basis functions (FNNRW and QGRNN). The objective of our experiments is twofold: verify how well our proposal can approximate the value of a basis function over an incomplete vector and assess the performance of NNRW models that use our framework. All experiments are described in details in the following subsections.
5.1 Multivariate Normal Data with Known Parameters
In this experiment, we want to estimate the q-Gaussian and the sigmoid activation functions computed over a vector with missing components. The vectors were drawn from a 5-dimensional multivariate normal distribution with known mean and covariance matrix. We varied the number of missing components. The design of an experiment with a known distribution aims at verifying the effect of the number of missing components in our proposal without the influence of the distribution estimation algorithm (i.e. GMM).
Initially, we randomly selected the parameters that will generate the tested vectors. For that, the mean vector was sampled from the standard normal distribution N(0, 1). The covariance matrix is given by \(\varSigma _s = L^T L\), where L is an upper triangular matrix whose non-zero entries were also drawn from the same distribution N(0, 1). The procedure to generate those parameters was repeated 20 times and the results reported in this section are the average of these 20 rounds.
Given the mean \(\mu _s\) and covariance matrix \(\varSigma _s\), we created a dataset with \(10^3\) samples drawn from this Normal distribution. We gradually increased the number of missing entries and verified the impact on the performance of our method as well as on two other common imputation methods. At each round, the entries of \(\varvec{w}\), used for both the q-Gaussian and the sigmoid functions, were drawn at random from a standard normal N(0, 1).
Our proposal was compared to the Single Mean Imputation (SMI), that imputes all missing components with the average value calculated using the complete components and the Conditional Mean Imputation (CMI, [31]). In CMI, the missing entries are imputed with their expected values conditioned to the observed entries of the same vector. Then, we calculated the basis function using the imputed values. For all methods, the data distribution was considered to be known. Since all methods share the same statistical model, the performance of each one depends only on its formulation.
Tables 1 and 2 show, respectively, the Average Root Mean Squared Errors (ARMSE) between real and estimated basis functions computed for the logistic function and the q-Gaussian function.
By the results, we observe that the performance of all methods degrades as the number of missing entries increases. We also conclude that all methods have similar performances for the lowest missingness levels. In such cases, the uncertainty on the imputation procedure seems to have a small impact on the estimate of the basis functions. It is worth noting that our method outperforms all other methods for the highest missingness levels. This result is expected since the high number of missing components increases the uncertainty of the estimation process and our method is the only one that includes this factor in its formulation.
5.2 Experiments Using Real-World Data
We evaluate the performance of the proposed method using real-world datasets. For that purpose, eleven datasets (see Table 3) were selected from the UCI Machine Learning Repository [28].
For both the logistic and the q-Gaussian functions, twenty similar rounds of the experiment were carried out. In each of these rounds, we selected \(80\%\) of the dataset for training and \(20\%\) for testing. The percentage of instances with missing samples was interactively increased from 10 to \(50\%\) (in steps of \(10\%\)) of the dataset size. In each step, we used different methods to estimate the basis function and measured the ARMSE between the obtained estimates and the real function values computed beforehand. Since the real distribution of the data is unknown, we estimate a GMM comprising three components.
Besides SMI and CMI, we also compare our method against the Incomplete-Case k-Nearest-Neighbors Imputation algorithm (ICkNNI), [17] and the Singular Value Thresholding (SVT) [4]. ICkNNI is a popular imputation method due to its easiness of implementation and good performance. Unlike most of single imputation strategies, ICkNNI does not rely on the estimation of a statistical model for the data. For ICkNNI, we used the parameters suggested in [17]. SVT is a matrix factorization based imputation method that has been largely used in several real-world problems such as movie recommendation and image recovery [27]. All tests were performed using the SVT toolbox described in [25].
Tables 4 and 5 show the results, in terms of ARMSE, obtained for the logistic function and the q-gaussian, respectively. The comparisons between the proposed method, SMI and CMI lead to conclusions that are similar to the ones obtained in previous experiments. The UT-based method outperforms SMI and CMI when the number of missing components increases. It is important to point out that UT was also able to outperform ICkNNI and SVT in most of the datasets. This result is significant since ICkNNI and SVT do not rely on any assumption regarding the distribution of the dataset and UT used a statistical model with a fixed, and possibly non-optimal, number of Gaussians.
Aiming at assessing the statistical significance of our results, we conducted a Friedman statistical test [10]. The Friedman test quantifies the consistency of the results obtained by a method when applied in several datasets. The success of each method is quantified according to their performance rankings (for each dataset, the best performing algorithm getting the rank of 1, the second best rank 2 and so on). In the current setting, the null hypothesis \(H_0\) states that there is no statistical difference between all models.
Following the methodology presented in [7], we verify that the null hypothesis was rejected with \(p = 5.3051e-23\) for the logistic and \(p = 9.0224e-16\) for the q-Gaussian. We obtained a Critical Difference (CD) of \(CD = 0.82251\). By observing the CD and the average rankings, we can state that our method significantly outperformed all other methods for the logistic activation. For the q-Gaussian, our proposal showed a significant performance gain when compared to SVT and SMI. It is worth pointing that, even though the performance gap between UT and the other two methods (ICkNNI and CMI) was not significant, our proposal had the best average ranking.
5.3 Application to Neural Networks with Random Weights
In this last experiment, our goal is to verify the impact of the basis function estimates on NNRW models. The proposed variants of FNNRW and QGRNN were compared to its original formulations trained with imputed datasets using SMI, CMI, SVT and ICkNNI. All FNNRW and QGRNN models have 100 hidden neurons.
In this experiment, the number of missing components was iteratively increased from 10 to \(50\%\) (in steps of \(10\%\)). SMI, CMI and UT used GMM models comprising three Gaussian. The parameters of each Gaussian was estimated with the ECM algorithm. We repeated the experiments twenty times and computed the ARMSE. For all datasets, we randomly selected \(80\%\) for training and \(20\%\) for testing. Tables 6 and 7 show the results for FNNRW and QGRNN, respectively.
The results presented in Tables 6 and 7 show that our proposal outperformed all other methods in more than \(90\%\) of the tested cases. This experiment enforces the belief that better estimates of basis functions may lead to NNRW with better performance, although no theoretical guarantee could be given. Despite the performance gain, one crucial aspect that shall be highlighted it that our model has an automatically selected regularization term. In Eq. (9), we can see matrix P as a regularization term that depends on the uncertainty of the estimates. Thus, our formulation can be seen as an uncertainty robust approach.
Once again, the statistical significance of our results was tested with the Friedman test. The null hypothesis \(H_0\) was rejected for the FNNRW with \(p = 2.3275e-17\) and with \(p = 5.9173e-08\) for the QGRNN. The \(CD = 0.82251\) and the average rankings showed that the models based on our proposals significantly outperformed all other approaches.
6 Conclusion
In this paper, we presented a methodology to design NNRWs for datasets with missing components. Our proposal presents a training procedure for NNRWs which can profit from incomplete feature vectors without the need of a previous imputation step.
In the proposed approach, data is modeled with a GMM, and the output of the hidden layer is estimated in terms of its expected values and variances. Such statistical measures are used to solve a robust least-squares problem to estimate the coefficients of the output layer.
To validate our method, we perform three sets of experiments. The first two sets aim to verify how well our method estimates basis functions that are part of a NNRW. We considered the logistic and the q-Gaussian basis functions as case studies. Our experiments included simulated and real datasets. The last experiment tested our proposal on two NNRW, the QGRNN and the FNNRW, in real datasets.
The results obtained show that our proposal was able to estimate the basis function more accurately and also result in NNRW with better performances for almost all tested cases. It is worth noting that our method was compared to both parametric and non-parametric approaches. A statistical analysis of the results shows that the superior performance of our method is significant. Even though the method showed promising results, it is important to point that its performance is strongly related to the chosen distribution estimation procedure (in this paper, we used a GMM). Another import aspect that shall be highlighted is that the UT consists of an approximated procedure to obtain the moments of the distribution, thus exact algorithm shall be preferred when available (see [32, 34, 38] for some examples). In future works, we intend to compare our proposal to other recent imputation methods and also analyze the impact of our strategy in other models like RVFL and deep neural networks.
References
Abdella M, Marwala T (2005) The use of genetic algorithms and neural networks to approximate missing data in database. In: IEEE 3rd international conference on computational cybernetics ICCC 2005, pp 207–212
Braake HAT, Straten GV (1995) Random activation weight neural net (rawn) for fast non-iterative training. Eng Appl Artif Intell 8(1):71–80. https://doi.org/10.1016/0952-1976(94)00056-S
Broomhead DS, Lowe D (1988) Multivariable functional interpolation and adaptive networks. Complex Syst 2:321–355
Cai J, Candès E, Shen Z (2010) A singular value thresholding algorithm for matrix completion. SIAM J Optim 20(4):1956–1982. https://doi.org/10.1137/080738970
Cox D, Pinto N (2011) Beyond simple features: a large-scale feature search approach to unconstrained face recognition. Face Gesture 2011:8–15. https://doi.org/10.1109/FG.2011.5771385
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control, Signals Syst 2(4):303–314
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Ding Y, Simonoff JS (2010) An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res 11:131–170
Eirola E, Lendasse A, Vandewalle V, Biernacki C (2014) Mixture of gaussians for distance estimation with missing data. Neurocomputing 131:32–42
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92
Funahashi KI (1989) On the approximate realization of continuous mappings by neural networks. Neural Netw 2(3):183–192
Garcia-Laencina PJ, Sancho-Gomez JL, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19(2):263–282
Giryes R, Sapiro G, Bronstein AM (2016) Deep neural networks with random gaussian weights: a universal classification strategy? IEEE Trans Signal Process 64:3444–3457
Guo P (2018) A vest of the pseudoinverse learning algorithm. CoRR arXiv:1805.07828
Guo P, Chen PC, Sun Y (1995) An exact supervised learning for a three-layer supervised neural network. In: International conference on neural information processing (ICONIP), Beijing, pp 1041–1044
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366
Hulse JV, Khoshgoftaar TM (2014) Incomplete-case nearest neighbor imputation in software measurement data. Inf Sci 259:596–610
Hunt L, Jorgensen M (2003) Mixture model clustering for mixed data with missing information. Comput Stat Data Anal 41(3–4):429–440
Gheyas IA, Smith LS (2010) A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing 73(16–18):3039–3065
Igelnik B, Pao YH (1995) Stochastic choice of basis functions in adaptive function approximation and the functional-link net. IEEE Trans Neural Netw 6(6):1320–1329. https://doi.org/10.1109/72.471375
Julier SJ, Uhlmann JK (1997) A new extension of the Kalman filter to nonlinear systems. In: SPIE aerosense symposium, pp 182–193
Julier SJ, Uhlmann JK (2004) Unscented filtering and nonlinear estimation. Proc IEEE 92(3):401–422
Kang P (2013) Locally linear reconstruction based missing value imputation for supervised learning. Neurocomputing 118:65–78
Leão BP, Yoneyama T (2011) On the use of the unscented transform for failure prognostics. In: IEEE aerospace conference. IEEE, Big Sky
Li C, Zhou H (2017) svt: Singular value thresholding in MATLAB. J Stat Softw, Code Snippets 81(2):1–13. https://doi.org/10.18637/jss.v081.c02
Li M, Wang D (2017) Insights into randomized algorithms for neural networks: practical issues and common pitfalls. Inf Sci 382–383:170–178. https://doi.org/10.1016/j.ins.2016.12.007
Li Y, Yu W (2017) A fast implementation of singular value thresholding algorithm using recycling rank revealing randomized singular value decomposition. CoRR arXiv:1704.05528
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 5 Jan 2018
Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, Hoboken
Luengo J, García S, Herrera F (2010) A study on the use of imputation methods for experimentation with radial basis function network classifiers handling missing attribute values: the good synergy between RBFNs and eventcovering method. Neural Netw 23(3):406–418
Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ecm algorithm: a general framework. Biometrika 80(2):267–278
Mesquita DP, Gomes JP, Souza AH Jr, Nobre JS (2017) Euclidean distance estimation in incomplete datasets. Neurocomputing 248:11–18. https://doi.org/10.1016/j.neucom.2016.12.081
Mesquita DP, Gomes JP, Corona F, Souza AH, Nobre JS (2019) Gaussian kernels for incomplete data. Appl Soft Comput 77:356–365. https://doi.org/10.1016/j.asoc.2019.01.022
Mesquita DPP, Gomes JPP, Souza AH Jr (2017) Epanechnikov kernel for incomplete data. Electron Lett 53(21):1408–1410. https://doi.org/10.1049/el.2017.0507
Oliveira PG, Coelho AL (2009) Genetic versus nearest-neighbor imputation of missing attribute values for RBF networks. In: Koppen M, Kasabov N, Coghill G (eds) Advances in neuro-information processing. Springer, Berlin, pp 276–283
Pao YH, Phillips SM, Sobajic DJ (1992) Neural-net computing and the intelligent control of systems. Int J Control 56(2):263–289
Pao YH, Park GH, Sobajic DJ (1994) Learning and generalization characteristics of the random vector functional-link net. Neurocomputing 6(2):163–180. https://doi.org/10.1016/0925-2312(94)90053-1
Pelckmans K, Brabanter JD, Suykens J, Moor BD (2005) Handling missing values in support vector machine classifiers. Neural Netw 18(5–6):684–692
Pinto N, Doukhan D, DiCarlo JJ, Cox DD (2009) A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLOS Comput Biol 5(11):1–12. https://doi.org/10.1371/journal.pcbi.1000579
Rudi A, Rosasco L (2017) Generalization properties of learning with random features. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates, Inc., pp 3215–3225. http://papers.nips.cc/paper/6914-generalization-properties-of-learning-with-random-features.pdf
Saxe AM, Koh PW, Chen Z, Bhand M, Suresh B, Ng AY (2011) On random weights and unsupervised feature learning. In: Proceedings of the 28th international conference on machine learning ICML’11. Omnipress, Madison, pp 1089–1096
Scardapane S, Wang D (2017) Randomness in neural networks: an overview. Wiley Interdisc Rev: Data Min Knowl Discov 7:e1200
Schmidt WF, Kraaijveld MA, Duin RPW (1992) Feedforward neural networks with random weights. In: Proceedings, 11th IAPR international conference on pattern recognition, conference B: pattern recognition methodology and systems, vol 2, pp 1–4
Smola AJ, Vishwanathan SVN, Hofmann T (2005) Kernel methods for missing variables. In: Proceedings of the tenth international workshop on artificial intelligence and statistics, pp 325–332
Stosica D, Stosic D, Zanchettin C, Ludermir T, Stosic B (2017) QRNN: \(q\)-generalized random neural network. IEEE Trans Neural Netw Learn Syst 28(2):383–390
Suganthan PN (2018) Letter: on non-iterative learning algorithms with closed-form solution. Appl Soft Comput 70:1078–1082. https://doi.org/10.1016/j.asoc.2018.07.013
Vidya L, Vivekanand V, Shyamkumar U, Mishra D (2015) RBF-network based sparse signal recovery algorithm for compressed sensing reconstruction. Neural Netw 63:66–78
Wang D, Li M (2017) Deep stochastic configuration networks: universal approximation and learning representation. CoRR arXiv:1702.05639
Wang D, Li M (2017) Stochastic configuration networks: fundamentals and algorithms. IEEE Trans Cyber 47(10):3466–3479. https://doi.org/10.1109/TCYB.2017.2734043
Yu Q, Miche Y, Eirola E, van Heeswijk M, SÃl’verin E, Lendasse A (2013) Regularized extreme learning machine for regression with missing data. Neurocomputing 102:45–51
Ding Z, Fu Y (2018) Deep domain generalization with structured low-rank constraint. IEEE Trans Image Process 27(1):304–313. https://doi.org/10.1109/TIP.2017.2758199
Zhang L, Suganthan P (2016) A survey of randomized algorithms for training neural networks. Inf Sci 364–365:146–155. https://doi.org/10.1016/j.ins.2016.01.039
Acknowledgements
The authors would like to thank the Brazilian National Council for Scientific and Technological Development (CNPq) for the financial support (Grant No. 305048/2016-3)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Unscented Transform (UT)
Appendix: Unscented Transform (UT)
Given a D-dimensional random variable X, we are interested in estimating statistical moments of \(\psi \), which results from an application of a non-linear function \(h(\cdot )\) to X. These values could be obtained via standard sampling procedures or numerical integration methods. However, such procedures can be computationally intensive and depend on many factors such as proper initialization, stop criteria, etc. The Unscented Transform (UT) provides a scheme to estimate the moments of \(\psi \) using a small set of deterministically chosen samples, referred to as sigma points (SPs), from the space of X.
There are different possible ways to choose the SPs. A common approach is to use a symmetric set of \(S = 2D + 1\) SPs as described in Eqs. (32) to (34).
where \(\left[ \sqrt{D \, \varSigma _{}}\right] _s\) denotes the s-th row of the matrix square root of \(D \, \varSigma _{}\), which is the covariance matrix \(\varSigma _{}\) of X.
Given the SPs and a set of weights \(\{k_s\}^S_{s=1} \subset {\mathbb {R}}\), we can approximate the moments of \(\psi \) using a simple set of rules. For instance \({\mathbb {E}}[\psi ]\) and \(\mathrm {Cov}(\psi )\) can then be approximated using the following equations:
Although there is no restriction on their sign, the weights \(k_1, \ldots , k_S\) must respect the convexity constraint.
to provide an unbiased estimate [22]. In this paper, we set \(k_1 = k_2 = \cdots = k_S = 1/S\).
Rights and permissions
About this article
Cite this article
Mesquita, D.P.P., Gomes, J.P.P. & Rodrigues, L.R. Artificial Neural Networks with Random Weights for Incomplete Datasets. Neural Process Lett 50, 2345–2372 (2019). https://doi.org/10.1007/s11063-019-10012-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-019-10012-0