1 Introduction

Pre-training as a breakthrough to effective training strategies for deep architectures, in spite of different algorithms such as restricted Boltzmann machine (RBM) (Hinton et al. 2006) and autoencoder (Bengio et al. 2007), is all based on the following approach: an unsupervised training phase followed by a supervised fine-tuning phase. A comparison between Bernoulli RBM layers, stacked denoising autoencoders and standard feed-forward multilayer neural networks (Erhan et al. 2009, 2010) empirically shows that the unsupervised pre-training not only gives relatively good set of initials for tuning phase but also seems to act as a regularizer. And such properties consist with different network structures, size of datasets and order of observations, which demonstrate the importance of starting point in the non-convex optimization problem.

The existing research about unsupervised pre-training (Yu et al. 2010) is mainly based on the same fine-tuning method: back-propagation (BP) using gradient descent optimization algorithm and its improved versions. The tuning-based BP algorithm requires iterations for continuously changing the parameter according to loss function which often takes considerable amount of time and may have the cost-effective problem. In this paper, focusing on non-iterative tuning phase, we incorporate the idea of pre-training using RBM into extreme learning machines (ELMs) and evaluate the new RBM-ELM model. ELM, as a type of single hidden layer feed-forward neural networks (SLFNs) proposed in Huang et al. (2004, 2006), has been studied by many researches in recent decades (Wang et al. 2012, 2017b, 2018). Some researchers are focusing on various data types and training tasks (Ding et al. 2017a, b; Wang et al. 2015), and some studies have already shown that ELM is capable of handling big data (Mao et al. 2017; Zhai et al. 2017; Wang et al. 2017a; Li et al. 2017; Zhang et al. 2017). The training of the original ELM consists of two parts: first, the weights and bias between input and hidden layers are randomly assigned; second, the weights between hidden and output layers are obtained by solving a system of linear equations using generalized inverse (GI). Now, instead of random assigning, the RBM is used as an unsupervised pre-training phase for weights and bias between input and hidden layers; then, the weights between hidden and output layers are analytically solved by the generalized inverse.

Noting that such approach in SLFN is already mentioned in Pacheco et al. (2017), Chen et al. (2017) and extended to multiple-hidden layer feed-forward neural networks (MLFNs) in Wang et al. (2017c), Meng et al. (2017), we focus on, through extensive experimentation, how RBM affects the initial values of the network. And what causes the performance difference of RBM-ELM comparing with simple ELM across various datasets and sizes of hidden layer. Contrast from previous studies, the experimental results show that the generalization ability is not always improved by RBM pre-training. Following this line of phenomenon, we find that for some datasets, the trained RBM could lead to low variance of the hidden layer matrix. In such situation, the hidden layer matrix could be viewed as a constant matrix plus a perturbation, which would cause large variance of weights matrix between hidden and output layers due to discontinuity of the generalized inverse. Some theoretical analysis and explanations related to the matrix perturbation are given.

The rest of this paper is organized as follows. Section 2 lists a brief review on the concepts and algorithms of ELM and RBM. Section 3 discusses some theoretical proofs of generalized inverse. Details of RBM-ELM model and performance evaluation are given in Sect. 4. In Sect. 5, we conclude this paper.

2 Related work

2.1 Extreme learning machine

ELM means a three-layer feed-forward network with single hidden layer in which the weights and bias between input layer and hidden layer are randomly assigned and the weights between hidden layer and output layer are solved by a system of linear equations using generalized inverse. A simple structure of ELM for regression problem is shown in Fig. 1 with n nodes in input layer, m nodes in hidden layer and only one node in output layer, while for classification problem, the number of output nodes equals to the number of categories.

Fig. 1
figure 1

A simple ELM structure

Given a set of samples \({\mathbf {S}}=\{({\mathbf {x}}_i,{\mathbf {t}}_i) | {\mathbf {x}}_i \in {\mathbf {R}}^n, {\mathbf {t}}_i \in {\mathbf {R}}^d\}_{i=1}^{N}\), training process of ELM is to determine model parameters \(\{w_{ij},b_j,\beta _j\}\) \((i=1,2,\ldots ,n;j=1,2,\ldots ,m)\). Since the weights \(w_{ij}\) and bias \({b_j}\) are randomly selected, the training process is only about determining the connections \(\beta _j\) between hidden layer and output layer. Let

$$\begin{aligned} {\mathbf {G}}_{N\times m} = \left[ \begin{array}{ccc} {\mathbf {w}}_1{\mathbf {x}}_1 + b_1 &{} \ \cdots \ &{} {\mathbf {w}}_m{\mathbf {x}}_1 + b_m\\ {\mathbf {w}}_1{\mathbf {x}}_2 + b_1&{} \ \cdots \ &{} {\mathbf {w}}_m{\mathbf {x}}_2 + b_m\\ \vdots &{}\ \ddots \ &{} \vdots \\ {\mathbf {w}}_1{\mathbf {x}}_N + b_1&{} \ \cdots \ &{} {\mathbf {w}}_m{\mathbf {x}}_N + b_m \end{array} \right] \end{aligned}$$
(1)

be the middle matrix, where \(\mathbf {w_j}\) is the jth column of the weight matrix \(\mathbf {W}\) between input layer and hidden layer. Let \(g(\cdot )\) be the sigmoid function and \({\mathbf {H}}\) be hidden layer matrix, then

$$\begin{aligned} {\mathbf {H}}_{N\times m} = [g({\mathbf {G}})]_{N\times m} = [h_{ij}]_{N\times m} \end{aligned}$$
(2)

Suppose the target matrix is \({\mathbf {T}} = [{\mathbf {t}}_1,{\mathbf {t}}_2,\ldots ,{\mathbf {t}}_N]^T\), then the training of ELM is transferred to solve the system of linear equations \({\mathbf {H}}\varvec{\beta }={\mathbf {T}}\). In general, the solution \({\mathbf {H}}^-\) is not unique. It is suggested in Huang et al. (2006, 2012) to use the minimum-norm least-square solution. Instead of solving the system of linear equations, the optimization problem changes to:

$$\begin{aligned} \min _{||\varvec{\beta }||}(\min _{\varvec{\beta }\in {\mathbf {R}}^m}|| {\mathbf {T}} - {\mathbf {H}}\varvec{\beta } ||) \end{aligned}$$
(3)

which means \(\varvec{\beta }_0\) is the solution of (3) if it has the smallest norm among all the least-square solutions. And the solution is Moore–Penrose generalized inverse \({\mathbf {H}}^\dagger \):

$$\begin{aligned} {\mathbf {H}}\varvec{\beta }={\mathbf {T}} \rightarrow \hat{\varvec{\beta }} = {\mathbf {H}}^\dagger {\mathbf {T}} \end{aligned}$$
(4)

Now the training process of an ELM can be viewed as three steps:

  1. 1.

    Increasing the data dimension from input data \({\mathbf {S}}\) to middle matrix \({\mathbf {G}}\). In most cases, the number of hidden nodes m is greater than number of input attributes n;

  2. 2.

    Transferring middle matrix \({\mathbf {G}}\) to hidden layer matrix \({\mathbf {H}}\) with rank increased by sigmoid activation function;

  3. 3.

    Solving a system of linear equations with full rank of coefficient matrix.

In the following, we give some propositions and remarks regarding the ELM training phase.

Proposition 1

Assume that \({\mathbf {v}} = \{{\mathbf {v}}_1,{\mathbf {v}}_2,\ldots ,{\mathbf {v}}_N\}\), \({\mathbf {v}}_i = \{v_{i1},v_{i2},\ldots ,v_{in}\}\), \(i = 1,2,\ldots ,N\) denotes a set of n-dimensional vectors, such that \(1 \le rank({\mathbf {v}}) \le n\). Then with probability 1, the sigmoid transformation will transfer \({\mathbf {v}}\) into a set of vectors of full rank:

$$\begin{aligned} rank({\mathbf {H}}) = n \quad w.p.1 \end{aligned}$$
(5)

where \({\mathbf {H}} = \{\mathbf {h}_1,\mathbf {h}_2,\ldots ,\mathbf {h}_N\}\), \(\mathbf {h}_i = \{h_{i1},h_{i2},\ldots ,h_{in}\}\), \(h_{ij} = sigmoid(v_{ij}) = 1/(1+e^{v_{ij}})\), \(i = 1,2,\ldots ,N\), \(j = 1,2,\ldots ,n\).

Proposition 2

The generalized inverse \({\mathbf {H}}^\dagger \) is continuous if \({\mathbf {H}}\) is a full rank matrix.

The proof for Propositions 1 and 2 can be found in Fu et al. (2014). According to the ELM training process and the two important propositions, we have the following remarks

Remark 1

In step 2, the middle matrix \({\mathbf {G}}\) is coming from input data \({\mathbf {S}}\) via a linear transformation and is generally waning rank. Proposition 1 guarantees that the sigmoid transformation will transfer a waning rank matrix \({\mathbf {G}}\) to a full rank matrix \({\mathbf {H}}\).

Remark 2

In step 3, to reach the minimum-norm least-square estimator of \(\varvec{\beta }\), the generalized inverse of a full rank matrix is calculated. Proposition 2 guarantees the stability of the solution.

Remark 3

Since the input matrix of ELM is transferred to a middle matrix by a group of random parameters, the relationship between the perturbation of weights and sensitivity of solution measures the stability of the model. According to the experimental result in Fu et al. (2014, Section 5), the full rank matrix \({\mathbf {H}}\) is insensitive to the perturbation and can get a more stable solution for \({\mathbf {H}}\varvec{\beta }={\mathbf {T}}\).

2.2 Restricted Boltzmann machine

RBM is a generative stochastic model which can be used to capture the probability distribution over a set of inputs (Hinton et al. 2012). Figure 2 shows the structure of a RBM. The topology of RBM is a two-layer bipartite graph: the underlying visible layer, denoted as \({\mathbf {v}} = [v_1, v_2, \ldots , v_n]\), is used to receive the input data and the upper hidden layer, denoted as \(\mathbf {h} = [h_1, h_2, \ldots ,h_m]\), is used to generate a new vector based on visible layer and training algorithm.

Fig. 2
figure 2

Structure of a RBM

Suppose a training set \(S = \{{\mathbf {v}}_1, {\mathbf {v}}_2, \ldots , {\mathbf {v}}_N\}\) that contains N observations, where \({\mathbf {v}}^{\tau } = [v_{\tau 1}, v_{\tau 2}, \ldots , v_{\tau n} ]\), \(\tau = 1, \ldots ,N\) is the \(\tau th\) sample observation. Then, the energy (Smolensky 1986) of a RBM configuration is defined as following:

$$\begin{aligned} E_{\theta }(\mathbf {v,h})= - \sum \limits _{j=1}^{n}a_j v_j- \sum \limits _{i=1}^{m}b_ih_i - \sum \limits _{j=1}^{n} \sum \limits _{i=1}^{m} w_{ij} v_j h_i \end{aligned}$$
(6)

where \(\mathbf {a} = [a_1, a_2, \ldots , a_n ] \in {\mathbf {R}}^n \) is the visible layer bias, \(\mathbf {b} = [b_1, b_2, \ldots , b_m ] \in {\mathbf {R}}^m\) is the hidden layer bias, \(\mathbf {W} = [w_{ij}] \in {\mathbf {R}}^{m\times n}\) is the weight matrix and \({\theta } = {\mathbf {W},\mathbf {a},\mathbf {b}}\) represents the set of model parameters. According to (6), the probability of system being in current status can be obtained by

$$\begin{aligned} p_{\theta }\mathbf {(v,h)} = \frac{1}{Z_\theta }e^{-E_\theta \mathbf {(v,h)}} \end{aligned}$$
(7)

where \(Z_\theta = \sum _{\mathbf {v,h}}E_\theta (\mathbf {v,h})\) is the normalization factor. From joint distribution (7), we have \({\mathbf {v}}\)’s marginal probability distribution

$$\begin{aligned} p_\theta ({\mathbf {v}}) = \frac{1}{Z_\theta }\sum \limits _{\mathbf {h}}e^{-E_\theta \mathbf {(v,h)}} \end{aligned}$$

The object of training RBM is to minimize the Kullback–Leibler divergence from the model distribution \(p_{\theta }\) to the true distribution of the data \(p_t\), i.e., \(D_{KL}(p_{t}||p_{\theta })\). And this is equivalent to maximizing the log likelihood function

$$\begin{aligned} {\mathcal {L}}(\theta ;{\mathbf {v}}) = \sum _{{\mathbf {v}}}\ln \frac{1}{Z_\theta }\sum \limits _{\mathbf {h}}e^{-E(\theta ;\mathbf {v,h})} \end{aligned}$$

Assume the data in visible and hidden layers are all subjected to Bernoulli distribution, then value of each node is in \(\{0,1\}\). For numerical attributes, one can refer to the Gaussian–Bernoulli RBM in Hinton (2010), Salakhutdinov et al. (2007). Now apply gradient descent method with respect to \(\theta \); we have the following:

$$\begin{aligned} \begin{aligned} \frac{\partial {\mathcal {L}}(\theta ;{\mathbf {v}})}{\partial w_{ij}} =&\sum \limits _{\tau =1}^{N}\Big [p(h_i=1|{\mathbf {v}}^\tau )v^{\tau }_{j} - \sum \limits _{{\mathbf {v}}}p({\mathbf {v}})p(h_i=1|{\mathbf {v}})v_{j}\Big ]\\ \frac{\partial {\mathcal {L}}(\theta ;{\mathbf {v}})}{\partial a_j} =&\sum \limits _{\tau =1}^{N}\Big [v^{\tau }_{j} - \sum \limits _{{\mathbf {v}}}p({\mathbf {v}})v_{j}\Big ]\\ \frac{\partial {\mathcal {L}}(\theta ;{\mathbf {v}})}{\partial b_i} =&\sum \limits _{\tau =1}^{N}\Big [p(h_i=1|{\mathbf {v}}^\tau ) - \sum \limits _{{\mathbf {v}}}p({\mathbf {v}})p(h_i=1|{\mathbf {v}})\Big ] \end{aligned} \end{aligned}$$
(8)

where the conditional probability is given by

$$\begin{aligned} {\left\{ \begin{array}{ll} p(h_i=1|{\mathbf {v}}) = {\mathrm {sigmoid}}\left( b_i + \sum \limits _{j=1}^{n}w_{ij}v_j\right) \\ p(v_j=1|\mathbf {h}) = {\mathrm {sigmoid}}\left( a_j + \sum \limits _{i=1}^{m}w_{ij}h_i\right) \end{array}\right. } \end{aligned}$$

Derived from \(Z_\theta \), on the right-hand side of (8), \(\sum p({\mathbf {v}})\) shows up in all three equations, and the computational complexity is \(O(2^{m+n})\). Due to the number of visible and hidden nodes, it is hard to update the parameters based on these gradient formulas. An efficient approximation method named contrastive divergence (CD) was introduced by Hinton (2006). The main idea of k step CD is that instead of minimizing \(D_{KL}(p_{t}||p_{\theta })\), the method minimizes the difference between \(p_{t}\) and \(p_{\theta }^k\), i.e., \(D_{KL}(p_{t}||p_{\theta }^k)\), where \(p_{\theta }^k\) is the distribution over the k step reconstructions of the data vectors generated by Gibbs sampling. In practice, we usually choose \(k=1\); then, the CD-1 for updating gradient descent is listed below

$$\begin{aligned} \begin{aligned} \frac{\partial {\mathcal {L}}(\theta ;{\mathbf {v}})}{\partial w_{ij}} \approx&\sum \limits _{\tau =1}^{N}\Big [p(h_i=1|{\mathbf {v}}^{(\tau ,0)})v^{(\tau ,0)}_{j} \\&- p(h_i=1|{\mathbf {v}}^{(\tau ,1)})v^{(\tau ,1)}_{j}\Big ]\\ \frac{\partial {\mathcal {L}}(\theta ;{\mathbf {v}})}{\partial a_j} \approx&\sum \limits _{\tau =1}^{N}\Big [v^{(\tau ,0)}_{j} - v^{(\tau ,1)}_{j}\Big ]\\ \frac{\partial {\mathcal {L}}(\theta ;{\mathbf {v}})}{\partial b_i} \approx&\sum \limits _{\tau =1}^{N}\Big [p(h_i=1|{\mathbf {v}}^{(\tau ,0)}) - p(h_i=1|{\mathbf {v}}^{(\tau ,1)})\Big ] \end{aligned} \end{aligned}$$
(9)

where \({\mathbf {v}}^{(\tau ,0)}\) is the starting point of sampling from training dataset and \({\mathbf {v}}^{(\tau ,1)}\) is the sampled point using CD-1 algorithm. We can use (9) as the gradient formulas.

3 Perturbation of generalized inverse

In this section, we will give some new theoretical findings for the ELM training phase, based on the preliminaries introduced in Sect. 2.

From Proposition 2, we already know that the generalized inverse is continuous if the matrix is full rank. Now we focus on a special case: the matrix \({\mathbf {H}}\) is still full rank, but the values inside \({\mathbf {H}}\) are all near a constant. We have two aims in this section. One is to prove the discontinuity of generalized inverse if matrix \({\mathbf {H}}\) is not full rank. Second is to discuss the pattern and trends of hidden layer output when generalized inverse is discontinuous. Because generalized inverse is closely related to the singular value decomposition (SVD), starting from CourantFischer min–max theorem, we will prove the perturbation rule of singular value.

Proposition 3

(CourantFischer) Suppose \({\mathbf {A}}\) is a \(n\times n\) real symmetric matrix, let \(\{\mathbf {p}_1,\mathbf {p}_2,\ldots ,\mathbf {p}_r\}\), \(r = 0,1,\ldots ,n-1\), denote the set of n-dimensional vectors. Also, enumerate the n eigenvalues of \({\mathbf {A}}\), \(\lambda _1,\lambda _2,\ldots ,\lambda _n\) in increasing order, i.e., \(\lambda _1 \le \cdots \le \lambda _n\). Then, we have

$$\begin{aligned}&\min _{\{\mathbf {p}_i\}} \max _{\begin{array}{c} ||{\mathbf {x}}||=1\\ {\mathbf {x}}^T\mathbf {p_i}=0 \end{array}}{\mathbf {x}}^T{\mathbf {A}}{\mathbf {x}} = \lambda _{r+1} \end{aligned}$$
(10)
$$\begin{aligned}&\max _{\{\mathbf {p}_i\}} \min _{\begin{array}{c} ||{\mathbf {x}}||=1\\ {\mathbf {x}}^T\mathbf {p_i}=0 \end{array}}{\mathbf {x}}^T{\mathbf {A}}{\mathbf {x}} = \lambda _{n-r} \end{aligned}$$
(11)

The proof of Proposition 3 can be found in Parlett (1998). Now suppose we have a \(m\times n \) matrix \({\mathbf {A}}\) with \(rank({\mathbf {A}}) = k\), the norm of \({\mathbf {A}}\) and its generalized inverse \({\mathbf {A}}^\dagger \) can be represented by its singular values.

Proposition 4

Suppose the singular values of \({\mathbf {A}}^{m\times n }\) are \(\lambda _1 \ge \lambda _2 \ge \cdots \ge \lambda _k > 0\), then

$$\begin{aligned} ||{\mathbf {A}}|| = \lambda _1 \quad {\mathrm {and}} \quad ||{\mathbf {A}}^\dagger || = \lambda _k^{-1} \end{aligned}$$
(12)

Proof

The definition of norm

$$\begin{aligned} ||{\mathbf {A}}|| = \max _{||{\mathbf {x}}||=1}||{\mathbf {A}}{\mathbf {x}}||, \quad {\mathbf {x}} = {\mathbf {R}}^n \end{aligned}$$

According to the definition of Euclidean norm

$$\begin{aligned} ||{\mathbf {A}}{\mathbf {x}}||^2 = {\mathbf {x}}^T{\mathbf {A}}^T{\mathbf {A}}{\mathbf {x}} \end{aligned}$$

The eigenvalues of \({\mathbf {A}}^T{\mathbf {A}}\) are \(\lambda _1^2 \ge \lambda _2^2 \ge \cdots \ge \lambda _k^2\) and eigenvectors \({\mathbf {v}}_1,{\mathbf {v}}_2,\ldots ,{\mathbf {v}}_k\), so

$$\begin{aligned} \begin{aligned} \max _{||{\mathbf {x}}||=1}||{\mathbf {A}}{\mathbf {x}}||^2&= \max _{||{\mathbf {x}}||=1}\left( {\mathbf {x}}^T{\mathbf {A}}^T{\mathbf {A}}{\mathbf {x}}\right) \\&= \max _{||{\mathbf {x}}||=1}\left( {\mathbf {x}}^T\sum \limits _{1}^{k}\lambda _i{\mathbf {v}}_i{\mathbf {v}}_i^T\right) {\mathbf {x}}\\&= \max _{||{\mathbf {x}}||=1}\sum \limits _{1}^{k}\lambda _i^2 \left( {\mathbf {x}}^T{\mathbf {v}}_i\right) ^2 \end{aligned} \end{aligned}$$

with \(\sum _{1}^{k}({\mathbf {x}}^T{\mathbf {v}}_i)^2 \le 1\), then \(\max \limits _{||{\mathbf {x}}||=1}||{\mathbf {A}}{\mathbf {x}}||^2 \le \lambda _1^2\). If let \({\mathbf {x}} = {\mathbf {v}}_1\), then

$$\begin{aligned} {\mathbf {x}}^T{\mathbf {A}}^T{\mathbf {A}}{\mathbf {x}} = \lambda _1^2 \leftrightarrow \max _{||{\mathbf {x}}||=1}||{\mathbf {A}}{\mathbf {x}}||^2 = \lambda _1^2 \leftrightarrow ||{\mathbf {A}}|| = \lambda _1 \end{aligned}$$

Now consider the \(||{\mathbf {A}}^\dagger ||\). Assume \({\mathbf {A}}\) has singular value decomposition (SVD) \({\mathbf {A}} = \mathbf {U}\varvec{\Sigma }\mathbf {V}^T\), then \( {\mathbf {A}}^\dagger = \mathbf {V}\varvec{\Sigma }^{\dagger }\mathbf {U}^T\), where

$$\begin{aligned}&\varvec{\Sigma } = \left[ \begin{array}{ccc} \lambda _1 &{} &{} \\ &{} \ \ddots \ &{} \\ &{} &{} \lambda _k \end{array} \right] \quad {\mathrm {and}} \quad \varvec{\Sigma }^{-1} = \left[ \begin{array}{ccc} \lambda _1^{-1} &{} &{} \\ &{} \ \ddots \ &{} \\ &{} &{} \lambda _k^{-1} \end{array} \right] \\&\begin{aligned} ||{\mathbf {A}}^\dagger ||^2&=\max _{||{\mathbf {x}}||=1} ||{\mathbf {A}}^\dagger {\mathbf {x}}||^2\\&= \max _{||{\mathbf {x}}||=1}\left\{ \left( \mathbf {V}\varvec{\Sigma }^{-1}\mathbf {U}^T{\mathbf {x}}\right) ^T\left( \mathbf {V}\varvec{\Sigma }^{-1}\mathbf {U}^T{\mathbf {x}}\right) \right\} \\&= \max _{||\mathbf {y}||=1}\mathbf {y}^T\varvec{\Sigma }^{-2}\mathbf {y} \end{aligned} \end{aligned}$$

Same as the norm of \({\mathbf {A}}\), the norm of \({\mathbf {A}}^\dagger \) is the square root of the largest eigenvalue of \(\varvec{\Sigma }^{-2}\) which is \(\lambda _k^{-1}\).

Suppose a small perturbation \(\varvec{\delta }{\mathbf {A}}\) and \({\mathbf {B}} = {\mathbf {A}} +\varvec{\delta }\varvec{\delta }{\mathbf {A}}\). Regarding the singular values of \({\mathbf {A}}\) and \({\mathbf {B}}\), we have the following proposition. \(\square \)

Proposition 5

Suppose \(rank({\mathbf {A}}) = rank({\mathbf {B}}) = k\) and the singular values of \({\mathbf {A}}\) are \(\lambda _1 \ge \lambda _2 \ge \cdots \ge \lambda _k\), because \({\mathbf {B}}\) has the same rank with \({\mathbf {A}}\), \({\mathbf {B}}\) has singular values \(\sigma _1 \ge \sigma _2 \ge \cdots \ge \sigma _k\). Then,

$$\begin{aligned} \sigma _i \le \lambda _i + ||\varvec{\delta }{\mathbf {A}}|| \end{aligned}$$
(13)

Proof

According to the singular value decomposition (SVD), \({\mathbf {A}}^T{\mathbf {A}}\) has the eigenvalues \(\lambda _1^2,\lambda _2^2,\ldots ,\lambda _k^2\) and eigenvector \({\mathbf {v}}_1,{\mathbf {v}}_2,\ldots ,{\mathbf {v}}_k\); then, apply the min–max rule in Proposition 3; let \({\mathbf {v}}_i = \mathbf {p}_i\), we have

$$\begin{aligned} \begin{aligned} \sigma ^2_{r+1}&\le \max _{\begin{array}{c} ||{\mathbf {x}}||=1\\ {\mathbf {x}}^T\mathbf {p_i}=0 \end{array}} {\mathbf {x}}^T{\mathbf {B}}^T{\mathbf {B}}{\mathbf {x}} \\&= \max _{\begin{array}{c} ||{\mathbf {x}}||=1\\ {\mathbf {x}}^T\mathbf {p_i}=0 \end{array}}{\mathbf {x}}^T\left( {\mathbf {A}} +\varvec{\delta } {\mathbf {A}}\right) ^T\left( {\mathbf {A}} +\varvec{\delta } {\mathbf {A}}\right) {\mathbf {x}}\\&\le \max _{\begin{array}{c} ||{\mathbf {x}}||=1\\ {\mathbf {x}}^T\mathbf {p_i}=0 \end{array}} \left\{ \left( {\mathbf {x}}^T{\mathbf {A}}^T{\mathbf {A}}{\mathbf {x}}\right) ^\frac{1}{2} +\left( {\mathbf {x}}^T\left( \varvec{\delta }{\mathbf {A}}\right) ^T\left( \varvec{\delta }{\mathbf {A}}\right) {\mathbf {x}}\right) ^ \frac{1}{2}\right\} ^2\\&\le \Big \{ \max _{\begin{array}{c} ||{\mathbf {x}}||=1\\ {\mathbf {x}}^T\mathbf {p_i}=0 \end{array}} \left( {\mathbf {x}}^T{\mathbf {A}}^T{\mathbf {A}}{\mathbf {x}}\right) ^\frac{1}{2} \\&\quad + \max _{\begin{array}{c} ||{\mathbf {x}}||=1\\ {\mathbf {x}}^T\mathbf {p_i}=0 \end{array}}\left( {\mathbf {x}}^T\left( \varvec{\delta } {\mathbf {A}}\right) ^T \left( \varvec{\delta }{\mathbf {A}}\right) {\mathbf {x}}\right) ^\frac{1}{2} \Big \}^2\\&\le \left( \lambda _{r+1} + ||\delta {\mathbf {A}}||\right) ^2, \quad r=1,2\ldots , k-1. \end{aligned} \end{aligned}$$

Thus

$$\begin{aligned} \sigma _{r+1} \le \lambda _{r+1} + ||\varvec{\delta }{\mathbf {A}}|| \leftrightarrow \sigma _{r} \le \lambda _{r} + ||\varvec{\delta }{\mathbf {A}}|| \end{aligned}$$

\(\square \)

Proposition 6

If the \(m\times n\) (\(m < n\)) matrix \({\mathbf {A}}\) is waning rank, \(rank({\mathbf {A}}) = k < n\), the small perturbation \(\varvec{\delta }{\mathbf {A}}\) increases the rank of \({\mathbf {B}} = {\mathbf {A}} + \varvec{\delta }{\mathbf {A}}\).

$$\begin{aligned} rank({\mathbf {A}} + \varvec{\delta }{\mathbf {A}}) > rank({\mathbf {A}}) = k \end{aligned}$$
(14)

Then, we have the inequation:

$$\begin{aligned} ||({\mathbf {A}} +\varvec{\delta }{\mathbf {A}})^\dagger || \ge \frac{1}{||\varvec{\delta } {\mathbf {A}}||} \end{aligned}$$
(15)

Proof

Assume \(rank({\mathbf {A}} +\varvec{\delta }{\mathbf {A}}) = r > k\), then the rth singular value of matrix A is \(\lambda _{r} = 0\). According to Proposition 5, the rth singular value of \({\mathbf {A}} +\varvec{\delta }{\mathbf {A}}\), \(\sigma _{r}\) has

$$\begin{aligned} \sigma _{r} \le ||\varvec{\delta }{\mathbf {A}}|| \end{aligned}$$

Meanwhile, apply Proposition 4, the norm of \(({\mathbf {A}} +\varvec{\delta }{\mathbf {A}})^\dagger \) has

$$\begin{aligned} ||({\mathbf {A}} +\varvec{\delta }{\mathbf {A}})^\dagger ||\le \frac{1}{\sigma _{r}} \end{aligned}$$

Therefore

$$\begin{aligned} ||({\mathbf {A}} +\varvec{\delta }{\mathbf {A}})^\dagger || \ge \frac{1}{||\varvec{\delta }{\mathbf {A}}||} \end{aligned}$$

\(\square \)

Example 1

Let \({\mathbf {H}} = \left[ \begin{array}{cc} 1 &{} 0 \\ 0 &{} 0 \\ 0 &{} 0 \end{array} \right] \), then \(rank({\mathbf {H}}) = 1\), which is not full rank. It is easy to calculate that \({\mathbf {H}}^\dagger = \left[ \begin{array}{ccc} 1 &{} 0 &{} 0\\ 0 &{} 0 &{} 0 \end{array} \right] \). Suppose that \(\varvec{\delta } {\mathbf {H}} = \left[ \begin{array}{cc} 0 &{} 0 \\ 0 &{} \epsilon \\ 0 &{} 0 \end{array} \right] \), \(\epsilon \ne 0\), then \({\mathbf {H}} + \varvec{\delta } {\mathbf {H}} = \left[ \begin{array}{cc} 1 &{} 0 \\ 0 &{} \epsilon \\ 0 &{} 0 \end{array} \right] \). Noting that the rank is increased from 1 to 2 and \({\mathbf {H}} + \delta {\mathbf {H}}\) is full rank, we get \(({\mathbf {H}} + \varvec{\delta } {\mathbf {H}})^\dagger = \left[ \begin{array}{ccc} 1 &{} 0 &{} 0\\ 0 &{} \epsilon ^{-1} &{} 0 \end{array} \right] \). It is easy to see that limit of \(({\mathbf {H}} + \delta {\mathbf {H}})^\dagger \) does not exists when \(\epsilon \rightarrow 0\). And the variance will increase when \(\epsilon \) decreases.

Remark 4

(15) and Example 1 demonstrate that if A is a waning rank matrix and a perturbation of \({\mathbf {A}}\) increases the rank of \({\mathbf {A}} +\varvec{\delta }{\mathbf {A}}\), the smaller of \(||\varvec{\delta }{\mathbf {A}}||\), the larger \(||({\mathbf {A}} +\varvec{\delta }{\mathbf {A}})^\dagger ||\) will be. When \(||\varvec{\delta }{\mathbf {A}}|| \rightarrow 0\), \(||({\mathbf {A}} +\varvec{\delta }{\mathbf {A}})^\dagger || \rightarrow \infty \), the generalized inverse does not exist. We conclude that the generalized inverse \({\mathbf {H}}^\dagger \) is discontinuous if \({\mathbf {H}}\) is waning rank. This conclusion is related to the continuity of singular value. For diagonal matrix \(\varvec{\Sigma }\), we get the generalized inverse by taking the reciprocal of each nonzero element on the diagonal, leaving the zeros in place and then transposing the matrix.

Remark 5

In the next section, we will show that for some datasets, the RBM pre-trained hidden layer outputs are close to each other. Although the hidden layer matrix is still full rank, but because of the low variance, the matrix could be viewed a perturbation around a constant. Thus, when solving the connections between hidden and output layers, the generalized inverse is decided by the scope of perturbation, not the value itself. Named as “pseudo-full rank,” this degenerating of hidden layer output has a negative impact on model generalization ability. The situation might not only happen to unsupervised pre-training but also for random initiating. That means, monitoring the variance of hidden layer matrix is necessary when work with ELM.

4 Algorithm and experiments

In this section, we will conduct some empirical studies to validate the theoretical findings in Sect. 3. More specifically, we determine the hidden layer of an ELM by unsupervised pre-training strategy. Instead of randomly selecting weights and bias, now we train a RBM based on input data S, number of visible nodes n and number of hidden nodes m. The training steps of RBM-ELM are given in Algorithm 1. Then, some benchmark datasets with different sizes and number/type of attributes are chosen from UCI Machine Learning Repository (Lichman 2013) and handwritten databases (LeCun et al. 2010). An empirical comparison is made between original ELM and RBM-ELM.

figure a

By experimenting such approach, we mainly focus on the following aspects: (1) significance of model performance; (2) influence of the hidden layer size; (3) variation of parameters during training. And finally, we summarize the effect of RBM unsupervised pre-training on ELM.

Table 1 Datasets for performance comparison
Table 2 Tenfold cross-validation of selected datasets

4.1 Datasets and experiment

We collect 14 representative classification datasets which focus on various learning fields. These datasets contain both categorical attributes and numerical attributes, which will be used to evaluate the performance of RBM-ELM. The metadata of these datasets with training configurations are shown in Table 1.

Datasets 1–5 are of categorical attributes and datasets 6–14 are of numerical attributes. All numerical attributes are normalized within the range of [0, 1]. In order to handle categorical data, we need to transfer each category into a list of numbers. For example, suppose that the categorical attribute \(F=\{x,y,z\}\). We transform F into 3 attributes by taking value either 0 or 1, which are indicators of xyz. Since one categorical attribute will be extended to a list of indicators, the final dataset feeds to the model may have an obvious increase in attribute number.

Both ELM and RBM-ELM are trained according to the parameters column in Table 1. We apply the same number of hidden nodes m in hidden layer on two models. For RBM-ELM, \(\eta \) is the leaning rate and ITER is the maximum iteration in Algorithm 1. To comprehensively verify the model generalization ability and avoid over-fitting, the tenfold cross-validation is applied on each dataset. During training, the original dataset is randomly partitioned into 10 equal sized subsets. Of the 10 subsets, a single subset is retained as the validation set for testing the model, and the remaining 9 subsets combined are used as training samples. The cross-validation process is then repeated 10 times with each of the 10 subsets used exactly once as the validation set. The mean and standard deviation of testing accuracy on validation set are calculated for performance comparison. The algorithms are implemented in Python3.5 under the hardware environment with AMD Ryzen7 1700X CPU, 32GB RAM and 64-bit Ubuntu 17.04LTS operating system.

Fig. 3
figure 3

Different network structures

Fig. 4
figure 4

Weight variance comparison between ELM and RBM-ELM on Letter

4.2 Result analysis

The experimental results are listed in Table 2. Highest testing accuracy is in bold face, and the Student’s t test is given to show the significance between two models. From the table, we can view that the testing accuracy of RBM-ELM is better than ELM on 8 out of 14 datasets. And the small pvalue indicates that the difference between two models is significant. Then, the situation reversed for the other 5 datasets, such that ELM has the better performance. In Sect. 4.3, we will show that this low accuracy on RBM-ELM is caused by the discontinuity of generalized inverse proved in Sect. 3. Furthermore, for dataset ID=8, the two algorithms are giving almost the same performance. It is hard to say which one is uniformly better for ELM and RBM-ELM with respect to the testing accuracy. In another word, the effect of RBM pre-training depends on the datasets. Noting that the above experiments are conducted with fixed size of hidden layer. The hidden layer has the same number of nodes for the two models. Whether the change of hidden layer size would have a critical impact on the testing accuracy is still unknown. Especially, we are concerning whether the performance advantage will remain when number of hidden nodes changes. To answer this question, we conduct an additional experiment. For each dataset, 90% of the samples are selected as the training set and the remaining 10% as the testing set. Starting close to the number of input attributes, let the number of hidden layer nodes increase in a certain step, we recorded the different testing accuracy.

Fig. 5
figure 5

Weight variance comparison between ELM and RBM-ELM on MNIST

Fig. 6
figure 6

Hidden layer output comparison between ELM and RBM-ELM on Letter

Fig. 7
figure 7

Hidden layer output comparison between ELM and RBM-ELM on MNIST

From Fig. 3, each dataset in Table 1 is trained with different hidden structures. Considering the overall result, the accuracy increases with more hidden nodes, and the performance advantage holds when changing the hidden layer size. For example, the first sub-figure is the MNIST handwritten digits dataset. With hidden layer varying from 50 to 800 with step 50, the RBM-ELM testing accuracy (green curve) is always above the ELM testing accuracy (blue curve) which matches the significance result in Table 2.

4.3 Network comparison

According to the previous results, we already find that there is a significant performance difference between RBM-ELM and ELM in most of the cases. By embedding hidden nodes designed using RBM, it is worthy to track value of parameters and intermediate results such as hidden matrix and its generalized inverse. Looking into such details, we can tell that the difference between two algorithms is not only from the final performance but also inside the training process. Thus, we carry out another experiment comparing the statistics of the weight matrix \(\mathbf {W}\), hidden layer matrix \({\mathbf {H}}\) and its generalized inverse \({\mathbf {H}}^\dagger \) particularly on the datasets which RBM-ELM performs worse than ELM.

For the sake of simplicity, the visualization in this part is conducted only on a negative case (RBM-ELM performs worse than ELM), i.e., dataset Letter Recognition, and a positive case (RBM-ELM performs better than ELM), i.e., dataset MNIST. For other datasets, patterns are similar. With 800 hidden nodes, Figs. 4 and 5 show the distribution of weight variance for datasets Letter Recognition and MNIST, respectively. In the left two sub-figures, each dot stands for the variance of weights connecting to a hidden node. It is the column variance of \(\mathbf {W}\). For example, input dataset has 16 attributes and hidden layer has 800 nodes, so hidden node \(h_1\) has 16 connections which are \(w_{1,1}\) to \(w_{16,1}\), and the first dot represents the variance of these connections. It is surprisingly found from Fig. 4 that the RBM pre-trained weight matrix for dataset Letter Recognition has very low column variance, which means the weights are just slightly different from each other. As discussed in Sect. 2 and Algorithm 1, the weight matrix between visible and hidden layers will transform input data into hidden layer matrix. Therefore, with the low-variance weight matrix, the unlikely trends of hidden layer output could be expected. As a result, RBM-ELM performs worse than traditional ELM on dataset Letter Recognition. On the contrary, as shown in Fig. 5, the low-variance phenomenon does not exist for dataset MINST; thus, RBM-ELM performs better than traditional ELM on this dataset.

Figure 6 takes a random observation on the hidden layer output values for dataset Letter Recognition. Although both hidden layer matrices are full rank, the ELM hidden layer output is close to a uniform distribution within [0, 1], and while the RBM hidden layer output has low variance, it is nearly a perturbation around constant 0.5. While in Fig. 7 we repeat the procedure in MNIST dataset, there is no waning rank phenomenon in hidden layer output.

Furthermore, the next step is the same for both RBM-ELM and ELM: compute the generalized inverse for hidden layer matrix \({\mathbf {H}}\). Table 3 reveals the difference of generalized inverse (GI) matrix between ELM and RBM-ELM for dataset Letter Recognition. Both have zero mean, and the large variance of GI in RBM-ELM is noteworthy. This phenomenon is actually due to the discontinuity of generalized inverse in waning rank matrix. The experimental result matches the theoretical analysis in Sect. 3.

5 Conclusion

In this paper, to study how initial weights affect the non-iterative training, the RBM is used as an unsupervised pre-training phase for ELM. We first give a detailed illustration about background and related works followed by introducing the RBM-ELM algorithm. Then, we prove the discontinuity of generalized inverse if the full rank matrix is close to a perturbation from waning rank. Next, to evaluate the model, some experiments were carried out with benchmark datasets and comparison was made between standard ELM and RBM-ELM. With the empirical study, we summarize the effect of RBM combined with ELM. The conclusions can be listed as follows:

  1. 1.

    For some of the experimental datasets, the performance of unsupervised pre-training using RBM is significantly better than random initiating.

  2. 2.

    Increasing hidden layer size by adding more nodes, the testing accuracy is improved on both standard ELM and RBM-ELM. And for a particular dataset, the advantage of RBM-ELM remains when hidden layer size changes.

  3. 3.

    According to our experiments, the RBM-ELM is not always surpass the standard ELM. The training of RBM could lead to a low- variance scenario in hidden layer. Because of the discontinuity of generalized inverse, it will finally cause large parameter variation in solution.

  4. 4.

    When the hidden layer is floating around a constant, it should be treated as perturbation of waning rank matrix, instead of full rank.

Table 3 Generalized inverse (GI) comparison