1 Introduction

Restricted Boltzmann machine (RBM) which is a stochastic artificial neural network with two layers: one is a visible input layer, and the other is a hidden layer, can learn the probability distribution over the raw features [1,2,3]. According to the distribution of visible units, two different RBMs have been proposed, i.e., Bernoulli-Bernoulli RBM (BBRBM) [4] and Gaussian-Bernoulli RBM (GBRBM) [5, 6]. BBRBM assumes that the visible unit is of binary data, and GBRBM assumes that the visible unit is Gaussian distribution. In practice, the appropriate RBM type can be selected based on the dataset. Deep-belief network (DBN) is a probability generation model, which can be viewed as a composition of simple, unsupervised networks such as RBMs [1, 4, 7]. DBN which is widely regarded as one of the effective deep learning models, can obtain the multi-layer nonlinear representation of the data by greedy layer-wise training [8,9,10]. DBN possesses inherent power for unsupervised feature learning [11], and it has been widely used in many fields, e.g., image classification, document processing, object segmentation, social network, automatic control [12, 13].

For DBN models, it consists of one layer of visible units and multiple layers of hidden units, and neurons between neighboring layers can have a widely interconnected structure [1]. However, studies from the brain’s nervous system have shown that such a system employs a highly sparse mechanism. In brain organization and processing information, there exist two kinds of sparse properties: one is connection sparseness and the other is response sparseness [14]. The connection sparseness is also known as the sparse structure of the network or sparse topology. And the response sparseness is the sparse representation or sparse coding of neurons. These two kinds of sparseness guarantee the high promotion ability of human neural systems. Compared with the distributed information representation, sparsity can provide a higher quality of storage and better generalization capabilities [14]. By a combination of experimental, computational, and theoretical studies, it has been verified the existence of an underlying principle involved in sensory information processing. Namely, the information is represented by a relatively small number of simultaneously active neurons out of a large population, commonly referred to as sparse coding [15]. In [14], it confirmed the existence of sparse connectivity within the basal ganglia. DeepMind generated a function of grid cells to achieve automatic tracking just like humans recently, where the use of regularization to obtain sparsity is critical to grid-like representations [16].

At present, there are many approaches to get the topology sparseness as well as representation sparseness of DBNs, and introducing regularization terms into the parameters updating procedure are most commonly used to get the sparsity. For regularization algorithm, the purpose is to solve a constrained optimization issue and to remove unnecessary connections in the training process. In this way, it can both reduce the complexity of the model and improve the generalization ability of the network. The methods of weighting constraints include Gaussian regularizer [17], Laplace regularizer [18], weight elimination [19] and soft weight sharing [20]. Among them, the most widely used methods are the Gaussian regularizer and Laplace regularizer. And they use L2-norm or L1-norm on the connections as their penalty functions. Moreover, some other techniques, such as Dropout, DropConnect, pruning, low-rank matrix decomposition, knowledge distillation, have also been used to achieve the topology sparseness or reduce the complexity of neural networks [17,18,19,20,21,22,23,24].

In addition to adding regularization on the connections to achieve the sparse topology, simulating the sparse response of the hidden neurons is another important way to improve the generalization ability. Hinton et al. [25] showed that networks with low response frequency neurons have a better explanation and generalization ability. Lee and Ng penalized the mean of the hidden neuron’s response to get the response sparsity of the hidden layer by using Kullback-Leibler (KL) divergence [26, 27]. Ranzato et al. [28] used the energy-based method to add a coding pattern with sparse representation. By adding the L1/L2 penalty to the response probability of the hidden neurons, a network was proposed to obtain the local correlation of the hidden neurons [29]. The L1-norm constraint was applied directly to the response of the hidden neurons in the fine-tuning process [30]. Srivastava and Hinton et al. used Dropout to prevent neural networks from over-fitting [31]. Almost all of the above methods to obtain the sparseness of DBN are based on penalization with L1-norm, L2-norm or KL divergence.

For controlling the sparsity, the best way is straightforward to use L0-norm since it directly measures the number of non-zero elements. However, since it is a NP-hard problem to solve the optimization problem based on L0-norm, many methods for obtaining the spareness are based on L1-norm, L2-norm or KL divergence instead. Although L1-norm is generally used as a substitution of L0-norm, there still exists a key difference between L1-norm and L0-norm, namely, the dependence on magnitude [24]. For L0-norm, the penalization on the components is democratic, while for L1-norm, the penalization on larger components is penalized more heavily than the small components. For L2-norm, it is not possible to give a sparse solution. The KL divergence is used for measuring how different two distributions are, and it is usually applied to control the average activation values of hidden neurons. Finding a better way to address the issue of sparsity in DBN is still a challenging topic. In this paper, by introducing a log-sum minimization problem and its corresponding iterative re-weighted-L1 minimization algorithm, we dedicate to seek the approximative solution of the optimum problem with L0-norm and further, sparse connections of DBNs.

On the other hand, for DBNs, sampling method based on Markov Chain Monte Carlo (MCMC) strategy is often used. In such a sampling process, each variable is sampled from a posterior distribution given the current state of other variables. Then the unbiased estimate of the position parameter is obtained by using the logistic likelihood method with multiple alternating Gibbs sampling. However, this procedure is time-consuming and requires a large number of samples for an accurate estimate. Hinton proposed a fast learning algorithm, i.e., the k-step contrastive divergence (CD), to overcome the complication of Gibbs sampling [32]. It is known that maximizing the log likelihood of the data is equivalent to minimizing the KL divergence (denoted by \(KL(P^{0}\|{P}_{\theta }^{\infty })\)) between the initial data distribution P0 and the equilibrium distribution of the visible variables, \({P}_{\theta }^{\infty }\), where 𝜃 refers to the coefficients of the learning machine, and \(P^{\infty }_{\theta }\) is generated by the sustained Gibbs sampling from the generation model. For CD algorithm with k steps, it minimizes the difference between \(KL(P^{0}\|P^{\infty }_{\theta })\) and \(KL({P}_{\theta }^{k}\|{P}_{\theta }^{\infty })\) instead of minimizing \(KL(P^{0}\|{P}_{\theta }^{\infty })\), where \(P^{k}_{\theta }\) denotes the reconstruction distribution of the data generated by k steps Gibbs sampling. The CD algorithm with k steps is an effective alternative to maximum likelihood learning, which can relieve high computational demands for samples acquired from the equilibrium distribution. As Hinton pointed out, the intuitive motivation for using the k-step CD algorithm is to leave the initial distribution over the visible variables unaltered when implementing Gibbs sampling in Markov chain. Rather than running the Markov chain to the equilibrium state and comparing the difference between the initial state and the final state, one can simply run the Markov chain with k steps and update the parameters of the model to reduce the tendency of the chain to deviate from the initial distribution.

In this paper, we aim to obtain a DBN with sparsity learning capability by embedding the k-step CD algorithm with the iterative re-weighted-L1 minimization algorithm, and further apply it to biomedical data, such as brain magnetic resonance imaging (MRI) data and single nucleotide polymorphism (SNP) data. For the MRI data, it contains the concentration of grey matter in different brain regions and the correlation values of different regions of interest (ROI), which do not have any spatial topological structure. Similarly, the SNP data contains the DNA polymorphism data caused by single nucleotide variation, and it doesn’t possess spatial topological structure either. Thus, other deep learning models, such as the convolution operation of convolutional neural networks which considers domain relationships of the data are no longer applicable. Additionally, it should be noticed that for biomedical datasets, they are usually of small sample size but a large number of features. Although nowadays deep networks have been constantly shown to be one of the most powerful tools in big data analysis, its application to biomedical data is still limited. That is because deep learning models are very easy to fall into overfitting for data with np, where n is the sample size and p is the number of features [33, 34]. Seeking a deep network with strong sparse self-learning ability is just an effective solution to this issue. It will be shown in Section 2 that the re-weighted-L1 function keeps more approximation ability of L0-norm than L1-norm or L2-norm. Thus by combining a iterative re-weighted-L1 algorithm, the connections of a DBN can be updated adaptively in a sparse way by the k-step CD algorithm, and such a network is called a self-adaptive sparse DBN. In this way, the deviation of the Markov chain from the initial distribution can be reduced, and it thus efficaciously minimize the probability that the model distorts the data. Experiments on the biomedical data show that the proposed self-adaptive sparse DBN is valid to avoid over-fitting, and it can achieve good identification ability by learning the sparse structure of the feature space.

2 Iterative algorithm for re-weighted L 1 minimization

In this section, the iterative re-weighted-L1 minimization algorithm will be presented; it is a good approximation to the sparse optimum solution of minimization problem with L0-norm.

In [35], it was proved that minimization of \({\sum }_{i=1}^{n}\log |x_{i}|\) is consistent with minimization of \({\sum }_{i=1}^{n}|x_{i}|^{p}\) when \(p\rightarrow 0\). Further, because

$$ \begin{array}{@{}rcl@{}} \underset{x}{\min} \|x\|_{0} = \underset{x}{\min} \underset{p\rightarrow0}{\lim}\sum\limits_{i=1}^{n} |x_{i}|^{p} \end{array} $$
(1)

where x is a n-dimensional vector with xi being the i-th element, [24, 35, 36] put forward that the minimization of \({\sum }_{i=1}^{n}\log |x_{i}|\) is an effective approximation to that of ∥x0. Furthermore, in order to ensure the existence of \(\log |x_{i}|\) when \(|x_{i}|\rightarrow 0\), a small ε is usually added to make the minimization problem meaningful, and the solution of

$$ \begin{array}{@{}rcl@{}} \underset{x}{\min}\sum\limits_{i=1}^{n}\log(|x_{i}|+\epsilon) \end{array} $$
(2)

is proposed as an alternative to get better approximate solution of minimization problem with L0-norm than the often used L1-norm [24, 36]. The objective function of (2) is called as the log-sum function.

In order to get the iterative solution of (2) for self-adaptive sparse CD algorithm with k steps, we introduce the absolute value optimization problem

$$ \begin{array}{@{}rcl@{}} \underset{x}{\min} \sum\limits_{i=1}^{n} c_{i}|x_{i}| \end{array} $$
(3)

where ci ≥ 0. Noting that |xi| is the minimum of zi with conditions |xi|≤ zi, (3) has the following equivalent form

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{ll} \underset{z}{\min}\sum\limits_{i=1}^{n} c_{i}z_{i}& \\ s.t.\ \ |x_{i}|\leq z_{i},\ i=1,\cdots,n & \end{array} \right. \end{array} $$
(4)

By (4), the optimization problem (2) can be rewritten in its equivalent form

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{ll} \underset{u}{\min}\sum\limits_{i=1}^{n} \log(u_{i}+\epsilon)& \\ s.t.\ \ |x_{i}|\leq u_{i},\ i=1,\cdots,n & \end{array} \right. \end{array} $$
(5)

That is, if x is the optimum solution of (2), then u is the optimum solution of (5) when u = |x|. Alternatively, if u is the optimum solution of (5), then x is the optimum solution of (2).

A general form of (5) is

$$ \begin{array}{@{}rcl@{}} \underset{v}{\min }~ g(v) &s.t.\ v\in C \end{array} $$
(6)

where g is a concave function with C being a convex set. The solution to the problem (6) can be improved step by step by minimizing a linearized g:

$$ \begin{array}{@{}rcl@{}} v^{(l+1)}=\underset{v}{\arg\min}~g(v^{(l)})+\nabla g(v^{(l)})\cdot(v-v^{(l)})\ &s.t.\ v\in C \end{array} $$
(7)

The solution obtained by the iteration algorithm is the approximation of convex optimization. In particular, for the problem (5), one can get that

$$ \begin{array}{@{}rcl@{}} u^{(l+1)}=\underset{u_{i}}{{\arg\min}} \sum\limits_{i=1}^{n} \frac{u_{i}}{u_{i}^{(l)}+\epsilon} & s.t.\ |x_{i}|\leq u_{i},i=1,\cdots,n \end{array} $$

Thus, by the equivalent relationship between the optimum solutions of (2) and (5), we have

$$ \begin{array}{@{}rcl@{}} x^{(l+1)}=\underset{x_{i}}{{\arg\min}} \sum\limits_{i=1}^{n} \frac{|x_{i}|}{|{x}_{i}^{(l)}|+\epsilon} \end{array} $$
(8)

Let

$$ \begin{array}{@{}rcl@{}} \eta^{(l)}_{i}=\frac{1}{|x^{(l)}_{i}|+\epsilon} \end{array} $$
(9)

then (8) can be rewritten as

$$ \begin{array}{@{}rcl@{}} x^{(l+1)}=\underset{x_{i}}{{\arg\min}} \sum\limits_{i=1}^{n} \eta^{(l)}_{i}|x_{i}| \end{array} $$
(10)

Thus, the solution of (2) can be approximated by the iterative solutions of (10).

The following iterative re-weighted-L1 minimization algorithm, i.e., Algorithm 1, gives the iterative process of solving minimization problem (2). Based on it, the solution of (2) can be dynamically updated, and it is a good approximation to the sparse optimum solution of minimization problem with L0-norm.

figure f

Furthermore, in order to show the advantage of the re-weighted-L1 function (i.e., \(f_{R,\epsilon }\triangleq \eta \cdot |x|=\frac {1}{|x|+\epsilon }\cdot |x|\)) in achieving sparsity, we compare it with the following penalty functions: f0(x) = 1{x≠ 0} and f1(x) = |x|. From Fig. 1, it shows that fR,𝜖 can well approximate to f0 as \(\epsilon \rightarrow 0\).

Fig. 1
figure 1

For different 𝜖, e.g., 𝜖 = 1,0.1,0.01, we can see that fR,𝜖 has better approximation to f0 than f1, and when 𝜖 gradually tends to 0, fR,𝜖 approaches to f0

As can be shown, by combining the iterative re-weighted-L1 minimization algorithm with the k-step CD algorithm, one can obtain more sparse solution of the connective weights than using the L1-norm or the CD algorithm with just one step. In this way, we can avoid over-fitting problem and improve the generalization of DBNs.

3 Self-adaptive sparse DBNs (SAS-DBNs)

In this section, by embedding the iterative re-weighted-L1 minimization algorithm, i.e., Algorithm 1, we will present the sparse self-adaptively learning for k-step CD, and get the sparse architecture for each component of DBN, namely, RBM. As a result, the unsupervised sparse learning solution of the whole DBN can be achieved.

A DBN is stacked and trained by multi-layer RBMs, and each RBM can find a compressed feature representation of the current input data. In an RBM, there are two layers: one is the visible input layer, and the other is a hidden layer. There are connections between the layers but no connections between units within each layer. Assume v is the Nv-dim visible units, h is the Nh-dim hidden units, α is the Nv-dim bias of the visible units, β is the Nh-dim bias of the hidden units, and \(W=\{W_{ij}\}_{N_{v}\times N_{h}}\) with each Wij is the connection weight between vi and hj. The likelihood function P𝜃(v) is defined as the distribution of the observed data v𝜃, where 𝜃 = {W,α,β} is the set of all parameters. Maximizing P𝜃(v) is the main task of training an RBM, and that is equal to determining the parameters to fit the given training samples, i.e., to find a 𝜃, such that

$$ \begin{array}{@{}rcl@{}} \theta^{*}={\arg\underset{\varTheta}{\max}}\mathcal{L}(\theta) \end{array} $$

where \({\mathscr{L}}(\theta ) = \log P_{\theta }(v) \). \({\mathscr{L}}(\theta )\) is just the objective function to be optimized for training RBM. By performing stochastic steepest ascent in the log probability on the training data, the weights can be updated based on

$$ \begin{array}{@{}rcl@{}} {\varDelta} W_{ij} &=& \zeta \cdot \frac{\partial \mathcal{L}}{\partial W_{ij}} = \zeta \cdot (\langle v_{i}h_{j}\rangle_{data}-\langle v_{i}h_{j}\rangle_{model}) \end{array} $$

in which, ζ is the learning rate, and 〈⋅〉 is the operator of expectation with the corresponding distribution denoted by the subscript.

Due to the difficulty of calculating Z𝜃, it is a quite hard task to obtain an unbiased sample of 〈vihjmodel. A much faster learning method is the k-step CD algorithm. The k-step CD algorithm is a fast learning algorithm of RBM which was proposed by Hinton [32]. Different from the Gibbs sampling method, k-step CD algorithm initializes the system states, i.e., neurons in the visible layer, with the training data. In this way, as Hinton pointed out, only a few state transitions are needed to obtain a good approximation of the model distribution. In the k-step CD algorithm, firstly, the initial states of the visible layer units, i.e., v(0), are assigned as the values of the training data, and based on the conditional probability p(h = 1|v(0)), the initial states of the hidden units h(0) can be sampled. Then, the conditional probability p(v = vr|h(0)) is utilized to obtain v(1), where vr is 1 and real-value for BBRBM and GBRBM respectively. The above processes are executed alternately until both v(k) and h(k) are obtained, which are used to approximately estimate 〈vihjmodel. Specially, for BBRBM,

$$ \begin{array}{@{}rcl@{}} p(h_{j}=1|v)&=&f\left( \beta_{j}+\underset{i}{\sum}v_{i}W_{ij}\right)\\ p(v_{i}=1|h)&=&f\left( \alpha_{i}+\underset{j}{\sum}h_{j}W_{ij}\right) \end{array} $$

For GBRBM,

$$ \begin{array}{@{}rcl@{}} p(h_{j}=1|v)&=&f\left( \beta_{j}+\underset{i}{\sum}W_{ij}\frac{v_{i}}{{\sigma_{i}^{2}}}\right)\\ p(v_{i}=v_{r}|h)&=&\mathcal{N}\left( v_{r}|\alpha_{i}+\underset{j}{\sum}h_{j}W_{ij},{\sigma_{i}^{2}}\right) \end{array} $$

in which, f(⋅) is the sigmoid activation function, σi is the standard deviation associated with Guassian visible units vi, and \(\mathcal {N}(\cdot |\mu ,\sigma ^{2})\) denotes the Guassion probability density function with mean μ and standard deviation σ.

For each RBM, if there is no restriction on the responses of hidden neurons and connections between layers, an unstructured network will be generated, which leads to the poor interpretability, high computational complexity and large space resource consumption of the network [37]. Adding sparse regularization terms on the objective function of training RBM is more likely to learn the structural characteristics of the data [38]. In Section 2, it has already been shown that the re-weighted-L1 function is an effective approximation to the sparse solution of minimization problem with L0-norm, thus, by introducing the sparse penalty term on the connection weights with the re-weighted-L1 function, and embedding its corresponding iterative re-weighted-L1 minimization algorithm with k-step CD, the sparse connection architecture of the RBM will be obtained. At the same time, since L2-norm penalty term can reduce the noise or variation in the data, we also introduce L2-norm term in the connection weights. Additionally, the KL divergence on the average responses of hidden neurons is also utilized to achieve the sparse responses of the hidden layer neurons. The improved objective function to be optimized for training RBM is

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{new}({\varTheta})&=&\ \mathcal{L}({\varTheta})-\frac{1}{2}\lambda_{1}\|W\|^{2}_{2} -\lambda_{2} \underset{i,j}{\sum}\ln(|W_{ij}|+\epsilon)\\ &&-\lambda_{3} \sum\limits_{j=1}^{N_{h}}KL(\rho\parallel p_{j}) \end{array} $$
(11)

where λ1, λ2 and λ3 are penalty parameters, ε is a given constant. The KL divergence, \(KL(\rho \parallel p_{j})=\rho \log \frac {\rho }{p_{j}}+(1-\rho )\log \frac {1-\rho }{1-p_{j}}\) is the relative entropy between the two random variables with the mean ρ and the mean pj. It is used to measure the difference between two different distributions. The purpose of introducing the KL divergence is to force the average response value of the hidden neurons to be approximately equal to the default initial value ρ. Here we take \(p_{j}=\frac {1}{N_{s}}{\sum }^{N_{s}}_{q=1}\frac {1}{1+e^{-{\sum }^{N_{v}}_{i=1}v^{(q)}_{i}W_{ij}-\beta _{j}}}\) as the average activation degree of the j-neuron in the hidden layer with Ns samples, and Nv is the number of nodes in the current visual layer. The deviations of the KL divergence for the parameters are

$$ \begin{array}{@{}rcl@{}} &&\frac{\partial}{\partial W_{ij}}\sum\limits_{j=1}^{N_{h}}KL(\rho\parallel p_{j})\\ &=& \frac{1}{N_{s}}\left( -\frac{\rho}{p_{j}}+\frac{1-\rho}{1-p_{j}}\right)\sum\limits_{q=1}^{N_{s}}\sigma^{(q)}_{j}\left( 1-\sigma^{(q)}_{j}\right)v^{(q)}_{i} \end{array} $$
(12)
$$ \begin{array}{@{}rcl@{}} &&\frac{\partial}{\partial \beta_{j}}\sum\limits_{j=1}^{N_{h}}KL(\rho\parallel p_{j})\\ &=& \frac{1}{N_{s}}\left( -\frac{\rho}{p_{j}}+\frac{1-\rho}{1-p_{j}}\right)\sum\limits_{q=1}^{N_{s}}\sigma^{(q)}_{j}\left( 1-\sigma^{(q)}_{j}\right) \end{array} $$
(13)

where \(\sigma ^{(q)}_{j}=\sigma \left ({\sum }_{i=1}^{N_{v}}v^{(q)}_{i}W_{ij}+\beta _{j}\right )=\frac {1}{1+e^{-{\sum }^{N_{v}}_{i=1}v^{(q)}_{i}W_{ij}-\beta _{j}}}\). Thus, by combining the iterative re-weighted-L1 minimization algorithm with the k-step CD sampling process, the parameters of an unsupervised RBM can be updated in a sparse self-adaptive way by Algorithm 2.

figure g

Based on Algorithm 2, each RBM can be trained to have sparse connections adaptively, and this process is called as an RBM reconstructed with SAS-k-CD. Thus, the unsupervised process of a DBN with self-adaptive sparsity (SAS-DBN) can be obtained by those stacking sparse RBMs. The following Fig. 2 presents the whole learning process of SAS-DBN with stacked sparse RBMs.

Fig. 2
figure 2

The unsupervised SAS-DBN with stacked sparse RBMs updated by Algorithm 2

Furthermore, the learned sparse DBN can be further fine-tuned with some fine-tuning methods, e.g., the back propagation (BP) algorithm. On noting that the fine-tuning process will redistribute the weights as well as the representations of the entire network, i.e., the learned sparsity of SAS-DBN will be destroyed. To ensure that the sparse structure of the learned SAS-DBN is still held during fine-tuning, two methods are generally utilized. One is that we can retain the sparse structure of SAS-DBN, and then carry out fine-tuning. Another one is bringing sparse learning into fine-tuning. In our experiments, we have compared both methods, namely, fine-tuning based on the retained sparse structure learned by SAS-DBN, and fine-tuning adding sparsity regularization terms, such as KL divergence, L1-norm and log-sum function. It indicates that there exists no significant difference between them. Thus, we only considered the first method here.

4 Experiments

One of the main purpose of implementing sparseness on deep learning is to avoid over-fitting and improve the identification ability, especially for high-dimensional data with small sample size. In this section, we will use both the brain functional magnetic resonance imaging data and single nucleotide polymorphism data to demonstrate the validation of the proposed DBN with self-adaptive sparsity. For the MRI data, adding the fine-tuning process does improve its accuracy rate, but there exists nearly no improvement on the accuracy rate of the SNP data with or without fine-tuning. Considering the time complexity caused by the fine-tuning process, we only utilized fine-tuning for the MRI data and didn’t use it for the SNP data. Moreover, the experiments show that although these datas are with a large number of features but low samples, SAS-DBN can better represent data in a sparse architecture, and it has the best classification accuracy among many classifiers we have tested.

4.1 Experiments on MRI database

This dataset is the MLSP 2014 Schizophrenia Classification Challenge data, which can be downloaded from https://www.kaggle.com/c/mlsp-2014-mri. The data are collected on a 3T MRI scanner at the Mind Research Network. Image preprocessing was performed using Statistical Parametric Mapping software, and further feature extraction was performed using independent component analysis, yielding features from different imaging modalities, i.e., source-based morphometric (SBM) loadings and functional network connectivity (FNC) features for structural magnetic resonance images (sMRI) and resting state functional MRI (rs-fMRI), respectively. The data consist of 40 patients with schizophrenia and schizoaffective disorders (both are denoted by SZs) and 46 healthy controls (HCs). A diagnosis of SZs was made by using the Structured Clinical Interview for DSM-IV (DSM: Diagnostic and Statistical Manual of Mental Disorders) [39]. 410 features (32 for SBM and 378 for FNC) are extracted for each sample. SBM loadings are weights of brain maps obtained from the application of ICA, which indicate the concentration of grey matter in different regions of the subject’s brain. The concentration of gray-matter is indicative of the computational power in a particular region of the brain [40]. FNC features are the pair-wise correlation values between the time courses of 28 ICA brain maps, which can be seen as features describing the subject’s overall level of synchronicity between brain areas [41]. We denote the data as \(D=\{(x_{i},y_{i})\}^{n}_{i=1}\), where n = 86 (40 SZs and 46 HCs); (32 from the SBM and 378 from the FNC); yi ∈{0,1} (0 = ’Healthy Control’, 1 = ’Schizophrenic Patient’). 86 subjects were randomly divided into two subsets: 70% subjects for training and the rest 30% ones for testing. For the sake of obtaining a stable classification, we performed random sampling from 86 subjects repeatedly for 30 times.

The deep architecture of SAS-DBN we used here contains 5 hidden layers with 200, 100, 50, 50, and 50 units respectively. In such a stacked DBN, we adopt a GBRBM as the first RBM and BBRBM as others. Because the grid search method can simply make a complete search over a given parameters space, easily be parallelized and find the optimal parameter for deep learning with a much more stable solution, it was used to find the optimal parameters in this paper. Specifically, one of the parameters is selected by the grid search method when other parameters are fixed. And all parameters are optimized by repeating the above process. Based on the grid search method, the optimal parameters corresponding to GBRBM and BBRBM can be obtained. Corresponding to GBRBM and BBRBM respectively, the learning rate ζ are 0.001 and 0.1, the penalty parameter λ1 are 0.03 and 0.0002, the re-weight L1-norm penalty parameter λ2 are 0.001 and 0.001, 𝜖 are 0.1 and 0.1, the KL divergence penalty parameter λ3 are 0.1 and 0.1, and the sparse parameter ρ are 0.01 and 0.01, respectively. For simplifying, the steps of CD for each RBM is selected as the same k.

4.1.1 Model selection for the steps of CD

We first discuss how to select the steps of CD, and perform SAS-DBNs with different k which are chosen from 1 to 7. Other parameters of SAS-DBNs are chosen as mentioned above. Figure 3 shows the classification accuracy rates (AR) and the non-zero ratio of the network connections with different k, where AR is defined as

$$ \begin{array}{@{}rcl@{}} AR &=& \frac{Corr_{p}+Corr_{c}}{Corr_{p}+Corr_{c}+InCorr_{c}+InCorr_{p}} \end{array} $$

in which, Corrp, Corrc, InCorrc and InCorrp denote the number of correctly predicted schizophrenic patients, correctly predicted healthy subjects, healthy subjects incorrectly classified as schizophrenic patients, and schizophrenic patients incorrectly classified as healthy subjects, respectively.

Fig. 3
figure 3

a shows AR of SAS-DBNs with different k steps CD and the numbers 1 to 7 refer to the different selections of k. b displays that with different k, the combinatorial graph with a bar graph for the average non-zero ratio of connections in the first RBM and a line chart for AR. The horizontal axis denotes the different selections of k. For the vertical axis, the primary axis on the left is denoted as the average non-zero ratio of connections and the secondary axis on the right is denoted as the AR. Here shows the connection rate of the first RBM, and the connections of other RBMs also have the same tendencies

Figure 3 shows the RBM has its best AR with k = 5. Meanwhile, the non-zero ratio of the network connections keeps down with the increasing of k, and the decline tends to be gentle after k = 5. Thus, we choose 5 as the steps of CD.

4.1.2 Weight sparsity

Weight sparsity is an important index to demonstrate the validation of a deep learning machine in overcoming the over-fitting problem. In order to show the sparse learning ability of the proposed SAS-DBN, we compare SAS-DBN with six different deep learning models, namely, stacked RBM adding a SVM classifier (denoted by SR-SVM), DBN with L2-norm penalty on the connection weights (denoted by DBN), DBN with L1-norm penalty on the connection weights and KL divergence penalty on the hidden neurons (denoted by sDBN), DBN with Dropout (denoted by DP-DBN), DBN with DropConnect (denoted by DPC-DBN), DBN with singular value decomposition (denoted by SVD-DBN), and DBN with pruning method (denoted by Pr-DBN). Each network is well trained and especially, for DP-DBNs, when the dropout ratio is selected as 0.1, 0.2, 0.3, 0.4 and 0.5, the average accuracy of 30 times is 0.709, 0.6833, 0.6795, 0.6795 and 0.6859, respectively. Thus, we choose the DP-DBN model with the dropout ratio being 0.1. Similarly, for DPC-DBNs, the average accuracy of 30 times is 0.7358, 0.7385, 0.7321, 0.7346 and 0.7410, respectively, with the drop connect ratio being 0.1, 0.2, 0.3, 0.4 and 0.5. So we choose the DPC-DBN with the drop connect ratio being 0.5. Similarly, for Pr-DBNs, we select the Pr-DBN with the pruning ratio of weights being 0.001. For each network, we perform the learning process for 30 times. For each time, the training data are sampled randomly 70% from the original data, and the rest data are the testing data.

In what follows, we will show the weight sparsity of SAS-DBN in comparison with other DBNs. The learning capability of weight sparsity can then strongly indicate why the SAS-DBN is valid for such kind of high-dimensional small sample size data.

For a quantifiable comparison, we list the weights sparsity of SAS-DBN and other six deep learning methods with two different indicators. The first indicator is the non-zero ratio of connections, and the results are shown in Table 1. Additionally, in Fig. 4a, we further compare SAS-DBN with two models, i.e., DPC-DBN and Pr-DBN, which have the highest average classification accuracy among the other six deep networks. In this figure, the different 5 colors, i.e., red, yellow, green, pink and blue refer to the non-zero ratio of each w(i) (i = 1,⋯ ,5) respectively.

Table 1 The non-zero ratio of connections for different deep networks
Fig. 4
figure 4

a shows the non-zero ratios of connections for three deep networks with most sparse connections. Red, yellow, green, pink and blue refer to non-zero ratio of w(1), w(2), w(3), w(4) and w(5) for all three deep networks. b shows the average HSM of these three networks

There is another normalized indicator to measure the sparsity of connections within the network, i.e., the Hoyer sparseness measure (HSM). It is based on the relationship between the L1-norm and the L2-norm of the weights [42, 43]. The HSM of the weight matrix Wm×n is defined as follows:

$$ \begin{array}{@{}rcl@{}} HSM(W)=\frac{\sqrt{m\times n}-({\sum}_{i=1}^{m}{\sum}_{j=1}^{n}|W_{ij}|)/\sqrt{{\sum}_{i=1}^{m}{\sum}^{n}_{j=1}W^{2}_{ij}}}{\sqrt{m\times n}-1} \end{array} $$
(14)

The value of HSM is in [0,1]. When HSM closes to 1, the connection of the network becomes much sparser, i.e., most of the values in connections are zeros and only a few have significant values. The average HSM with random 30 learning times for three deep networks are shown in Table 2 and Fig. 4b.

Table 2 The average HSM of connections for different deep networks

Further, Fig. 5 shows that the learning ability for weight sparseness of SAS-DBN, DPC-DBN and Pr-DBN. It shows that SAS-DBN has more sparse connections than DPC-DBN and Pr-DBN, which is an important reason that SAS-DBN owns better identification and extraction abilities of sparse features than the other two deep learning models.

Fig. 5
figure 5

The learning ability for weight sparseness of SAS-DBN, DPC-DBN and Pr-DBN. In which, the horizontal axis is the connections between different layers and the vertical axis represents the values of those connections. w(i) (i = 1,⋯ ,5) refers to the connections between layer i and i + 1

From Tables 1 and 2, Figs. 4 and 5, one can draw the conclusion that the proposed SAS-DBN has the sparsest connections than the others deep networks, and especially has the sparse learning ability of w(1). It should be noticed that the learning of w(1) is quite crucial, since it is the first and basic step to extract essential features and remove redundancy and noise from the original data.

4.1.3 Classification accuracy rates of fifteen methods

In order to show the classification results, we compare SAS-DBN with other several typical classifiers, i.e., linear SVM (denoted by SVM-I), SVM with RBF kernel (denoted by SVM-II), KNN, Naive Bayes (NB), Random Forest (RF), Sparse Representation Based Classifier (SRC), discriminant analysis classifier (DAC), and other six different deep learning models [44,45,46]. For each classifier, we perform classifications for 30 times. For each time, the training data are sampled randomly 70% from the original data, and the rest data are testing data.

The convergence performances of the proposed method here in the training and testing data are shown in Fig. 6. In which, there is a sharp fall between epochs 20 and 40. That is because the first 30 epochs were the training of SAS-DBN, and then was the fine-tuning. Moreover, the average results are shown in Table 3 and Fig. 7. Among these fifteen methods, the proposed method here has the best performance. Especially in comparing with other methods, the AR has been increased as 11.92%, 13.22%, 9.55%, 15.23%, 11.93%, 23.83%, 13.70%, 6.93%, 2.55%, 1.56%, 5.97%, 1.39%, 3.36%, 0.68%, respectively. From the above two subsections, we know SAS-DBN do have the best sparse learning ability and discrimination capability of the fMRI data.

Fig. 6
figure 6

The convergence performance for the fMRI data. In which, the red line is the average error, and the shadow area is the standard deviation of error

Table 3 AR of fifteen classifiers
Fig. 7
figure 7

AR results of fifteen classifiers

4.2 Experiments on SNP database

In this section, subject recruitment and data collection were conducted by the Mind Clinical Imaging Consortium (MCIC). The SNPs data were collected from 208 subjects including 92 schizophrenia patients (age: 34 ± 11, 22 females) and 116 healthy controls (age: 32 ± 11, 44 females). All of them were provided written informed consent. Healthy participants were free of any medical, neurological or psychiatric illnesses and had no history of substance abuse. By the clinical interview of patients for DSM IV-TR disorders [47] or the comprehensive assessment of symptoms and history, patients met criteria for DSM-IV-TR schizophrenia [48, 49]. Antipsychotic history was collected as part of the psychiatric assessment.

A blood sample was obtained for each participant and DNA was extracted. Genotyping for all participants was performed at the Mind Research Network using the Illumina Infinium HumanOmni1-Quad assay covering 1,140,419 SNP loci. Bead Studio was used to make the final genotype calls. The PLINK software package http : //pngu.mgh.harvard.edu/ purcell/plink) was used to perform a series of standard quality control procedures, resulting in the final data set spanning 777,635 SNP loci. Each SNP was categorized into three clusters based on their genotype and was represented with discrete numbers: 0 for \('BB^{\prime }\) (no minor allele), 1 for \('AB^{\prime }\) (one minor allele) and 2 for \('AA^{\prime }\) (two minor alleles). Further, SNP with > 20% missing data was deleted, and missing data were further imputed. SNPs with minor allele frequency < 1% were removed. For genetic diseases, because pathways organize genes into biologically functional groups and model their interactions that capture correlation between genes, the genes involving in the same pathway process have similar effects and act together in regulating a biological system [50]. Therefore, using the pathway information can improve not only the predictive performance but also the interpretability of the selected features. In this paper, we select the SNPs falling into the schizophrenia genes’ pathway for analysis. After the selection, the dimension for each sample is 12513.

The deep architecture of SAS-DNN used here contains three hidden layers with 2000, 1000, and 500 units respectively. We use three BBRBM modules as the stacked SAS-DNN. Based on the grid search method, the learning rate ζ is 0.1, the penalty parameter λ1 is 0.0002, the re-weight L1-norm penalty coefficients λ2 is 0.0002, 𝜖 is 0.1, the KL divergence penalty coefficients λ3 is 0.05, the sparse parameters ρ is 0.05, and the steps of CD is 5. 208 subjects were randomly divided into two subsets: 148 subjects for training and the rest 60 ones for testing. We perform the SAS-DNN 30 times to obtain the average accuracy rate. The convergence performances of the proposed method here in the training and testing data are shown in Fig. 8. It indicated that the proposed method here can converge quickly in few iterations.

Fig. 8
figure 8

The convergence performance for the SNP data. In which, the red line is the average error, and the shadow area is the standard deviation of error

4.2.1 Weight sparsity

We compare the weights sparsity of SAS-DBN with the other six deep networks mentioned in Section 4.1.2. In the experiments, each network is well trained separately, and especially, for DP-DNNs, when the dropout ratio is selected as 0.1, 0.2, 0.3, 0.4 and 0.5, the average accuracy of 30 times is 0.8483, 0.8517, 0.765, 0.825 and 0.8717, respectively. Thus, we choose the DP-DNN model with the dropout ratio being 0.5. Similarly, for DPC-DNNs, the average accuracy of 30 times is 0.8717, 0.8633, 0.8783, 0.82 and 0.8717, respectively, with the drop connect ratio being 0.1, 0.2, 0.3, 0.4 and 0.5. So we choose the DPC-DNN with the drop connect ratio being 0.3. Similarly, for Pr-DNNs, we select the Pr-DNN with the pruning ratio of weights being 0.001. we perform the learning process 30 times. At each time, the training data are sampled from 70% of the original data.

The ratio of connection weights with non-zeros values between different layers on SNP data is shown in Table 4a. The HSM is presented in Table 4b. From them, we can see that SAS-DBN has the lowest non-zeros connection ratio and the highest HSM among the deep networks. Additionally, in Fig. 9a and b, we further compare SAS-DBN with two models, i.e., sDBN and SVD-DBN, which have the highest average classification accuracy among the other six deep networks.

Table 4 Two different sparse indexes of connections in SAS-DBN and other deep networks
Fig. 9
figure 9

a and b show the two different sparse indexes of connections in SAS-DBN, sDBN and SVD-DBN on SNP data. a: The Non-zero ratio of connections with three classifiers. b: The HSM of connections with three classifiers

4.2.2 Classification results on the SNP database

The performance of each classifier is quantified by using AR, Sensitivity (SS), Specificity (SC), positive predictive value (PPV) and negative predictive value (NPV). And the last four are defined as follows

$$ \begin{array}{@{}rcl@{}} SS &=& \frac{Corr_{p}}{Corr_{p}+InCorr_{c}}\\ SC &=& \frac{Corr_{c}}{Corr_{c}+InCorr_{p}} \\ PPV &=& \frac{Corr_{p}}{Corr_{p}+InCorr_{p}}\\ NPV &=& \frac{Corr_{c}}{Corr_{c}+InCorr_{c}} \end{array} $$

AR, SS, SC, PPV and NPV provide different assessments of the classifier’s performance from five aspects, i.e., the proportion of all samples that are correctly predicted, the proportion of patients correctly predicted among all the predicted patients, the proportion of healthy subjects correctly predicted among all the predicted healthy subjects, the proportion of correctly predicted patients among the true patients, and the proportion of healthy subjects correctly predicted among those true healthy subjects.

We compare the proposed method, i.e., SAS-DBN with several classifiers that have fairly good performances. The classification results on the testing data set for different classifiers are summarized in Table 5. In Table 5, CLASSIFIER I refers to linear SVM, II for SVM with multi-layer perceptron kernel, III for sparse representations based classifier, IV for DBN, V for sDBN, VI for DP-DBN, VII for SVD-DBN, VIII for DPC-DBN, IX for Pr-DBN and X for SAS-DBN. For each classifier, we perform the classification process 30 times. At each time, the training data are sampled from 70% of the original data, and the rest data are testing data. The average results are shown in Table 5. We can see that the SAS-DBN performs the best among the classifiers. The AR has been increased as 7.59%, 22.93%, 5.90%, 11.29%, 6.91%, 11.80%, 6.40%, 11.13% and 13.99% respectively. Figure 10 further shows the classification results of all methods, where the horizontal axis represents different classifiers. The primary vertical axis on the left is denoted as the AR and the secondary axis on the right denotes other classification results, i.e., the SS, SC, PPV and NPV. Through the comparison, we can conclude that SAS-DBN has the best performance on the classification accuracy.

Table 5 Classification results by different classifiers
Fig. 10
figure 10

The classification of using 10 classifiers on SNP data. In which, the left y axis is the coordinates of the histogram, and the right y axis is the coordinates of the line

Compared with [47, 51] (the data here are the same as ours), we use 148 samples as the training and 60 samples as the testing. In these papers, they use 90% subjects for training and only the rest 10% are used for testing. The average accuracy here is over 98%, while the accuracy of classification in these papers are around 85%, and the best accuracy of classification is 65%, and the average accuracy is about 55% by Li et al. [52].

4.3 Remark

For two kinds of high-dimensional data with small sample size, the above two experiments show that by the iterative re-weighted-L1 minimization algorithm combining with k-step CD, the proposed deep belief networks with adaptive sparsity have the best classification performance than other typical classifiers including several deep networks. A crucial reason for the good classification results of SAS-DBN lies in that the iterative re-weighted-L1 minimization algorithm can effectively obtain the approximate solution of L0 minimization problem, especially by combining it with k-step CD algorithm, the parameters of the network can be adjusted adaptively in a quite sparse way. Furthermore, the learned sparse deep network can effectively remove the noise and redundant features from the original data. By the sparse structure, discriminative features of the data can be extracted in a sparse layer-wise way, which is critical for preventing overfitting and improving identification accuracy.

5 Conclusion

This work is motivated by the sparsity mechanism of brain neural networks. By exploring substitutions of the approximating solution of minimization problem with L0-norm, the k-step CD sampling process embedding the re-weighted-L1 minimization algorithm, and further the self-adaptive sparse DBNs are proposed. In this way, layer-wise sparse features of data can be effectively extracted, which are vital for learning the true distribution of the data, thereby it can enhance the generalization capacity as well as the discernment performance of the network. It is promising that the self-adaptive sparse DBNs will be a practical solution to the small number of samples but a large number of features problem. Numerical experiments of applying to brain imaging and genomics data give a good demonstration of their effectiveness, and it shows that the proposed methods can not only get a sparse deep network architecture in a self-adaptive way, but also have the potential to avoid over-fitting problem.