Deep belief networks with self-adaptive sparsity

Qiao, Chen; Yang, Lan; Shi, Yan; Fang, Hanfeng; Kang, Yanmei

doi:10.1007/s10489-021-02361-y

Deep belief networks with self-adaptive sparsity

Published: 26 April 2021

Volume 52, pages 237–253, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Deep belief networks with self-adaptive sparsity

Download PDF

Chen Qiao¹,
Lan Yang¹,
Yan Shi¹,
Hanfeng Fang² &
…
Yanmei Kang¹

523 Accesses
5 Citations
Explore all metrics

Abstract

To have the sparsity of deep neural networks is crucial, which can improve the learning ability of them, especially for application to high-dimensional data with small sample size. Commonly used regularization terms for keeping the sparsity of deep neural networks are based on L₁-norm or L₂-norm; however, they are not the most reasonable substitutes of L₀-norm. In this paper, based on the fact that the minimization of a log-sum function is one effective approximation to that of L₀-norm, the sparse penalty term on the connection weights with the log-sum function is introduced. By embedding the corresponding iterative re-weighted-L₁ minimization algorithm with k-step contrastive divergence, the connections of deep belief networks can be updated in a way of sparse self-adaption. Experiments on two kinds of biomedical datasets which are two typical small sample size datasets with a large number of variables, i.e., brain functional magnetic resonance imaging data and single nucleotide polymorphism data, show that the proposed deep belief networks with self-adaptive sparsity can learn the layer-wise sparse features effectively. And results demonstrate better performances including the identification accuracy and sparsity capability than several typical learning machines.

Effective sparsity control in deep belief networks using normal regularization term

Article 10 April 2017

A-DBNF: adaptive deep belief network framework for regression and classification tasks

Article 02 January 2021

Automated classification of Alzheimer's disease based on deep belief neural networks

Article 16 February 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Restricted Boltzmann machine (RBM) which is a stochastic artificial neural network with two layers: one is a visible input layer, and the other is a hidden layer, can learn the probability distribution over the raw features [1,2,3]. According to the distribution of visible units, two different RBMs have been proposed, i.e., Bernoulli-Bernoulli RBM (BBRBM) [4] and Gaussian-Bernoulli RBM (GBRBM) [5, 6]. BBRBM assumes that the visible unit is of binary data, and GBRBM assumes that the visible unit is Gaussian distribution. In practice, the appropriate RBM type can be selected based on the dataset. Deep-belief network (DBN) is a probability generation model, which can be viewed as a composition of simple, unsupervised networks such as RBMs [1, 4, 7]. DBN which is widely regarded as one of the effective deep learning models, can obtain the multi-layer nonlinear representation of the data by greedy layer-wise training [8,9,10]. DBN possesses inherent power for unsupervised feature learning [11], and it has been widely used in many fields, e.g., image classification, document processing, object segmentation, social network, automatic control [12, 13].

For DBN models, it consists of one layer of visible units and multiple layers of hidden units, and neurons between neighboring layers can have a widely interconnected structure [1]. However, studies from the brain’s nervous system have shown that such a system employs a highly sparse mechanism. In brain organization and processing information, there exist two kinds of sparse properties: one is connection sparseness and the other is response sparseness [14]. The connection sparseness is also known as the sparse structure of the network or sparse topology. And the response sparseness is the sparse representation or sparse coding of neurons. These two kinds of sparseness guarantee the high promotion ability of human neural systems. Compared with the distributed information representation, sparsity can provide a higher quality of storage and better generalization capabilities [14]. By a combination of experimental, computational, and theoretical studies, it has been verified the existence of an underlying principle involved in sensory information processing. Namely, the information is represented by a relatively small number of simultaneously active neurons out of a large population, commonly referred to as sparse coding [15]. In [14], it confirmed the existence of sparse connectivity within the basal ganglia. DeepMind generated a function of grid cells to achieve automatic tracking just like humans recently, where the use of regularization to obtain sparsity is critical to grid-like representations [16].

At present, there are many approaches to get the topology sparseness as well as representation sparseness of DBNs, and introducing regularization terms into the parameters updating procedure are most commonly used to get the sparsity. For regularization algorithm, the purpose is to solve a constrained optimization issue and to remove unnecessary connections in the training process. In this way, it can both reduce the complexity of the model and improve the generalization ability of the network. The methods of weighting constraints include Gaussian regularizer [17], Laplace regularizer [18], weight elimination [19] and soft weight sharing [20]. Among them, the most widely used methods are the Gaussian regularizer and Laplace regularizer. And they use L₂-norm or L₁-norm on the connections as their penalty functions. Moreover, some other techniques, such as Dropout, DropConnect, pruning, low-rank matrix decomposition, knowledge distillation, have also been used to achieve the topology sparseness or reduce the complexity of neural networks [17,18,19,20,21,22,23,24].

In addition to adding regularization on the connections to achieve the sparse topology, simulating the sparse response of the hidden neurons is another important way to improve the generalization ability. Hinton et al. [25] showed that networks with low response frequency neurons have a better explanation and generalization ability. Lee and Ng penalized the mean of the hidden neuron’s response to get the response sparsity of the hidden layer by using Kullback-Leibler (KL) divergence [26, 27]. Ranzato et al. [28] used the energy-based method to add a coding pattern with sparse representation. By adding the L₁/L₂ penalty to the response probability of the hidden neurons, a network was proposed to obtain the local correlation of the hidden neurons [29]. The L₁-norm constraint was applied directly to the response of the hidden neurons in the fine-tuning process [30]. Srivastava and Hinton et al. used Dropout to prevent neural networks from over-fitting [31]. Almost all of the above methods to obtain the sparseness of DBN are based on penalization with L₁-norm, L₂-norm or KL divergence.

For controlling the sparsity, the best way is straightforward to use L₀-norm since it directly measures the number of non-zero elements. However, since it is a NP-hard problem to solve the optimization problem based on L₀-norm, many methods for obtaining the spareness are based on L₁-norm, L₂-norm or KL divergence instead. Although L₁-norm is generally used as a substitution of L₀-norm, there still exists a key difference between L₁-norm and L₀-norm, namely, the dependence on magnitude [24]. For L₀-norm, the penalization on the components is democratic, while for L₁-norm, the penalization on larger components is penalized more heavily than the small components. For L₂-norm, it is not possible to give a sparse solution. The KL divergence is used for measuring how different two distributions are, and it is usually applied to control the average activation values of hidden neurons. Finding a better way to address the issue of sparsity in DBN is still a challenging topic. In this paper, by introducing a log-sum minimization problem and its corresponding iterative re-weighted-L₁ minimization algorithm, we dedicate to seek the approximative solution of the optimum problem with L₀-norm and further, sparse connections of DBNs.

On the other hand, for DBNs, sampling method based on Markov Chain Monte Carlo (MCMC) strategy is often used. In such a sampling process, each variable is sampled from a posterior distribution given the current state of other variables. Then the unbiased estimate of the position parameter is obtained by using the logistic likelihood method with multiple alternating Gibbs sampling. However, this procedure is time-consuming and requires a large number of samples for an accurate estimate. Hinton proposed a fast learning algorithm, i.e., the k-step contrastive divergence (CD), to overcome the complication of Gibbs sampling [32]. It is known that maximizing the log likelihood of the data is equivalent to minimizing the KL divergence (denoted by $KL(P^{0}\|{P}_{\theta }^{\infty })$) between the initial data distribution P⁰ and the equilibrium distribution of the visible variables, ${P}_{\theta }^{\infty }$, where 𝜃 refers to the coefficients of the learning machine, and $P^{\infty }_{\theta }$ is generated by the sustained Gibbs sampling from the generation model. For CD algorithm with k steps, it minimizes the difference between $KL(P^{0}\|P^{\infty }_{\theta })$ and $KL({P}_{\theta }^{k}\|{P}_{\theta }^{\infty })$ instead of minimizing $KL(P^{0}\|{P}_{\theta }^{\infty })$, where $P^{k}_{\theta }$ denotes the reconstruction distribution of the data generated by k steps Gibbs sampling. The CD algorithm with k steps is an effective alternative to maximum likelihood learning, which can relieve high computational demands for samples acquired from the equilibrium distribution. As Hinton pointed out, the intuitive motivation for using the k-step CD algorithm is to leave the initial distribution over the visible variables unaltered when implementing Gibbs sampling in Markov chain. Rather than running the Markov chain to the equilibrium state and comparing the difference between the initial state and the final state, one can simply run the Markov chain with k steps and update the parameters of the model to reduce the tendency of the chain to deviate from the initial distribution.

In this paper, we aim to obtain a DBN with sparsity learning capability by embedding the k-step CD algorithm with the iterative re-weighted-L₁ minimization algorithm, and further apply it to biomedical data, such as brain magnetic resonance imaging (MRI) data and single nucleotide polymorphism (SNP) data. For the MRI data, it contains the concentration of grey matter in different brain regions and the correlation values of different regions of interest (ROI), which do not have any spatial topological structure. Similarly, the SNP data contains the DNA polymorphism data caused by single nucleotide variation, and it doesn’t possess spatial topological structure either. Thus, other deep learning models, such as the convolution operation of convolutional neural networks which considers domain relationships of the data are no longer applicable. Additionally, it should be noticed that for biomedical datasets, they are usually of small sample size but a large number of features. Although nowadays deep networks have been constantly shown to be one of the most powerful tools in big data analysis, its application to biomedical data is still limited. That is because deep learning models are very easy to fall into overfitting for data with n ≪ p, where n is the sample size and p is the number of features [33, 34]. Seeking a deep network with strong sparse self-learning ability is just an effective solution to this issue. It will be shown in Section 2 that the re-weighted-L₁ function keeps more approximation ability of L₀-norm than L₁-norm or L₂-norm. Thus by combining a iterative re-weighted-L₁ algorithm, the connections of a DBN can be updated adaptively in a sparse way by the k-step CD algorithm, and such a network is called a self-adaptive sparse DBN. In this way, the deviation of the Markov chain from the initial distribution can be reduced, and it thus efficaciously minimize the probability that the model distorts the data. Experiments on the biomedical data show that the proposed self-adaptive sparse DBN is valid to avoid over-fitting, and it can achieve good identification ability by learning the sparse structure of the feature space.

2 Iterative algorithm for re-weighted L ₁ minimization

In this section, the iterative re-weighted-L₁ minimization algorithm will be presented; it is a good approximation to the sparse optimum solution of minimization problem with L₀-norm.

In [35], it was proved that minimization of ${\sum }_{i=1}^{n}\log |x_{i}|$ is consistent with minimization of ${\sum }_{i=1}^{n}|x_{i}|^{p}$ when $p\rightarrow 0$. Further, because

$$ \begin{array}{@{}rcl@{}} \underset{x}{\min} \|x\|_{0} = \underset{x}{\min} \underset{p\rightarrow0}{\lim}\sum\limits_{i=1}^{n} |x_{i}|^{p} \end{array} $$

(1)

where x is a n-dimensional vector with x_i being the i-th element, [24, 35, 36] put forward that the minimization of ${\sum }_{i=1}^{n}\log |x_{i}|$ is an effective approximation to that of ∥x∥₀. Furthermore, in order to ensure the existence of $\log |x_{i}|$ when $|x_{i}|\rightarrow 0$, a small ε is usually added to make the minimization problem meaningful, and the solution of

$$ \begin{array}{@{}rcl@{}} \underset{x}{\min}\sum\limits_{i=1}^{n}\log(|x_{i}|+\epsilon) \end{array} $$

(2)

is proposed as an alternative to get better approximate solution of minimization problem with L₀-norm than the often used L₁-norm [24, 36]. The objective function of (2) is called as the log-sum function.

In order to get the iterative solution of (2) for self-adaptive sparse CD algorithm with k steps, we introduce the absolute value optimization problem

$$ \begin{array}{@{}rcl@{}} \underset{x}{\min} \sum\limits_{i=1}^{n} c_{i}|x_{i}| \end{array} $$

(3)

where c_i ≥ 0. Noting that |x_i| is the minimum of z_i with conditions |x_i|≤ z_i, (3) has the following equivalent form

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{ll} \underset{z}{\min}\sum\limits_{i=1}^{n} c_{i}z_{i}& \\ s.t.\ \ |x_{i}|\leq z_{i},\ i=1,\cdots,n & \end{array} \right. \end{array} $$

(4)

By (4), the optimization problem (2) can be rewritten in its equivalent form

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{ll} \underset{u}{\min}\sum\limits_{i=1}^{n} \log(u_{i}+\epsilon)& \\ s.t.\ \ |x_{i}|\leq u_{i},\ i=1,\cdots,n & \end{array} \right. \end{array} $$

(5)

That is, if x^∗ is the optimum solution of (2), then u^∗ is the optimum solution of (5) when u^∗ = |x^∗|. Alternatively, if u^∗ is the optimum solution of (5), then x^∗ is the optimum solution of (2).

A general form of (5) is

$$ \begin{array}{@{}rcl@{}} \underset{v}{\min }~ g(v) &s.t.\ v\in C \end{array} $$

(6)

where g is a concave function with C being a convex set. The solution to the problem (6) can be improved step by step by minimizing a linearized g:

$$ \begin{array}{@{}rcl@{}} v^{(l+1)}=\underset{v}{\arg\min}~g(v^{(l)})+\nabla g(v^{(l)})\cdot(v-v^{(l)})\ &s.t.\ v\in C \end{array} $$

(7)

The solution obtained by the iteration algorithm is the approximation of convex optimization. In particular, for the problem (5), one can get that

$$ \begin{array}{@{}rcl@{}} u^{(l+1)}=\underset{u_{i}}{{\arg\min}} \sum\limits_{i=1}^{n} \frac{u_{i}}{u_{i}^{(l)}+\epsilon} & s.t.\ |x_{i}|\leq u_{i},i=1,\cdots,n \end{array} $$

Thus, by the equivalent relationship between the optimum solutions of (2) and (5), we have

$$ \begin{array}{@{}rcl@{}} x^{(l+1)}=\underset{x_{i}}{{\arg\min}} \sum\limits_{i=1}^{n} \frac{|x_{i}|}{|{x}_{i}^{(l)}|+\epsilon} \end{array} $$

(8)

Let

$$ \begin{array}{@{}rcl@{}} \eta^{(l)}_{i}=\frac{1}{|x^{(l)}_{i}|+\epsilon} \end{array} $$

(9)

then (8) can be rewritten as

$$ \begin{array}{@{}rcl@{}} x^{(l+1)}=\underset{x_{i}}{{\arg\min}} \sum\limits_{i=1}^{n} \eta^{(l)}_{i}|x_{i}| \end{array} $$

(10)

Thus, the solution of (2) can be approximated by the iterative solutions of (10).

The following iterative re-weighted-L₁ minimization algorithm, i.e., Algorithm 1, gives the iterative process of solving minimization problem (2). Based on it, the solution of (2) can be dynamically updated, and it is a good approximation to the sparse optimum solution of minimization problem with L₀-norm.

Furthermore, in order to show the advantage of the re-weighted-L₁ function (i.e., $f_{R,\epsilon }\triangleq \eta \cdot |x|=\frac {1}{|x|+\epsilon }\cdot |x|$) in achieving sparsity, we compare it with the following penalty functions: f₀(x) = 1_{x≠ 0} and f₁(x) = |x|. From Fig. 1, it shows that f_R,𝜖 can well approximate to f₀ as $\epsilon \rightarrow 0$.

As can be shown, by combining the iterative re-weighted-L₁ minimization algorithm with the k-step CD algorithm, one can obtain more sparse solution of the connective weights than using the L₁-norm or the CD algorithm with just one step. In this way, we can avoid over-fitting problem and improve the generalization of DBNs.

3 Self-adaptive sparse DBNs (SAS-DBNs)

In this section, by embedding the iterative re-weighted-L₁ minimization algorithm, i.e., Algorithm 1, we will present the sparse self-adaptively learning for k-step CD, and get the sparse architecture for each component of DBN, namely, RBM. As a result, the unsupervised sparse learning solution of the whole DBN can be achieved.

A DBN is stacked and trained by multi-layer RBMs, and each RBM can find a compressed feature representation of the current input data. In an RBM, there are two layers: one is the visible input layer, and the other is a hidden layer. There are connections between the layers but no connections between units within each layer. Assume v is the N_v-dim visible units, h is the N_h-dim hidden units, α is the N_v-dim bias of the visible units, β is the N_h-dim bias of the hidden units, and $W=\{W_{ij}\}_{N_{v}\times N_{h}}$ with each W_ij is the connection weight between v_i and h_j. The likelihood function P_𝜃(v) is defined as the distribution of the observed data v_𝜃, where 𝜃 = {W,α,β} is the set of all parameters. Maximizing P_𝜃(v) is the main task of training an RBM, and that is equal to determining the parameters to fit the given training samples, i.e., to find a 𝜃^∗, such that

$$ \begin{array}{@{}rcl@{}} \theta^{*}={\arg\underset{\varTheta}{\max}}\mathcal{L}(\theta) \end{array} $$

where ${\mathscr{L}}(\theta ) = \log P_{\theta }(v) $. ${\mathscr{L}}(\theta )$ is just the objective function to be optimized for training RBM. By performing stochastic steepest ascent in the log probability on the training data, the weights can be updated based on

$$ \begin{array}{@{}rcl@{}} {\varDelta} W_{ij} &=& \zeta \cdot \frac{\partial \mathcal{L}}{\partial W_{ij}} = \zeta \cdot (\langle v_{i}h_{j}\rangle_{data}-\langle v_{i}h_{j}\rangle_{model}) \end{array} $$

in which, ζ is the learning rate, and 〈⋅〉 is the operator of expectation with the corresponding distribution denoted by the subscript.

Due to the difficulty of calculating Z_𝜃, it is a quite hard task to obtain an unbiased sample of 〈v_ih_j〉_model. A much faster learning method is the k-step CD algorithm. The k-step CD algorithm is a fast learning algorithm of RBM which was proposed by Hinton [32]. Different from the Gibbs sampling method, k-step CD algorithm initializes the system states, i.e., neurons in the visible layer, with the training data. In this way, as Hinton pointed out, only a few state transitions are needed to obtain a good approximation of the model distribution. In the k-step CD algorithm, firstly, the initial states of the visible layer units, i.e., v⁽⁰⁾, are assigned as the values of the training data, and based on the conditional probability p(h = 1|v⁽⁰⁾), the initial states of the hidden units h⁽⁰⁾ can be sampled. Then, the conditional probability p(v = v_r|h⁽⁰⁾) is utilized to obtain v⁽¹⁾, where v_r is 1 and real-value for BBRBM and GBRBM respectively. The above processes are executed alternately until both v^(k) and h^(k) are obtained, which are used to approximately estimate 〈v_ih_j〉_model. Specially, for BBRBM,

$$ \begin{array}{@{}rcl@{}} p(h_{j}=1|v)&=&f\left( \beta_{j}+\underset{i}{\sum}v_{i}W_{ij}\right)\\ p(v_{i}=1|h)&=&f\left( \alpha_{i}+\underset{j}{\sum}h_{j}W_{ij}\right) \end{array} $$

For GBRBM,

$$ \begin{array}{@{}rcl@{}} p(h_{j}=1|v)&=&f\left( \beta_{j}+\underset{i}{\sum}W_{ij}\frac{v_{i}}{{\sigma_{i}^{2}}}\right)\\ p(v_{i}=v_{r}|h)&=&\mathcal{N}\left( v_{r}|\alpha_{i}+\underset{j}{\sum}h_{j}W_{ij},{\sigma_{i}^{2}}\right) \end{array} $$

in which, f(⋅) is the sigmoid activation function, σ_i is the standard deviation associated with Guassian visible units v_i, and $\mathcal {N}(\cdot |\mu ,\sigma ^{2})$ denotes the Guassion probability density function with mean μ and standard deviation σ.

For each RBM, if there is no restriction on the responses of hidden neurons and connections between layers, an unstructured network will be generated, which leads to the poor interpretability, high computational complexity and large space resource consumption of the network [37]. Adding sparse regularization terms on the objective function of training RBM is more likely to learn the structural characteristics of the data [38]. In Section 2, it has already been shown that the re-weighted-L₁ function is an effective approximation to the sparse solution of minimization problem with L₀-norm, thus, by introducing the sparse penalty term on the connection weights with the re-weighted-L₁ function, and embedding its corresponding iterative re-weighted-L₁ minimization algorithm with k-step CD, the sparse connection architecture of the RBM will be obtained. At the same time, since L₂-norm penalty term can reduce the noise or variation in the data, we also introduce L₂-norm term in the connection weights. Additionally, the KL divergence on the average responses of hidden neurons is also utilized to achieve the sparse responses of the hidden layer neurons. The improved objective function to be optimized for training RBM is

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{new}({\varTheta})&=&\ \mathcal{L}({\varTheta})-\frac{1}{2}\lambda_{1}\|W\|^{2}_{2} -\lambda_{2} \underset{i,j}{\sum}\ln(|W_{ij}|+\epsilon)\\ &&-\lambda_{3} \sum\limits_{j=1}^{N_{h}}KL(\rho\parallel p_{j}) \end{array} $$

(11)

where λ₁, λ₂ and λ₃ are penalty parameters, ε is a given constant. The KL divergence, $KL(\rho \parallel p_{j})=\rho \log \frac {\rho }{p_{j}}+(1-\rho )\log \frac {1-\rho }{1-p_{j}}$ is the relative entropy between the two random variables with the mean ρ and the mean p_j. It is used to measure the difference between two different distributions. The purpose of introducing the KL divergence is to force the average response value of the hidden neurons to be approximately equal to the default initial value ρ. Here we take $p_{j}=\frac {1}{N_{s}}{\sum }^{N_{s}}_{q=1}\frac {1}{1+e^{-{\sum }^{N_{v}}_{i=1}v^{(q)}_{i}W_{ij}-\beta _{j}}}$ as the average activation degree of the j-neuron in the hidden layer with N_s samples, and N_v is the number of nodes in the current visual layer. The deviations of the KL divergence for the parameters are

$$ \begin{array}{@{}rcl@{}} &&\frac{\partial}{\partial W_{ij}}\sum\limits_{j=1}^{N_{h}}KL(\rho\parallel p_{j})\\ &=& \frac{1}{N_{s}}\left( -\frac{\rho}{p_{j}}+\frac{1-\rho}{1-p_{j}}\right)\sum\limits_{q=1}^{N_{s}}\sigma^{(q)}_{j}\left( 1-\sigma^{(q)}_{j}\right)v^{(q)}_{i} \end{array} $$

(12)

$$ \begin{array}{@{}rcl@{}} &&\frac{\partial}{\partial \beta_{j}}\sum\limits_{j=1}^{N_{h}}KL(\rho\parallel p_{j})\\ &=& \frac{1}{N_{s}}\left( -\frac{\rho}{p_{j}}+\frac{1-\rho}{1-p_{j}}\right)\sum\limits_{q=1}^{N_{s}}\sigma^{(q)}_{j}\left( 1-\sigma^{(q)}_{j}\right) \end{array} $$

(13)

where $\sigma ^{(q)}_{j}=\sigma \left ({\sum }_{i=1}^{N_{v}}v^{(q)}_{i}W_{ij}+\beta _{j}\right )=\frac {1}{1+e^{-{\sum }^{N_{v}}_{i=1}v^{(q)}_{i}W_{ij}-\beta _{j}}}$. Thus, by combining the iterative re-weighted-L₁ minimization algorithm with the k-step CD sampling process, the parameters of an unsupervised RBM can be updated in a sparse self-adaptive way by Algorithm 2.

Based on Algorithm 2, each RBM can be trained to have sparse connections adaptively, and this process is called as an RBM reconstructed with SAS-k-CD. Thus, the unsupervised process of a DBN with self-adaptive sparsity (SAS-DBN) can be obtained by those stacking sparse RBMs. The following Fig. 2 presents the whole learning process of SAS-DBN with stacked sparse RBMs.

Furthermore, the learned sparse DBN can be further fine-tuned with some fine-tuning methods, e.g., the back propagation (BP) algorithm. On noting that the fine-tuning process will redistribute the weights as well as the representations of the entire network, i.e., the learned sparsity of SAS-DBN will be destroyed. To ensure that the sparse structure of the learned SAS-DBN is still held during fine-tuning, two methods are generally utilized. One is that we can retain the sparse structure of SAS-DBN, and then carry out fine-tuning. Another one is bringing sparse learning into fine-tuning. In our experiments, we have compared both methods, namely, fine-tuning based on the retained sparse structure learned by SAS-DBN, and fine-tuning adding sparsity regularization terms, such as KL divergence, L₁-norm and log-sum function. It indicates that there exists no significant difference between them. Thus, we only considered the first method here.

4 Experiments

One of the main purpose of implementing sparseness on deep learning is to avoid over-fitting and improve the identification ability, especially for high-dimensional data with small sample size. In this section, we will use both the brain functional magnetic resonance imaging data and single nucleotide polymorphism data to demonstrate the validation of the proposed DBN with self-adaptive sparsity. For the MRI data, adding the fine-tuning process does improve its accuracy rate, but there exists nearly no improvement on the accuracy rate of the SNP data with or without fine-tuning. Considering the time complexity caused by the fine-tuning process, we only utilized fine-tuning for the MRI data and didn’t use it for the SNP data. Moreover, the experiments show that although these datas are with a large number of features but low samples, SAS-DBN can better represent data in a sparse architecture, and it has the best classification accuracy among many classifiers we have tested.

4.1 Experiments on MRI database

This dataset is the MLSP 2014 Schizophrenia Classification Challenge data, which can be downloaded from https://www.kaggle.com/c/mlsp-2014-mri. The data are collected on a 3T MRI scanner at the Mind Research Network. Image preprocessing was performed using Statistical Parametric Mapping software, and further feature extraction was performed using independent component analysis, yielding features from different imaging modalities, i.e., source-based morphometric (SBM) loadings and functional network connectivity (FNC) features for structural magnetic resonance images (sMRI) and resting state functional MRI (rs-fMRI), respectively. The data consist of 40 patients with schizophrenia and schizoaffective disorders (both are denoted by SZs) and 46 healthy controls (HCs). A diagnosis of SZs was made by using the Structured Clinical Interview for DSM-IV (DSM: Diagnostic and Statistical Manual of Mental Disorders) [39]. 410 features (32 for SBM and 378 for FNC) are extracted for each sample. SBM loadings are weights of brain maps obtained from the application of ICA, which indicate the concentration of grey matter in different regions of the subject’s brain. The concentration of gray-matter is indicative of the computational power in a particular region of the brain [40]. FNC features are the pair-wise correlation values between the time courses of 28 ICA brain maps, which can be seen as features describing the subject’s overall level of synchronicity between brain areas [41]. We denote the data as $D=\{(x_{i},y_{i})\}^{n}_{i=1}$, where n = 86 (40 SZs and 46 HCs); (32 from the SBM and 378 from the FNC); y_i ∈{0,1} (0 = ’Healthy Control’, 1 = ’Schizophrenic Patient’). 86 subjects were randomly divided into two subsets: 70% subjects for training and the rest 30% ones for testing. For the sake of obtaining a stable classification, we performed random sampling from 86 subjects repeatedly for 30 times.

The deep architecture of SAS-DBN we used here contains 5 hidden layers with 200, 100, 50, 50, and 50 units respectively. In such a stacked DBN, we adopt a GBRBM as the first RBM and BBRBM as others. Because the grid search method can simply make a complete search over a given parameters space, easily be parallelized and find the optimal parameter for deep learning with a much more stable solution, it was used to find the optimal parameters in this paper. Specifically, one of the parameters is selected by the grid search method when other parameters are fixed. And all parameters are optimized by repeating the above process. Based on the grid search method, the optimal parameters corresponding to GBRBM and BBRBM can be obtained. Corresponding to GBRBM and BBRBM respectively, the learning rate ζ are 0.001 and 0.1, the penalty parameter λ₁ are 0.03 and 0.0002, the re-weight L₁-norm penalty parameter λ₂ are 0.001 and 0.001, 𝜖 are 0.1 and 0.1, the KL divergence penalty parameter λ₃ are 0.1 and 0.1, and the sparse parameter ρ are 0.01 and 0.01, respectively. For simplifying, the steps of CD for each RBM is selected as the same k.

4.1.1 Model selection for the steps of CD

We first discuss how to select the steps of CD, and perform SAS-DBNs with different k which are chosen from 1 to 7. Other parameters of SAS-DBNs are chosen as mentioned above. Figure 3 shows the classification accuracy rates (AR) and the non-zero ratio of the network connections with different k, where AR is defined as

$$ \begin{array}{@{}rcl@{}} AR &=& \frac{Corr_{p}+Corr_{c}}{Corr_{p}+Corr_{c}+InCorr_{c}+InCorr_{p}} \end{array} $$

in which, Corr_p, Corr_c, InCorr_c and InCorr_p denote the number of correctly predicted schizophrenic patients, correctly predicted healthy subjects, healthy subjects incorrectly classified as schizophrenic patients, and schizophrenic patients incorrectly classified as healthy subjects, respectively.

Figure 3 shows the RBM has its best AR with k = 5. Meanwhile, the non-zero ratio of the network connections keeps down with the increasing of k, and the decline tends to be gentle after k = 5. Thus, we choose 5 as the steps of CD.

4.1.2 Weight sparsity

Weight sparsity is an important index to demonstrate the validation of a deep learning machine in overcoming the over-fitting problem. In order to show the sparse learning ability of the proposed SAS-DBN, we compare SAS-DBN with six different deep learning models, namely, stacked RBM adding a SVM classifier (denoted by SR-SVM), DBN with L₂-norm penalty on the connection weights (denoted by DBN), DBN with L₁-norm penalty on the connection weights and KL divergence penalty on the hidden neurons (denoted by sDBN), DBN with Dropout (denoted by DP-DBN), DBN with DropConnect (denoted by DPC-DBN), DBN with singular value decomposition (denoted by SVD-DBN), and DBN with pruning method (denoted by Pr-DBN). Each network is well trained and especially, for DP-DBNs, when the dropout ratio is selected as 0.1, 0.2, 0.3, 0.4 and 0.5, the average accuracy of 30 times is 0.709, 0.6833, 0.6795, 0.6795 and 0.6859, respectively. Thus, we choose the DP-DBN model with the dropout ratio being 0.1. Similarly, for DPC-DBNs, the average accuracy of 30 times is 0.7358, 0.7385, 0.7321, 0.7346 and 0.7410, respectively, with the drop connect ratio being 0.1, 0.2, 0.3, 0.4 and 0.5. So we choose the DPC-DBN with the drop connect ratio being 0.5. Similarly, for Pr-DBNs, we select the Pr-DBN with the pruning ratio of weights being 0.001. For each network, we perform the learning process for 30 times. For each time, the training data are sampled randomly 70% from the original data, and the rest data are the testing data.

In what follows, we will show the weight sparsity of SAS-DBN in comparison with other DBNs. The learning capability of weight sparsity can then strongly indicate why the SAS-DBN is valid for such kind of high-dimensional small sample size data.

For a quantifiable comparison, we list the weights sparsity of SAS-DBN and other six deep learning methods with two different indicators. The first indicator is the non-zero ratio of connections, and the results are shown in Table 1. Additionally, in Fig. 4a, we further compare SAS-DBN with two models, i.e., DPC-DBN and Pr-DBN, which have the highest average classification accuracy among the other six deep networks. In this figure, the different 5 colors, i.e., red, yellow, green, pink and blue refer to the non-zero ratio of each w⁽ⁱ⁾ (i = 1,⋯ ,5) respectively.

Table 1 The non-zero ratio of connections for different deep networks

Full size table

There is another normalized indicator to measure the sparsity of connections within the network, i.e., the Hoyer sparseness measure (HSM). It is based on the relationship between the L₁-norm and the L₂-norm of the weights [42, 43]. The HSM of the weight matrix W_m×n is defined as follows:

$$ \begin{array}{@{}rcl@{}} HSM(W)=\frac{\sqrt{m\times n}-({\sum}_{i=1}^{m}{\sum}_{j=1}^{n}|W_{ij}|)/\sqrt{{\sum}_{i=1}^{m}{\sum}^{n}_{j=1}W^{2}_{ij}}}{\sqrt{m\times n}-1} \end{array} $$

(14)

The value of HSM is in [0,1]. When HSM closes to 1, the connection of the network becomes much sparser, i.e., most of the values in connections are zeros and only a few have significant values. The average HSM with random 30 learning times for three deep networks are shown in Table 2 and Fig. 4b.

Table 2 The average HSM of connections for different deep networks

Full size table

Further, Fig. 5 shows that the learning ability for weight sparseness of SAS-DBN, DPC-DBN and Pr-DBN. It shows that SAS-DBN has more sparse connections than DPC-DBN and Pr-DBN, which is an important reason that SAS-DBN owns better identification and extraction abilities of sparse features than the other two deep learning models.

From Tables 1 and 2, Figs. 4 and 5, one can draw the conclusion that the proposed SAS-DBN has the sparsest connections than the others deep networks, and especially has the sparse learning ability of w⁽¹⁾. It should be noticed that the learning of w⁽¹⁾ is quite crucial, since it is the first and basic step to extract essential features and remove redundancy and noise from the original data.

4.1.3 Classification accuracy rates of fifteen methods

In order to show the classification results, we compare SAS-DBN with other several typical classifiers, i.e., linear SVM (denoted by SVM-I), SVM with RBF kernel (denoted by SVM-II), KNN, Naive Bayes (NB), Random Forest (RF), Sparse Representation Based Classifier (SRC), discriminant analysis classifier (DAC), and other six different deep learning models [44,45,46]. For each classifier, we perform classifications for 30 times. For each time, the training data are sampled randomly 70% from the original data, and the rest data are testing data.

The convergence performances of the proposed method here in the training and testing data are shown in Fig. 6. In which, there is a sharp fall between epochs 20 and 40. That is because the first 30 epochs were the training of SAS-DBN, and then was the fine-tuning. Moreover, the average results are shown in Table 3 and Fig. 7. Among these fifteen methods, the proposed method here has the best performance. Especially in comparing with other methods, the AR has been increased as 11.92%, 13.22%, 9.55%, 15.23%, 11.93%, 23.83%, 13.70%, 6.93%, 2.55%, 1.56%, 5.97%, 1.39%, 3.36%, 0.68%, respectively. From the above two subsections, we know SAS-DBN do have the best sparse learning ability and discrimination capability of the fMRI data.

Table 3 AR of fifteen classifiers

Full size table

4.2 Experiments on SNP database

In this section, subject recruitment and data collection were conducted by the Mind Clinical Imaging Consortium (MCIC). The SNPs data were collected from 208 subjects including 92 schizophrenia patients (age: 34 ± 11, 22 females) and 116 healthy controls (age: 32 ± 11, 44 females). All of them were provided written informed consent. Healthy participants were free of any medical, neurological or psychiatric illnesses and had no history of substance abuse. By the clinical interview of patients for DSM IV-TR disorders [47] or the comprehensive assessment of symptoms and history, patients met criteria for DSM-IV-TR schizophrenia [48, 49]. Antipsychotic history was collected as part of the psychiatric assessment.

A blood sample was obtained for each participant and DNA was extracted. Genotyping for all participants was performed at the Mind Research Network using the Illumina Infinium HumanOmni1-Quad assay covering 1,140,419 SNP loci. Bead Studio was used to make the final genotype calls. The PLINK software package http : //pngu.mgh.harvard.edu/ purcell/plink) was used to perform a series of standard quality control procedures, resulting in the final data set spanning 777,635 SNP loci. Each SNP was categorized into three clusters based on their genotype and was represented with discrete numbers: 0 for $'BB^{\prime }$ (no minor allele), 1 for $'AB^{\prime }$ (one minor allele) and 2 for $'AA^{\prime }$ (two minor alleles). Further, SNP with > 20% missing data was deleted, and missing data were further imputed. SNPs with minor allele frequency < 1% were removed. For genetic diseases, because pathways organize genes into biologically functional groups and model their interactions that capture correlation between genes, the genes involving in the same pathway process have similar effects and act together in regulating a biological system [50]. Therefore, using the pathway information can improve not only the predictive performance but also the interpretability of the selected features. In this paper, we select the SNPs falling into the schizophrenia genes’ pathway for analysis. After the selection, the dimension for each sample is 12513.

The deep architecture of SAS-DNN used here contains three hidden layers with 2000, 1000, and 500 units respectively. We use three BBRBM modules as the stacked SAS-DNN. Based on the grid search method, the learning rate ζ is 0.1, the penalty parameter λ₁ is 0.0002, the re-weight L₁-norm penalty coefficients λ₂ is 0.0002, 𝜖 is 0.1, the KL divergence penalty coefficients λ₃ is 0.05, the sparse parameters ρ is 0.05, and the steps of CD is 5. 208 subjects were randomly divided into two subsets: 148 subjects for training and the rest 60 ones for testing. We perform the SAS-DNN 30 times to obtain the average accuracy rate. The convergence performances of the proposed method here in the training and testing data are shown in Fig. 8. It indicated that the proposed method here can converge quickly in few iterations.

4.2.1 Weight sparsity

We compare the weights sparsity of SAS-DBN with the other six deep networks mentioned in Section 4.1.2. In the experiments, each network is well trained separately, and especially, for DP-DNNs, when the dropout ratio is selected as 0.1, 0.2, 0.3, 0.4 and 0.5, the average accuracy of 30 times is 0.8483, 0.8517, 0.765, 0.825 and 0.8717, respectively. Thus, we choose the DP-DNN model with the dropout ratio being 0.5. Similarly, for DPC-DNNs, the average accuracy of 30 times is 0.8717, 0.8633, 0.8783, 0.82 and 0.8717, respectively, with the drop connect ratio being 0.1, 0.2, 0.3, 0.4 and 0.5. So we choose the DPC-DNN with the drop connect ratio being 0.3. Similarly, for Pr-DNNs, we select the Pr-DNN with the pruning ratio of weights being 0.001. we perform the learning process 30 times. At each time, the training data are sampled from 70% of the original data.

The ratio of connection weights with non-zeros values between different layers on SNP data is shown in Table 4a. The HSM is presented in Table 4b. From them, we can see that SAS-DBN has the lowest non-zeros connection ratio and the highest HSM among the deep networks. Additionally, in Fig. 9a and b, we further compare SAS-DBN with two models, i.e., sDBN and SVD-DBN, which have the highest average classification accuracy among the other six deep networks.

Table 4 Two different sparse indexes of connections in SAS-DBN and other deep networks

Full size table

4.2.2 Classification results on the SNP database

The performance of each classifier is quantified by using AR, Sensitivity (SS), Specificity (SC), positive predictive value (PPV) and negative predictive value (NPV). And the last four are defined as follows

$$ \begin{array}{@{}rcl@{}} SS &=& \frac{Corr_{p}}{Corr_{p}+InCorr_{c}}\\ SC &=& \frac{Corr_{c}}{Corr_{c}+InCorr_{p}} \\ PPV &=& \frac{Corr_{p}}{Corr_{p}+InCorr_{p}}\\ NPV &=& \frac{Corr_{c}}{Corr_{c}+InCorr_{c}} \end{array} $$

AR, SS, SC, PPV and NPV provide different assessments of the classifier’s performance from five aspects, i.e., the proportion of all samples that are correctly predicted, the proportion of patients correctly predicted among all the predicted patients, the proportion of healthy subjects correctly predicted among all the predicted healthy subjects, the proportion of correctly predicted patients among the true patients, and the proportion of healthy subjects correctly predicted among those true healthy subjects.

We compare the proposed method, i.e., SAS-DBN with several classifiers that have fairly good performances. The classification results on the testing data set for different classifiers are summarized in Table 5. In Table 5, CLASSIFIER I refers to linear SVM, II for SVM with multi-layer perceptron kernel, III for sparse representations based classifier, IV for DBN, V for sDBN, VI for DP-DBN, VII for SVD-DBN, VIII for DPC-DBN, IX for Pr-DBN and X for SAS-DBN. For each classifier, we perform the classification process 30 times. At each time, the training data are sampled from 70% of the original data, and the rest data are testing data. The average results are shown in Table 5. We can see that the SAS-DBN performs the best among the classifiers. The AR has been increased as 7.59%, 22.93%, 5.90%, 11.29%, 6.91%, 11.80%, 6.40%, 11.13% and 13.99% respectively. Figure 10 further shows the classification results of all methods, where the horizontal axis represents different classifiers. The primary vertical axis on the left is denoted as the AR and the secondary axis on the right denotes other classification results, i.e., the SS, SC, PPV and NPV. Through the comparison, we can conclude that SAS-DBN has the best performance on the classification accuracy.

Table 5 Classification results by different classifiers

Full size table

Compared with [47, 51] (the data here are the same as ours), we use 148 samples as the training and 60 samples as the testing. In these papers, they use 90% subjects for training and only the rest 10% are used for testing. The average accuracy here is over 98%, while the accuracy of classification in these papers are around 85%, and the best accuracy of classification is 65%, and the average accuracy is about 55% by Li et al. [52].

4.3 Remark

For two kinds of high-dimensional data with small sample size, the above two experiments show that by the iterative re-weighted-L₁ minimization algorithm combining with k-step CD, the proposed deep belief networks with adaptive sparsity have the best classification performance than other typical classifiers including several deep networks. A crucial reason for the good classification results of SAS-DBN lies in that the iterative re-weighted-L₁ minimization algorithm can effectively obtain the approximate solution of L₀ minimization problem, especially by combining it with k-step CD algorithm, the parameters of the network can be adjusted adaptively in a quite sparse way. Furthermore, the learned sparse deep network can effectively remove the noise and redundant features from the original data. By the sparse structure, discriminative features of the data can be extracted in a sparse layer-wise way, which is critical for preventing overfitting and improving identification accuracy.

5 Conclusion

This work is motivated by the sparsity mechanism of brain neural networks. By exploring substitutions of the approximating solution of minimization problem with L₀-norm, the k-step CD sampling process embedding the re-weighted-L₁ minimization algorithm, and further the self-adaptive sparse DBNs are proposed. In this way, layer-wise sparse features of data can be effectively extracted, which are vital for learning the true distribution of the data, thereby it can enhance the generalization capacity as well as the discernment performance of the network. It is promising that the self-adaptive sparse DBNs will be a practical solution to the small number of samples but a large number of features problem. Numerical experiments of applying to brain imaging and genomics data give a good demonstration of their effectiveness, and it shows that the proposed methods can not only get a sparse deep network architecture in a self-adaptive way, but also have the potential to avoid over-fitting problem.

References

Qiao C, Gao B, Shi Y (2020) SRS-DNN: a deep neural network with strengthening response sparsity. Neural Comput Applic 32:8127–8142
Article Google Scholar
Liu K, Wu J, Liu H, Sun M, Wang Y (2021) Reliability analysis of thermal error model based on DBN and Monte Carlo method. Mech Syst Signal Process 146:107020
Article Google Scholar
Yan X, Liu Y, Jia M (2020) Multiscale cascading deep belief network for fault identification of rotating machinery under various working conditions. Knowl Based Syst 193:105484
Article Google Scholar
Chen CLP, Feng S (2020) Generative and discriminative fuzzy restricted Boltzmann machine learning for text and image classification. IEEE Trans Cybern 50(5):2237–2248
Article MathSciNet Google Scholar
Chu J, Wang H, Meng H, Jin P, Li T (2019) Restricted Boltzmann machines with Gaussian visible units guided by pairwise constraints. IEEE Trans Cybern 49(12):4321–4334
Article Google Scholar
Zhang J, Wang H, Chu J, Huang S, Li T, Zhao Q (2019) Improved Gaussian–Bernoulli restricted Boltzmann machine for learning discriminative representations. Knowl Based Syst 185:104911
Article Google Scholar
Qiao J, Wang L (2021) Nonlinear system modeling and application based on restricted Boltzmann machine and improved BP neural network. Appl Intell 51:37–50
Article Google Scholar
Bengio Y (2009) Learning deep architectures for ai. Found Trends Mach Learn 2(1):1–127
Article MathSciNet Google Scholar
LeCun Y, Bengio Y, Hinton GE (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Gu L, Huang J, Yang L (2019) On the representational power of restricted boltzmann machines for symmetric functions and boolean functions. IEEE T Neur Net Lear 30(5):1335–1347
Article MathSciNet Google Scholar
Chen Y, Zhao X, Jia X (2015) Spectral-spatial classification of hyperspectral data based on deep belief network. IEEE J Select Topics Appl Earth Observ Remote Sens 8(6):2381–2392
Article Google Scholar
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet Google Scholar
Salakhutdinov R, Hinton G (2009) Replicated softmax: An undirected topic model. In: Proceedings of the 22nd International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA, NIPS’09, pp 1607–1614
Morris G, Nevet A, Bergman H (2003) Anatomical funneling, sparse connectivity and redundancy reduction in the neural networks of the basal ganglia. J Physiol-Paris 97(4-6):581–589
Article Google Scholar
Olshausen BA, Field DJ (2004) Sparse coding of sensory inputs. Current Opin Neurobiol 14 (4):481–487
Article Google Scholar
Banino A, Barry C, Uria B, Blundell C, Lillicrap T, Mirowski P, Pritzel A, Chadwick M, Degris T, Modayil J, Wayne G, Soyer H, Viola F, Zhang B, Goroshin R, Rabinowitz N, Pascanu R, Beattie C, Petersen S, Kumaran D (2018) Vector-based navigation using grid-like representations in artificial agents. Nature 557, 429–433
Girosi F, Poggio T (1995) Regularization theory and neural networks architectures. Neural Comput 7(2):219–269
Article Google Scholar
Williams P (1995) Bayesian regularization and pruning using a laplace prior. Neural Comput 7 (1):117–143
Article Google Scholar
Weigend AS, Rumelhart DE, Huberman BA (1990) Generalization by weight-elimination with application to forecasting.. In: Proceedings of the 1990 conference on advances in neural information processing systems 3, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, NIPS-3, pp 875–882
Nowlan SJ, Hinton GE (1992) Simplifying neural networks by soft weight-sharing. Neural Comput 4(4):473–493
Article Google Scholar
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural comput 18:1527–54
Article MathSciNet Google Scholar
Denton E, Zaremba W, Bruna J, LeCun Y, Fergus R (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In: Proceedings of the 27th international conference on neural information processing systems - volume 1, MIT Press, Cambridge, MA, USA, NIPS’14, pp 1269–1277
Wang E, Davis JJ, Zhao R, Ng HC, Niu X, Luk W, Cheung PYK, Constantinides GA (2019) Deep neural network approximation for custom hardware: Where we’ve been, where we’re going. ACM Comput Surv 52(2):1–39
Article Google Scholar
J Candès E, Wakin MB, Boyd SP (2007) Enhancing sparsity by reweighted l₁ minimization. J Fourier Anal Appl 14:877–905
Article Google Scholar
Nair V, Hinton GE (2009) 3d object recognition with deep belief nets. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22, Curran Associates, Inc. pp 1339–1347
Lee H, Ekanadham C, Ng AY (2008) Sparse deep belief net model for visual area v2. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems 20 curran associates Inc. pp 873–880
Lee H, Grosse R, Ranganath R, Ng AY (2011) Unsupervised learning of hierarchical representations with convolutional deep belief networks. Commun Acm 54(10):95–103
Article Google Scholar
Ranzato M, Poultney C, Chopra S, LeCun Y (2007) Efficient learning of sparse representations with an energy-based model, MIT Press 1137–1144
Luo H, Shen R, Niu C, Ullrich C (2011) Sparse group restricted boltzmann machines. In: Proceedings of the twenty-fifth AAAI conference on artificial intelligence, AAAI Press, AAAI’11, pp 429–434
Zhang J, Ji N, Liu J, Pan J, Meng D (2015) Enhancing performance of the backpropagation algorithm via sparse response regularization. Neurocomputing 153:20–40
Article Google Scholar
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: A simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
MathSciNet MATH Google Scholar
Hinton GE (2002) Training product of expert by minimizing contrastive divergence. Neural Comput 14(8):1771–1800
Article Google Scholar
Kong Y, Yu T (2018) A graph-embedded deep feedforward network for disease outcome classification and feature selection using gene expression data. Bioinformatics 34(21):3727–3737
Article Google Scholar
Min S, Lee B, Yoon S (2016) Deep learning in bioinformatics. Brief Bioinform 18(5):851–869
Google Scholar
Rao B, Kreutz-Delgado K (1999) An affine scaling methodology for best basis selection. IEEE T Signal Proces 47(1):187–200
Article MathSciNet Google Scholar
Wipf D, Nagarajan S (2010) Iterative reweighted ℓ₁ and ℓ₂ methods for finding sparse solutions. IEEE J Sel Top Signal Process 4(2):317–329
Article Google Scholar
Hinton GE (2010) A practical guide to training restricted boltzmann machines. Momentum 9:926–947
Google Scholar
Fischer A, Igel C (2014) Training restricted boltzmann machines: An introduction. Pattern Recogn 47(1):25–39
Article Google Scholar
Segal D (2015) Diagnostic and statistical manual of mental disorders (5th ed.), American Cancer Society 101–105
Segall J, Allen E, Jung R, Erhardt E, Arja S, Kiehl K, Calhoun V (2012) Correspondence between structure and function in the human brain at rest. Front Neuroinform 6:10
Article Google Scholar
Allen E, Erik BE (2011) A baseline for the multivariate comparison of resting-state networks. Syst Neurosci 5:2
Google Scholar
Hoyer PO (2004) Non-negative matrix factorization with sparseness constraints. J Mach Learn Res 5:1457–1469
MathSciNet MATH Google Scholar
Thom M, Palm G (2013) Sparse activity and sparse connectivity in supervised learning, JMLR 14
Hu J, Li T, Wang H, Fujita H (2016) Hierarchical cluster ensemble model based on knowledge granulation. Knowl Based Syst 91:179–188
Article Google Scholar
Lan L, Wang Z, Zhe S, Cheng W, Wang J, Zhang K (2019) Scaling up kernel SVM on limited resources: a low-rank linearization approach. IEEE Trans Neural Netw Learn Syst 30(2):369–378
Article MathSciNet Google Scholar
Guo X, Zhang C, Luo W, Yang J, Yang M (2020) Urban impervious surface extraction based on multi-features and random forest. IEEE Access 8:226609–226623
Article Google Scholar
Cao H, Duan J, Lin D, Calhoun V, Wang YP (2013) Integrating fmri and snp data for biomarker identification for schizophrenia with a sparse representation based variable selection method. BMC Med Genomics 6(3):S2
Article Google Scholar
Meier L, Geer S, Bühlmann P (2008) The group lasso for logistic regression, group lasso for logistic regression. J R Stat Soc Ser B 70(1):53–71
Article MathSciNet Google Scholar
Onitsuka T, Shenton ME, Salisbury DF, Dickey CC, Kasai K, Toner SK, Frumin M, Kikinis R, Jolesz FA, MR W (2004) Middle and inferior temporal gyrus gray matter volume abnormalities in chronic schizophrenia: An mri study. Am J Psychiatry 161(9):1603–1611
Article Google Scholar
Tyekucheva S, Marchionni L, Karchin R, Parmigiani G (2011) Integrating diverse genomic data using gene sets. Genome Biol 12(10):R105
Article Google Scholar
Cao H, Lin D, Duan J, Calhoun V, Wang YP (2012) Biomarker identification for diagnosis of schizophrenia with integrated analysis of fmri and snps. In: 2012 IEEE Int C Bioinform, pp 1–6
Li Y, Namburi P, Yu Z, Guan C, Feng J, Gu Z (2009) Voxel selection in fmri data analysis based on sparse representation. IEEE T Bio-Med Eng 56(10):2439–2451
Article Google Scholar

Download references

Author information

Authors and Affiliations

Xi’an Jiaotong University, Xi’an, 710049, People’s Republic of China
Chen Qiao, Lan Yang, Yan Shi & Yanmei Kang
Suzhou Hanlin Information Technology Development Co., LTD., Suzhou, 215138, People’s Republic of China
Hanfeng Fang

Authors

Chen Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Lan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Shi
View author publications
You can also search for this author in PubMed Google Scholar
Hanfeng Fang
View author publications
You can also search for this author in PubMed Google Scholar
Yanmei Kang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanmei Kang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was supported by the National Natural Science Foundation of China (No. 12090020 and No.12090021), the National Key Research and Development Program of China (No. 2020AAA0106302), the Science and Technology Innovation Plan of Xian (No. 2019421315KYPT004JC006) and was partly Supported by HPC Platform, Xi’an Jiaotong University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qiao, C., Yang, L., Shi, Y. et al. Deep belief networks with self-adaptive sparsity. Appl Intell 52, 237–253 (2022). https://doi.org/10.1007/s10489-021-02361-y

Download citation

Accepted: 16 March 2021
Published: 26 April 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s10489-021-02361-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Deep belief networks with self-adaptive sparsity

Abstract

Similar content being viewed by others

Effective sparsity control in deep belief networks using normal regularization term

A-DBNF: adaptive deep belief network framework for regression and classification tasks

Automated classification of Alzheimer's disease based on deep belief neural networks

1 Introduction

2 Iterative algorithm for re-weighted L ₁ minimization

3 Self-adaptive sparse DBNs (SAS-DBNs)