On the Assessment of Nature-Inspired Meta-Heuristic Optimization Techniques to Fine-Tune Deep Belief Networks

Passos, Leandro Aparecido; Rosa, Gustavo Henrique de; Rodrigues, Douglas; Roder, Mateus; Papa, João Paulo

doi:10.1007/978-981-15-3685-4_3

Leandro Aparecido Passos⁵,
Gustavo Henrique de Rosa⁵,
Douglas Rodrigues⁶,
Mateus Roder⁵ &
…
João Paulo Papa⁵

Part of the book series: Natural Computing Series ((NCS))

1814 Accesses
2 Citations

Abstract

Machine learning techniques are capable of talking, interpreting, creating, and even reasoning about virtually any subject. Also, their learning power has grown exponentially throughout the last years due to advances in hardware architecture. Nevertheless, most of these models still struggle regarding their practical usage since they require a proper selection of hyper-parameters, which are often empirically chosen. Such requirements are strengthened when concerning deep learning models, which commonly require a higher number of hyper-parameters. A collection of nature-inspired optimization techniques, known as meta-heuristics, arise as straightforward solutions to tackle such problems since they do not employ derivatives, thus alleviating their computational burden. Therefore, this work proposes a comparison among several meta-heuristic optimization techniques in the context of Deep Belief Networks hyper-parameter fine-tuning. An experimental setup was conducted over three public datasets in the task of binary image reconstruction and demonstrated consistent results, posing meta-heuristic techniques as a suitable alternative to the problem.

Access provided by Autonomous University of Puebla. Download chapter PDF

Fine-Tuning Dropout Regularization in Energy-Based Deep Learning

A-DBNF: adaptive deep belief network framework for regression and classification tasks

Article 02 January 2021

Learning Parameters in Deep Belief Networks Through Firefly Algorithm

1 Introduction

In the past years, multimedia-based applications fostered the generation of a massive amount of data. These data provide a wide range of opportunities for machine learning applications in several areas of knowledge, such as medicine, financial market, intelligent manufacturing, and event classification. Among such machine learning approaches, deep learning methods have received significant attention due to their excellent results, often surpassing even humans.

Deep learning models try to simulate the human-brain behavior on how the information is processed. The basic idea is to use multiple layers to extract higher-level features progressively, where each layer learns to transform input data into a more abstract representation. Regarding applications in the image processing area, lower layers may identify edges, while higher layers may identify human-meaningful items such as human faces and objects. Among the most employed methods, one can include Convolutional Neural Networks (CNNs) [8], Deep Belief Networks (DBNs) [5], and Deep Boltzmann Machines (DBMs) [23], among others.

Since “deep” in deep learning refers to the architecture complexity, the more complex it becomes, the higher the number of hyper-parameters to fit. Yosinski and Lipson [36], for instance, highlighted some approaches for visualizing the behavior of a single Restricted Boltzmann Machine (RBM) [24], which is an energy-based model that can be used to build DBNs and DBMs, during its learning procedure, and provided an overview toward such complexities comprehension. Such a problem was usually tackled using auto-learning tools, which combine parameter fine-tuning with feature selection techniques [26]. Despite, it can also be posed as an optimization task in which one wants to choose suitable hyper-parameters.

Therefore, meta-heuristic algorithms have become a viable alternative to solve optimization problems due to their simple implementation. Kuremoto et al. [7], for instance, employed the Particle Swarm Optimization (PSO) [6] to the context of hyper-parameter fine-tuning concerning RBMs, while Liu et. al [10] and Levy et al. [9] applied Genetic Algorithms (GA) [29] for model selection and automatic painter classification using RBMs, respectively. Later, Rosa et al. [22] addressed the Firefly Algorithm to fine-tune DBN hyper-parameters. Finally, Passos et al. [15, 16] proposed a similar approach comparing several meta-heuristic techniques to fine-tune hyper-parameters in DBMs, infinity Restricted Boltzmann Machines [13, 18], and RBM-based models in general [14].

Following this idea, this chapter presents a comparison among ten different swarm- and differential evolution-based meta-heuristic algorithms in the context of fine-tuning DBN hyper-parameters. We present a discussion about the viability of such approaches in three public datasets, as well as the statistical evaluation through the Wilcoxon signed-rank test. The remainder of this chapter is organized as follows. Section 3.2 introduces the theoretical background concerning RBMs and DBNs. Sections 3.4 and 3.5 present the methodology and the experimental results, respectively. Finally, Sect. 3.6 states conclusions and future works.

2 Theoretical Background

In this section, we present a theoretical background concerning Restricted Boltzmann Machines and Deep Belief Networks.

2.1 Restricted Boltzmann Machines

Restricted Boltzmann Machines are well-known stochastic-nature neural networks inspired by physical laws of statistical mechanics and parameterized by concepts like energy and entropy. These networks are commonly employed in the field of unsupervised learning, having at least two layers of neurons, i.e., one visible and one hidden.

The Restricted Boltzmann Machine basic architecture is composed of a visible layer v = {v ₁, v ₂, …, v _m} with m units and a hidden layer h = {h ₁, h ₂, …, h _n} with n units. Furthermore, a real-valued matrix W _m×n is responsible for modeling the restricted connections, i.e., the weights, between the visible and hidden neurons, where w _ij represents the connection between the visible unit v _i and the hidden unit h _j. Figure 3.1 describes the vanilla RBM architecture.

Regarding the learning process, a layer composed of visible units represents the input data to be processed, while the hidden layer is employed to extract deep-seated patterns and information from this data. Besides, both visible and hidden units assume only binary values, i.e., v ∈{0, 1}^m and h ∈{0, 1}ⁿ, once sampling process is derived from a Bernoulli distribution [4]. Finally, the training process is performed by minimizing the system’s energy considering both the visible and hidden layers units, as well as the biases associated with each layer. The energy can be computed as follows:

$$\displaystyle \begin{aligned} E(\mathbf{v},\mathbf{h})=-\sum_{i=1}^ma_iv_i-\sum_{j=1}^nb_jh_j-\sum_{i=1}^m\sum_{j=1}^nv_ih_jw_{ij}, \end{aligned} $$

(3.1)

where a and b represent the biases of visible and hidden units, respectively.

Computing the system’s probability is an intractable task due to the computational cost. However, one can estimate the probability of activating a single visible neuron i given the hidden units through Gibbs sampling over a Markov chain, as follows:

$$\displaystyle \begin{aligned} P(v_i=1|\mathbf{h})=\phi\left(\sum_{j=1}^nw_{ij}h_j+a_i\right), \end{aligned} $$

(3.2)

and, in a similar fashion, the probability of activating a single hidden neuron j given the visible units is stated as follows:

$$\displaystyle \begin{aligned} P(h_j=1|\mathbf{v})=\phi\left(\sum_{i=1}^mw_{ij}v_i+b_j\right), \end{aligned} $$

(3.3)

where ϕ(⋅) stands for the logistic-sigmoid function.

The training process consists of maximizing the product of probabilities given a set of parameters θ = (W, a, b) and the data probability distribution over the training samples. Such a process can be easily computed using either the Contrastive Divergence (CD) [3] or the Persistent Contrastive Divergence (PCD) [27] algorithms.

2.2 Contrastive Divergence

Hinton [3] introduced a faster methodology to compute the energy of the system based on contrastive divergence. The idea is to initialize the visible units with a training sample, to compute the states of the hidden units using Eq. (3.3), and then to compute the states of the visible unit (reconstruction step) using Eq. (3.2). In short, this is equivalent to perform Gibbs sampling using k = 1 and to initialize the chain with the training samples.

Therefore, the equation below leads to a simple learning rule for updating the weights matrix W, and biases a and b at iteration t:

$$\displaystyle \begin{aligned} {\mathbf{W}}^{t+1}={\mathbf{W}}^t+\underbrace{\eta(P(\mathbf{h}|\mathbf{v}){\mathbf{v}}^T-P(\tilde{\mathbf{h}}|\tilde{\mathbf{v}})\tilde{\mathbf{v}}^T)+\varPhi}_{=\varDelta{\mathbf{W}}^t}, \end{aligned} $$

(3.4)

$$\displaystyle \begin{aligned} {\mathbf{a}}^{t+1}={\mathbf{a}}^t+\underbrace{\eta(\mathbf{v}-\tilde{\mathbf{v}})+\varphi\varDelta {\mathbf{a}}^{t-1}}_{=\varDelta{\mathbf{a}}^t}, \end{aligned} $$

(3.5)

$$\displaystyle \begin{aligned} {\mathbf{b}}^{t+1}={\mathbf{b}}^t+\underbrace{\eta(P(\mathbf{h}|\mathbf{v})-P(\tilde{\mathbf{h}}|\tilde{\mathbf{v}}))+\varphi\varDelta {\mathbf{b}}^{t-1}}_{=\varDelta{\mathbf{b}}^t}, \end{aligned} $$

(3.6)

where η stands for the learning rate, φ denotes the momentum, $\tilde {\mathbf {v}}$ stands for the reconstruction of the visible layer given h, and $\tilde {\mathbf {h}}$ denotes an estimation of the hidden vector h given $\tilde {\mathbf {v}}$. In a nutshell, Eqs. (3.4), (3.5), and (3.6) show the optimization algorithm, the well-known Gradient Descent. The additional term Φ in Eq. (3.4) is used to control the values of matrix W during the convergence process, and it is described as follows:

$$\displaystyle \begin{aligned} \varPhi = -\lambda{\mathbf{W}}^t+\varphi\varDelta{\mathbf{W}}^{t-1}, \end{aligned} $$

(3.7)

where λ stands for the weight decay.

2.3 Persistent Contrastive Divergence

Most of the issues related to the Contrastive Divergence approach concern the number of iterations employed to approximate the model to the real data. Although the approach proposed by Hinton [3] takes k = 1 and works well for real-world problems, one can settle different values for k [1].^{Footnote 1}

Notwithstanding, Contrastive Divergence provides a good approximation to the likelihood gradient, i.e., it gives a reasonable estimation of the model to the data when k →∞. However, its convergence might become poor when the Markov chain has a “low mixing,” as well as a good convergence only on the early iterations, getting slower as iterations go by, thus, demanding the use of parameters decay.

Therefore, Tieleman [27] proposed the Persistent Contrastive Divergence, an interesting alternative for contrastive divergence using higher values for k while keeping the computational burden relatively low. The idea is quite simple: on CD-1, each training sample is employed to start an RBM and rebuild a model after a single Gibbs sampling iteration. Once every training sample is presented to the RBM, we have a so-called epoch. The process is repeated for each next epoch, i.e., the same training samples are used to feed the RBM, and the Markov chain is restarted at each epoch.

2.4 Deep Belief Networks

Deep Belief Networks [5] are graphical models composed of a visible and L hidden layers, where each layer is connected to the latter through a weight matrix W _l, l ∈{1, 2, …, L}, and there is no connection between units from the same layer. In a nutshell, one can consider each set of two subsequent layers as an RBM trained in a greedy fashion such that the trained hidden layer of the bottommost RBM feeds the next RBM’s visible layer, and so on. Figure 3.2 depicts the model. Notice v and h _l stand for the visible and the l-th hidden layers.

Although this work focuses on image reconstruction, one can use DBNs for supervised classification tasks. Such an approach requires, after the greedy feedforward pass mentioned above, fine-tuning the network weights using either backpropagation or gradient descent. Afterward, a softmax layer is added at the top of the model to attribute the predicted labels.

3 Meta-heuristic Optimization Algorithms

This section presents a brief description of the meta-heuristic optimization techniques employed in this work.

Improved Harmony Search (IHS) [11]: an improved version of the Harmony Search optimization algorithm that employs dynamic values for both the Pitch Adjusting Rate (PAR), considering values in the range [PAR$_{\min },$PAR$_{\max }]$, and the Harmony Memory Considering Rate (HMCR), which assumes values in the range [HMCR$_{\min },$HMCR$_{\max }]$. Additionally, the algorithm uses the bandwidth variable ϱ in the range $[\varrho _{\min },\varrho _{\max }]$ to calculate PAR.
Particle Swarm Optimization with Adaptive Inertia Weight (AIWPSO) [12]: an improved version of the Particle Swarm Optimization that employs self-adjusting inertia weights w over each particle along with the search space aiming to balance the global exploration and local exploitation. Notice the method uses the variables c ₁ and c ₂ to control the particles’ acceleration.
Flower Pollination Algorithm (FPA) [21, 35]: a meta-heuristic optimization algorithm that tries to mimic the pollination process performed by flowers. The algorithm employs four basic rules: (1) the cross-pollination, which stands for the pollination performed by birds and insects, (2) the self-pollination, representing the pollination performed by the wind diffusion or similar approaches, (3) the constancy of birds/insects, representing the probability of reproduction, and (4) the interaction of local and global pollination, controlled by the probability parameter p. Additionally, the algorithm employs an additional parameter β to control the amplitude of the distribution.
Bat Algorithm (BA) [34]: based on the bats’ echolocation system while searching for food and prey. The algorithm employs a swarm of virtual bats randomly flying in the search space at different velocities, even following a random walk approach for local search intensification. Additionally, it applies a dynamically updated wavelength frequency in the range $\{f_{\min },f_{\max }\}$ according to the distance from the objective, as well as loudness A and the pulse rate r.
Firefly Algorithm (FA) [31]: the algorithm is based on the fireflies’ approach for attracting potential preys and mating partners. It employs the attractiveness β parameter, which influences the brightness of each agent, depending on its position and light absorption coefficient γ. Moreover, the model employs a random perturbation α used to perform a random walk and avoid local optima.
Cuckoo Search (CS) [20, 32, 33]: the model combines some cuckoo species parasitic behavior with a τ-step random walk over a Markov chain. It employs three basic concepts: (1) each cuckoo lays a single egg for iteration at another bird’s randomly chosen nest, (2) p _a ∈ [0, 1] defines the probability of this bird discover and discard the cuckoo’s egg or abandon it and create a new chest, i.e., a new solution, and (3) the nests with best eggs will carry over to the next generations.
Differential Evolution (DE) [25]: evolution algorithm maintains a population of candidate solutions which are combined and improved in following generations aiming to find the characteristics that best fit the problem. The algorithm employs a mutation factor to control the mutation amplitude, as well as a parameter to control the crossover probability.
Backtracking Search Optimization Algorithm (BSA) [2, 17]: an evolution algorithm that employs a random selection of a historical population for mutation and crossover operations to generate a new population of individuals based on past experiences. The algorithm controls the number of elements to be mutated using a mixing rate (mix_rate) parameter, as well as the amplitude of the search-direction matrix with the parameter F.
Differential Evolution Based on Covariance Matrix Learning and Bimodal Distribution Parameter Setting Algorithm (CoBiDE) [28]: a differential evolution model that represents the search space coordinate system using a covariance matrix according to the probability parameter P _b, and the proportion of individuals employed in the process using the P _s variable. Moreover, it employs a binomial distribution to control the mutation and crossover rates, aiming a better trade-off between exploitation and exploration.
Adaptive Differential Evolution with Optional External Archive (JADE) [19, 37]: JADE is a differential evolution-based algorithm that employs the “DE/current-to-p-best” strategy, i.e., only the p − best agents are used in the mutation process. Further, the algorithm employs both a historical population and a control parameter, which is adaptively updated. Finally, it requires a proper selection of the rate of adaptation parameter c, as well as the mutation greediness parameter g.

4 Methodology

This section introduces the intended procedure for DBN hyper-parameter fine-tuning. Additionally, it describes the employed datasets and the experimental setup.

4.1 Modeling DBN Hyper-parameter Fine-tuning

The learning procedure of each RBM employs four hyper-parameters, as specified in Sect. 3.2.1: the learning rate η, weight decay λ, momentum φ, and the number of hidden units n. Since DBNs are built over RBM blocks, they employ a similar process to fine-tune each of their layers individually. In short, a four-dimensional search space composed of three real- and one integer-valued variables should be selected for each layer. Notice the variable values are intrinsically real numbers, thus requiring a type casting to obtain the nearest integer. Such an approach aims at electing the assortment of DBN hyper-parameters that minimizes the training images reconstruction error, denoted by the minimum squared error (MSE). Subsequently, the selected set of parameters is applied to reconstruct the unseen images of the test set. Figure 3.3 depicts the procedure.

4.2 Datasets

We employed three datasets, as described below:

MNIST dataset^{Footnote 2}: a dataset composed of “0”–“9” handwritten digits images. Regarding the pre-processing, the images were converted from gray-scale to binary, as well as resized to 14 × 14. Additionally, the training was performed over 2% of the training set, i.e., 1200 images, due to the demanded computational burden. Moreover, the complete set of 10, 000 was employed for testing.
Semeion Handwritten Digit Dataset^{Footnote 3}: similar to the MNIST, Semeion is also a dataset composed of “0”–“9” handwritten digits images formed by 1593 images. In this paper, we resized the samples to 16 × 16 and binarized each pixel.
CalTech 101 Silhouettes Dataset^{Footnote 4}: a dataset composed of 101 classes of silhouettes with a resolution of 28 × 28. No pre-processing step was applied to the image samples.

Figure 3.4 displays some training examples from the above datasets.

4.3 Experimental Setup

Experiments were conducted over 20 runs and a 2-fold cross-validation for statistical analysis using the Wilcoxon signed-rank test [30] with 5% of significance. Each meta-heuristic technique employed five agents (particles) over 50 iterations for convergence purposes over the three configurations, i.e., DBNs with 1, 2, and 3 layers. Additionally, the paper compares different techniques ranging from music composition process, swarm-based, and evolutionary-inspired methods, in the context of DBN hyper-parameter fine-tuning, as presented in Sect. 3.3:

Table 3.1 exhibits the parameter configuration for every meta-heuristic technique.^{Footnote 5}

Table 3.1 Meta-heuristic algorithms’ parameter configuration

Full size table

Finally, each DBN layer is composed of an RBM whose hyper-parameters are randomly initialized according to the following ranges: n ∈ [5, 100], η ∈ [0.1, 0.9], λ ∈ [0.1, 0.9], and φ ∈ [10⁻⁵, 10⁻¹]. Additionally, the experiments were conducted over three different depth configurations, i.e., DBNs composed of 1, 2, and 3 RBM layers, which implies on fine-tuning a 4 −, 8 −, and 12 −dimensional set of hyper-parameters. We also have employed T = 10 as the number of epochs for DBN learning weights procedure with mini-batches of size 20. In order to present a more in-depth experimental validation, all DBNs were trained with the Contrastive Divergence (CD) [3] and Persistent Contrastive Divergence (PCD) [27]. Figure 3.5 depicts the pipeline proposed in this paper.

5 Experimental Results

This section introduces the results obtained during the experiments. Further, a detailed discussion about them is provided. Tables 3.2, 3.3, and 3.4 present the average MSE, and their standard deviation regarding MNIST, Semeion Handwritten Digit, and CalTech 101 Silhouettes datasets, respectively. The best results accordingly to the Wilcoxon signed-rank test with 5% of significance level are presented in bold.

Table 3.2 Average MSE values considering MNIST dataset

Full size table

Table 3.3 Average MSE values considering Semeion Handwritten Digit dataset

Full size table

Table 3.4 Average MSE values considering CalTech 101 Silhouettes dataset

Full size table

Table 3.2 presents the results concerning the MNIST dataset. IHS obtained the lowest errors using the Contrastive Divergence algorithm over one single layer. BA and AIWPSO obtained statistically similar results using the PCD algorithm over two and three layers, respectively. One can notice that FPA using CD over a single layer also obtained the same average errors as the IHS, although the Wilcoxon signed-rank test does not consider both statistically similar. Moreover, the evolutionary algorithms also obtained good results, though not statistically similar as well.

Regarding Semeion Handwritten Digit dataset, Table 3.3 demonstrates the best results were obtained using CoBiDe technique over the CD algorithm with one layer. Worth pointing that none of the other methods achieved similar statistical results, which confirms the robustness of evolutionary-based meta-heuristic optimization algorithms.

Similar to MNIST dataset, the best results over CalTech 101 Silhouettes dataset was obtained using the IHS method with the CD algorithm over a single-layered DBN, as presented in Table 3.4. IHS was also the sole technique to achieve the lowest errors since none of the other methods obtained statistically similar results.

5.1 Training Evaluation

Figure 3.6 depicts the learning steps considering MNIST dataset. Except for the BA algorithm (and the random search), all techniques converged equally to the same point since the initial iterations. Notice FA outperformed such results, achieving the lowest error at iteration number 20. However, the training error regresses to the initial values, which suggests the problem presents a local optimum hard to be overpassed, given the set of optimized parameters.

An interesting behavior is depicted in Fig. 3.7. One can observe AIWPSO converges faster than the other techniques obtaining an average MSE of 0.2 after ten iterations. However, AIWPSO gets stuck at this time step and is outperformed by both JADE and DE after approximately 15 iterations. Moreover, DE still improves its performance until reaching its optimum at nearly 40 iterations. The behavior is not observed over the testing set, where although DE obtained good results, CoBiDE was the most accurate technique.

Regarding the Caltech 101 Silhouettes, the learning curve depicted in Fig. 3.8 showed that AIWPSO presented a similar behavior as presented over Semeion dataset, and a faster convergence in the 15 initial iterations, being outperformed by JADE afterward. Notice that IHS and FPA also demonstrated a good convergence, which is expected since IHS obtained the best results over the testing set and FPA achieved very close results. Additionally, CoBiDE and BSA are also among the best techniques together with JADE and DE, confirming the robustness of evolution techniques to the task of DBN meta-parameter fine-tuning.

5.2 Time Analysis

Tables 3.5, 3.6, and 3.7 present the computational burden, in hours, regarding MNIST, Semeion Handwritten Digit, and Caltech 101 Silhouettes datasets, respectively. One can observe that CS is the fastest technique, followed by IHS. Such a result is expected since IHS evaluates a single solution per iteration, and CS employs a probability of evaluating or not each solution. On the other hand, the remaining techniques evaluate every solution for each iteration, contributing to a higher computational burden.

Table 3.5 Average computational burden (in hours) considering MNIST dataset

Full size table

Table 3.6 Average computational burden (in hours) considering Semeion Handwritten Digit dataset

Full size table

Table 3.7 Average computational burden (in hours) considering CalTech 101 Silhouettes dataset

Full size table

Additionally, evolutionary algorithms, in general, present a higher computation burden than swarm-based approaches. AIWPSO stands for an exception, offering itself as the most costly technique among all the others, due to its updating mechanism.

In most cases, the best results were obtained using a single layer as well as the CD algorithm. Such behavior is probably related to the limited number of epochs employed for training, i.e., more complex models composed of a more significant amount of layers would require a higher number of epochs for convergence than the 10 epochs employed in this work. However, running the experiments over such conditions is not plausible in this context due to the massive amount of executions performed for the comparisons presented in the chapter. The same is valid for the PCD algorithm.

5.3 Hyper-Parameters Analysis

This section provides a complete list of the average values of hyper-parameters obtained during the execution of every possible experimental configuration. Notice that values in bold stand for the configuration that obtained the best results accordingly to the Wilcoxon signed-rank test.

Table 3.8 presents the average hyper-parameter values considering the MNIST dataset. The similarity between both IHS and FPA considering a single layer over the Contrastive Divergence algorithm is evident, which is expected since both obtained similar results. However, comparing these results with the ones obtained with a higher number of layers, i.e., BA with 2 layers and AIWPSO with 3 layers, over the PCD algorithm denotes a harder task, since the number of hyper-parameters is also higher, and each one exerts a degree of influence over the others.

Table 3.8 Average hyper-parameter values considering MNIST dataset

Full size table

Regarding the Semeion Handwritten Digit dataset, presented in Table 3.9, one can once again identify some relation between the set of hyper-parameters and the final results. Although IHS did not obtain the best results over the CD algorithm with a single layer, its results are pretty close to the best obtained using the CoBiDE algorithm. The resemblance is reflected in their hyper-parameter sets. Another example of this resemblance is observed in the 1-layered FPA and BSA over the CD algorithm: a close set of hyper-parameters leads to close results in the experiments.

Table 3.9 Average hyper-parameter values considering Semeion Handwritten Digit dataset

Full size table

An analogous behavior is observed in Table 3.10 regarding Caltech 101 Silhouettes dataset. Although FPA, BSA, and CoBiDE did not obtain statistically similar results to IHS according to the Wilcoxon signed-rank test, their results are very much alike, which is perceptible in their selected sets of hyper-parameter. Regarding more complex models, i.e., with two and three layers, one can still observe some likeness. Notice, for instance, the similarity between AIWPSO trained with both CD and PCD, and BSA trained with CD over three layers. However, since they require a larger number of hyper-parameters to be fine-tuned, the combination is exponentially larger, thus providing more diverse combination sets.

Table 3.10 Average hyper-parameter values considering CalTech 101 Silhouettes dataset

Full size table

3	CD	1	Hidden units	43	51	51	34	45	48	31	39	59	37	50
			Learning rate	0.52239	0.51207	0.52097	0.48690	0.63751	0.52362	0.51773	0.42346	0.38359	0.47503	0.56581
			Weight decay	0.71342	0.72279	0.61374	0.62583	0.68437	0.67406	0.69570	0.68193	0.67821	0.71961	0.73120
			Momentum	0.00431	0.00514	0.00497	0.00553	0.00428	0.00600	0.00446	0.00391	0.00512	0.00546	0.00577
		2	Hidden units	49	54	52	41	43	47	49	49	61	49	44
			Learning rate	0.52924	0.51316	0.53610	0.42102	0.49939	0.52684	0.50960	0.57533	0.57925	0.52413	0.54938
			Weight decay	0.61271	0.48068	0.44030	0.44930	0.57062	0.53085	0.40449	0.46727	0.56396	0.40061	0.44172
			Momentum	0.00442	0.00495	0.00597	0.00473	0.00428	0.00505	0.00604	0.00419	0.00422	0.00506	0.00605
		3	Hidden units	55	52	55	63	46	67	50	59	57	51	55
			Learning rate	0.51658	0.48736	0.45940	0.53802	0.55106	0.53752	0.55456	0.51279	0.55418	0.55914	0.50365
			Weight decay	0.39922	0.52719	0.58714	0.45855	0.58377	0.58716	0.54719	0.51086	0.42597	0.54949	0.49739
			Momentum	0.00594	0.00399	0.00633	0.00518	0.00428	0.00457	0.00588	0.00396	0.00575	0.00387	0.00523
	PCD	1	Hidden units	56	53	49	51	49	58	51	37	59	48	49
			Learning rate	0.46399	0.40623	0.37432	0.50307	0.48364	0.59600	0.53473	0.44439	0.38359	0.43948	0.49811
			Weight decay	0.69432	0.63476	0.59465	0.51970	0.59724	0.63553	0.61455	0.60310	0.67821	0.63087	0.65155
			Momentum	0.00341	0.00550	0.00477	0.00531	0.00451	0.00472	0.00516	0.00504	0.00512	0.00408	0.00444
		2	Hidden units	56	51	51	52	59	40	52	49	61	40	51
			Learning rate	0.51054	0.42227	0.52618	0.55580	0.50243	0.42248	0.57549	0.55893	0.57925	0.56304	0.52403
			Weight decay	0.55409	0.42987	0.47076	0.46793	0.42596	0.43440	0.47748	0.49206	0.56396	0.54592	0.51342
			Momentum	0.00586	0.00452	0.00493	0.00378	0.00451	0.00623	0.00508	0.00499	0.00422	0.00492	0.00446
		3	Hidden units	59	49	52	42	42	55	50	49	57	44	50
			Learning rate	0.49215	0.53762	0.52214	0.59451	0.45669	0.49241	0.58855	0.51762	0.55418	0.48230	0.58397
			Weight decay	0.53427	0.55750	0.46411	0.45421	0.51050	0.51246	0.47173	0.51046	0.42597	0.52163	0.44847
			Momentum	0.00522	0.00467	0.00540	0.00457	0.00451	0.00454	0.00527	0.00536	0.00575	0.00434	0.00619

Bold values denote the lowest average MSE or values whose Wilcoxon’s p-value is above 0.05, i.e., values that are statistically similar

6 Conclusions and Future Works

This chapter dealt with the problem of Deep Belief Network’s hyper-parameter parameter fine-tuning through meta-heuristic approaches. Experiments were conducted using three architectures, i.e., one (naïve RBM), two, and three layers, which were trained using both the Contrastive Divergence and the Persistent Contrastive Divergence algorithms. Further, the performance of ten techniques, as well as a random search, were compared over three public binary image datasets. Results demonstrated that Improved Harmony Search obtained the best results in two out of three datasets, while CoBiDE obtained the best values regarding Semeion Handwritten Digit dataset, denoting the efficiency of differential evolution-based techniques. Concerning the training steps, in general, AIWPSO converges faster than the other methods on the initial iterations. However, it is outperformed by evolution techniques after approximately 15 iterations. Finally, one can also verify that CS is the fastest technique, followed by IHS. On the other hand, AIWPSO is the slowest one.

Regarding future works, we intend to compare meta-heuristic approaches to fine-tuning DBNs to the task of classification.

Notes

1.
Usually, contrastive divergence with a single iteration is called CD-1.
2.
http://yann.lecun.com/exdb/mnist/.
3.
https://archive.ics.uci.edu/ml/datasets/Semeion+Handwritten+Digit.
4.
https://people.cs.umass.edu/~marlin/data.shtml.
5.
Note that these values were empirically chosen according to their author’s definition.

References

Carreira-Perpiñán, M.A., Hinton, G.E.: On contrastive divergence learning. In: Cowell, R.G., Ghahramani, Z. (eds.) Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics, pp. 33–40 (2005)
Google Scholar
Civicioglu, P.: Backtracking search optimization algorithm for numerical optimization problems. Appl. Math. Comput. 219(15), 8121–8144 (2013)
MathSciNet MATH Google Scholar
Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)
MATH Google Scholar
Hinton, G.E.: A practical guide to training restricted Boltzmann machines. In: Montavon, G., Orr, G., Müller, K.R. (eds.) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol. 7700, pp. 599–619. Springer, Berlin (2012)
Google Scholar
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
MathSciNet MATH Google Scholar
Kennedy, J., Eberhart, R.C.: Swarm Intelligence. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Google Scholar
Kuremoto, T., Kimura, S., Kobayashi, K., Obayashi, M.: Time series forecasting using restricted Boltzmann machine. In: International Conference on Intelligent Computing, pp. 17–22. Springer, Berlin (2012)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Google Scholar
Levy, E., David, O.E., Netanyahu, N.S.: Genetic algorithms and deep learning for automatic painter classification. In: Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, pp. 1143–1150. ACM, New York (2014)
Google Scholar
Liu, K., Zhang, L.M., Sun, Y.W.: Deep Boltzmann machines aided design based on genetic algorithms. In: Applied Mechanics and Materials, vol. 568, pp. 848–851. Trans Tech, Clausthal (2014)
Google Scholar
Mahdavi, M., Fesanghary, M., Damangir, E.: An improved harmony search algorithm for solving optimization problems. Appl. Math. Comput. 188(2), 1567–1579 (2007)
MathSciNet MATH Google Scholar
Nickabadi, A., Ebadzadeh, M.M., Safabakhsh, R.: A novel particle swarm optimization algorithm with adaptive inertia weight. Appl. Soft Comput. 11, 3658–3670 (2011)
Google Scholar
Passos, L.A., Papa, J.P.: Fine-tuning infinity restricted Boltzmann machines. In: 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 63–70. IEEE, New York (2017)
Google Scholar
Passos, L.A., Papa, J.P.: On the training algorithms for restricted Boltzmann machine-based models. Ph.D. thesis, Universidade Federal de São Carlos (2018)
Google Scholar
Passos, L.A., Papa, J.P.: A metaheuristic-driven approach to fine-tune deep Boltzmann machines. Appl. Soft Comput., 105717 (2019, in press). https://www.sciencedirect.com/science/article/abs/pii/S1568494619304983
Passos, L.A., Rodrigues, D.R., Papa, J.P.: Fine tuning deep Boltzmann machines through meta-heuristic approaches. In: 2018 IEEE 12th International Symposium on Applied Computational Intelligence and Informatics (SACI), pp. 000,419–000,424. IEEE, New York (2018)
Google Scholar
Passos, L.A., Rodrigues, D., Papa, J.P.: Quaternion-based backtracking search optimization algorithm. In: 2019 IEEE Congress on Evolutionary Computation. IEEE, New York (2019)
Google Scholar
Passos, L.A., de Souza Jr, L.A., Mendel, R., Ebigbo, A., Probst, A., Messmann, H., Palm, C., Papa, J.P.: Barrett’s esophagus analysis using infinity restricted Boltzmann machines. J. Vis. Commun. Image Represent. 59, 475–485 (2019)
Google Scholar
Pereira, C.R., Passos, L.A., Rodrigues, D., Nunes, S.A., Papa, J.P.: JADE-based feature selection for non-technical losses detection. In: VII ECCOMAS Thematic Conference on Computational Vision and Medical Image Processing: VipIMAGE 2019 (2019)
Google Scholar
Rodrigues, D., Pereira, L.A.M., Almeida, T.N.S., Papa, J.P., Souza, A.N., Ramos, C.O., Yang, X.S.: BCS: a binary cuckoo search algorithm for feature selection. In: IEEE International Symposium on Circuits and Systems, pp. 465–468 (2013)
Google Scholar
Rodrigues, D., de Rosa, G.H., Passos, L.A., Papa, J.P.: Adaptive improved flower pollination algorithm for global optimization. In: Nature-Inspired Computation in Data Mining and Machine Learning, pp. 1–21. Springer, Berlin (2020)
Google Scholar
Rosa, G., Papa, J.P., Costa, K., Passos, L.A., Pereira, C., Yang, X.S.: Learning parameters in deep belief networks through firefly algorithm. In: IAPR Workshop on Artificial Neural Networks in Pattern Recognition, pp. 138–149. Springer, Berlin (2016)
Google Scholar
Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: Artificial Intelligence and Statistics, pp. 448–455 (2009)
Google Scholar
Smolensky, P.: Parallel distributed processing: explorations in the microstructure of cognition. In: Chap. Information Processing in Dynamical Systems: Foundations of Harmony Theory, pp. 194–281. MIT Press, Cambridge (1986)
Google Scholar
Storn, R., Price, K.: Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. 11(4), 341–359 (1997)
MathSciNet MATH Google Scholar
Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 847–855. ACM, New York (2013)
Google Scholar
Tieleman, T.: Training restricted Boltzmann machines using approximations to the likelihood gradient. In: Proceedings of the 25th International Conference on Machine Learning (ICML ’08), pp. 1064–1071. ACM, New York (2008)
Google Scholar
Wang, Y., Li, H.X., Huang, T., Li, L.: Differential evolution based on covariance matrix learning and bimodal distribution parameter setting. Appl. Soft Comput. 18, 232–247 (2014)
Google Scholar
Whitley, D.: A genetic algorithm tutorial. Stat. Comput. 4(2), 65–85 (1994)
Google Scholar
Wilcoxon, F.: Individual comparisons by ranking methods. Biom. Bull. 1(6), 80–83 (1945)
Google Scholar
Yang, X.S.: Firefly algorithm, stochastic test functions and design optimisation. Int. J. Bio-Inspired Comput. 2(2), 78–84 (2010)
Google Scholar
Yang, X.S., Deb, S.: Cuckoo search via lévy flights. In: Nature and Biologically Inspired Computing (NaBIC 2009). World Congress on, pp. 210–214. IEEE, New York (2009)
Google Scholar
Yang, X.S., Deb, S.: Engineering optimisation by cuckoo search. Int. J. Math. Model. Numer. Optim. 1, 330–343 (2010)
MATH Google Scholar
Yang, X.S., Gandomi, A.H.: Bat algorithm: a novel approach for global engineering optimization. Eng. Comput. 29(5), 464–483 (2012)
Google Scholar
Yang, S.S., Karamanoglu, M., He, X.: Flower pollination algorithm: a novel approach for multiobjective optimization. Eng. Optim. 46(9), 1222–1237 (2014)
MathSciNet Google Scholar
Yosinski, J., Lipson, H.: Visually debugging restricted Boltzmann machine training with a 3D example. In: Representation Learning Workshop, 29th International Conference on Machine Learning (2012)
Google Scholar
Zhang, J., Sanderson, A.C.: Jade: adaptive differential evolution with optional external archive. IEEE Trans. Evol. Comput. 13(5), 945–958 (2009)
Google Scholar

Download references

Acknowledgements

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001. The authors also appreciate FAPESP grants #2013/07375-0, #2014/12236-1, #2016/19403-6, #2017/02286-0, #2017/25908-6, #2018/21934-5 and #2019/02205-5, and CNPq grants 307066/2017-7 and 427968/2018-6.

Author information

Authors and Affiliations

Department of Computing, São Paulo State University, Bauru, Brazil
Leandro Aparecido Passos, Gustavo Henrique de Rosa, Mateus Roder & João Paulo Papa
Department of Computing, São Carlos Federal University, São Carlos, Brazil
Douglas Rodrigues

Authors

Leandro Aparecido Passos
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo Henrique de Rosa
View author publications
You can also search for this author in PubMed Google Scholar
Douglas Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Mateus Roder
View author publications
You can also search for this author in PubMed Google Scholar
João Paulo Papa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leandro Aparecido Passos .

Editor information

Editors and Affiliations

Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
Hitoshi Iba
School of Electrical Engineering and Computing, The University of Newcastle, Callaghan, NSW, Australia
Nasimul Noman

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Passos, L.A., Rosa, G.H.d., Rodrigues, D., Roder, M., Papa, J.P. (2020). On the Assessment of Nature-Inspired Meta-Heuristic Optimization Techniques to Fine-Tune Deep Belief Networks. In: Iba, H., Noman, N. (eds) Deep Neural Evolution. Natural Computing Series. Springer, Singapore. https://doi.org/10.1007/978-981-15-3685-4_3

Download citation

DOI: https://doi.org/10.1007/978-981-15-3685-4_3
Published: 21 May 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3684-7
Online ISBN: 978-981-15-3685-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Assessment of Nature-Inspired Meta-Heuristic Optimization Techniques to Fine-Tune Deep Belief Networks

Abstract

Similar content being viewed by others

Fine-Tuning Dropout Regularization in Energy-Based Deep Learning

A-DBNF: adaptive deep belief network framework for regression and classification tasks

Learning Parameters in Deep Belief Networks Through Firefly Algorithm

1 Introduction