1 Introduction

In the past years, multimedia-based applications fostered the generation of a massive amount of data. These data provide a wide range of opportunities for machine learning applications in several areas of knowledge, such as medicine, financial market, intelligent manufacturing, and event classification. Among such machine learning approaches, deep learning methods have received significant attention due to their excellent results, often surpassing even humans.

Deep learning models try to simulate the human-brain behavior on how the information is processed. The basic idea is to use multiple layers to extract higher-level features progressively, where each layer learns to transform input data into a more abstract representation. Regarding applications in the image processing area, lower layers may identify edges, while higher layers may identify human-meaningful items such as human faces and objects. Among the most employed methods, one can include Convolutional Neural Networks (CNNs) [8], Deep Belief Networks (DBNs) [5], and Deep Boltzmann Machines (DBMs) [23], among others.

Since “deep” in deep learning refers to the architecture complexity, the more complex it becomes, the higher the number of hyper-parameters to fit. Yosinski and Lipson [36], for instance, highlighted some approaches for visualizing the behavior of a single Restricted Boltzmann Machine (RBM) [24], which is an energy-based model that can be used to build DBNs and DBMs, during its learning procedure, and provided an overview toward such complexities comprehension. Such a problem was usually tackled using auto-learning tools, which combine parameter fine-tuning with feature selection techniques [26]. Despite, it can also be posed as an optimization task in which one wants to choose suitable hyper-parameters.

Therefore, meta-heuristic algorithms have become a viable alternative to solve optimization problems due to their simple implementation. Kuremoto et al. [7], for instance, employed the Particle Swarm Optimization (PSO) [6] to the context of hyper-parameter fine-tuning concerning RBMs, while Liu et. al [10] and Levy et al. [9] applied Genetic Algorithms (GA) [29] for model selection and automatic painter classification using RBMs, respectively. Later, Rosa et al. [22] addressed the Firefly Algorithm to fine-tune DBN hyper-parameters. Finally, Passos et al. [15, 16] proposed a similar approach comparing several meta-heuristic techniques to fine-tune hyper-parameters in DBMs, infinity Restricted Boltzmann Machines [13, 18], and RBM-based models in general [14].

Following this idea, this chapter presents a comparison among ten different swarm- and differential evolution-based meta-heuristic algorithms in the context of fine-tuning DBN hyper-parameters. We present a discussion about the viability of such approaches in three public datasets, as well as the statistical evaluation through the Wilcoxon signed-rank test. The remainder of this chapter is organized as follows. Section 3.2 introduces the theoretical background concerning RBMs and DBNs. Sections 3.4 and 3.5 present the methodology and the experimental results, respectively. Finally, Sect. 3.6 states conclusions and future works.

2 Theoretical Background

In this section, we present a theoretical background concerning Restricted Boltzmann Machines and Deep Belief Networks.

2.1 Restricted Boltzmann Machines

Restricted Boltzmann Machines are well-known stochastic-nature neural networks inspired by physical laws of statistical mechanics and parameterized by concepts like energy and entropy. These networks are commonly employed in the field of unsupervised learning, having at least two layers of neurons, i.e., one visible and one hidden.

The Restricted Boltzmann Machine basic architecture is composed of a visible layer v = {v 1, v 2, …, v m} with m units and a hidden layer h = {h 1, h 2, …, h n} with n units. Furthermore, a real-valued matrix W m×n is responsible for modeling the restricted connections, i.e., the weights, between the visible and hidden neurons, where w ij represents the connection between the visible unit v i and the hidden unit h j. Figure 3.1 describes the vanilla RBM architecture.

Fig. 3.1
figure 1

Vanilla RBM architecture

Regarding the learning process, a layer composed of visible units represents the input data to be processed, while the hidden layer is employed to extract deep-seated patterns and information from this data. Besides, both visible and hidden units assume only binary values, i.e., v ∈{0, 1}m and h ∈{0, 1}n, once sampling process is derived from a Bernoulli distribution [4]. Finally, the training process is performed by minimizing the system’s energy considering both the visible and hidden layers units, as well as the biases associated with each layer. The energy can be computed as follows:

$$\displaystyle \begin{aligned} E(\mathbf{v},\mathbf{h})=-\sum_{i=1}^ma_iv_i-\sum_{j=1}^nb_jh_j-\sum_{i=1}^m\sum_{j=1}^nv_ih_jw_{ij}, \end{aligned} $$
(3.1)

where a and b represent the biases of visible and hidden units, respectively.

Computing the system’s probability is an intractable task due to the computational cost. However, one can estimate the probability of activating a single visible neuron i given the hidden units through Gibbs sampling over a Markov chain, as follows:

$$\displaystyle \begin{aligned} P(v_i=1|\mathbf{h})=\phi\left(\sum_{j=1}^nw_{ij}h_j+a_i\right), \end{aligned} $$
(3.2)

and, in a similar fashion, the probability of activating a single hidden neuron j given the visible units is stated as follows:

$$\displaystyle \begin{aligned} P(h_j=1|\mathbf{v})=\phi\left(\sum_{i=1}^mw_{ij}v_i+b_j\right), \end{aligned} $$
(3.3)

where ϕ(⋅) stands for the logistic-sigmoid function.

The training process consists of maximizing the product of probabilities given a set of parameters θ = (W, a, b) and the data probability distribution over the training samples. Such a process can be easily computed using either the Contrastive Divergence (CD) [3] or the Persistent Contrastive Divergence (PCD) [27] algorithms.

2.2 Contrastive Divergence

Hinton [3] introduced a faster methodology to compute the energy of the system based on contrastive divergence. The idea is to initialize the visible units with a training sample, to compute the states of the hidden units using Eq. (3.3), and then to compute the states of the visible unit (reconstruction step) using Eq. (3.2). In short, this is equivalent to perform Gibbs sampling using k = 1 and to initialize the chain with the training samples.

Therefore, the equation below leads to a simple learning rule for updating the weights matrix W, and biases a and b at iteration t:

$$\displaystyle \begin{aligned} {\mathbf{W}}^{t+1}={\mathbf{W}}^t+\underbrace{\eta(P(\mathbf{h}|\mathbf{v}){\mathbf{v}}^T-P(\tilde{\mathbf{h}}|\tilde{\mathbf{v}})\tilde{\mathbf{v}}^T)+\varPhi}_{=\varDelta{\mathbf{W}}^t}, \end{aligned} $$
(3.4)
$$\displaystyle \begin{aligned} {\mathbf{a}}^{t+1}={\mathbf{a}}^t+\underbrace{\eta(\mathbf{v}-\tilde{\mathbf{v}})+\varphi\varDelta {\mathbf{a}}^{t-1}}_{=\varDelta{\mathbf{a}}^t}, \end{aligned} $$
(3.5)
$$\displaystyle \begin{aligned} {\mathbf{b}}^{t+1}={\mathbf{b}}^t+\underbrace{\eta(P(\mathbf{h}|\mathbf{v})-P(\tilde{\mathbf{h}}|\tilde{\mathbf{v}}))+\varphi\varDelta {\mathbf{b}}^{t-1}}_{=\varDelta{\mathbf{b}}^t}, \end{aligned} $$
(3.6)

where η stands for the learning rate, φ denotes the momentum, \(\tilde {\mathbf {v}}\) stands for the reconstruction of the visible layer given h, and \(\tilde {\mathbf {h}}\) denotes an estimation of the hidden vector h given \(\tilde {\mathbf {v}}\). In a nutshell, Eqs. (3.4), (3.5), and (3.6) show the optimization algorithm, the well-known Gradient Descent. The additional term Φ in Eq. (3.4) is used to control the values of matrix W during the convergence process, and it is described as follows:

$$\displaystyle \begin{aligned} \varPhi = -\lambda{\mathbf{W}}^t+\varphi\varDelta{\mathbf{W}}^{t-1}, \end{aligned} $$
(3.7)

where λ stands for the weight decay.

2.3 Persistent Contrastive Divergence

Most of the issues related to the Contrastive Divergence approach concern the number of iterations employed to approximate the model to the real data. Although the approach proposed by Hinton [3] takes k = 1 and works well for real-world problems, one can settle different values for k [1].Footnote 1

Notwithstanding, Contrastive Divergence provides a good approximation to the likelihood gradient, i.e., it gives a reasonable estimation of the model to the data when k →. However, its convergence might become poor when the Markov chain has a “low mixing,” as well as a good convergence only on the early iterations, getting slower as iterations go by, thus, demanding the use of parameters decay.

Therefore, Tieleman [27] proposed the Persistent Contrastive Divergence, an interesting alternative for contrastive divergence using higher values for k while keeping the computational burden relatively low. The idea is quite simple: on CD-1, each training sample is employed to start an RBM and rebuild a model after a single Gibbs sampling iteration. Once every training sample is presented to the RBM, we have a so-called epoch. The process is repeated for each next epoch, i.e., the same training samples are used to feed the RBM, and the Markov chain is restarted at each epoch.

2.4 Deep Belief Networks

Deep Belief Networks [5] are graphical models composed of a visible and L hidden layers, where each layer is connected to the latter through a weight matrix W l, l ∈{1, 2, …, L}, and there is no connection between units from the same layer. In a nutshell, one can consider each set of two subsequent layers as an RBM trained in a greedy fashion such that the trained hidden layer of the bottommost RBM feeds the next RBM’s visible layer, and so on. Figure 3.2 depicts the model. Notice v and h l stand for the visible and the l-th hidden layers.

Fig. 3.2
figure 2

DBN architecture with two hidden layers

Although this work focuses on image reconstruction, one can use DBNs for supervised classification tasks. Such an approach requires, after the greedy feedforward pass mentioned above, fine-tuning the network weights using either backpropagation or gradient descent. Afterward, a softmax layer is added at the top of the model to attribute the predicted labels.

3 Meta-heuristic Optimization Algorithms

This section presents a brief description of the meta-heuristic optimization techniques employed in this work.

  • Improved Harmony Search (IHS) [11]: an improved version of the Harmony Search optimization algorithm that employs dynamic values for both the Pitch Adjusting Rate (PAR), considering values in the range [PAR\(_{\min },\)PAR\(_{\max }]\), and the Harmony Memory Considering Rate (HMCR), which assumes values in the range [HMCR\(_{\min },\)HMCR\(_{\max }]\). Additionally, the algorithm uses the bandwidth variable ϱ in the range \([\varrho _{\min },\varrho _{\max }]\) to calculate PAR.

  • Particle Swarm Optimization with Adaptive Inertia Weight (AIWPSO) [12]: an improved version of the Particle Swarm Optimization that employs self-adjusting inertia weights w over each particle along with the search space aiming to balance the global exploration and local exploitation. Notice the method uses the variables c 1 and c 2 to control the particles’ acceleration.

  • Flower Pollination Algorithm (FPA) [21, 35]: a meta-heuristic optimization algorithm that tries to mimic the pollination process performed by flowers. The algorithm employs four basic rules: (1) the cross-pollination, which stands for the pollination performed by birds and insects, (2) the self-pollination, representing the pollination performed by the wind diffusion or similar approaches, (3) the constancy of birds/insects, representing the probability of reproduction, and (4) the interaction of local and global pollination, controlled by the probability parameter p. Additionally, the algorithm employs an additional parameter β to control the amplitude of the distribution.

  • Bat Algorithm (BA) [34]: based on the bats’ echolocation system while searching for food and prey. The algorithm employs a swarm of virtual bats randomly flying in the search space at different velocities, even following a random walk approach for local search intensification. Additionally, it applies a dynamically updated wavelength frequency in the range \(\{f_{\min },f_{\max }\}\) according to the distance from the objective, as well as loudness A and the pulse rate r.

  • Firefly Algorithm (FA) [31]: the algorithm is based on the fireflies’ approach for attracting potential preys and mating partners. It employs the attractiveness β parameter, which influences the brightness of each agent, depending on its position and light absorption coefficient γ. Moreover, the model employs a random perturbation α used to perform a random walk and avoid local optima.

  • Cuckoo Search (CS) [20, 32, 33]: the model combines some cuckoo species parasitic behavior with a τ-step random walk over a Markov chain. It employs three basic concepts: (1) each cuckoo lays a single egg for iteration at another bird’s randomly chosen nest, (2) p a ∈ [0, 1] defines the probability of this bird discover and discard the cuckoo’s egg or abandon it and create a new chest, i.e., a new solution, and (3) the nests with best eggs will carry over to the next generations.

  • Differential Evolution (DE) [25]: evolution algorithm maintains a population of candidate solutions which are combined and improved in following generations aiming to find the characteristics that best fit the problem. The algorithm employs a mutation factor to control the mutation amplitude, as well as a parameter to control the crossover probability.

  • Backtracking Search Optimization Algorithm (BSA) [2, 17]: an evolution algorithm that employs a random selection of a historical population for mutation and crossover operations to generate a new population of individuals based on past experiences. The algorithm controls the number of elements to be mutated using a mixing rate (mix_rate) parameter, as well as the amplitude of the search-direction matrix with the parameter F.

  • Differential Evolution Based on Covariance Matrix Learning and Bimodal Distribution Parameter Setting Algorithm (CoBiDE) [28]: a differential evolution model that represents the search space coordinate system using a covariance matrix according to the probability parameter P b, and the proportion of individuals employed in the process using the P s variable. Moreover, it employs a binomial distribution to control the mutation and crossover rates, aiming a better trade-off between exploitation and exploration.

  • Adaptive Differential Evolution with Optional External Archive (JADE) [19, 37]: JADE is a differential evolution-based algorithm that employs the “DE/current-to-p-best” strategy, i.e., only the p − best agents are used in the mutation process. Further, the algorithm employs both a historical population and a control parameter, which is adaptively updated. Finally, it requires a proper selection of the rate of adaptation parameter c, as well as the mutation greediness parameter g.

4 Methodology

This section introduces the intended procedure for DBN hyper-parameter fine-tuning. Additionally, it describes the employed datasets and the experimental setup.

4.1 Modeling DBN Hyper-parameter Fine-tuning

The learning procedure of each RBM employs four hyper-parameters, as specified in Sect. 3.2.1: the learning rate η, weight decay λ, momentum φ, and the number of hidden units n. Since DBNs are built over RBM blocks, they employ a similar process to fine-tune each of their layers individually. In short, a four-dimensional search space composed of three real- and one integer-valued variables should be selected for each layer. Notice the variable values are intrinsically real numbers, thus requiring a type casting to obtain the nearest integer. Such an approach aims at electing the assortment of DBN hyper-parameters that minimizes the training images reconstruction error, denoted by the minimum squared error (MSE). Subsequently, the selected set of parameters is applied to reconstruct the unseen images of the test set. Figure 3.3 depicts the procedure.

Fig. 3.3
figure 3

DBN hyper-parameter optimization approach

4.2 Datasets

We employed three datasets, as described below:

  • MNIST datasetFootnote 2: a dataset composed of “0”–“9” handwritten digits images. Regarding the pre-processing, the images were converted from gray-scale to binary, as well as resized to 14 × 14. Additionally, the training was performed over 2% of the training set, i.e., 1200 images, due to the demanded computational burden. Moreover, the complete set of 10, 000 was employed for testing.

  • Semeion Handwritten Digit DatasetFootnote 3: similar to the MNIST, Semeion is also a dataset composed of “0”–“9” handwritten digits images formed by 1593 images. In this paper, we resized the samples to 16 × 16 and binarized each pixel.

  • CalTech 101 Silhouettes DatasetFootnote 4: a dataset composed of 101 classes of silhouettes with a resolution of 28 × 28. No pre-processing step was applied to the image samples.

Figure 3.4 displays some training examples from the above datasets.

Fig. 3.4
figure 4

Some training examples from (a) MNIST, (b) Semeion, and (c) CalTech 101 Silhouettes datasets

4.3 Experimental Setup

Experiments were conducted over 20 runs and a 2-fold cross-validation for statistical analysis using the Wilcoxon signed-rank test [30] with 5% of significance. Each meta-heuristic technique employed five agents (particles) over 50 iterations for convergence purposes over the three configurations, i.e., DBNs with 1, 2, and 3 layers. Additionally, the paper compares different techniques ranging from music composition process, swarm-based, and evolutionary-inspired methods, in the context of DBN hyper-parameter fine-tuning, as presented in Sect. 3.3:

Table 3.1 exhibits the parameter configuration for every meta-heuristic technique.Footnote 5

Table 3.1 Meta-heuristic algorithms’ parameter configuration

Finally, each DBN layer is composed of an RBM whose hyper-parameters are randomly initialized according to the following ranges: n ∈ [5, 100], η ∈ [0.1, 0.9], λ ∈ [0.1, 0.9], and φ ∈ [10−5, 10−1]. Additionally, the experiments were conducted over three different depth configurations, i.e., DBNs composed of 1, 2, and 3 RBM layers, which implies on fine-tuning a 4 −, 8 −, and 12 −dimensional set of hyper-parameters. We also have employed T = 10 as the number of epochs for DBN learning weights procedure with mini-batches of size 20. In order to present a more in-depth experimental validation, all DBNs were trained with the Contrastive Divergence (CD) [3] and Persistent Contrastive Divergence (PCD) [27]. Figure 3.5 depicts the pipeline proposed in this paper.

Fig. 3.5
figure 5

Proposed pipeline to the task of DBN hyper-parameter fine-tuning

5 Experimental Results

This section introduces the results obtained during the experiments. Further, a detailed discussion about them is provided. Tables 3.2, 3.3, and 3.4 present the average MSE, and their standard deviation regarding MNIST, Semeion Handwritten Digit, and CalTech 101 Silhouettes datasets, respectively. The best results accordingly to the Wilcoxon signed-rank test with 5% of significance level are presented in bold.

Table 3.2 Average MSE values considering MNIST dataset
Table 3.3 Average MSE values considering Semeion Handwritten Digit dataset
Table 3.4 Average MSE values considering CalTech 101 Silhouettes dataset

Table 3.2 presents the results concerning the MNIST dataset. IHS obtained the lowest errors using the Contrastive Divergence algorithm over one single layer. BA and AIWPSO obtained statistically similar results using the PCD algorithm over two and three layers, respectively. One can notice that FPA using CD over a single layer also obtained the same average errors as the IHS, although the Wilcoxon signed-rank test does not consider both statistically similar. Moreover, the evolutionary algorithms also obtained good results, though not statistically similar as well.

Regarding Semeion Handwritten Digit dataset, Table 3.3 demonstrates the best results were obtained using CoBiDe technique over the CD algorithm with one layer. Worth pointing that none of the other methods achieved similar statistical results, which confirms the robustness of evolutionary-based meta-heuristic optimization algorithms.

Similar to MNIST dataset, the best results over CalTech 101 Silhouettes dataset was obtained using the IHS method with the CD algorithm over a single-layered DBN, as presented in Table 3.4. IHS was also the sole technique to achieve the lowest errors since none of the other methods obtained statistically similar results.

5.1 Training Evaluation

Figure 3.6 depicts the learning steps considering MNIST dataset. Except for the BA algorithm (and the random search), all techniques converged equally to the same point since the initial iterations. Notice FA outperformed such results, achieving the lowest error at iteration number 20. However, the training error regresses to the initial values, which suggests the problem presents a local optimum hard to be overpassed, given the set of optimized parameters.

Fig. 3.6
figure 6

Training convergence (a) MSE and (b) log pseudo-likelihood using the CD algorithm and a single layer of hidden units over the MNIST dataset

An interesting behavior is depicted in Fig. 3.7. One can observe AIWPSO converges faster than the other techniques obtaining an average MSE of 0.2 after ten iterations. However, AIWPSO gets stuck at this time step and is outperformed by both JADE and DE after approximately 15 iterations. Moreover, DE still improves its performance until reaching its optimum at nearly 40 iterations. The behavior is not observed over the testing set, where although DE obtained good results, CoBiDE was the most accurate technique.

Fig. 3.7
figure 7

Training convergence (a) MSE and (b) log pseudo-likelihood using the CD algorithm and a single layer of hidden units over the Semeion Hand Written Digit dataset

Regarding the Caltech 101 Silhouettes, the learning curve depicted in Fig. 3.8 showed that AIWPSO presented a similar behavior as presented over Semeion dataset, and a faster convergence in the 15 initial iterations, being outperformed by JADE afterward. Notice that IHS and FPA also demonstrated a good convergence, which is expected since IHS obtained the best results over the testing set and FPA achieved very close results. Additionally, CoBiDE and BSA are also among the best techniques together with JADE and DE, confirming the robustness of evolution techniques to the task of DBN meta-parameter fine-tuning.

Fig. 3.8
figure 8

Training convergence (a) MSE and (b) log pseudo-likelihood using the CD algorithm and a single layer of hidden units over the CalTech 101 Silhouettes dataset

5.2 Time Analysis

Tables 3.5, 3.6, and 3.7 present the computational burden, in hours, regarding MNIST, Semeion Handwritten Digit, and Caltech 101 Silhouettes datasets, respectively. One can observe that CS is the fastest technique, followed by IHS. Such a result is expected since IHS evaluates a single solution per iteration, and CS employs a probability of evaluating or not each solution. On the other hand, the remaining techniques evaluate every solution for each iteration, contributing to a higher computational burden.

Table 3.5 Average computational burden (in hours) considering MNIST dataset
Table 3.6 Average computational burden (in hours) considering Semeion Handwritten Digit dataset
Table 3.7 Average computational burden (in hours) considering CalTech 101 Silhouettes dataset

Additionally, evolutionary algorithms, in general, present a higher computation burden than swarm-based approaches. AIWPSO stands for an exception, offering itself as the most costly technique among all the others, due to its updating mechanism.

In most cases, the best results were obtained using a single layer as well as the CD algorithm. Such behavior is probably related to the limited number of epochs employed for training, i.e., more complex models composed of a more significant amount of layers would require a higher number of epochs for convergence than the 10 epochs employed in this work. However, running the experiments over such conditions is not plausible in this context due to the massive amount of executions performed for the comparisons presented in the chapter. The same is valid for the PCD algorithm.

5.3 Hyper-Parameters Analysis

This section provides a complete list of the average values of hyper-parameters obtained during the execution of every possible experimental configuration. Notice that values in bold stand for the configuration that obtained the best results accordingly to the Wilcoxon signed-rank test.

Table 3.8 presents the average hyper-parameter values considering the MNIST dataset. The similarity between both IHS and FPA considering a single layer over the Contrastive Divergence algorithm is evident, which is expected since both obtained similar results. However, comparing these results with the ones obtained with a higher number of layers, i.e., BA with 2 layers and AIWPSO with 3 layers, over the PCD algorithm denotes a harder task, since the number of hyper-parameters is also higher, and each one exerts a degree of influence over the others.

Table 3.8 Average hyper-parameter values considering MNIST dataset

Regarding the Semeion Handwritten Digit dataset, presented in Table 3.9, one can once again identify some relation between the set of hyper-parameters and the final results. Although IHS did not obtain the best results over the CD algorithm with a single layer, its results are pretty close to the best obtained using the CoBiDE algorithm. The resemblance is reflected in their hyper-parameter sets. Another example of this resemblance is observed in the 1-layered FPA and BSA over the CD algorithm: a close set of hyper-parameters leads to close results in the experiments.

Table 3.9 Average hyper-parameter values considering Semeion Handwritten Digit dataset

An analogous behavior is observed in Table 3.10 regarding Caltech 101 Silhouettes dataset. Although FPA, BSA, and CoBiDE did not obtain statistically similar results to IHS according to the Wilcoxon signed-rank test, their results are very much alike, which is perceptible in their selected sets of hyper-parameter. Regarding more complex models, i.e., with two and three layers, one can still observe some likeness. Notice, for instance, the similarity between AIWPSO trained with both CD and PCD, and BSA trained with CD over three layers. However, since they require a larger number of hyper-parameters to be fine-tuned, the combination is exponentially larger, thus providing more diverse combination sets.

Table 3.10 Average hyper-parameter values considering CalTech 101 Silhouettes dataset

3

CD

1

Hidden units

43

51

51

34

45

48

31

39

59

37

50

   

Learning rate

0.52239

0.51207

0.52097

0.48690

0.63751

0.52362

0.51773

0.42346

0.38359

0.47503

0.56581

   

Weight decay

0.71342

0.72279

0.61374

0.62583

0.68437

0.67406

0.69570

0.68193

0.67821

0.71961

0.73120

   

Momentum

0.00431

0.00514

0.00497

0.00553

0.00428

0.00600

0.00446

0.00391

0.00512

0.00546

0.00577

  

2

Hidden units

49

54

52

41

43

47

49

49

61

49

44

   

Learning rate

0.52924

0.51316

0.53610

0.42102

0.49939

0.52684

0.50960

0.57533

0.57925

0.52413

0.54938

   

Weight decay

0.61271

0.48068

0.44030

0.44930

0.57062

0.53085

0.40449

0.46727

0.56396

0.40061

0.44172

   

Momentum

0.00442

0.00495

0.00597

0.00473

0.00428

0.00505

0.00604

0.00419

0.00422

0.00506

0.00605

  

3

Hidden units

55

52

55

63

46

67

50

59

57

51

55

   

Learning rate

0.51658

0.48736

0.45940

0.53802

0.55106

0.53752

0.55456

0.51279

0.55418

0.55914

0.50365

   

Weight decay

0.39922

0.52719

0.58714

0.45855

0.58377

0.58716

0.54719

0.51086

0.42597

0.54949

0.49739

   

Momentum

0.00594

0.00399

0.00633

0.00518

0.00428

0.00457

0.00588

0.00396

0.00575

0.00387

0.00523

 

PCD

1

Hidden units

56

53

49

51

49

58

51

37

59

48

49

   

Learning rate

0.46399

0.40623

0.37432

0.50307

0.48364

0.59600

0.53473

0.44439

0.38359

0.43948

0.49811

   

Weight decay

0.69432

0.63476

0.59465

0.51970

0.59724

0.63553

0.61455

0.60310

0.67821

0.63087

0.65155

   

Momentum

0.00341

0.00550

0.00477

0.00531

0.00451

0.00472

0.00516

0.00504

0.00512

0.00408

0.00444

  

2

Hidden units

56

51

51

52

59

40

52

49

61

40

51

   

Learning rate

0.51054

0.42227

0.52618

0.55580

0.50243

0.42248

0.57549

0.55893

0.57925

0.56304

0.52403

   

Weight decay

0.55409

0.42987

0.47076

0.46793

0.42596

0.43440

0.47748

0.49206

0.56396

0.54592

0.51342

   

Momentum

0.00586

0.00452

0.00493

0.00378

0.00451

0.00623

0.00508

0.00499

0.00422

0.00492

0.00446

  

3

Hidden units

59

49

52

42

42

55

50

49

57

44

50

   

Learning rate

0.49215

0.53762

0.52214

0.59451

0.45669

0.49241

0.58855

0.51762

0.55418

0.48230

0.58397

   

Weight decay

0.53427

0.55750

0.46411

0.45421

0.51050

0.51246

0.47173

0.51046

0.42597

0.52163

0.44847

   

Momentum

0.00522

0.00467

0.00540

0.00457

0.00451

0.00454

0.00527

0.00536

0.00575

0.00434

0.00619

  1. Bold values denote the lowest average MSE or values whose Wilcoxon’s p-value is above 0.05, i.e., values that are statistically similar

6 Conclusions and Future Works

This chapter dealt with the problem of Deep Belief Network’s hyper-parameter parameter fine-tuning through meta-heuristic approaches. Experiments were conducted using three architectures, i.e., one (naïve RBM), two, and three layers, which were trained using both the Contrastive Divergence and the Persistent Contrastive Divergence algorithms. Further, the performance of ten techniques, as well as a random search, were compared over three public binary image datasets. Results demonstrated that Improved Harmony Search obtained the best results in two out of three datasets, while CoBiDE obtained the best values regarding Semeion Handwritten Digit dataset, denoting the efficiency of differential evolution-based techniques. Concerning the training steps, in general, AIWPSO converges faster than the other methods on the initial iterations. However, it is outperformed by evolution techniques after approximately 15 iterations. Finally, one can also verify that CS is the fastest technique, followed by IHS. On the other hand, AIWPSO is the slowest one.

Regarding future works, we intend to compare meta-heuristic approaches to fine-tuning DBNs to the task of classification.