Keywords

1 Introduction

Machine learning (ML) [1] is an artificial intelligence subfield (AI). It has expanded at an incredible rate, drawing a large number of academics interested in studying how a system may learn to do a task. In reality, an ML system does not follow instructions but instead learns from experience, such as making predictions or decisions based on data and continuously improving performance by reviewing new data. ML research achieved outstanding results on several complex cognitive tasks, including Computer Vision [2,3,4,5], Medical diagnoses [6,7,8,9], Signal Processing [10, 11], recommendation systems [12], etc. Deep Learning (DL) [13, 14] architectures have proved their capacity to deal with progressively voluminous data during the previous two decades. Furthermore, it has gradually become the most extensively employed computational strategy in the field of machine learning, generating exceptional results on a variety of cognitive tasks, equal or even surpassing human performance in some cases. The capacity to learn from huge volumes of data is one of the benefits and challenges of deep learning.

In a similar vein, Convolutional neural networks (CNN) [2, 15, 16] are one of the state-of-art deep learning techniques. CNNs are designed to automatically and adaptively learn spatial hierarchies of features through backpropagation [17, 18] by using multiple building blocks, such as convolution layers, pooling layers, and fully connected layers. However, training a CNN is a challenging task, especially for deep architecture involving a high number of parameters (model weights) to be estimated. Sophisticated optimization algorithms need therefore to be used. This is indeed the key step in order to fit a given architecture to learning data in order to minimize the error between ground truth and estimates.

Many optimization techniques have been presented in recent years [19]. The convexity and differentiability of the target loss function have a significant impact on the performance of the deployed algorithms. Hence, choosing an optimization strategy that seeks to find the global optima in the learning stage is generally challenging, especially when the number of parameters is large. A non-appropriate optimization technique may for instance lead the network to lie in a local minimum during training phase. Speeding up the optimization process is also a challenging issue for large databases.

In this context, Bayesian approaches have made significant progress in a number of areas over the years, and there are several practical benefits. The core concept is to use probabilities to represent all uncertainties throughout the model. One of the most significant benefits is the ability to incorporate prior information. Indeed, recent developments in Markov Chain Monte Carlo (MCMC) methods [20,21,22,23,24] facilitate the implementation of Bayesian analyses of complex data sets containing missing observations and handling multidimensional outcomes. The main goal of this paper is to highlight a Bayesian model for the minimization of the target cost function of a learning model through hyperparameters adjustment.

Specifically, we propose a Bayesian optimization method to minimise the target cost function and derive the optimal weights vector. Indeed, we demonstrate that using the proposed method leads to high accuracy results, which cannot be reached using competing.

The rest of this paper is organized as follows. The addressed problem is formulated in Sect. 2. The proposed efficient Bayesian optimization scheme is developed in Sect. 3 and validated in Sect. 4. Finally, conclusions and future work are drawn in Sect. 5.

2 Problem Formulation

It is well known that weights optimization is one of the key steps to design an efficient artificial neural network. For instance, if we consider a classification problem, the ANN weight vector W is updated during the learning phase by minimizing an error between the ground truth and the labels estimated using the network. An iterative procedure is generally performed, and gradient-based optimization procedures are used. For the sake of efficiency, regularization can also be performed in order to have a more accurate weights configuration. In this sense, smooth regularizers such as the \(\ell _2\) norm are used. In this case, gradient-based algorithms could still be used. However, if one aims at promoting sparse networks, sparse regularizations such as the \(\ell _1\) norm should be used, which makes the use of gradient-based algorithms inefficient since the error to be minimized in this case is no longer differentiable.

In this paper, we propose a method to allow weights optimization under non-smooth regularizations. Let us denote by x an input to be presented to the ANN. The estimated label will be denoted by \(\widehat{y}(x,W)\) as a non-linear function of the input x and the weights vector \(W \in \mathbb {R}^N\), while the ground truth label will be denoted by y.

Using a quadratic error with an \(\ell _1\) regularization with M input data for the learning step, the weights vector can be estimated as:

$$\begin{aligned} \begin{aligned} \widehat{W}&= \arg \min _{W} \mathcal {L}(W) \\&= \arg \min _{W} \sum _{m=1}^{M} \Vert \widehat{y}(x^m;W) - y^{(m)} \Vert ^2_2 + \lambda \Vert W \Vert _1 \end{aligned} \end{aligned}$$
(1)

where \(\lambda \) is a regularization parameter balancing the solution between the data fidelity and regularization terms, and M is the number of learning data.

Since the optimization problem in (1) is not differentiable, the use of gradient-based algorithms with back-propagation is not possible. In this case, the learning process is costly and very complicated.

In Sect. 3 we present a method to efficiently estimate the weights vector without increase of learning complexity. The optimization problem in (1) is formulated and solved in a Bayesian framework.

3 Bayesian Optimization

As stated above, the weights optimization problem is formulated in a Bayesian framework. In this sense, the problem parameters and hyperparameters are assumed to follow probability distributions. More specifically, a likelihood distribution is defined to model the link between the target weights vector and the data, while a prior distribution is defined to model the prior knowledge about the target weights.

3.1 Hierarchical Bayesian Model

According to the principle of minimizing the error between the reference label y and the estimated one \(\widehat{y}\), and assuming a quadratic error (first term in (1)), we define the likelihood distribution as

$$\begin{aligned} f\left( y;W,\sigma \right) \propto \prod _{m=1}^{M} \exp \left( -\dfrac{1}{2\sigma ^2} \Vert \widehat{y}(x^m;W) - y^{(m)} \Vert ^2 \right) , \end{aligned}$$
(2)

where \(\sigma ^2\) is a positive parameter to be set.

As regards the prior knowledge on the weights vector W, we propose the use of a Laplace distribution in order to promote the sparsity of the neural network:

$$\begin{aligned} f(W;\lambda ) \propto \prod _{k=1}^{N} \exp \left( - \dfrac{\Vert W^{[k]}\Vert _1 }{\lambda }\right) , \end{aligned}$$
(3)

where \(\lambda \) is a hyperparameter to be fixed or estimated.

By adopting a Maximum A Posteriori (MAP) approach, we first need to express the posterior distribution. Based on the defined likelihood and prior, this posterior writes:

$$\begin{aligned}&f(W;y,\sigma ,\lambda )\propto f(y;W,\sigma )f(W;\lambda )\nonumber \\&\propto \prod _{m=1}^{M} \exp \left( -\dfrac{1}{2\sigma ^2} \Vert \widehat{y}(x^m;W) - y^{(m)} \Vert ^2 \right) \prod _{k=1}^{N} \exp \left( -\dfrac{\Vert W^{[k]}\Vert _1 }{\lambda }\ \right) . \end{aligned}$$
(4)

It is clear that this posterior is not straightforward to handle in order to derive a closed-form expression of the estimate \(\widehat{W}\). For this reason, we resort to a stochastic sampling approach in order to numerically approximate the posterior, and hence to calculate an estimator for \(\widehat{W}\). The following Section details the adopted sampling procedure.

3.2 Hamiltonian Sampling

Let us denote \(\alpha =\dfrac{\lambda }{\sigma ^2}\) and \(\theta = \{\sigma ^2,\lambda \}\). For a weight \(W^k\) we define the following energy function

$$\begin{aligned} E_{\theta }^k (W^{k})= \dfrac{\alpha }{2} \sum \limits _{m=1}^{M}\Vert \widehat{y}(x^m;W) - y^{(m)} \Vert ^2_2 + \Vert W^{k}\Vert _1. \end{aligned}$$
(5)

The posterior in (4) can therefore be reformulated as

$$\begin{aligned} f(W;y,\theta ) \propto \exp { \left( -\sum _{k=1}^{N} E_{\theta }^k (W^{k}) \right) }. \end{aligned}$$
(6)

To sample according to this exponential posterior, and since direct sampling is not possible due to the form of the energy function \(E_{\theta }^k\), Hamiltonian sampling is adopted. Indeed, Hamiltonian dynamics [25] strategy has been widely used in the literature to sample from high dimensional vectors. However, sampling using Hamiltonian dynamics requires computing the gradient of the energy function, which is not possible in our case due to the \(\ell _1\) term. To overcome this difficulty, we resort to a non-smooth Hamiltonian Monte Carlo (ns-HMC) strategy as proposed in [26]. More specifically, we use the plug and play procedure developed in [27]. Indeed, this strategy requires to calculate the proximity operator only at an initial point, and uses the shift property [28, 29] to deduce the proximity operator during the iterative procedure [27, Algorithm 1].

As regards the proximity operator calculation, let us denote by \(G_{\mathcal {L}}(W^{k})\) the gradient of the quadratic term of the loss function \(\mathcal {L}\) with respect to the weight \(W^{k}\). Let us also denote by \(\varphi ( W^{k}) = \Vert W^{k}\Vert _1\). Following the standard definition of the proximity operator [28, 29], we can write for a point z

$$\begin{aligned} \mathrm {prox}_{E_{\theta }^k} (z) = p \Leftrightarrow&z - p \in \partial E_{\theta }^k(p). \end{aligned}$$
(7)

Straightforward calculations lead to the following expression of the proximity operator:

$$\begin{aligned} \mathrm {prox}_{E_{\theta }^k} (z) = \mathrm {prox}_{\varphi } \left( z -\dfrac{\alpha }{2} G_{\mathcal {L}}(W^{k}) \right) . \end{aligned}$$
(8)

Since \( \mathrm {prox}_{\varphi }\) is nothing but the soft thresholding operator [29], the proximity operator in (78) can be easily calculated once a single gradient step is applied (back-propagation) to calculate \(G_{\mathcal {L}}(W^{k})\).

The main steps of the proposed method are detailed in Algorithm 1.

figure a

After convergence, Algorithm 1 provides chains of coefficients sampled according to the target distribution of each \(W^k\). These chains can be used to compute an MMSE (minimum mean square error) estimator (after discarding the samples corresponding to the burn-in period).

It is worth noting that hyperprior distributions can be put on \(\lambda \) and \(\sigma \) in order to integrate them in the hierarchical Bayesian model. These hyperparameters can therefore be estimated from the data at the expense of some additional complexity.

4 Experimental Validation

In order to validate the proposed method, two image classification experiments are conducted using two different datasets: COVID-19 dataset including Computed tomography (CT) images [30], and a standard dataset, namely, CIFAR-10 [31]. For the sake of comparison, two kinds of optimizers are used: i) MCMC-based method, precisely the standard Metropolis-Hastings (MH) algorithm and the random walk Metropolis Hastings (rw-MH) [32], and ii) the most popular optimization techniques used in DL. : Adam and Adagrad [33]. One of the key hyper-parameters to set in optimizers in order to train a neural network is the learning rate. This parameter scales the magnitude of the weight updates in order to minimize the network’s loss function. In the experiments, the learning rate is equal to \(10^{-3}\). In addition, the hyper-parameters \(\beta _1\) and \(\beta _2\) are equals to 0.9 and 0.999 respectively. They stand for the initial decay rates used when estimating the first and second moments of the gradient. As regards coding, we used python programming language with Keras and Tensorflow libraries on an Intel(R) Core(TM) i7-2720QM CPU 2.20 GHZ architecture with 16 Go memory. The same behavior with the computational time and accuracy which justify the effectiveness of our proposed MCMC method.

4.1 ConvNet Models

Two CNN architectures are used in this study. Like the LeNet model [34], the first one (CNN_1) includes three convolutional (Conv3 \(\times \) 3-32, Conv3 \(\times \) 3-64, Conv3 \(\times \) 3-128), and two fully-connected (FC-64 and FC-softmax). The second one (CNN_2) has five convolutional (Conv3 \(\times \) 3-32, Conv3 \(\times \) 3-32, Conv3 \(\times \) 3-64, Conv3 \(\times \) 3-64, Conv3 \(\times \) 3-128, Conv3 \(\times \) 3-128) and three FC layers (FC-128,FC-64,FC-softmax) that are organized similarly to VGG-Net [35]. All of them involve convolutional layers with 3 \(\times \) 3 Kernel filters in addition to \(2 \times 2\) max-pooling, with stride size equal to 1. All layers in the different configurations used ReLU as an activation function except the output layer.

As deep neural networks can easily overfit when trained with small datasets, the used CNNs are extended with three regularizing techniques [33]:

  • Batch Normalization: deals with the feature space distribution variability during the training. The input of the layer is normalized to be zero-mean with unitary variance. This step not only acts as a regularizer, but also allows faster training, higher learning rates, and less sensitivity to weights initialization.

  • \(\ell _{1}\) Regularization: \(\ell _1\) regularization is the preferred choice when having a high number of features as it provides sparse solutions. In our case, the regularization parameter was set to \(\lambda =0.001\).

  • Dropout : random disabling of neurons during training with rate p. Temporarily ignoring some activation forces the other neurons to learn a more robust representation of the input data while reducing the sensitivity of specific neurons. In our study, the dropout rate is set by cross validation to \(p=0.35\).

4.2 Experiment 1: Challenging Case

A challenging classification case is addressed in this experiment. The same CNNs are used for CT images classification to identify Covid-19 infections from other pneumonia. This task is challenging due to the rich content of CT images and similarity between Covid-19 infection and other pneumonia. The COVID-CT dataset contains 349 CT images positive for COVID-19 belonging to 216 patients and 397 CT images that are negative for COVID-19. The dataset is open-sourced to the public. We used 566 images for the train and 180 images for the test with size of \(230 \times 230\).

The reported scores in Table 1 indicate that the proposed method clearly outperforms the competing optimizers in training both models to solve this challenging classification problem. Moreover, severe performance decrease is observed for some optimizers like Adagrad. This is mainly due to the challenging classification, which leads to a more complex learning process.

Table 1. Experiment 1: results for CT image classification using CNN_1 and CNN_2.

In order to confirm this performance decrease, Figs.  1 and 2 shows loss and accuracy curves obtained using the competing optimizers, and this for CNN_1 and CNN_2, respectively. The displayed curves clearly indicate an overfitting effect for classical optimizers, in contrast to the proposed method.

Fig. 1.
figure 1

Experiment 1: train and test curves using CNN_1.

Fig. 2.
figure 2

Experiment 1: train and test curves using CNN_2.

4.3 Experiment 2: CIFAR-10 Image Classification

In this scenario, the learning performance using the competing optimization algorithms is evaluated using the standard CIFAR-10 dataset. The CIFAR-10 dataset consists of 60000 32\(\,\times \,\)32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

The reported scores in Table 2 indicate that the proposed method outperforms the competing optimizers in terms of learning precision, and hence classification performance. Furthermore, the competing optimizers do not perform well to learn both CNNs on the CIFAR-10 dataset. This confirms the ability of the proposed method to allow different networks reaching high accuracy levels, in contrast to standard optimizers, even when regularization is use. The gain in terms of computational time using the proposed method is more important on this experiment.

Table 2. Experiment 2: results for CIFAR-10 image classification using CNN_1 and CNN_2.

5 Conclusion

In this paper, we proposed a new Bayesian optimization method for fitting weights for artificial neural networks. The suggested method uses Hamiltonian dynamics to solve the problem of sparse regularization optimization. Our results demonstrated the good performance of the proposed method in comparison with standard optimizers, as well as classical Bayesian ones. Moreover, the proposed technique allows simple networks to enjoy high accuracy and generalization properties. Future work will focus on testing our proposed optimizer with larger datasets, as well as proposing a distributed or parallel implementation.