Keywords

1 Introduction

Deep learning is a new area of machine learning research which was introduced with the aim of bringing machine learning closer to its main goal artificial intelligence. These are algorithms inspired by the structure and functioning of the brain they can learn several levels of representation [28,29,30].

The performance of many deep learning methods is highly sensitive to many decisions including choosing the right neural structures, training procedures and methods of hyperparameters optimization; in order to get the desired result. This is a problem whether for new users or even experts. Therefore, Automated Machine Learning (AutoML) can improve performance while saving a lot time and money. AutoML fields aims to make these decisions in data-driven, objective and automated manner. The most helpful and remarkable method of AutoML is the Neural Architecture Search (NAS) method [9].

We choose one of the latest NAS method that uses the ant colony optimization algorithm to design its structure with minimal weights [7]. Where the general concept of ant colony optimization algorithms, inspired by original ant demeanor [17], is a combination of prerequisites about a promising solution structure with basic in-formation about the priori obtained network structure [4]. It is branched into several types, which are of interest to us in this study, namely Ant Colony System (ACS), where a group of ant team up to explore good solutions for treatment providers, using an indirect form of pheromone mediated communication deposited at the edges of the travelling salesman problem (TSP) diagram while building solutions [8].

The hyperparameters optimization has a significant impact on the performance of the neural network. Many technologies have been applied successfully. The most common ones are grid search, random search, and Bayesian optimization [2]. Grid search, searches all possibilities. So, it takes a lot of time devoted to the search of hyperparameters. Whereas, random search is based on Grid search, with the aim of creating a network of excessive parameter values, to randomly choose a combination of them. Therefore, this process automatically takes a long time, so it cannot converge with global optimum or guarantee a stable and competitive result. These two methods need all the possible values for each parameter. On the other hand, Bayesian optimization just needs the order of values. The Bayesian optimization is looking for a global optimum with minimal stages [12].

In this work, we are interested in modeling a new optimization process for a convolutional neural network model. To verify the feasibility of our work we compared it with others previous methods of optimization, such as Deepswarm [7], Udeas [2], and LDWPSO [3].

The remainder of the paper is organized as follows: Sect. 2 presents the study background concerning the deep learning, the ant colony optimization and the bayesian hyperparameter optimization; Sect. 3 introduces our proposed method; Sect. 4 provides an evaluation of our method; Sect. 5 concludes the paper and explores possible future directions.

2 Study Background

2.1 Deep Learning

Deep learning supports computer models composed of different processing layers to explore representations of data with different advanced levels these technics have dramatically afflicted the technical level of various domains [28, 29].

A deep neural network, within deep learning, includes many categories, including convolutional neural networks. So convolutional refers to a computation, and it is a specialized type of linear operation. They are simply neural networks that use convolution rather than repeating the general matrix in at least one of their layers. Its first appearance was in the 1990s, when models were developed to recognize handwritten numbers and were of high performance [18]. Even CNN structures have seen increased development, with ImageNet’s challenge error rate dropping below 4%. When developing AlexNet researchers from 8 layers to 152 layers [19]. CNN has also excelled in other computer vision tasks, such as human action recognition [20], object localization [21, 22], pedestrian detection [23], and face recognition [24]. CNN subsequently demonstrated that it was effective in natural language and speech processing, and achieved excellent results in stratification [25], sentence modeling [26], and speech recognition [27].

2.2 Ant Colony Optimization

The algorithm to improve the ant colony mainly aims to find the path closest to the target [8]. By secreting the so-called pheromone, in order to leave a trace of the following ants, and according to the nature of the ants, it follows the smell of the most recent and most pheromone in terms of quantity, which means the ant that took the shortest path towards the target. The choice is between each node and another with the rule of choosing a pseudo random procedure (based on probability). The environment altered in two various modes, one to update the local track, when the other to update the global track as shown in Fig. 1. The amounts of pheromone are updated in the edges, where ants move between the contract, with the aim of building a new solution. Automatically all ants apply updating pheromone offline. Global Trail Update comes when all nodes complete through the shortest path by updating the edges that have the best ingenuity on their way [4].

Fig. 1.
figure 1

Ant colony optimization steps

The contact weights are adjusted according to the number of ants, so different combinations of contact weight values are determined. For an independent Ant Colony Optimization (ACO) training application and ACO-BP hybrid training for forward neural networks training for categorization of patterns [10]. Through the global research of the ant colony, weights and bias of artificial neural network (ANN) and deep neural network (DNN) models were modified to achieve optimum performance in the prediction of capital cost [1]. ACO is used to form an ant clan that uses pheromone information to collectively search for the best neurological structure [7]. Is also used for recurrent neural network to develop the structure [11].

2.3 Bayesian Hyperparameter Optimization

We can say that the bayesian hyperparameters optimization algorithm is repeated t − 1 times [12]. Within this loop, it will increase the acquisition function and then update the pre-distribution. With each cycle, the pre-distribution is constantly updated, and based on the new setting, the dot at whom the acquisition function is incremented and collected to the training data set is organize. The entire process is duplicated until the maximum number of duplicates is reached or the difference between the current value and the optimum value obtained so far is less than a predetermined threshold [12]. Using bayesian optimization to control data, compare models for an ideal network [13]. Gaussian Process Model (GP) is an algorithm for integrating learning performance. Its performance is affected by some options like kernel type and handling super parameters. The advance is that, the algorithms take into account the variable cost of learning algorithm experiments, which can benefit from multiple cores of parallel experiments [14]. Where the only available standards are artificial test functions that do not represent practical applications. To alleviate this problem, a library of standards was introduced from the pre-eminent application to improve hyperparameters [15]. In [16], the authors define a new kernel for conditional parameter distances that clearly includes information about the relevant parameters in a particular structure, to link the collected performance data for different architectures. In the case of searching for structures that have parameters of different values. For example, we might want to research neural network architectures with an unknown number of layers.

The central idea of BO is to optimize the hyperparameters of the neural network. In this work, we suggest an improvement, that we called BayesACO.

3 Proposed Algorithm

The parameters embedded in our model are internal to the neural network, assessed automatically or learned from learning samples. Therefore, hyperparameters, which are external parameters determined by the neural network, have a great influence on the accuracy of the neural network, hence it is hard to detect these values. Our algorithm takes place in two phases as shown in Fig. 2:

  • Initial phase: Neural architecture searches with ANT COLONY OPTIMIZATION to obtain the best weight that form the convolutional neural network structure which improves the weight of the CNN model in order to get optimal performance.

  • Second phase: Using Bayesian Hyperparameters Optimization enables the optimization of certain number of hyperparameters. Thus, we can find better hyperparameters in less period of time because they are reflected on the best set of hyperparameters to an evaluation based on the former experiences.

3.1 Bayesian Hyperparameter Optimization

Throughout this phase, we are going to apply the Bayesian Hyperparameters Optimization. To begin first, we have to specify which function is to be optimized. Then we start with selecting the hyperparameters of the model in a random way. Afterwards, we train the model by evaluating and updating it until it gets the best performance. This stage is repeated with certain number of iterations which is specified by the user in a manner that each iteration depends on the previous one.

Fig. 2.
figure 2

BayesACO workflow

4 Application

Along this section, we will deal with hyperparameter and parameter optimization. Evaluating hyperparameters and model structure in CNN to get the best performance as possible is performed on the Mnist and FashionMnist datasets.

Table 1 shows the optimization parameters of our methodology. The number of filters in the convolution Node is optimized between 32 and 128, and the kernel size between 1 and 5 the learning rate of Dropout Node is between 0.1 and 0.3 the “stride” of Pooling Node between 2 and 3, and the type which max or average the size of Dense Node can be 64 or 128, and their activation function is ReLU or Sigmoid validation split between 0.0 and 0.3, batch size which 32, 64 or 128 and epochs number which 5, 10 or 20.

Table 1. Optimization parameters of BayeACO.

The main parameters of BayesACO use for optimization with mnist. The ant count is 16 and the maximum number of depths is 10. The epochs number is 15, the batch size equals 64, and the learning rate is equal to 0.1.

The first model that we present in Fig. 3 is composed of two convolutional layers and two fully connected layers and one max pooling, dropout and flatten layer.

Fig. 3.
figure 3

The best architecture discovered with Mnist

From Fig. 4 The accuracy of training and testing increases with the number of epochs, this reflects that with each epoch the model learns more information. If the precision is reduced then we will need more information to teach our model and therefore we must increase the number of epochs and vice versa. Likewise, the learning and validation error decreases with the number of epochs. We also notice that the total misclassified images are 57 images, an error rate of 0.57% and the total well classified images is 9943 an accuracy rate of 99.43%.

Fig. 4.
figure 4

Accuracy and Error for mnist model Fashion-MNIST dataset

The initial parameters of BayesACO use for optimization with fashionmnist. The ant count is 16 and the maximum number of depth is 15. The epochs number is 20, the batch size equals 64, and the learning rate is equal to 0.1.

The second model that we present in Fig. 5 is composed of three layers of convolution and two layers of averagepooling and a dropout and fully connected layer.

Fig. 5.
figure 5

The best architecture discovered with FashionMnist

Beyond this, we compare the differences between the results of our final method and other methods of improving the convolutional neural network such as Deepswarm, Udeas and LDWPSO.

The variations in results of the algorithms obviously indicate the effectiveness of our proposed methodology in term of cost which has been defined as the value of the test accuracy as it is clarified in the Table 2.

Table 2. Results of optimization methods.

We can conclude that our result with the mnsit database is 3.63% higher compared to the xgboost architecture, and higher by 0.48 with the ldwpso optimization method and with the fashionmnsit database the precision obtained is 12.4% higher compared to the lenet5 architecture, and higher by 8.71% with the bayesian optimization method.

5 Conclusion

In this paper, we are interested in integrating the bayesian optimization of hyperparameters in the stages of an existing neural architecture search. This system was developed to optimize the convolutional neural network.

The combination of the hyperparameters optimization with the neural architecture search allows reducing human intervention because the process of extracting the network will become fully automated. Thus, we gain time and give us more accurate results.

As perspectives, we think it is important to run the proposed method on other databases. To evaluate the method in terms of time compared to other competing methods, and to develop this approach using more advanced techniques than those which already exist to obtain better results.