Keywords

1 A Brief History

In the last decades, the use of Neural Networks (NNs) has become one of the most effective approaches for solving classification and regression tasks. This is principally due to their capability of identifying and modeling complex correlations. They have been proposed for the first time in the 40s in order to try to achieve two main objectives: to study the functioning of the human brain by defining models able to simulate its neuro-physiological phenomena, and to use such models to extract the principles guiding human reasoning in terms of mathematical calculations, to develop artificial systems able to reason as a human but possibly faster and more efficiently.

The concept of NN, as the name explicitly says, was originally used to define networks simulating the neurons and their interactions in the human brain. The first theory was developed by McCulloch and Walter Pitts [31] and was very simple and effective. They defined the neuron as a computational unit that applies a function to the inputs implementing a binary classification. The inputs were multiplied by weights fixed apriori and static. In 1958, this first definition was at the basis of the implementation of the first model of neural network, called Perceptron [39], allowing the training of a single neuron. From the work of those years, besides the Perceptron, a second learning algorithm also emerged, which was a special case of the Stochastic Gradient Descent (SGD) technique, which is at the basis of most learning algorithms now. This learning algorithm was used to train the weights of the ADAptive LINear Element (ADALINE) model [48], used for linear regression. Basically, SGD needs an error function that returns, for a given output of the model, how far the output is from the label of the input. SGD takes the error function and computes the gradient of this error on the weights. In this way, it is possible to update the weights moving along the gradient in order to reduce the error. Performing these operations iteratively allows minimizing the error. Nowadays, this idea has been declined in many ways, considering for example also the second derivative or the derivative of the previous iterations to better guide the tuning of the weights. However, a single neuron is not effective for even the simplest classification and regression task because it implements a too simple function. For this reason, this idea was abandoned at that time.

In the second half of the eighties, Rumelhart et al. [40] used the back-propagation mechanism to train larger networks, giving new life to the study of the topic. The idea was simple but effective: using more neurons means combining more functions, defining a model able to represent more complex scenarios. Unfortunately, at that time, developing and experimenting with these ideas was difficult due to hardware limitations that made the training unfeasible with the increase of the complexity of the model. However, the idea of deep networks dates back to these years. Deep networks are networks with many layers of neurons, the internal layers are called hidden and the depth of the network is the number of layers. Moreover, the eighties saw the first theorization of Convolutional Neural Networks (CNN) by Yan LeCun [27], whose work resulted in the definition of the well-known LeNet5 model [26]. This model was one of the first effective CNN as it was able to achieve super-human results in the task of handwritten digit recognition.

Nowadays, the idea of NNs is that of extremely complex networks with many neurons organized in many layers. The concept of depth is stressed even further, partly thanks to the advances in hardware. Other important aspects that helped to increase the importance of these models are the increase of the size of the data and the availability of powerful (and in many cases user-friendly) systems and frameworks for the design, the implementation and the use of these models. Frameworks such as Tensorflow,Footnote 1 PyTorchFootnote 2 and CaffeFootnote 3 allow a better user friendliness and consequently a easier prototyping of models.

The classical definition of a NN, a set of neurons grouped in different layers where a neuron in a layer communicates with all the neurons of the next layer, is now usually called Fully Connected (Deep) Neural Network, Artificial Neural Network, Multilayer Perceptron, or Deep Feedforward Network  [1, 13], and has been later extended in order to define more complex models.

Alongside this definition, new models have gained increasing importance in the field. On one hand, the already mentioned convolutional networks, mainly used in the field of computer vision, and on the other hand Recurrent Neural Networks [40], defined for input data in the form of sequences, such as written text. Their great innovation is that they allow the network to maintain a memory of previous data, enabling the management of such sequences.

The rest of this chapter is organized as follow. Section 2 discusses Multilayer Perceptrons, Sect. 3 introduces Convolutional Neural Networks, and Sect. 4 Recurrent Neural Networks. Section 5 discusses the problem of tuning the hyper-parameters to guide the training. Finally, Sect. 6 concludes the chapter.

2 Multilayer Perceptrons

The objective of a Multilayer Perceptron (MLP) is that of approximating a function \(\hat{f}:\mathbb {R}^{N\times M}\rightarrow \mathbb {R}^K\) by means of other functions such that \(\hat{f}(X)=f_n(f_{n-1}(\dots (f_1(f_0(X)))))\), where X is a tensorFootnote 4 of size \(N\times M\) representing the input. The corresponding model is a network of n layers, each representing the function \(f_i\) with i the number of the layer. Figure 1 shows an example of MLP with 4 layers. Layer 0 is the input layer, the leftmost one, then layers 1 and 2 are internal layers, also called hidden layers. Finally, the last layer is the output one, returning the results of the computation of the network. The output function is \(\hat{Y}=\hat{f}(X)=f_3(f_2(f_1(f_0(X)))))\), with \(\hat{f}:\mathbb {R}^{3\times 1}\rightarrow \mathbb {R}^3\). This network is developed to solve a multiclass classification where each input x can be labelled with three different classes, associated with different neurons of the output layer. Before describing the overall flow, let us concentrate on a single neuron, e.g., neuron \(h_1\) in Fig. 1. Figure 2 shows the computation performed in each single neuron. Basically, each neuron takes as input the output of each neurons of the previous layer, for the case of \(h_1\), it takes \(X=[x_1, x_2, x_3]^{\top }\) as input. Moreover, each neuron has another input value, the bias, a constant term set to 1 which represents background noise. Thus, toghether with the weight matrix \(W_1\), containing one weight \(w_i^1\) for each \(x_i\), another weight \(b_1^1\) is considered, associated to the bias. This weight is usually automatically added to the neuron by the frameworks used to model the MLP, thus it is not necessary to explicitly add it to the model, as shown in Fig. 1, where the bias terms are not represented.

Fig. 1
figure 1

Example of a MLP with four layers: the input, two hidden and the output layers. The input \(X=[x_1, x_2, x_3]^{\top }\) represents a tensor of size 3\(\,\times \,\)1. The output \(Y=[y_1,y_2,y_3]^{\top }\) represents the 3 possible classes to be assigned to the input, one class for each value in Y

Fig. 2
figure 2

Computation of neuron \(h_1\) in Fig. 1

In detail, first of all, the neuron computes the network input, i.e., the input it takes by the network. So, \(z_1^1=b_1^1+\sum _{i=1}^{n}x_i w_i^1\) or, in matrix notation \(z_1^1=b_1^1+X^{\top }W_1\), with \(n=3\) the size of X, i.e., the number of input from the previous layer. Then, \(z_1^1\) is given to the activation function \(f_{act}\) that computes the output of the neuron, which is given as input to the neurons of the next layer. Thus, \(f_1^1(X)=f_{act}(b_1^1+X^{\top }W_1)\).

There are many possible activation functions, varying from hyperbolic tangent \(\tanh \) to the sigmoid or to Rectified Linear Unit (ReLU) [24]. In particular, the last two functions are the most used. Given Z the input of a neuron, the sigmoid function \(\sigma (Z)=\frac{1}{1+e^{-Z}}\), shown in Fig. 3a, is used as the activation function of the output layers in case of binary classification. It returns a value between 0 and 1, thus the output layer is designed to have one neuron returning the probability of the input to belong to the positive class (one of the two classes). On the other hand, the ReLU activation function, shown in Fig. 3b, is the most used for the neurons of the hidden layers. It was introduced to improve the performance in terms of training time and defined as \(ReLU(H)=\max (0,Z)\).

Fig. 3
figure 3

Sigmoid a and ReLU b activation functions

Therefore, the overall flow is as follow. The input data is multiplied by weights matrix \(W_1\) and given to the neurons of the first layer. Here, every neuron takes the result of this multiplication and applies the activation function, returning the output of the neuron. This value is intended as an indicator of the level of activation of the neuron, the higher the value, the most active that neuron.

This process is then repeated sequentially for the next layers until the output layer is reached. Thus, for example, the output of the neuron \(h_2\) in Fig. 1 is \(f_1^2(X)=f_{act}(b_1^2+Z_1^{\top }W_2)\) where \(b_1^2\) is the bias weight of neuron \(h_2\), and \(Z_1\) is the tensor containing the output of the neurons of the previous layer.

The last layer will output a probability distribution among the classes considered in the data in the case of classification, or a real value predicting the output label of the example in the case of regression.

During learning, this output distribution is used to compute a loss measure representing how far the predicted output of the network is from the actual label of the input examples. The gradient of this loss measure is calculated w.r.t. the weights. These are then updated by taking a step in the opposite of the direction given by the gradient by means of the gradient descent algorithm or one of its evolutions. This approach is called back-propagation and it is at the basis of the training of each neural network, with some variants due to the different architectures adopted.

The need of calculating the gradient guided the search for good activation functions for all the layers. What one would like to have is a function easy to derivate, able to maintain the output values of each neuron in a range that is functional for the training and, at the same time, that does not reduce the information passing through the neuron. Values in the range [0, 1] are the easiest to manage because they avoid an explosion of the weights’ values. The counterpart is that multiplying many values lesser than 1 makes the output close to 0, making its gradient close to 0 as well. This problem is called vanishing gradient. To try to avoid this problem, the choice of the value of the update step for the weights must be wisely chosen and, in many cases, it must be changed during the training as well.

3 Convolutional Neural Networks

When the input data presents some spatial structure, i.e., each value of the input is connected in some way with other values, MLP may have problems to represent these spatial relations, that usually connect smaller portions of the input example and may appear in different positions and numbers. MLPs need a neuron for each value of the input example, which is a tensor, thus they may require too many neurons to effectively handle the problem. Input data with spatial structure are, e.g. images, where each pixel is connected with the pixels surrounding it, or time series, where each value is connected with the previous and the following value in the series. If we consider a picture of a person, the set of pixels representing an eye are related and the same relation appears in two different places inside the picture. Moreover, the eye is related to the face of the subject but not, for example, to a car passing behind the person in the background. In these cases, if a MLP is considered, the input layer should have a neuron for each value of the input, i.e., one neuron for each pixel in case of a greyscale image or three neurons for each pixel in case of an RGB picture (one for each channel). This may force the network to have an extremely large number of weights to tune. Consider, for example, a greyscale image with a resolution of \(256\times 256\) pixels. The input layer has 65,536 neurons. If the first hidden layer has 100 neurons, which is usually a number too small to be effective given the input size, the network needs 65,536\(\cdot \)100=6,553,600 weights to connect the two layers. This number will increase even further adding more layers, resulting in a hard to train network.

Convolutional Neural Networks (CNNs) [27] have come to the fore to solve problems where data presents spatial and/or topological structure as in the previous example. CNNs are built using convolutional layers, performing convolutions on the data they receive and extracting features from it, typically followed by some fully connected layers to carry out the classification by considering only the extracted features instead of the entire input data. Thus, considering the previous example, the convolutional part checks which features are present in the picture, e.g., if eyes or a mouth or wheels are present. Then, the fully connected part considers the set of features identified in the image and decides which label to assign.

The convolution operation is the integration of two functions x and w, and it is denoted as \(x*w\). Function x is called input while function w is the kernel. The result is called feature map, or simply feature. Convolution is defined as

$$\begin{aligned} (x*w)(t)= & {} \int _{-\infty }^\infty x(\tau )w(t-\tau )d\tau \end{aligned}$$

or, in the discrete case

$$\begin{aligned} (x*w)(t)= & {} \sum \limits _{\tau =-\infty }^\infty x(\tau )w(t-\tau ) \end{aligned}$$

In practical scenarios, the input and the kernel functions are tensors, i.e., multidimensional arrays, such as a 2D grid of pixels for an image or a 1D vector for a time series. The kernel scrolls all the input, moving in the directions given by the size of the input itself (in one direction in the case of 1D input, in two directions in the case of 2D input, etc.). Good surveys on the application of CNNs to images and time series are respectively [20, 36].

The result of a convolution operation represents the features extracted from the input. The advantage of using CNNs is that the kernel is used on the entire input. The only weights to train are the values of the kernel, therefore, if we consider a single kernel of size \(5\times 5\), the convolutional layer has only 25 weights to train irrespective of the size of the input of the layer. Even if the input has size \(256\times 256\), the number of weights to train will be always 25.

In practice, every convolutional layer applies many kernels at a time, extracting many features from the same image, one for each kernel. For example, one of the first CNN, LeNet5 [26], was defined to train 10 kernels of size \(28\times 28\) in the first convolutional layer. Later, Simonyan and Zisserman [42] showed that it is possible to further reduce the number of weights by maintaining the same receptive field by reducing the size of the kernel and adding more convolutional layers, as shown in Fig. 4.

The receptive field is the number of values of the input affecting a single value of the output of a convolution operation. Considering a single convolutional layer having \(5\times 5\) kernels, the receptive field of the output of this layer is 25 (Fig. 4a). Consider now replacing this single layer with two convolutional layers having \(3\) kernels each, as depicted in Fig. 4b. After the second convolutional layer, every value of the output depends on 25 values of the input given to the first convolutional layer, i.e., the receptive field of each value of the output of the second layer is a square \(3\times 3\), so 9 values, of the output of the first layer. In the output of the first layer, each value has a receptive field of \(3\times 3\) values of the input. By considering these two layers as a single black-box, the receptive field of the output combines those of the two layers, becoming, as said before, a square of \(5\times 5\) values of the input. This is true if we consider a stride equal to 1. The stride indicates by how many pixels the kernel moves in a certain direction to calculate the feature map. The advantage of using more layers is that, while with a single \(5\times 5\) convolutional layer there are 25 weights to train, with two \(3\times 3\) convolutional layers there are \(2\cdot 9=18\) weights to train. If we consider a single layer having a \(7\times 7\) kernel and replace it with three layers having \(3\times 3\) kernel and using stride 1, the same receptive field is achieved with \(3\cdot 9=27\) weights instead of \(7\cdot 7=49\). Since every convolutional layer contains several different kernels, the gain in terms of number of weights to train increases fast. By exploiting this approach, it is possible to extract hundreds of features by maintaining the number of weights to train feasible.

Fig. 4
figure 4

Comparison between 1 convolution operation with kernel \(5\times 5\) (a) and 2 convolution operations with kernel \(3\times 3\) (b)

However, with the increase of layers, the problem of vanishing gradient will pop up. For this reason, He et al. [16] introduced ResNet where the configuration of layers combines the output of sets of convolutional layers with the input of the first layer in the sets creating a short-circuit between the input and the output of the set of layers, as shown in Fig. 5. This configuration of layers is calledResidual Block and presents an output function that is \(H(X)=F(X)+X\), where F(X) is the output of the set of convolutional layers. Moreover, the increase of complexity of a layer in terms of operations performed also implies an increase of the number of weights. To reduce their number, a possibility is to resort to \(1\times 1\) kernels, as in GoogLeNet [46]. Basically, \(1\times 1\) convolution is used to reduce the number of features computed by the previous convolutional layer by combining them in a meaningful way. This combination is learned automatically during the training phase to maintain the information of the features taken as input by the \(1\times 1\) convolutional layer while reducing their number. GoogLeNet also introduced the concept of inception, i.e., the design of a good local convolutional network architecture, usually with parallel branches, and its use as a new layer.

Fig. 5
figure 5

The architecture of a Residual Block as defined in ResNet

In the last years, many other different architectures have been presented. Most of them combine and extend the ideas presented for GoogLeNet and ResNet by including MLP networks inside layers. An example is the well-known Network-in-Network model [28], combining the idea of residual blocks and inception [49], creating fractal architectures [25], where the model is no more defined by means of layers but by means of a fractal function which is defined recursively, or heavily using \(1\times 1\) convolution [19] or short-circuits [18].

Another important operation used by CNNs to reduce feature maps is pooling, also called sub-sampling, which usually follows convolution operations. This is used to reduce the size of a feature map by summarizing its information. This summarization can be performed by, e.g., averaging neighbour values or extracting from them the maximum value. The use of convolution and pooling operations allows the extraction of the important features contained in the input data X, representing them by smaller tensors. These features can be used to help classification, or the recognition of certain patterns contained in the data, irrespectively of their position or scale.

Indeed, CNNs’ final layers are usually classical fully connected layers that perform classification on the features extracted by previous layers. Therefore, roughly speaking, a CNN can be divided into two subnetworks, where the first subnetwork processes the input data by means of convolution operations. The output of these operations is then passed to the second subnetwork, which is a fully connected network used to classify the input data. Figure 6 shows the architecture of LeNet5 (Fig. 6a) [26] and of ResNet (Fig. 6b) [16]. From this figure it is possible to see that while the overall architecture has remained the same, i.e., a convolutional part that feeds a set of features to a fully connected part, the depth has increased significantly, going from 6 layers of LeNet5 to 152 layers of ResNet.

Fig. 6
figure 6

The architecture of LeNet5 (a) and ResNet (b)

Considering images as input, besides the classification task, CNNs are also used to solve more specialized tasks, for example, semantic and instance segmentation, or object detection. The first task consists of identifying which part of the image each pixel belongs to. For example, consider a picture having two dogs in the foreground and a grassland with sky as background, semantic segmentation’s objective is that of “classifying” each pixel as sky, grass and dog, while instance segmentation adds the capability of discriminating among the two dogs. Thus, the difference is that semantic segmentation does not discriminate between different subjects, the pixels representing the two dogs are all classified as “dog”, while in instance segmentation the pixels representing the first dog are kept separated from those representing the second dog. This task is performed by superimposing a mask on the initial image that colours each pixel depending on how it is classified. This task is usually performed by applying convolution and deconvolution to the image so that the fully connected part of the network is replaced by a sequence of deconvolutional layers [30, 32, 41]. In particular, deconvolution is the dual of convolution, i.e., instead of extracting features from an image, reducing its size, it is aimed at recreating an image starting from a set of features, also possibly increasing its size.

Object detection allows to locate and surround with a box the objects represented in the image. Therefore, given an image, the objective is to identify which objects the image contains, classify them and spatially locate them in the image. This task is performed by defining and training the fully connected layers at the end of the model so that they also output the coordinates of the box highlighting each object. This approach can be used also to estimate the pose of the subject of the image [47], for example to identify if a person is sitting or walking. Object detection and pose estimation can be done by iteratively selecting portions of the image randomly sampled and using this portion as input of the CNN. In this way, the model can extract the information contained in each portion and thus in the entire image [12].

4 Recurrent Neural Networks

If we must process sequential data, the architectures seen before are not the best choice. Sequential data may have different length and be very long, however the model needs to be able to analyse the sequence as a whole. MLPs and CNNs take as input data of fixed size and are designed to handle different characteristics in the data, such as grid-like topology for the CNNs. When sequential data needs models having a memory of what they have seen previously in the sequence, Recurrent Neural Networks (RNNs) [40] may come in handy. An example of an application of the RNNs can be text translation, where given a sentence, the system has to produce a different sentence, which is the translation of the input in a different language. The vanilla RNN is composed of a single neuron that takes as input a single value of the sequence at a time and a value representing its previous state, i.e., the state, usually the output or a function of the output, obtained considering the previous values of the sequence. In a sequence, every individual value represents the value of the sequence at a certain time step t, therefore, the output of a RNN is \(h_t(x_t)=f(h_{t-1},x_t)\) for \(t\ge 0\), \(h_t\) is the output of the recurrent layer at time t, f the function computed by the layer, and \(x_t\) the input at time t. The function f can vary a lot and different types of functions have been defined to manage the process memory in different ways, by replacing the single neuron of the vanilla version with sub-networks containing more neurons. Each version of f uses weights that are learned during the training phase. An important thing to note is that the network is composed by a single neuron/sub-network applying function f, therefore, the weights are tuned by considering the whole sequences, while the network needs less memory than the other types of network discussed in this chapter.

This simple architecture can be used for different purposes, given an input sequence the model could return as output a different sequence or a single value, or else, the model could take a single value as input instead of a sequence and use only its previous state to create a sequence from scratch. This can be easily done by deciding when to return an output, either at each time step, at the end of the input sequence, or every intermediate solution.

The training method used for this type of network is called back-propagation through time and in its basic definition is computed for each time step. The difference is that in this case, at each time step, the number of times the weights of the network are considered increases. In fact, at time 1 the output considers the weights of the neuron only once, at time 2 the output is the output of time 1 multiplied by the weights, so they are considered twice, and this holds for every time step until the end of the sequence.

We can also define Bidirectional-RNN, where for each time step the output depends also on the future values of the sequence, thus the whole sequence must be read forward and backward in order to return the output. It is worth noting that the possibility of reading the sequence bidirectionally opens to the possibility of taking as input not only sequences but also input that have more dimensions, such as images. This can be done by considering different directions for each dimension, e.g., in the case of an image the network should consider 4 directions.

With the combination of RNNs and CNNs it is also possible to solve the image captioning problem, i.e., given an image, generate a caption describing the content of the image [22]. This can be done by exploiting a CNN trained for solving object recognition task. This network returns a set of labels telling what the image contains. A RNN is then fed with these labels to generate a caption, i.e., a sequence of words, for the image.

The main problem of vanilla RNNs is that the network tends to forget the most distant parts of the sequence. Thus, as seen before, more complex architectures have been proposed to improve the memory of the network. One of the most important architecture is called Long Short Term Memory RNN (LSTM) [17]. LSTMs replace the single neuron of the vanilla RNN with four gates (basically four neurons) that, for each time step, take as input the current value of the input sequence, and the previous output and state, as shown in Fig. 7. One gate decides which parts of the previous state \(C_{t-1}\) to remember by checking the previous output \(h_{t-1}\), another gate computes which parts of the current input \(x_t\) to remember, a third gate decides how these two parts are combined together to compute the current state \(C_t\) and so to manage what to keep in memory, and the fourth gate computes the current output \(h_t\).

Fig. 7
figure 7

Long Short Term Memory architecture

As can be seen in Fig. 7, an LSTM is a network containing several MLP neurons, called gates. To compute the output and the state of the LSTM, it is necessary to compute the output of each gate.

The output of gate f at time t, called forget gate, telling what to forget from the previous state is

$$f_t=\sigma (W_f\cdot [h_{t-1},x_t]+b_f)$$

where \(\sigma (\cdot )\) is the sigmoid function, \(W_f\) and \(b_f\) are the weights matrix and the bias weight of gate f,and \([h_{t-1},x_t]\) is the concatenation of the two tensors \(h_{t-1}\) and \(x_t\).

The output of gate i at time t, called input gate and telling which values to update, is

$$i_t = \sigma (W_i\cdot [h_{t-1},x_t]+b_i)$$

where \(W_i\) and \(b_i\) are the weights matrix and the bias weight of gate i.

The output of gate g at time t, called gate gate and creating a vector of new candidate values, is

$$g_t=\tanh (W_g\cdot [h_{t-1},x_t]+b_g)$$

where \(\tanh (\cdot )\) is the hyperbolic tangent, \(W_g\) and \(b_g\) are the weights matrix and the bias weight of gate g.

Now, it is possible to compute \(C_t\) as

$$C_t=f_t\times C_{t-1}+i_t\times g_t$$

where \(\times \) is the element-wise multiplication.

To find the output \(h_t\), the output of the gate o, called output gate, must be computed as:

$$o_t=\sigma (W_o\cdot [h_{t-1},x_t]+b_o)$$

where \(W_o\) and \(b_o\) are the weights matrix and the bias weight of gate o. Finally, the output of the LSTM at time t can be computed as

$$h_t=o_t\times \tanh (C_t)$$

Starting from LSTM, a plethora of new models have been proposed. Some of them combine or replace gates to simplify the training, such as the well-known Gated Recurrent Unit (GRU) [7], or introduce spyholes [11] giving also the previous state as input to (some) of the four gates of a standard LSTM.

The idea of adding sub-networks has also been applied in the case of RNNs[14, 33]. For example, the current state may be passed as input to an MLP whose output is considered as the previous state from the recurrent part [14].

A generalization of RNNs are the Recursive Neural Networks [5, 35]. In this case, the model presents a tree-like structure, as shown in Fig. 8. The main advantage is that the number of operations necessary to compute the output given an input sequence is reduced to be logarithmic in the length of the sequence. The drawback is that to achieve the best results the structure of the tree should be tailored to the input. This structure is usually computed by analysing the task. For example, for natural language processing, one can exploit some parsers to get a parse tree to be transformed into the recursive network [44, 45], or define a learner able to automatically create the structure of the tree [5]. However, given the difficulties to correctly identify the best architecture, recursive neural networks are rarely used.

Fig. 8
figure 8

Recursive neural network

5 Hyper-parameter Optimization

An important aspect to consider for each machine learning approach is the choice of good hyper-parameters. These can be the number of neurons in each layer, the number of layers, the size of the kernels or the number of iterations to perform during training. As seen in the previous sections, with the increasing sophistication of Deep Learning systems, the hyper-parameters and the possible configurations of the network architectures become more and more complex. Furthermore, due to the pervasiveness of Deep Learning systems, the tuning process of the hyper-parameters and the choice of the best neural architecture needs to be addressed even by non-experts.

In the automation of Deep Learning, we can distinguish two families of algorithms: Hyper-Parameters Optimization (HPO) and Neural Architecture Search (NAS) algorithms [9]. With HPO algorithms we modify only the hyper-parameters of the models while with NAS we act only on the architecture of the neural networks.

The four main HPO approaches are Grid Search, Random Search, Bayesian Optimization and with Genetic algorithm. Grid search [51] performs exhaustive search on the specified hyper-parameters space. This algorithm performs a new indipendent training session for each combination of the hyper-parameters. This algorithm ensures that the optimal configuration is found as long as sufficient resources are provided. This method is guided by a performance metric that decides which of the tested configurations is the best, measuring the performance of the neural model in the training or validation phase. However, due to the fact that the computational resources increase exponentially with the number of hyper-parameters to set, Grid Search suffers from the curse of dimensionality [51].

Random Search [3] searches randomly in the user-defined hyper-parameters space. Random search leads to better results than Grid Search, especially when only a small number of hyper-parameters affect the final performance of the learning algorithm [3]. Unlike Grid Search, this algorithm is not guaranteed to achieve the optimum, but may require less computation time while finding a reasonably good model in most cases [3] because the maximum computation time is set before starting the search. With Random Search, it is also possible to include prior knowledge by specifying the distribution from which to sample the hyper-parameter values.

Bayesian Optimization aims to find the global optimum with the minimum number of trials. It is a probabilistic model-based approach for optimizing objective functions which are very expensive or slow to evaluate [8]. Bayesian Optimization builds a probabilistic model (called surrogate model) of the objective function and quantifies the uncertainty in this surrogate using a regression model. Then, it uses an acquisition function to decide where to sample the next set of hyper-parameters values [10]. Its effectiveness in optimizing the hyper-parameters of NNs derives from the fact that it limits the number of training sessions by spending more time choosing the next set of hyper-parameters to try. In the literature, there are many examples of application of this kind of HPO to NNs [4, 23, 43].

A genetic algorithm (GA) is a metaheuristic based on Charles Darwin’s evolution theory [50] frequently used to generate high-quality solutions to optimization problems. This algorithm tries to imitate the process of natural selection where the fittest individuals are selected for reproduction to produce the population of the next generation. Evolution starts with a randomly generated population of individuals (each individual is a solution to the optimization problem). One of the key points of GA is the fitness function. The fitness function determines the fitness score of each individual. Fitness score represents the probability that an individual will be selected for reproduction. At each iteration of the algorithm, with the selection phase, the fittest individuals are selected so that they pass on their genes to the next generation. Individuals with high fitness score have more chance to be selected for reproduction. Through the crossover phase (also called recombination), for each new solution to be produces, a pair of parent solution of the actual generation are selected and their genetic information are combined to create new child solution. In addition to the crossover, to maintain the genetic diversity, there is also the mutation. Mutation alters one or more gene values in an individual from its initial state and occurs during the creation of the new population, according to a user-definable mutation probability. The algorithm terminates where the population has converged (GA does not produce new population that are significantly different from the previous generation).

Then, in the case of DL hyper-parameters optimization, the Genetic algorithm starts with an initial population of N DL models with some predefined hyper-parameters. Then, we can calculate the accuracy (or loss) of the models and uses that as a fitness score. Finally, we can generate a new offspring of DL models. This method is slow (at each iteration, new neural networks are generated which need to be trained) and not guaranteed to find the optimal solution.

Fig. 9
figure 9

Components of NAS with their interactions

NAS is a technique for automating the design of NNs architectures strictly correlated to HPO. NAS methods have already been shown to be capable of overcoming manually designed architectures [37, 52]. The three main elements on which NAS are based are: Search Space, Search Strategy and Performance Estimation Strategy as can be seen in Fig. 9. Search Space refers to all possible architectures that can be generated during the optimization process. Search Strategy refers to the methods for exploring all possible architectures that can be generated by NAS. Performance Estimation Strategy are the methods for measuring the quality of the generated NN [9]. There are different search strategies that can be used to explore the search space of neural architectures. These strategies include: random search, Bayesian Optimization [21], reinforcement learning [34], gradient-based methods [29] and evolutionary algorithms [9, 38]. NAS approaches can be categorized in two groups: classical NAS and one-shot NAS [2]. The first group follows the traditional search approach also used by Grid Search, where each generated NN is trained independently. One-shot NAS algorithms use weight sharing among models in the search space to train a super-net and use this to select better models. This type of algorithms reduces computation resources compared to the classical NAS algorithm. Therefore, a super-net is a single large network that contains every possible operation in the search space. All possible network architecture in the super-net can be considered as a sub-net with shared weights between common edges. Then, rather than training thousands of separate models from scratch like in the classical NAS, one can train a single large network (super-net) capable of emulating any architecture in the search space. Once the super-net is trained, it is used for evaluating the performance of many different architectures sampled at random by zeroing out or removing some operations.

The state-of-the-art of one-shot NAS are: Efficient Neural Architecture Search (ENAS) [34], Differentiable Architecture Search (DARTS) [29], Single Path One-Shot (SPOS) [15] and ProxylessNAS [6]. Nowadays, different software libraries implement this kind of algorithms. We can cite AutokerasFootnote 5 [21] and Neural Network Intelligence (NNI).Footnote 6 Autokeras implements one-shot NAS with Bayesian Optimization-like search strategy and NNI implements both HPO algorithms and classical and one-shot NAS algorithms.

6 Conclusions

This chapter illustrates the main concepts of the Deep Learning field. In particular, it discusses the most important architectures, i.e., Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). While MLPs are a necessary component for performing classification or regression, the other two types are more tailored to specific input types.

CNNs are extremely useful when dealing with input that shows a grid-like topology, such as images and, to some extent, time series. These types of input data are multi-dimensional tensors where each value is connected with the neighbouring values. CNNs owe their success to their capability of extracting features from the input data.

RNNs are well suited for sequential data, such as sentences. Indeed, this kind of architecture is often used for natural language processing, both for reading sentences as input, and generating sentences as output. The main feature of RNNs, absent in CNNs and MLPs, is that they are designed to keep memory of the (most important parts of the) whole input sequence. Therefore, unlike CNNs, RNNs can consider the sequence as a whole instead of considering only a bunch of neighbouring values.

All these architectures can be combined to create powerful tools. For example, to classify images given their content, a CNN needs to send its output to a MLP, which will use the extracted features to perform the classification. Another interesting combination is between CNNs and RNNs to label images. In this case, the extracted features are given as input to a RNN to compose a sentence describing the content of the image.

The architecture of the networks and the algorithms used to train the networks depend on many hyper-parameters, that must be optimized to obtain good results. Since the number of these hyper-parameters can be high, we need mechanisms to automatically tune their values. For this reason, in this chapter we have surveyed the relevant literature about hyper-parameters’ optimization.