Keywords

1 Introduction

This chapter introduces the basics of deep learning that will be used in deep reinforcement learning. For those who are already familiar with the fundamentals, please feel free to skip this chapter. This book’s content is meant to be self-contained, but one might wish to refer to other books like Bishop (2006) and Goodfellow et al. (2016) to understand some of the topics in depth. Unlike classical reinforcement learning which uses analytical methods for function approximation, deep reinforcement learning relies on deep neural networks such that it can leverage the power of large data volume and increased computing resources. In general, there are two types of models.

Discriminative Models

study the conditional probability p(y|x) with input data x and a target label y. In other words, discriminative models predict the label y given the input data x. Discriminative models are mostly adopted in tasks such as classification and regression which require discriminative judgement. More specifically, in terms of classification, a model is designed to categorize the input data into specific classes from a set of given classes. The binary classification, as the most fundamental classification task, predicts one class from two candidates. For example, in the sentiment analysis (Maas et al. 2011), a piece of text is classified as either positive or negative. In contrast, in multi-label classification, the input data can be assigned with several classes at the same time. In some cases, instead of identifying the class directly, a classification model needs to calculate the probability distribution of classes. For example, the input data has a probability of 80% to be assigned with class A and a probability of 20% to be assigned with class B. This probabilistic representation is mostly needed during training for optimization purposes. Deep learning has achieved great success on classification tasks such as image classification (Krizhevsky et al. 2009) and text classification (Yang et al. 2019). Unlike classifications, which produce discrete class labels, a regression studies continuous values. An example of regression is to predict future traffic speed based on historical traffic data (Liao et al. 2018a,b). Regression models remain discriminative models as long as they are learning the conditional probability.

Generative Models

are designed to study the joint probability p(x, y). Generative models are usually used to generate observed data by learning the distribution of the observed data. For example, the generative adversarial networks (GANs) (Goodfellow et al. 2014) are adopted to generate, reconstruct, and denoise images (Ledig et al. 2017; Yang et al. 2018). Nonetheless, techniques in deep learning like GANs have no explicit relationship with the distribution of the observed data but focus more on the similarity between generated samples and observed data. Meanwhile, generative models are also used for classification purposes like Naive Bayes (Ng and Jordan 2002; Rish et al. 2001). Although both generative models and discriminative models are used for classification, discriminative models only consider which label should be assigned given the observed data, while generative models try to learn the distribution of the observed data. For example, Naive Bayes studies the likelihood p(x|y), i.e. the probability of the observed data to be generated assuming a label.

Most deep neural networks that have been explored are discriminative models no matter whether they are initially designed for discriminative or generative problems. This is because many generative problems in practice can be simplified to classification or regression problems. For example, question answering (Devlin et al. 2019) selects which part of the provided context is the answer to the given question; abstractive summarization (Zhang et al. 2019b) selects words from vocabulary to assemble summaries based on the probability of each word. For both cases, they are trying to generate something but one uses a classification approach and the other uses a regression approach.

Concretely, this chapter covers the mechanical components and techniques such as the definitions of neurons, activation functions, and optimizers that can build up deep neural networks and deep learning applications. Fundamental deep neural networks such as multilayer perceptron (MLP), convolutional neural networks (CNNs), and recurrent neural networks (RNNs) are also within the scope of this chapter. Finally, Sect. 1.10 introduces examples of implementing deep neural networks by TensorFlow and TensorLayer. Please refer to Goodfellow et al. (2016) for a more detailed introduction to deep learning.

2 Perceptron

2.1 One Output

A neuron (node) is the basic unit of deep neural networks. Originally, the neuron was proposed to be an abstract representation of the real neuron in the brain, which receives electrical impulses from its dendrites. When this specific neuron is polarized enough, it will send an action potential spike via its axon to the other adjacent neurons. In a real biological system, these steps do not take place at once but at a more granular scale. Spiking neural networks are better suited in describing the underlying biological processes. At the moment, the deep learning community relies more on deep neural networks (DNNs), also known as artificial neural networks (ANNs). The neurons in deep neural networks are formalized with numerical inputs and outputs. A neuron can have many output neurons in the next layer and a neuron can also have many input neurons in the previous layer. This is a many-to-many relationship. A neuron in one layer aggregates the signals being passed through from its input neurons in the previous layer. This aggregated signal will then be passed through an activation function that will determine the neuronal behavior. Concretely, if the aggregated signal is strong enough, then the activation function will “activate” this neuron and pass forward a high value to the output neurons in the next layer. Otherwise, a low value will be passed forward instead (Fig. 1.1).

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} z = w_{1}x_{1}+ w_{2}x_{2} + w_{3}x_{3}. \end{array} \end{aligned} $$
(1.1)
Fig. 1.1
figure 1

A neural network with three input neurons and one output neuron

A neural network can have an arbitrary number of neurons with random connections among themselves, but for the ease of computation, the neurons are organized layer after layer. Typically, a single neuron will have at least two layers, namely the input and output layer as shown in Fig. 1.2. This network can be formalized by Eq. (1.1) and can help with simple decision-making. An example is helping a group of students decide whether or not they can play soccer on a day based on the weather condition. The decision may also rely on some other factors such as the expense of the soccer field and the students’ availability. If the weather condition has a higher impact on the decision, the corresponding weight (w) should have a greater absolute value. In contrast, factors of less importance should have weights with a lower absolute value. If a weight is set as zero, the corresponding input factor is discarded in the decision-making process. This kind of neural network is also called a single-layer neural network or perceptron.

Fig. 1.2
figure 2

A neural network with bias

2.2 Bias and Decision Boundary

A bias is an extra scalar that is added to the neuron to shift the value of the output. For example, Fig. 1.2 shows the single-layer neural network with a bias and it can be formalized as:

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} z = w_{1}x_{1}+ w_{2}x_{2}+ w_{3}x_{3} + b. \end{array} \end{aligned} $$
(1.2)

The bias can help a neural network to fit the data better. For example, let us define a binary classification problem, in which the label y is 1 if the input z is positive and 0 otherwise:

$$\displaystyle \begin{aligned} y =\left\{ \begin{array}{@{}ll@{}} 1 & \text{when}~z>0 \\ 0 & \text{otherwise} \\ \end{array}\right. \end{aligned} $$
(1.3)

Then the distribution of data samples is shown in Fig. 1.3 and we need to find out a set of weights and bias that can best fit the data. The decision boundary is defined to partition the data samples into the two classes for the binary classification. Formally, the decision boundary is {x 1, x 2, x 3|w 1x 1 + w 2x 2 + w 3x 3 + b = 0}.

Fig. 1.3
figure 3

Decision boundary of linear model with two and three inputs. Left: z = w 1x 1 + w 2x 2 + b, Right: z = w 1x 1 + w 2x 2 + w 3x 3 + b

Let us first simplify this problem by having only two inputs, i.e. z = w 1x 1 + w 2x 2 + b. As shown in the left-hand side of Fig. 1.3, without the bias component, i.e. b = 0, the decision boundary must cross the origin of the Cartesian coordinate as demonstrated by the blue line in the bottom-left corner. However, this apparently cannot fit the data distribution well enough as the data samples for both classes fall on the same side of the boundary. If the bias is non-zero, the decision boundary crosses both axes at \((0, -\frac {b}{w_2})\) and \( (-\frac {b}{w_1},0)\), respectively, and this decision boundary can fit the data distribution better if the weights and bias are well chosen.

If we come back to the original setting of the problem where the neuron has three inputs, i.e. z = w 1x 1 + w 2x 2 + w 3x 3 + b, the decision boundary will become a plane as shown in the right-hand side of Fig. 1.3. In a linear model like the single-layer neural networks defined in Eq. (1.2), the decision boundary is also called hyperplane.

2.3 More Than One Output

The single-layer neural network can have multiple neurons. Figure 1.4 shows an example of a single-layer neural network with two outputs, which are computed by Eq. (1.4). Since each output is connected with all of the inputs, the output layer is also called the dense layer, or fully connected (FC) layer:

$$\displaystyle \begin{aligned} \begin{aligned} {} z_{1} &= w_{11}x_{1} + w_{12}x_{2} + w_{13}x_{3} + b_1 \\ z_{2} &= w_{21}x_{1} + w_{22}x_{2} + w_{23}x_{3} + b_2. \\ \end{aligned} \end{aligned} $$
(1.4)
Fig. 1.4
figure 4

The neural network with three input neurons and two output neurons

In practice, the dense layer can be represented by matrix multiplication:

$$\displaystyle \begin{aligned} \boldsymbol{z} = \boldsymbol{W}\boldsymbol{x} + \boldsymbol{b} \end{aligned} $$
(1.5)

where \(\boldsymbol {W}\in \mathbb {R}^{m\times n}\) is a matrix to represent weights and \(\boldsymbol {z}\in \mathbb {R}^{m}, \boldsymbol {x}\in \mathbb {R}^{n}, \boldsymbol {b}\in \mathbb {R}^{m}\) are column vectors to represent outputs, inputs, and biases, respectively. In the example by Eq. (1.4), m = 2 and n = 3.

$$\displaystyle \begin{aligned} \begin{bmatrix} z_1 \\ z_2 \end{bmatrix} = \begin{bmatrix} w_{11} & w_{12} & w_{13}\\ w_{21} & w_{22} & w_{23}\\ \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ x_3 \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \end{bmatrix} \end{aligned} $$
(1.6)

3 Multilayer Perceptron (MLP)

A multilayer perceptron (MLP) (Rosenblatt 1958; Ruck et al. 1990) stems from a single dense layer to have at least two dense layers. Figure 1.5 presents an MLP consisting of four dense layers. The three layers between the input and output layers are “hidden” because they are typically not accessible from outside the network, and we will refer them as the hidden layers. Compared with the network with a single dense layer, MLP can fit more complex data. In other words, MLP has a stronger learning capability than a single-layer neural network. However, more hidden layers in MLP do not necessarily lead to stronger learning capacity. The universal approximation theorem states that a feedforward network with one hidden layer (e.g., MLP with one hidden layer) and any squashing activation function (e.g., sigmoid or tanh) can approximate any Borel measurable function, given that the hidden layer has sufficient hidden units (Samuel 1959; Hornik et al. 1989; Goodfellow et al. 2016). However, in practice, such a network can be inflexible to train or hard to avoid overfitting if the hidden layer is extremely large. Therefore, deep neural networks including MLP typically have several hidden layers.

Fig. 1.5
figure 5

An example of multilayer perceptron (MLP) with three hidden layers and one output layer. The neurons are represented by \(a^l_i\), where l the layer index and i is the output index

We start with the logic operations to demonstrate how a network approximates a function. The logic operations including AND, OR, NOR, NAND, XNOR, and XOR take two binary numbers and return either zero or one. For example, AND returns one if and only if the two binary numbers are both one. Simple logic operations can be easily approximated by the perceptron, which can be defined by Eq. (1.7).

$$\displaystyle \begin{aligned} f(\boldsymbol{x})=\left\{ \begin{array}{@{}ll@{}} {} 1 & \text{ if}\ z>0 \\ 0 & \text{ otherwise} \\ \end{array}\right. \text{where}~~~z = w_1x_1 + w_2x_2 + b \end{aligned} $$
(1.7)

Figure 1.6 shows that hyperplanes defined by perceptron can be easily found to separate the points between zero and one for AND, OR, NOR, and NAND. However, it is not possible to do the same for XOR or XNOR.

Fig. 1.6
figure 6

Top left: The perceptron with two inputs and one output. The rest: Hyperplanes can be found to separate the points between zero (green) and one (orange) for AND, OR, NOR, NAND, but no hyperplane defined by perceptron can be found for XOR, XNOR

The XOR cannot be approximated by a linear model directly working on the original inputs x 1, x 2 like the perceptron, so we need to transform the inputs first. As an example, we use MLP with one hidden layer as shown in Fig. 1.7 to approximate XOR. This MLP first transforms the inputs x 1, x 2 into a new space by approximating the logic operations OR and NAND, and then, in the transformed space, the points are linearly separable by an approximation of AND. The transformed space is also named feature space and this example shows how learning features can improve the learning capacity of a model.

Fig. 1.7
figure 7

Left: An MLP that approximates XOR. Mid and right: Transformation from the original data space to the feature space, where the data points are linearly separable

4 Activation Functions

Matrix addition and multiplication are both linear operators but the learning capability of a linear model is rather limited. For example, a linear model cannot easily approximate a cosine function. Most real-world problems that deep neural networks are applied to solve cannot be simplified as a linear transformation, so non-linearity is important for deep neural networks. In practice, the non-linearity of deep neural networks is introduced by activation functions, which are typically element-wise operations. In addition, the activation functions are necessary when a model needs to obtain probability vectors instead of vectors with arbitrary values. The choice of activation functions varies in different applications. Even though there exist some functions that work well in most deep learning applications, there might be other functions that have better performance on a case by case basis. Therefore, the design of activation functions remains an active research area. This section introduces four commonly used activation functions, namely sigmoid, tanh, ReLU, and softmax (Fig. 1.8).

Fig. 1.8
figure 8

Demonstration of three element-wise activation functions including sigmoid, tanh, and ReLU. The sigmoid constrains values between 0 and 1, while the tanh returns values between − 1 and 1. The ReLU returns zero when the input is non-positive but is equivalent to f(x) = x when the input is positive

The logistic sigmoid as an activation function has float output ranging between 0 and 1 as defined by Eq. (1.8). The sigmoid function can be used at the output layer for classification purpose. For example, a binary classifier with one output neuron uses sigmoid to constrain the output value between 0 and 1 and then converts it to a discrete class label (either 0 or 1) by using a threshold like 0.5.

$$\displaystyle \begin{aligned} f(\boldsymbol{z}) = \frac{1}{1+e^{-\boldsymbol{z}}}. \end{aligned} $$
(1.8)

Similar to the sigmoid function, the hyperbolic tangent (tanh) constrains output values to a limited range between − 1 and 1 as defined by Eq. (1.9). The tanh function can be used in the hidden layers (Glorot et al. 2011) to provide non-linearity. It can also be used in the output layer, e.g. in the generation of images whose pixel values range between − 1 and 1.

$$\displaystyle \begin{aligned} f(\boldsymbol{z}) = \frac{e^{\boldsymbol{z}}-e^{-\boldsymbol{z}}}{e^{\boldsymbol{z}}+e^{-\boldsymbol{z}}}. \end{aligned} $$
(1.9)

The rectified linear unit (ReLU), also known as the rectifier, is defined by Eq. (1.10). The study by Glorot et al. (2011) shows that ReLU is more promising than sigmoid and tanh, and ReLU has also been widely adopted in recent works (He et al. 2016; Cao et al. 2017; Noh et al. 2015). The empirical advantages of ReLU are:

  • Easier to implement and compute: in the implementation of ReLU, a simple comparison with zero is conducted first and then the activation is set to zero or z accordingly. Whereas in the sigmoid and tanh, the exponential function is harder to compute especially in the case of large networks.

  • Easier for a network to optimize: ReLU function is close to being linear, consisting of two linear functions. This property makes the gradient large and consistent. The gradient of an active neuron by ReLU is always one, but the gradient of a neuron by sigmoid or tanh suffers from vanishing when the activated value approaches the limits (i.e., − 1, 0, or 1).

$$\displaystyle \begin{aligned} f(\boldsymbol{z})=\left\{ \begin{array}{@{}ll@{}} 0 & \ \text{when} \ \boldsymbol{z}<=0 \\ \boldsymbol{z} & \ \text{when} \ \boldsymbol{z}>0\\ \end{array}\right. \end{aligned} $$
(1.10)

However, merely setting negative values to zero in ReLU can lead to information loss. Imagine, if a neuron constantly outputs zero, it will always output zero in the future and is unlikely to recover. This can happen because of an inappropriate learning rate or a negative bias. The work by Xu et al. (2015) proposes a solution to this with another activation function called leaky ReLU, which is defined in Eq. (1.11). The scalar α in this equation is a small positive value to control the slope (e.g., 0.01 or 0.02) so that a little information from the negative scope can be retained.

$$\displaystyle \begin{aligned} f(\boldsymbol{z})=\left\{ \begin{array}{@{}ll@{}} \alpha\boldsymbol{z} & \text{when}\ \boldsymbol{z}<=0 \\ \boldsymbol{z} & \text{when}\ \boldsymbol{z}>0\\ \end{array}\right. \end{aligned} $$
(1.11)

The parametric ReLU (PReLU) (He et al. 2015) is similar to the leaky ReLU except that it considers α as a trainable parameter. There is no clear evidence to show which one of ReLU, leaky ReLU or PReLU is significantly better than the others since the choice varies in different scenarios.

Unlike the activation functions mentioned above, the softmax function, defined by Eq. (1.12), provides normalization based on all values from previous layer’s outputs. The softmax function first computes the exponential function e z and then normalizes each entry by dividing it.

$$\displaystyle \begin{aligned} f(\boldsymbol{z})_{i} = \frac{e^{z_{i}}}{\sum_{k=1}^{K}e^{z_{k}}} \end{aligned} $$
(1.12)

In practice, the softmax function is typically only used in the output layer to normalize the output vector z into a probability vector, where each entry is non-negative and the entries are added to one. Therefore, the softmax function is widely used for classification.

5 Loss Functions

In deep learning, loss functions are defined to quantify an error, also known as the loss value or cost, between the prediction and target (i.e., ground truth, gold standard). The loss value is normally used as the objective to optimize the parameters of neural networks, such as the weights and biases. This section introduces some commonly used loss functions and Sect. 1.6 will introduce how to optimize the parameters based on the loss values.

5.1 Cross-Entropy Loss

Before we introduce the cross-entropy loss, we start with a similar concept named Kullback–Leibler (KL) divergence. The KL divergence measures the similarity between two distribution P(x) and Q(x):

$$\displaystyle \begin{aligned} D_{\text{KL}}(P\|Q) = \mathbb{E}_{x\sim P}\bigg[\log \frac{P(x)}{Q(x)}\bigg] = \mathbb{E}_{x\sim P}[\log P(x) - \log Q(x)] \end{aligned} $$
(1.13)

The KL divergence is non-negative and equals to 0 if and only if P and Q have the same distribution. Since the first term in KL divergence has no relation with Q, we introduce cross-entropy which can remove the first term.

$$\displaystyle \begin{aligned} H(P, Q) = - \mathbb{E}_{x\sim P}\log Q(x) \end{aligned} $$
(1.14)

Therefore, minimizing the cross-entropy with respect to Q is equivalent to minimizing the KL divergence. As mentioned before, in some deep learning applications, e.g. classification, deep neural networks calculate a probability distribution of classes in practice instead of identifying the target class directly. Therefore, we can use the cross-entropy to measure how well the predicted distribution is and then update the network accordingly.

We start with binary classification as an example. In binary classification, for each input data sample x i with target y i (i.e., 0 or 1), a model needs to predict the probability of each candidate class \(\hat {y}_{i,1}, \hat {y}_{i,2}\). Since \(\hat {y}_{i,1} + \hat {y}_{i,2} = 1\), we can rewrite the prediction as \(\hat {y}_{i}\) which represents the probability of one class, so the probability of the other class is \(1 - \hat {y}_{i}\). Therefore, a neural network for binary classification typically has only one output neuron (with sigmoid) and following the definition of cross-entropy, we have:

$$\displaystyle \begin{aligned} \begin{aligned} \mathcal{L} &= - \frac{1}{N} \sum_{i=1}^N \bigg(y_{i}\log\hat{y}_{i} + (1 - y_{i})\log(1 - \hat{y}_{i})\bigg), \end{aligned} \end{aligned} $$
(1.15)

where N represents the total number of data samples. Since y i is either 0 or 1, only one of \(y_{i}\log \hat {y}_{i}\) and \((1 - y_{i})\log (1 - \hat {y}_{i})\) is retained for each data sample. If \(~\forall i,~y_i=\hat {y}_i\), the cross-entropy loss is zero.

In multinomial classification, where each data sample x i is classified into one out of three or more candidate classes, a model predicts the probability of each class \(\{\hat {y}_{i,1}, \hat {y}_{i,2}, \dots , \hat {y}_{i,M}\}\), where M ≥ 3 and \(\sum _{j=1}^M \hat {y}_{i,j}=1\). The target of each data sample is referred to as c i, which is an integer between [1, M], and it can be converted to a one-hot vector y i = [y i,1, y i,2, …, y i,M], where only \(y_{i, c_i}=1\) and others are zero. Then, we can write the cross-entropy loss for the multinomial classification as below:

$$\displaystyle \begin{aligned} \mathcal{L} &= - \frac{1}{N} \sum_{i=1}^N \sum_{j=1}^M y_{i,j}\log \hat{y}_{j} = - \frac{1}{N} \sum_{i=1}^N (0 + \dots + y_{i, c_i}\log \hat{y}_{c_i} + \dots + 0)\\ &= - \frac{1}{N} \sum_{i=1}^N \log \hat{y}_{c_i}. \end{aligned} $$
(1.16)

5.2 \(\mathcal {L}_{p}\) Norm

Given a vector x, p-norm measures its scale such that a vector with larger values has a larger scale, and it is defined as follows, where p is an integer greater or equal to 1.

$$\displaystyle \begin{aligned} \begin{array}{rcl} \begin{aligned} \|\boldsymbol{x}\|{}_{p} &\displaystyle = \left(\sum_{i=1}^{N}|x_{i}|{}^{p}\right)^{1/p}\\ {\mathrm{i.e.,}} \|\boldsymbol{x}\|{}_{p}^{p} &\displaystyle = \sum_{i=1}^{N}|x_{i}|{}^{p} \end{aligned} {} \end{array} \end{aligned} $$
(1.17)

In deep learning, a p-norm can be used to measure the difference between two vectors written as \(\mathcal {L}_{p}\), as in Eq. (1.18), where y and \(\hat {\boldsymbol {y}}\) are the target and prediction, respectively.

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathcal{L}_{p} = \|\boldsymbol{y}-\hat{\boldsymbol{y}}\|{}_{p}^{p} = \sum_{i=1}^{N}|y_{i}-\hat{y}_{i}|{}^{p}. {} \end{array} \end{aligned} $$
(1.18)

5.3 Mean Squared Error

The mean squared error (MSE) is the averaged \(\mathcal {L}_{2}\) norm as defined by Eq. (1.19). The MSE can be used for regression problems in which the outputs of a neural network are continuous values. For example, the difference between two images can be measured by MSE between pixels of the two images.

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathcal{L} = \frac{1}{N}\|\boldsymbol{y}-\hat{\boldsymbol{y}}\|{}^{2}_{2} = \frac{1}{N}\sum_{i=1}^N(y_{i}-\hat{y}_{i})^{2}, {} \end{array} \end{aligned} $$
(1.19)

where N is the number of data samples, and y and \(\hat {\boldsymbol {y}}\) are the target and prediction, respectively.

5.4 Mean Absolute Error

Similar to MSE, the mean absolute error (MAE) can also be used for regression problems and is defined as the averaged \(\mathcal {L}_{1}\) norm.

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathcal{L} = \frac{1}{N}\sum_{i=1}^{N}|y_{i}-\hat{y}_{i}| {} \end{array} \end{aligned} $$
(1.20)

Both MSE and MAE minimize the difference between y and \(\hat {\boldsymbol {y}}\). MSE offers a better mathematical property making it easier to compute the partial derivative which is required by gradient descent. In contrast, since the absolute term is not differentiable when \(y_i = \hat {y}_i\), the partial derivative of MAE needs to walk around this case. In addition, when the difference between y i and \(\hat {y}_i\) is greater than 1, MSE has a larger error value compared to MAE (i.e., \((y_i-\hat {y}_i)^2\) vs \(|y_i-\hat {y}_i|\)) which can lead to a quicker optimization of a network.

6 Optimization

In this section we describe the optimization of deep neural networks, or in other words, how the parameters of deep neural networks are trained. This section covers back-propagation, gradient descent, stochastic gradient descent, and the selection of hyper-parameters.

6.1 Gradient Descent and Error Back-Propagation

Given a neural network and a loss function, the training of the neural network is formalized to learning its parameters θ so that the loss \(\mathcal {L}\) is minimized. Finding the minimum by searching \(\boldsymbol {\theta }~~s.t.~~\bigtriangledown _{\boldsymbol {\theta }}\mathcal {L}=0\) in a brute-force fashion is infeasible in practice, especially when the formula is as complex as that of a deep neural network. Therefore, we consider a process to approach the minimum by small steps and this technique is called gradient descent.

Figure 1.9 illustrates two examples of gradient descent. The learning process of gradient descent starts from a randomly picked point and the loss \(\mathcal {L}\) decreases along with the update of parameters as denoted by the red dotted path. Similarly, in a neural network, its parameters are first randomly initialized and updated each step based on the partial derivative \(\frac {\partial \mathcal {L}}{\partial \boldsymbol {\theta }}\). More specifically, the parameters are updated iteratively by \(\boldsymbol {\theta }:=\boldsymbol {\theta }-\alpha \frac {\partial \mathcal {L}}{\partial \boldsymbol {\theta }}\), where α the learning rate of each step and θ mostly consists of weights W and biases b of each layer.

Fig. 1.9
figure 9

Examples of gradient descent with a trainable parameter (Left) and two trainable parameters (Right). In gradient descent, the learning process starts from a ranomly picked point. With the parameters updates shown by the red arrows, the loss \(\mathcal {L}\) gradually reachs a saddle point. Note that there is no guarantee the gradient descent can find the global minimum but in most cases a local minimum is approached

Back-Propagation

(Rumelhart et al. 1986; LeCun et al. 2015) is a technique to compute the partial derivative \(\frac {\partial \mathcal {L}}{\partial \boldsymbol {\theta }}\) in the network. To make the computation of \(\frac {\partial \mathcal {L}}{\partial \boldsymbol {\theta }}\) clearer, we introduce an intermediate value \(\boldsymbol {\delta }=\frac {\partial \mathcal {L}}{\partial \boldsymbol {z}}\), which is the partial derivative of the loss \(\mathcal {L}\) with respect to the layer’s output z. Then, the partial derivatives of the loss \(\mathcal {L}\) with respect to each parameter, which assemble \(\frac {\partial \mathcal {L}}{\partial \boldsymbol {\theta }}\), are computed based on the intermediate value δ.

The layers are indexed as l = 1, 2, …L, where L is the index of the output layer, each layer has an output z l, an intermediate value \(\boldsymbol {\delta }^l=\frac {\partial \mathcal {L}}{\partial \boldsymbol {z}^l}\), and an activation output a l = f(z l) (where f is the activation function). We use an MLP with MSE loss and a sigmoid activation function as an example. Given z l = W la l−1 + b l, \(\boldsymbol {a}^l=f(\boldsymbol {z}^l)=\frac {1}{1+e^{-\boldsymbol {z}^l}}\), and \(\mathcal {L}=\frac {1}{2}\|\boldsymbol {y}-\boldsymbol {a}^{L}\|{ }^2_2\), we represent the partial derivative of the activation output with respect to its original output as \(\frac {\partial \boldsymbol {a}^l}{\partial \boldsymbol {z}^l}=f'(\boldsymbol {z}^l)=f(\boldsymbol {z}^l)(1-f(\boldsymbol {z}^l))=\boldsymbol {a}^l(1-\boldsymbol {a}^l)\) and the partial derivative of the loss with respect to the activation output as \(\frac {\partial \mathcal {L}}{\partial \boldsymbol {a}^{L}}=(\boldsymbol {a}^{L}-\boldsymbol {y})\). To compute the partial derivative of the loss with respect to the output layer, we apply the chain rule as follows:

  • \(\boldsymbol {\delta }^{L} =\frac {\partial \mathcal {L}}{\partial \boldsymbol {z}^{L}} =\frac {\partial \mathcal {L}}{\partial \boldsymbol {a}^{L}}\frac {\partial \boldsymbol {a}^L}{\partial \boldsymbol {z}^{L}}=\left (\boldsymbol {a}^L-\boldsymbol {y}\right )\odot \left (\boldsymbol {a}^L\left (1-\boldsymbol {a}^L\right )\right )\)

Then, the partial derivative of the loss with respect to all the other layers’ outputs can be computed recursively as follows, where l = 1, 2, …, L − 1.

  • Given z l+1 = W l+1a l + b l+1

  • Then \(\boldsymbol {\delta }^{l} =\frac {\partial \mathcal {L}}{\partial \boldsymbol {z}^{l}} =\frac {\partial \mathcal {L}}{\partial \boldsymbol {z}^{l+1}}\frac {\partial \boldsymbol {z}^{l+1}}{\partial \boldsymbol {a}^{l}}\frac {\partial \boldsymbol {a}^{l}}{\partial \boldsymbol {z}^{l}} =\left (\boldsymbol {W}^{l+1}\right )^T\boldsymbol {\delta }^{l+1}\odot \left (\boldsymbol {a}^l\left (1-\boldsymbol {a}^l\right )\right )\)

The second step of the back-propagation is to compute the partial derivative of the loss with respect to the parameters \(\frac {\partial \mathcal {L}}{\partial \boldsymbol {W}^l}\) and \(\frac {\partial \mathcal {L}}{\partial \boldsymbol {b}^l}\) of each layer based on the intermediate value δ l.

  • Given z l = W la l−1 + b l, we have \(\frac {\partial \boldsymbol {z}^{l}}{\partial \boldsymbol {W}^l}=\boldsymbol {a}^{l-1}\) and \(\frac {\partial \boldsymbol {z}^{l}}{\partial \boldsymbol {b}^l}=1\)

  • Then \(\frac {\partial \mathcal {L}}{\partial \boldsymbol {W}^l}=\frac {\partial \mathcal {L}}{\partial \boldsymbol {z}^l}\frac {\partial \boldsymbol {z}^l}{\partial \boldsymbol {W}^l}=\boldsymbol {\delta }^l\left (\boldsymbol {a}^{l-1}\right )^T\), \(\frac {\partial \mathcal {L}}{\partial \boldsymbol {b}^l}=\frac {\partial \mathcal {L}}{\partial \boldsymbol {z}^l}\frac {\partial \boldsymbol {z}^l}{\partial \boldsymbol {b}^l}=\boldsymbol {\delta }^l\)

Finally, we use the \(\frac {\partial \mathcal {L}}{\partial \boldsymbol {W}^l}\) and \(\frac {\partial \mathcal {L}}{\partial \boldsymbol {b}^l}\) to update the parameters W l and b l as follows:

  • \(\boldsymbol {W}^l:=\boldsymbol {W}^l-\alpha \frac {\partial \mathcal {L}}{\partial \boldsymbol {W}^l}\)

  • \(\boldsymbol {b}^l:=\boldsymbol {b}^l-\alpha \frac {\partial \mathcal {L}}{\partial \boldsymbol {b}^l}\)

With the partial derivative \(\frac {\partial \mathcal {L}}{\partial \boldsymbol {\theta }}\), gradient descent updates the parameter iteratively and converges to a minimum point of the loss function as in Fig. 1.9. In practice, the converged point is typically a local minimum rather than the global one. However, as deep neural networks offer a good representation capacity, the local minimums tend to be close to the global minimum (Goodfellow et al. 2016).

In gradient descent, the computation of the loss value \(\mathcal {L}\) in each iteration can be expensive if the size of dataset (i.e., total number of data samples) N is large. Given the MSE in the example above, we can expand the MSE to:

$$\displaystyle \begin{aligned} \mathcal{L}=\frac{1}{2}\|\boldsymbol{y}-\boldsymbol{a}^{L}\|{}^2_2=\frac{1}{2}\sum_{i=1}^N\left(y_i-a^L_i\right)^2 \end{aligned} $$
(1.21)

In practice, the size of dataset can be more than tens of thousands so the gradient descent suffers from inefficiency due to the computation of \(\mathcal {L}\). To tackle this problem, we introduce stochastic gradient descent which computes \(\mathcal {L}\) of a small batch of data samples.

6.2 Stochastic Gradient Descent and Adaptive Learning Rate

Instead of computing the loss \(\mathcal {L}\) of all training data in each iteration, the stochastic gradient descent (SGD) (Bottou and Bousquet 2007) randomly selects a small number of data samples from the training set. These small number of data samples are named as a mini-batch, and the quantity of data samples in the mini-batch is referred to as batch size. We can rewrite the Eq. (1.21) with batch size B and B ≪ N so that the computation of \(\mathcal {L}\) is much more efficient.

$$\displaystyle \begin{aligned} \mathcal{L}=\frac{1}{2}\|\boldsymbol{y}-\boldsymbol{a}^{L}\|{}^2_2=\frac{1}{2}\sum_{i=1}^B\left(y_i-a^L_i\right)^2 \end{aligned} $$
(1.22)

The training process of stochastic gradient descent is outlined in Algorithm 1. If the parameters are updated with sufficient times (i.e., sufficient training steps/iterations), the mini-batches can cover the entire training set.

Algorithm 1 The training process of stochastic gradient descent (SGD)

The learning rate controls the step size of each update in SGD. If the learning rate is too large, the SGD may fail to find the minimum as shown in Fig. 1.10. If the learning rate is too small, the SGD can be slow to converge (Fig 1.10) or become stuck in a local minimum which has high error (Fig 1.9). Therefore, it is difficult to determine a proper fixed learning rate. Recent studies proposed adaptive learning rates, such as Adam (Kingma and Ba 2014), RMSProp (Tieleman and Hinton 2017), and Adagrad (Duchi et al. 2011), which speed up the training process by automatically adapting the learning rate. Adam is one of the most frequently used algorithm. Instead of using the gradients to update the parameters directly, Adam computes the running average of the gradients and the second moment of the gradients to update the parameters as shown in Algorithm 2. The β 1 and β 2 terms are the forgetting factors, also known as momentum, for the gradients and the second moment of the gradients, respectively. By default, β 1 is 0.9 and β 2 is 0.999 (Kingma and Ba 2014).

Fig. 1.10
figure 10

A large learning rate may accelerate the training process but can also make it hard to train a model with ideal parameters. As shown in the left figure, which has a larger learning rate than the right figure, the loss value may increase after parameters update and it can be hard to approach the minimum. In contrast, in the right figure, which has a lower learning rate, the loss value decreases consistently but in a slower manner

Algorithm 2 The training process of Adam optimization

6.3 Hyper-Parameter Selection

In deep learning, hyper-parameters refer to the settings of a model, such as the number of layers, and the settings of the training process, such as the number of steps, batch size, and learning rate. These settings can significantly affect the performance of a model, so selecting these hyper-parameters appropriately is essential to obtain an ideal model.

To evaluate the performance of different hyper-parameters, the data is usually split into training, validation, and testing sets. Then, multiple hyper-parameter settings are applied to the training set and evaluated on the validation set. Finally, the model with the best hyper-parameters that performs best on the validation set is selected for a final evaluation on the testing set.

6.3.1 Cross-Validation

For a small dataset, splitting the data into training and testing sets may be problematic. If the size of the training set is too small, the performance of a model can be harmed since there is no sufficient training data. On the other hand, if the testing set is too small, a model cannot be adequately evaluated. To tackle this problem, cross-validation is introduced.

In a k-fold cross-validation, a dataset is split into k non-overlapping subsets and each subset has the same size. The training/testing process is repeated for k times and, in each time, one of the subsets is selected for testing and the remainders for training. The final evaluation result is then averaged by the result across the k trials. Figure 1.11 illustrates an example of four-fold cross-validation.

Fig. 1.11
figure 11

An example of four-fold cross-validation. The dataset is split into four subsets (each row is a subset for demonstration purpose). In each trial, the blue subsets are the training set and the green subset is the testing set. The final evaluation result is the average of the four trials

7 Regularization

Regularization refers to a collection of methods which are designed to make sure a model not only works well on the training set but also on the testing data and new dataset. This section introduces the concept of overfitting and some regularization methods including weight decay, dropout, and batch normalization.

7.1 Overfitting

A machine learning model is optimized to minimize the training error (i.e., loss) but this cannot guarantee that the model can also perform well on the testing data. If the model is optimized “overly,” the model may even have a significantly large testing error. This case is called overfitting. For example, in Fig. 1.12, the polynomial model represented by the dashed curve suffers from overfitting. This model fits the training data accurately but it fails to fit the testing data. Such a model with overfitting can be unreliable in real-world applications where there is always new data. In contrast, the linear model represented by the solid straight line has fewer parameters while offering a better fit for the testing data.

Fig. 1.12
figure 12

A demonstration of overfitting. The blue dots represent training data, and the orange dots are testing data. Though the linear model represented by the solid line has a larger training error, it has much smaller testing error than the polynomial model represented by the dashed curve. Thus, we can say the polynomial model suffers from overfitting

Underfitting is opposite to overfitting, where the model cannot fit the training data, resulting in large error for both training and testing data. However, in practice, underfitting can be solved by using a larger model (more layers, more parameters, etc.), but solving overfitting is more challenging. The simplest way to alleviate overfitting is to use more training data which is not always possible since data acquisition and data labeling can be expensive.

7.2 Weight Decay

Weight decay is a simple but yet effective regularization method targeting the overfitting problem. It introduces a regularization term as a penalty to encourage θ with smaller absolute values. For example, as Fig. 1.12 shows, if the parameters from c to h of the polynomial model have smaller absolute values, the model will have a lower swing range so that it can better fit the data. The loss function with the parameter norm penalty is defined as follows:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathcal{L}_{\mathrm{total}}=\mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})+\lambda \Omega(\theta), {} \end{array} \end{aligned} $$
(1.23)

where \(\mathcal {L}(\boldsymbol {y}, \hat {\boldsymbol {y}})\) is the original loss function computed from the target y and prediction \(\hat {\boldsymbol {y}}\), Ω is the parameter norm penalty function and λ is a small value that controls the strength of the regularization. Two of the most commonly used parameter norm penalty functions are \(\mathcal {L}_{1}=\|\boldsymbol {W}\|\) and \(\mathcal {L}_{2}=\|\boldsymbol {W}\|{ }_{2}^{2}\). The parameters of deep neural networks often have absolute values smaller than 1, so \(\mathcal {L}_{1}\) can lead to a large penalty than \(\mathcal {L}_{2}\) since |w| > w 2 when |w| < 1. Therefore, the loss function with \(\mathcal {L}_{1}\) has the property which encourages the parameters of a network to have rather small values, or even zeros. This enables the network to implicitly perform feature selection, i.e. discarding some input features by setting the corresponding parameters to zero or some small values. As Fig. 1.13 shows, given two parameters w 1, w 2, in the coordinate system, w 1 2 + w 2 2 = r 2 is a circle with radius of r and |w 1| + |w 2| = r is a square with diagonal length of 2r, both of which are demonstrated by the blue contour lines. The red contour lines indicate the original loss \(\mathcal {L}(\boldsymbol {y}, \hat {\boldsymbol {y}})\). The intersection points, represented by the red crosses, of the parameter norm penalties and the original loss, indicate that \(\mathcal {L}_{1}\) is more likely to produce parameters valued zero than \(\mathcal {L}_{2}\), while \(\mathcal {L}_{2}\) may produce parameters with similar absolute values.

Fig. 1.13
figure 13

Left: A demonstration of contour lines of the original loss (red) and \(\mathcal {L}_2\) (blue). Right: A demonstration of contour lines of the original loss (red) and \(\mathcal {L}_1\) (blue). The interaction points (red crosses) of the two contour lines in each sub-figure indicate that \(\mathcal {L}_1\) may tend to produce parameters valued zero and \(\mathcal {L}_1\) may produce parameters with similar absolute values

7.3 Dropout

Deep neural networks with large numbers of neurons can suffer from the co-adaptation of neurons which can result in overfitting. The co-adaptation of neurons means that the neurons are dependent on each other. If one of the neurons fails, all dependent neurons may also fail and this can lead to the failure of the entire neural network. Dropout (Hinton et al. 2012; Srivastava et al. 2014) is a popular technique to address this problem by preventing the co-adaptation of neurons (i.e., parameters). To prevent the co-adaptation of parameters, during training, the hidden outputs are randomly set to zero, which resembles a random disconnection of neurons from one layer to the next, as illustrated in Fig. 1.14. During back-propagation, with a zero-valued output a, the corresponding partial derivative of the loss with respect to the layer output δ will be zero. In other words, only the remaining connected neurons are updated. Therefore, the dropout method can train different sub-networks while allowing all of them to share the same parameters (Hinton et al. 2012). During testing, dropout is disabled, and no outputs are set to zero. This means that all sub-networks work together to predict the final result (i.e., ensemble learning (Hara et al. 2016)). The theoretical proof of dropout was not presented in the original work by Hinton et al. (2012), but more recent studies proved its effectiveness in ensemble learning (Hara et al. 2016) and Bayesian approximation (Gal and Ghahramani 2016).

Fig. 1.14
figure 14

Applying dropout to MLP where some neurons are randomly deactivated

7.4 Batch Normalization

Batch normalization (Ioffe and Szegedy 2015) normalizes the inputs of a layer to have a mean of 0 and a variance of 1 and can improve the performance of a neural network and its training stability. Specifically, during training, the batch normalization layer estimates the mean and variance of the batch inputs using a moving average. Then, the moving mean and variance are updated to normalize the batch inputs. During testing, the moving mean and variance are fixed and applied to normalize the inputs.

Besides improving the performance and stability, batch normalization provides regularization. Similar to the dropout process that adds a random factor to the hidden values, the moving mean and variance of batch normalization introduce randomness as they are updated in each iteration according to the random mini-batch. Therefore, a neural network is encouraged during training to be robust enough to deal with the variation (Fig. 1.15).

Fig. 1.15
figure 15

An example of image data augmentation. The top-left image is the original image and the others are obtained by random flip, rotate, shear, shift, and zoom on the original image

7.5 Other Methods for Alleviating Overfitting

There are many other methods designed to prevent overfitting, such as early stopping and data augmentation. Early stopping allows early termination of the training process once it matches an empirical criterion, such as a threshold of accuracy on the validation set. Figure 1.16 shows that the testing loss may start to increase during training (i.e., the overfitting starts) and early stopping can be applied so that the training process is terminated before the overfitting starts. Data augmentation increases the size of training data by augmenting the existing training data. For example, image data can be augmented by simply flipping, rotating, shifting, and zooming. Data augmentation methods that generate arbitrary but reasonable data can reduce overfitting and improve the performance of a model (Simonyan and Zisserman 2015; He et al. 2016; Howard et al. 2017; Dong et al. 2017b). As with an image, the audio can be augmented by adding noise or perturbation. A recent study by Ko et al. (2015) showed that audio data augmentation with speed perturbation can improve the performance of speech recognition algorithms.

Fig. 1.16
figure 16

A demonstration of where the overfitting starts. The early stopping can be applied so that the training process is terminated before the overfitting starts

However, it is not applicable to use similar augmenting transformations on textual data since the order of words provides specific meaning. For example, “people like dogs” is not semantically equivalent to “dogs like people.” A practical way to augment textual data can be rephrasing sentences by replacing words with pre-defined synonyms (Zhang et al. 2015). Moreover, instead of augmenting the raw textual data, another study (Reed et al. 2016) interpolates the text embeddings of two random sentences so that the model is aware of the gaps in the text latent space.

8 Convolutional Neural Networks

Convolutional neural networks (CNNs) (LeCun et al. 1989) are a variant of MLP and are particularly useful in computer vision (Krizhevsky et al. 2012; Simonyan and Zisserman 2015; He et al. 2016), time series prediction (van den Oord et al. 2016), natural language processing (Zhang et al. 2019a; Yin et al. 2017), and also reinforcement learning (Rusu et al. 2016; James et al. 2019). Many of the deployed real-world machine learning systems are built on CNNs, which often demonstrate far superior performances when being compared against those with conventional methods. In this section, we introduce two kinds of layers, namely convolutional layer and pooling layer, which are commonly used to construct CNNs.

Convolutional Layer

The convolutional layer has the most distinguishable feature of CNNs. The idea of its design stems from the study of the human brain again where we have an array of nearby neurons processing a subset of the visual input. Concretely, as Fig. 1.17 has shown, the convolution volume uses four different neurons to process the same region from the input image. Different neurons could be responsible for different tasks such as edge, color, or angle detection. The neuron in the convolution input is locally connected rather than being connected to all units from the previous layer. Convolutional layers can also be stacked one by one, which means a convolutional layer can be applied to the output from another convolutional layer. The benefit of a convolutional layer is that it has far fewer connections to the previous layer than a dense layer so the convolutional layer typically can be trained more quickly. Figure 1.17 also shows that each neuron in a convolutional layer contains all the information of a small region and across all channels. For example, if the input layer is the RGB image input layer, then a neuron in the convolutional layer has the information after the filter is applied to a small region of the image across all the RGB image channels.

Fig. 1.17
figure 17

Computation of the convolution volume from a sample image. There are four neurons applied to the same region in this example

Regarding the convolution operation inside the convolutional layer, it uses filters to extract various important features. A layer has an input of height/width W. When we convolve an input with a filter of size F, we simply compute a dot product between the input and the filter values in a sliding window fashion. Then we move to apply the filter to the next block. The stride S describes how far each input block is away from each other. For instance, with the stride of two (S = 2), the filter is applied to the block that is one element away, skipping one row/column essentially. Lastly, sometimes in order to ensure that boundary values are well-considered, we have to add zeros on the edge, namely padding. We let the padding size be P. The output volume size of a convolutional layer can be computed by

$$\displaystyle \begin{aligned} \Bigg\lfloor~~\frac{W - F + 2P}{S} + 1~~\Bigg\rfloor {} \end{aligned} $$
(1.24)

The output volume has the same depth (number of output channels) as the number of filters. Figure 1.18 shows a concrete example of the convolution operation. In this example, there is an image of size 4 × 4 (height × width) with 3 input channels (RGB), and 1 filter sized 3 × 3 × 3 (filter height × filter width × input channels) with a stride S = 1 and a padding P = 0. According to Eq. (1.24), the output height/width is (4 − 3 + 0)∕1 + 1 = 2. The depth of the output (number of output channels) is 1 since there is 1 filter. To compute the top-left value in each channel, we first compute the dot products between the input image and the filter, which generate three values, and then sum up the three values to produce the top-left value. The convolution operation is a special case of ∑iw ix i, where w i is non-zero in a much smaller set. The output can then be passed through an activation function which introduces non-linearity.

Fig. 1.18
figure 18

Illustration of the convolution operation. In this example, 1 filter with size 3 × 3 × 3 (filter height × filter width × input channels) is applied on an image sized 4 × 4 (height × width) with 3 input channels (RGB). The dot products between the image and the filter are computed across the channels. The values obtained from the dot products are summed up to produce the top-left value of the output

Pooling Layer

Pooling takes advantage of the fact that, for images, neighboring pixels are similar. So it is assumed that proper down-sampling, such as only retaining the maximum or the average of a small region, is beneficial for modeling. There are typically two types of pooling layers to reduce the dimensions, namely max-pooling and average-pooling. In Fig. 1.19, we are showing examples of max-pooling and average-pooling on a 4 × 4 input with a stride of 2. The pooling layer reduces the dimensions of the output significantly, which makes computation in the following layers more efficient. For example, there can be hundreds of channels after a convolutional layer. Before the output is passed to a dense layer, reducing the dimensions of the output by pooling is preferred so that the successive dense layer has less computation workload.

Fig. 1.19
figure 19

2 × 2 max-pooling and average-pooling examples with a stride of 2 on a 4 × 4 input

Overall, the convolutional layer and pooling layer together with the dense layer are the basic components to construct CNNs. Figure 1.20 demonstrates a CNN with two convolutional layers, a max-pooling layer, and a dense layer. Note that activation functions can be applied to the output of the convolutional layers in the same way as the dense layer.

Fig. 1.20
figure 20

A example of CNN with two convolutional layers, a max-pooling layer, and a dense layer. Figure created by NN-SVGFootnote

http://alexlenail.me/NN-SVG/LeNet.html.

CNNs adopt the idea of parameter sharing which is different from MLP. The parameter sharing across different parts of a model makes the model more efficient (fewer parameters and less memory) and possible to handle variable data forms (different lengths and sizes). Recall that, in a dense layer, there is a weight matrix whose element w ij denotes the connectivity between the i-th neuron in the previous layer and the j-th neuron in the current layer. However, in a convolutional layer, the filters are essentially weights, which are used repeatedly when the output values are being computed. The repeated usage of filters reduces the number of parameters needed in a convolutional layer and this is why a convolutional layer typically has far fewer parameters than a dense layer given similar sizes of input and output.

Batch normalization (batch-norm layers) (Ioffe and Szegedy 2015) can be integrated with CNNs to accelerate the training due to the internal covariate shift. As mentioned above, the input of a batch-norm layer is normalized by a mean and a variance, which are independent of other layers. Therefore, intuitively, the batch normalization simplifies the interactions between layers in the gradient update and allows larger learning rates which accelerate the training.

LeNet (LeCun et al. 1998), AlexNet (Krizhevsky et al. 2012), and VGGnet (Simonyan and Zisserman 2015) are some popular CNNs. How to design the architecture of CNNs for a specific task or a general scenario is still an on-going research topic. The design can be an empirical driven exercise and requires lots of trials. However, recent works in neural architecture search seem to have provided more insights (Zoph and Le 2016; Zoph et al. 2018).

9 Recurrent Neural Networks

Recurrent neural networks (RNNs) (Rumelhart et al. 1986) is another class of deep learning architectures and it is designed to process sequential data. Unlike the images which can be represented by a grid of values, the sequential data refers to a sequence of values {x 1, x 2, …, x n}, which is also a common data format. For example, a document is composed of a sequence of words, and the values of a stock can be represented by a sequence of stock prices.

An important feature of the sequential data is the interaction among elements within the sequence. For example, provided with a snippet of text, a human reader may easily infer the content that would come next by only reading the beginning. However, the modeling of such interaction within the sequence can be more challenging if the sequence is longer. Therefore, RNNs should be able to effectively accumulate information provided by the sequential data and adequately consider the impact of earlier values on later ones in the sequence.

The design of RNNs, like that of CNNs, also adopts parameter sharing. The use of parameter sharing allows the same weight to be utilized repeatedly across multiple locations in the input sequential data. For example, RNNs should be able to learn that the sentences “Deep learning has been popular since the 2010s.” and “Since the 2010s, deep learning has been popular.” express the same meaning even though the positions of words are different. Similarly, when the CNNs are used to classify an image of a cat, the position of the cat in the image should not change the decision made by the CNNs (Fig. 1.21).

Fig. 1.21
figure 21

An illustration of RNN architecture. The cell ingests the value x t and the previous hidden state h t−1, and then outputs the new hidden state h t

Simple Cell

Similar to the CNNs which can process images of variable sizes, the RNNs can also easily be adjusted to process sequences with variable lengths. The idea of RNNs is to define a computation unit, referred to as a cell, and the cell is repeatedly computed given each value in the sequence one by one. The cell has a state which accumulates the information so far. When the cell is computed, it takes a value from the sequence and the previous state of the cell as inputs, and then generates a new state, which will be used in the next computation round. The simplest RNN cell applies a linear transformation which can be defined as follows:

$$\displaystyle \begin{aligned} \boldsymbol{h}_t = \boldsymbol{W}[\boldsymbol{x}_t; \boldsymbol{h}_{t-1}] + \boldsymbol{b} \end{aligned} $$
(1.25)

In this equation, the previous state of the cell h t−1 is concatenated with the value x t and then multiplied by the linear kernel W. A bias b can also be added to the state. An RNN constructs a deep computational graph as the linear kernel is repeatedly multiplied. Such a deep computational graph may cause the exploding of gradients if the eigenvalues of W are greater than 1 in magnitude or vanishing of gradients if the eigenvalues are less than 1 in magnitude. The exploding of gradients can make the learning process volatile while the vanishing of gradients can make the optimization of objectives (cost or loss) less effective. The RNNs with the simple cell may suffer from either problem if the input sequence is lengthy.

LSTM

The long short-term memory networks or LSTMs (Hochreiter et al. 1997) are more sophisticated RNNs to handle the long-term dependencies in long sequences, and the LSTM computation can serve as a cell in RNNs.

Unlike the simple cell, the LSTM cell has two states: cell state C t and hidden state h t. The update process of the cell state forms an information highway (the orange line in Fig. 1.22) which runs across the entire sequence with simple computations. This feature allows an easier flow of information throughout the sequence so that the dependency between two values that are located far away from each other in the sequence (i.e., long-term dependency) can be properly considered. Meanwhile, the hidden state is involved with gated computations. The gate controls whether to forget or add information to the flow and is implemented by the sigmoid function. The output of the sigmoid function is restricted between 0 and 1. In other words, when the sigmoid function outputs 1, the corresponding information should be totally kept. In contrast, the corresponding information should be totally forgotten if the sigmoid function outputs 0.

Fig. 1.22
figure 22

An illustration of RNN with the LSTM cell. There are two states in the LSTM which are the cell state C t and the hidden state h t. In addition, the three gates control whether information should be removed or added. Figure reproduced based on Olah (2015)

There are three kinds of gates in an LSTM cell: the forget gate, input gate, and output gate. The forget state first determines whether certain information should be removed from the cell state based on the new input. In addition, the input gate controls whether the new input should be added into the cell state for longer storage and also for a replacement to any information which has been forgotten. Then finally, the output state decides what the cell should output based on the new cell state. The three gates and the computation within the LSTM cell can be formally defined as follows. Note that σ represents the sigmoid function.

$$\displaystyle \begin{aligned} \begin{aligned} \text{Forget gate:}~~~\boldsymbol{f}_t &= \sigma(\boldsymbol{W}_f[\boldsymbol{h}_{t-1}; \boldsymbol{x}_t] + \boldsymbol{b}_f) \\ \text{Input gate:}~~~\boldsymbol{i}_t &= \sigma(\boldsymbol{W}_i[\boldsymbol{h}_{t-1}; \boldsymbol{x}_t] + \boldsymbol{b}_i) \\ \text{Output gate:}~~~\boldsymbol{o}_t &= \sigma(\boldsymbol{W}_o[\boldsymbol{h}_{t-1}; \boldsymbol{x}_t] + \boldsymbol{b}_o) \\ \text{Update cell state:}~~~\boldsymbol{C}_t &= \boldsymbol{f}_t \times \boldsymbol{C}_{t-1} + \boldsymbol{i}_t \times \text{tanh}(\boldsymbol{W}_C[\boldsymbol{h}_{t-1}; \boldsymbol{x}_t] + \boldsymbol{b}_C) \\ \text{Update hidden state:}~~~\boldsymbol{h}_t &= \boldsymbol{o}_t \times \text{tanh}(\boldsymbol{C}_t) \end{aligned} \end{aligned} $$
(1.26)

There is a family of gated RNNs that uses gated recurrent units (or GRUs) and the LSTM is a member of this family. Recent works have investigated different RNN architectures but it is still unclear which one is clearly better than others (Cho et al. 2014; Jozefowicz et al. 2015).

RNNs are widely adopted in deep learning to process sequential data like natural language and time series (Liao et al. 2018b; Chung et al. 2014; Mikolov et al. 2010) and also applied to solve reinforcement learning problems (Peng et al. 2018; Wierstra et al. 2010). Based on the relations between inputs and outputs, the architecture of RNNs can be modified in different scenarios. For example, a typical example of sequence input and single output is text classification (Zhang et al. 2019a; Lee and Dernoncourt 2016) where the input is a sequence of words (a sentence or a document) and the output is a single label to represent the predicted class. More challenging tasks such as machine translation (Sutskever et al. 2014; Luong et al. 2015; Bahdanau et al. 2015) and text summarization (Nallapati et al. 2017) have a sequence input and a sequence output.

10 Deep Learning Examples

This section introduces examples of how to implement deep learning models in TensorFlowFootnote 2 and TensorLayer.Footnote 3 TensorFlow (Abadi et al. 2016) by Google is an open-source library that enables researchers and engineers to develop deep learning models, while TensorLayer (Dong et al. 2017a) provides a moderate abstraction over TensorFlow to make such development easier and more flexible. The content of this section is validated on Python 3, TensorFlow 2.0, and TensorLayer 2.0 or later. In the future, TensorLayer will support different computational backend not only TensorFlow.

10.1 Tensor and Gradients

The tensor is the most fundamental computation unit in TensorFlow and it is used to represent outputs of an operation. A tensor can be created by operations such as tf.constant, tf.matmul, etc. Tensor does not store the values of the operation’s outputs but provides access to the computation of those values in a TensorFlow session. In TensorFlow 2.0, there is no need to run a session manually, as in eager execution, graphs and sessions are designed to stay in the backend. For examples, in the matrix multiplication as shown below, matrices can be created by tf.constant and the multiplication is computed by tf.matmul whose output is another matrix.

Matrix multiplication in TensorFlow by Tensor.

In the forward propagation of deep neural networks, the tensors are automatically connected by each other as a graph. Based on the graph and the automatic differentiation technique provided by TensorFlow, gradients can be computed in the back-propagation. TensorFlow 2.0 provides tf.GradientTape to compute gradients of recorded operations with respect to its input variables. For example, the code below shows an example of computing gradients in back-propagation. The forward propagation and the computation of loss are within the scope of tf.GradientTape, while the back-propagation and the update of weights are outside the scope. The tf.GradientTape records all operations that are executed within the scope onto a tape. Then the gradients associated with each recorded operation and its input variables are computed by reverse-mode automatic differentiation. Once the function tape.gradient( ) is called, the resources held by tf.GradientTape are released.

Gradients computation in TensorFlow and TensorLayer.

10.2 Define a Model

In TensorLayer 2.0, Model is an entity that consists of multiple Layers and defines the propagation between the Layers. TensorLayer 2.0 provides two sets of APIs to define a model. Static model APIs allow users to build a model fluently and dynamic model APIs provide more flexibility in the forward propagation. A static model requires users to manually construct a graph and compile it. Once the model is compiled, the forward propagation cannot be changed. Unlike the static model, the dynamic model can be executed eagerly like Python normally does and the forward propagation is mutable.

In the implementation of models, as shown in the examples below, the difference between a static model and a dynamic model can be summarized in two aspects. First, when layers in a static model are declared, the connection between layers (i.e., the forward propagation) is defined explicitly at the same time. Based on the connection, for each layer, TensorLayer can automatically infer the size of input variables from previous layers and then construct weights. When the Model is finally instanced, only inputs and outputs need to be specified and TensorLayer automatically builds a graph based on the connection. However, when a dynamic model is initialized, the forward propagation is still unknown until it is defined in the function forward later. Thus, the size of input variables cannot be automatically inferred and it has to been manually provided via the argument in_channels.

Second, the forward propagation of a static model is fixed once the model is constructed, so it is easier to accelerate the computation of a static model. TensorFlow 2.0 provides a new feature called tf.function which can be used as a decorator and accelerate the computation. Unlike the static model, the forward propagation in a dynamic model can be more flexible. For example, the forward flow can be controlled based on input values or arguments specified by users. Users are also allowed to use or abandon any layer in the forward propagation of a dynamic model.

An example of a static model: multilayer perceptron (MLP)

An example of a dynamic model: multilayer perceptron (MLP)

10.3 Customized Layers

TensorLayer 2.0 provides more than a hundred layers for users, and at the same time, TensorLayer 2.0 also supports Lambda Layer so that users can easily customize layers. The simplest example is to pass a lambda function into a Lambda Layer as shown below. Users may also define a customized function with arguments and the arguments can be passed by fn_args when the Lambda Layer is initialized or called.

The Lambda Layer can also have trainable weights. The example below shows that the weight is defined outside the customized function and it should be passed into the Lambda Layer by fn_weights.

Moreover, the Lambda Layer enables the compatibility of Keras in TensorLayer. Users may define a Keras model and pass the model into a Lambda Layer as a function since the Keras model is callable. The trainable weights of the Keras model need to be fetched and then passed into the Lambda Layer so that the Keras model can be updated together with the customized model.

10.4 MLP: Image Classification on MNIST

With the Models, Layers, and other supportive APIs provided by TensorLayer 2.0, users can design and implement their own deep learning models in a straightforward and flexible manner. To help readers have a better understanding of how to write a deep learning model by TensorLayer, let us start from an MLP to classify images on the MNIST dataset (LeCun et al. 1998), which collects 70,000 images of handwritten digits. The implementation of a deep learning example typically has five steps including data loading, building a model, training, testing, and saving the model.

TensorLayer provides APIs in the submodule tl.files to load various popular datasets including MNIST, CIFAR10, PTB, CelebA, etc. For example, the MNIST dataset can be loaded by tl.files.load_mnist_dataset with a specific shape. The datasets are typically split into three subsets: the training set, validation set, and testing test.

As introduced in the Sect. 1.10.2, an MLP model can be implemented as a either static or dynamic model in TensorLayer 2.0. In this example, the MLP model is designed to have three Dense layers and is implemented as a static model. But unlike a conventional MLP, the MLP model in this example also has three Dropout layers, which are used to prevent overfitting.

The training of the MLP model on the MNIST dataset is to learn the weights of the model. Users can trigger the training process by simply calling the function tl.utils.fit. In addition, the testing step is to validate if the model has properly learned from the data and can be triggered by tl.utils.test.

Finally, the weights of the trained MLP model can be saved to a local file so that the model can be restored later for inference.Footnote 4

10.5 CNN: Image Classification on CIFAR10

The CIFAR-10 dataset (Krizhevsky et al. 2009) was a challenging and popular benchmark for image classification. It collects images from 10 classes and each class has 6000 images. The images are sized 32 × 32 with RGB color and each image exclusively focuses on one single object (class) such as a dog, airplane, ship, etc. In TensorLayer 2.0, CIFAR-10 can be easily loaded and augmented by using Dataset and Dataloader APIs.

In this example, a CNN model with batch normalization (Ioffe and Szegedy 2015) is trained to classify the images from CIFAR-10. The model has two convolution blocks, each of which contains a batch normalization layer, and the blocks are followed by three dense layers.Footnote 5

10.6 RNN and Seq2seq: Chatbot

Chatbots are designed to conduct conversation by audio and text in general. In this example, we simplify the chatbot which takes text as inputs and responses in text. In this sense, the seq2seq by (Sutskever et al. 2014) can be a good fit for the chatbot. The seq2seq model has a sequence input and a sequence output. For example, both the input and output can be a sentence, which is a sequence of words. In chatbot, the seq2seq model takes a sentence as input and is trained to respond properly with another sentence. The seq2seq was originally proposed for machine translation but has potentials on many other sequence-to-sequence scenarios such as traffic prediction (Liao et al. 2018b) and text summarization (Liu et al. 2018). In practice, the seq2seq model consists of two RNNs: one encoder and one decoder. The encoder RNN learns the representation of the input sequence and the decoder RNN generates the response against the input. TensorLayer provides APIs to build a seq2seq model with one line of code.

An example output of the seq2seq based chatbot modelFootnote 6 is demonstrated below. The model ingests the input query which is a sentence and outputs several candidate responses.