Neural Networks

El Morr, Christo; Jammal, Manar; Ali-Hassan, Hossam; El-Hallak, Walid

doi:10.1007/978-3-031-16990-8_11

Christo El Morr¹³,
Manar Jammal¹⁴,
Hossam Ali-Hassan¹⁵ &
…
Walid El-Hallak¹⁶

Part of the book series: International Series in Operations Research & Management Science ((ISOR,volume 334))

963 Accesses

Abstract

Rule-based systems and Bayesian networks cannot effectively solve problems such as image or speech recognition. Artificial neural networks (ANNs), or simply neural networks, are effective in solving complex problems, i.e., in modeling complex nonlinear functions. ANNs model the functioning of the brain’s neurons; ANN can be trained to “learn” how to recognize patterns and classify data [1].

Access provided by Autonomous University of Puebla. Download chapter PDF

Some Note on Artificial Intelligence

Neural Networks – State of Art, Brief History, Basic Models and Architecture

Artificial Neural Networks

11.1 The Problem

Rule-based systems and Bayesian networks cannot effectively solve problems such as image or speech recognition. Artificial neural networks (ANNs), or simply neural networks, are effective in solving complex problems, i.e., in modeling complex nonlinear functions. ANNs model the functioning of the brain’s neurons; ANN can be trained to “learn” how to recognize patterns and classify data [1].

11.2 A Practical Example

11.2.1 Example 1

Let us take an example of a dataset that has four instances with two variables, x and y, and two classes (i.e., class “grey” and class “black”), which are drawn in Fig. 11.1. We can notice two groups of instances: those in black and those in grey. But there is no way that one straight line can classify these instances into two classes/categories. If we have two lines like those present in Fig. 11.2, we can correctly classify the instances. So, the function that separates these two classes cannot be linear; we therefore have a nonlinear solution to this classification problem (Fig. 11.2). Every time a linear classification cannot work, we can make use of an artificial neural network (ANN), or more accurately, an ANN with hidden layers.

A table of a dataset in 3 columns and 8 rows of attributes 1, 2, and class on the left and a dataset of instances of both the attributes on the right. — **Fig. 11.1**

A dataset of eight instances belongs to two classes represented by black and grey dots of attribute 2 and attribute 1, attributes are separated by an irregular line. — **Fig. 11.2**

To make our point clear, we can draw two straight lines to separate the two classes (Fig. 11.3).

A dataset of eight instances belongs to two classes represented by black and grey dots of attribute 2 and attribute 1 are separated by two straight dashed lines. — **Fig. 11.3**

Each line is expressed as y = ax + b, or to write it slightly differently, y − ax − b = 0, which is equivalent to

$$ {w}_2{x}_2+{w}_1{x}_1+{w}_0=0, $$

where w₂=1, w₁=−a, and w₀=−b.

We will see in the Multilayer Perceptron paragraph how an artificial neural network can solve this problem.

11.3 The Algorithm

A biological neuron can be schematized typically in the following figure (Fig. 11.4).

A schematic diagram of a typical biological neuron. The labeled parts are as follows synaptic gap, synapse, synaptic terminals, dendrites, soma, nucleus, and axon. — **Fig. 11.4**

A brain neuron can be considered an information-processing unit. Neurons communicate through electrical signals. By discharging chemicals known as neurotransmitters, the synaptic terminals of one neuron produce a voltage pulse which is communicated to the soma through the dendrites of another neuron. At the soma, the potentials are added, and when the summation rises above a critical threshold, then an electrical signal travels through the axon to the synaptic terminals [2, 3].

Hence, dendrites play the role of input, and the axon, the role of output. As a processing unit, the neuron has many inputs and one output that is connected to many other processing units [3].

Synapses might excite or inhibit the dendrites; exciting a dendrite results in a positive direction of its potential, while inhibiting it results in a negative direction of its potential. Hence, the inputs communicated through the dendrites are “weighted”: some signals are positive (excite), and others are negative (inhibit). At the soma, the weighted inputs are added, and if the sum crosses a threshold, the neuron fires (i.e., gives output). A neuron can fire between 0 and 1500 times per second [2]. The neuron either fires or does not, but what changes is the rate of firing.

11.3.1 The McCulloch–Pitts Neuron

In 1943, Warren McCulloch and Walter Pitts proposed a mathematical model of the neuron known today as the McCulloch–Pitts (M-P) neuron [4, 5] (Fig. 11.5). The inputs (e.g., dendrites) of an M-P neuron are either 0 or 1, and it can be thought of as formed of two parts: the first part sums up all input values, and the second makes a decision about the resulting sum. The decision function f will provide an output of 1 if the sum of the inputs is greater than or equal to a certain threshold θ (pronounced theta) and 0 otherwise.

A schematic mathematical model of the neuron. Four circles labeled X 1 to X n from top to bottom arrows towards a large circle divided into two parts. A right arrow labeled y equals f of g of x. — **Fig. 11.5**

Let us use an M-P neuron to decide whether to go to the movie theater or not. Suppose that we base our decision on four binary parameters: it is a weekday (x₁), it is after 6:00 p.m. (x₂), it is not during the COVID-19 pandemic (x₃), and the actor is Shah Rukh Khan (x₄). A decision will be made to go to watch the movie if three out of the four conditions are met (θ = 3): f(g(x)) = 1 if g(x) ≥ ϑ; f(g(x)) = 0 if g(x) < ϑ.

θ is called the bias; we can think of it as the prior prejudice. For example, for a certain group of people, it might be enough that two of the conditions are met to decide to go to the movie theater; for others, the threshold could be 4 or even 0.

What would be the M-P decision on a Tuesday at 5:00 p.m. during the pandemic if the actor was Shah Rukh Khan?

$$ o=g(x)={x}_1+{x}_2+{x}_3+{x}_4=1+0+0+1=2 $$

f(o) = f(g(x)) = f(2) = 0; the decision is not to go to the movie theater (i.e., the neuron will not fire).

What would be the M-P decision on a Tuesday at 7:00 p.m. during the pandemic if the actor was Shah Rukh Khan?

$$ g(x)={x}_1+{x}_2+{x}_3+{x}_4=1+1+0+1=3 $$

f(g(x)) = f(3) = 1; the decision is to go to the movie theater (i.e., the neuron will fire).

The M-P neuron was the first step towards today’s neural network; however, it was very restrictive. First, not all our inputs are binary; they can be numerical or categorical. Also, the output we desire is not always binary—we might want to predict a number in the case of regression or predict a class out of multiple existing classes (more than 2) in the case of classification problems.

11.3.2 The Perceptron

To overcome these limitations, the perceptron model was proposed by Frank Rosenblatt in 1958; the model was refined by Minsky and Papertin 1969. Mainly, the perceptron proposed to add adaptive weights to the inputs (Fig. 11.6).

A schematic model of the perceptron. Four circles labeled X 1 to X n arrows towards a circle, divided into two parts with labels w 1 to w n, respectively. An arrow from the circle is labeled y equals f of g of x. — **Fig. 11.6**

The neuron has many inputs (x₀ … x_n) and adaptive weights (w₀ … w_n); each input x_i is multiplied by a corresponding weight w_i, and then the results are summed up, mimicking the dendrites-soma-axon behavior. When the summed-up result is higher than a threshold ϑ, the outcome y is set to 1; otherwise, y is set to 0; y is in fact a function of the weighted sum $ y=f\left(g(x)\right)=f\left(\sum \limits_{i=1}^n\left({w}_i\times {x}_i\right)\right) $.

If we go back to the same problem above—the decision to go to the movie theater—but we add to it the weight for each input, the weights can be decided based on knowledge about the importance of each input: a highly important input for making the right decision can be assigned a high weight, and inputs that do not play a major role can be assigned lower weights. Finding ways to determine the best weights and θ for a decision problem is the main goal in the next paragraphs.

Suppose that the perceptron is deciding for a group of Shah Rukh Khan diehard fans, hence the weight w₄ could be 10, while the other weights are set as follows: w₁ = 2, w₂ = 3, w₃ = −5. For the sake of this example, let us change x₃ to represent pandemic if it is equal to 1 and no pandemic if it is equal to 0.

What would the perceptron’s decision be on a Tuesday at 7:00 p.m. during the pandemic if the actor was Shah Rukh Khan?

g(x) = w₁x₁ + w₂x₂ + w₃x₃ + w₄x₄ = 2x₁ + 3x₂ + −5x₃ + 10x₄ = 10>ϑ=3, the decision is to go to watch the movie.

Now, we can change the weights for people who are more reasonable and decide that watching a movie with Shah Rukh Khan (or any other actor) is not of a higher value than their and others’ lives; we can decide that w₃ = −100, which will push the perceptron’s decision “do not go to theater” to always fire under a pandemic.

Mathematically, we could look at the inputs (x₀ … x_n) and the weights (w₀ … w_n) as vectors.

$$ \mathrm{The}\ \mathrm{input}\ \mathrm{vector}\ x\ \mathrm{is}\ \mathrm{defined}\ \mathrm{as}\ x=\left[\begin{array}{c}{x}_1\\ {}{x}_2\\ {}\dots \\ {}{x}_n\end{array}\right]. $$

And the weights’ vector transpose is defined as w^T = [w₁ w₂…w_n].

In mathematics, the multiplication of two vectors w^T and x is written w^Tx and is expressed as follows:

$$ {w}^Tx=\left[{w}_1\ {w}_2\dots {w}_n\right]\left[\begin{array}{c}{x}_1\\ {}{x}_2\\ {}\dots \\ {}{x}_n\end{array}\right]=\sum \limits_{i=1}^n\left({w}_i\times {x}_i\right) $$

The function f that we have used to provide the output y is a function of the total weighted sum (i.e., g(x)) and is called the activation function because it allows us to activate the neuron when the value is greater than or equal to ϑ.

The activation function compares the weighted sum to ϑ and decides to activate the neuron if the weighted sum is greater than ϑ (Fig. 11.7).

A schematic mathematical model of the perceptron with threshold theta. Four circles labeled X 1 to X n arrows towards a circle divided into two parts with labels w 1 to w n, respectively. — **Fig. 11.7**

The same result can be achieved if we subtract the value ϑ from the sum (Fig. 11.8). The result y is the same; however, the activation decision is made based on whether $ \sum \limits_{i=0}^n\left({w}_i\times {x}_i\right)-\theta >0 $ or not.

A schematic mathematical model of the perceptron with threshold theta subtracted from the weighted sum. Four circles labeled X 1 to X n arrows towards a circle labeled sigma. — **Fig. 11.8**

We can move one step further by treating -ϑ as an extra weight called w₀ multiplied by an attribute x₀ of value 1 (Fig. 11.9).

A schematic mathematical model of the perceptron. Four circles labeled X 1 to X n, and 1 arrow towards the right with labels w 1 to w n, and w 0, respectively to a circle labeled sigma. — **Fig. 11.9**

w₀ is the bias of the model, which we sometimes represent with the letter b, which is familiar in linear functions (i.e., y = ax + b).

Hence the following:

$$ g(x)=\sum \limits_{i=\mathbf{1}}^n\left({w}_i\times {x}_i\right)+b $$

$$ y=f\left(g(x)\right)=f\left(\sum \limits_{i=\mathbf{1}}^n\left({w}_i\times {x}_i\right)+b\right) $$

can also be written

$$ g(x)=\sum \limits_{i=1}^n\left({w}_i\times {x}_i\right)+\left(-\uptheta \right)=\sum \limits_{i=\mathbf{1}}^n\left({w}_i\times {x}_i\right)+\left({w}_0\times 1\right)=\sum \limits_{i=\mathbf{0}}^n\left({w}_i\times {x}_i\right) $$

$$ y=f\left(g\Big(x\right)=f\left(\sum \limits_{i=\mathbf{0}}^n\left({w}_i\times {x}_i\right)\right) $$

11.3.3 The Perceptron as a Linear Function

In fact, the perceptron estimates a linear function. Let us take the following example with two input variables, x₁ and x₂, and their corresponding weights, w₁ and w₂. Suppose that we have the following values for the dataset we are trying to model using the perceptron (Table 11.1). That dataset is plotted in Fig. 11.10, where the points corresponding to a zero output are in grey.

Table 11.1 A training dataset

Full size table

A scatterplot depicts attributes 1 versus attributes 2. A table with 8 instances for attributes 1, attributes 2, and class is on the left side. — **Fig. 11.10**

It is obvious that we can find a solution to differentiate between the points in grey and the others: we can plot a straight line that separates the two sets of data points. A line such as y = x will do the job (Fig. 11.11).

A scatterplot of the dataset depicts X 2 versus X 1. Instances belong to two classes. A reference line starts from the origin in an increasing trend. — **Fig. 11.11**

If we use a perceptron with weights w₁ = 1, w₂ = −1, and w₀ = 0, the perceptron behaves exactly like f(x) = x.

Let us start with f(x) = x; we can rewrite it as y = x.

We can also write it as x₂−x₁ = 0, or x₂−x₁₋0 = 0, or even w₂x₂ + w₁x₁ + w₀ = 0, where w₀ = 0, w₁ = −1, and w₂ = 1.

All the points on the line have in common the property x₂−x₁₋0 = 0.

The points on both sides of the lines satisfy either of these two conditions: x₂−x₁−0 > 0 or x₂−x₁−0 < 0.

We are in a situation of a summation $ \sum \limits_{i=0}^2\left({w}_i\times {x}_i\right) $ and then a decision based on comparison of the resulting sum with a threshold θ = 0. We are in the domain of the perceptron. The perceptron is modeling a straight line, so it is a linear model.

11.3.4 Activation Functions

The activation function f can be different than the one mentioned above; it can, for example, propose that the output be −1 instead of 0; such a function is called bipolar as opposed to unipolar (i.e., output positive or zero). The passage from one output to another (0 to 1 or − 1 to 1) was abrupt in the previous paragraph, but we can use activation functions with a smoother passage; such functions are called soft-limiting (Fig. 11.12).

Two cartesian coordinate systems of unipolar and bipolar activation functions. The unipolar graph starts at the negative x axis and moves to the first quadrant. The bipolar graph starts at the negative x axis in the third quadrant and moves to the first quadrant. — **Fig. 11.12**

However, for reasons, we will discuss below (i.e., gradient descent), we will need to compute the derivative of the activation function, i.e., it must be differentiable. We have many activation functions to choose from [6].

11.3.4.1 The Sigmoid Function

A function that satisfies the differentiability criterion and that can play the role of a soft-limiting activation function is the sigmoid function, defined as:

$$ f(x)=\frac{1}{1+{e}^{-\lambda x}} $$

The graph of the sigmoid function is given in Fig. 11.13:

A graph of the sigmoid function, lambda equals 1. A line passes through the points (negative 10, 0), (negative 3.5, 0), (0, 0.3), (5, 0.98), and (10, 1). All values are approximate. — **Fig. 11.13**

where λ determines the steepness of the sigmoid function. We can notice that the outcome of the sigmoid function varies between 0 and 1.

The gradient of the sigmoid is defined as follows:

$$ {f}^{\prime }(x)=\mathrm{sigmoid}(x)\times \left(1-\mathrm{sigmoid}(x)\right) $$

Since the sigmoid function’s output is always positive, the gradient of the sigmoid will always be positive, whatever the value of x (Fig. 11.14). In fact, the gradient approaches 0 above +3 and below −3, which indicates that little learning is done above +3 or below −3.

A cartesian coordinate system depicts a unilateral curve that passes through the first and second quadrants. It depicts positive. — **Fig. 11.14**

We will overcome this issue if we scale the sigmoid function, and that is the solution proposed by the tanh function.

11.3.4.2 The Tanh Function

The hyperbolic tangent function is defined as follows:

$$ f(x)=\tanh \left(\frac{1}{2}\ \lambda x\right) $$

$$ \mathrm{for}\ \lambda =2,\tanh (x)=\frac{1-{e}^{-2x}}{1+{e}^{-2x}} $$

The graph of the tanh function is like that of the sigmoid, but it is scaled so that it is symmetric around zero (Fig. 11.15).

A cartesian coordinate system plots a curve that passes through the estimated points (negative 5, negative 1), (negative 2.2, negative 0.8), (0, 0), (2.2, 0.8), and (5, 1). All values are approximate. — **Fig. 11.15**

The gradient of the tanh function is defined as follows:

$$ {f}^{\prime }(x)=1-{\left(\tanh (x)\right)}^2 $$

We can notice that the graph of the tanh gradient is symmetric around zero, and hence it can be positive or negative (Fig. 11.16)

A cartesian coordinate system plots a bell curve that passes through the estimated points (negative 4, 0), (0, 1), and (4, 0). All values are approximate. — **Fig. 11.16**

11.3.4.3 The ReLU Function

The rectified unit function (ReLU) is defined as follows:

$$ \mathrm{ReLU}(x)=f(x)=\left\{\begin{array}{c}0\kern1.75em \mathrm{if}\ x<0\\ {}x\kern0.5em \mathrm{otherwise}\ \end{array}\right. $$

The ReLU graph is shown in Fig. 11.17.

A cartesian coordinate system plots a line that passes through the estimated points (0, 0), (1,1) (2, 2), and (3, 3). All values are approximate. — **Fig. 11.17**

For all values that are below 0, the activation function will have 0 as an output; hence, ReLU might activate a subset of all the neurons, which makes it more efficient than other activation functions. The gradient of ReLU is a constant (0 or 1) and is defined as

$$ {f}^{\prime }(x)=\left\{\begin{array}{c}0\kern1.75em \mathrm{if}\ x<0\\ {}1\kern0.5em \mathrm{otherwise}\ \end{array}\right. $$

Since the gradient might be 0 for some neurons, during backpropagation, some weights and biases will not be updated, and the corresponding neurons might never get activated; we call such neurons “dead neurons.” The leaky ReLU activation function addresses this problem.

11.3.4.4 The Leaky ReLU Function

Leaky ReLU is defined as follows:

$$ \mathrm{ReLU}(x)=f(x)=\left\{\begin{array}{c}0.01x\kern1.75em \mathrm{if}\ x<0\\ {}x\kern2.5em \mathrm{otherwise}\ \end{array}\right. $$

The leaky ReLU graph is shown in Fig. 11.18

A cartesian coordinate system plots a line that passes through the estimated points (negative 6, negative 0.02), (negative 3, 0.01), (0, 0), and (2, 2). All values are approximate. — **Fig. 11.18**

The leaky ReLU gradient function is defined as follows:

$$ \mathrm{ReLU}(x)=f(x)=\left\{\begin{array}{c}0.01\kern2.25em \mathrm{if}\ x<0\\ {}1\kern2.5em \mathrm{otherwise}\ \end{array}\right. $$

With leaky ReLU, the gradient of the negative inputs will never be 0; hence, there will be no dead neurons.

11.3.4.5 The Parameterized ReLU Function

The parameterized ReLU function adds flexibility for the negative values of x as it introduces the slope as a parameter (instead of the constant slope 0.01). The function is defined as follows:

$$ \mathrm{ReLU}(x)=f(x)=\left\{\begin{array}{c} ax\kern2.75em \mathrm{if}\ x<0\\ {}x\kern1.75em \mathrm{otherwise}\ \end{array}\right. $$

The only caveat is that the artificial neural network will also learn the slope a for an optimal convergence.

The parameterized ReLU gradient function is defined as follows:

$$ \mathrm{ReLU}(x)=f(x)=\left\{\begin{array}{c}a\kern2.25em \mathrm{if}\ x<0\\ {}1\kern1em \mathrm{otherwise}\ \end{array}\right. $$

11.3.4.6 The Swish Function

The swish activation function shows better performance than ReLU and is very efficient; it is defined as follows (Fig. 11.19):

$$ f(x)=\frac{x}{1+{e}^{-x}} $$

A cartesian coordinate system plots a curve that passes through the estimated points (negative 7, negative 0), (negative 3, negative 0.12), (0, 0), (2, 1.8), and (6, 6). All values are approximate. — **Fig. 11.19**

11.3.4.7 The SoftMax Function

The SoftMax function turns a vector x of k real values x_j, j = 1 to k, into a vector of k real values that sum to 1; it is defined as:

$$ \sigma \left({x}_i\right)=\frac{e^{-{x}_i}}{\sum \limits_{j=1}^K{e}^{-{x}_j}} $$

Since the SoftMax function returns values between 0 and 1, we can treat these values as probabilities that an input belongs to a particular class. The SoftMax activation function is very useful for multiclass classification, where the ANN has multiple neurons as output.

11.3.4.8 Which Activation Function to Choose?

There is no formula; however, the following are rules of thumb:

Sigmoid functions work well in classification problems.
Sigmoid and tanh functions have one notable drawback: the vanishing gradient.
The ReLU function is generic and is widely used.
In the case of dead neurons, use leaky ReLU.
Use ReLU first; if it does not provide you with a good solution, then you can try other activation functions.
Use SoftMax for multiclass classification problems.

11.3.5 Training the Perceptron

The question is how to find the right weights for the perceptron. We will do that by gradient descent, which we have seen during regression.

Let us define an error function E for the perceptron. If we take the error function as the mean squared error (MSE), and suppose that we have a training set of N instances (xⁱ, yⁱ), then E can be formulated as:

$$ E=\frac{1}{2N}{\sum}_{i=1}^N{\left({y}^i-{\hat{y}}^i\right)}^2=\frac{1}{2N}{\sum}_{i=1}^N{\left({y}^i-f\left({w}^T{x}^i+{w}_0\right)\right)}^2 $$

The value $ \frac{1}{2} $ is chosen for convenience in later calculations (i.e., derivatives).
The training dataset is formed of N instances {(x¹, y¹), (x², y²)…(x^N, y^N)}. Each xⁱ is an input vector with n attributes/features (xⁱ₁, xⁱ₂, …xⁱ_n), and each yⁱ is an expected output for vector xⁱ.
w^T is the transpose of the weight vector (w₁, w₂, …w_n), where w₀ is the bias.
$ {\hat{y}}^i $ is the thresholded output computed by the perceptron for the input vector xⁱ.

Our aim is to find the set of weights that minimizes E.

The error value depends on the values of w₀, which is the bias b, and on all other weights represented by the vector w, so E is a function of both variables. To obtain E(w, b), we replace f(w^Txⁱ + b) with w^Txⁱ + b. The error function is then expressed as follows:

$$ E\left(w,b\right)=\frac{1}{2N}{\sum}_{i=1}^N{\left({y}^i-\left({w}^T{x}^i+b\right)\right)}^2 $$

When w^Txⁱ + b = yⁱ for all xⁱ, i = 1 to N, then E = 0; our aim is to find a set of weights that makes E as close as possible to 0. When the perceptron learns how to fit the unthresholded outputs w^Txⁱ + b to the desired outputs yⁱ, it is simple to take the same weights, apply them to the input vectors xⁱ, and then use a threshold function f to obtain perceptron outputs $ {\hat{y}}^i $ that correctly classify the xⁱ. For example, suppose the vectors xⁱ are in two classes, yⁱ = 1 and yⁱ = 0; then if (w^Txⁱ + b) correctly classifies a vector xⁱ into one of the two classes (e.g., 1), that means that w^Txⁱ + b is equal to either 1 or 0; if we apply an activation function f(w^Txⁱ + w₀) that produces 1 if w^Txⁱ + b = 1 and produces 0 if w^Txⁱ + b = 0, the perceptron’s output f(w^Txⁱ + w₀) will correctly classify xⁱ.

So, we will be interested in minimizing the error function for the output $ {\hat{y}}^i $:

$$ E\left(w,b\right)=\frac{1}{2N}{\sum}_{i=1}^N{\left({y}^i-\left({w}^T{x}^i+b\right)\right)}^2 $$

As was the case with the regression, we will use gradient descent to minimize the error function until convergence is reached.

$$ \frac{\partial E}{\partial {w}_j}=\frac{\partial \left(\frac{1}{2N}{\sum}_{i=1}^N{\left({y}^i-\sum \limits_{j=1}^n\left({w}_j{x}_j^i+{w}_0\right)\right)}^2\right)}{\partial {w}_j} $$

The error e_i made for the ith sample can be expressed as follows:

$$ {e}^i={y}^i-\sum \limits_{j=1}^n\left({w}_j{x}_j^i+{w}_0\right) $$

Hence

$$ \frac{\partial E}{\partial {w}_j}=\frac{\partial \left(\frac{1}{2N}{\sum}_{i=1}^N{\left({e}^i\right)}^2\right)}{\partial {w}_j} $$

$$ \frac{\partial E}{\partial {w}_j}=\frac{1}{2N}\frac{\partial \left({\sum}_{i=1}^N{\left({e}^i\right)}^2\right)}{\partial {w}_j} $$

$$ \frac{\partial E}{\partial {w}_j}=\frac{1}{2N}{\sum}_{i=1}^N\frac{\partial {\left({e}^i\right)}^2}{\partial {w}_j} $$

$$ \frac{\partial E}{\partial {w}_j}=\frac{1}{2N}{\sum}_{i=1}^N2{e}^i\frac{\partial \left({e}^i\right)}{\partial {w}_j} $$

$$ \frac{\partial E}{\partial {w}_j}=\frac{1}{N}{\sum}_{i=1}^N{e}^i\frac{\partial \left({e}^i\right)}{\partial {w}_j} $$

$$ \frac{\partial E}{\partial {w}_j}=\frac{1}{N}{\sum}_{i=1}^N{e}^i\frac{\partial \left({y}^i-\sum \limits_{j=1}^n\left({w}_j{x}_j^i+{w}_0\right)\right)}{\partial {w}_j} $$

$$ \frac{\partial E}{\partial {w}_j}=\frac{1}{N}{\sum}_{i=1}^N\left({e}^i\left(-{x}_j^i\right)\right) $$

$$ \frac{\partial E}{\partial {w}_j}=-\frac{1}{N}{\sum}_{i=1}^N{e}^i{x}_j^i=-\frac{1}{N}{\sum}_{i=1}^N\left(\left(\Big({y}^i-\sum \limits_{j=1}^n\left({w}_j{x}_j^i+{w}_0\right)\right){x}_j^i\right) $$

Now, we will compute $ \frac{\partial E}{\partial b} $, which is $ \frac{\partial E}{\partial {w}_0} $, where w₀ is the weight for xⁱ₀ and $ {x}_0^i=1 $.

$$ \frac{\partial E}{\partial b}=\frac{\partial E}{\partial {w}_0}=-\frac{1}{N}{\sum}_{i=1}^N\left(\left(\Big({y}^i-\sum \limits_{j=1}^n\left({w}_j{x}_j^i+{w}_0\right)\right){x}_0^i\right)=-\frac{1}{N}{\sum}_{i=1}^N\left({y}^i-\sum \limits_{j=1}^n\left({w}_j{x}_j^i+{w}_0\right)\right) $$

We can start at the first iteration at a random value for the weights w and the bias b; then, we adjust these values at the next iteration based on the following formula, where k is the current iteration:

$$ {w}_{\left(k+1\right)}={w}_{(k)}-\alpha \frac{\partial E}{\partial {w}_{(k)}} $$

$$ {b}_{\left(k+1\right)}={b}_{(k)}-\alpha \frac{\partial E}{\partial {b}_{(k)}} $$

where α is the learning rate (e.g., 0.01) and $ \frac{\partial E}{\partial {w}_{(k)}} $ and $ \frac{\partial E}{\partial {b}_{(k)}} $ are the gradient of E with respect to w and b, at iteration k, respectively.

The updates of the perceptron parameters w and b are calculated as the difference (represented by delta Δ) between their values in the next iteration k and in the current one [7]:

$$ {w}_{\left(k+1\right)}={w}_{(k)}-\alpha \frac{\partial E}{\partial {w}_{(k)}} $$

$$ {w}_{\left(k+1\right)}={w}_{(k)}+\alpha \frac{1}{N}{\sum}_{i=1}^N{e}^i{x}_j^i $$

$$ {\boldsymbol{w}}_{\left(\boldsymbol{k}+\mathbf{1}\right)}={\boldsymbol{w}}_{\left(\boldsymbol{k}\right)}+\frac{\boldsymbol{\alpha}}{\boldsymbol{N}}{\sum}_{\boldsymbol{i}=\mathbf{1}}^{\boldsymbol{N}}\left(\left({\boldsymbol{y}}^{\boldsymbol{i}}-\sum \limits_{\boldsymbol{j}=\mathbf{1}}^{\boldsymbol{n}}\left({\boldsymbol{w}}_{\boldsymbol{j}\left(\boldsymbol{k}\right)}{\boldsymbol{x}}_{\boldsymbol{j}}^{\boldsymbol{i}}+{\boldsymbol{w}}_{\mathbf{0}\left(\boldsymbol{k}\right)}\right)\right){\boldsymbol{x}}_{\boldsymbol{j}}^{\boldsymbol{i}}\right) $$

$$ {b}_{\left(k+1\right)}={b}_{(k)}-\alpha \frac{\partial E}{\partial {b}_{(k)}} $$

$$ {b}_{\left(k+1\right)}={b}_{(k)}+\alpha \frac{1}{N}{\sum}_{i=1}^N{e}^i $$

$$ {\boldsymbol{b}}_{\left(\boldsymbol{k}+\mathbf{1}\right)}={\boldsymbol{b}}_{\left(\boldsymbol{k}\right)}+\frac{\boldsymbol{\alpha}}{\boldsymbol{N}}{\sum}_{\boldsymbol{i}=\mathbf{1}}^{\boldsymbol{N}}\left({\boldsymbol{y}}^{\boldsymbol{i}}-\sum \limits_{\boldsymbol{j}=\mathbf{1}}^{\boldsymbol{n}}\left({\boldsymbol{w}}_{\boldsymbol{j}\left(\boldsymbol{k}\right)}{\boldsymbol{x}}_{\boldsymbol{j}}^{\boldsymbol{i}}+{\boldsymbol{w}}_{\mathbf{0}\left(\boldsymbol{k}\right)}\right)\right), $$

which is equivalent to writing $ {w}_{0\left(k+1\right)}={w}_{0(k)}+\alpha \frac{1}{N}{\sum}_{i=1}^N\left({y}^i-\sum \limits_{j=1}^n\left({w}_{j(k)}{x}_j^i+{w}_{0(k)}\right)\right) $.

To train the perceptron, we can proceed by:

1.
Forward calculation: Calculating w^Tx_i + b for all x_i. Such a run into the N instances of the training set is called an epoch.
2.
Updating the weights and the bias w_(k + 1) and w_0(k + 1).
3.
Repeating steps 1 and 2 until E(w, b) converges.
4.
Using the last calculated weights and bias to predict the output y for any new input x.

As we can guess, the perceptron is a linear model and cannot solve a nonlinear problem.

In practice, using all the available instances to make a single update of the weights might be extremely slow, so instead, we sample a random smaller batch of the training dataset to compute every update. This method is called the minibatch stochastic gradient descent.

11.3.6 Perceptron Limitations: XOR Modeling

The exclusive-or (XOR) function is a function with two input variables, x₁ and x₂, that has an output of 1 if either x₁ or x₂ is 1; otherwise, it is 0. The truth table is shown in Table 11.2, and the corresponding plot is in Fig. 11.20.

Table 11.2 XOR function truth table

Full size table

A line graph of X O R. A dashed line extends between the estimated points (negative 0.13, 0.9) and (0.85, 0.02). Another dashed line passes between the estimated points (0.1, 0) and (1. 0.95). All values are approximate. — **Fig. 11.20**

To model the XOR function, we need to find a line that separates outputs 1 (black dots in Fig. 11.20) from outputs 0 (grey dots in Fig. 11.20). However, we cannot find a linear function that separates those outputs; we can see an example failing to represent XOR in Fig. 11.20.

The perceptron is a linear classifier and hence cannot find a model to classify correctly an XOR function. We can, however, extend the perceptron by adding more layers so that it becomes a multilayer perceptron (MLP), which will enable it to model nonlinear complex functions and virtually any function.

11.3.7 Multilayer Perceptron (MLP)

A multilayer perceptron (MLP) has one or more hidden layers. Figure 11.21 shows an example of two hidden layers, one input layer and one output layer. It is important to note that the perceptron principles function on the hidden layer and the output layer, where the hidden layer is formed of the data instances from the training dataset but no perceptron is involved in it; that is why many authors do not count the input layer as part of the total number of layers; however, some authors do.

A chart of a multilayer perceptron with two hidden layers and a single output. The chart from left to right has an input layer, hidden layer L 1, hidden layer L 2, and an output layer. — **Fig. 11.21**

The inputs in Fig. 11.21 and their corresponding weights are fed into the first hidden layer, composed of several neurons, which in turn generate their own outputs using an activation function f₁ and feed them with their own weights to the second hidden layer, which in turn applies an activation function f₂ and feeds its outputs to the output layer; the latter applies an activation function f₃ and generates the final MLP output.

The activation functions in each layer (i.e., f₁, f₂, and f₃) can be the same or different. If the activation function of the output layer is a linear function, the MLP generates a regression model, while if it is nonlinear, then the model is nonlinear (i.e., if the function is logistic, then the model is logistic regression or binary classification) [8]. Figure 11.22 shows an MLP with one hidden layer and one output, while Fig. 11.23 shows an MLP with one hidden layer and three outputs.

A chart of a multilayer perceptron with a single hidden layer and a single output. The chart from left to right has an input layer, a hidden layer, and an output layer. — **Fig. 11.22**

A chart of a multilayer perceptron with a single hidden layer and three outputs. The chart from left to right has an input layer, a hidden layer, and an output layer. — **Fig. 11.23**

Within an MLP, we have different weights and different biases (i.e., constant b) for each layer; hence, expressing the learning problem becomes more elaborate, but it follows the same principle as in the case of one perceptron.

What do hidden layers do exactly? We will continue the practical example to answer this question.

We have seen in Fig. 11.3 that we need two lines to separate the two given classes. We know that a perceptron models a linear function; needing n lines to separate two classes is equivalent to say that we need n perceptrons. In our example, we will need two perceptrons in one hidden layer, where each hidden perceptron (i.e., neuron) produces one line. Since we need to join the two lines in order to have one model that separates the two classes, then we will need to join the two neurons’ outputs into one neuron (the output neuron). The result is shown in Fig. 11.24.

A chart of a multilayer perceptron with two neurons hidden layer and nonlinear model output. The chart from left to right has an input layer, a hidden layer L subscript 1, and an output layer. — **Fig. 11.24**

This example is to illustrate the benefit of hidden neurons; in complex real-life problems, we cannot just guess the number of hidden neurons and the number of hidden layers required to create the model.

11.3.8 MLP Algorithm Overview

The MLP follows the algorithm below:

Initialize the input layer
Initialize the weights’ vectors and bias vectors for all layers
PHASE 1: forward computation
For each layer l from layer 2 to the output layer L (layer 1 being the MLP inputs)
- For each neuron i in layer l
  - Compute the sum based on layer l’s weights, bias, and the previous layer outputs
    - $$ {z}_i^{(l)}={\sum}_{j=1}^{N^{\left(l-1\right)}}{w}_{ij}^{(l)}\times {a}_j^{\left(l-1\right)}+{b}^{(l)} $$
  - Compute the output based on the previous sum
    $$ {a}_i^{(l)}={f}^{(l)}\left({z}_i^{(l)}\right) $$
- End For
End For

Learning the model entails iteratively calculating the gradient for a cost function such as the mean squared error (MSE) until the minimum is found (the algorithm converges).

The example here is for MSE.

$$ \mathrm{Use}\ E(X)=\frac{1}{2N}{\sum}_{i=1}^N{\left({\hat{y}}_i-{y}_i\right)}^2 $$

Starting with the last layer and working backward, compute for every neuron

$$ \frac{\partial E(X)}{\partial {w}_{ji}^l}\ \mathrm{and}\ \frac{\partial E(X)}{\partial {b}_i^l} $$

PHASE 2: Backward propagation

We start from the output layer and update the weights and biases of each layer backward: layer L first, then L-1, and we continue until the weights of layer 0 are updated.

Weights’ update:

$$ {w}_{i+1}^l={w}_i-\alpha \frac{\partial E(X)}{\partial {w}_{ji}^l} $$

$$ {b}_{i+1}^l={b}_i-\alpha \frac{\partial E(X)}{\partial {b}_i^l} $$

Then we repeat the two phases until convergence, i.e., the cost is less than a certain threshold.

11.3.9 Backpropagation

The problem we are facing is to find a method to minimize the error the MLP can produce by minimizing the error function that estimates the difference between the final output of the MLP and the expected outcome.

Backpropagation is a technique that allows us to achieve such aim; it performs a gradient descent by working backward from the output layer to the input layer, calculating in each layer the gradient of the error function with respect to the neural network’s weights. The gradients of the last layer of weights are computed first and then used in computation of the gradient for the previous layer; the process continues until we reach the first layer of weights [9].

The mathematical notation is complex if we want to take a fully connected neural network, so we will start with an example and move towards the fully connected situation.

We will use the following denotations:

E denotes our error (i.e., cost) function
L denotes the number of layers
N_l denotes the number of neurons in layer l
$ {w}_{ij}^{(l)} $ denotes the weight for neuron i in layer l in relation to the incoming neuron j in layer l-1
$ {b}_i^{(l)} $ denotes the bias for neuron i in layer l
$ {z}_i^{(l)} $ denotes the product sum plus bias for neuron i in layer l: $ {z}_i^{(l)}={\sum}_{j=1}^{N^{\left(L-1\right)}}{w}_{ij}{a}_j^{\left(l-1\right)}+{b}_j^{\left(l-1\right)} $
σ denotes a nonlinear activation function in layer l
$ {a}_i^{(l)} $ denotes the output at a neuron i in layer l: $ {a}_i^{(l)}=\sigma \left({z}_i^{(l)}\right) $
a^(l) denotes the output vector for layer l: $ {a}^{(l)}=\left\{{a}_1^{(l)},{a}_2^{(l)},\dots {a}_{N_l}^{(l)}\right\} $;
$ {w}_i^{(l)} $ denotes the weight vector for neuron i in layer l; $ {w}_i^{(l)}=\left\{\ {w}_1^{(l)},{w}_2^{(l)},\dots {w}_{N_l}^{(l)}\right\} $
$ {w}_{ij}^{(l)} $ denotes the weight vector connecting the neuron i in layer l to neuron j in layer l-1

11.3.9.1 Simple 1–1–1 Network

Let us take an example of a three-layer neural network (L = 3) with an input layer with 1 neuron, a hidden layer with 1 neuron, and an output layer with 1 neuron (Fig. 11.25).

A schematic diagram of a three layer neural network with the layers from left to right labeled as layer L 2, layer L 1, and layer L. — **Fig. 11.25**

There are only three layers: layer L (output), layer L-1 (hidden), and layer L-2 (input). We have one neuron in each layer, so we will not use the subscript i; for example, instead of $ {a}_i^l $ we will use a^l, and the same applies for all other notations.

$$ {a}^{(L)}=\sigma \left({z}^{(L)}\right) $$

$$ {z}^{(L)}={w}^{\left(L-1\right)}{a}^{\left(L-1\right)}+{b}^{\left(L-1\right)} $$

$$ {z}^{\left(L-1\right)}={w}^{\left(L-2\right)}{a}^{\left(L-2\right)}+{b}^{\left(L-2\right)} $$

We need to compute the gradient (partial derivative) of the error function E (or cost function C) with respect to the weights and the biases. That is, we would like to know how our cost function would change if we changed the weights and biases of the network.

Starting in the last layer, we then investigate how this gradient propagates backward through the network.

11.3.9.1.1 Computation with Respect to Layer L-1

Having one node per layer will help us understand the computational work and its implications. We will start with the layer L and calculate the gradient of our error function E with respect to the weights and biases of neurons in the previous layer L−1, $ \frac{\partial E}{\partial {w}^{\left(L-1\right)}} $. Figure 11.26 clarifies the relationship between the cost function and the weights and biases.

A schematical diagram of a relationship between the cost function and the weights and biases. A of L 1, b of L 2, and a dashed arrow labeled w of L 2 point towards a circle in layer l and points E towards the error function. — **Fig. 11.26**

Let us consider the mean squared error as an error function. Since we have only one neuron in the output:

$$ E=\frac{1}{2}{\left({a}^{(L)}-y\right)}^2 $$

$$ \left(\mathrm{the}\ \frac{1}{2}\ \mathrm{is}\ \mathrm{for}\ \mathrm{convenience}\right) $$

Using the chain rule, we can write:

$$ \frac{\partial E}{\partial {w}^{\left(L-1\right)}}=\frac{\partial E}{\partial {a}^{(L)}}\frac{\partial {a}^{(L)}}{\partial {z}^{(L)}}\frac{\partial {z}^{(L)}}{\partial {w}^{\left(L-1\right)}} $$

$$ \frac{\partial E}{\partial {a}^{(L)}}=\frac{1}{2}\frac{\partial {\left({a}^{(L)}-y\right)}^2}{\partial {a}^{(L)}}=\frac{1}{2}2\left({a}^{(L)}-y\right)=\left({a}^{(L)}-y\right) $$

$$ \frac{\partial {a}^{(L)}}{\partial {z}^{(L)}}=\frac{\partial \sigma \left({z}^{(L)}\right)}{\partial {z}^{(L)}}={\sigma}^{\prime}\left({z}^{(L)}\right) $$

$$ \frac{\partial {z}^{(L)}}{\partial {w}^{\left(L-1\right)}}=\frac{\partial \left({w}^{\left(L-1\right)}{a}^{\left(L-1\right)}+{b}^{\left(L-1\right)}\right)}{\partial {w}^{\left(L-1\right)}}={a}^{\left(L-1\right)} $$

Hence, we can solve $ \frac{\partial E}{\partial {w}^{\left(L-1\right)}} $:

$$ \frac{\boldsymbol{\partial E}}{\boldsymbol{\partial}{\boldsymbol{w}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}}=\left({\boldsymbol{a}}^{\left(\boldsymbol{L}\right)}-\boldsymbol{y}\right){\boldsymbol{\sigma}}^{\prime}\left({\boldsymbol{z}}^{\left(\boldsymbol{L}\right)}\right)\ {\boldsymbol{a}}^{\left(\boldsymbol{L}-\mathbf{1}\right)} $$

Just note that the sigmoid is used as the nonlinear activation function; then, $ \sigma =\frac{1}{1+{e}^{\left(-z\right)}} $, and its derivative is $ {\sigma}^{\prime }=\frac{e^{\left(-z\right)}}{{\left(1+{e}^{\left(-z\right)}\right)}^2} $. Also note that we would like to see how the cost function changes with the change of the weight w^(L − 2); we will see that in a moment. First, let us see how the cost function changes with the change of the bias b^(L − 1). We will use the chain rule $ \frac{\partial E}{\partial {b}^{\left(L-1\right)}} $, which can be written as follows:

$$ \frac{\partial E}{\partial {b}^{\left(L-1\right)}}=\frac{\partial E}{\partial {a}^{(L)}}\frac{\partial {a}^{(L)}}{\partial {z}^{(L)}}\frac{\partial {z}^{(L)}}{\partial {b}^{\left(L-1\right)}} $$

$$ \frac{\partial E}{\partial {a}^{(L)}}=\left({a}^{(L)}-y\right) $$

$$ \frac{\partial {a}^{(L)}}{\partial {z}^{(L)}}={\sigma}^{\prime}\left({z}^{(L)}\right) $$

$$ \frac{\partial {z}^{(L)}}{\partial {b}^{\left(L-1\right)}}=\frac{\partial \left({w}^{\left(L-1\right)}{a}^{\left(L-1\right)}+{b}^{\left(L-1\right)}\right)}{\partial {w}^{\left(L-1\right)}}=1 $$

Therefore,

$$ \frac{\boldsymbol{\partial E}}{\boldsymbol{\partial}{\boldsymbol{b}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}}=\left({\boldsymbol{a}}^{\left(\boldsymbol{L}\right)}-\boldsymbol{y}\right){\boldsymbol{\sigma}}^{\prime}\left({\boldsymbol{z}}^{\left(\boldsymbol{L}\right)}\right) $$

So, based on the weights and biases’ initial values, we have used a training instance to compute the predicted output a^(L), then computed the gradient of the cost (i.e., error) with respect to the weights and biases, as we have just seen. We can use those gradients in the following equations to update the weights and biases before going forward with another training round (time t + 1):

$$ {w}^{\left(L-1\right)}\left(t+1\right)={w}^{(L)}(t)-\alpha \frac{\boldsymbol{\partial E}}{\boldsymbol{\partial}{\boldsymbol{w}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}} $$

$$ {b}^{\left(L-1\right)}\left(t+1\right)={b}^{(L)}(t)-\alpha \frac{\boldsymbol{\partial E}}{\boldsymbol{\partial}{\boldsymbol{b}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}} $$

where α is the training rate, t denotes a round of training.

11.3.9.1.2 Computation with Respect to Layer L-2

We will now proceed further up the network and calculate the gradient of our error function E with respect to the weights and biases of neurons in the previous layer L-2. This is a measurement of how much E changes with respect to changes in weights and biases at level L-2.

But before we can know that, we will need $ \frac{\partial E}{\partial {a}^{\left(L-1\right)}} $, so let us compute that derivative.

$$ \frac{\partial E}{\partial {a}^{\left(L-1\right)}}=\frac{\partial E}{\partial {a}^{(L)}}\frac{\partial {a}^{(L)}}{\partial {z}^{(L)}}\frac{\partial {z}^{(L)}}{\partial {a}^{\left(L-1\right)}} $$

$$ \frac{\partial E}{\partial {a}^{(L)}}=\left({a}^{(L)}-y\right) $$

$$ \frac{\partial {a}^{(L)}}{\partial {z}^{(L)}}={\sigma}^{\prime}\left({z}^{(L)}\right) $$

$$ \frac{\partial {z}^{(L)}}{\partial {a}^{\left(L-1\right)}}={w}^{\left(L-1\right)} $$

Hence,

$$ \frac{\partial E}{\partial {a}^{\left(L-1\right)}}=\left({a}^{(L)}-y\right){\sigma}^{\prime}\left({z}^{(L)}\right){w}^{\left(L-1\right)} $$

Now that we have found the gradient of E with respect to a^(L − 1), we can proceed with our investigation:

$$ \frac{\partial E}{\partial {w}^{\left(L-2\right)}}=\frac{\partial E}{\partial {a}^{\left(L-1\right)}}\frac{\partial {a}^{\left(L-1\right)}}{\partial {z}^{\left(L-1\right)}}\frac{\partial {z}^{\left(L-1\right)}}{\partial {w}^{\left(L-2\right)}} $$

$$ \frac{\partial {a}^{\left(L-1\right)}}{\partial {z}^{\left(L-1\right)}}=\frac{\partial \sigma \left({z}^{\left(L-1\right)}\right)}{\partial {z}^{\left(L-1\right)}}={\sigma}^{\prime}\left({z}^{\left(L-1\right)}\right) $$

$$ \frac{\partial {z}^{\left(L-1\right)}}{\partial {w}^{\left(L-2\right)}}=\frac{\partial \left({w}^{\left(L-2\right)}{a}^{\left(L-2\right)}+{b}^{\left(L-2\right)}\right)}{\partial {w}^{\left(L-2\right)}}={a}^{\left(L-2\right)} $$

$$ \frac{\partial E}{\partial {w}^{\left(L-2\right)}}=\frac{\partial E}{\partial {a}^{\left(L-1\right)}}{\boldsymbol{\sigma}}^{\prime}\left({\boldsymbol{z}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}\right)\ {\boldsymbol{a}}^{\left(\boldsymbol{L}-\mathbf{2}\right)} $$

$$ \frac{\boldsymbol{\partial E}}{\boldsymbol{\partial}{\boldsymbol{w}}^{\left(\boldsymbol{L}-\mathbf{2}\right)}}=\left({\boldsymbol{a}}^{\left(\boldsymbol{L}\right)}-\boldsymbol{y}\right){\boldsymbol{\sigma}}^{\prime}\left({\boldsymbol{z}}^{\left(\boldsymbol{L}\right)}\right){\boldsymbol{w}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}{\boldsymbol{\sigma}}^{\prime}\left({\boldsymbol{z}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}\right){\boldsymbol{a}}^{\left(\boldsymbol{L}-\mathbf{2}\right)} $$

Similarly, $ \frac{\partial E}{\partial {b}^{\left(L-1\right)}} $can be computed as follows:

$$ \frac{\partial E}{\partial {b}^{\left(L-2\right)}}=\frac{\partial E}{\partial {a}^{\left(L-1\right)}}\frac{\partial {a}^{\left(L-1\right)}}{\partial {z}^{\left(L-1\right)}}\frac{\partial {z}^{\left(L-1\right)}}{\partial {b}^{\left(L-2\right)}} $$

$$ \frac{\partial {a}^{\left(L-1\right)}}{\partial {z}^{\left(L-1\right)}}={\sigma}^{\prime}\left({z}^{\left(L-1\right)}\right) $$

$$ \frac{\partial {z}^{\left(L-1\right)}}{\partial {b}^{\left(L-2\right)}}=\frac{\partial \left({w}^{\left(L-2\right)}{a}^{\left(L-2\right)}+{b}^{\left(L-2\right)}\right)}{\partial {w}^{\left(L-2\right)}}=1 $$

$$ \frac{\boldsymbol{\partial E}}{\boldsymbol{\partial}{\boldsymbol{b}}^{\left(\boldsymbol{L}-\mathbf{2}\right)}}=\frac{\partial E}{\partial {a}^{\left(L-1\right)}}{\boldsymbol{\sigma}}^{\prime}\left({\boldsymbol{z}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}\right) $$

$$ \frac{\boldsymbol{\partial E}}{\boldsymbol{\partial}{\boldsymbol{b}}^{\left(\boldsymbol{L}-\mathbf{2}\right)}}=\left({\boldsymbol{a}}^{\left(\boldsymbol{L}\right)}-\boldsymbol{y}\right){\boldsymbol{\sigma}}^{\prime}\left({\boldsymbol{z}}^{\left(\boldsymbol{L}\right)}\right){\boldsymbol{w}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}{\boldsymbol{\sigma}}^{\prime}\left({\boldsymbol{z}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}\right) $$

We can also update the weights in level L-2 using the usual formula:

$$ {w}^{\left(L-2\right)}\left(t+1\right)={w}^{(L)}(t)-\alpha \frac{\boldsymbol{\partial E}}{\boldsymbol{\partial}{\boldsymbol{w}}^{\left(\boldsymbol{L}-\mathbf{2}\right)}} $$

$$ {b}^{\left(L-2\right)}\left(t+1\right)={b}^{(L)}(t)-\alpha \frac{\boldsymbol{\partial E}}{\boldsymbol{\partial}{\boldsymbol{b}}^{\left(\boldsymbol{L}-\mathbf{2}\right)}} $$

11.3.9.2 Fully Connected Neural Network

There are a few adjustments that we have to consider when we have a fully connected neural network.

The mean squared error function E is still a function of the weights vector and the bias b but is now expressed as an average:

$$ E\left(w,b\right)=\frac{1}{2N}{\sum}_{i=1}^N{\left({a}_i-{y}_i\right)}^2 $$

where N is the number of instances in the training set.

11.3.9.2.1 Computation with Respect to Layer L-1

$$ \frac{\boldsymbol{\partial E}}{\boldsymbol{\partial}{{\boldsymbol{w}}_{\boldsymbol{i}\boldsymbol{j}}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}}=\frac{\boldsymbol{\partial E}}{\boldsymbol{\partial}{{\boldsymbol{a}}_{\boldsymbol{i}}}^{\left(\boldsymbol{L}\right)}}\frac{\boldsymbol{\partial}{{\boldsymbol{a}}_{\boldsymbol{i}}}^{\left(\boldsymbol{L}\right)}}{\boldsymbol{\partial}{{\boldsymbol{z}}_{\boldsymbol{i}}}^{\left(\boldsymbol{L}\right)}}\frac{\boldsymbol{\partial}{{\boldsymbol{z}}_{\boldsymbol{i}}}^{\left(\boldsymbol{L}\right)}}{\boldsymbol{\partial}{{\boldsymbol{w}}_{\boldsymbol{i}\boldsymbol{j}}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}} $$

$$ \frac{\boldsymbol{\partial E}}{\boldsymbol{\partial}{{\boldsymbol{w}}_{\boldsymbol{i}\boldsymbol{j}}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}}=\left({{\boldsymbol{a}}_{\boldsymbol{i}}}^{\left(\boldsymbol{L}\right)}-{\boldsymbol{y}}_{\boldsymbol{i}}\right){\boldsymbol{\sigma}}^{\prime}\left({{\boldsymbol{z}}_{\boldsymbol{i}}}^{\left(\boldsymbol{L}\right)}\right){{\boldsymbol{a}}_{\boldsymbol{i}}}^{\left(\boldsymbol{L}-\mathbf{1}\right)} $$

$$ \frac{\partial E}{\partial {a_j}^{\left(L-1\right)}}={\sum}_{i=1}^{N^{\left(l-1\right)}}\frac{\partial E}{\partial {a_i}^{(L)}}\frac{\partial {a_i}^{(L)}}{\partial {z_i}^{(L)}}\frac{\partial {z_i}^{(L)}}{\partial {a_j}^{\left(L-1\right)}} $$

The sum is added, as the activation from every neuron a_j from layer L-1 will affect all the activations of the neurons in layer L, which will affect the cost of the neural network.

11.3.9.2.2 Computation with Respect to Layer L-2

$$ \frac{\boldsymbol{\partial E}}{\boldsymbol{\partial}{{\boldsymbol{w}}_{\boldsymbol{i}\boldsymbol{j}}}^{\left(\boldsymbol{L}-\mathbf{2}\right)}}=\frac{\boldsymbol{\partial E}}{\boldsymbol{\partial}{{\boldsymbol{a}}_{\boldsymbol{i}}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}}\frac{\boldsymbol{\partial}{{\boldsymbol{a}}_{\boldsymbol{i}}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}}{\boldsymbol{\partial}{{\boldsymbol{z}}_{\boldsymbol{i}}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}}\frac{\boldsymbol{\partial}{{\boldsymbol{z}}_{\boldsymbol{i}}}^{\left(\boldsymbol{L}-\mathbf{1}\right)}}{\boldsymbol{\partial}{{\boldsymbol{w}}_{\boldsymbol{i}\boldsymbol{j}}}^{\left(\boldsymbol{L}-\mathbf{2}\right)}} $$

We can see clearly from the above that the error in a layer l depends on the error in the next layer l + 1; hence, the errors propagate backward from the last to the first layer. Once we compute the error at the output layer and once the partial derivatives for all the neurons are known, the weights can be updated. The process is repeated until convergence.

Note that strictly speaking, the term “backpropagation” refers not to the learning process but to the method used to compute the gradient [10].

11.3.10 Backpropagation Algorithm

The backpropagation algorithm runs in four steps:

1.
Forward phase: Proceeding from the input layer to the output layer, for each input-output pair in the training dataset, calculate the predicted output and save the result for each neuron.
2.
Backward phase: Proceeding from the output layer to the input layer, calculate and save the resulting gradients.
3.
Combine the individual gradients to obtain the total gradient.
4.
Update the weights using α and total gradient.
5.
Repeat until the minimum cost is reached.

11.4 Final Notes: Advantages, Disadvantages, and Best Practices

Neural networks are considered one of most prominent ML models given its ability to deal with different type of outputs, including discrete, real-value, vectors, images, and many others. Those models can learn and model complex, nonlinear and highly volatile data. Their architecture allows those models to be robust to any noises during the training period. Even with long training period, neural networks can generate interesting results. Note that neural networks can also be used for anomaly detection (even if we are dealing with unlabeled data); in this case, the learning results can be used to give fast second opinion with good accuracy in any used application.

Like an ML model, neural networks need parallel processing power, which makes it hardware dependence in a way. Although it gives promising results, the latter are unexplainable in many cases in terms of why and how we reached such decisions which might affect the trust in such models. In terms of its technical structure, there is no well-defined rule on how to design such architecture (number of hidden layers, number of hidden nodes, error thresholds for best training time and optimal results); it is more of trial-and-error process

With this in mind, we tend to depend on best practices to try and optimize the neural networks results. Some of key practices include the following:

Always check the size of the training data; if it is not enough, it is important to increase.
If the model overfits, you can either use simpler network (a smaller number of hidden layers/nodes), use dropout layers, increase data samples, or remove some features (execute preprocessing of data again).
If the mode underfits, you can add more features (using feature engineering techniques).
Starting with large batch size can reduce the training time in some cases.
If the model suffers from vanishing gradient problem, using lower learning rate might allow the model to converge.
Normalizing the inputs in every layer might help the stability and performance of the model.

11.5 Key Terms

1.
Artificial neural networks (ANN)
2.
McCulloch–Pitts (M-P) neuron
3.
Perceptron
4.
Linear function
5.
Linear model
6.
Bipolar activation functions
7.
Unipolar activation functions
8.
Sigmoid function
9.
Hyperbolic tangent function
10.
Tanh function
11.
Rectified unit function
12.
ReLU function
13.
Leaky ReLU function
14.
Parameterized ReLU function
15.
Swish function
16.
SoftMax function
17.
Training the perceptron
18.
Gradient descent
19.
Stochastic gradient descent
20.
XOR
21.
Exclusive OR
22.
Multilayer perceptron
23.
MLP
24.
Backpropagation
25.
Chain rule
26.
Fully connected neural network

11.6 Test Your Understanding

1.
Can we identify a perceptron as a linear classifier or a nonlinear one?
2.
What type of problems does a perceptron solve?
3.
Why should the activation function of a multilayer perceptron be nonlinear?
4.
What is the aim of backpropagation?
5.
Explain backpropagation in simple words for a specialist.
6.
The hyperbolic tangent function overcomes a problem we find in the sigmoid functions. What is it?
7.
Why does ReLU perform better than tanh and sigmoid functions?
8.
Explain the “dead” neuron problem and how to overcome it.
9.
What kind of issues does a leaky ReLU overcome in comparison with a ReLU?
10.
SoftMax is very useful to solve a specific kind of problem; what is it?

11.7 Read More

1.
Cao, J., Qian, S., Zhang, H., Fang, Q., & Xu, C. (2021). Global Relation-Aware Attention Network for Image-Text Retrieval Proceedings of the 2021 International Conference on Multimedia Retrieval, Taipei, Taiwan. https://doi.org/10.1145/3460426.3463615
2.
Chatterjee, B., & Sen, S. (2021). Energy-Efficient Deep Neural Networks with Mixed-Signal Neurons and Dense-Local and Sparse-Global Connectivity Proceedings of the 26th Asia and South Pacific Design Automation Conference, Tokyo, Japan. https://doi.org/10.1145/3394885.3431614
3.
Collins, J., Sun, S., Guo, C., Podgorsak, A., Rudin, S., & Bednarek, D. R. (2021). Estimation of Patient Eye-Lens Dose During Neuro-Interventional Procedures using a Dense Neural Network (DNN). Proc SPIE Int Soc Opt Eng, 11,595. https://doi.org/10.1117/12.2580723
4.
Grasemann, U., Peñaloza, C., Dekhtyar, M., Miikkulainen, R., & Kiran, S. (2021). Predicting language treatment response in bilingual aphasia using neural network-based patient models. Sci Rep, 11(1), 10,497. https://doi.org/10.1038/s41598-021-89443-6
5.
Hasan, N. (2021). A Hybrid Method of Covid-19 Patient Detection from Modified CT-Scan/Chest-X-Ray Images Combining Deep Convolutional Neural Network And Two- Dimensional Empirical Mode Decomposition. Comput Methods Programs Biomed Update, 1, 100,022. https://doi.org/10.1016/j.cmpbup.2021.100022
6.
Kimura, Y., Kadoya, N., Oku, Y., Kajikawa, T., Tomori, S., & Jingu, K. (2021). Error detection model developed using a multi-task convolutional neural network in patient-specific quality assurance for volumetric-modulated arc therapy. Med Phys. https://doi.org/10.1002/mp.15031
7.
Maharaj, S., Qian, T., Ohiba, Z., & Hayes, W. (2021). Common Neighbors Extension of the Sticky Model for PPI Networks Evaluated by Global and Local Graphlet Similarity. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 18(1), 16–26. https://doi.org/10.1109/tcbb.2020.3017374
8.
Minagawa, A., Koga, H., Sano, T., Matsunaga, K., Teshima, Y., Hamada, A., Houjou, Y., & Okuyama, R. (2021). Dermoscopic diagnostic performance of Japanese dermatologists for skin tumors differs by patient origin: A deep learning convolutional neural network closes the gap. J Dermatol, 48(2), 232–236. https://doi.org/10.1111/1346-8138.15640
9.
Pan, Q., Zhang, L., Jia, M., Pan, J., Gong, Q., Lu, Y., Zhang, Z., Ge, H., & Fang, L. (2021). An interpretable 1D convolutional neural network for detecting patient-ventilator asynchrony in mechanical ventilation. Comput Methods Programs Biomed, 204, 106,057. https://doi.org/10.1016/j.cmpb.2021.106057
10.
Shihada, B., Elbatt, T., Eltawil, A., Mansour, M., Sabir, E., Rekhis, S., & Sharafeddine, S. (2021). Networking research for the Arab world: from regional initiatives to potential global impact. Commun. ACM, 64(4), 114–119. https://doi.org/10.1145/3447748
11.
Shorfuzzaman, M., Masud, M., Alhumyani, H., Anand, D., & Singh, A. (2021). Artificial Neural Network-Based Deep Learning Model for COVID-19 Patient Detection Using X-Ray Chest Images. J Healthc Eng, 2021, 5,513,679. https://doi.org/10.1155/2021/5513679
12.
Sridhara, S., Wirz, F., Ruiter, J. d., Schutijser, C., Legner, M., & Perrig, A. (2021). Global Distributed Secure Mapping of Network Addresses Proceedings of the ACM SIGCOMM 2021 Workshop on Technologies, Applications, and Uses of a Responsible Internet, Virtual Event, USA. https://doi.org/10.1145/3472951.3473503
13.
Valizadeh, A., Jafarzadeh Ghoushchi, S., Ranjbarzadeh, R., & Pourasad, Y. (2021). Presentation of a Segmentation Method for a Diabetic Retinopathy Patient’s Fundus Region Detection Using a Convolutional Neural Network. Comput Intell Neurosci, 2021, 7,714,351. https://doi.org/10.1155/2021/7714351
14.
Xiao, Y., Wang, X., Li, Q., Fan, R., Chen, R., Shao, Y., Chen, Y., Gao, Y., Liu, A., Chen, L., & Liu, S. (2021). A cascade and heterogeneous neural network for CT pulmonary nodule detection and its evaluation on both phantom and patient data. Comput Med Imaging Graph, 90, 101,889. https://doi.org/10.1016/j.compmedimag.2021.101889
15.
Zhong, Y. W., Jiang, Y., Dong, S., Wu, W. J., Wang, L. X., Zhang, J., & Huang, M. W. (2021). Tumor radiomics signature for artificial neural network-assisted detection of neck metastasis in patient with tongue cancer. J Neuroradiol. https://doi.org/10.1016/j.neurad.2021.07.006
16.
Zhu, Y., Xie, R., Zhuang, F., Ge, K., Sun, Y., Zhang, X., Lin, L., & Cao, J. (2021). Learning to Warm Up Cold Item Embeddings for Cold-start Recommendation with Meta Scaling and Shifting Networks. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1167–1176). Association for Computing Machinery. https://doi.org/10.1145/3404835.3462843

11.8 Lab

11.8.1 Working Example in Python

The diabetes dataset that is used in this lab can be downloaded from the following link: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

This is a binary classification problem. This dataset contains the following information:

Pregnancies: number of pregnancies
Glucose: plasma glucose concentration
Blood Pressure: diastolic blood pressure measurement
SkinThikness: triceps skinfold thickness (mm)
Insulin: 2-hour serum insulin
BMI: body mass index (BMI)
DiabetesPedigreeFunction: diabetes pedigree function
Age: the person’s age
Outcome: tested positive for diabetes or not (1 or 0)

11.8.1.1 Load Diabetes for Pima Indians Dataset

Before loading the Pima Indians dataset, it is important to note that we need to install the Keras, TensorFlow, and SciKeras libraries using the pip install command to create the sequential neural network model.

For visualizing neural network, the Graphviz library is used. Graphviz Python can be downloaded from the following link: https://www.graphviz.org/download/

After downloading Graphviz, the path in the system environment variables needs to be edited to include:

C:\Program Files\Graph viz.\bin

C:\Program Files\Graphviz\bin\dot.exe

We start by importing the required libraries and loading the dataset and display a bar chart for the outcomes a well as pair plots for the features (Fig. 11.27). The displayed graphs are partially shown in Fig. 11.28.

An algorithm to load the Pima Indians diabetes dataset. The algorithm has the codes for the followings: Import required libraries, import warnings, load Pima Indians diabetes dataset, and data visualisation. — **Fig. 11.27**

A set of 33 graphs to visualize diabetic versus nondiabetic Pima Indians. A bar chart that depicts count versus outcome has a decreasing trend. — **Fig. 11.28**

11.8.1.2 Visualize Data

We explore the data visually (Fig. 11.28).

11.8.1.3 Split Dataset into Training and Testing Datasets

The next task is to choose features and target (the “Outcome”). The next step is to split the dataset into training and testing and standardize both (Fig. 11.29).

An algorithm to split and scale the dataset of Pima Indians. The algorithm has the codes for the followings: prepare the feature vector x and the outcome vector y, and so on. — **Fig. 11.29**

11.8.1.4 Create Neural Network Model

The next task is to create the sequential neural network model using the Keras library. As expected the input layer has eight nodes to accommodate the 8 features. WE have chosen to add two hidden layers are added, one with 10 nodes and the other with eight nodes (Fig. 11.30).

An algorithm to add two hidden layers with 10 and 8 nodes. The algorithm has the codes as follows: Create the model, add the input layer with 8 nodes and R e L U activation, and so on. — **Fig. 11.30**

We can display the NN structure using the graphviz library (Fig. 11.31).

An algorithm to display the N N structure using the Graphviz library and an illustration of a sequential neural network. The input layer with 8 inputs leads to a single output through many layers. — **Fig. 11.31**

11.8.1.5 Optimize Neural Network Model Using Hyperparameter

For model optimization, we use the grid search cross-validation approach (Fig. 11.32). The hyperparameters used for the grid search are the batch size and the number of epochs. We conclude that the model has fair performance (AUC = 72% and accuracy 75%) and can be used on an unseen dataset.

An algorithm of grid search cross-validation and a confusion 2 x 2 matrix. The algorithm has the codes to optimize the N N and print the confusion matrix for the model with the dataset. — **Fig. 11.32**

11.8.2 Working Example in Weka

Download the Boston housing dataset from the following website:

https://www.kaggle.com/prasadperera/the-boston-housing-dataset

Open the file in Weka, go to the Classify tab, and choose the Multilayer Perceptron algorithm from Functions (Fig. 11.33).

A screenshot of the Weka explorer. A file named multilayer perceptron is selected under functions, Bayes, classifiers, and Weka folder. — **Fig. 11.33**

Click on the function and notice the parameters for the algorithm (Fig. 11.34).

A screenshot of a Weka dot g u I dot generic object editor. The screenshot reads various parameters from top to bottom. A pop up reads this will cause the learning rate to decrease. — **Fig. 11.34**

One of the most important parameters is the number of hidden layers; it is set to automatic by default (i.e., the letter a denotes automatic), but it can be set to any number you want. The learning rate can be changed; the default is 0.3. Click on GUI and make it True, then click OK, then click Start to run the algorithm. The result is shown in Fig. 11.35, and its graphical representation is shown in Fig. 11.36.

A screenshot of Weka explorer. Classify tab has various sections such as classifier and test options on the left and classifier output on the right. A left arrow points at text 4.7344 in the output section. — **Fig. 11.35**

A screenshot of a neural network window. A graphical representation of the neural network is on the screen with labeled points c r i m, z n, r m and so on. The option to start and accept are at the bottom. — **Fig. 11.36**

11.8.3 Do it Yourself

11.8.3.1 Diabetes Revisited

How can you enhance the results of the Neural Network above? Hint: think of changing the number of hidden layers, and the number of nodes in each. We have used above standardization but neural network expects values between 0 and 1, would normalization allow the NN to perform better?

11.8.3.2 Choose your Own Problem

Pick a problem of your own and apply NN. Discuss the results with another person. Compare your result with someone else who used NN to solve the same problem. Note the differences in the results.

11.8.4 Do More Yourself

Solve each of the following predictive problems using neural networks.

1.
Boston house prices:

You can load the Boston house prices data file (as well as many other datasets) from within python by writing: boston = dataset.load_boston()
2.
Predicting stock prices using neural networks.

Download the dataset from https://www.kaggle.com/datasets/paultimothymooney/stock-market-data
3.
Handwritten digit recognition.

Download the dataset from http://yann.lecun.com/exdb/mnist/

References

Z.H. Zhou, Introduction, Ensemble Methods: Foundations and Algorithms ((Chapman & Hall/CRC Machine Learning & Pattern Recognition Series: CRC Press, 2012)
Book Google Scholar
M. Gopal, Applied Machine Learning (McGraw-Hill Education, 2018)
Google Scholar
T. Trappenberg, Fundamentals of Machine Learning (OUP, Oxford, 2019)
Book Google Scholar
W.S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5(4), 115–133 (1 Dec 1943). https://doi.org/10.1007/BF02478259
Article Google Scholar
G. Palm, Warren McCulloch and Walter Pitts: A logical calculus of the ideas immanent in nervous activity, in Brain Theory, (Springer, Berlin, Heidelberg, 1986), pp. 229–230
Chapter Google Scholar
D. Gupta, Fundamentals of deep learning – activation functions and when to use them? https://www.analyticsvidhya.com/blog/2020/01/fundamentals-deep-learning-activation-functions-when-to-use-them/#. Accessed 7 Sep 2021
Brilliant.org, Backpropagation. Brilliant.org. https://brilliant.org/wiki/backpropagation/. Accessed 25 Aug 2021
A. Burkov, The Hundred-Page Machine Learning Book (Andriy Burkov, 2019)
Google Scholar
S. Kostadinov, Understanding Backpropagation Algorithm: Learn the nuts and bolts of a neural network’s most important ingredient. https://towardsdatascience.com/understanding-backpropagation-algorithm-7bb3aa2f95fd. Accessed 7 Sep 2021
I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, 2016)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Health Policy and Management, York University, Toronto, ON, Canada
Christo El Morr
School of Information Technology, York University, Toronto, ON, Canada
Manar Jammal
Department of International Studies, York University, Glendon Campus, Toronto, ON, Canada
Hossam Ali-Hassan
Ontario Health, Toronto, ON, Canada
Walid El-Hallak

Authors

Christo El Morr
View author publications
You can also search for this author in PubMed Google Scholar
Manar Jammal
View author publications
You can also search for this author in PubMed Google Scholar
Hossam Ali-Hassan
View author publications
You can also search for this author in PubMed Google Scholar
Walid El-Hallak
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

El Morr, C., Jammal, M., Ali-Hassan, H., El-Hallak, W. (2022). Neural Networks. In: Machine Learning for Practical Decision Making. International Series in Operations Research & Management Science, vol 334. Springer, Cham. https://doi.org/10.1007/978-3-031-16990-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-16990-8_11
Published: 30 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16989-2
Online ISBN: 978-3-031-16990-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics