Training of Multi-Layer Neural Network

Kim, Phil

doi:10.1007/978-1-4842-2845-6_3

Phil Kim²

18k Accesses

Abstract

In an effort to overcome the practical limitations of the single-layer, the neural network evolved into a multi-layer architecture. However, it has taken approximately 30 years to just add on the hidden layer to the single-layer neural network. It’s not easy to understand why this took so long, but the problem involved the learning rule. As the training process is the only method for the neural network to store information, untrainable neural networks are useless. A proper learning rule for the multi-layer neural network took quite some time to develop.

Access provided by CONRICYT-eBooks. Download chapter PDF

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

In an effort to overcome the practical limitations of the single-layer, the neural network evolved into a multi-layer architecture. However, it has taken approximately 30 years to just add on the hidden layer to the single-layer neural network. It’s not easy to understand why this took so long, but the problem involved the learning rule. As the training process is the only method for the neural network to store information, untrainable neural networks are useless. A proper learning rule for the multi-layer neural network took quite some time to develop.

The previously introduced delta rule is ineffective for training of the multi-layer neural network. This is because the error, the essential element for applying the delta rule for training, is not defined in the hidden layers. The error of the output node is defined as the difference between the correct output and the output of the neural network. However, the training data does not provide correct outputs for the hidden layer nodes, and hence the error cannot be calculated using the same approach for the output nodes. Then, what? Isn’t the real problem how to define the error at the hidden nodes? You got it. You just formulated the back-propagation algorithm, the representative learning rule of the multi-layer neural network.

In 1986, the introduction of the back-propagation algorithm finally solved the training problem of the multi-layer neural network.^{Footnote 1} The significance of the back-propagation algorithm was that it provided a systematic method to determine the error of the hidden nodes. Once the hidden layer errors are determined, the delta rule is applied to adjust the weights. See Figure 3-1.

The input data of the neural network travels through the input layer, hidden layer, and output layer. In contrast, in the back-propagation algorithm, the output error starts from the output layer and moves backward until it reaches the right next hidden layer to the input layer. This process is called back-propagation , as it resembles an output error propagating backward. Even in back-propagation, the signal still flows through the connecting lines and the weights are multiplied. The only difference is that the input and output signals flow in opposite directions.

Back-Propagation Algorithm

This section explains the back-propagation algorithm using an example of the simple multi-layer neural network . Consider a neural network that consists of two nodes for both the input and output and a hidden layer, which has two nodes as well. We will omit the bias for convenience. The example neural network is shown in Figure 3-2, where the superscript describes the layer indicator.

In order to obtain the output error, we first need the neural network’s output from the input data. Let’s try. As the example network has a single hidden layer, we need two input data manipulations before the output calculation is processed. First, the weighted sum of the hidden node is calculated as:

$$ \begin{array}{c}\left[\kern0.1em \begin{array}{c}\hfill {v}_1^{(1)}\hfill \\ {}\hfill {v}_2^{(1)}\hfill \end{array}\kern0.1em \right]\kern0.62em =\kern0.62em \left[\kern0.1em \begin{array}{cc}\hfill {w}_{11}^{(1)}\hfill & \hfill {w}_{12}^{(1)}\hfill \\ {}\hfill {w}_{21}^{(1)}\hfill & \hfill {w}_{22}^{(1)}\hfill \end{array}\kern0.1em \right]\left[\kern0.1em \begin{array}{c}\hfill {x}_1\hfill \\ {}\hfill {x}_2\hfill \end{array}\kern0.1em \right]\\ {}\triangleq \kern0.62em {W}_1\kern0.1em x\end{array} $$

(Equation 3.1)

When we put this weighted sum , Equation 3.1, into the activation function, we obtain the output from the hidden nodes .

$$ \left[\;\begin{array}{c}\hfill {y}_1^{(1)}\hfill \\ {}\hfill {y}_2^{(1)}\hfill \end{array}\;\right]\kern0.62em =\kern0.62em \left[\;\begin{array}{c}\hfill \varphi \left({v}_1^{(1)}\right)\hfill \\ {}\hfill \varphi \left({v}_2^{(1)}\right)\hfill \end{array}\;\right] $$

where y ⁽¹⁾₁ and y ⁽¹⁾₂ are outputs from the corresponding hidden nodes. In a similar manner, the weighted sum of the output nodes is calculated as:

$$ \begin{array}{c}\left[\;\begin{array}{c}\hfill {v}_1\hfill \\ {}\hfill {v}_2\hfill \end{array}\kern0.1em \right]\kern0.62em =\kern0.62em \left[\;\begin{array}{cc}\hfill {w}_{11}^{(2)}\hfill & \hfill {w}_{12}^{(2)}\hfill \\ {}\hfill {w}_{21}^{(2)}\hfill & \hfill {w}_{22}^{(2)}\hfill \end{array}\;\right]\left[\;\begin{array}{c}\hfill {y}_1^{(1)}\hfill \\ {}\hfill {y}_2^{(1)}\hfill \end{array}\kern0.1em \right]\\ {}\triangleq \kern0.62em {W}_2\kern0.2em {y}^{(1)}\end{array} $$

(Equation 3.2)

As we put this weighted sum into the activation function, the neural network yields the output.

$$ \left[\;\begin{array}{c}\hfill {y}_1\hfill \\ {}\hfill {y}_2\hfill \end{array}\kern0.1em \right]\kern0.62em =\kern0.62em \left[\;\begin{array}{c}\hfill \varphi \left({v}_1\right)\hfill \\ {}\hfill \varphi \left({v}_2\right)\hfill \end{array}\kern0.1em \right] $$

Now, we will train the neural network using the back-propagation algorithm. The first thing to calculate is delta , δ, of each node. You may ask, “Is this delta the one from the delta rule?” It is! In order to avoid confusion, the diagram in Figure 3-3 has been redrawn with the unnecessary connections dimmed out.

In the back-propagation algorithm , the delta of the output node is defined identically to the delta rule of the “Generalized Delta Rule” section in Chapter 2, as follows:

$$ \begin{array}{c}{e}_1\kern0.62em =\kern0.62em {d}_1\kern0.5em -\kern0.5em {y}_1\\ {}{\delta}_1\kern0.62em =\kern0.62em {\varphi}^{\prime}\left({v}_1\right)\;{e}_1\\ {}\\ {}{e}_2\kern0.62em =\kern0.62em {d}_2\kern0.5em -\kern0.5em {y}_2\\ {}{\delta}_2\kern0.62em =\kern0.62em {\varphi}^{\prime}\left({v}_2\right)\;{e}_2\end{array} $$

(Equation 3.3)

where $ {\varphi}^{\prime}\left(\cdot \right) $ is the derivative of the activation function of the output node, y _i is the output from the output node, d _i is the correct output from the training data, and v _i is the weighted sum of the corresponding node.

Since we have the delta for every output node, let’s proceed leftward to the hidden nodes and calculate the delta (Figure 3-4). Again, unnecessary connections are dimmed out for convenience.

As addressed at the beginning of the chapter, the issue of the hidden node is how to define the error. In the back-propagation algorithm, the error of the node is defined as the weighted sum of the back-propagated deltas from the layer on the immediate right (in this case, the output layer). Once the error is obtained, the calculation of the delta from the node is the same as that of Equation 3.3. This process can be expressed as follows:

$$ \begin{array}{c}{e}_1^{(1)}\kern0.62em =\kern0.62em {w}_{11}^{(2)}\kern0.1em {\delta}_1\kern0.5em +\kern0.5em {w}_{21}^{(2)}\kern0.1em {\delta}_2\\ {}{\delta}_1^{(1)}\kern0.62em =\kern0.62em {\varphi}^{\prime}\left({v}_1^{(1)}\right)\;{e}_1^{(1)}\\ {}\\ {}{e}_2^{(1)}\kern0.62em =\kern0.62em {w}_{12}^{(2)}\kern0.1em {\delta}_1\kern0.5em +\kern0.5em {w}_{22}^{(2)}\kern0.1em {\delta}_2\\ {}{\delta}_2^{(1)}\kern0.62em =\kern0.62em {\varphi}^{\prime}\left({v}_2^{(1)}\right)\;{e}_2^{(1)}\end{array} $$

(Equation 3.4)

where and are the weight sums of the forward signals at the respective nodes . It is noticeable from this equation that the forward and backward processes are identically applied to the hidden nodes as well as the output nodes. This implies that the output and hidden nodes experience the same backward process. The only difference is the error calculation (Figure 3-5).

In summary, the error of the hidden node is calculated as the backward weighted sum of the delta, and the delta of the node is the product of the error and the derivative of the activation function. This process begins at the output layer and repeats for all hidden layers. This pretty much explains what the back-propagation algorithm is about.

The two error calculation formulas of Equation 3.4 are combined in a matrix equation as follows:

$$ \left[\kern0.2em \begin{array}{c}\hfill {e}_1^{(1)}\hfill \\ {}\hfill {e}_2^{(1)}\hfill \end{array}\kern0.1em \right]\kern0.62em =\kern0.62em \left[\kern0.22em \begin{array}{cc}\hfill {w}_{11}^{(2)}\hfill & \hfill {w}_{21}^{(2)}\hfill \\ {}\hfill {w}_{12}^{(2)}\hfill & \hfill {w}_{22}^{(2)}\hfill \end{array}\kern0.2em \right]\left[\kern0.22em \begin{array}{c}\hfill {\delta}_1\hfill \\ {}\hfill {\delta}_2\hfill \end{array}\kern0.2em \right] $$

(Equation 3.5)

Compare this equation with the neural network output of Equation 3.2. The matrix of Equation 3.5 is the result of transpose of the weight matrix, W, of Equation 3.2.^{Footnote 2} Therefore, Equation 3.5 can be rewritten as:

$$ \left[\kern0.1em \begin{array}{c}\hfill {e}_1^{(1)}\hfill \\ {}\hfill {e}_2^{(1)}\hfill \end{array}\kern0.1em \right]\kern0.62em =\kern0.62em {W}_2^T\left[\kern0.1em \begin{array}{c}\hfill {\delta}_1\hfill \\ {}\hfill {\delta}_2\hfill \end{array}\kern0.1em \right] $$

(Equation 3.6)

This equation indicates that we can obtain the error as the product of the transposed weight matrix and delta vector. This very useful attribute allows an easier implementation of the algorithm .

If we have additional hidden layers , we will just repeat the same backward process for each hidden layer and calculate all the deltas. Once all the deltas have been calculated, we will be ready to train the neural network. Just use the following equation to adjust the weights of the respective layers.

$$ \begin{array}{c}\triangle {w}_{i j}\kern0.62em =\kern0.62em \alpha \kern0.1em {\delta}_i{x}_j\\ {}{w}_{i j}\kern0.62em \leftarrow \kern0.62em {w}_{i j}\kern0.5em +\kern0.5em \triangle {w}_{i j}\end{array} $$

(Equation 3.7)

where x _j is the input signal for the corresponding weight. For convenience, we omit the layer indicator from this equation. What do you see now? Isn’t this equation the same as that of the delta rule of the previous section? Yes, they are the same. The mere difference is the deltas of the hidden nodes, which are obtained from the backward calculation using the output error of the neural network.

We will proceed a bit further and derive the equation to adjust the weight using Equation 3.7. Consider the weight for example.

The weight of Figure 3-6 can be adjusted using Equation 3.7 as:

$$ {w}_{21}^{(2)}\kern0.62em \leftarrow \kern0.62em {w}_{21}^{(2)}\kern0.5em +\kern0.5em \alpha \kern0.1em {\delta}_2\kern0.1em {y}_1^{(1)} $$

where y ⁽¹⁾₁ is the output of the first hidden node . Here is another example.

The weight w ⁽¹⁾₁₁ of Figure 3-7 is adjusted using Equation 3.7 as:

$$ {w}_{11}^{(1)}\kern0.62em \leftarrow \kern0.62em {w}_{11}^{(1)}\kern0.5em +\kern0.5em \alpha {\delta}_1^{(1)}\kern0.1em {x}_1 $$

where x ₁ is the output of the first input node, i.e., the first input of the neural network.

Let’s organize the process to train the neural network using the back-propagation algorithm.

1.
Initialize the weights with adequate values.
2.
Enter the input from the training data { input, correct output } and obtain the neural network’s output. Calculate the error of the output to the correct output and the delta, δ, of the output nodes.
$$ \begin{array}{c} e\kern0.62em =\kern1em d\kern0.5em -\kern0.5em y\\ {}\delta \kern0.62em =\kern0.62em {\varphi}^{\prime }(v)\; e\end{array} $$
3.
Propagate the output node delta, δ, backward, and calculate the deltas of the immediate next (left) nodes.
$$ \begin{array}{c}{e}^{(k)}\kern0.62em =\kern0.62em {W}^T\delta \\ {}{\delta}^{(k)}\kern0.62em =\kern0.62em {\varphi}^{\prime}\left({v}^{(k)}\right)\;{e}^{(k)}\end{array} $$
4.
Repeat Step 3 until it reaches the hidden layer that is on the immediate right of the input layer.
5.
Adjust the weights according to the following learning rule .
$$ \begin{array}{c}\triangle {w}_{i j}\kern0.62em =\kern0.62em \alpha \kern0.1em {\delta}_i{x}_j\\ {}{w}_{i j}\kern0.62em \leftarrow \kern0.62em {w}_{i j}\kern0.5em +\kern0.5em \triangle {w}_{i j}\end{array} $$
6.
Repeat Steps 2-5 for every training data point.
7.
Repeat Steps 2-6 until the neural network is properly trained.

Other than Steps 3 and 4, in which the output delta propagates backward to obtain the hidden node delta, this process is basically the same as that of the delta rule, which was previously discussed. Although this example has only one hidden layer, the back-propagation algorithm is applicable for training many hidden layers. Just repeat Step 3 of the previous algorithm for each hidden layer.

Example: Back-Propagation

In this section, we implement the back-propagation algorithm. The training data contains four elements as shown in the following table. Of course, as this is about supervised learning, the data includes input and correct output pairs. The bolded rightmost number of the data is the correct output. As you may have noticed, this data is the same one that we used in Chapter 2 for the training of the single-layer neural network; the one that the single-layer neural network had failed to learn.

Ignoring the third value, the Z-axis, of the input, this dataset actually provides the XOR logic operations. Therefore, if we train the neural network with this dataset, we would get the XOR operation model.

Consider a neural network that consists of three input nodes and a single output node, as shown in Figure 3-8. It has one hidden layer of four nodes. The sigmoid function is used as the activation function for the hidden nodes and the output node.

This section employs SGD for the implementation of the back-propagation algorithm. Of course, the batch method will work as well. What we have to do is use the average of the weight updates, as shown in the example in the “Example: Delta Rule” section of Chapter 2. Since the primary objective of this section is to understand the back-propagation algorithm, we will stick to a simpler and more intuitive method: the SGD.

XOR Problem

The function BackpropXOR, which implements the back-propagation algorithm using the SGD method, takes the network’s weights and training data and returns the adjusted weights.

[W1 W2] = BackpropXOR(W1, W2, X, D)

where W1 and W2 carries the weight matrix of the respective layer. W1 is the weight matrix between the input layer and hidden layer and W2 is the weight matrix between the hidden layer and output layer. X and D are the input and correct output of the training data, respectively. The following listing shows the BackpropXOR.m file, which implements the BackpropXOR function .

function [W1, W2] = BackpropXOR(W1, W2, X, D) alpha = 0.9; N = 4; for k = 1:N x = X(k, :)'; d = D(k); v1 = W1*x; y1 = Sigmoid(v1); v = W2*y1; y = Sigmoid(v); e = d - y; delta = y.*(1-y).*e; e1 = W2'*delta; delta1 = y1.*(1-y1).*e1; dW1 = alpha*delta1*x'; W1 = W1 + dW1; dW2 = alpha*delta*y1'; W2 = W2 + dW2; end end

The code takes point from the training dataset, calculates the weight update, dW, using the delta rule, and adjusts the weights. So far, the process is almost identical to that of the example code of Chapter 2. The slight differences are the two calls of the function Sigmoid for the output calculation and the addition of the delta (delta1) calculation using the back-propagation of the output delta as follows:

e1 = W2'*delta; delta1 = y1.*(1-y1).*e1;

where the calculation of the error, e1, is the implementation of Equation 3.6. As this involves the back-propagation of the delta, we use the transpose matrix, W2'. The delta (delta1) calculation has an element-wise product operator, .*, because the variables are vectors. The element-wise operator of MATLAB has a dot (period) in front of the normal operator and performs an operation on each element of the vector. This operator enables simultaneous calculations of deltas from many nodes.

The function Sigmoid, which the BackpropXOR code calls, also replaced the division with the element-wise division ./ to account for the vector.

function y = Sigmoid(x) y = 1 ./ (1 + exp(-x)); end

The modified Sigmoid function can operate using vectors as shown by the following example:

Sigmoid([-1 0 1]) ➔ [0.2689 0.5000 0.7311]

The program listing that follows shows the TestBackpropXOR.m file, which tests the function BackpropXOR. This program calls in the BackpropXOR function and trains the neural network 10,000 times. The input is given to the trained network, and its output is shown on the screen. The training performance can be verified as we compare the output to the correct outputs of the training data. Further details are omitted, as the program is almost identical to that of Chapter 2.

clear all X = [ 0 0 1; 0 1 1; 1 0 1; 1 1 1; ]; D = [ 0 1 1 0 ]; W1 = 2*rand(4, 3) - 1; W2 = 2*rand(1, 4) - 1; for epoch = 1:10000 % train [W1 W2] = BackpropXOR(W1, W2, X, D); end N = 4; % inference for k = 1:N x = X(k, :)'; v1 = W1*x; y1 = Sigmoid(v1); v = W2*y1; y = Sigmoid(v) end

Execute the code , and find the following values on the screen. These values are very close to the correct output, D, indicating that the neural network has been properly trained. Now we have confirmed that the multi-layer neural network solves the XOR problem , which the single-layer network had failed to model properly.

$$ \left[\kern0.1em \begin{array}{c}\hfill 0.0060\hfill \\ {}\hfill 0.9888\hfill \\ {}\hfill 0.9891\hfill \\ {}\hfill 0.0134\hfill \end{array}\kern0.1em \right]\kern2em \iff \kern2em D\kern0.5em =\kern0.5em \left[\;\begin{array}{c}\hfill 0\hfill \\ {}\hfill 1\hfill \\ {}\hfill 1\hfill \\ {}\hfill 0\hfill \end{array}\;\right] $$

Momentum

This section explores the variations of the weight adjustment. So far, the weight adjustment has relied on the simplest forms of Equations 2.7 and 3.7. However, there are various weight adjustment forms available.^{Footnote 3} The benefits of using the advanced weight adjustment formulas include higher stability and faster speeds in the training process of the neural network. These characteristics are especially favorable for Deep Learning as it is hard to train. This section only covers the formulas that contain momentum, which have been used for a long time. If necessary, you may want to study this further with the link shown in the footnote.

The momentum, m, is a term that is added to the delta rule for adjusting the weight. The use of the momentum term drives the weight adjustment to a certain direction to some extent, rather than producing an immediate change. It acts similarly to physical momentum, which impedes the reaction of the body to the external forces.

$$ \begin{array}{c}\triangle w\kern1em =\kern1.12em \alpha \delta x\\ {} m\kern1em =\kern1em \triangle w\kern0.5em +\kern0.5em \beta {m}^{-}\\ {} w\kern1em =\kern1.12em w\kern0.5em +\kern0.5em m\\ {}{m}^{-}\kern1em =\kern1em m\end{array} $$

(Equation 3.8)

where $ {m}^{-} $ is the previous momentum and β is a positive constant that is less than 1. Let’s briefly see why we modify the weight adjustment formula in this manner. The following steps show how the momentum changes over time:

$$ \begin{array}{c} m(0)\kern0.62em =\kern0.62em 0\\ {} m(1)\kern0.62em =\kern0.62em \triangle w(1)+\beta m(0)\kern0.62em =\kern0.62em \triangle w(1)\\ {} m(2)\kern0.62em =\kern0.62em \triangle w(2)+\beta m(1)\kern0.62em =\kern0.62em \triangle w(2)+\beta \triangle w(1)\\ {} m(3)\kern0.62em =\kern0.62em \triangle w(3)+\beta m(2)\kern0.62em =\kern0.62em \triangle w(3)+\beta \left\{\triangle w(2)+\beta \triangle w(1)\right\}\kern0.62em =\kern0.62em \triangle w(3)+\beta \triangle w(2)+{\beta}^2\triangle w(1)\\ {}\vdots \end{array} $$

It is noticeable from these steps that the previous weight update, i.e. ∆w(1), ∆w(2), ∆w(3), etc., is added to each momentum over the process. Since β is less than 1, the older weight update exerts a lesser influence on the momentum. Although the influence diminishes over time, the old weight updates remain in the momentum. Therefore, the weight is not solely affected by a particular weight update value. Therefore, the learning stability improves. In addition, the momentum grows more and more with weight updates. As a result, the weight update becomes greater and greater as well. Therefore, the learning rate increases.

The following listing shows the BackpropMmt.m file, which implements the back-propagation algorithm with the momentum. The BackpropMmt function operates in the same manner as that of the previous example; it takes the weights and training data and returns the adjusted weights. This listing employs the same variables as defined in the BackpropXOR function.

[W1 W2] = BackpropMmt(W1, W2, X, D) function [W1, W2] = BackpropMmt(W1, W2, X, D) alpha = 0.9; beta = 0.9; mmt1 = zeros(size(W1)); mmt2 = zeros(size(W2)); N = 4; for k = 1:N x = X(k, :)'; d = D(k); v1 = W1*x; y1 = Sigmoid(v1); v = W2*y1; y = Sigmoid(v); e = d - y; delta = y.*(1-y).*e; e1 = W2'*delta; delta1 = y1.*(1-y1).*e1; dW1 = alpha*delta1*x'; mmt1 = dW1 + beta*mmt1; W1 = W1 + mmt1; dW2 = alpha*delta*y1'; mmt2 = dW2 + beta*mmt2; W2 = W2 + mmt2; end end

The code initializes the momentums, mmt1 and mmt2, as zeros when it starts the learning process. The weight adjustment formula is modified to reflect the momentum as:

dW1 = alpha*delta1*x'; mmt1 = dW1 + beta*mmt1; W1 = W1 + mmt1;

The following program listing shows the TestBackpropMmt.m file, which tests the function BackpropMmt. This program calls the BackpropMmt function and trains the neural network 10,000 times. The training data is fed to the neural network and the output is shown on the screen. The performance of the training is verified by comparing the output to the correct output of the training data. As this code is almost identical to that of the previous example, further explanation is omitted.

clear all X = [ 0 0 1; 0 1 1; 1 0 1; 1 1 1; ]; D = [ 0 1 1 0 ]; W1 = 2*rand(4, 3) - 1; W2 = 2*rand(1, 4) - 1; for epoch = 1:10000 % train [W1 W2] = BackpropMmt(W1, W2, X, D); end N = 4; % inference for k = 1:N x = X(k, :)'; v1 = W1*x; y1 = Sigmoid(v1); v = W2*y1; y = Sigmoid(v) end

Cost Function and Learning Rule

This section briefly explains what the cost function ^{Footnote 4} is and how it affects the learning rule of the neural network. The cost function is a rather mathematical concept that is associated with the optimization theory. You don’t have to know it. However, it is good to know if you want to better understand the learning rule of the neural network. It is not a difficult concept to follow.

The cost function is related to supervised learning of the neural network. Chapter 2 addressed that supervised learning of the neural network is a process of adjusting the weights to reduce the error of the training data. In this context, the measure of the neural network’s error is the cost function. The greater the error of the neural network, the higher the value of the cost function is. There are two primary types of cost functions for the neural network’s supervised learning.

$$ J\kern1em =\kern1em {\displaystyle \sum_{i=1}^M\frac{1}{2}\kern0.5em {\left({d}_i\kern0.5em -\kern0.5em {y}_i\right)}^2} $$

(Equation 3.9)

$$ J\kern1em =\kern1em {\displaystyle \sum_{i=1}^M\left\{-{d}_i \ln \left({y}_i\right)\kern0.5em -\kern0.5em \left(1-{d}_i\right) \ln \left(1-{y}_i\right)\right\}} $$

(Equation 3.10)

where y _i is the output from the output node , d _i is the correct output from the training data, and M is the number of output nodes.

First, consider the sum of squared error shown in Equation 3.9. This cost function is the square of the difference between the neural network’s output, y, and the correct output, d. If the output and correct output are the same, the error becomes zero. In contrast, a greater difference between the two values leads to a larger error. This is illustrated in Figure 3-9.

It is clearly noticeable that the cost function value is proportional to the error. This relationship is so intuitive that no further explanation is necessary. Most early studies of the neural network employed this cost function to derive learning rules . Not only was the delta rule of the previous chapter derived from this function, but the back-propagation algorithm was as well. Regression problems still use this cost function.

Now, consider the cost function of Equation 3.10. The following formula, which is inside the curly braces, is called the cross entropy function .

$$ E\kern1em =\kern1em - d \ln (y)\kern0.5em -\kern0.5em \left(1- d\right) \ln \left(1- y\right) $$

It may be difficult to intuitively capture the cross entropy function’s relationship to the error. This is because the equation is contracted for simpler expression. Equation 3.10 is the concatenation of the following two equations:

$$ E\kern0.62em =\kern0.62em \left\{\begin{array}{c}\hfill - \ln (y)\kern2.5em d\kern0.5em =\kern0.5em 1\hfill \\ {}\hfill - \ln \left(1- y\right)\kern1.5em d\kern0.5em =\kern0.5em 0\hfill \end{array}\right. $$

Due to the definition of a logarithm, the output, y, should be within 0 and 1. Therefore, the cross entropy cost function often teams up with sigmoid and softmax activation functions in the neural network.^{Footnote 5} Now we will see how this function is related to the error. Recall that cost functions should be proportional to the output error. What about this one?

Figure 3-10 shows the cross entropy function at $ d\kern0.5em =\kern0.5em 1 $.

When the output y is 1, i.e., the error ($ d- y $) is 0, the cost function value is 0 as well. In contrast, when the output y approaches 0, i.e., the error grows, the cost function value soars. Therefore, this cost function is proportional to the error.

Figure 3-11 shows the cost function at $ d\kern0.5em =\kern0.5em 0 $. If the output y is 0, the error is 0, the cost function yields 0. When the output approaches 1, i.e., the error grows, the function value soars. Therefore, this cost function in this case is proportional to the error as well. These cases confirm that the cost function of Equation 3.10 is proportional to the output error of the neural network.

The primary difference of the cross entropy function from the quadratic function of Equation 3.9 is its geometric increase. In other words, the cross entropy function is much more sensitive to the error. For this reason, the learning rules derived from the cross entropy function are generally known to yield better performance. It is recommended that you use the cross entropy-driven learning rules except for inevitable cases such as the regression.

We had a long introduction to the cost function because the selection of the cost function affects the learning rule, i.e., the formula of the back-propagation algorithm. Specifically, the calculation of the delta at the output node changes slightly. The following steps detail the procedure in training the neural network with the sigmoid activation function at the output node using the cross entropy-driven back-propagation algorithm .

1.
Initialize the neural network’s weights with adequate values.
2.
Enter the input of the training data { input, correct output } to the neural network and obtain the output. Compare this output to the correct output, calculate the error, and calculate the delta, δ, of the output nodes.
$$ \begin{array}{c} e\kern0.62em =\kern1em d\kern0.5em -\kern0.5em y\\ {}\delta \kern0.62em =\kern0.62em e\end{array} $$
3.
Propagate the delta of the output node backward and calculate the delta of the subsequent hidden nodes.
$$ \begin{array}{c}{e}^{(k)}\kern0.62em =\kern0.62em {W}^T\delta \\ {}{\delta}^{(k)}\kern0.62em =\kern0.62em {\varphi}^{\prime}\left({v}^{(k)}\right)\;{e}^{(k)}\end{array} $$
4.
Repeat Step 3 until it reaches the hidden layer that is next to the input layer.
5.
Adjust the neural network’s weights using the following learning rule:
$$ \begin{array}{c}\triangle {w}_{i j}\kern0.62em =\kern0.62em \alpha \kern0.1em {\delta}_i{x}_j\\ {}{w}_{i j}\kern0.62em \leftarrow \kern0.62em {w}_{i j}\kern0.5em +\kern0.5em \triangle {w}_{i j}\end{array} $$
6.
Repeat Steps 2-5 for every training data point.
7.
Repeat Steps 2-6 until the network has been adequately trained.

Did you notice the difference between this process and that of the “Back-Propagation Algorithm” section? It is the delta, δ, in Step 2. It has been changed as follows:

$$ \delta \kern0.62em =\kern0.62em {\varphi}^{\prime }(v) e\kern1em \to \kern1em \delta \kern0.62em =\kern0.62em e $$

Everything else remains the same. On the outside, the difference seems insignificant. However, it contains the huge topic of the cost function based on the optimization theory. Most of the neural network training approaches of Deep Learning employ the cross entropy-driven learning rules. This is due to their superior learning rate and performance.

Figure 3-12 depicts what this section has explained so far. The key is the fact that the output and hidden layers employ the different formulas of the delta calculation when the learning rule is based on the cross entropy and the sigmoid function .

While we are at it, we will address just one more thing about the cost function. You saw in Chapter 1 that overfitting is a challenging problem that every technique of Machine Learning faces. You also saw that one of the primary approaches used to overcome overfitting is making the model as simple as possible using regularization. In a mathematical sense, the essence of regularization is adding the sum of the weights to the cost function, as shown here. Of course, applying the following new cost function leads to a different learning rule formula.

$$ J\kern1em =\kern1em \frac{1}{2}\kern0.5em {\displaystyle \sum_{i=1}^M{\left({d}_i\kern0.5em -\kern0.5em {y}_i\right)}^2}\kern0.5em +\kern0.5em \lambda \frac{1}{2}{\left\Vert w\right\Vert}^2 $$

$$ J\kern1em =\kern1em {\displaystyle \sum_{i=1}^M\left\{-{d}_i \ln \left({y}_i\right)\kern0.5em -\kern0.5em \left(1-{d}_i\right) \ln \left(1-{y}_i\right)\right\}}\kern0.5em +\kern0.5em \lambda \frac{1}{2}{\left\Vert w\right\Vert}^2 $$

where λ is the coefficient that determines how much of the connection weight is reflected on the cost function.

This cost function maintains a large value when one of the output errors and the weight remain large. Therefore, solely making the output error zero will not suffice in reducing the cost function. In order to drop the value of the cost function, both the error and weight should be controlled to be as small as possible. However, if a weight becomes small enough, the associated nodes will be practically disconnected. As a result, unnecessary connections are eliminated, and the neural network becomes simpler. For this reason, overfitting of the neural network can be improved by adding the sum of weights to the cost function, thereby reducing it.

In summary, the learning rule of the neural network’s supervised learning is derived from the cost function. The performance of the learning rule and the neural network varies depending on the selection of the cost function. The cross entropy function has been attracting recent attention for the cost function. The regularization process that is used to deal with overfitting is implemented as a variation of the cost function .

Example: Cross Entropy Function

This section revisits the back-propagation example. But this time, the learning rule derived from the cross entropy function is used. Consider the training of the neural network that consists of a hidden layer with four nodes, three input nodes, and a single output node. The sigmoid function is employed for the activation function of the hidden nodes and output node.

The training data contains the same four elements as shown in the following table. When we ignore the third numbers of the input data, this training dataset presents a XOR logic operation. The bolded rightmost number of each element is the correct output .

Cross Entropy Function

The BackpropCE function trains the XOR data using the cross entropy function . It takes the neural network’s weights and training data and returns the adjusted weights.

[W1 W2] = BackpropCE(W1, W2, X, D)

where W1 and W2 are the weight matrices for the input-hidden layers and hidden-output layers, respectively. In addition, X and D are the input and correct output matrices of the data, respectively. The following listing shows the BackpropCE.m file, which implements the BackpropCE function.

function [W1, W2] = BackpropCE(W1, W2, X, D) alpha = 0.9; N = 4; for k = 1:N x = X(k, :)'; % x = a column vector d = D(k); v1 = W1*x; y1 = Sigmoid(v1); v = W2*y1; y = Sigmoid(v); e = d - y; delta = e; e1 = W2'*delta; delta1 = y1.*(1-y1).*e1; dW1 = alpha*delta1*x'; W1 = W1 + dW1; dW2 = alpha*delta*y1'; W2 = W2 + dW2; end end

This code pulls out the training data, calculates the weight updates (dW1 and dW2) using the delta rule, and adjusts the neural network’s weights using these values. So far, the process is almost identical to that of the previous example. The difference arises when we calculate the delta of the output node as:

e = d - y; delta = e;

Unlike the previous example code, the derivative of the sigmoid function no longer exists. This is because, for the learning rule of the cross entropy function, if the activation function of the output node is the sigmoid, the delta equals the output error. Of course, the hidden nodes follow the same process that is used by the previous back-propagation algorithm.

e1 = W2'*delta; delta1 = y1.*(1-y1).*e1;

The following program listing shows the TestBackpropCE.m file, which tests the BackpropCE function. This program calls the BackpropCE function and trains the neural network 10,000 times. The trained neural network yields the output for the training data input, and the result is displayed on the screen. We verify the proper training of the neural network by comparing the output to the correct output. Further explanation is omitted, as the code is almost identical to that from before.

clear all X = [ 0 0 1; 0 1 1; 1 0 1; 1 1 1; ]; D = [ 0 1 1 0 ]; W1 = 2*rand(4, 3) - 1; W2 = 2*rand(1, 4) - 1; for epoch = 1:10000 % train [W1 W2] = BackpropCE(W1, W2, X, D); end N = 4; % inference for k = 1:N x = X(k, :)'; v1 = W1*x; y1 = Sigmoid(v1); v = W2*y1; y = Sigmoid(v) end

Executing this code produces the values shown here. The output is very close to the correct output, D. This proves that the neural network has been trained successfully.

$$ \left[\kern0.1em \begin{array}{c}\hfill 0.00003\hfill \\ {}\hfill 0.9999\hfill \\ {}\hfill 0.9998\hfill \\ {}\hfill 0.00036\hfill \end{array}\kern0.1em \right]\kern2em \iff \kern2em D\kern0.5em =\kern0.5em \left[\;\begin{array}{c}\hfill 0\hfill \\ {}\hfill 1\hfill \\ {}\hfill 1\hfill \\ {}\hfill 0\hfill \end{array}\;\right] $$

Comparison of Cost Functions

The only difference between the BackpropCE function from the previous section and the BackpropXOR function from the “XOR Problem” section is the calculation of the output node delta. We will examine how this insignificant difference affects the learning performance. The following listing shows the CEvsSSE.m file that compares the mean errors of the two functions. The architecture of this file is almost identical to that of the SGDvsBatch.m file in the “Comparison of the SGD and the Batch” section in Chapter 2.

clear all X = [ 0 0 1; 0 1 1; 1 0 1; 1 1 1; ]; D = [ 0 0 1 1 ]; E1 = zeros(1000, 1); E2 = zeros(1000, 1); W11 = 2*rand(4, 3) - 1; % Cross entropy W12 = 2*rand(1, 4) - 1; % W21 = W11; % Sum of squared error W22 = W12; % for epoch = 1:1000 [W11 W12] = BackpropCE(W11, W12, X, D); [W21 W22] = BackpropXOR(W21, W22, X, D); es1 = 0; es2 = 0; N = 4; for k = 1:N x = X(k, :)'; d = D(k); v1 = W11*x; y1 = Sigmoid(v1); v = W12*y1; y = Sigmoid(v); es1 = es1 + (d - y)^2; v1 = W21*x; y1 = Sigmoid(v1); v = W22*y1; y = Sigmoid(v); es2 = es2 + (d - y)^2; end E1(epoch) = es1 / N; E2(epoch) = es2 / N; end plot(E1, 'r') hold on plot(E2, 'b:') xlabel('Epoch') ylabel('Average of Training error') legend('Cross Entropy', 'Sum of Squared Error')

This program calls the BackpropCE and the BackpropXOR functions and trains the neural networks 1,000 times each. The squared sum of the output error (es1 and es2) is calculated at every epoch for each neural network, and their average (E1 and E2) is calculated. W11, W12, W21, and W22 are the weight matrices of respective neural networks. Once the 1,000 trainings have been completed, the mean errors are compared over the epoch on the graph. As Figure 3-14 shows, the cross entropy-driven training reduces the training error at a much faster rate. In other words, the cross entropy-driven learning rule yields a faster learning process. This is the reason that most cost functions for Deep Learning employ the cross entropy function.

This completes the contents for the back-propagation algorithm. If you had a hard time catching on, don’t be discouraged. Actually, understanding the back-propagation algorithms is not a vital factor when studying and developing Deep Learning. As most of the Deep Learning libraries already include the algorithms; we can just use them. Cheer up! Deep Learning is just one chapter away.

Summary

This chapter covered the following concepts:

The multi-layer neural network cannot be trained using the delta rule; it should be trained using the back-propagation algorithm, which is also employed as the learning rule of Deep Learning.
The back-propagation algorithm defines the hidden layer error as it propagates the output error backward from the output layer. Once the hidden layer error is obtained, the weights of every layer are adjusted using the delta rule. The importance of the back-propagation algorithm is that it provides a systematic method to define the error of the hidden node.
The single-layer neural network is applicable only to linearly separable problems, and most practical problems are linearly inseparable.
The multi-layer neural network is capable of modeling the linearly inseparable problems.
Many types of weight adjustments are available in the back-propagation algorithm. The development of various weight adjustment approaches is due to the pursuit of a more stable and faster learning of the network. These characteristics are particularly beneficial for hard-to-learn Deep Learning.
The cost function addresses the output error of the neural network and is proportional to the error. Cross entropy has been widely used in recent applications. In most cases, the cross entropy-driven learning rules are known to yield better performance.
The learning rule of the neural network varies depending on the cost function and activation function. Specifically, the delta calculation of the output node is changed.
The regularization, which is one of the approaches used to overcome overfitting, is also implemented as an addition of the weight term to the cost function.

Notes

1.
“Learning representations by back-propagating errors,” David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams, Nature, October 1986.
2.
When two matrices have rows and columns switched, they are transpose matrices to each other.
3.
sebastianruder.com/optimizing-gradient-descent
4.
It is also called the loss function and objective function .
5.
If the other activation function is employed, the definition of the cross entropy function slightly changes as well.

Author information

Authors and Affiliations

Seoul, Soul-t’ukpyolsi, Korea (Republic of)
Phil Kim

Authors

Phil Kim
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kim, P. (2017). Training of Multi-Layer Neural Network. In: MATLAB Deep Learning. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-2845-6_3

Download citation

DOI: https://doi.org/10.1007/978-1-4842-2845-6_3
Published: 16 June 2017
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-2844-9
Online ISBN: 978-1-4842-2845-6
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)

Publish with us

Policies and ethics