Keywords

1 Introduction

The methods of Artificial Intelligence are broadly used in the modern world. Genetic algorithms, neural networks and their applications are the subject of many researches [11, 12, 14, 15]. The most common use of the modern AI can be found in the areas of optimization, classification and recognition of the given patterns [1, 2, 13, 16, 17, 21]. The Levenberg-Marquardt algorithm (also called LM or LMA) is a highly efficient and a popular teaching method for feedforward neural networks [9]. It is classified as a quasi-Newton method which finds a wide usage in areas of function minimization problems such as neural networks teaching [8]. One of the biggest downfalls of the LMA is a huge complexity contrary to classical teaching methods as e.g. the well known Back Propagation algorithm with all its variants [3, 7, 10]. That limitation makes LMA impractical as a teaching algorithm for bigger networks. The following paper describes the parallel modification for LMA which can be applied to any feedforward multi-layered neural network.

2 The Classic Levenbert-Marquardt Algorithm

The Levenberg-Marquard method is an iterational algorithm mostly used for non-linear function optimization. It combines the advantages of a steepest descent and the Gauss-Newton methods. In order to teach a neural network, LMA is used to minimize the following error function

$$\begin{aligned} E\left( {\mathbf{{w}}\left( n \right) } \right) = \frac{1}{2}\sum \limits _{t = 1}^Q {\sum \limits _{r = 1}^{{N_L}} {\varepsilon _r^{{{\left( L \right) }^2}}} \left( t \right) } = \frac{1}{2}\sum \limits _{t = 1}^Q {\sum \limits _{r = 1}^{{N_L}} {{{\left( {y_r^{\left( L \right) }\left( t \right) - d_r^{\left( L \right) }\left( t \right) } \right) }^2}} } \end{aligned}$$
(1)

where \(\varepsilon _i^{\left( L \right) } \) stands for non-linear neuron error and can be depicted as

$$\begin{aligned} \varepsilon _r^{\left( L \right) }(t) = \varepsilon _r^{\left( {Lr} \right) }(t) = y_r^{\left( L \right) }(t) - d_r^{\left( L \right) }(t) \end{aligned}$$
(2)

while \(d_r^{\left( L \right) }(t)\) is the r-th expected network response to the t-th teaching sample. In the Levenberg-Marquardt algorithm weights delta is given by

$$\begin{aligned} \varDelta \left( {\mathbf{{w}}(n)} \right) = - {\left[ {{\nabla ^\mathbf{{2}}}{} \mathbf{{E}}\left( {\mathbf{{w}}(n)} \right) } \right] ^{ - 1}}\nabla \mathbf{{E}}\left( {\mathbf{{w}}(n)} \right) \end{aligned}$$
(3)

where \(\nabla \mathbf{{E}}\left( {\mathbf{{w}}(n)}\right) \) stands for the gradient vector

$$\begin{aligned} \nabla \mathbf{{E}}\left( {\mathbf{{w}}(n)} \right) = {\mathbf{{J}}^T}\left( {\mathbf{{w}}(n)} \right) \mathbf{{\varepsilon }}\left( {\mathbf{{w}}(n)} \right) \end{aligned}$$
(4)

and \(\nabla ^\mathbf{{2}}{} \mathbf{{E}}\left( {\mathbf{{w}}(n)} \right) \) stands for the Hessian matrix

$$\begin{aligned} {\nabla ^\mathbf{{2}}}{} \mathbf{{E}}\left( {\mathbf{{w}}(n)} \right) = {\mathbf{{J}}^T}\left( {\mathbf{{w}}(n)} \right) \mathbf{{J}}\left( {\mathbf{{w}}(n)} \right) + \mathbf{{S}}\left( {\mathbf{{w}}(n)} \right) \end{aligned}$$
(5)

while \(\mathbf{{J}}\left( {\mathbf{{w}}(n)}\right) \) is the Jacobian matrix

$$\begin{aligned} \mathbf{{J}}(\mathbf{{w}}\left( n \right) ) = \left[ {\begin{array}{*{20}{c}} {\frac{{\partial \varepsilon _1^{\left( L \right) }(1)}}{{\partial w_{10}^{(1)}}}}&{}{\frac{{\partial \varepsilon _1^{\left( L \right) }(1)}}{{\partial w_{11}^{(1)}}}}&{} \cdots &{}{\frac{{\partial \varepsilon _1^{\left( L \right) }(1)}}{{\partial w_{ij}^{(k)}}}}&{} \cdots &{}{\frac{{\partial \varepsilon _1^{\left( L \right) }(1)}}{{\partial w_{{N^L}{N^{L - 1}}}^{(L)}}}}\\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ {\frac{{\partial \varepsilon _{{N_L}}^{\left( L \right) }(1)}}{{\partial w_{10}^{(1)}}}}&{}{\frac{{\partial \varepsilon _{{N_L}}^{\left( L \right) }(1)}}{{\partial w_{11}^{(1)}}}}&{} \cdots &{}{\frac{{\partial \varepsilon _{{N_L}}^{\left( L \right) }(1)}}{{\partial w_{ij}^{(k)}}}}&{} \cdots &{}{\frac{{\partial \varepsilon _{{N_L}}^{\left( L \right) }(1)}}{{\partial w_{{N_L}{N_{L - 1}}}^{(L)}}}}\\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ {\frac{{\partial \varepsilon _{{N_L}}^{\left( L \right) }(Q)}}{{\partial w_{10}^{(1)}}}}&{}{\frac{{\partial \varepsilon _{{N_L}}^{\left( L \right) }(Q)}}{{\partial w_{10}^{(1)}}}}&{} \cdots &{}{\frac{{\partial \varepsilon _{{N_L}}^{\left( L \right) }(Q)}}{{\partial w_{ij}^{(k)}}}}&{} \cdots &{}{\frac{{\partial \varepsilon _{{N_L}}^{\left( L \right) }(Q)}}{{\partial w_{{N_L}{N_{L - 1}}}^{(L)}}}} \end{array}} \right] . \end{aligned}$$
(6)

The non-linear errors \(\varepsilon _i^{\left( lr \right) } \) of the neurons in hidden layers are calculated as follows

$$\begin{aligned} \varepsilon _i^{\left( {lr} \right) }\left( t \right) \buildrel \wedge \over = \sum \limits _{m = 1}^{{N_{l + 1}}} {\delta _i^{\left( {l\mathrm{{ + }}1,r} \right) }\left( t \right) w_{mi}^{\left( {l + 1} \right) }}, \end{aligned}$$
(7)
$$\begin{aligned} \delta _i^{\left( {lr} \right) }\left( t \right) = \varepsilon _i^{\left( {lr} \right) }\left( t \right) f'\left( {s_i^{\left( {lr} \right) }\left( t \right) } \right) . \end{aligned}$$
(8)

That creates a possibility to calculate contents of the Jacobian matrix across all weights in the network

$$\begin{aligned} \frac{{\partial \varepsilon _r^{\left( L \right) }\left( t \right) }}{{w_{ij}^{\left( l \right) }}} = \delta _i^{\left( {lr} \right) }\left( t \right) x_j^{\left( l \right) }\left( t \right) . \end{aligned}$$
(9)

In the classic LMA all weights of a neural network are stored as a single vector. The \(\mathbf{{S}}\left( {\mathbf{{w}}(n)} \right) \) factor from Eq. (5) is depicted as

$$\begin{aligned} \mathbf{{S}}\left( {\mathbf{{w}}(n)} \right) = {\sum \limits _{t = 1}^Q {\sum \limits _{r = 1}^{{N_L}} {\varepsilon _r^{\left( L \right) }\left( t \right) } \nabla ^2 } }\varepsilon _r^{\left( L \right) }\left( t \right) . \end{aligned}$$
(10)

In the Gauss-Newton method it is assumed that \(\mathbf{{S}}\left( {\mathbf{{w}}(n)} \right) \approx 0\), so the Eq. (3) can be simplified

$$\begin{aligned} \varDelta \left( {\mathbf{{w}}(n)} \right) = - {\left[ {{\mathbf{{J}}^T}\left( {\mathbf{{w}}(n)} \right) \mathbf{{J}}\left( {\mathbf{{w}}(n)} \right) } \right] ^{ - 1}}{\mathbf{{J}}^T}\left( {\mathbf{{w}}(n)} \right) \mathbf{{\varepsilon }}\left( {\mathbf{{w}}(n)} \right) . \end{aligned}$$
(11)

For the Levenberg-Marquardt algorithm needs it is assumed that \(\mathbf{{S}}\left( {\mathbf{{w}}(n)} \right) = \mu \mathbf I\) so the Eq. (3) takes the form

$$\begin{aligned} \varDelta \left( {\mathbf{{w}}(n)} \right) = - {\left[ {{\mathbf{{J}}^T}\left( {\mathbf{{w}}(n)} \right) \mathbf{{J}}\left( {\mathbf{{w}}(n)} \right) + \mu \mathbf{{I}}} \right] ^{ - 1}}{\mathbf{{J}}^T}\left( {\mathbf{{w}}(n)} \right) \mathbf{{\varepsilon }}\left( {\mathbf{{w}}(n)} \right) . \end{aligned}$$
(12)

Let

$$\begin{aligned} \begin{array}{c} \mathbf{{A}}\left( n \right) = - \left[ {{\mathbf{{J}}^T}\left( {\mathbf{{w}}(n)} \right) \mathbf{{J}}\left( {\mathbf{{w}}(n)} \right) + \mu \mathbf{{I}}} \right] \\ \\ \mathbf{{h}}\left( n \right) = {\mathbf{{J}}^T}\left( {\mathbf{{w}}(n)} \right) \mathbf{{\varepsilon }}\left( {\mathbf{{w}}(n)} \right) \end{array} \end{aligned}$$
(13)

so the Eq. (12) can be depicted as

$$\begin{aligned} \varDelta \left( {\mathbf{{w}}(n)} \right) = \mathbf{{A}}{\left( n \right) ^{ - 1}}{} \mathbf{{h}}\left( n \right) . \end{aligned}$$
(14)

At this stage the QR decomposition can be used to solve Eq. (14). This results with obtaining a desired weight update vector \(\varDelta \left( {\mathbf{{w}}(n)} \right) \).

$$\begin{aligned} {\mathbf{{Q}}^T}\left( n \right) \mathbf{{A}}\left( n \right) \varDelta \left( {\mathbf{{w}}(n)} \right) = {\mathbf{{Q}}^T}\left( n \right) \mathbf{{h}}\left( n \right) , \end{aligned}$$
(15)
$$\begin{aligned} \mathbf{{R}}\left( n \right) \varDelta \left( {\mathbf{{w}}(n)} \right) = {\mathbf{{Q}}^T}\left( n \right) \mathbf{{h}}\left( n \right) . \end{aligned}$$
(16)

For the results presented in Sect. 4 the QR decomposition based on Givens rotations has been used. The summary of the Levenberg-Marquardt teaching algorithm can be presented in 5 steps:

  1. 1.

    Calculate network outputs and errors across all teaching samples and solve the goal criterion.

  2. 2.

    Run backpropagation method and calculate the jacobian matrix.

  3. 3.

    Perform QR decomposition in order to obtain \(\varDelta \left( {\mathbf{{w}}(n)} \right) \).

  4. 4.

    Apply \(\varDelta \left( {\mathbf{{w}}(n)} \right) \) for a neural network and calculate the goal criterion once again. If the new error is smaller than the original one commit the weights, divide \(\mu \) by \(\beta \) and proceed to step 1. If the new error is bigger than the original one, multiply \(\mu \) by \(\beta \) and go back to step 3.

  5. 5.

    The algorithm is deemed to be finished once the gradient is reduced below the accepted threshold or the network error satisfies the predefined error goal.

3 Parallel Modification

As shown in the previous section (especially in Eq. (6)), the classic LMA requires computing a jacobian, which is a \(\left[ no \cdot np\right] \times \left[ nw\right] \) matrix of error derivatives. Where no stands for a number of network outputs, np is a number of teaching samples and nw is a total number of weights in considered network. Then the Eq. (14) needs to be solved what involves the inversion of \(\left[ nw\right] \) x \(\left[ nw\right] \) size matrix. The clue of a presented modification is to create a set of jacobians, unique for each neuron of the network. Then the computation can be done in parallel for all neurons. In such case the algorithm complexity is significantly reduced to the biggest weight vector for the respective neuron.

Fig. 1.
figure 1

The considered neural network. One of the neurons with the biggest weight vector and its connections are highlighted.

The neural network shown in Fig. 1 is taken into further considerations. The network contains 2 inputs and 3 layers. The two hidden layers consist of 100 neurons while the output layer contains a single neuron. To simplify the calculations teaching vector is assumed to contain 100 samples. In the classic Levenberg-Marquardt algorithm (from now on also called LMC) the jacobian size is the following

$$\begin{aligned} \left[ no \cdot np\right] \times \left[ \sum \limits _{l = 1}^{L}{N_l \cdot \left( N_{l-1}+1\right) } \right] . \end{aligned}$$
(17)

By filling the above equation with considered network’s values the jacobian size calculates as follows

$$\begin{aligned} \left[ 1 \cdot 100\right] \times \left[ 100 \cdot 3 + 100 \cdot 101 + 1 \cdot 101\right] = \left[ 100\right] \times \left[ 10501\right] . \end{aligned}$$
(18)

In a parallel approach to the Levenberg-Marquardt algorithm (from now on also called LMP) the separate jacobian matrix is created for each neuron of the network so it’s size can be depicted as

$$\begin{aligned} \left[ np\right] \times \left[ \max {\left( N_{l-1}+1\right) } \right] , \end{aligned}$$
(19)

which in considered network translates into the biggest jacobian of size

$$\begin{aligned} \left[ 100\right] \times \left[ 101\right] . \end{aligned}$$
(20)

It is easy to see that the smaller jacobian matrix is, the faster QR decomposition is completed. Table 1 shows the comparison of jacobians sizes across all layers in the considered network for both LMC and LMP. Table 2 presents the average times of solving Eq. 14 for matrices of size \(\left[ 10501\right] \times \left[ 10502\right] \) and \(\left[ 101\right] \times \left[ 102\right] \). Intel core i7 CPU with the frequency set to 4.40 GHz has been used for computation.

Table 1. Sizes of jacobians in LMC and LMP comparison (LMP is divided into layers).
Table 2. Average times for solving Eq. (14) for matrices of size \(\left[ 10501\right] \times \left[ 10502\right] \) and \(\left[ 101\right] \times \left[ 102\right] \) in milliseconds.

4 Results

4.1 Approximating \(y = 4x(1 - x)\) Function

The first teaching problem is a logistic curve approximation given by the formula \(y = 4x(1 - x)\). A fully connected 1-5-1 network has been used. Teaching set consists of 11 samples to cover the argument range in \(x \in \left[ 0, 1\right] \). As a teaching goal the maximum error of value 0.001 has been set. The best teaching results LMP algorithm achieved with the narrow initial weight values \(w_{init} \in \left[ -0.5, 0.5\right] \) and the small value of \(\beta \in \left[ 1.2, 4\right] \) factor. Figure 2 shows the networks outputs trained by LMC, LMP and the expected curve. Figure 3 shows the error for all teaching samples after a neural network training with LMP algorithm.

Fig. 2.
figure 2

Comparison of network outputs for \(y = 4x(1 - x)\) function approximation problem. Teaching target is set as max epoch error \(< 0.001\).

Fig. 3.
figure 3

Network error for \(y = 4x(1 - x)\) function approximation problem trained by LMP. Teaching target is set as max epoch error \(< 0.001\).

Fig. 4.
figure 4

Comparison of network outputs for \(f\left( x\right) = \sin x \cdot \log x\) function approximation problem. Teaching target is set as max epoch error \(< 0.001\).

Fig. 5.
figure 5

Network error for \(f\left( x\right) = \sin x \cdot \log x\) function approximation problem trained by LMP. Teaching target is set as max epoch error \(< 0.001\).

4.2 Approximating \(y = \sin {x} \cdot \log {x}\) Function

The second teaching problem is a curve approximation given by the formula \(y = \sin {x} \cdot \log {x}\). A fully connected 1-15-1 network has been used. Teaching set consists of 40 samples to cover the argument range in \(x \in \left[ 0.1, 4\right] \). As a teaching goal the maximum error of value 0.001 has been set. The best teaching results LMP algorithm achieved with the narrow initial weight values \(w_{init} \in \left[ -0.5, 0.5\right] \) and the small value of \(\beta \in \left[ 1.2, 4\right] \) factor. Figure 4 shows the networks outputs trained by LMC, LMP and the expected curve. Figure 5 shows the error for all teaching samples after a neural network training with LMP algorithm.

5 Conclusion

Parallelization of the teaching algorithms seems to be a good direction for neural networks training optimization. This topic has been approached multiple times by many researchers e.g. [5, 18,19,20]. The parallel approach to the popular Levenberg-Marquardt neural network teaching algorithm has been presented in this paper. In the classic form the LMA is characterized by a very high computational complexity which goes to the square of a network size. The presented parallel modification to the Levenberg-Marquardt algorithm seems to overcome this limitation maintaining the reasonable accuracy and performance of the archetype. As shown in Sect. 4 in both approximation problems, networks trained by LMP achieved the assumed requirements. The LMP method seems to be a good direction for Levenberg-Marquardt algorithm optimization.

In the near future additional consideration in this matter can be taken. First, the complete parallel implementation of the LMP algorithm can be performed e.g. using CUDA platform and GPU capabilities. Then, more complex approximation problems can be trained e.g. higher dimension functions. Also in deep analysis of influence of parameter \(\beta \) and initial network weights can be performed. Finally LMP implementation can be compared to similar neural network training algorithm parallel optimizations e.g. [4, 6].