Abstract
The paper presents a parallel approach to the Levenberg-Marquardt algorithm (also called LM or LMA). The first section contains the mathematical basics of the classic LMA. Then the parallel modification to LMA is introduced. The classic Levenberg-Marquardt algorithm is sufficient for a training of small neural networks. For bigger networks the algorithm complexity becomes too big for the effective teaching. The main scope of this paper is to propose more complexity efficient approach to LMA by parallel computation. The proposed modification to LMA has been tested on a few function approximation problems and has been compared to the classic LMA. The paper concludes with the resolution that the parallel modification to LMA could significantly improve algorithm performance for bigger networks. Summary also contains a several proposals for the possible future work directions in the considered area.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
- Feed-forward neural network
- Parallel neural network training algorithm
- Optimization problem
- Levenberg-Marquardt algorithm
- QR decomposition
- Givens rotation
1 Introduction
The methods of Artificial Intelligence are broadly used in the modern world. Genetic algorithms, neural networks and their applications are the subject of many researches [11, 12, 14, 15]. The most common use of the modern AI can be found in the areas of optimization, classification and recognition of the given patterns [1, 2, 13, 16, 17, 21]. The Levenberg-Marquardt algorithm (also called LM or LMA) is a highly efficient and a popular teaching method for feedforward neural networks [9]. It is classified as a quasi-Newton method which finds a wide usage in areas of function minimization problems such as neural networks teaching [8]. One of the biggest downfalls of the LMA is a huge complexity contrary to classical teaching methods as e.g. the well known Back Propagation algorithm with all its variants [3, 7, 10]. That limitation makes LMA impractical as a teaching algorithm for bigger networks. The following paper describes the parallel modification for LMA which can be applied to any feedforward multi-layered neural network.
2 The Classic Levenbert-Marquardt Algorithm
The Levenberg-Marquard method is an iterational algorithm mostly used for non-linear function optimization. It combines the advantages of a steepest descent and the Gauss-Newton methods. In order to teach a neural network, LMA is used to minimize the following error function
where \(\varepsilon _i^{\left( L \right) } \) stands for non-linear neuron error and can be depicted as
while \(d_r^{\left( L \right) }(t)\) is the r-th expected network response to the t-th teaching sample. In the Levenberg-Marquardt algorithm weights delta is given by
where \(\nabla \mathbf{{E}}\left( {\mathbf{{w}}(n)}\right) \) stands for the gradient vector
and \(\nabla ^\mathbf{{2}}{} \mathbf{{E}}\left( {\mathbf{{w}}(n)} \right) \) stands for the Hessian matrix
while \(\mathbf{{J}}\left( {\mathbf{{w}}(n)}\right) \) is the Jacobian matrix
The non-linear errors \(\varepsilon _i^{\left( lr \right) } \) of the neurons in hidden layers are calculated as follows
That creates a possibility to calculate contents of the Jacobian matrix across all weights in the network
In the classic LMA all weights of a neural network are stored as a single vector. The \(\mathbf{{S}}\left( {\mathbf{{w}}(n)} \right) \) factor from Eq. (5) is depicted as
In the Gauss-Newton method it is assumed that \(\mathbf{{S}}\left( {\mathbf{{w}}(n)} \right) \approx 0\), so the Eq. (3) can be simplified
For the Levenberg-Marquardt algorithm needs it is assumed that \(\mathbf{{S}}\left( {\mathbf{{w}}(n)} \right) = \mu \mathbf I\) so the Eq. (3) takes the form
Let
so the Eq. (12) can be depicted as
At this stage the QR decomposition can be used to solve Eq. (14). This results with obtaining a desired weight update vector \(\varDelta \left( {\mathbf{{w}}(n)} \right) \).
For the results presented in Sect. 4 the QR decomposition based on Givens rotations has been used. The summary of the Levenberg-Marquardt teaching algorithm can be presented in 5 steps:
-
1.
Calculate network outputs and errors across all teaching samples and solve the goal criterion.
-
2.
Run backpropagation method and calculate the jacobian matrix.
-
3.
Perform QR decomposition in order to obtain \(\varDelta \left( {\mathbf{{w}}(n)} \right) \).
-
4.
Apply \(\varDelta \left( {\mathbf{{w}}(n)} \right) \) for a neural network and calculate the goal criterion once again. If the new error is smaller than the original one commit the weights, divide \(\mu \) by \(\beta \) and proceed to step 1. If the new error is bigger than the original one, multiply \(\mu \) by \(\beta \) and go back to step 3.
-
5.
The algorithm is deemed to be finished once the gradient is reduced below the accepted threshold or the network error satisfies the predefined error goal.
3 Parallel Modification
As shown in the previous section (especially in Eq. (6)), the classic LMA requires computing a jacobian, which is a \(\left[ no \cdot np\right] \times \left[ nw\right] \) matrix of error derivatives. Where no stands for a number of network outputs, np is a number of teaching samples and nw is a total number of weights in considered network. Then the Eq. (14) needs to be solved what involves the inversion of \(\left[ nw\right] \) x \(\left[ nw\right] \) size matrix. The clue of a presented modification is to create a set of jacobians, unique for each neuron of the network. Then the computation can be done in parallel for all neurons. In such case the algorithm complexity is significantly reduced to the biggest weight vector for the respective neuron.
The neural network shown in Fig. 1 is taken into further considerations. The network contains 2 inputs and 3 layers. The two hidden layers consist of 100 neurons while the output layer contains a single neuron. To simplify the calculations teaching vector is assumed to contain 100 samples. In the classic Levenberg-Marquardt algorithm (from now on also called LMC) the jacobian size is the following
By filling the above equation with considered network’s values the jacobian size calculates as follows
In a parallel approach to the Levenberg-Marquardt algorithm (from now on also called LMP) the separate jacobian matrix is created for each neuron of the network so it’s size can be depicted as
which in considered network translates into the biggest jacobian of size
It is easy to see that the smaller jacobian matrix is, the faster QR decomposition is completed. Table 1 shows the comparison of jacobians sizes across all layers in the considered network for both LMC and LMP. Table 2 presents the average times of solving Eq. 14 for matrices of size \(\left[ 10501\right] \times \left[ 10502\right] \) and \(\left[ 101\right] \times \left[ 102\right] \). Intel core i7 CPU with the frequency set to 4.40 GHz has been used for computation.
4 Results
4.1 Approximating \(y = 4x(1 - x)\) Function
The first teaching problem is a logistic curve approximation given by the formula \(y = 4x(1 - x)\). A fully connected 1-5-1 network has been used. Teaching set consists of 11 samples to cover the argument range in \(x \in \left[ 0, 1\right] \). As a teaching goal the maximum error of value 0.001 has been set. The best teaching results LMP algorithm achieved with the narrow initial weight values \(w_{init} \in \left[ -0.5, 0.5\right] \) and the small value of \(\beta \in \left[ 1.2, 4\right] \) factor. Figure 2 shows the networks outputs trained by LMC, LMP and the expected curve. Figure 3 shows the error for all teaching samples after a neural network training with LMP algorithm.
4.2 Approximating \(y = \sin {x} \cdot \log {x}\) Function
The second teaching problem is a curve approximation given by the formula \(y = \sin {x} \cdot \log {x}\). A fully connected 1-15-1 network has been used. Teaching set consists of 40 samples to cover the argument range in \(x \in \left[ 0.1, 4\right] \). As a teaching goal the maximum error of value 0.001 has been set. The best teaching results LMP algorithm achieved with the narrow initial weight values \(w_{init} \in \left[ -0.5, 0.5\right] \) and the small value of \(\beta \in \left[ 1.2, 4\right] \) factor. Figure 4 shows the networks outputs trained by LMC, LMP and the expected curve. Figure 5 shows the error for all teaching samples after a neural network training with LMP algorithm.
5 Conclusion
Parallelization of the teaching algorithms seems to be a good direction for neural networks training optimization. This topic has been approached multiple times by many researchers e.g. [5, 18,19,20]. The parallel approach to the popular Levenberg-Marquardt neural network teaching algorithm has been presented in this paper. In the classic form the LMA is characterized by a very high computational complexity which goes to the square of a network size. The presented parallel modification to the Levenberg-Marquardt algorithm seems to overcome this limitation maintaining the reasonable accuracy and performance of the archetype. As shown in Sect. 4 in both approximation problems, networks trained by LMP achieved the assumed requirements. The LMP method seems to be a good direction for Levenberg-Marquardt algorithm optimization.
In the near future additional consideration in this matter can be taken. First, the complete parallel implementation of the LMP algorithm can be performed e.g. using CUDA platform and GPU capabilities. Then, more complex approximation problems can be trained e.g. higher dimension functions. Also in deep analysis of influence of parameter \(\beta \) and initial network weights can be performed. Finally LMP implementation can be compared to similar neural network training algorithm parallel optimizations e.g. [4, 6].
References
Starczewski, A.: A new validity index for crisp clusters. Pattern Anal. Appl. 20(3), 687–700 (2015)
Starczewski, A., Krzyżak, A.: Improvement of the validity index for determination of an appropriate data partitioning. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds.) ICAISC 2017. LNCS (LNAI), vol. 10246, pp. 159–170. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59060-8_16
Bilski, J., Wilamowski, B.M.: Parallel Levenberg-Marquardt algorithm without error backpropagation. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds.) ICAISC 2017. LNCS (LNAI), vol. 10245, pp. 25–39. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59063-9_3
Bilski, J., Kowalczyk, B., Żurada, J.M.: Parallel implementation of the givens rotations in the neural network learning algorithm. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds.) ICAISC 2017. LNCS (LNAI), vol. 10245, pp. 14–24. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59063-9_2
Bilski, J., Smola̧g, J.: Parallel realisation of the recurrent RTRN neural network learning. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp. 11–16. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-69731-2_2
Bilski, J., Smoląg, J.: Parallel architectures for learning the rtrn and elman dynamic neural network. IEEE Trans. Parallel Distrib. Syst. 26(9), 2561–2570 (2015)
Bilski, J., Smoląg, J., Żurada, J.M.: Parallel approach to the Levenberg-Marquardt learning algorithm for feedforward neural networks. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds.) ICAISC 2015. LNCS (LNAI), vol. 9119, pp. 3–14. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19324-3_1
Marqardt, D.: An algorithm for last-sqares estimation of nonlinear paeameters. J. Soc. Ind. Appl. Math. 11(2), 431–441 (1963)
Hagan, M.T., Menhaj, M.B.: Training feedforward networks with the Marquardt algorithm. IEEE Trans. Neural Netw. 5(6), 989–993 (1994)
Werbos, J.: Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Harvard University, Cambridge (1974)
Cpałka, K., Łapa, K., Przybył, A.: A new approach to design of control systems using genetic programming. Inf. Technol. Control 44(4), 433–442 (2015)
Łapa, K., Cpałka, K.: On the application of a hybrid genetic-firework algorithm for controllers structure and parameters selection. In: Borzemski, L., Grzech, A., Świątek, J., Wilimowska, Z. (eds.) Information Systems Architecture and Technology: Proceedings of 36th International Conference on Information Systems Architecture and Technology – ISAT 2015 – Part I. AISC, vol. 429, pp. 111–123. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-28555-9_10
Łapa, K., Cpałka, K., Galushkin, A.I.: A new interpretability criteria for neuro-fuzzy systems for nonlinear classification. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds.) ICAISC 2015. LNCS (LNAI), vol. 9119, pp. 448–468. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19324-3_41
Khan, N.A., Shaikh, A.: A smart amalgamation of spectral neural algorithm for nonlinear Lane-Emden equations with simulated annealing. J. Artif. Intell. Soft Comput. Res. 7(3), 215–224 (2017)
Liu, H., Gegov, A., Cocea, M.: Rule based networks: an efficient and interpretable representation of computational models. J. Artif. Intell. Soft Comput. Res. 7(2), 111–123 (2017)
Notomista, G., Botsch, M.: A machine learning approach for the segmentation of driving Maneuvers and its application in autonomous parking. J. Artif. Intell. Soft Comput. Res. 7(4), 243–255 (2017)
Rotar, C., Lantovics, L.B.: Directed evolution - a new Metaheuristc for optimization. J. Artif. Intell. Soft Comput. Res. 7(3), 183–200 (2017)
Rutkowska, D., Nowicki, R., Hayashi, Y.: Parallel processing by implication-based neuro-fuzzy systems. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Waśniewski, J. (eds.) PPAM 2001. LNCS, vol. 2328, pp. 599–607. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-48086-2_66
Smoląg, J., Bilski, J.: A systolic array for fast learning of neural networks. In: V NNSC, pp. 754–758 (2000)
Smoląg, J., Bilski, J., Rutkowski, L.: Systolic array for neural networks. In: IV KSNiIZ, pp. 487–497 (1999)
Villmann, T., Bohnsack, A., Kaden, M.: Can learning vector quantization be an alternative to SVM and deep learning? Recent trends and advanced variants of learning vector quantization for classification learning. J. Artif. Intell. Soft Comput. Res. 7(1), 65–81 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Bilski, J., Kowalczyk, B., Grzanek, K. (2018). The Parallel Modification to the Levenberg-Marquardt Algorithm. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2018. Lecture Notes in Computer Science(), vol 10841. Springer, Cham. https://doi.org/10.1007/978-3-319-91253-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-91253-0_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91252-3
Online ISBN: 978-3-319-91253-0
eBook Packages: Computer ScienceComputer Science (R0)