1 INTRODUCTION

Gradient algorithms are frequently used for neural network (NN) training. These algorithms are locally optimal. The basic limitation when using the gradient-based algorithms is a strong dependence of quality of the obtained solution on the choice of the point of initial approximation. This problem is the most acute when the number of local minima is large. A universal way to eliminate this shortcoming is application of multi-start methods [1]. They repeatedly run such algorithms from different points of initial approximation; however, this leads to a substantial increase of their computational complexity and does not guarantee that the solution would be optimal.

In [2], we proposed a multi-start algorithm with cutting. The main idea of this algorithm is to launch (to initialize) a number of parallel or pseudoparalell runs (starts) of a local optimal algorithm with different initial approximations. Herewith at the early stages of the algorithm work, we find “unpromising” starts. We cut off these starts and continue the process of solution search on a narrower set of starts. In such a way, we can decrease the number of steps necessary to find the solutions launching at ineffective starts and thus decrease the run time of the algorithm. However, this method does not allow us to solve the problem of algorithm stopping at local optima.

When using the gradient algorithms the second problem is their high computational complexity, which is due to the necessity to calculate the gradients at each step of the algorithm. A back propagation method was developed for training feedforward neural networks. D. Rumelhart and G. Hinton are believed to be the first who proposed this algorithm in [3]; however, a theoretical basis sufficient for an implementation of this algorithm was developed in earlier papers [46]. The backpropagation algorithm uses a technique allowing one to calculate quickly the vector of partial derivatives (the gradient) of a complex multivariable function if we know the structure of this function. We can increase the efficiency of this algorithm using some of its modifications; for example, the Levenberg–Marquardt algorithm [7].

The other problem is choosing of a step size for variation of the parameters we have to optimize. If the step size is too small, the algorithm converges very slowly; however, when it is very large, the algorithm can lose its stability [8].

In the present paper, we analyze a possibility to use a random search algorithm with self-learning in place of locally optimal gradient algorithms [9]. It allows us:

•to leave local optima,

•its speed is comparable with speeds of the gradient algorithms [9] in the areas with regular relief of the error function,

•it has a low computational complexity of one step.

2 DIRECTED RANDOM SEARCH ALGORITHMS WITH SELF-LEARNING

In this section, we discuss the main principles of development for the directed random search algorithms with self-learning [9].

The basis of the directed random search method is an interaction process.

$${{X}_{{k + 1}}} = {{X}_{k}} + {{\alpha }_{k}}\frac{\xi }{{\left\| \xi \right\|}},k = 0,1, \ldots ,$$

In this equation αk is the step size and ξ = (ξ1, …, ξn) is a realization of a random n-dimensional vector ξ.

The directed random search algorithms with self-learning involve a restructuring of probabilistic search characteristics that is in a directed impact on a random vector ξ.

We can do this introducing a memory vector

$${{P}^{k}} = \left( {p_{1}^{k},p_{2}^{k}, \ldots ,p_{n}^{k}} \right),$$

where \(p_{j}^{k}\) is a probability of choosing a positive direction for the jth coordinate at the kth step.

Three elements define the directed random search algorithms with self-learning:

(1) An algorithm for selecting trial states at the current step.

(2) A decision rule according which at each step the algorithm chooses a new current approximation of the solution.

(3) A self-learning algorithm, which refines the memory vector with account for information obtained at the current step.

Let us examine a self-learning algorithm with an arbitrary law of the probability change [9].

Suppose \(p_{j}^{{(k)}} = p\left( {w_{j}^{{(k)}}} \right)\). The forms of the function p(w) may be different but it has to be

•monotone,

•non-decreasing.

In [9], they propose the following functions p(w):

a linear function

$$p(w) = \left\{ \begin{gathered} 0\;\quad \quad \quad {\text{when }}\;w < - 1 \hfill \\ \frac{1}{2}\;(w - 1)\;{\text{when }} - {\text{1}}\; \leqslant w < {\text{1}} \hfill \\ {\text{1}}\;\quad \quad \quad {\text{when 1}} \leqslant \;w \hfill \\ \end{gathered} \right.,$$

and an exponential function

$$p(w) = \left\{ \begin{gathered} \frac{1}{2}\;{{e}^{{aw}}},\quad \;\quad {\text{when }}\;w \leqslant {\text{0}} \hfill \\ {\text{1}}\; - \frac{1}{2}\;{{e}^{{ - aw}}},\;\;{\text{when }}\;\,\,w > {\text{0}} \hfill \\ \end{gathered} \right.,$$

\(w_{j}^{{(k + 1)}} = w_{j}^{{(k)}} - \delta {\kern 1pt} {\text{sign}}\left( {\Delta x_{j}^{{(k)}}\Delta {{f}^{{(k)}}}} \right),\) \(\delta \) is a step that defines the learning rate.

Consequently, in the course of its work the algorithm redefine the probabilistic characteristics of the search by means of a goal-directed impact on a random vector ξ. After that, it ceases to be equiprobable and as a result of self-learning it gains a certain advantage in the directions of the best steps. This is achieved by introducing a memory vector. At each iteration the algorithm recurrently corrects the values of the components of this vector depending on how successful/unsuccessful (how much the value of the objective function changed) was the previous step.

3 APPLICATION OF RANDOM SEARCH ALGORITHM WITH SELF-LEARNING FOR NEURAL NETWORK TRAINING

We define a neural network (NN) as a triplet NET = (G, T, W) where

G is a directed acyclic graph defining the topology of the neural network: its vertexes correspond to neurons and its arcs correspond to interneural connections;

T is the set of the neural network transition function;

W is the set of the neural network weight coefficients.

By NN initialization, we mean a NN with a given architecture after assigning it an initial set of weight coefficients.

A finite set S = {(Xi, Yi)}, i = 1, 2, …, P, of P pairs “argument/value of function under approximation” will be called a sample. Here XiRN, YiRM; N is a number of input neurons (the number of function arguments) and M is a number of the output neurons. The sample is correct if XiXj if ∀i, j: ij.

The problem of a supervised learning of a neural network with a given architecture is to select such values of the weight coefficients, which will allow us to maximize the accuracy of solving of an applied problem. When we use the NN to approximate a tabulated continuous function, we can estimate the accuracy of the approximation by the value of the mean square error (MSE) [10]:

$$MSE(NET,S) = \frac{1}{P}\sum\limits_{i = 1}^P {{{{({{Y}_{i}} - {{y}_{i}})}}^{2}}} ,$$

where yi is a vector of the values of output neurons of the NN. When we apply the NN for pattern, recognition we estimate the accuracy by the value of MATCH:

$$MATCH(NET,S) = \frac{1}{P}\sum\limits_{i = 1}^P {{{I}_{{{{Y}_{i}} = {{y}_{i}}}}}} ,$$

where yi is a class number to which the network has assigned the input pattern.

For a given architecture of the NN, the supervised learning algorithm includes the following steps [8]:

(1) To initialize the NN architecture (to assign it an initial set of weight coefficients).

(2) To choose a subsequent training pair.

(3) To calculate the NN output.

(4) Depending on the problem, to calculate the error (MSE) or the accuracy (MATCH) of the NN.

(5) To minimize the error by correcting the NN weights.

(6) To repeat the steps 2–5 for each pair of the sample S until the error reaches an acceptable level on the entire set S.

To initialize the neural network we use the He-at-al method [11]. In the framework of this method we examine fully connected neural networks consisting of L neuron layers with lk neurons in the kth layer. We define the initialization of the weight of the connection of the ith neuron from the klth layer by the jth neuron of the kth layer as

$${{W}_{{kij}}} = \xi \sqrt {\frac{2}{{{{l}_{k}}}}} ,\quad \xi \sim N(0,\;1).$$

To perform the step 5 they usually use one of the backpropagation algorithms. The major disadvantages of these methods are their high computational complexity of calculation of the gradient and inability to escape from “bad” local optima.

In the present paper, to eliminate these defects we propose to use the random search algorithm with self-learning at the step 5.

We feed the algorithm with the input NN weight coefficients, which are the parameters we have to optimize; and depending on the problem, we use the error (MSE) or the accuracy (MATCH) as the objective functions.

For the initial initialized neural network, the memory vector is initialized as P = (0.5, …, 0.5); pi defines the probability to make a positive step in the direction of changing of the weight wi in the given neural network. As the parameters of the algorithm, we define the number of steps K, a slowing coefficient α (α ≤ 1), and a limiting probability ε (ε ≤ \(p_{i}^{k}\) ≤ 1 – ε). As an initial value of the step size of changes of the weights we choose D = 1.0. The algorithm works step-by-step:

(1) The counter of the number of iterations k is set to 0.

(2) The value of E1 = MSE(NET, S) (or E1 = 1 – MATCH(NET, S) in the case of a pattern recognition problem) is calculated for a given neural network.

(3) The test selection of the direction is performed:

— For each weight wi with a corresponding probability pi the direction of the step hi (positive or negative) is chosen.

— The previous value of the weight vi = wi is stored. A new value of each weight wi = hiD is calculated.

(4) By analogy with the step 2, the metrics E2 is calculated.

(5) If E1E2, then E2 = E1, and the jump to the step 7. Otherwise, the trial step is considered as unsuccessful.

(6) Step is canceled: wi = vi. Since in the test direction the error increased, pi = max(ε, min(1 – ε, pi Δi)), where Δi is equal to 0.5, if hi > 0, and it is equal to 2, if hi < 0. The value of the step changes as D = D α. The jump to the step 3.

(7) The counter k increases by 1. If k = K, the algorithm terminates. In other case, D = D0.5 and the jump to the step 2.

The output of the algorithm is one neural network over which the random search algorithm with self-learning performed K training steps.

4 COMPARISON OF PROPOSED AND GRADIENT ALGORITHMS

To compare our random search algorithm with self-learning described in the previous Section with the backpropagation algorithm we use the following criteria

(1) Accuracy (the value of MSE after training.)

(2) Training time on a specific device.

(3) Total number of testing iterations (including unsuccessful steps for the random search method with self-learning.)

The characteristics of the test medium are:

•Processor Intel® Core™ i7 3520M, 2 × 2.8 GHz,

•RAM: 8 Gb, DDR3.

When testing we performed an averaging over the results of some tests.

Prior testing with the aid of the He-at-al method [11] we generated the initialized neural network.

For testing the training algorithms, we chose the following problems:

(1) MNIST dataset as an example of problems with a large number of parameters.

(2) Approximation of functions of 1, 2 or 3 variables as an example of problems with small number of parameters.

We consider the fully connected feedforward neural networks and choose a sigmoid function as an activation function. We define the neural network by a number of layers and a number of neurons in each layer. For simplicity, we symbolized the neural network topology as [l1, l2, …, lL], where L is the number of layers, li is the number of neurons in the ith layer, l1 is the size of the input layer and lL is the size of the output layer.

For the MNIST problem, we performed the comparative testing on two neural networks, which are the Topology I—[768, 100, 10] and the Topology II—[768, 100, 50, 10]. A grayscale image of the size 28 by 28 points was fed to the network input and the maximal value of 10 outputs was interpreted as a selection of one of the numbers by the neural network. On the same device the testing time is measured under the same conditions.

When examining the approximation problem we considered the functions listed in Table 1 of Appendix. In Table 1, we show the topologies of the neural networks in the second and third columns.

Table 1.   Neural network topologies for approximation problems

In Tables 2–4 of Appendix, we show the averaged test results in the case of one start with the same initial initialization as well as the case of a number of starts with selection of the best result. We also present the averaged test results of multi-start versions of the algorithms for the number of starts equal to 10 with and without test steps; the value of the parameter K = 8. The training time and the number of iterations are considered in total for all the starts.

Table 2.   Averaged test results of approximation for gradient method and for random search with and without trial steps
Table 3.   Averaged test results for approximation problems for gradient and random search with trial steps algorithms with multi-starts
Table 4.   Averaged test results for MNIST for the gradient, random search with trial steps, and random search without trial steps algorithms

In Tables 2 and 3, we color the best result for each algorithm in light gray and the best result among all the algorithms in dark gray. In the first and second columns, we show the functions from Table 1 and the neural network topologies for these functions, respectively. In the table headers, for each test we define the algorithm and the values of the parameters of this algorithm (η is the learning rate parameter, which defines the rate in the direction of the error gradient, and α is a parameter of the random search).

Table 2 corresponds to the problem of function approximation in the case of one start. As we see from this table, there is only one case when the random search with self-learning and one start is more effective with regard to the error metrics than the gradient algorithm. The random search algorithm with self-learning and without trial steps turned out to be the most ineffective. (The algorithm without trial steps is much the same as the random search algorithm with self-learning described above but it performs no computation of the values of E1 and E2, and the step 6 follows immediately after the step 4.) From Table 3 we see that sometimes the values of the errors differ only in the second decimal place of the coefficient of ten in negative power (for example, 1.02 × 10–3 and 1.03 × 10–3). This means that in some sense the random search algorithm with self-learning is comparable with the backpropagation algorithm; however, it is still less effective. In average, the training time for the random search algorithm was 10 times more than the training time for the gradient method.

In Table 3 (the problem of function approximation), we show the results for multi-start versions of the random search and gradient algorithms. As we see from this table, now the random search algorithm with self-learning was better than the gradient method in 8 out of 18 cases. However, it has to be mentioned that in average the training time of the multi-start random search algorithm is 10 times larger the training time of the gradient method, and that may be crucial when choosing the training algorithm.

In Tables 4 and 5 (the problem of recognition of handwritten digits) the best results for each algorithm separately are shown in light gray. The best result among all the algorithms we colored in dark gray. In table headers we indicate the algorithm and the parameters of this algorithm for each test (η is the learning rate parameter, which defines the rate in the direction of the error gradient, and α is a parameter of the random search).

Table 5.   Averaged test results for MNIST for the gradient and random search with trial steps with multi-starts

In Table 4 (the problem of recognition of handwritten digits), we show the test results for the random search algorithms with self-learning with and without trial steps and the gradient algorithm. As we see from this Table, the result of the random search is 5 times worse than the result of the gradient method; and, consequently, it does not applicable in this problem. It also has to be mentioned that in average the training time for the random search algorithm with self-learning was 100 times larger the training time for the gradient method.

In Table 5 (the problem of recognition of handwritten digits), we show the test results for multi-start versions of the random search algorithm with self-learning and the gradient method. As we see, the result of the random search is 5 times worse than the result of the gradient method; and, consequently, it does not applicable in this problem. It also has to be mentioned that in average the training time for the random search algorithm with self-learning was 100 times larger the training time for the gradient method.

5 CONCLUSIONS

As our simulations show, the efficiency of the multi-start random search algorithm is comparable with the gradient training algorithm only in half of the presented approximation problems and is not comparable with it for any of the MNIST problems. The proposed algorithm can be used for the problems with a small number of the parameters; however even in the problems of approximation of simple functions, it may be inefficient with regard to the error metrics. A separate and frequently important minus of this algorithm is the large time of its performance. With regard to this parameter, it again turned out to be many times worse than the gradient method.

Thus, when using the multi-start method, the random search algorithm with self-learning may be applicable in the approximation problems with a small number of parameters if the training time is not important. It is useless in other cases due it to inefficiency.