1 Introduction

Computer vision helps machines to view and comprehend digital images, something that humans can do autonomously to high accuracy levels. Image processing is an important computer vision field, with major real-world applications such as autonomous vehicles [2], industry [15], medical diagnosis [19], and face recognition [21]. Image processing has many tasks, such as:regularization [7, 8], clustering [13], localization [6] and classification [9]. In this paper we concerned by image classification that is the process of assigning a class label to an image.

Typically, the current state of the art solution to image classification uses artificial neural network [19], convolutionary neural network [17] and deep neural network [11, 14], but this approaches has limitations such as the need for carefully designed structures and poor interpretability. The training algorithms used for solving image classification are gradient descent algorithms [4, 15, 19] and genetic algorithm [9].

However, the gradient descent algorithms and stochastic algorithms are easy to converge into the local minimum slowly. In fact, it is possible to divide the image classification into two phases: extraction and classification of the feature. It is possible to perform two stages in the order. It can avoid simultaneously adjusting the parameters of the entire network and reducing the parameter adjustment difficulty.

Based on the above considerations, an improved gradient descent method are proposed, called stochastic gradient descent algorithm (SGD). SGD combines the advantages of the gradient descent algorithms, back propagation [4] and stochastic strategy, and it is used as a training algorithm of the classifier such as Nestrov accelerate gradient (NAG) [3].

In this paper, we propose a fast iterative algorithm for image classification with neural network called fast control gradient algorithm (FGD), combined SGD algorithm with controlled learning rate by Armigo rule and accelerated the algorithm by Nestrove step.

Neural network is composed of the stacked restricted boltzmann machines [10], or auto-encoder [1], which is used to extract the image features and then classify those extracted features by the softmax classifier.

FGD is proposed as the softmax classifier’s training algorithm and is used to find the optimum of the softmax classifier parameters.

We discuss the complexity convergence of FGD the proposed algorithm and present some promising results of numerical experiments application to MNIST dataset, show that the proposed classification method FGD has better accuracy and antiover-fitting than other image classification methods such as SGD and NAG.

These algorithms are implemented in this paper using python programming tool for analyzing MNIST dataset [12].

This paper is organized in as drafted below in Sect. 1 introduction. Section 2 notations and assumptions. Section 3 classification. In Sect. 4 gradient algorithms. And finally the results are discussed in Sects. 5 and 6 concludes.

2 Notations and assumptions

First, we establish notation for future use:

  • features \(x_i\) is the input variables,

  • target \(y_i\) is the output variable that we are trying to predict,

  • training example is a pair \((x_i,y_i)\),

  • training set is the dataset that we’ll be using to learn a list of n training examples

    $$\begin{aligned} \{(x_i,y_i): i=1,\dots ,n\}, \end{aligned}$$

Note that the superscript ‘i’ in the above notation is simply an index into the dataset. We will also use the following notation:

  • \(E=\mathfrak {R}^n\) is a finite dimensional Euclidean space of input variables,

  • \(x^{t}\) the transpose of \(x=(x_{1},x_{2},\ldots ,x_{n})\in E\);

  • \(|| x||_2 =\sqrt{x^{t}x} =\sqrt{(x_{1}^{2}+\cdots +x_{n}^{2})} \) is the Euclidean norm of x;

  • \(<x,y>=x^ty\) is the inner product and corresponding norm \(\Vert x\Vert _2\).

To describe the classification problem slightly more formally, given a training set, our objective is to learn a function \(h:E\longrightarrow \mathfrak {R}\) called a hypothesis, so that h(x) is an optimal predictor for the corresponding value of y.

We consider the following basic problem of convex optimization

$$\begin{aligned} \min \limits _{\mathbf {x\in E}}\mathbf {f(x)}, \end{aligned}$$
(1)

where \(\mathbf {f}: E\longrightarrow \mathfrak {R}\) is a smooth convex function of type \(C^{1,1}\), i.e., continuously differentiable with Lipschitz continuous gradient L:

$$\begin{aligned} \Vert \nabla \mathbf {f(x)}-\nabla \mathbf {f(y)}\Vert \le L\Vert \mathbf {x}-\mathbf {y}\Vert ~~~~\forall \mathbf {x}, \mathbf {y}\in E, \end{aligned}$$

and \(\nabla \mathbf {f(x)}\) is the gradient of \(\mathbf {f(x)}\).

It is assumed that solution \(\mathbf {x}^*\) of (1) exists.

Typically computational algorithms for (1) rely on the use of gradient oracles which provide at arbitrary point \(\mathbf {x}\) the value of objective function \(\mathbf {f(x)}\) and some gradient \(\mathbf {g}\) of \( \mathbf {f(x)}\), then \(\mathbf {g}=\nabla \mathbf {f(x)}\).

3 Image classification with neural networks

3.1 Model

We define the following activation functions:

  • the sigmoid function

    $$\begin{aligned} \begin{array}{rll} \sigma : &{} \mathbb {R}\rightarrow &{} (0,1)\\ &{} x \mapsto &{} \frac{1}{1+e^{-1}} \end{array} \end{aligned}$$
  • the hyperbolic tangent function

    $$\begin{aligned} \begin{array}{rll} \tanh : &{} \mathbb {R}\rightarrow &{} (-1,1)\\ &{} x \mapsto &{} \frac{e^{2x}-1}{1+e^{2x}+1} \end{array} \end{aligned}$$
  • the rectified linear unit function

    $$\begin{aligned} \begin{array}{rll} ReLU : &{} \mathbb {R}\rightarrow &{} \mathbb {R}^+\\ &{} x \mapsto &{} \max \{0,x\} \end{array} \end{aligned}$$

Neural networks is machine learning models [18], which as functions of their input, are the alternating composition of linear transformations, i.e.,

$$\begin{aligned} \begin{array}{ll} \mathbb {R}^n \rightarrow &{} \mathbb {R}^m\\ x \mapsto &{} Wx + b \end{array} \end{aligned}$$

where \(b\in \mathbb {R}^m\) and W is a \(m\times n\) matrix, with activation functions, applied to each coordinate. b and W are usually parameters of the model to be estimated.

In all of the layers to be trained, we denote the biases b and weights W by parameters \(\mathbf {w}\).

We denote \(C=\{0, 1, \dots , \mathbf {c}-1\}\) the classification where \(\mathbf {c}\) is number of classes, on input x, the neural network model must output \(y\in \{\mathbb {R}^\mathbf {c}: \sum _iy_i=1\}\). For \(k\in C\), the softmax function

$$\begin{aligned} \sigma (y)_k = \frac{e^{y_k}}{\sum _i e^{y_i}} \end{aligned}$$

is applicable.

The typically the crossentropy, a smooth approximation of classification error is objective function in our minimization problem:

$$\begin{aligned} f(\mathbf {w})= -\frac{1}{N}\sum \limits _{j=1}^N\sum \limits _{q=0}^{\mathbf {c}-1}z_{jq}\log (\hat{z_q}(x_j;\mathbf {w})) \end{aligned}$$
(2)

where \( z_{jq}=\left\{ \begin{array}{ll} 1 &{} \hbox {if } x_j \hbox { is in class } q\\ 0 &{} \hbox {otherwise} \end{array} \right. \) and \(\hat{z_q}(x_j;\mathbf {w})\) is the predicted probability that \(x_j\) belongs to class q.

We define the training loss function \(L(\mathbf {w})\) as :

$$\begin{aligned} L(\mathbf {w}) = f(\mathbf {w}) + \theta R(\mathbf {w}) \end{aligned}$$

where \(\theta \) is parameter of regularization and \(R(\mathbf {w}) =\Vert \mathbf {w}\Vert ^2\) is \(\ell _2\) regularization.

The popular algorithms for estimated the parameters w of neural networks is gradient descent algorithms with back propagation.

For each step t, the technical used in SGD algorithms is replacing Eq. ( 2) of classification error f(w) by

$$\begin{aligned} f(\mathbf {w,t})= -\frac{1}{|M_t|}\sum \limits _{j: x_j\in M_t}\sum \limits _{q=0}^{\mathbf {c}-1}z_{jq}\log (\hat{z_q}(x_j;\mathbf {w})) \end{aligned}$$
(3)

where \(M_t\) is mini-batch. Equation (3) satisfy,

$$\begin{aligned} \mathbb {E}[f(w,t)] = f(w) \end{aligned}$$

and

$$\begin{aligned} \mathbb {E}[\nabla _wf(w,t)] = \nabla _wf(w) \end{aligned}$$

However, once before testing, the sample is shuffled arbitrarily, separated into batches of a fixed size, and these batches are used one at a time until they are all collected. This is then repeated from the first batch in the same order again. Each epoch then consists of sequential batches covering all training data, starting with the first batch and ending with the last one.

3.2 Back propagation

The updating weight parameters and fixed errors should be doing through the layers in the neural network, by combining GD algorithm and back propagation algorithm.

Let \(\ell _{w,b}(x)\) the activation function as:

$$\begin{aligned} \ell _{W,b}(x) = \sigma (Wx+b) \end{aligned}$$

where x is input of training sample. The loss function is:

$$\begin{aligned} g(W,b,x,y)= \frac{1}{2}\Vert y-\ell _{W,b}(x)\Vert ^2 \end{aligned}$$

Since the property of sigmoid function is:

$$\begin{aligned} \alpha = \sigma (\alpha ),~~~ \frac{\partial \alpha }{\partial z} = \alpha (1-\alpha ). \end{aligned}$$

where \(z=Wx+b\). Using chain rule:

$$\begin{aligned} \frac{\partial }{\partial z}g(W,b,x,y)= \frac{\partial g}{\partial \alpha } \frac{\partial \alpha }{\partial z}=-(y-\alpha )\alpha (1-\alpha ) \end{aligned}$$

Then

$$\begin{aligned} \frac{\partial }{\partial W}g(W,b,x,y)= & {} \frac{\partial g}{\partial z} \frac{\partial z}{\partial W}=-(y-\alpha )\alpha (1-\alpha )x^T \\ \frac{\partial }{\partial b}g(W,b,x,y)= & {} \frac{\partial g}{\partial z} \frac{\partial z}{\partial b}=-(y-\alpha )\alpha (1-\alpha ). \end{aligned}$$

4 Gradient methods

4.1 Gradient descent (GD)

The neural network algorithm uses an optimization method to evaluate global optimum. The optimization method used in the neural network algorithm was GD algorithm, represented by Eq. (4), a well referenced artificial intelligence function to model a first order optimization algorithm that helps us to find a local minimum. GD is an algorithm that is used to minimize a function, in this case, the cost function. The aim was to find a value of w which renders the lowest error and enables the cost function to reach a local minimum. Each iteration in this method aims to find a new value of w that yields a slightly lower error than the previous iteration. A learning rate is also used to control how large of a step we take downhill during each iteration. Since GD algorithm starts with some initial w, and repeatedly performs the update:

$$\begin{aligned} w_i = w_i - \eta \frac{\partial }{\partial w_i}f(w),~~~~i=0,\dots ,n. \end{aligned}$$
(4)

Here, \(\eta \) is a constant learning rate. This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of f. Since differentiation and combination are interchangeable we can calculate the gradient as a sum of discrete components thus avoiding unnecessary complications of analytical formulas for neural network. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. In our case, we are looking for minimum value of the cost function, represented by Eq. (5).

$$\begin{aligned} y(w)=f(w) \end{aligned}$$
(5)

Note that the value of the step size \(\eta \) is allowed to change at every iteration of the neural network algorithm, known as learning rate, of the optimization process. With certain assumptions on the function y and particular choices of \(\eta \), convergence to a local minimum can be guaranteed. When the function y is convex, all local minimum are also global minimum, so in this case gradient descent can converge to the global solution.

4.2 Control gradient algorithms

The simplest algorithms for (1) are gradient methods of the kind

$$\begin{aligned} \mathbf {x}^{k+1}=\mathbf {x}^k-\eta _k\mathbf {g}_k,~~ \mathbf {g}^k=\nabla \mathbf {f}(\mathbf {x}^k),~~~~k=0,1,\ldots \end{aligned}$$
(6)

which were under intensive study since 1960’s. It was shown that (6) converges under very mild conditions for learning rate \(\eta _k\) satisfying “divergence series” condition \(\sum _k\eta _k=\infty ,~~~~\eta _k\rightarrow +0\).

Let us suppose a learning rate \(\eta _{k-1}\) was used at iteration \(k-1\) and let us identify at \(x^k\) the relation between \(\eta _{k-1}\) and the learning rate providing minimization of f in the direction \(-g^{k-1}\). To this end Uryasev used in [20] scalar products

$$\begin{aligned} u_k=<g^k,g^{k-1}> \end{aligned}$$

It was heuristically suggested in [20] to correct learning rate according to the formula

$$\begin{aligned} \eta _{k+1}=\left\{ \begin{array}{rll} \eta _\mathrm{decr}\eta _k,&{}\hbox {if}&{}\mathbf {u}_{k+1} \le 0,\\ \eta _\mathrm{incr}\eta _k,&{}\hbox {otherwise},&{} \end{array} \right. \end{aligned}$$
(7)

where:

  • \(\eta _\mathrm{decr}\) is a decreasing coefficient,

  • \(\eta _\mathrm{incr}\) is a increasing coefficient,

  • \(0<\eta _\mathrm{decr}<1<\eta _\mathrm{incr}\) and \(\eta _\mathrm{incr}.\eta _\mathrm{decr}<1\).

Remark

The control learning rate of Uryasev is equivalent to the second conditions of Wolfe:

$$\begin{aligned} <g^k,g^{k-1}> \ge c \Vert g^k\Vert ^2 \end{aligned}$$

where c is real positive.

In this paper, we used Armijo condition [5]:

$$\begin{aligned} f(\mathbf {x}^{k}) -f(\mathbf {x}^{k+1}) \ge \beta <g^k, (\mathbf {x}^{k}-\mathbf {x}^{k+1})>. \end{aligned}$$
(8)

where \(0<\beta < 0.5\). Then, the control learning rate can be calculate by this equation:

$$\begin{aligned} \eta _{k+1}=\left\{ \begin{array}{rl} \eta _\mathrm{decr}\eta _k,&{}\hbox {if Eq. (8) satisfied},\\ \eta _\mathrm{incr}\eta _k,&{}\hbox {otherwise}, \end{array} \right. \end{aligned}$$
(9)

4.3 Fast gradient descent (FGD) algorithm

Gradient and Nesterov step [16]

  1. 1.

    Select a point \(\mathbf {y}_0\in \mathbf {E}\). Put

    $$\begin{aligned} k=0,~~\mathbf {b}_0=1,~~\mathbf {x}^{-1}=\mathbf {y}_0. \end{aligned}$$
  2. 2.

    kth iteration.

    1. a)

      Compute \(\mathbf {g_k}=\nabla \mathbf {f}(\mathbf {y}_k)\)

    2. b)

      Put

      $$\begin{aligned} \begin{array}{lll} \mathbf {x}^k=\mathbf {y}_k-\eta _k\mathbf {g}_k,\\ \mathbf {b}_{k+1}=0.5(1+\sqrt{4\mathbf {b}_k^2+1}~),\\ \mathbf {y}_{k+1}=\mathbf {x}^k+(\frac{\mathbf {b}_k-1}{\mathbf {b}_{k+1}})(\mathbf {x}^k-\mathbf {x}^{k-1}). \end{array} \end{aligned}$$
      (10)

The recalculation of the point \(\mathbf {y}_k\) in(10) is done using a “ravine” step, and \(\eta _k\) is the control learning rate (7).

Remark

We assume that

$$\begin{aligned} \eta _k \le \frac{1}{L} \end{aligned}$$
(11)

4.3.1 Convergence rate of FGD

Lemma 1

For any \(x,y\in \mathbb {R}^n\), we have

$$\begin{aligned} f(y)\le f(x) + <\nabla f(x),y-x> + \frac{L}{2}\Vert y-x\Vert ^2. \end{aligned}$$

Proof

For all \(x,y\in \mathbb {R}^n\), we have

$$\begin{aligned} \begin{array}{rl} f(y) &{}= f(x) + \int _{0}^{1}<\nabla f(x+t(y-x)),y-x>\mathrm{d}t\\ &{}= f(x) +<\nabla f(x),y-x> \\ &{}\quad + \int _{0}^{1}<\nabla f(x+t(y-x))-\nabla f(x),y-x>\mathrm{d}t.\\ \end{array} \end{aligned}$$

Therefore, if \(A=f(y)-f(x)-<\nabla f(x),y-x>\) then

$$\begin{aligned} \begin{array}{rl} A &{}\le |f(y)_f(x)-<\nabla f(x),y-x>|\\ &{}= | \int _{0}^{1}<\nabla f(x+t(y-x))-\nabla f(x),y-x>\mathrm{d}t|\\ &{}\le \int _{0}^{1}|<\nabla f(x+t(y-x))-\nabla f(x),y-x>|\mathrm{d}t\\ &{}\le \int _{0}^{1}\Vert <\nabla f(x+t(y-x))-\nabla f(x)\Vert .\Vert y-x\Vert \mathrm{d}t\\ &{}\le \int _{0}^{1}tL\Vert y-x\Vert ^2\mathrm{d}t\\ &{}= \frac{L}{2}\Vert y-x\Vert ^2. \end{array} \end{aligned}$$

Then, Lemma 1 hold. \(\square \)

Lemma 2

The sequence \((b_{k})_{k\ge 1}\) generated by the scheme () satisfies the following:

$$\begin{aligned} \frac{k+1}{2}\le b_{k}~~\forall k\ge 1. \end{aligned}$$
(12)

Proof

We have

$$\begin{aligned} 0.5(1+\sqrt{4b_{k-1}^2+1})\ge 0.5(1+2b_{k-1}^2) \end{aligned}$$

then

$$\begin{aligned} b_{k} \ge \frac{1}{2} + b_{k-1} \end{aligned}$$

Therefore, we have \(b_{k}\ge \frac{k+1}{2}\). \(\square \)

Lemma 3

Let \(\{x^k,y_k\}\) be the sequence generated by the FGD, then for any \(x\in \mathfrak {R}^n\) we have

$$\begin{aligned} f(x)\ge f(x^k)+\frac{L}{2}\Vert x-x^k\Vert ^2- \frac{L}{2}\Vert x-y^k\Vert ^2. \end{aligned}$$
(13)

Proof

By the convexity of f we have

$$\begin{aligned} f(x)\ge f(y_k) + <\nabla f(y_k),x-y_k>. \end{aligned}$$
(14)

By Lemma 1

$$\begin{aligned} f(x^k)\le f(y_k) + <\nabla f(y_k),x^k-y_k> + \frac{L}{2}\Vert x^k-y_k\Vert ^2 \end{aligned}$$
(15)

Combining (14) and (15), we have

$$\begin{aligned} f(x) \ge f(x^k) + <\nabla f(y_k),x-x^k> -\frac{L}{2}\Vert y_k-x^k\Vert ^2.\quad \end{aligned}$$
(16)

Since \(x^k = y_k -\eta _k \nabla f(y_k)\), then

$$\begin{aligned} f(x) \ge f(x^k) + \frac{1}{\eta _k}<y_k-x^k,x-x^k> -\frac{L}{2}\Vert y_k-x^k\Vert ^2. \end{aligned}$$
(17)

Using (11), we obtain

$$\begin{aligned} f(x) \ge f(x^k) + L<y_k-x^k,x-x^k> -\frac{L}{2}\Vert y_k-x^k\Vert ^2.\quad \end{aligned}$$
(18)

Since

$$\begin{aligned} <y_k-x^k,x-x^k>&=\frac{1}{2}(\Vert y_k-x^k\Vert ^2\\&\quad + \Vert x-x^k\Vert ^2-\Vert x-y_k\Vert ^2) \end{aligned}$$

then from (18) the inequality (13) hold. \(\square \)

Theorem 1

If the sequence \(\{x_k\}_{k\ge 0}\) is constructed by FGD, then there exist a constant C such that

$$\begin{aligned} f(x_k)-f(x^*)\le C/k^2. \end{aligned}$$
(19)
Fig. 1
figure 1

Samples from the MNIST dataset

Table 1 Comparison of the classification accuracy between FGD and four classified methods

Proof

We denote

$$\begin{aligned} z_k=\frac{1}{b_{k+1}}x^*+(1-\frac{1}{b_{k+1}})x^k. \end{aligned}$$
(20)

We know that \(\frac{1}{b_{k+1}}\in (0,1],~~\forall k\ge 0\), by the convexity of f we have

$$\begin{aligned} f(z_k) \le \frac{1}{b_{k+1}}f(x^*)+(1-\frac{1}{b_{k+1}})f(x^k). \end{aligned}$$
(21)

\((21)\times b_{k+1}^2\), we get

$$\begin{aligned} b_{k+1}^2f(z_k) \le b_{k+1}f(x^*)+(b_{k+1}^2-b_{k+1})f(x^k) \end{aligned}$$
(22)

With \(b_{k+1}^2-b_{k+1}=b_k^2\) yield

$$\begin{aligned} b_{k+1}^2f(z_k) \le b_{k+1}^2f(x^*)-b_{k}^2f(x^*)+b_{k}^2f(x^k). \end{aligned}$$
(23)

\((23) + b_{k+1}^2f(x^{k+1})\), we get

$$\begin{aligned} \begin{array}{rl} b_{k+1}^2(f(x^{k+1}-f(z_k)) \ge &{} b_{k+1}^2(f(x^{k+1})-f(x^*))\\ &{}-b_{k}^2(f(x^{k})-f(x^*)). \end{array} \end{aligned}$$
(24)

By Lemma 3, we have

$$\begin{aligned} f(x^{k+1})- f(z_k)\le \frac{L}{2}\Vert z_k-y_{k+1}\Vert ^2 -\frac{L}{2}\Vert z_k-x^{k+1}\Vert ^2.\quad \end{aligned}$$
(25)

Then,

$$\begin{aligned} \begin{array}{rl} &{}b_{k+1}^2\frac{L}{2}( \Vert z_k-y_{k+1}\Vert ^2 -\Vert z_k-x^{k+1}\Vert ^2)\ge b_{k+1}^2(f(x^{k+1})\\ &{}\quad -f(x^*))-b_{k}^2(f(x^{k})-f(x^*)). \end{array} \end{aligned}$$
(26)

Let \(p_k=x^{k-1}+b_k(x^k-x^{k-1})\), then

$$\begin{aligned} \begin{array}{rl} z_k -y_{k+1} =&{} \frac{1}{b_{k+1}}x^*+(1-\frac{1}{b_{k+1}})x^k - x^k\\ &{}+(\frac{b_k-1}{b_{k+1}})(x^k-x^{k-1})\\ =&{} \frac{1}{b_{k+1}}(x^*-x^{k-1}+b_k(x^{k-1}-x^k))\\ =&{} \frac{1}{b_{k+1}}(x^*-p_k)) \end{array} \end{aligned}$$
(27)

and

Fig. 2
figure 2

Comparing epoch of FGD with SGD, NAG, Moumentum and Adaselta

$$\begin{aligned} \begin{array}{rl} z_k -x^{k+1} &{}= \frac{1}{b_{k+1}}x^*+(1-\frac{1}{b_{k+1}})x^k - x^{k+1}\\ &{}= \frac{1}{b_{k+1}}(x^*-x^{k}+b_{k+1}(x^{k}-x^{k+1}))\\ &{}= \frac{1}{b_{k+1}}(x^*-p_{k+1}). \end{array} \end{aligned}$$
(28)

Using Eqs. (27) and (28) in (26), we have

$$\begin{aligned} \begin{array}{rl} &{}\frac{L}{2}\Vert x^*-p_{k}\Vert ^2-\frac{L}{2}\Vert x^*-p_{k+1}\Vert ^2\\ &{}\quad \ge b_{k+1}^2(f(x^{k+1})-f(x^*)) -b_{k}^2(f(x^{k})-f(x^*)). \end{array} \end{aligned}$$

Since \(b_0=0\) and \(p_0=x^{-1}=x^0\), then we get

$$\begin{aligned} b_k^2(f(x^k)-f(x^*)) \le \frac{L}{2}\Vert x^*-x^0\Vert ^2. \end{aligned}$$

Let \(C=2L\Vert x^*-x^0\Vert ^2\), using Lemma 2 then theorem hold. \(\square \)

Fig. 3
figure 3

Comparing CPU time of FGD with SGD, NAG, Moumentum and Adaselta

Fig. 4
figure 4

predict sample using FGD in MNIST

5 Computational experiment

All the experiments were carried out on a personal computer with an HP i3 CPU processor 1.80 GHz, 4 Go RAM, \(\times \)64, using Python 3.7 for Windows 8.1.

In this section, the dataset used is MNIST [12], consisting of 60,000 training and 10,000 test \(28\times 28\) grayscale images of handwritten digits, 0 to 9, as shown in Fig. 1, so 10 classes. MNIST dataset is performed to verify the performance of the FGD algorithm.

To train the network, we present digits from the 60,000 MNIST training set to the network. The different gradient descent optimization sch as, SGD, NAG, Momentum, Adadelta and our approach FGD are used in this experiments. For the results of comparing methods see Table  1, Figs. 2 and 3, this results show that the FGD is faster with best classification accuracy.

The class-assigned neuron response is then used in the 10,000 MNIST test set to measure the network’s classification accuracy. The estimated number is calculated by multiplying each neuron’s responses by class and then choosing the class with the highest average firing rate see Figs. 4 and 5.

Fig. 5
figure 5

Average confusion matrix of the testing results overt en presentations of the 10,000 MNIST test set digits using FGD algorithm

6 Concluding remarks

In this paper, the gradient descent algorithm are accelerated by Nestrov step and control learning rate via Armijo condition called FGD algorithm for solving minimum loss function of image classification. The convergence rate of FGD algorithm proved that it is quadratic convergence, and the comprising of numerical results of FGD with four gradient descent algorithms in dataset MNIST, show that FGD algorithm is robust and faster than others methods.