Fast gradient descent algorithm for image classification with neural networks

El Mouatasim, Abdelkrim

doi:10.1007/s11760-020-01696-2

Fast gradient descent algorithm for image classification with neural networks

Original Paper
Published: 20 May 2020

Volume 14, pages 1565–1572, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Signal, Image and Video Processing Aims and scope Submit manuscript

Fast gradient descent algorithm for image classification with neural networks

Download PDF

Abdelkrim El Mouatasim ORCID: orcid.org/0000-0001-9947-1408¹

661 Accesses
14 Citations
Explore all metrics

Abstract

Any optimization of gradient descent methods involves selecting a learning rate. Tuning the learning rate can quickly become repetitive with deeper models of image classification, does not necessarily lead to optimal convergence. We proposed in this paper, a modification of the gradient descent algorithm in which the Nestrove step is added, and the learning rate is update in each epoch. Instead, we learn learning rate itself, either by Armijo rule, or by control step. Our algorithm called fast gradient descent (FGD) for solving image classification with neural networks problems, the quadratic convergence rate $o(k^2)$ of FGD algorithm are proved. FGD algorithm are applicate to a MNIST dataset. The numerical experiment, show that our approach FGD algorithm is faster than gradient descent algorithms.

Fast Conjugate Gradient Algorithm for Feedforward Neural Networks

A New Conjugate Gradient Method with Smoothing $L_{1/2} $ Regularization Based on a Modified Secant Equation for Training Neural Networks

Article 21 November 2017

An Efficient Algorithm for Multi Class Classification in Deep Neural Network

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Computer vision helps machines to view and comprehend digital images, something that humans can do autonomously to high accuracy levels. Image processing is an important computer vision field, with major real-world applications such as autonomous vehicles [2], industry [15], medical diagnosis [19], and face recognition [21]. Image processing has many tasks, such as:regularization [7, 8], clustering [13], localization [6] and classification [9]. In this paper we concerned by image classification that is the process of assigning a class label to an image.

Typically, the current state of the art solution to image classification uses artificial neural network [19], convolutionary neural network [17] and deep neural network [11, 14], but this approaches has limitations such as the need for carefully designed structures and poor interpretability. The training algorithms used for solving image classification are gradient descent algorithms [4, 15, 19] and genetic algorithm [9].

However, the gradient descent algorithms and stochastic algorithms are easy to converge into the local minimum slowly. In fact, it is possible to divide the image classification into two phases: extraction and classification of the feature. It is possible to perform two stages in the order. It can avoid simultaneously adjusting the parameters of the entire network and reducing the parameter adjustment difficulty.

Based on the above considerations, an improved gradient descent method are proposed, called stochastic gradient descent algorithm (SGD). SGD combines the advantages of the gradient descent algorithms, back propagation [4] and stochastic strategy, and it is used as a training algorithm of the classifier such as Nestrov accelerate gradient (NAG) [3].

In this paper, we propose a fast iterative algorithm for image classification with neural network called fast control gradient algorithm (FGD), combined SGD algorithm with controlled learning rate by Armigo rule and accelerated the algorithm by Nestrove step.

Neural network is composed of the stacked restricted boltzmann machines [10], or auto-encoder [1], which is used to extract the image features and then classify those extracted features by the softmax classifier.

FGD is proposed as the softmax classifier’s training algorithm and is used to find the optimum of the softmax classifier parameters.

We discuss the complexity convergence of FGD the proposed algorithm and present some promising results of numerical experiments application to MNIST dataset, show that the proposed classification method FGD has better accuracy and antiover-fitting than other image classification methods such as SGD and NAG.

These algorithms are implemented in this paper using python programming tool for analyzing MNIST dataset [12].

This paper is organized in as drafted below in Sect. 1 introduction. Section 2 notations and assumptions. Section 3 classification. In Sect. 4 gradient algorithms. And finally the results are discussed in Sects. 5 and 6 concludes.

2 Notations and assumptions

First, we establish notation for future use:

features $x_i$ is the input variables,
target $y_i$ is the output variable that we are trying to predict,
training example is a pair $(x_i,y_i)$,
training set is the dataset that we’ll be using to learn a list of n training examples
$$\begin{aligned} \{(x_i,y_i): i=1,\dots ,n\}, \end{aligned}$$

Note that the superscript ‘i’ in the above notation is simply an index into the dataset. We will also use the following notation:

$E=\mathfrak {R}^n$ is a finite dimensional Euclidean space of input variables,
$x^{t}$ the transpose of $x=(x_{1},x_{2},\ldots ,x_{n})\in E$;
$|| x||_2 =\sqrt{x^{t}x} =\sqrt{(x_{1}^{2}+\cdots +x_{n}^{2})} $ is the Euclidean norm of x;
$<x,y>=x^ty$ is the inner product and corresponding norm $\Vert x\Vert _2$.

To describe the classification problem slightly more formally, given a training set, our objective is to learn a function $h:E\longrightarrow \mathfrak {R}$ called a hypothesis, so that h(x) is an optimal predictor for the corresponding value of y.

We consider the following basic problem of convex optimization

$$\begin{aligned} \min \limits _{\mathbf {x\in E}}\mathbf {f(x)}, \end{aligned}$$

(1)

where $\mathbf {f}: E\longrightarrow \mathfrak {R}$ is a smooth convex function of type $C^{1,1}$, i.e., continuously differentiable with Lipschitz continuous gradient L:

$$\begin{aligned} \Vert \nabla \mathbf {f(x)}-\nabla \mathbf {f(y)}\Vert \le L\Vert \mathbf {x}-\mathbf {y}\Vert ~~~~\forall \mathbf {x}, \mathbf {y}\in E, \end{aligned}$$

and $\nabla \mathbf {f(x)}$ is the gradient of $\mathbf {f(x)}$.

It is assumed that solution $\mathbf {x}^*$ of (1) exists.

Typically computational algorithms for (1) rely on the use of gradient oracles which provide at arbitrary point $\mathbf {x}$ the value of objective function $\mathbf {f(x)}$ and some gradient $\mathbf {g}$ of $ \mathbf {f(x)}$, then $\mathbf {g}=\nabla \mathbf {f(x)}$.

3 Image classification with neural networks

3.1 Model

We define the following activation functions:

the sigmoid function
$$\begin{aligned} \begin{array}{rll} \sigma : &{} \mathbb {R}\rightarrow &{} (0,1)\\ &{} x \mapsto &{} \frac{1}{1+e^{-1}} \end{array} \end{aligned}$$
the hyperbolic tangent function
$$\begin{aligned} \begin{array}{rll} \tanh : &{} \mathbb {R}\rightarrow &{} (-1,1)\\ &{} x \mapsto &{} \frac{e^{2x}-1}{1+e^{2x}+1} \end{array} \end{aligned}$$
the rectified linear unit function
$$\begin{aligned} \begin{array}{rll} ReLU : &{} \mathbb {R}\rightarrow &{} \mathbb {R}^+\\ &{} x \mapsto &{} \max \{0,x\} \end{array} \end{aligned}$$

Neural networks is machine learning models [18], which as functions of their input, are the alternating composition of linear transformations, i.e.,

$$\begin{aligned} \begin{array}{ll} \mathbb {R}^n \rightarrow &{} \mathbb {R}^m\\ x \mapsto &{} Wx + b \end{array} \end{aligned}$$

where $b\in \mathbb {R}^m$ and W is a $m\times n$ matrix, with activation functions, applied to each coordinate. b and W are usually parameters of the model to be estimated.

In all of the layers to be trained, we denote the biases b and weights W by parameters $\mathbf {w}$.

We denote $C=\{0, 1, \dots , \mathbf {c}-1\}$ the classification where $\mathbf {c}$ is number of classes, on input x, the neural network model must output $y\in \{\mathbb {R}^\mathbf {c}: \sum _iy_i=1\}$. For $k\in C$, the softmax function

$$\begin{aligned} \sigma (y)_k = \frac{e^{y_k}}{\sum _i e^{y_i}} \end{aligned}$$

is applicable.

The typically the crossentropy, a smooth approximation of classification error is objective function in our minimization problem:

$$\begin{aligned} f(\mathbf {w})= -\frac{1}{N}\sum \limits _{j=1}^N\sum \limits _{q=0}^{\mathbf {c}-1}z_{jq}\log (\hat{z_q}(x_j;\mathbf {w})) \end{aligned}$$

(2)

where $ z_{jq}=\left\{ \begin{array}{ll} 1 &{} \hbox {if } x_j \hbox { is in class } q\\ 0 &{} \hbox {otherwise} \end{array} \right. $ and $\hat{z_q}(x_j;\mathbf {w})$ is the predicted probability that $x_j$ belongs to class q.

We define the training loss function $L(\mathbf {w})$ as :

$$\begin{aligned} L(\mathbf {w}) = f(\mathbf {w}) + \theta R(\mathbf {w}) \end{aligned}$$

where $\theta $ is parameter of regularization and $R(\mathbf {w}) =\Vert \mathbf {w}\Vert ^2$ is $\ell _2$ regularization.

The popular algorithms for estimated the parameters w of neural networks is gradient descent algorithms with back propagation.

For each step t, the technical used in SGD algorithms is replacing Eq. ( 2) of classification error f(w) by

$$\begin{aligned} f(\mathbf {w,t})= -\frac{1}{|M_t|}\sum \limits _{j: x_j\in M_t}\sum \limits _{q=0}^{\mathbf {c}-1}z_{jq}\log (\hat{z_q}(x_j;\mathbf {w})) \end{aligned}$$

(3)

where $M_t$ is mini-batch. Equation (3) satisfy,

$$\begin{aligned} \mathbb {E}[f(w,t)] = f(w) \end{aligned}$$

and

$$\begin{aligned} \mathbb {E}[\nabla _wf(w,t)] = \nabla _wf(w) \end{aligned}$$

However, once before testing, the sample is shuffled arbitrarily, separated into batches of a fixed size, and these batches are used one at a time until they are all collected. This is then repeated from the first batch in the same order again. Each epoch then consists of sequential batches covering all training data, starting with the first batch and ending with the last one.

3.2 Back propagation

The updating weight parameters and fixed errors should be doing through the layers in the neural network, by combining GD algorithm and back propagation algorithm.

Let $\ell _{w,b}(x)$ the activation function as:

$$\begin{aligned} \ell _{W,b}(x) = \sigma (Wx+b) \end{aligned}$$

where x is input of training sample. The loss function is:

$$\begin{aligned} g(W,b,x,y)= \frac{1}{2}\Vert y-\ell _{W,b}(x)\Vert ^2 \end{aligned}$$

Since the property of sigmoid function is:

$$\begin{aligned} \alpha = \sigma (\alpha ),~~~ \frac{\partial \alpha }{\partial z} = \alpha (1-\alpha ). \end{aligned}$$

where $z=Wx+b$. Using chain rule:

$$\begin{aligned} \frac{\partial }{\partial z}g(W,b,x,y)= \frac{\partial g}{\partial \alpha } \frac{\partial \alpha }{\partial z}=-(y-\alpha )\alpha (1-\alpha ) \end{aligned}$$

Then

$$\begin{aligned} \frac{\partial }{\partial W}g(W,b,x,y)= & {} \frac{\partial g}{\partial z} \frac{\partial z}{\partial W}=-(y-\alpha )\alpha (1-\alpha )x^T \\ \frac{\partial }{\partial b}g(W,b,x,y)= & {} \frac{\partial g}{\partial z} \frac{\partial z}{\partial b}=-(y-\alpha )\alpha (1-\alpha ). \end{aligned}$$

4 Gradient methods

4.1 Gradient descent (GD)

The neural network algorithm uses an optimization method to evaluate global optimum. The optimization method used in the neural network algorithm was GD algorithm, represented by Eq. (4), a well referenced artificial intelligence function to model a first order optimization algorithm that helps us to find a local minimum. GD is an algorithm that is used to minimize a function, in this case, the cost function. The aim was to find a value of w which renders the lowest error and enables the cost function to reach a local minimum. Each iteration in this method aims to find a new value of w that yields a slightly lower error than the previous iteration. A learning rate is also used to control how large of a step we take downhill during each iteration. Since GD algorithm starts with some initial w, and repeatedly performs the update:

$$\begin{aligned} w_i = w_i - \eta \frac{\partial }{\partial w_i}f(w),~~~~i=0,\dots ,n. \end{aligned}$$

(4)

Here, $\eta $ is a constant learning rate. This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of f. Since differentiation and combination are interchangeable we can calculate the gradient as a sum of discrete components thus avoiding unnecessary complications of analytical formulas for neural network. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. In our case, we are looking for minimum value of the cost function, represented by Eq. (5).

$$\begin{aligned} y(w)=f(w) \end{aligned}$$

(5)

Note that the value of the step size $\eta $ is allowed to change at every iteration of the neural network algorithm, known as learning rate, of the optimization process. With certain assumptions on the function y and particular choices of $\eta $, convergence to a local minimum can be guaranteed. When the function y is convex, all local minimum are also global minimum, so in this case gradient descent can converge to the global solution.

4.2 Control gradient algorithms

The simplest algorithms for (1) are gradient methods of the kind

$$\begin{aligned} \mathbf {x}^{k+1}=\mathbf {x}^k-\eta _k\mathbf {g}_k,~~ \mathbf {g}^k=\nabla \mathbf {f}(\mathbf {x}^k),~~~~k=0,1,\ldots \end{aligned}$$

(6)

which were under intensive study since 1960’s. It was shown that (6) converges under very mild conditions for learning rate $\eta _k$ satisfying “divergence series” condition $\sum _k\eta _k=\infty ,~~~~\eta _k\rightarrow +0$.

Let us suppose a learning rate $\eta _{k-1}$ was used at iteration $k-1$ and let us identify at $x^k$ the relation between $\eta _{k-1}$ and the learning rate providing minimization of f in the direction $-g^{k-1}$. To this end Uryasev used in [20] scalar products

$$\begin{aligned} u_k=<g^k,g^{k-1}> \end{aligned}$$

It was heuristically suggested in [20] to correct learning rate according to the formula

$$\begin{aligned} \eta _{k+1}=\left\{ \begin{array}{rll} \eta _\mathrm{decr}\eta _k,&{}\hbox {if}&{}\mathbf {u}_{k+1} \le 0,\\ \eta _\mathrm{incr}\eta _k,&{}\hbox {otherwise},&{} \end{array} \right. \end{aligned}$$

(7)

where:

$\eta _\mathrm{decr}$ is a decreasing coefficient,
$\eta _\mathrm{incr}$ is a increasing coefficient,
$0<\eta _\mathrm{decr}<1<\eta _\mathrm{incr}$ and $\eta _\mathrm{incr}.\eta _\mathrm{decr}<1$.

Remark

The control learning rate of Uryasev is equivalent to the second conditions of Wolfe:

$$\begin{aligned} <g^k,g^{k-1}> \ge c \Vert g^k\Vert ^2 \end{aligned}$$

where c is real positive.

In this paper, we used Armijo condition [5]:

$$\begin{aligned} f(\mathbf {x}^{k}) -f(\mathbf {x}^{k+1}) \ge \beta <g^k, (\mathbf {x}^{k}-\mathbf {x}^{k+1})>. \end{aligned}$$

(8)

where $0<\beta < 0.5$. Then, the control learning rate can be calculate by this equation:

$$\begin{aligned} \eta _{k+1}=\left\{ \begin{array}{rl} \eta _\mathrm{decr}\eta _k,&{}\hbox {if Eq. (8) satisfied},\\ \eta _\mathrm{incr}\eta _k,&{}\hbox {otherwise}, \end{array} \right. \end{aligned}$$

(9)

4.3 Fast gradient descent (FGD) algorithm

Gradient and Nesterov step [16]

1.
Select a point $\mathbf {y}_0\in \mathbf {E}$. Put
$$\begin{aligned} k=0,~~\mathbf {b}_0=1,~~\mathbf {x}^{-1}=\mathbf {y}_0. \end{aligned}$$
2.
kth iteration.
1. a)
  Compute $\mathbf {g_k}=\nabla \mathbf {f}(\mathbf {y}_k)$
2. b)
  Put
  $$\begin{aligned} \begin{array}{lll} \mathbf {x}^k=\mathbf {y}_k-\eta _k\mathbf {g}_k,\\ \mathbf {b}_{k+1}=0.5(1+\sqrt{4\mathbf {b}_k^2+1}~),\\ \mathbf {y}_{k+1}=\mathbf {x}^k+(\frac{\mathbf {b}_k-1}{\mathbf {b}_{k+1}})(\mathbf {x}^k-\mathbf {x}^{k-1}). \end{array} \end{aligned}$$
  (10)

The recalculation of the point $\mathbf {y}_k$ in(10) is done using a “ravine” step, and $\eta _k$ is the control learning rate (7).

Remark

We assume that

$$\begin{aligned} \eta _k \le \frac{1}{L} \end{aligned}$$

(11)

4.3.1 Convergence rate of FGD

Lemma 1

For any $x,y\in \mathbb {R}^n$, we have

$$\begin{aligned} f(y)\le f(x) + <\nabla f(x),y-x> + \frac{L}{2}\Vert y-x\Vert ^2. \end{aligned}$$

Proof

For all $x,y\in \mathbb {R}^n$, we have

$$\begin{aligned} \begin{array}{rl} f(y) &{}= f(x) + \int _{0}^{1}<\nabla f(x+t(y-x)),y-x>\mathrm{d}t\\ &{}= f(x) +<\nabla f(x),y-x> \\ &{}\quad + \int _{0}^{1}<\nabla f(x+t(y-x))-\nabla f(x),y-x>\mathrm{d}t.\\ \end{array} \end{aligned}$$

Therefore, if $A=f(y)-f(x)-<\nabla f(x),y-x>$ then

$$\begin{aligned} \begin{array}{rl} A &{}\le |f(y)_f(x)-<\nabla f(x),y-x>|\\ &{}= | \int _{0}^{1}<\nabla f(x+t(y-x))-\nabla f(x),y-x>\mathrm{d}t|\\ &{}\le \int _{0}^{1}|<\nabla f(x+t(y-x))-\nabla f(x),y-x>|\mathrm{d}t\\ &{}\le \int _{0}^{1}\Vert <\nabla f(x+t(y-x))-\nabla f(x)\Vert .\Vert y-x\Vert \mathrm{d}t\\ &{}\le \int _{0}^{1}tL\Vert y-x\Vert ^2\mathrm{d}t\\ &{}= \frac{L}{2}\Vert y-x\Vert ^2. \end{array} \end{aligned}$$

Then, Lemma 1 hold. $\square $

Lemma 2

The sequence $(b_{k})_{k\ge 1}$ generated by the scheme () satisfies the following:

$$\begin{aligned} \frac{k+1}{2}\le b_{k}~~\forall k\ge 1. \end{aligned}$$

(12)

Proof

We have

$$\begin{aligned} 0.5(1+\sqrt{4b_{k-1}^2+1})\ge 0.5(1+2b_{k-1}^2) \end{aligned}$$

then

$$\begin{aligned} b_{k} \ge \frac{1}{2} + b_{k-1} \end{aligned}$$

Therefore, we have $b_{k}\ge \frac{k+1}{2}$. $\square $

Lemma 3

Let $\{x^k,y_k\}$ be the sequence generated by the FGD, then for any $x\in \mathfrak {R}^n$ we have

$$\begin{aligned} f(x)\ge f(x^k)+\frac{L}{2}\Vert x-x^k\Vert ^2- \frac{L}{2}\Vert x-y^k\Vert ^2. \end{aligned}$$

(13)

Proof

By the convexity of f we have

$$\begin{aligned} f(x)\ge f(y_k) + <\nabla f(y_k),x-y_k>. \end{aligned}$$

(14)

By Lemma 1

$$\begin{aligned} f(x^k)\le f(y_k) + <\nabla f(y_k),x^k-y_k> + \frac{L}{2}\Vert x^k-y_k\Vert ^2 \end{aligned}$$

(15)

Combining (14) and (15), we have

$$\begin{aligned} f(x) \ge f(x^k) + <\nabla f(y_k),x-x^k> -\frac{L}{2}\Vert y_k-x^k\Vert ^2.\quad \end{aligned}$$

(16)

Since $x^k = y_k -\eta _k \nabla f(y_k)$, then

$$\begin{aligned} f(x) \ge f(x^k) + \frac{1}{\eta _k}<y_k-x^k,x-x^k> -\frac{L}{2}\Vert y_k-x^k\Vert ^2. \end{aligned}$$

(17)

Using (11), we obtain

$$\begin{aligned} f(x) \ge f(x^k) + L<y_k-x^k,x-x^k> -\frac{L}{2}\Vert y_k-x^k\Vert ^2.\quad \end{aligned}$$

(18)

Since

$$\begin{aligned} <y_k-x^k,x-x^k>&=\frac{1}{2}(\Vert y_k-x^k\Vert ^2\\&\quad + \Vert x-x^k\Vert ^2-\Vert x-y_k\Vert ^2) \end{aligned}$$

then from (18) the inequality (13) hold. $\square $

Theorem 1

If the sequence $\{x_k\}_{k\ge 0}$ is constructed by FGD, then there exist a constant C such that

$$\begin{aligned} f(x_k)-f(x^*)\le C/k^2. \end{aligned}$$

(19)

Table 1 Comparison of the classification accuracy between FGD and four classified methods

Full size table

Proof

We denote

$$\begin{aligned} z_k=\frac{1}{b_{k+1}}x^*+(1-\frac{1}{b_{k+1}})x^k. \end{aligned}$$

(20)

We know that $\frac{1}{b_{k+1}}\in (0,1],~~\forall k\ge 0$, by the convexity of f we have

$$\begin{aligned} f(z_k) \le \frac{1}{b_{k+1}}f(x^*)+(1-\frac{1}{b_{k+1}})f(x^k). \end{aligned}$$

(21)

$(21)\times b_{k+1}^2$, we get

$$\begin{aligned} b_{k+1}^2f(z_k) \le b_{k+1}f(x^*)+(b_{k+1}^2-b_{k+1})f(x^k) \end{aligned}$$

(22)

With $b_{k+1}^2-b_{k+1}=b_k^2$ yield

$$\begin{aligned} b_{k+1}^2f(z_k) \le b_{k+1}^2f(x^*)-b_{k}^2f(x^*)+b_{k}^2f(x^k). \end{aligned}$$

(23)

$(23) + b_{k+1}^2f(x^{k+1})$, we get

$$\begin{aligned} \begin{array}{rl} b_{k+1}^2(f(x^{k+1}-f(z_k)) \ge &{} b_{k+1}^2(f(x^{k+1})-f(x^*))\\ &{}-b_{k}^2(f(x^{k})-f(x^*)). \end{array} \end{aligned}$$

(24)

By Lemma 3, we have

$$\begin{aligned} f(x^{k+1})- f(z_k)\le \frac{L}{2}\Vert z_k-y_{k+1}\Vert ^2 -\frac{L}{2}\Vert z_k-x^{k+1}\Vert ^2.\quad \end{aligned}$$

(25)

Then,

$$\begin{aligned} \begin{array}{rl} &{}b_{k+1}^2\frac{L}{2}( \Vert z_k-y_{k+1}\Vert ^2 -\Vert z_k-x^{k+1}\Vert ^2)\ge b_{k+1}^2(f(x^{k+1})\\ &{}\quad -f(x^*))-b_{k}^2(f(x^{k})-f(x^*)). \end{array} \end{aligned}$$

(26)

Let $p_k=x^{k-1}+b_k(x^k-x^{k-1})$, then

$$\begin{aligned} \begin{array}{rl} z_k -y_{k+1} =&{} \frac{1}{b_{k+1}}x^*+(1-\frac{1}{b_{k+1}})x^k - x^k\\ &{}+(\frac{b_k-1}{b_{k+1}})(x^k-x^{k-1})\\ =&{} \frac{1}{b_{k+1}}(x^*-x^{k-1}+b_k(x^{k-1}-x^k))\\ =&{} \frac{1}{b_{k+1}}(x^*-p_k)) \end{array} \end{aligned}$$

(27)

and

$$\begin{aligned} \begin{array}{rl} z_k -x^{k+1} &{}= \frac{1}{b_{k+1}}x^*+(1-\frac{1}{b_{k+1}})x^k - x^{k+1}\\ &{}= \frac{1}{b_{k+1}}(x^*-x^{k}+b_{k+1}(x^{k}-x^{k+1}))\\ &{}= \frac{1}{b_{k+1}}(x^*-p_{k+1}). \end{array} \end{aligned}$$

(28)

Using Eqs. (27) and (28) in (26), we have

$$\begin{aligned} \begin{array}{rl} &{}\frac{L}{2}\Vert x^*-p_{k}\Vert ^2-\frac{L}{2}\Vert x^*-p_{k+1}\Vert ^2\\ &{}\quad \ge b_{k+1}^2(f(x^{k+1})-f(x^*)) -b_{k}^2(f(x^{k})-f(x^*)). \end{array} \end{aligned}$$

Since $b_0=0$ and $p_0=x^{-1}=x^0$, then we get

$$\begin{aligned} b_k^2(f(x^k)-f(x^*)) \le \frac{L}{2}\Vert x^*-x^0\Vert ^2. \end{aligned}$$

Let $C=2L\Vert x^*-x^0\Vert ^2$, using Lemma 2 then theorem hold. $\square $

5 Computational experiment

All the experiments were carried out on a personal computer with an HP i3 CPU processor 1.80 GHz, 4 Go RAM, $\times $64, using Python 3.7 for Windows 8.1.

In this section, the dataset used is MNIST [12], consisting of 60,000 training and 10,000 test $28\times 28$ grayscale images of handwritten digits, 0 to 9, as shown in Fig. 1, so 10 classes. MNIST dataset is performed to verify the performance of the FGD algorithm.

To train the network, we present digits from the 60,000 MNIST training set to the network. The different gradient descent optimization sch as, SGD, NAG, Momentum, Adadelta and our approach FGD are used in this experiments. For the results of comparing methods see Table 1, Figs. 2 and 3, this results show that the FGD is faster with best classification accuracy.

The class-assigned neuron response is then used in the 10,000 MNIST test set to measure the network’s classification accuracy. The estimated number is calculated by multiplying each neuron’s responses by class and then choosing the class with the highest average firing rate see Figs. 4 and 5.

6 Concluding remarks

In this paper, the gradient descent algorithm are accelerated by Nestrov step and control learning rate via Armijo condition called FGD algorithm for solving minimum loss function of image classification. The convergence rate of FGD algorithm proved that it is quadratic convergence, and the comprising of numerical results of FGD with four gradient descent algorithms in dataset MNIST, show that FGD algorithm is robust and faster than others methods.

References

Bengio, Y.: Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1), 1–27 (2009)
Article Google Scholar
Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L.D., Monfort, M., Muller, U., Zhang, J., et al.: End to End Learning for Self-driving Cars (2016). arXiv:1604.07316
Botev, A., Lever, G., Barber, D.: Nesterov’s accelerated gradient and momentum as approximations to regularised update descent. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 1899–1903 (2017)
Cui, N.: Applying gradient descent in convolutional neural networks. J. Phys. Conf. Ser. 1004, 012027 (2018)
Article Google Scholar
Chen, A., Xu, X., Ryu, S., Zhou, Z.: A self-adaptive Armijo stepsize strategy with application to traffic assignment models and algorithms. Transp. A Transp. Sci. 9(8), 695–712 (2013)
Google Scholar
El Mouatasim, A., Ellaia, R., Souza de Cursi, J.E.: Stochastic perturbation of reduced gradient & GRG methods for nonconvex programming problems. J. Appl. Math. Comput. 226, 198–211 (2014)
Article MathSciNet Google Scholar
El Mouatasim, A., Wakrim, M.: Control subgradient algorithm for image regularization. J. Signal Image Video Process. (SIViP) 9, 275–283 (2015)
Article Google Scholar
El Mouatasim, A.: Control proximal gradient algorithm for $\ell _1$ regularization image. J. Signal Image Video Process. (SIViP) 13(6), 1113–1121 (2019)
Article Google Scholar
Evans, B., Al-Sahaf, H., Xue, B., Zhang, M.: Evolutionary deep learning: a genetic programming approach to image classification. In: IEEE Congress on Evolutionary Computation, pp. 1–6 (2018)
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet Google Scholar
Huang, K., Hussain, A., Wang, Q., Zhang, R.: Deep Learning: Fundamentals, Theory and Applications. Springer, Berlin (2019)
Book Google Scholar
LeCun, Y., Cortes, C.: MNIST Handwritten Digit Database. AT&T Labs, vol. 2. (2010). http://yann.lecun.com/exdb/mnist/
Li, P., Lee, S., Park, J.: Development of a global batch clustering with gradient descent and initial parameters in colour image classification. IET Image Process 13(1), 161–174 (2019)
Article Google Scholar
Liu, G., Xiao, L., Xiong, C.: Image classification with deep belief networks and improved gradient descent. In: International Conference on Embedded and Ubiquitous Computing (2017)
MacLean, J., Tsotsos, J.: Fast Pattern Recognition Using Gradient-Descent Search in an Image Pyramid. IEEE (2000)
Nesterov, Y.: Introduction Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2004)
Book Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Shi, B., Lyengar, S.S.: Mathematical Theories of Machine Learning: Theory and Applications. Springer, Berlin (2020)
Book Google Scholar
Singh, B.K., Verma, K., Thoke, A.S.: Adaptive gradient descent backpropagation for classification of breast tumors in ultrasound imaging. Proc. Comput. Sci. 46, 1601–1609 (2015)
Article Google Scholar
Uryas’ev, S.P.: New variable-metric algorithms for nondifferentiable optimization problems. J. Optim. Theory Appl. 71(2), 359–388 (1991)
Article MathSciNet Google Scholar
Xinhua, L., Qian, Y.: Face recognition based on deep neural network. Int. J. Signal Process. Image Process. Pattern Recognit. 8(10), 29–38 (2015)
Google Scholar

Download references

Acknowledgements

We indebted to the anonymous Reviewers and Editors for many suggestions and stimulating comments to improve the original manuscript.

Author information

Authors and Affiliations

Faculty of Polydisciplinary Ouarzazate (FPO), Ibn Zohr University, B.P. 284, Ouarzazate, 45800, Morocco
Abdelkrim El Mouatasim

Authors

Abdelkrim El Mouatasim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdelkrim El Mouatasim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

El Mouatasim, A. Fast gradient descent algorithm for image classification with neural networks. SIViP 14, 1565–1572 (2020). https://doi.org/10.1007/s11760-020-01696-2

Download citation

Received: 11 November 2019
Revised: 06 March 2020
Accepted: 18 April 2020
Published: 20 May 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11760-020-01696-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Fast gradient descent algorithm for image classification with neural networks

Abstract

Similar content being viewed by others

Fast Conjugate Gradient Algorithm for Feedforward Neural Networks

A New Conjugate Gradient Method with Smoothing \(L_{1/2} \) Regularization Based on a Modified Secant Equation for Training Neural Networks

An Efficient Algorithm for Multi Class Classification in Deep Neural Network

1 Introduction

2 Notations and assumptions

3 Image classification with neural networks

3.1 Model

3.2 Back propagation

4 Gradient methods

4.1 Gradient descent (GD)

4.2 Control gradient algorithms

Remark

4.3 Fast gradient descent (FGD) algorithm

Remark

4.3.1 Convergence rate of FGD

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Theorem 1

Proof

5 Computational experiment

6 Concluding remarks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fast gradient descent algorithm for image classification with neural networks

Abstract

Similar content being viewed by others

Fast Conjugate Gradient Algorithm for Feedforward Neural Networks

A New Conjugate Gradient Method with Smoothing \(L_{1/2} \) Regularization Based on a Modified Secant Equation for Training Neural Networks

An Efficient Algorithm for Multi Class Classification in Deep Neural Network

Explore related subjects

1 Introduction

2 Notations and assumptions

3 Image classification with neural networks

3.1 Model

3.2 Back propagation

4 Gradient methods

4.1 Gradient descent (GD)

4.2 Control gradient algorithms

Remark

4.3 Fast gradient descent (FGD) algorithm

Remark

4.3.1 Convergence rate of FGD

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Theorem 1

Proof

5 Computational experiment

6 Concluding remarks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation