Keywords

1 Introduction

An artificial neural network is powerful tool in different domains [1,2,3,4,5]. Over the last decade the machine learning techniques has the leading role in domain of artificial intelligence [1]. This is confirmed by recent qualitative achievements in images, video, speech recognition, natural language processing, big data processing and visualization, etc. [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]. These achievements are primarily associated with new paradigm in machine learning, namely deep neural networks and deep learning [2, 6,7,8,9,10,11,12,13,14,15,16,17,18]. However in many real world applications the important problem is limited computational resources, which doesn’t permit to use deep neural networks. Therefore the further development of shallow architecture is an important task. It should be noted especially that for many real applications, the shallow architecture can show the comparable accuracy in comparison with deep neural networks.

This paper deals with a convolutional neural network for handwritten digits classification. We propose a simplified architecture of convolutional neural networks, which permits to classify handwritten digits with more precision than a conventional convolution neural network LeNet -5. We have shown that by using a simplest convolutional neural network can be obtained the better classification results.

The rest of the paper is organized as follows. Section 2 introduces the standard convolutional neural networks. In Sect. 3 we propose a simplified convolutional network. Section 4 demonstrates the results of experiments and finally Sect. 5 gives conclusion.

2 Related Works

Convolutional neural network is a further development of a multilayer perceptron and neocognitron and is widely used for image processing [19,20,21,22]. This kind of neural network is invariant to shifts and distortions of the input. Convolutional neural network integrates three approaches, namely local receptive field, shared weights and spatial subsampling [20,21,22]. Using local receptive areas the neural units of the first convolutional layer can extract primitive features such as edges, corners etc. The general structure of convolutional neural network is shown in the Fig. 1.

Fig. 1.
figure 1

General structure of convolutional neural network

A convolutional layer consists of set of a feature maps and the neural units of each map contain the same set of weights and thresholds. As a result each neuron in a feature map performs the same operations on different parts of image. The sliding windows technique is used for image scanning. Therefore if size of window is pxp (receptive field) then each unit in a convolutional layer is connected with p2 units of the corresponding receptive field. Each receptive field in input space is mapped into special neuron in each feature map. Then if the stride of sliding window is one that the numbers of neurons in each feature map is given by

$$ D(C_{1} ) = (n - p + 1)(n - p + 1) $$
(1)

where nxn is size of image. If the stride of sliding window is S that the numbers of neurons in each feature map is defined by the following way:

$$ D(C_{1} ) = (\frac{n - p}{s} + 1)(\frac{n - p}{s} + 1) $$
(2)

Accordingly, the common number of synaptic weights in convolutional layer is defined by

$$ V(C_{1} ) = M(p^{2} + 1) $$
(3)

where \( M \) – the number of feature maps in convolutional layer. Let’s represent the pixels of the input image in one-dimensional space. Then the ij-th output unit for k-th feature map in convolutional layer is given by

$$ y_{ij}^{k} = F(S_{ij}^{k} ) $$
(4)
$$ S_{ij}^{k} = \sum\limits_{c} {w_{cij}^{k} } x_{c} - T_{ij}^{k} $$
(5)

where \( c = 1 \), \( p^{2} \), \( F \) – the activation function, \( S_{ij}^{k} \) — the weighted sum of the ij-th unit in k-th feature map, \( w_{ij}^{k} \) — the weight from the c-th unit of the input layer to the ij-th unit of the k-th feature map, \( T_{ij}^{k} \) – the threshold of the ij-th unit of the k-th feature map.

As already said the neural units of each feature map contain the same set of weights and thresholds. As a result the multiple features can be extracted at the same location. These features are then combined by the higher layer using pooling in order to reduce the resolution of feature maps [22]. This layer is called subsampling or pooling layer and performs a local averaging or maximization different regions of image. To this end, each map of convolutional layer is divided into non-overlapping areas with size of kxk and each area is mapped into one unit of corresponding map in pooling layer. It should be noted that each map of convolutional layer is connected only with corresponding map in pooling layer. Each unit of pooling layer computes the average or maximum of k2 neurons in a convolutional layer:

$$ z_{j} = \frac{1}{k \times k}\sum\limits_{j = 1}^{k \times k} {y_{j} } $$
$$ z_{j} = \hbox{max} (y_{j} ) $$
(6)

The number the of neurons in each pooling map is given by

$$ D(S_{2} ) = \frac{{D(C_{1} )}}{{k^{2} }} $$
(7)

The number of feature maps in pooling layer will be the same like in convolutional layer and equal M. Thus convolutional neural network represents combination of convolutional and pooling layers, which perform nonlinear hierarchical transformation of input data space. The last block of the convolutional neural network is a multilayer perceptron, SVM or other classifier (Fig. 2).

Fig. 2.
figure 2

General representation of convolutional neural network

Lets consider the conventional convolutional neural network (LeNet-5) for handwritten digits classification (Fig. 3) [22]. The input image has size 32 × 32. The sliding window with size 5 × 5 scans the image and the segments of images enter to the layer C1 of neural network. Layer C1 is a convolution layer with 6 feature maps and each feature map contains 28 × 28 neurons. Layer S2 is a subsampling layer with 6 feature maps and a 2 × 2 kernel for each feature map. As a result each feature map of this layer contains 14 × 14 units. Layer C3 is a convolution layer with 16 feature maps and a 5 × 5 kernel for each feature map. The number of neural units in each feature map is 10 × 10. The connections between layers S2 and C3 are not fully connected [22], as is shown in the Table 1.

Fig. 3.
figure 3

Architecture of LeNet-5

Table 1. Connections between layers \( S_{2} \) and \( C_{3} \)

Layer S4 is a subsampling layer with 16 feature maps and a 2 × 2 kernel for each feature map. As a result each feature map of this layer contains 5 × 5 units. Each receptive field with size 5 × 5 is mapped into 120 neurons of the next layer C5. Therefore layer C5 is a convolution layer with 120 neurons. The next layer F6 and output layer are fully connected layers.

3 The Simplified Convolutional Network

In this section we proposed convolutional neural network which has a simpler architecture compared with LeNet-5. The simplified convolutional neural network for handwritten digit classification is shown in Fig. 4. This network consists of convolutional layer (C1), pooling layer (S2), convolutional layer (C3), pooling layer (S4) and convolutional layer (C5). The convolutional layer C1 has 8 feature maps and each feature map contains 24 × 24 neurons. The pooling layer S2 contains 8 feature maps and 12 × 12 units for each feature map, i.e. k = 2. Layer C3 is a convolution layer with 16 feature maps and 8 × 8 neurons in each feature map. The layers S2 and C3 are fully connected in comparison with conventional network LeNet 5. Layer S4 is a pooling layer with 16 feature maps and 4 × 4 units for each feature map. The last layer C5 is the output layer contains 10 units and performs classification. As can be seen the main differences are the following: 1) we removed two last layers in LeNet-5 2) the layers S2 and C3 are fully connected 3) the sigmoid transfer function is used in all convolutional and output layers. The goal of learning is to minimize the total mean square error (MSE), which characterizes the difference between real and desires outputs of neural network. In order to minimize a MSE we will use gradient descent technique. The mean square error for L samples is defined using outputs of last layer:

$$ E_{s} = \frac{1}{2}\sum\limits_{k = 1}^{L} {\sum\limits_{j = 1}^{m} {(y_{j}^{k} - e_{j}^{k} )^{2} } } $$
(8)
Fig. 4.
figure 4

Architecture of simplified convolutional neural network

where \( y_{j}^{k} \) and \( e_{j}^{k} \) – respectively real and desired output of j-th unit for k-th sample. Then using gradient descent approach we can write in case of mini-batch learning, that

$$ w_{cij} (t + 1) = w_{cij} (t) - \alpha \frac{\partial E(r)}{{\partial w_{cij} (t)}} $$
(9)

where \( \alpha \) is learning rate, \( E(r) \) is mean square error for \( r \) samples (size of minibatch). Since the units of each feature map in convolutional layer contain the same set of weights then the partial derivative \( \frac{\partial E(r)}{{\partial w_{cij} (t)}} \) is equal to the sum of partial derivatives for all neurons of the feature map:

$$ \frac{\partial E(r)}{{\partial w_{cij} (t)}} = \sum\limits_{i,j} {\frac{\partial E(r)}{{\partial w_{cij} (t)}}} $$
(10)

As a result in case of batch learning we can obtain the following delta rule to update synaptic weights:

$$ w_{cij} (t + 1) = w_{cij} (t) - \alpha (t)\sum\limits_{i,j} {\sum\limits_{k} {\gamma_{ij}^{k} F'(s_{ij}^{k} )x_{c}^{k} } } $$
(11)

where \( c = 1 \), \( p^{2} \), \( F^{'} \left( {s_{ij}^{k} } \right) = \frac{{\partial y_{ij}^{k} }}{{\partial S_{ij}^{k} }} \) – the derivative of activation function for k-th sample, \( s_{ij}^{k} \) – the weighted sum, \( \gamma_{ij}^{k} \) the error of ij-th unit in a feature map for k-th sample, \( x_{c}^{k} \)– the c-th input.

4 Experiments

In order to illustrate the performance of proposed technique we present simulation results for handwritten digits classification using MNIST dataset. The MNIST dataset contains 28 × 28 handwritten digits in gray-scale and has a training set of 60000 samples, and a test set of 10000 samples. You can see some examples of handwritten digits from MNIST data set in the Fig. 5.

Fig. 5.
figure 5

Examples of handwritten digits

We used simple backpropagation algorithm for convolutional neural network training without any modifications. The size of mini batch is 50; learning rate is changed from 0.8 to 0.0001. The results of experiments are illustrated in the Table 2. As can be seen we can achieve test error rate 0.71% using simple shallow convolutional neural network. The best result for convolutional network LeNet-5 without distortions is 0.95%. Thus the use of simplified convolutional network with the elementary backpropagation technique permits to obtain the better performance compared conventional architecture. The processing results of each layer of digit 7 are shown in the Table 3.

Table 2. Conparative analysis
Table 3. The results of handwritten digits

5 Conclusion

This paper deals with a convolutional neural network for handwritten digits classification. We propose a simplified architecture of convolutional neural networks, which permits to classify handwritten digits with more precision than a conventional convolution neural network LeNet-5. The main differences from the conventional LeNet-5 are the following: we removed two last layers in LeNet-5; the layers S2 and C3 are fully connected; the sigmoid transfer function is used in all convolutional and output layers. We have shown that simple neural network is capable of achieving test error rate 0.71% on the MNIST handwritten digits classification.