Keywords

1 Introduction

ANNs are the information processing systems, which are vigorously motivated by the way organic sensory system, (e.g., brain) works. Artificial neural networks are made up of large number of interrelating units called neurons. These neurons are laced in distributed manner to learn from the input in order to streamline the final output. The basic model and mathematical model of simple ANN are shown in Figs. 1 and 2. The information (generally multidimensional vector) is fed to the input layer and propagated to shrouded (hidden) layers. The middle layers then learn features and decision taken by these layers depends upon the former layer [1].

Fig. 1
figure 1

Simple feed-forward neural network [2, 3]

Fig. 2
figure 2

Mathematical model for artificial neuron [2, 3]

Study in artificial neural networks has started right around 100 years prior. For a long time, there was no broadly acknowledged organic model for visual neural systems, until practical and experimental works clarified the structure and function of the mammalian visual cortex [4]. Thereafter, theoreticians developed models which resemble to biological neural networks.

Till 1950s, Perceptron, Habbian, ADLINE, and MADALINE are the network models that proven to be the mile stone in the history of artificial neural network. Thereafter, invention of back propagation in 1960s and convolution neural networks in late 90s has totally changed the way we look at neural networks. Convolution neural networks were developed in 1998, named as Le-Net(First CNN), and since then, these are applied to visual tasks. However, despite a few scattered applications, they were dormant until the mid-2000s. But due to abundance of data, efficient algorithms and computational power exponential growth have been noticed since 2012. Table stated below mentions the brief history of neural networks and highlighting the key events:

Figure 3 is the basic structure of number of common neural networks. It is a feed forward network with input, two hidden layers, and one out layer. Neural networks, which have more than one hidden layer (as shown below) are called deep neural networks. Deep neural networks (DNN) are also referred as deep learning networks. These are different from the traditional single-layer-hidden neural networks because of number of hidden layers or their depth.

Fig. 3
figure 3

Deep neural network with several hidden layers [5]

Based upon the previous layer output in deep learning neural networks, each layer of neurons trains on particular set of features. The deeper you go, more complex and composite features can be recognized. The most significant and robust deep learning network is convolution neural network (CNN).Convolutional neural network is a class of deep learning techniques which turned out to be which has become overruling in various computer vision tasks and is drawing attention over various domains. CNN is made up of numerous layers, such as convolution layers, pooling layers, and fully connected layers, and is intended consequently and adaptively learn spatial features through a back propagation algorithm [6, 7] (Fig. 4).

Fig. 4
figure 4

Layer wise detailed CNN architecture [8]

Complete CNN architecture is summarized as below

1. Input layer

Feature extraction

2. Convolution layer

3. Pooling layer

4. Fully connected layer

Classification/prediction/recognition, etc.

CNN accepts three-dimensional input and transforms it through all connected layers into a set of class scores given by the output layer. First of all, the convolution operation is performed between the input image pixels and window filter (kernel, generally of size 3 × 3 or 5 × 5). The output obtained thus is called activation map or feature map. After this, nonlinear activation function is applied on feature map called ReLu operation. It replaces all the negative pixel values with zero and introduces nonlinearity. Then, pooling is done to reduce the size of receptive field. Reducing the size of receptive field decreases the training time and massive calculation of parameters. The last layer in CNN is the classification layer (Fully connected layer). Here, the higher-order features are transformed to class scores or probabilities. Back propagation algorithm is then used for training the CNN. Training of Covnet (CNN) is done to lessen the differences between estimated target predictions and actual ground truth label [8].

As mentioned in Table 1, there are many convolution architectures that have been released since 2012. Practically, all CNN structures pursue a similar general plan standard of progressively applying convolutional layers to the input, and then down sampling the spatial dimensions and learning the feature maps [9]. Earlier, the system models contain just stack of convolutional layers, but the recent architectures develop new and imaginative ways for building convolutional layers and help in more accurate and competent learning. LeNet, AlexNet, ZfNet, and VggNet are the standard networks. DenseNet, GoogLeNet, Inception (All four version), Exception, ResNext, Network in Network, and many more are the advanced networks. In the next section, a few classic architecture and main modern architecture are defined (Fig. 5).

Table 1 Important contribution toward neural network and deep learning architectures [10]
Fig. 5
figure 5

Layer wise LeNet architecture for handwritten digit recognition [11]

2 LeNet (1998) [11, 12]

It is the oldest convolutional network which is designed for handwritten and machine-printed character recognition. Main features are:

  • Use average pooling

  • Nonlinear activation function is Sigmoid or tanh.

  • Use of FC layers at the end

  • Le Net is trained on approx. 60 K training image (use MNIST database)

3 AlexNet (2012) [13, 14]

ILSVRC 2012 winner. AlexNet expressively outclassed all the earlier contenders and won the competition by challenge by reducing the top-5 error from 26-15.3%. System architecture of AlexNet and LeNet is fundamentally same except the former is more profound, with more number of filters, and more number of convolution layers, AlexNet uses imagenet database and is trained on two Nvidia Geforce GTX 580 GPUs for six days. Main features are:

  • Use Max pooling

  • Use Relu for nonlinearity function (faster than tanh function)

  • Use data augmentation techniques to enlarge the data base, hence more data and large model with seven hidden layers and 60 M parameters (Fig. 6).

    Fig. 6
    figure 6

    Layer wise AlexNet architecture [15]

4 VGGNet (2014) [1, 14]

VggNet scored ILSVRC 2014 competition. Like AlexNet, VGGNet utilizes only 3 × 3 convolutions, but more number of filters. It is trained on four GPUs for 2–3 weeks. It is the true deep learning neural network with 16 CONV/FC layers. VGGNet consists of 140 M parameters, which can be a bit challenging as well as costly to access. Main features are:

  • Uniform architecture

  • Large receptive fields swapped by consecutive layers of 3 × 3 convolutions filters.

  • Keep the benefits of small filter sizes (minimal loss of spatial information)

  • Number of filter increases to almost double after each pooling layer, hence spatial information decreases but depth of the network increases.

  • Worked well on both image classification and localization tasks (Fig. 7).

    Fig. 7
    figure 7

    Layer wise VGG Net architecture [1]

The development of VggNet architecture has demonstrated that profundity (depth) of the system is the crucial component for good performance.

A drawback of the VGGNet is its increasing cost to access and utilize significantly more memory and parameters. Mostly, these parameters are defined in initial few layers; hence, FC layers can be expelled without no downsizing the architecture performance.

5 Inception (GoogLe Net) (2014)

Till VGG architecture, the accuracy and number of parameters are directly proportional to the depth or number of layers. But if training data is small then bigger models are more likely to over fit. With deep networks, the number of parameters increases hence the complexity and computational cost increase. Inception model proposes new idea of moving to sparsely connected architecture. This approach lets you to reduce error and maintain the “computational budget,” while increasing the depth and width of the network.

Hence, just like any other architecture, reason behind the development of Inception was also to reduce error. Before this, the convolutional networks were only made deeper to increase the accuracy. But this lead to overfitting with limited data, and an exponential increase in the computational resources is required.

The inception network is the breakthrough achievement in improvement of CNN classifiers. The Inception network is complicatedly planned and uses many methods to improve performance (training speed, accuracy). GoogLeNet or Inception is the winner of the ILSVRC 2014. It achieved a top-5 error rate of 6.67%! Its persistent progression leads to the formation of many improved variants of the network. Firstly, the network familiarized as GoogLeNet or Inception-v1. Next variant is inception V2, and in this model, the concept of factorization is introduced. Then idea of asymmetric factorization, and batch normalization is introduced in inception V3. Inception V4 and Inception-ResNet are explained together. Each version is an iterative improvement over the previous one [16].

5.1 Inception Module V1 (2014)

The Inception architecture of Google Net is designed to perform well even under strict constraints on memory and computational budget. Its architecture comprised of a 22-layer deep convolution layers. Inception V1 compute 5 M parameters, which are 12 times less than the parameters computed by alexNet use 60 M parameters [17]

With Inception, the network was not only made deeper but also wider, i.e., Instead of having just a single size filter at input, multiple filters of different sizes were introduced, thereby making the network wider. It performed convolution on three different sizes of filters (5 × 5, 1 × 1 and 3 × 3). After convolution operation, the Max pooling is applied; then, the outputs are concatenated to single vector output and sent to the next inception module. The number of input channels was limited by introducing an extra 1 × 1 convolution before 5 × 5 and 3 × 3 convolutions. The 1 × 1 convolutions are computationally cheaper than 5 × 5 convolutions. Hence, the less number of inputs helped in decreasing the computational requirement (reduce dimensionality). Inception reduced the computation by using sparsely connected neurons/layers, which significantly reduce the computation power by 84% in some cases while increasing the accuracy [18] (Figs. 8 and 9).

Fig. 8
figure 8

Inception module versiion1 with naïve version [18]

Fig. 9
figure 9

Inception module with dimension reduction [18]

5.2 Inception V2 (2015)

Inception v2 is the upgradation of inception V1. In inception module decrease in dimensions causes loss of information, which is known as representational bottle neck. Inception Version 2 was built considering the problem of Representational Bottleneck. This occurred mostly with very deep convolutional networks. With each layer, the size of image was reduced by a fraction, resulting in a smaller image. Hence, after every level, the information we receive from an image is reduced [16]. To retain the spatial information, smart factorization method is used which made inception v2 more accurate and more efficient in terms of computational complexity. In factorization method, the 5 × 5 convolution was broken down into two 3 × 3 convolution, as two 3 × 3 convolutions were faster and more cheaper to compute as compared to a single 5 × 5 convolution. Factorization procedure is illustrated below (Fig. 10).

Fig. 10
figure 10

Inception module after convolution and filter factorization. Here, the left-most 5 × 5 convolution of the old inception module is now represented as two blocks of 3 × 3 convolutions [17]

5.3 Inception VersionV3 (2015)

It is one of the modern architectures that attains a new state of art in terms of accuracy on ILSVRC image classification standard [19]. Inception V3 is the first Runner Up for image classification in ILSVRC 2015 [19]. To increase the speed and efficacy of inception module, asymmetric factorization is done, i.e., the n × n (3 × 3) convolutions were broken down into two convolutions 1 × n, and n × 1. This method proved to be 0.3 times faster and approximately 33% cheaper with the same hardware resources (Fig. 11).

Fig. 11
figure 11

Inception module after filter size reduction [17]

To reduce Representational Bottleneck, the deepness of model was reduced and whole model was made wider. Multiple convolutions were shifted to the same level to make sure image size does not get extremely small, rendering the image useless (Fig. 12).

Fig. 12
figure 12

Inception module using asymmetric factorization [17]

Inception Net v3 assimilated all features of Inception v2, and in addition, it uses asymmetrical factorization, dropout in auxiliary classifiers, and label smoothening. Both auxiliary classifier and label smoothening act as regularizer and help to reduce overfitting. Inception V3 is 42-layer deep architecture but uses fewer parameters and complexity is similar to VGG Net. This version was able to perform 7 × 7 convolutions as well, compared to the last version which was able to only perform till 5 × 5 convolution.

5.4 Inception V4-Inception-Resnet(2016)

Inception V4 and Inception ResNet are explained together. Cost and design structure of “Inception-ResNet-v2” roughly matches with the Inception-v4 network. There are minute difference of step time and training time. Inception-Resnet is faster in terms of practice and training time. As the name suggests, Inception-Resnet is the hybrid combination and stimulated by the performance of the residual network. The idea of Inception network with residual connections is proposed by Microsoft ResNet. It outperforms in a similar way the expensive pure Inception network works. In ILSVRC classification task this hybrid network has achieved 3.08% error [20].

Idea of this collaboration came from the findings that residual connections are innately vital for training the very deep networks. As inception modules are deep-layered networks so by replacing the filter concatenation stage of inception model with residual block, we can allow Inception to enjoy the gains of the Resnet and simultaneously intact to computational efficiency [16].

In inception-ResNet, the initial operations performed before Inception blocks were modified to make this model much more uniform. These operations were known as the stem. The overall scheme for pure inception V4 network and detailed composition of stem configuration of for inception Resnet is shown in Fig. 13.

Fig. 13
figure 13

Overall scheme for pure inception V4 network and detailed composition of stemcon-figuration of for inception Resnet [20]

This model introduced special blocks known as reduction blocks. These provided the user with the ability to change the width and height of the block making the model more tuneable and thus easier to test. The inception blocks are very flexible and modifications can be done on a lot of parameters including filter sizes (Figs. 14 and 15).

Fig. 14
figure 14

Reduction block A (35 × 35–17 × 17 size reduction) and reduction block B (17 × 17–8 × 8 size reduction) [20]

Fig. 15
figure 15

Layout for unception V4 and layout for inception Resnet [20]

6 ResNet

Deep residual networks were implemented shortly after Google’s Inception V3. The Inception was based on increasing width by developing shallower network to reduce overall error. In DCNN, going deep means ability to solve more complex task and significant improvement in object recognition capability. But, as we go further deeper down the network, the neural network experiences more difficulty in training the network (slow and tedious) and degradation and saturation in accuracy. Residual learning developers or technicians to take care of these issues. Developers of residual network were trying to reduce data degradation while maintain the depth of the network. To maintain the depth of network Residual blocks were introduced in the network.

Resnet(Residual network) developed by kaiming He et al., introduces new technology called residual learning. ResNet Won ILSVRC challenge 2015 and COCO 2015 competition in ImageNet Detection localization, Coco detection and segmentation. ResNet makes use of specialskip connections and batch normalization [21].

The main idea is: learning the differences or changes of the transformation is simpler than learning the transformation directly. So, in residual learning, instead of learning features at the end of the layer, the network learns some residual. Residual is the subtraction of feature gained from input of former layer. ResNet does this by utilizing identity shortcut connections (directly connecting input of nth layer to some (n + x)th layer), i.e., skipping one or few layers. Due to short and skip structure training this residual networks is easy. It can efficiently train networks with 100 layers and 1000 layers with lower complexity. The short cut connection is shown in Fig. 16.

Fig. 16
figure 16

Residual learning: a building block [22]

ResNet block is either two-layer or three-layer deep. (two-layer network used for ResNet 18, 34 and three layer used for ResNet 50, 101, 152 Fig. 17).

Fig. 17
figure 17

A deeper residual function. Left: ResNet 34 block). Right: a bottleneck building block for ResNet 50 [23]

In ResNet 50, the two-layer residual block is replaced with a three-layer bottleneck block. This block used 1 × 1 convolution to reduce computation required for a 3 × 3 convolution. This model consisted of 25 M parameters.

7 ResNext

ResNext, the winner of the ImageNet Large Scale Visual Recognition (ILSVRC) 2017, is the extension of ResNet and inception model. It is also inspired from the VGG. ResNext uses a homogeneous, multi-branch architecture for image classification. As ResNet and ResNext both follow the split-transform-merge paradigm. The main difference is in Inception module is that, the output from preceding layers is depth concatenated, but in ResNext, the outputs of different paths are merged by adding them together. ResNext performs point-wise grouped convolutional(1 × 1), which divides its input into groups of feature maps and performs normal convolution, respectively; their outputs are depth-concatenated and then fed to a 1 × 1 convolutional layer (In ResNext all paths share same topology).So in ResNext, convolution is performed on lower-dimensional representations and later merged to provide the result, instead of performing convolution on complete feature map [24] (Fig. 18).

Fig. 18
figure 18

ResNet-50 block vs. ResNeXt block [25]

According to research conducted by SainingXie, it was found that residual networks provide better optimization, while aggregated transformations (in Inception Networks) provide stronger representations. This made ResNeXt one of the most powerful convolutional networks of its time, as it combined best of both the networks providing us with a much deeper and wider network.

8 DENSENET:(Dense Convolution Network)

Convolution networks perform better (in terms of accuracy, efficacy and training time) when they are deeper and contain shorter connections between layers closer to input as well as output. In DenseNet, the layers are connected in feed-forward manner rather than a traditional direct connection. For each layer, the feature maps of the respective layer and all forgoing layers are used as input to all succeeding layers which led a colossal increase in the number of connection. Hence, dense net has L(L + 1)/2 direct connections. The feature maps are aggregated with depth-concatenation which preserve the features, increase the variance of the outputs, and encouraging feature reuse. Figure below illustrates this layout schematically [26] (Fig. 19)

Fig. 19
figure 19

5-layered densely connected network, every layer takes input as feature map from the former layer [26]

For each layer, all the feature maps for its previous layers are used as input, and its own output is used as input for all the upcoming layers. This significantly improved the performance of the network as fewer numbers of parameters were required and the direct connection to each layer solved the vanishing gradient dispute. Because of L(L + 1)/2 connections in an L-layer network, this architecture shows dense connectivity hence named as dense convolutional network (DenseNet).

9 Conclusion

Convolutional neural networks (back bone of numerous deep learning algorithms) have shown state-of-the-art performance in high-level computer vision tasks. To the best of our knowledge, in this paper, we try to include and cover the literature which focuses on most advanced deep learning architectures. A brief introduction of classical neural net models has been included to give the necessary background knowledge about the subject in interest. Literature of most recent and complex architectures like inception, denseNet, googLe Net, etc., have been included to give the reader a detailed explanation about deep learning techniques. All views and findings of experts have been mentioned year wise to intact the interest and curiosity of the readers. The general discoveries show that CNN establishes a promising procedure with high grade performances in terms of accuracy, precision, and classification. However, the accomplishment of each convolutional neural network model is profoundly reliant on the nature of the informational collection used.