1 Introduction

In residual networks, every fraction of a percent improvement in accuracy costs almost double the number of layers, and thus as a natural consequence of this large increase in network depth during training. The network will develop a problem of decreasing feature reuse, which makes network training very slow. The decrease in feature reuse during forward propagation refers to the problem of vanishing gradients in the forward direction. Residual networks solve the problem of degradation through connection jumps or shortcuts, by short-circuiting the shallow layers to the deeper layers. The Deep Residual Network in Network (DrNIN) architecture, above all other Deep Network in Network (DNIN) architectures, exhibited great accuracy improvements and convergence behavior superior to the competition. Increasing the depth of the DNIN [1] unexpectedly caused a problem in degradation of accuracy. The proposed solution was to reformulate the convolutional layers as residual learning functions. Therefore, so far, the study of DNINs has mainly focused on the order and number of layers inside a DMLPconv block and the depth of the networks. The main disadvantage of residual networks lies in the high number of layers where the depth of residual deep networks can evolve up to thousands of layers and up to thousands of layers for deep network in network residual networks. To solve this problem, a work published in 2016 [40] show that width has a greater effect than depth. Convinced by this principle, we propose to expand the DrMLPConv blocks of DrNIN [18] in order to have a more precise classification. In this paper, we address the challenges of improving classification by introducing a broader and more efficient DrNIN [18] architecture for computer vision which takes its name from DrNIN’s paper [18] with the famous “WRN” [40]. The benefits of this work have been validated experimentally in the CIFAR-10 classification challenges. The contributions of this work are:

  1. (i)

    We propose a new layer architecture WDrMLPConv which represents the core of Wide Deep Residual Networks in Networks (WDrNIN) architecture with improved performance.

  2. (ii)

    We present a detailed experimental study that thoroughly examines several important aspects of WDrMLPconv blocks.

  3. (iii)

    We present a detailed experimental study of multi-width deep model architectures that broadly examines several important aspects of WDrMLPconv layers.

  4. (iv)

    Finally, we show that our proposed WDrNIN models achieve interesting results on CIFAR-10 significantly improving accuracy and learning speed.

The rest of this article is organized as follows: Section 2 provides an overview of related work. Section 3 bear the strategy. Experimental results are presented in section 4. Evaluations and comparative analyzes are presented and discussed in the section 5. Section 6 is dedicated to implementation details. Advantage and limitations of WDrNIN are presented in section 7. The work is concluded in section 8.

2 Related works

Over the years, various techniques have been applied to improve the value of accuracy, and this is evident in the work that followed Alexnet until the publication of ResNet. Generally, deep networks have highlighted their success in many works in the post-2015 period [1,2,3,46, 7, 10, 11, 16, 1719,20,21,22,23,24, 26, 2831,32,33,34, 37, 41]. These solutions are represented in the application of modifications at the level of the convolution layer such as increasing the depth [10, 17, 41] and/or the width [21, 40], modifying the type, the parameters [8, 9] and reducing filter size [1, 18, 38], changing the number of channels and feature map [38, 40]. The modification at the level of the layers of pooling [12, 14, 15, 2528,29,30,31,32,33,34,35,36, 39] and of the activation function [5, 27]. In classical CNNs, simple linear filters represent the beating hearts of computations inside the convolutional layer. On the other hand, in the model based on the network in network, the nonlinear filters are exploited instead of simple classical linear filters like the multilayer perceptron (MLP) [1, 28, 30]. Various works have exploited this type of nonlinear filters such as the NIN model [28], DNIN [1], DrNIN [18]. NIN [28] consists of several layers MLPconv, stacked in succession, which integrates a linear convolution layer and two MLP layers with a ReLU activation function. A global average pooling layer is used instead of the fully connected layers that are traditionally used in CNNs. The calculation performed by the MLP conversion layer is as follows:

$$ {\displaystyle \begin{array}{c}{f}_{i,j,{k}_1}^1=\max \left({\omega}_{k1}^{1T}{x}_{i,j}+{b}_{k_n},0\right)\\ {}\dots \\ {}{f}_{i,j,n}^n=\max \left({\omega}_{k_n}^{nT}{f}_{i,j}^{n-1}+{b}_{k_n},0\right)\end{array}} $$

Knowing that:

  • (i, j) represents the pixel index in the feature map,

  • xij designates the input patch centered at the location (i, j).

  • k is used to index the channels of the feature map and n denotes the number of layers.

Figure 1 illustrates the overall structure of the NIN [28] architecture.

Fig. 1
figure 1

Network in network

A modification of the NIN model [28] is presented in the Deep Network In Network (DNIN) model [1]. This model consists of the stacked DMLPconv block which integrates two convolutional layers of size 3 × 3 and a nonlinear activation unit “eLU” instead of ReLU. In this architecture, the eLU function [9] is used instead of ReLU [36] to alleviate the problem of leakage gradients and accelerate the learning speed. The DNIN [1] is shown in Fig. 2.

Fig. 2
figure 2

Deep Network in Network

In [18], the authors proposed the DrNIN model based on DNIN [1]. It represents an improvement of the architecture of DNIN [1] by applying the residual learning framework to the different MLPconv layers, and reformulating the convolutional layers of DMLPConv as residual learning functions. Figure 3 illustrates the DrNIN model composed of three DrMLPconv layers.

Fig. 3
figure 3

Deep residual network in network

The first model that applied the residual function in CNNs is the ResNet model [17]. In this model published in 2015, a residual block is proposed to facilitate the formation of very deep networks. In this model, the average global pooling layer is followed by a Softmax layer is exploited as the classification layer. Note that the residual block with identity mapping described in formula-represented (1). xl + 1 and xl are the input and output of the lth unit in the network, F is a residual function and wl are parameters of the block.

$$ {x}_{\left(l+1\right)}={x}_l+F\left({x}_k,{W}_l\right) $$
(1)

In 2016, Wide ResNet [40] showed that using a wider Residual Block represents an effective solution to improve the accuracy than deepen residual networks.

From this work, we considered these approaches to improve the model of DrNIN [18] in order to obtain a better performance where one can widen the DrMLPconv blocks with a widening factor k which scales the width of the l blocks WDrMLPconv.

3 Proposed model

Compared to the original DrMLPconv architecture of the DrNIN model [18], a stretching is applied to the DrMLPconv layers with a stretching factor k. The new layer is named “wide deep residual MLPconv (WDrMLPconv)”. The original “base” block of DrMLPconv is shown in Fig. 4a. The new WDrMLPconv block is shown in Fig. 4.

Fig. 4
figure 4

a A schematic example of “basic” DrMLPconv layer, b a schematic example of “basic” Wide DrMLPconv. c an example schematic of a bottleneck, d an example schematic of a “basic” Dropout-DrMLPconv layer

Figure 4a presents the structure of DrMLPconv is based on a residual block [17], two multilayer perception layers and two convolution layers of size 3 × 3, MLP layers. These different layers are followed by an eLU activation. Figure 4b presents a basic wide residual block architecture integrating two consecutives wide 3 × 3 convolution layers with forward addition and return to main flow. In Fig. 4c, a bottleneck architecture that integrates two 1 × 1 convolution layers and one 3 × 3 convolution layer. The first layer Conv 1 × 1 decrease the dimension of the feature size, the Conv 3 × 3 layer reduces it, then Conv 1 × 1 increases it before adding and returning to the main stream. This configuration is exploited to make the WDrMLPConv block even thinner. Figure 4d presents a basic WDrMLPConv architecture integrating a dropout layer between two consecutive 3 × 3 convolution layers with batch normalization and ReLU before adding and returning to the main stream.

The WDrNIN model only uses 3 × 3 filters because they are filters with a very small receptive field: 3 × 3 (which is the smallest size to capture the notion of left/right, top/bottom, center). In the rest of the article, the following notation: WDrNIN-L-k is used. For example, for a network with 20 layers and widening factor k = 3. The notation would be WDrNIN-20-3. The new basic structure of DrMLPconv is based on a residual block [17], a multilayer perception (with a depth of two layers) which is described as a complex nonlinear filter. Note that basic DMLPconv, as shown in Fig. 4a, consists of two convolution layers of size 3 × 3, MLP layers. These different layers are followed by an eLU activation. Let WDrMLPConv(X) be the WDrMLPconv layer, where X denotes a list of layers used in the WDrMLPconv(x) structure. For example, for WDrMLPConv(3, 3, D) denotes the basic structure WDrMLPconv integrating a residual block applied to two convolution layers of size 3 × 3 with the two MLP layers with a dropout between the two 1 × 1 convolution layers. For WDrMLPConv(3, 3), it is the same structure but without using the dropout layers. The WDrMLPConv(1,3,1) denotes a structure with two 1 × 1 layers that embeds a 3 × 3 layer between them. All configurations of the WDrMLPconv layer are equipped with the eLU nonlinearity [13]. The different WDrMLPconv structure adopted in this work is shown in Table 1.

Table 1 The configurations of WDrMLPConv

The general structure consists of L groups of WDrMLPConvx with x belonging to {1,2,3,4,5} ends with an average pooling layer and a final classification layer. The overall average filter size of the grouping layer depends on the depth factor “L”. Table 2 summarizes the sizes of these overall average clustering layers. The introduced enlargement factor k represents the width of the model and changes from one network to another. The number of convolution kernels for each convolution group and block is described in Table 2.

Table 2 The number of kernels for WDrMLPconv

As mentioned earlier, the WDrNIN network admits two factors: Deepening factor l and widening factor k. Note that a network is said to be “wide” only if k is greater than 1. Hence when k = 1, WDrMLPconv has the same width as DrMLPconv. Let WDrNIN-l-k be the notation used. To describe the WDrNIN model with its depth and width parameters. The general structure is made up of l WDrMLPconv blocks. These blocks are followed by an average pooling and a final classification layer. Figure 5 shows the structure of WRN-3-1 wide residual networks.

Fig. 5
figure 5

Structure of WDrNIN-3-1 wide residual networks

4 Experimental results

We evaluate our different configurations on a reference data set: CIFAR-10. In the following, we describe our results and analyze the performance.

4.1 Type of convolutions, number of convolutions and width of a residual block

We start by reporting the results using the WDrNIN models with l = 3, 4 and 5 and k = 2 for WDrMLPConv(1,3,1) and WDrMLPConv(3,3) and WDrMLPConv(3, 3, D). The accuracy of the test is calculated as the average of 2 runs. The time per training epoch is also calculated as an average of 2 runs.

Table 3 presents the error test (%) on CIFAR-10 with a factor k = 2 and the two different block types WDrMLPConv (1.3.1) and WDrMLPConv (3.3).

Table 3 Test error (%) on CIFAR-10 with k = 2 and the two different types of blocks. The average of 2 runs. Time (time,s) measures an epoch of training

In this table, we can notice that WDrMLPConv(1,3,1) offers an accuracy of 08.09 and consumes 42.09 s for one training epoch. WDrMLPConv (3.3). WDrMLPConv (3.3) generates an error of 08.09 and consumes 42.09 s for a learning period. WDrMLPConv (3.3) consumes 58.6 s for a training period and offers 07.62% as test error.

In the following, we limit our work on WDrNIN with WDrMLPConv (3.3) in order to be consistent with other techniques and methods as well. We test and analyze the block deepening factor l to see its performance effect. Table 4 shows test accuracy (%) on CIFAR-10 with k=2 and WDrMLPConv(3.3) with various l. Note that the number of parameters increases linearly with the depth factor “l”. In addition to all that above, we also test and analyze the effect of the enlargement factor k.

Table 4 Test precision (%) on CIFAR-10 with k = 2. and WDrMLPConv(3.3) with various l

It is observed that WDrNIN with l = 4 and k = 2 - WDrMLPConv(3,3) was found to be the best compared to the same network using l different from 4. The WDrNIN model with WDrMLPConv(3,3) is the fastest in terms of time (time, s) which measures a training epoch compared to another model which exploits the same parameters l and k with WDrMLPConv(3,3). We note that our results were obtained with a batch size equivalent to 128 in all our experiments.

4.2 Dropout in residual blocks

As enlargement increases do does the number of parameters. Although the networks already have a batch normalization layer that provides a stabilization effect, adding a Dropout layer [36] after the eLU layer in each residual block is done, as shown in Fig. 4d, in order to prevent networks from over-fitting. Overall, Dropout [36] is described as a regularization technique to reduce overfitting in neural networks. This avoids complex co-fittings on the training sample data.

We trained the models with the Dropout layer inserted into a residual block between the convolutions. The dropout probability values were 0.5. Exploiting the dropout layer brings improvements in test accuracy. It leads to an increase in test accuracy on CIFAR-10 ranging from 0.027% up to 0.043% for an average of 2 runs with the WDrNIN-4-2 models shown in Table 5.

Table 5 The effect of the Dropout in WDrNIN. (The average of 2 runs)

5 Discussions

The main objective of this work is to examine and evaluate the success of our proposed architecture in image classification and to compare the performances found with the models in the literature. It is recalled that the proposed model was trained by performing transfer learning. As shown in Table 6, the WRN-4-2 model achieved slightly better accuracy than most literature studies with the original dataset. Using the Dropout layer, the WRN-4-2 model was better in terms of accuracy. The experimental studies showed that the WDrNIN models, which offer the best classification performance in the exploited literature, could be the best performance in all works of plant disease classification. A comparison with the results of different studies and works is presented in Table 6.

Table 6 CIFAR-10 Test Accuracy

The experimental results demonstrate the effectiveness of the proposed contribution. Moreover, they show that the WDrNIN-4-2 with the dropout layer offers better results, in terms of classification accuracy, than the various other models.

6 Implementation details

All models used in this study were compiled with CPU support. All experimental studies were conducted in Google cloud environment on Linux operating system running on Dell Intel Core i5- 2450 M 2.50 GHz processor and 6 GB DDR4–2400 RAM. All codes are made with python algorithm based on “TensorFlow” deep learning framework to classify and recognize images, which is an open source deep neural network library written in Python language. All the experiments are carried out is executed on a total of 160 epochs, During the training we exploited the stochastic gradient descent with a Momentum equivalent to 0.9 in all the experiments carried out. The basic learning rate is equivalent to 0.005 and decreases by a factor of 10 every 10 epochs. The weight loss is equivalent to 0.0005.

7 Advantage and future works

The WDrNIN models used offer an interesting test accuracy which is located at the head of the various works reported in the literature. The importance of the exploited WDrNIN model also lies in its repetitive and homogeneous structure which make it very suitable and compatible for integration into embedded system applications. In future work, it is planned to expand and artificially augment the CIFAR-10 dataset by increasing the number of classes. This will contribute to the development of models and architecture capable of achieving more precise and interesting precisions. By publishing these models and architectures in embedded electronics and mobile applications, experts, researchers and the visually impaired will be able to discover and classify the images and make useful and necessary decisions.

8 Conclusion

Classification based on deep learning [23] has become popular in the field of image processing. In this work, a WDrNIN deep learning model is proposed for image classification and detection. In addition, a study is presented on the width of WDrNIN networks as well as on the use of dropout in these different architectures. WDrNIN was compared to widely known deep learning models used in image detection and classification from the CIFAR-10 dataset. The WDrNIN-4-2 models with/without the dropout layer were found to be the most accurate in terms of accuracy compared to other widely known CNN models. The WDrNIN-4-2 model with the suppression layer achieved an accuracy of 93.553% in the CIFAR-10 dataset, while the WDrNIN-4-2 model achieved an accuracy of 93.51%. Additionally, when analyzing the model training times by epoch, it was found that the WDrNIN-4-2 model was the fastest of the CIFAR-10 set, but the accuracy was lower than that of the other WDrNIN models with different depth and width. Additionally, WDrNIN models have achieved well-localized accuracy in the literature for CIFAR-10,