Keywords

1 Introduction

Artificial neural networks are relatively crude electronic networks of neurons which aims at making machines learn knowledge like a human does. The neural network takes one input at a time and then further processes and learns by comparing the desired output and the result from the neural network. The error which is calculated from the first input is fed back to the network and is used to modify the weights between the neurons. This process of reducing error correction is performed for many iterations. A neuron has two major components:

  1. 1.

    Input values and random weights which are associated with it.

  2. 2.

    Summation function [u] that adds the weights together and maps it to an output [y].

There are different layers in an artificial neural network which consists of neurons. The input layer is composed just of the input values and not the neurons which act as input to the next layer.

The next layer is hidden layer; there may be several of them and this paper focuses on how varying the number of hidden layers correlates with the accuracy of the model (Fig. 1).

Fig. 1
figure 1

Structure of a simple neural network

The hidden layers take weighted inputs and the output which is given by this layer is based on an activation function. There is no fixed rule about the number of hidden layers which should be used to create the neural network.

2 Hidden Layer and Its Working

For training a neural network, the following steps are performed in a loop so that the weights of each input to the hidden neurons can be adjusted to get the least error possible:

  1. 1.

    In first step, forward propagation is implemented.

  2. 2.

    In second step, the loss is computed.

  3. 3.

    In third step, backward propagation is implemented to get the parameters to adjust the weights.

  4. 4.

    In step four, the parameters are updated to reduce the error.

  5. 5.

    In the final step, forward propagation is implemented.

In the first step, generally, the hidden layers use ReLU activation function and the output layer uses sigmoid activation function.

The forward propagation is computed using the following equations:

  • Computation at first layer of activation:

$$\varvec{Y}^{\left[ 1 \right]} \varvec{ } = \varvec{ W}^{\left[ 1 \right]} \varvec{X }\, + \,\varvec{ b}^{\left[ 1 \right]} ,\varvec{A}^{\left[ 1 \right]} \varvec{ } = \varvec{ }{\mathbf{ReLU}}(\varvec{Y}^{\left[ 1 \right]} )$$
  • Computation at nth activation layer:

$$\varvec{Y}^{{\left[ \varvec{n} \right]}} \varvec{ } = \varvec{ W}^{{\left[ \varvec{n} \right]}} \varvec{A}^{{\left[ {\varvec{n}\, - \,1} \right]}} \varvec{ }\, + \,\varvec{ b}^{\left[ 1 \right]} ,\varvec{A}^{{\left[ \varvec{n} \right]}} \varvec{ } = \varvec{ }{\mathbf{ReLU}}(\varvec{Y}^{{\left[ \varvec{n} \right]}} )$$
  • Computation at last activation layer:

$$\varvec{Y}^{{\left[ \varvec{L} \right]}} \varvec{ } = \varvec{ W}^{{\left[ \varvec{L} \right]}} \varvec{A}^{{\left[ {\varvec{L}\, - \,1} \right]}} \varvec{ }\, + \varvec{ }\,\varvec{b}^{\left[ 1 \right]} ,\varvec{A}^{{\left[ \varvec{n} \right]}} \varvec{ } = \varvec{{\mathbf{sigmoid}}}(\varvec{Y}^{{\left[ \varvec{L} \right]}} )$$
  • Computation of loss function:

$$\frac{ - 1}{\varvec{m}}\mathop \sum \limits_{{\varvec{i} = 1}}^{\varvec{m}} \left( {(\varvec{y}^{\varvec{i}} )\log \left( {\varvec{a}^{{\left[ \varvec{L} \right]\left( \varvec{i} \right)}} } \right) + \left( {1 - \varvec{y}^{{\left( \varvec{i} \right)}} } \right)\log \left( {1 - \varvec{a}^{{\left[ \varvec{L} \right]\left( \varvec{i} \right)}} } \right)} \right)$$

After implementing forward propagation, backward propagation is calculated using the following steps:

  • First, we perform linear backward propagation.

  • After that linear to activation backward where the derivative of ReLU or sigmoid activation is computed.

  • [linear to ReLU] X(N − 1) to linear to sigmoid backward (entire model).

After completion of all the above-mentioned steps, we use gradient descent to update the parameters.

3 Procedure

In this paper, an artificial neural network [1] was trained for four different datasets and for each model, the number of hidden layers was varied to see the effect of the number of layers on the accuracy of the model. While changing the number of layers, all the other factors such as number of neurons for a level, activation function, and other variables were kept constant. After training each model, the same data was tested for all neural networks for a dataset. Then, for every dataset, a graph was plotted which visualizes accuracy [2] for a different number of hidden layers on a given dataset.

4 Dataset

There are four different datasets used for the following experiment. The range of number of rows varies from 1,000 to 10,000 and range of number of columns varies from 4 to 8. The datasets are further divided into 25:75 for training [3] and testing, respectively. Accuracy is used as the performance measure of the neural network in this experiment. The accuracy is defined as the number of predictions of testing set that is correct to that of the total number of cases that are used for testing.

5 Results

The following graphs are for four different datasets where X-axis shows the number of hidden layers, whereas Y-axis shows the accuracy attained. Along with hidden layers, there is an output level which is not considered while plotting the graph.

5.1 Dataset 1

The graph below shows the accuracy of dataset 1 with number of hidden layers increased from 1 to 6 (Fig. 2).

Fig. 2
figure 2

Accuracy of dataset 1 with varying number of hidden layers

5.2 Dataset 2

See Fig. 3.

Fig. 3
figure 3

Accuracy of dataset 2 with varying number of hidden layers

5.3 Dataset 3

See Fig. 4.

Fig. 4
figure 4

Accuracy of dataset 3 with varying number of hidden layers

5.4 Dataset 4

See Fig. 5.

Fig. 5
figure 5

Accuracy of dataset 4 with varying number of hidden layers

6 Observations

The graph above explains how accuracy varies with variation in the number of hidden layers. It can be observed that initially the model’s accuracy increases gradually for certain number of layers and then drops abruptly after reaching a saturation point.

In dataset 1, accuracy starts from around 0.8385, reaches its maximum when there are three hidden layers. All the graphs can be summarized in a similar way, and Table 1 gives an insight about the maximum accuracy and hidden layers.

Table 1 Hidden layer at which maximum accuracy was found

Theoretically, if appropriate number of neurons is selected for the first hidden layer of the neural network then it can fit most of the hypothesis and hardly there is a need to add more hidden layers for the network.

A function that has a continuous mapping from one finite space to another can be approximated with the help of only one hidden layer.

However, one hidden layer can approximate any function that contains a continuous mapping from one finite space to another.

An arbitrary decision boundary to arbitrary accuracy with rational activation can be represented using two hidden layers. It can also be used to approximate smooth mapping to any accuracy.

When you do not use any hidden layer, the network can only be used to represent functions which are linearly separable.

So, from the above observations, it is clear that accuracy can be improved by increasing the number of hidden layers from 2 to 3 or 1 to 2 or 0 to 1 for small datasets.

7 Correlation Between Accuracy and Number of Hidden Layers

If the number of hidden layers which are used to build the network are much more than what is required for the given dataset, the accuracy of the test set will decrease. Such networks will overfit the training data, that is, it will perfectly learn the data which is given for training, but it will fail to generalize for the test data.

Figure 6 reflects the problem of underfitting and overfitting. Here, we a set of data points and we will try to fit the best function we can to fit the data.

Fig. 6
figure 6

Simulating the model graph with increasing number of hidden layers [4]

In the first figure, we are trying to fit a linear function to fit all the data points. As we can see that the function is not complex enough to fit all the data points and it suffers from the problem of underfitting. In the second figure, we try to generalize the data with a more complex function. It can be seen that the model has learned the trend that the points in the data follows which is a parabola. In the last figure, we have increased the number of hidden layers more than the model required and we can see that it suffers from the problem of overfitting. That is it could not learn what the trend was and thus it fails to correctly predict the results of test data. Thus, by increasing the number of hidden layers in the neural network, the model fails to generalize the trend to the new data. Thus, it gives poor accuracy with the testing set and this can be reflected in the result section of this paper.

8 Conclusion

In most of the practical cases, where we are using small dataset, there is no need of having more than two hidden layers for getting a real good accuracy. Increasing the number of hidden layers will reduce the accuracy of the model since the backpropagation algorithm loses its effectiveness.

When you increase the number of hidden layers in the neural network, the error that you will get while using the model to predict test dataset will increase, even though the model was correctly predicting for the training set due to overfitting.

The accuracy of the model depends upon the performance of the architecture of the network and the algorithm used in its test dataset.

When a network tries to fit the data very closely, it will have a huge generalization error and a very high variance because of overfitting.

Thus, to decrease this variance, we need to smooth the network outputs, but while reducing the variance, the bias may increase to a huge value and the error in generalization will be large again. This is the case of underfitting. Thus, the balance between the bias and variance plays a huge role in applying neural network to practical applications.

Following solutions can be used to avoid the problem of underfitting:

  1. 1.

    There should be enough number of hidden nodes in the network so that the function properly fits the dataset. It should be capable enough to represent the mappings of the data points.

  2. 2.

    To reduce the cost of sum squared error, the network should be trained for enough amount of time.

To prevent overfitting:

  1. 1.

    The network should not be trained for extremely long time that it does not learn the trend and it only fits the training data.

  2. 2.

    The adjustable parameters like number of hidden layers of the network must be restricted so that the chance of overfitting reduces.