Keywords

Introduction

Deep learning is a form of hierarchical learning. It comprises algorithms and topologies to solve a wide variety of issues. Deep learning is a feature-learning method with many levels of representation. Thus, it will be easy to understand nonlinear representations one layer at a time [1]. The lower-level features are minor details that are used to transform the representation at one level to build high-level features on top of it [2]. Thus, the complex functions can be learned using such transformations. Over the last few decades, deep learning has become a very popular and most powerful tool as it can handle a huge quantity of data. Deep learning architectures, which are outstandingly, expanded the number and types of problems that neural networks can handle. Deep learning has shown several advancements by researchers and academicians in the last two decades [3]. Neural networks are a novel architecture imitating biological neural networks [4]. The rudimentary building block of artificial neural networks is a neuron. A neuron is a mathematical function that simulates the functioning of a biological neuron [5]. Neural networks are composite computer codes written with many basic, highly interconnected computing components that mimic human biological brain structure operations to simulate data models of human brain functioning and processing. Figure 1 shows the simple neural network. The neural network comprises three layers, namely activation function, learning technique, and weights. All these layers include neurons that are interlinked to form a network [6]. Its elements are validated depending on whether the neuron is used for input, output, or in one of the hidden layers.

Fig. 1
figure 1

Simple neural network

The two key hyperparameters which control the architecture or topology of neural networks are the total number of layers and the number of total nodes in each hidden layer as shown in Fig. 2.

Fig. 2
figure 2

Complex neural network

Input layer: It receives input either through an outside source or through other neighboring nodes. Every node is attached to another node of the succeeding layer. Each connection has a specific weight. Depending on its degree of importance, weights are assigned to a neuron in relation to other inputs. Once the entire node values from the input side are multiplied by their corresponding weights and totted up, the value for hidden layers is generated [6]. The output of the input layer can be given by the equation:

$$y_i = \sum \left( {{\text{weights}}*{\text{input }} + {\text{bias}}} \right)$$
(1)

It can range from −infinity to +infinity. So it is necessary to bound the output to get the desired prediction or generalized results.

Hidden layers: Hidden layers are always found between the input and output layers. It is always shrouded from the outside world. Hidden layers may vary from network to network that we selected. The number of hidden layers in a neural network is determined by the problem’s nature and size. Once the hidden layer gets information from the input side, it performs all the computational tasks and provides the result [6, 7]. This result is then forwarded to the output layer. Hidden layers refine the input weightings until the marginal error of the neural network is small.

Output layer: The output nodes are known collectively as the output layer, and they are responsible for bringing out the final result. This output layer is designed differently to contour and improve the final results of the iterative task [8]. The output layer acquires the input from the hidden layers and uses its neurons to complete the computations, after which the output is generated.

Activation Function

In neural networks, activation functions are mathematical functions used to represent each neuron present in the network. The activation function of a neuron decides whether it should be turned on or turned off depending on the input or set of input values [9]. Activation functions facilitate normalizing the output of all the neurons and map them into a range within 1 and 0 or −1 and 1.

The activation functions can be classified mainly into the following types:

  1. 1.

    Identity or linear activation function

  2. 2.

    Nonlinear activation functions.

Identity or Linear Activation Function

The input values multiplied by their corresponding weights from each neuron are given as input to an activation function. The activation function produces an output corresponding to the input values [10]. A neural network is nothing but a linear regression model without a linear activation function. Linear activation function has confined power and also has limited capacity to solve the composite input data. The equation of a linear function is alike to that of a straight line, i.e., f(x) = a * x and it ranges \({\text{from}} - \infty \,{\text{to}}\,\infty .\) The graphical representation is shown in Fig. 3.

Fig. 3
figure 3

Linear function

When a linear activation function is employed in the multi-layered neural network, irrespective of how many layers are present in the network all the layers will be linear. Therefore, the final layer is purely a linear transformation of the final layer [11]. The linear function f(x) has an invariable derivative and also it does not rely on the input value x. Thus, the linear function is unable to perform backpropagation every time to train the model. The gradient is pretty much the same hence it is not possible to improve the error.

Nonlinear Activation Function

Nonlinear activation functions can be used to represent any imaginable process as a computational function in the neural network [12]. They enable the model between the network inputs and outputs, to develop a composite mapping. These are essential to learn and model the nonlinear type of composite data such as images, video, audio, and data sets. Thus, to resolve the problems of a linear type of activation function, often nonlinear types of activation functions are used [13]. They have an input-related derivative function that permits backpropagation. Nonlinear functions often allow a deep neural network to be built up by stacking multiple hidden layers of neurons. Thus, it is possible to understand the composite data sets with high levels of precision using deep neural networks.

Binary Step Activation Function

The step function is among the most basic activation function available, which provides binary output [9]. That is why it is also called a binary step function. Here, we consider a threshold value, when the input passes the threshold limit the function produces a value 1(true) and then the neuron is activated. If the input does not pass the threshold value, the function produces a value 0(false) and then the neuron is deactivated. That is why they are very useful for binary classification studies. A graphical representation of the binary step activation function is shown in Fig. 4.

Fig. 4
figure 4

Binary step activation function

Mathematically, binary step activation function can be described as

$$f\left( x \right) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {{\text{for}}\,x < 0} \hfill \\ {1,} \hfill & {{\text{for}}\,x \ge 0} \hfill \\ \end{array} } \right\}$$
(2)

Sigmoid or Logistic Activation Function

The sigmoid function resembles an “S” shaped curve. It can be used to represent the anticipated values to probabilities. The sigmoid function distributes the input values of any size to output values in the interval between 0 and 1, normalizing the output of each neuron [14]. Graphical representation sigmoid activation function is shown in the below fig. Here, the output is not zero-centered as shown in Fig. 5.

Fig. 5
figure 5

Logistic/sigmoid function

Mathematically, sigmoid activation function can be described as

$$f\left( x \right) = \frac{1}{{\left( {1 + {\text{e}}^{ - x} } \right)}}$$
(3)

The above function is exclusively monotonic in its entire region and it is easily differentiable. However, its derivative is not monotonic. There is almost no ambiguity in the estimation of very low or very high variation in the values of x, which creates a problem of vanishing gradient. That leads to a situation where the network refuses to learn further or being too sluggish to succeed in an accurate prediction. During the testing period, the logistic sigmoid function may end up causing the neural network to become stuck.

Tan h Function

Tanh activation function works almost always better than the sigmoid function. It is a mathematically modified version of the sigmoid function that transforms input to output with values ranging from −1 to 1 [15]. The gradient is stronger for tanh than the sigmoid activation function. The major advantage of the tanh activation function is that its negative inputs are always represented as strongly negative; zero inputs are represented near to zero which is not the same for sigmoid function as the range for it is between 0 and 1 as shown in Fig. 6.

Fig. 6
figure 6

Tanh function

The mean of the activations coming out of the hidden layer is closer to having a zero mean. Therefore, the data are more centered, making learning easier, and faster for the next layer. The function and its derivative are both monotonous.

Mathematically, tanh activation function can be described as

$$\tan h\left( x \right) = \frac{{1 - {\text{e}}^{ - 2x} }}{{1 + {\text{e}}^{ - 2x} }}$$
(4)

The main disadvantage of the tanh activation function is that its gradient will experience a very small value and even it may accomplish a near-zero value. This can thwart the gradient descent. The most difficult aspect of implementing these functions is that it demands the exponential term, which results in nonlinear behavior.

Arctan Function

The arctan function is similar to the sigmoid and tanh function and is obtained by the inverse of the tangent function. This activation function maps input to accelerating and decelerating output values ranging between (−π/2, π/2). The arctan function graph is a slightly flattered S-shape compared to the tanh function, which provides better classification power [11]. Arctan function can be mathematically described as

$$f\left( x \right) = \tan^{ - 1} \left( x \right)$$
(5)

Figure 7 shows the graphical representation of arctan function.

Fig. 7
figure 7

Arctan function

For the larger values of the input, its derivative converges to zero. Contrarily, the derivative of the sigmoid activation function converges exponentially to zero.

Softmax Function

The Softmax function is another type of mathematical function that always produces an output ranging from 0 to 1 irrespective of whether the input values are positive, negative, zero, or greater than one. The cumulative of all the probabilities is equals to 1.Thus, the Softmax function is used to compute the normalized output probability distribution comprised of K probabilities from the input vector consisting of K real numbers. The formula for the Softmax functions can be specified as follows [16]:

$$\sigma \left( {z_i } \right) = \frac{{{\text{e}}^{z_i } }}{{\sum_{j = 1}^K {\text{e}}^{z_j } }}$$
(6)

The above expression calculates the input exponential value as well as the sum of all input exponential values. The Softmax function’s output is proportional to the exponential input value as well as the sum of the exponential values.

The graph shown in Fig. 8 shows the variation of output probabilities corresponding to the variation in the input values. To overcome the issues of multi-class classification, the Softmax function can be used as the activation function in the output layer of neural network models.

Fig. 8
figure 8

Softmax function

Softsign Function

Softsign mathematical function is again a different type of activation function employed in neural networks. It almost resembles the hyperbolic tangent activation function but the main difference between them is that, unlike the tanh function which converges exponentially the softsign function converges in a polynomial form [17]. The value of softsign function is zero-centered and it ranges between −1 and +1, so the network learns effectively.

Figure 9 shows the graphical interpretation of the softsign function. The formula for softsign functions can be specified as follows:

$$f\left( x \right) = \frac{1}{{\left( {1 + \left| x \right|} \right)}}$$
(7)
Fig. 9
figure 9

Softsign function

Softsign activation function is characterized by a high degree of nonlinearization and good error tolerance. Thus, it can be used in neural networks to transform the input into nonlinear distribution. The main advantage of the softsign function is that its output is centered on zero and its asymptote lines are smoother [11]. Thus, the output saturation reaches steadily to 0 on both sides. This alleviates the problem of gradient vanishing to some degree.

Rectified Linear Unit (ReLU) Function

In DL models, the rectified linear unit (ReLU) is one of the most famous and oftenly utilized activation functions. This function conserves the characteristics of a linear function. It also prevents the vanishing gradient problem seen in earlier forms of activation functions by rectifying the values of the inputs less than zero to zero otherwise; it will direct the input to output [18]. ReLU maps output ranging between 0 and 1 and it can be represented as

$$f\left( x \right) = \left\{ {\begin{array}{*{20}c} {x,} & {{\text{if}}\,x \ge 0} \\ {0,} & {{\text{if}}\,x < 0} \\ \end{array} } \right\} = \max \left( {x,0} \right)$$
(8)

The below graph shown in Fig. 10 represents the equation of the ReLU activation function.

Fig. 10
figure 10

ReLU function

ReLU incorporates faster AF learning that outperforms other AFs including the sigmoid and tanh functions in terms of efficiency and generalization [11]. It will be easier to train the varieties of neural network models using ReLU as a default activation function.

Exponential Linear Units (ELUs) Function

ELU also known as exponential linear unit is another form of activation function which is similar to the ReLU with certain variations. The ELU activation function can be specified mathematically as follows:

$$f\left( x \right) = \left\{ {\begin{array}{*{20}l} {\alpha \left( {{\text{e}}^x - 1} \right),} \hfill & {x \le 0} \hfill \\ {x,} \hfill & {x > 0} \hfill \\ \end{array} } \right\}$$
(9)

The equation strictly outputs x-value for positive values of input x, which is the same as ReLU. In the case of negative input, the output will be α times \(\left( {{\text{e}}^x - 1} \right)\). Where α is a hyperparameter that controls the value of negative inputs for which ELU saturates. This is an excellent way of handling the negative inputs [18]. The graphical contrast between the ReLU and ELU activation functions is shown in Fig. 11.

Fig. 11
figure 11

Comparison between ReLU and ELU activation functions

ELUs have negative values that try to bring the mean of the activations closer to 0. This enables quicker learning when the gradient is closer to the natural gradient. It does not experience the issue of dying neurons because the gradient of ELU is non-zero for all negative values. ELU is a steady and differentiable activation function at all points.

Swish Function

The swish activation function is an innovative activation function that can be effectively used in deep learning models across a variety of complicated data sets. Mathematically, the swish function can be defined as follows:

$$f\left( x \right) = \frac{x}{{\left( {1 + {\text{e}}^{ - x} } \right)}}$$
(10)

The function is just the multiplication of the input x with the sigmoid function and its graphical representation is shown in Fig. 12.

Fig. 12
figure 12

Swish function

Swish is unrestricted in the upper portion of the graph, therefore, the output would not be saturated to the maximum for large values of input [19]. However, it is restricted in the lower portion of the graph; therefore, for negative inputs, it does not return a zero as is the case for ReLU. Swish is smooth, non-monotonic, and continuous at all points; this differentiates it from most of the common activation functions.

Flatten-T Swish (FTS) Function

FTS, or flatten-T Swish (FTS), was introduced by Chieng as a novel activation function. Flatten-T swish incorporates activation features of both swish and rectified linear units (ReLUs) activations functions together into an innovative one [20]. It is used to cope with the negative cancelation property in ReLU. Mathematically, FTS is formulated as follows:

$${\text{FTS}}\left( x \right) = \left\{ {\begin{array}{*{20}l} {\frac{x}{{1 + {\text{e}}^{ - x} }},} \hfill & {x \ge 0} \hfill \\ {0,} \hfill & {x < 0} \hfill \\ \end{array} } \right\}$$
(11)

When the value of x ≥ 0, the FTS function has properties identical to that of the swish activation function. If the value of x is less than zero, then the function acts as ReLU. The graphical representation of the FTS function is shown in Fig. 13.

Fig. 13
figure 13

Flatten-T swish function

FTS has network limitations like dynamicity, pliability, and nonlinear representation capacity. Table 1 gives a summary of activation functions and their corresponding equations, derivatives, and applications.

Table 1 Activation functions and their corresponding equations

Conclusion

Deep learning approaches employ neural networks consisting of several hidden layers to perform complex tasks. In the design of neural networks, the activation function plays a vital role. The hidden layer’s activation function dictates how well the network model learns the training data set. The kind of predictions the model will offer is determined by the activation function employed in the output layer. Vanishing gradient is an unstable behavior that inhibits the training of deep neural networks with saturated activation functions. As the network’s layers become deeper, the training efficiency and precision encounter numerous challenges which stimulate the development of different kinds of activation functions. Thus, activation functions are an important component of networks and selecting proper activation functions and analyzing their impact on the network will assist in optimizing the efficiency of the DL model.