1 Introduction

LSTM, introduced by Hochreiter and Schmidhuber [1], is a recurrent neural network (RNN) architecture shown to be effective for different learning problems especially those involving sequential data [2]. The LSTM architecture contains blocks which are a set of recurrently connected units. In RNNs, the gradient of the error function can increase or decay exponentially over time, known as the vanishing gradient problem. In LSTMs, the network units are redesigned to alleviate this problem. Each LSTM block consists of one or more self-connected memory cells along with input, forget, and output multiplicative gates. The gates allow the memory cells to store and access information for longer time periods to improve performance [2].

LSTMs and bidirectional LSTMs [3] are successfully applied in various tasks especially classification. Different applications of these networks include online handwriting recognition [4, 5], phoneme classification [2, 3, 6], and online mode detection [7]. LSTMs are also employed for generation [8], translation [9], emotion recognition [10], acoustic modeling [11], and synthesis [12] of speech. These networks are also employed for language modeling [13], protein structure prediction [14], analysis of audio and video data [15, 16], and human behavior analysis [17].

Generally, the behavior of neural networks relies on different factors such as the network structure, the learning algorithm, the activation function used in each node, etc. However, the emphasis in neural network research is on the learning algorithms and architectures, and the importance of activation functions has been less investigated [18,19,20]. The value of the activation function determines the decision borders and the total input and output signal strength of the node. The activation functions can also affect the complexity and performance of the networks and also the convergence of the algorithms [19,20,21]. Careful selection of activation functions has a large impact on the network performance.

The most popular activation functions adopted in the LSTM blocks are sigmoid (log-sigmoid) and hyperbolic tangent. In different neural network architectures, however, other kinds of activation functions have been successfully applied. Among the activation functions are the complementary log–log, probit and log–log functions [22], periodic functions [23], rational transfer functions [24], Hermite polynomials [25], non-polynomial functions [26, 27], Gaussians bars [28], new classes of sigmoidals [20, 29], and also combination of different functions such as polynomial, periodic, sigmoidal, and Gaussian [30]. These activation functions upon application in LSTMs may demonstrate good performance. In this paper, a total of 23 activations including the just mentioned functions are analyzed in LSTMs.

The properties that should be generally fulfilled by an activation function are as follows: The activation function should be continuous and bounded [31, 32]. It should also be sigmoidal [31, 32], or the limits for infinity should satisfy the following equations [33]:

$$\mathop {\lim }\limits_{x \to - \infty } f(x) = \alpha$$
(1)
$$\mathop {\lim }\limits_{x \to + \infty } f(x) = \beta$$
(2)
$${\text{with}}\,\alpha < \beta$$
(3)

The activation function’s monotonicity is not a compulsory requirement for the existence of the Universal Approximation Property (UAP) [32].

In this paper we investigate the effect of the 23 different activation functions, employed in the input, output, and forget gates of LSTM, on the classification performance of the network. To the best of our knowledge this is the first study to aggregate a comprehensive set of activation functions and extensively compare them in the LSTM networks. Using the IMDB and Movie Review data sets, the misclassification error of LSTM networks with different structures and activation functions are compared. The results specifically show that the most commonly used activation functions in LSTMs do not contribute to the best network performance. Accordingly, the main highlights of this paper are as follows:

  1. 1.

    Compiling an extensive list of applicable activation functions in LSTMs.

  2. 2.

    Applying and analyzing different activation functions in three gates of an LSTM network for classification.

  3. 3.

    Comparing the performance of LSTM networks with various activation functions and different number of blocks in the hidden layer.

The rest of the paper is organized as follows: in Sect. 2, the LSTM architecture and the activation functions are described. In Sect. 3, the experimental results are reported and discussed. The conclusion is presented in Sect. 4.

2 System model

In this section, the LSTM architecture and the activation functions employed in the network are described.

2.1 LSTM architecture

We use a basic LSTM with a single hidden layer with an average pooling and a logistic regression output layer for classification. The LSTM architecture, illustrated in Fig. 1, has three parts, namely the input layer, a single hidden layer, and the output layer. The hidden layer consists of single-cell blocks which are a set of recurrently connected units. At time t, the input vector \(x_{t }\) is inserted in the network. Elements of each block are defined by Eqs. 49.

Fig. 1
figure 1

The LSTM architecture consisting of the input layer, a single hidden layer, and the output layer [2]

$$f_{t} = \sigma \left( {W_{f} x_{t} + U_{f} h_{t - 1} + b_{f} } \right)$$
(4)
$$i_{t} = \sigma \left( {W_{i} x_{t} + U_{i} h_{t - 1} + b_{i} } \right)$$
(5)
$$o_{t} = \sigma \left( {W_{o} x_{t} + U_{o} h_{t - 1} + b_{o} } \right)$$
(6)
$$\mathop {C_{t} }\limits^{\sim} = \tanh \left( {W_{C} x_{t} + U_{C} h_{t - 1} + b_{C} } \right)$$
(7)
$$C_{t} = f_{t } \odot C_{t - 1 } + i_{t } \odot \tilde{C}_{t }$$
(8)
$$h_t = o_{t } \odot { \tanh }(C_{t } )$$
(9)

The forget, input, and output gates of each LSTM block are defined by Eqs. 46, respectively, where \(f_{t }\), \(i_{t }\), and \(o_{t }\) are the forget, input, and output gates, respectively. The input gate decides which values should be updated, the forget gate allows forgetting and discarding the information, and the output gate together with the block output selects the outgoing information at time t. \(\tilde{C}_{t}\) defined in Eq. 7 is the block input at time t which is a tanh layer and with the input gate, the two decides on the new information that should be stored in the cell state. \(C_{t}\) is the cell state at time t which is updated from the old cell state (Eq. 8). Finally, \(h_{t}\) is block output at time t.

The LSTM block is illustrated in Fig. 2. The three gates (input, forget, and output gates), and block input and block output activation functions are displayed in the figure. The output of the block is recurrently connected back to the block input and all of the gates. \(W\) and \(U\) are weight matrices, and b is the bias vector. The \(\odot\) sign is the point-wise multiplication of two vectors. Functions \(\sigma\) and \(\tanh\) are point-wise nonlinear logistic sigmoid and hyperbolic tangent activation functions, respectively.

Fig. 2
figure 2

A single LSTM block with tanh block input and output and with the sigmoidal gates shown with \(\sigma\) [2]. The \(\odot\) sign is the point-wise multiplication

2.2 Activation functions

Three main aspects of neural network have important roles in network performance: the network architecture and the pattern of connections between units, the learning algorithm, and the activation functions used in the network. Most of the researches on analysis of the neural networks have focused on the importance of the learning algorithm, whereas the importance of the activation functions used in the neural networks has been mostly neglected [18,19,20].

We analyze the LSTM network in this paper by changing the activation functions of the forget, input, and output gates (sigmoidal gates of Eqs. 4, 5, and 6). We compare 23 different activation functions in terms of their effect on network performance when employed in sigmoidal gates of a basic LSTM block for classification.

Sigmoid and hyperbolic tangent functions are the most popular activation functions used in the neural networks. However, some individual studies have considered other activation functions in their research. We have compiled a comprehensive list of 23 such functions as shown in Table 1 and discussed below. We experimentally observed that adding a value of 0.5 to some functions makes them applicable as activation functions in the network. Changing the range of the activation functions is previously observed in other studies [41].

Table 1 Label, definition, corresponding derivative and range of each activation function

In Table 1, the first activation function is Aranda-Ordaz introduced by Gomes et al. [18] which is labeled as Aranda. The second to fifth functions are the bimodal activation functions proposed by Singh et al. [36] and labeled as Bi-sig1, Bi-sig2, Bi-tanh1, and Bi-tanh2, respectively. The sixth function is the complementary log–log [22]. The next function presents a modified version of cloglog, named cloglogm [21]. Next come the Elliott, Gaussian, logarithmic, and loglog functions, the 12th function is a modified logistic sigmoid function proposed by Singh and Chandra [20] labeled as logsigm. The logistic sigmoid comes next as called log-sigmoid, followed by the modified Elliott function. The 15th function is a sigmoid function with roots [19], called rootsig. The 16th to 19th functions are the Saturated, the hyperbolic secant (Sech), and two modified sigmoidals labeled as sigmoidalm and sigmoidalm2. The tunable activation function proposed by Yuan et al. [37] and labeled as sigt is the 20th function. Next is a skewed derivative activation function proposed by Chandra et al. [38] labeled as skewed-sig. The softsign function proposed by Elliott [39] and the wave function proposed by Hara and Nakayamma [40] come last. Some other activation functions such as rectifier [43] were applied in the network but turned out to be ineffective due to the exploding gradient problem.

2.3 Methodology

To evaluate the effect of different activation functions on the classification performance, we vary the activation of the input, output, and forget gates which we refer to as sigmoidal gates, and keep the tanh units unchanged. In each configuration, all the three sigmoidal gates are identical and chosen from the set of activation functions introduced in Table 1.

To train the network, the back propagation through time algorithm (BPTT) [34] is used with either ADADELTA [35] or RMSprop [44] as the optimization method. ADADELTA is a gradient descent-based learning algorithm which is proposed as an improvement over Adagrad [45] and adapts the learning rate per parameter over time. RMSprop is also an extension of Adagrad that deals with its radically diminishing learning rates. The two optimization methods are popular for LSTM networks and achieve faster convergence rates [46, 47].

The mini-batch method is used for the training and test phases. The network is trained and tested three times for each activation function with the same train and test data. The initial network weights and the batches are chosen randomly in each experiment. The error interval is reported using the results of the three experiments of each configuration. The train and validation errors are measured at the end of each batch. The dropout method with probability of 0.5 is used to prevent overfitting [48]. The network is trained until a low and approximately constant classification error based on training data is observed, and also the validation error is stable for 10 consecutive batches. The test errors at this stage are reported.

3 Experimental results

To analyze the performance of the LSTM network, two sets of experiments are designed with different types of data sets. In both set of experiments different architectures of LSTM are evaluated and in each configuration the input, output, and forget gates of the LSTM blocks use an identical activation function from Table 1. In what follows, we describe the analysis results.

3.1 First set of experiments

In the first set of experiments, we use two movie review data sets. The first one [49] is referred to as Movie ReviewFootnote 1 in this paper, and the other is the IMDB large movie review data setFootnote 2 [50]. The Movie Review data set consists of 10,662 review sentences, with equal number of positives and negatives. From this data set, we use a total of 8162 sentences in the training and the rest are used in the test phase. Both sets contain equal number of positive and negative sentences. From the IMDB data set, we use 2000 sentences for training the network (with portion of 5% for validation set) and 500 sentences for testing the performance. Again, the number of positive and negative sentences is equal and the sentences have a maximum length of 100 words.

The mini-batch method is used with the batch size for the training and test phases set to 16 and 64, respectively. The batch sizes have been chosen based on experiment for producing a better performance. We use the backpropagation through time algorithm (BPTT) with ADADELTA as the optimization method, with the epsilon parameter set to 1e−6. Hyperparameters are not tuned specifically for each configuration of LSTM and are identical in all experiments. The test misclassification error is used to rank the activation functions. Each experiment is repeated three times.

Table 2 illustrates the average training error values on the Movie Review data set for different configurations. In these set of experiments, the number of LSTM blocks in the hidden layer increases exponentially from 2 to 64 and the number of epochs on each run is set to 20 (in each epoch, all the training data are exposed to the network in mini-batches). For each activation function, the average test error for the configuration which has produced the least train error is shown in the last column. The test errors for all configurations are presented in “Appendix.” As observed, on this data set the modified Elliott has the least average test error (22.52%). Overall, the modified Elliott (with range of [− 0.5, 1.5]), cloglogm ([− 0.5, 1.5]), softsign ([− 0.5, 1.5]), logsigm ([0.5, 1.5]), and saturated ([− 0.5, 1.5]) functions when used as activation present the least average error values which are 22.52, 22.62, 22.85, 23.01, and 23.06%, respectively. The optimum number of LSTM blocks on the hidden layer when using modified Elliott was 16 units, while for cloglogm, softsign, logsigm, and saturated it was 4, 16, 16, and 32 units, respectively. Interestingly log-sigmoid stands in rank 17 among all functions, with the average error of 23.52%. For the Movie Review data set, training errors have negative correlation with number of units for most functions (although the correlations are mostly weak). Most activations perform poorly with a very low number of units (e.g., 2). But, as number of units increase, the sensitivity of error values to number of units is less observed and the standard deviation of error values for most activations is less than 2.

Table 2 Average train errors per each activation function for the Movie Review data set

Results of the training error values for the IMDB data set are illustrated in Table 3 including the average test error for the best configuration of each activation. The number of LSTM units in the hidden layer was modified exponentially from 4 to 256. The number of epochs on each run was 50. On this data set, similar to the first data set, modified Elliott had the least average error (12.46%). After modified Elliott (with range of [− 0.5, 1.5]), the saturated ([− 0.5, 1.5]), cloglogm ([− 0.5, 1.5]), and softsign ([− 0.5, 1.5]), functions have the least average error values of 13.06, 13.13, and 13.13%, respectively. The optimum number of LSTM blocks on the hidden layer for modified Elliott was 256 units, while for cloglogm, saturated, and softsign it was 128, 128, and 128 units, respectively. Interestingly again, the log-sigmoid does not appear in the four best results and rank of this activation function is 10 out of 23 with the average error of 13.6%. For the IMDB data set, no solid pattern of correlation is observed between error values and number of units.

Table 3 Average train errors per each activation function for the IMDB data set

Most functions produced best training results with 16 and 256 blocks for the Movie Review and IMDB data sets, respectively. Average test and train errors, and average number of convergence epochs for Movie Review and IMDB data sets are represented in Tables 4 and 5, respectively. Note that in each epoch all of the train data are represented to the network in a sequence of mini-batches. The number of actual (average) iterations is reported in parentheses. To test the significance of the results, ANOVA tests were conducted for results of 16 and 256 blocks for the Movie Review and IMDB data sets, respectively. The obtained p values were 3.33e−6 and 7.61e−11, which being less than 0.05 indicate the significance of the results.

Table 4 Average test and train errors for Movie Review data set, and average number of convergence epochs for 16 blocks in hidden layer of LSTM
Table 5 Average test and train errors for IMDB data set, and average number of convergence epochs for 256 blocks in hidden layer of LSTM

Results show that for both data sets, the activation function modified Elliott has the best performance. Using this activation function for large data sets, we may need to tune the hyperparameters such as using a smaller epsilon parameter in ADADELTA optimization method. Interestingly, the log-sigmoid activation function which is commonly used in neural networks and in LSTM networks does not produce the best results and the modified Elliott function demonstrates better results when employed in the sigmoidal gates. Additionally, it was observed that sigmoidal activations with range of [− 0.5, 1.5] result in a more accurate network than those in the range of [0, 1] in LSTM network. The maximum length of sentences for the Movie Review and IMDB data sets used in the experiments were 64 and 100, respectively. When applied on the IMDB data set, LSTM network required more hidden blocks and even more epochs per run. This can be justified by the greater complexity of this data set.

The error levels measured in current study are consistent with some other studies in the literature. Lenc and Hercig [51] report a 38.3% error for classification of Movie Review with LSTM. Dai and Le [52] report an error of 13.5% for classification of IMDB data with LSTM. The overall error difference for all functions is at most 2 and 5% in Movie Review and IMDB data sets, respectively. In the experiments, the difference in the best measured error values of the modified Elliott function and the popular log-sigmoid function is 0.36 and 1.14 for the Movie Review and IMDB data sets, respectively. Although being small, these values can be meaningful according to the specific application.

3.2 Second set of experiments

For the second experiment we use the MNISTFootnote 3 data set of handwritten digits. The mini-batch method is again used with the batch size for the training and test phases set to 128. The batch sizes have been chosen based on experiment. We use the RMSprop as the optimization method, with the learning rate set to 0.001.

The MNIST data set of handwritten digits has a training set of 60,000 examples (with 5000 examples for validation), and a test set of 10,000 examples. The image sizes are 28 × 28 pixels. We use the one-hot method to predict 10 digits (0–9) or equivalently 10 classes.

Table 6 illustrates the average training error values for the MNIST data set with the test error being reported for the best configuration of each activation function. In these set of experiments, two configurations of 64 and 128 LSTM blocks in the hidden layer are considered, and the number of epochs for each run is set to six. As observed, on this data set Elliott and softsign have the least average error (1.66%). Overall, the softsign (with range of [− 0.5, 1.5]) and Elliott (with range of [0, 1]), rootsig ([− 0.5, 1.5]), Bi-tanh2 ([− 0.5, 1.5]), Gaussian ([0, 1]), Bi-sig1 ([0, 1]), Bi-sig2 ([0, 1]), and modified Elliott ([− 0.5, 1.5]) functions when used as activation present the least average error values which are 1.66%, 1.66, 1.9, 1.93, 1.96, 2, 2.03, and 2.03% respectively. The optimum number of LSTM blocks on the hidden layer for softsign, Elliott, rootsig, Bi-tanh2, Gaussian, Bi-sig1, Bi-sig2, and modified Elliott was 128 units, and it seems that most of the activation functions worked better with this number of units Interestingly, log-sigmoid stands in rank 10 with the average error of 2.16%.

Table 6 Average train errors per each activation function for the MNIST data set

Average test and train errors, and average number of convergence epochs for the MNIST data set, for 128 blocks in the hidden layer are represented in Table 7. The ANOVA result for all experiments with 128 blocks is 3.7e−21 which shows the results are significant. The error levels are consistent with the study of Arjovsky et al. [53]. which report a classification error of 1.8% for MNIST data with LSTM.

Table 7 Average test and train errors for MNIST data set, and average number of convergence epochs for 128 blocks in hidden layer of LSTM

3.3 Discussion

In this paper we aggregated a list of 23 applicable activation functions that can be used in place of the sigmoidal gates in a LSTM network. We compared performance of the network using these functions with different number of hidden blocks, in classification tasks. The results showed the following:

  1. 1.

    Overall, the results on both data sets suggest that less-recognized activation functions (such as Elliott, modified Elliott, and softsign which are interestingly all in the Elliott family) can produce more promising results compared to the common functions in the literature.

  2. 2.

    Activation functions with the range of [− 0.5, 1.5] have generally produced better results, and this indicates that a wider range of codomain (than the sigmoidal range of [0, 1]) can yield a better performance.

  3. 3.

    The log-sigmoid activation function which is mostly used in LSTM blocks produces weak results compared to other activation functions.

Burhani et al. [41] in their study on denoising autoencoders reported a similar result that the modified Elliott has a better performance and less error than log-sigmoid activation function. In addition, in the first set of experiments we found cloglogm to be the second best activation which is consistent with Gomes et al. [21] stating that cloglogm shows good results for forecasting financial time series. The top activations (Elliott family and cloglogm) along with the popular log-sigmoid activation are displayed in Fig. 3. According to the diagram, modified Elliott, softsign, and cloglogm are much steeper than log-sigmoid around zero and also have a wider range. In Fig. 4 performance comparison of these five functions in term of the average error value for the Movie Review, IMDB, and MNIST data sets, respectively, for 16, 256 and 128 blocks is illustrated.

Fig. 3
figure 3

The modified Elliott, cloglogm, log-sigmoid, softsign, and Elliott activation functions

Fig. 4
figure 4

Comparison of the average error values for the cloglogm, Elliott, log-sigmoid, modified Elliott, and softsign activation functions for MNIST, IMDB and Movie Review data sets with 128, 256, and 16 blocks, respectively

There are two widely known issues with training the recurrent neural networks, the vanishing and the exploding gradient problems [54]. The LSTM networks alleviate the gradient vanishing problem by their special design. The gradient exploding problem can, however, still occur. A gradient norm clipping strategy is proposed by Pascanu et al. [55] to deal with exploding gradients. Gradient clipping is a technique to prevent exploding gradients in very deep networks. A common method is to normalize the gradients of a parameter vector when its L2 norm exceeds a certain threshold [55]. Although we have not performed gradient norm clipping in training the LSTM network, the method suggests that gradient exploding problem is closely related to norm of the gradient matrix and smaller norms are preferred.

We evaluated the norm of the gradient matrix in the second set of experiments, and interestingly observed that the norm of the gradient matrix for the Elliott activation was low and in fact the second lowest among all the activations. The norms of the gradient matrix after convergence are presented in Table 8. As observed, the norm of gradient matrix for most of the activation functions achieving lower classification errors is considerably low (less than 0.1).

Table 8 Norm of gradient matrix for MNIST data set in increasing order

The tanh function is one of the most popular activation functions which is widely used in LSTM networks [2]. From a conceptual point of view, two tanh activations in LSTM blocks squash the block input and output and can be considered to have a different role from the three gates. However, they can indeed have a significant effect on the overall network performance and can be replaced by other activations which fulfill the properties mentioned in Sect. 1 of the manuscript. This change will specifically affect the gradient, range, and derivative of the activation functions and blocks. Analyzing the effect of other activation functions when used in place of tanh activations is left for future work.

Some follow-up studies have proposed modifications on the initial LSTM architecture. Evaluating different activation functions on these architectures can serve as an interesting future study. Gers and Schmidhuber [56] introduced peephole connections that cross directly from the internal state to the input and output gates of a node. According to their observations, these connections improve performance on timing tasks where the network must learn to measure precise intervals between events. Another line of research is the alternate and similar architectures which are popular along with the LSTM. The bidirectional recurrent neural network (BRNN) is first proposed by Schuster and Paliwal [57]. This architecture involves two layers of hidden nodes, both of which are connected to input and output. The first hidden layer has recurrent connections from the past time steps, and in the second layer direction of recurrent of connections is flipped. A gated recurrent unit (GRU) was proposed by Cho et al. [58] to make each recurrent unit to adaptively capture dependencies of different time scales. These modifications can improve performance of the network.

4 Conclusions

In LSTM blocks, the two most popular activation functions are sigmoidal and hyperbolic tangent. In this study we evaluated the performance of a LSTM network with 23 different activation functions that can be used in place of the sigmoidal gates. We varied the number of hidden blocks in the network and employed three different data sets for classification. The results exposed that some less-recognized activations such as the Elliott function and its modifications can yield less error levels compared to the most popular functions.

More research is needed to study other parts and details of an LSTM network such as the effect of changing the hyperbolic tangent function on the block input and block output. Variants of the LSTM network can also be analyzed. Additionally, larger data sets and different tasks can be employed to further analyze the network performance considering different configurations.