A comparative performance analysis of different activation functions in LSTM networks for classification

Farzad, Amir; Mashayekhi, Hoda; Hassanpour, Hamid

doi:10.1007/s00521-017-3210-6

A comparative performance analysis of different activation functions in LSTM networks for classification

Original Article
Published: 19 October 2017

Volume 31, pages 2507–2521, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

A comparative performance analysis of different activation functions in LSTM networks for classification

Download PDF

7311 Accesses
87 Citations
7 Altmetric
1 Mention
Explore all metrics

Abstract

In recurrent neural networks such as the long short-term memory (LSTM), the sigmoid and hyperbolic tangent functions are commonly used as activation functions in the network units. Other activation functions developed for the neural networks are not thoroughly analyzed in LSTMs. While many researchers have adopted LSTM networks for classification tasks, no comprehensive study is available on the choice of activation functions for the gates in these networks. In this paper, we compare 23 different kinds of activation functions in a basic LSTM network with a single hidden layer. Performance of different activation functions and different number of LSTM blocks in the hidden layer are analyzed for classification of records in the IMDB, Movie Review, and MNIST data sets. The quantitative results on all data sets demonstrate that the least average error is achieved with the Elliott activation function and its modifications. Specifically, this family of functions exhibits better results than the sigmoid activation function which is popular in LSTM networks.

Overview of Long Short-Term Memory Neural Networks

Overview of Incorporating Nonlinear Functions into Recurrent Neural Network Models

A review on the long short-term memory model

Article 13 May 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

LSTM, introduced by Hochreiter and Schmidhuber [1], is a recurrent neural network (RNN) architecture shown to be effective for different learning problems especially those involving sequential data [2]. The LSTM architecture contains blocks which are a set of recurrently connected units. In RNNs, the gradient of the error function can increase or decay exponentially over time, known as the vanishing gradient problem. In LSTMs, the network units are redesigned to alleviate this problem. Each LSTM block consists of one or more self-connected memory cells along with input, forget, and output multiplicative gates. The gates allow the memory cells to store and access information for longer time periods to improve performance [2].

LSTMs and bidirectional LSTMs [3] are successfully applied in various tasks especially classification. Different applications of these networks include online handwriting recognition [4, 5], phoneme classification [2, 3, 6], and online mode detection [7]. LSTMs are also employed for generation [8], translation [9], emotion recognition [10], acoustic modeling [11], and synthesis [12] of speech. These networks are also employed for language modeling [13], protein structure prediction [14], analysis of audio and video data [15, 16], and human behavior analysis [17].

Generally, the behavior of neural networks relies on different factors such as the network structure, the learning algorithm, the activation function used in each node, etc. However, the emphasis in neural network research is on the learning algorithms and architectures, and the importance of activation functions has been less investigated [18,19,20]. The value of the activation function determines the decision borders and the total input and output signal strength of the node. The activation functions can also affect the complexity and performance of the networks and also the convergence of the algorithms [19,20,21]. Careful selection of activation functions has a large impact on the network performance.

The most popular activation functions adopted in the LSTM blocks are sigmoid (log-sigmoid) and hyperbolic tangent. In different neural network architectures, however, other kinds of activation functions have been successfully applied. Among the activation functions are the complementary log–log, probit and log–log functions [22], periodic functions [23], rational transfer functions [24], Hermite polynomials [25], non-polynomial functions [26, 27], Gaussians bars [28], new classes of sigmoidals [20, 29], and also combination of different functions such as polynomial, periodic, sigmoidal, and Gaussian [30]. These activation functions upon application in LSTMs may demonstrate good performance. In this paper, a total of 23 activations including the just mentioned functions are analyzed in LSTMs.

The properties that should be generally fulfilled by an activation function are as follows: The activation function should be continuous and bounded [31, 32]. It should also be sigmoidal [31, 32], or the limits for infinity should satisfy the following equations [33]:

$$\mathop {\lim }\limits_{x \to - \infty } f(x) = \alpha$$

(1)

$$\mathop {\lim }\limits_{x \to + \infty } f(x) = \beta$$

(2)

$${\text{with}}\,\alpha < \beta$$

(3)

The activation function’s monotonicity is not a compulsory requirement for the existence of the Universal Approximation Property (UAP) [32].

In this paper we investigate the effect of the 23 different activation functions, employed in the input, output, and forget gates of LSTM, on the classification performance of the network. To the best of our knowledge this is the first study to aggregate a comprehensive set of activation functions and extensively compare them in the LSTM networks. Using the IMDB and Movie Review data sets, the misclassification error of LSTM networks with different structures and activation functions are compared. The results specifically show that the most commonly used activation functions in LSTMs do not contribute to the best network performance. Accordingly, the main highlights of this paper are as follows:

1.
Compiling an extensive list of applicable activation functions in LSTMs.
2.
Applying and analyzing different activation functions in three gates of an LSTM network for classification.
3.
Comparing the performance of LSTM networks with various activation functions and different number of blocks in the hidden layer.

The rest of the paper is organized as follows: in Sect. 2, the LSTM architecture and the activation functions are described. In Sect. 3, the experimental results are reported and discussed. The conclusion is presented in Sect. 4.

2 System model

In this section, the LSTM architecture and the activation functions employed in the network are described.

2.1 LSTM architecture

We use a basic LSTM with a single hidden layer with an average pooling and a logistic regression output layer for classification. The LSTM architecture, illustrated in Fig. 1, has three parts, namely the input layer, a single hidden layer, and the output layer. The hidden layer consists of single-cell blocks which are a set of recurrently connected units. At time t, the input vector $x_{t }$ is inserted in the network. Elements of each block are defined by Eqs. 4–9.

$$f_{t} = \sigma \left( {W_{f} x_{t} + U_{f} h_{t - 1} + b_{f} } \right)$$

(4)

$$i_{t} = \sigma \left( {W_{i} x_{t} + U_{i} h_{t - 1} + b_{i} } \right)$$

(5)

$$o_{t} = \sigma \left( {W_{o} x_{t} + U_{o} h_{t - 1} + b_{o} } \right)$$

(6)

$$\mathop {C_{t} }\limits^{\sim} = \tanh \left( {W_{C} x_{t} + U_{C} h_{t - 1} + b_{C} } \right)$$

(7)

$$C_{t} = f_{t } \odot C_{t - 1 } + i_{t } \odot \tilde{C}_{t }$$

(8)

$$h_t = o_{t } \odot { \tanh }(C_{t } )$$

(9)

The forget, input, and output gates of each LSTM block are defined by Eqs. 4–6, respectively, where $f_{t }$, $i_{t }$, and $o_{t }$ are the forget, input, and output gates, respectively. The input gate decides which values should be updated, the forget gate allows forgetting and discarding the information, and the output gate together with the block output selects the outgoing information at time t. $\tilde{C}_{t}$ defined in Eq. 7 is the block input at time t which is a tanh layer and with the input gate, the two decides on the new information that should be stored in the cell state. $C_{t}$ is the cell state at time t which is updated from the old cell state (Eq. 8). Finally, $h_{t}$ is block output at time t.

The LSTM block is illustrated in Fig. 2. The three gates (input, forget, and output gates), and block input and block output activation functions are displayed in the figure. The output of the block is recurrently connected back to the block input and all of the gates. $W$ and $U$ are weight matrices, and b is the bias vector. The $\odot$ sign is the point-wise multiplication of two vectors. Functions $\sigma$ and $\tanh$ are point-wise nonlinear logistic sigmoid and hyperbolic tangent activation functions, respectively.

2.2 Activation functions

Three main aspects of neural network have important roles in network performance: the network architecture and the pattern of connections between units, the learning algorithm, and the activation functions used in the network. Most of the researches on analysis of the neural networks have focused on the importance of the learning algorithm, whereas the importance of the activation functions used in the neural networks has been mostly neglected [18,19,20].

We analyze the LSTM network in this paper by changing the activation functions of the forget, input, and output gates (sigmoidal gates of Eqs. 4, 5, and 6). We compare 23 different activation functions in terms of their effect on network performance when employed in sigmoidal gates of a basic LSTM block for classification.

Sigmoid and hyperbolic tangent functions are the most popular activation functions used in the neural networks. However, some individual studies have considered other activation functions in their research. We have compiled a comprehensive list of 23 such functions as shown in Table 1 and discussed below. We experimentally observed that adding a value of 0.5 to some functions makes them applicable as activation functions in the network. Changing the range of the activation functions is previously observed in other studies [41].

Table 1 Label, definition, corresponding derivative and range of each activation function

Full size table

In Table 1, the first activation function is Aranda-Ordaz introduced by Gomes et al. [18] which is labeled as Aranda. The second to fifth functions are the bimodal activation functions proposed by Singh et al. [36] and labeled as Bi-sig1, Bi-sig2, Bi-tanh1, and Bi-tanh2, respectively. The sixth function is the complementary log–log [22]. The next function presents a modified version of cloglog, named cloglogm [21]. Next come the Elliott, Gaussian, logarithmic, and log–log functions, the 12th function is a modified logistic sigmoid function proposed by Singh and Chandra [20] labeled as logsigm. The logistic sigmoid comes next as called log-sigmoid, followed by the modified Elliott function. The 15th function is a sigmoid function with roots [19], called rootsig. The 16th to 19th functions are the Saturated, the hyperbolic secant (Sech), and two modified sigmoidals labeled as sigmoidalm and sigmoidalm2. The tunable activation function proposed by Yuan et al. [37] and labeled as sigt is the 20th function. Next is a skewed derivative activation function proposed by Chandra et al. [38] labeled as skewed-sig. The softsign function proposed by Elliott [39] and the wave function proposed by Hara and Nakayamma [40] come last. Some other activation functions such as rectifier [43] were applied in the network but turned out to be ineffective due to the exploding gradient problem.

2.3 Methodology

To evaluate the effect of different activation functions on the classification performance, we vary the activation of the input, output, and forget gates which we refer to as sigmoidal gates, and keep the tanh units unchanged. In each configuration, all the three sigmoidal gates are identical and chosen from the set of activation functions introduced in Table 1.

To train the network, the back propagation through time algorithm (BPTT) [34] is used with either ADADELTA [35] or RMSprop [44] as the optimization method. ADADELTA is a gradient descent-based learning algorithm which is proposed as an improvement over Adagrad [45] and adapts the learning rate per parameter over time. RMSprop is also an extension of Adagrad that deals with its radically diminishing learning rates. The two optimization methods are popular for LSTM networks and achieve faster convergence rates [46, 47].

The mini-batch method is used for the training and test phases. The network is trained and tested three times for each activation function with the same train and test data. The initial network weights and the batches are chosen randomly in each experiment. The error interval is reported using the results of the three experiments of each configuration. The train and validation errors are measured at the end of each batch. The dropout method with probability of 0.5 is used to prevent overfitting [48]. The network is trained until a low and approximately constant classification error based on training data is observed, and also the validation error is stable for 10 consecutive batches. The test errors at this stage are reported.

3 Experimental results

To analyze the performance of the LSTM network, two sets of experiments are designed with different types of data sets. In both set of experiments different architectures of LSTM are evaluated and in each configuration the input, output, and forget gates of the LSTM blocks use an identical activation function from Table 1. In what follows, we describe the analysis results.

3.1 First set of experiments

In the first set of experiments, we use two movie review data sets. The first one [49] is referred to as Movie Review^{Footnote 1} in this paper, and the other is the IMDB large movie review data set^{Footnote 2} [50]. The Movie Review data set consists of 10,662 review sentences, with equal number of positives and negatives. From this data set, we use a total of 8162 sentences in the training and the rest are used in the test phase. Both sets contain equal number of positive and negative sentences. From the IMDB data set, we use 2000 sentences for training the network (with portion of 5% for validation set) and 500 sentences for testing the performance. Again, the number of positive and negative sentences is equal and the sentences have a maximum length of 100 words.

The mini-batch method is used with the batch size for the training and test phases set to 16 and 64, respectively. The batch sizes have been chosen based on experiment for producing a better performance. We use the backpropagation through time algorithm (BPTT) with ADADELTA as the optimization method, with the epsilon parameter set to 1e−6. Hyperparameters are not tuned specifically for each configuration of LSTM and are identical in all experiments. The test misclassification error is used to rank the activation functions. Each experiment is repeated three times.

Table 2 illustrates the average training error values on the Movie Review data set for different configurations. In these set of experiments, the number of LSTM blocks in the hidden layer increases exponentially from 2 to 64 and the number of epochs on each run is set to 20 (in each epoch, all the training data are exposed to the network in mini-batches). For each activation function, the average test error for the configuration which has produced the least train error is shown in the last column. The test errors for all configurations are presented in “Appendix.” As observed, on this data set the modified Elliott has the least average test error (22.52%). Overall, the modified Elliott (with range of [− 0.5, 1.5]), cloglogm ([− 0.5, 1.5]), softsign ([− 0.5, 1.5]), logsigm ([0.5, 1.5]), and saturated ([− 0.5, 1.5]) functions when used as activation present the least average error values which are 22.52, 22.62, 22.85, 23.01, and 23.06%, respectively. The optimum number of LSTM blocks on the hidden layer when using modified Elliott was 16 units, while for cloglogm, softsign, logsigm, and saturated it was 4, 16, 16, and 32 units, respectively. Interestingly log-sigmoid stands in rank 17 among all functions, with the average error of 23.52%. For the Movie Review data set, training errors have negative correlation with number of units for most functions (although the correlations are mostly weak). Most activations perform poorly with a very low number of units (e.g., 2). But, as number of units increase, the sensitivity of error values to number of units is less observed and the standard deviation of error values for most activations is less than 2.

Table 2 Average train errors per each activation function for the Movie Review data set

Full size table

Results of the training error values for the IMDB data set are illustrated in Table 3 including the average test error for the best configuration of each activation. The number of LSTM units in the hidden layer was modified exponentially from 4 to 256. The number of epochs on each run was 50. On this data set, similar to the first data set, modified Elliott had the least average error (12.46%). After modified Elliott (with range of [− 0.5, 1.5]), the saturated ([− 0.5, 1.5]), cloglogm ([− 0.5, 1.5]), and softsign ([− 0.5, 1.5]), functions have the least average error values of 13.06, 13.13, and 13.13%, respectively. The optimum number of LSTM blocks on the hidden layer for modified Elliott was 256 units, while for cloglogm, saturated, and softsign it was 128, 128, and 128 units, respectively. Interestingly again, the log-sigmoid does not appear in the four best results and rank of this activation function is 10 out of 23 with the average error of 13.6%. For the IMDB data set, no solid pattern of correlation is observed between error values and number of units.

Table 3 Average train errors per each activation function for the IMDB data set

Full size table

Most functions produced best training results with 16 and 256 blocks for the Movie Review and IMDB data sets, respectively. Average test and train errors, and average number of convergence epochs for Movie Review and IMDB data sets are represented in Tables 4 and 5, respectively. Note that in each epoch all of the train data are represented to the network in a sequence of mini-batches. The number of actual (average) iterations is reported in parentheses. To test the significance of the results, ANOVA tests were conducted for results of 16 and 256 blocks for the Movie Review and IMDB data sets, respectively. The obtained p values were 3.33e−6 and 7.61e−11, which being less than 0.05 indicate the significance of the results.

Table 4 Average test and train errors for Movie Review data set, and average number of convergence epochs for 16 blocks in hidden layer of LSTM

Full size table

Table 5 Average test and train errors for IMDB data set, and average number of convergence epochs for 256 blocks in hidden layer of LSTM

Full size table

Results show that for both data sets, the activation function modified Elliott has the best performance. Using this activation function for large data sets, we may need to tune the hyperparameters such as using a smaller epsilon parameter in ADADELTA optimization method. Interestingly, the log-sigmoid activation function which is commonly used in neural networks and in LSTM networks does not produce the best results and the modified Elliott function demonstrates better results when employed in the sigmoidal gates. Additionally, it was observed that sigmoidal activations with range of [− 0.5, 1.5] result in a more accurate network than those in the range of [0, 1] in LSTM network. The maximum length of sentences for the Movie Review and IMDB data sets used in the experiments were 64 and 100, respectively. When applied on the IMDB data set, LSTM network required more hidden blocks and even more epochs per run. This can be justified by the greater complexity of this data set.

The error levels measured in current study are consistent with some other studies in the literature. Lenc and Hercig [51] report a 38.3% error for classification of Movie Review with LSTM. Dai and Le [52] report an error of 13.5% for classification of IMDB data with LSTM. The overall error difference for all functions is at most 2 and 5% in Movie Review and IMDB data sets, respectively. In the experiments, the difference in the best measured error values of the modified Elliott function and the popular log-sigmoid function is 0.36 and 1.14 for the Movie Review and IMDB data sets, respectively. Although being small, these values can be meaningful according to the specific application.

3.2 Second set of experiments

For the second experiment we use the MNIST^{Footnote 3} data set of handwritten digits. The mini-batch method is again used with the batch size for the training and test phases set to 128. The batch sizes have been chosen based on experiment. We use the RMSprop as the optimization method, with the learning rate set to 0.001.

The MNIST data set of handwritten digits has a training set of 60,000 examples (with 5000 examples for validation), and a test set of 10,000 examples. The image sizes are 28 × 28 pixels. We use the one-hot method to predict 10 digits (0–9) or equivalently 10 classes.

Table 6 illustrates the average training error values for the MNIST data set with the test error being reported for the best configuration of each activation function. In these set of experiments, two configurations of 64 and 128 LSTM blocks in the hidden layer are considered, and the number of epochs for each run is set to six. As observed, on this data set Elliott and softsign have the least average error (1.66%). Overall, the softsign (with range of [− 0.5, 1.5]) and Elliott (with range of [0, 1]), rootsig ([− 0.5, 1.5]), Bi-tanh2 ([− 0.5, 1.5]), Gaussian ([0, 1]), Bi-sig1 ([0, 1]), Bi-sig2 ([0, 1]), and modified Elliott ([− 0.5, 1.5]) functions when used as activation present the least average error values which are 1.66%, 1.66, 1.9, 1.93, 1.96, 2, 2.03, and 2.03% respectively. The optimum number of LSTM blocks on the hidden layer for softsign, Elliott, rootsig, Bi-tanh2, Gaussian, Bi-sig1, Bi-sig2, and modified Elliott was 128 units, and it seems that most of the activation functions worked better with this number of units Interestingly, log-sigmoid stands in rank 10 with the average error of 2.16%.

Table 6 Average train errors per each activation function for the MNIST data set

Full size table

Average test and train errors, and average number of convergence epochs for the MNIST data set, for 128 blocks in the hidden layer are represented in Table 7. The ANOVA result for all experiments with 128 blocks is 3.7e−21 which shows the results are significant. The error levels are consistent with the study of Arjovsky et al. [53]. which report a classification error of 1.8% for MNIST data with LSTM.

Table 7 Average test and train errors for MNIST data set, and average number of convergence epochs for 128 blocks in hidden layer of LSTM

Full size table

3.3 Discussion

In this paper we aggregated a list of 23 applicable activation functions that can be used in place of the sigmoidal gates in a LSTM network. We compared performance of the network using these functions with different number of hidden blocks, in classification tasks. The results showed the following:

1.
Overall, the results on both data sets suggest that less-recognized activation functions (such as Elliott, modified Elliott, and softsign which are interestingly all in the Elliott family) can produce more promising results compared to the common functions in the literature.
2.
Activation functions with the range of [− 0.5, 1.5] have generally produced better results, and this indicates that a wider range of codomain (than the sigmoidal range of [0, 1]) can yield a better performance.
3.
The log-sigmoid activation function which is mostly used in LSTM blocks produces weak results compared to other activation functions.

Burhani et al. [41] in their study on denoising autoencoders reported a similar result that the modified Elliott has a better performance and less error than log-sigmoid activation function. In addition, in the first set of experiments we found cloglogm to be the second best activation which is consistent with Gomes et al. [21] stating that cloglogm shows good results for forecasting financial time series. The top activations (Elliott family and cloglogm) along with the popular log-sigmoid activation are displayed in Fig. 3. According to the diagram, modified Elliott, softsign, and cloglogm are much steeper than log-sigmoid around zero and also have a wider range. In Fig. 4 performance comparison of these five functions in term of the average error value for the Movie Review, IMDB, and MNIST data sets, respectively, for 16, 256 and 128 blocks is illustrated.

There are two widely known issues with training the recurrent neural networks, the vanishing and the exploding gradient problems [54]. The LSTM networks alleviate the gradient vanishing problem by their special design. The gradient exploding problem can, however, still occur. A gradient norm clipping strategy is proposed by Pascanu et al. [55] to deal with exploding gradients. Gradient clipping is a technique to prevent exploding gradients in very deep networks. A common method is to normalize the gradients of a parameter vector when its L2 norm exceeds a certain threshold [55]. Although we have not performed gradient norm clipping in training the LSTM network, the method suggests that gradient exploding problem is closely related to norm of the gradient matrix and smaller norms are preferred.

We evaluated the norm of the gradient matrix in the second set of experiments, and interestingly observed that the norm of the gradient matrix for the Elliott activation was low and in fact the second lowest among all the activations. The norms of the gradient matrix after convergence are presented in Table 8. As observed, the norm of gradient matrix for most of the activation functions achieving lower classification errors is considerably low (less than 0.1).

Table 8 Norm of gradient matrix for MNIST data set in increasing order

Full size table

The tanh function is one of the most popular activation functions which is widely used in LSTM networks [2]. From a conceptual point of view, two tanh activations in LSTM blocks squash the block input and output and can be considered to have a different role from the three gates. However, they can indeed have a significant effect on the overall network performance and can be replaced by other activations which fulfill the properties mentioned in Sect. 1 of the manuscript. This change will specifically affect the gradient, range, and derivative of the activation functions and blocks. Analyzing the effect of other activation functions when used in place of tanh activations is left for future work.

Some follow-up studies have proposed modifications on the initial LSTM architecture. Evaluating different activation functions on these architectures can serve as an interesting future study. Gers and Schmidhuber [56] introduced peephole connections that cross directly from the internal state to the input and output gates of a node. According to their observations, these connections improve performance on timing tasks where the network must learn to measure precise intervals between events. Another line of research is the alternate and similar architectures which are popular along with the LSTM. The bidirectional recurrent neural network (BRNN) is first proposed by Schuster and Paliwal [57]. This architecture involves two layers of hidden nodes, both of which are connected to input and output. The first hidden layer has recurrent connections from the past time steps, and in the second layer direction of recurrent of connections is flipped. A gated recurrent unit (GRU) was proposed by Cho et al. [58] to make each recurrent unit to adaptively capture dependencies of different time scales. These modifications can improve performance of the network.

4 Conclusions

In LSTM blocks, the two most popular activation functions are sigmoidal and hyperbolic tangent. In this study we evaluated the performance of a LSTM network with 23 different activation functions that can be used in place of the sigmoidal gates. We varied the number of hidden blocks in the network and employed three different data sets for classification. The results exposed that some less-recognized activations such as the Elliott function and its modifications can yield less error levels compared to the most popular functions.

More research is needed to study other parts and details of an LSTM network such as the effect of changing the hyperbolic tangent function on the block input and block output. Variants of the LSTM network can also be analyzed. Additionally, larger data sets and different tasks can be employed to further analyze the network performance considering different configurations.

Notes

References

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. doi:10.1162/neco.1997.9.8.1735
Article Google Scholar
Graves A (2012) Supervised sequence labelling with recurrent neural networks. Springer, Berlin
Book MATH Google Scholar
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw 18(5):602–610
Article Google Scholar
Liwicki M, Graves A, Bunke H, Schmidhuber J (2007) A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In: Proceedings of the 9th international conference on document analysis and recognition, ICDAR 2007
Graves A, Liwicki M, Fernández S et al (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31:855–868. doi:10.1109/TPAMI.2008.137
Article Google Scholar
Graves A, Fernández S, Schmidhuber J (2005) Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. In: Duch W, Kacprzyk J, Oja E, Zadrożny S (eds) Artificial neural networks: formal models and their applications—ICANN 2005. Springer, Berlin, pp 799–804
Google Scholar
Otte S, Krechel D, Liwicki M, Dengel A (2012) Local feature based online mode detection with recurrent neural networks. In: 2012 international conference on frontiers in handwriting recognition (ICFHR). pp 533–537
Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. arXiv:1303.5778 [cs]
Thang Luong IS (2014) Addressing the rare word problem in neural machine translation. doi:10.3115/v1/P15-1002
Wöllmer M, Metallinou A, Eyben F, et al (2010) Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling. In: Proceedings of interspeech, Makuhari. pp 2362–2365
Sak H, Senior A, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the annual conference of international speech communication association (INTERSPEECH). pp 338–342
Fan Y, Qian Y, Xie F, Soong FK (2014) TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Proceedings interspeech. pp 1964–1968
Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv:1409.2329 [cs]
Sønderby SK, Winther O (2014) Protein secondary structure prediction with long short term memory networks. arXiv:1412.7828 [cs, q-bio]
Marchi E, Ferroni G, Eyben F, et al (2014) Multi-resolution linear prediction based features for audio onset detection with bidirectional LSTM neural networks. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp 2164–2168
Donahue J, Hendricks LA, Guadarrama S, et al (2014) Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389 [cs]
Wollmer M, Blaschke C, Schindl T et al (2011) Online driver distraction detection using long short-term memory. IEEE Trans Intell Transp Syst 12:574–582. doi:10.1109/TITS.2011.2119483
Article Google Scholar
da Gomes GSS, Ludermir TB (2013) Optimization of the weights and asymmetric activation function family of neural network for time series forecasting. Exp Syst Appl 40:6438–6446. doi:10.1016/j.eswa.2013.05.053
Article Google Scholar
Duch W, Jankowski N (1999) Survey of neural transfer functions. Neural Comput Surv 2:163–213
Google Scholar
Singh Sodhi S, Chandra P (2003) A class +1 sigmoidal activation functions for FFANNs. J Econ Dyna Control 28:183–187
Article MathSciNet MATH Google Scholar
da Gomes GSS, Ludermir TB, Lima LMMR (2010) Comparison of new activation functions in neural network for forecasting financial time series. Neural Comput Appl 20:417–439. doi:10.1007/s00521-010-0407-3
Article Google Scholar
Gomes GS d S, Ludermir TB (2008) Complementary log-log and probit: activation functions implemented in artificial neural networks. In: Eighth international conference on hybrid intelligent systems, 2008. HIS’08. pp 939–942
Michal Rosen-Zvi MB (1998) Learnability of periodic activation functions: general results. Phys Rev E 58:3606–3609. doi:10.1103/PhysRevE.58.3606
Article Google Scholar
Leung H, Haykin S (1993) Rational function neural network. Neural Comput 5:928–938. doi:10.1162/neco.1993.5.6.928
Article Google Scholar
Ma L, Khorasani K (2005) Constructive feedforward neural networks using Hermite polynomial activation functions. IEEE Trans Neural Netw 16:821–833. doi:10.1109/TNN.2005.851786
Article Google Scholar
Hornik K (1993) Some new results on neural network approximation. Neural Netw 6:1069–1072. doi:10.1016/S0893-6080(09)80018-X
Article Google Scholar
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4:251–257. doi:10.1016/0893-6080(91)90009-T
Article Google Scholar
Hartman E, Keeler JD, Kowalski JM (1990) Layered neural networks with Gaussian hidden units as universal approximations. Neural Comput 2:210–215. doi:10.1162/neco.1990.2.2.210
Article Google Scholar
Skoundrianos EN, Tzafestas SG (2004) Modelling and FDI of dynamic discrete time systems using a MLP with a new sigmoidal activation function. J Intell Robot Syst 41:19–36. doi:10.1023/B:JINT.0000049175.78893.2f
Article Google Scholar
Pao Y-H (1989) Adaptive pattern recognition and neural networks. Addison-Wesley Longman Publishing Co. Inc, Boston
MATH Google Scholar
Carroll SM, Dickinson BW (1989) Construction of neural nets using the radon transform. In: International joint conference on neural networks, 1989, vol 1. IJCNN. pp 607–611
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signal Syst 2:303–314. doi:10.1007/BF02551274
Article MathSciNet MATH Google Scholar
Chandra P, Singh Y (2004) Feedforward sigmoidal networks—equicontinuity and fault-tolerance properties. IEEE Trans Neural Netw 15:1350–1366. doi:10.1109/TNN.2004.831198
Article Google Scholar
Williams RJ, Zipser D (1995) Gradient-based learning algorithms for recurrent networks and their computational complexity. In: Chauvin Y, Rumelhart DE (eds) Back-propagation: theory, architectures and applications. L. Erlbaum Associates Inc., Hillsdale, pp 433–486
Google Scholar
Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv:1212.5701 [cs]
Singh Sodhi S, Chandra P (2014) Bi-modal derivative activation function for sigmoidal feedforward networks. Neurocomputing 143:182–196. doi:10.1016/j.neucom.2014.06.007
Article Google Scholar
Yuan M, Hu H, Jiang Y, Hang S (2013) A new camera calibration based on neural network with tunable activation function in intelligent space. In: 2013 6th international symposium on computational intelligence and design (ISCID). pp 371–374
Chandra P, Sodhi SS (2014) A skewed derivative activation function for SFFANNs. In: Recent advances and innovations in engineering (ICRAIE). IEEE, pp 1–6
Elliott DL (1993) A better activation function for artificial neural networks. Technical Report ISR TR 93–8, University of Maryland
Hara K, Nakayamma K (1994) Comparison of activation functions in multilayer neural network for pattern classification. In: 1994 IEEE international conference on neural networks, 1994. IEEE world congress on computational intelligence, vol 5. pp 2997–3002
Burhani H, Feng W, Hu G (2015) Denoising autoencoder in neural networks with modified Elliott activation function and sparsity-favoring cost function. In: 2015 3rd international conference on applied computing and information technology/2nd international conference on computational science and intelligence (ACIT-CSI). pp 343–348
Chandra P, Singh Y (2004) A case for the self-adaptation of activation functions in FFANNs. Neurocomputing 56:447–454. doi:10.1016/j.neucom.2003.08.005
Article Google Scholar
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML 2010). pp 807–814
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4:26–31
Google Scholar
Duchi J, Hazan E, Singer Y (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research 12:2121–2159
MathSciNet MATH Google Scholar
Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019 [cs]
Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv:1609.04747 [cs]
Hinton GE, Srivastava N, Krizhevsky A, et al (2012) Improving neural networks by preventing co-adaptation of feature detectors, vol abs/1207.0580. arXiv preprint arXiv:1207.0580. The Computing Research Repository (CoRR)
Pang B, Lee L (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA. pp 115–124
Maas AL, Daly RE, Pham PT, et al (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, Stroudsburg, PA, USA. pp 142–150
Lenc L, Hercig T (2016) Neural networks for sentiment analysis in Czech. In: ITAT 2016 proceedings, CEUR Workshop Proceedings, vol 1649. pp 48–55
Dai AM, Le QV (2015) Semi-supervised sequence learning. In: Cortes C, Lawrence ND, Lee DD et al (eds) Advances in neural information processing systems, vol 28. Curran Associates Inc, Red Hook, pp 3079–3087
Google Scholar
Arjovsky M, Shah A, Bengio Y (2016) Unitary evolution recurrent neural networks. In: Proceedings of the 33rd international conference on machine learning. pp 1120–1128
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5:157–166. doi:10.1109/72.279181
Article Google Scholar
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proceedings of the 30th international conference on machine learning. pp 1310–1318
Gers FA, Schmidhuber J (2000) Recurrent nets that time and count. In: Proceedings of the IEEE-INNS-ENNS international joint conference on neural networks. IJCNN 2000. Neural computing: new challenges and perspectives for the new millennium, vol 3. pp 189–194
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681. doi:10.1109/78.650093
Article Google Scholar
Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder–decoder approaches. arXiv:1409.1259 [cs, stat]

Download references

Author information

Authors and Affiliations

Kharazmi International Campus, Shahrood University of Technology, Shahrood, Iran
Amir Farzad
Department of Computer Engineering, Shahrood University of Technology, P.O. Box: 3619995161, Shahrood, Iran
Hoda Mashayekhi & Hamid Hassanpour

Authors

Amir Farzad
View author publications
You can also search for this author in PubMed Google Scholar
Hoda Mashayekhi
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Hassanpour
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hoda Mashayekhi.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Appendix

See Tables 9, 10 and 11.

Table 9 Average test error values per each activation function for the Movie Review data set

Full size table

Table 10 Average test error values per each activation function for the IMDB data set

Full size table

Table 11 Average error values per each activation function for the MNIST data set

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Farzad, A., Mashayekhi, H. & Hassanpour, H. A comparative performance analysis of different activation functions in LSTM networks for classification. Neural Comput & Applic 31, 2507–2521 (2019). https://doi.org/10.1007/s00521-017-3210-6

Download citation

Received: 03 August 2016
Accepted: 04 October 2017
Published: 19 October 2017
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s00521-017-3210-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A comparative performance analysis of different activation functions in LSTM networks for classification

Abstract

Similar content being viewed by others

Overview of Long Short-Term Memory Neural Networks

Overview of Incorporating Nonlinear Functions into Recurrent Neural Network Models

A review on the long short-term memory model

1 Introduction