A review on the long short-term memory model

Van Houdt, Greg; Mosquera, Carlos; Nápoles, Gonzalo

doi:10.1007/s10462-020-09838-1

A review on the long short-term memory model

Published: 13 May 2020

Volume 53, pages 5929–5955, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Artificial Intelligence Review Aims and scope Submit manuscript

A review on the long short-term memory model

Download PDF

Greg Van Houdt¹,
Carlos Mosquera² &
Gonzalo Nápoles^1,3

24k Accesses
749 Citations
5 Altmetric
Explore all metrics

Abstract

Long short-term memory (LSTM) has transformed both machine learning and neurocomputing fields. According to several online sources, this model has improved Google’s speech recognition, greatly improved machine translations on Google Translate, and the answers of Amazon’s Alexa. This neural system is also employed by Facebook, reaching over 4 billion LSTM-based translations per day as of 2017. Interestingly, recurrent neural networks had shown a rather discrete performance until LSTM showed up. One reason for the success of this recurrent network lies in its ability to handle the exploding/vanishing gradient problem, which stands as a difficult issue to be circumvented when training recurrent or very deep neural networks. In this paper, we present a comprehensive review that covers LSTM’s formulation and training, relevant applications reported in the literature and code resources implementing this model for a toy example.

Overview of Long Short-Term Memory Neural Networks

Deep Learning

Overview of Incorporating Nonlinear Functions into Recurrent Neural Network Models

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Recurrent or very deep neural networks are difficult to train, as they often suffer from the exploding/vanishing gradient problem (Hochreiter 1991; Kolen and Kremer 2001). To overcome this shortcoming when learning long-term dependencies, the LSTM architecture (Hochreiter and Schmidhuber 1997a) was introduced. The learning ability of LSTM impacted several fields from both a practical and theoretical perspective, so that it became a state-of-the-art model. This led to the model being used by Google for its speech recognition (Sak et al. 2015), and to improve machine translations on Google Translate (Wu et al. 2016; Metz 2016). Amazon employs the model to improve Alexa’s functionalities (Vogels 2016), and Facebook puts it to use for over 4 billion LSTM-based translations per day as of 2017 (Pino et al. 2017).

Due to its high applicability and popularity, this neural architecture has also found its way into the world of gaming. For example, Google’s Deepmind created AlphaStar (The AlphaStar Team 2019b), an artificial intelligence designed to play Starcraft II. Throughout the development of AlphaStar, it started to master the game (The AlphaStar Team 2019a), climbing up the global rankings, which was unseen before. Research in this field is of course not limited to Starcraft II, as the research interest spans the entire RTS gaming genre due to its complexity (Zhang et al. 2019e). To generalise the topic of reinforcement learning in other settings, OpenAI succeeded in building a robot hand called Dactyl, which taught itself how to manipulate objects in a human-like fashion (Sabina Aouf 2019).

Of course, a neural architecture would not be so well adopted into practice without a strong theoretical foundation. An extensive review with respect to several LSTM variants and their performances relative to the so-called vanilla model was recently conducted by Greff et al. (2017). The vanilla LSTM is interpreted as the original LSTM block with the addition of the forget gate and peephole connections. In total, eight variants were identified for experimentation. In a nutshell, the vanilla architecture performs well on a number of tasks, and none of the eight investigated variants significantly outperforms the remaining ones. This justified most applications found in the literature to employ the vanilla LSTM.

A recent study conducted by Yu et al. (2019) provides an overview of the LSTM cell, its functionalities and different architectures. A distinction is made between LSTM-dominated neural networks and integrated LSTM networks, the latter of which adds other components than LSTM to take advantage of its properties, therefore potentially hybridising neural networks. As will be illustrated in Sect. 4, this work complements the review study of Yu et al. (2019) in terms of a broad applications overview showing where integrated networks are most useful. As each problem is largely unique, there is often a better solution than employing solely the standard LSTM model.

Therefore, in this paper, we present a comprehensive review on the LSTM model that complements the theoretical findings presented in Greff et al. (2017) and Yu et al. (2019). Our review study focuses on three main directions, which move from theory to practice. In the first part, we broadly describe the LSTM components, how they interact with each other and how we can estimate the learnable parameters. These considerations may become relevant for readers who want to master the model from a theoretical perspective, instead of being experienced practitioners. In the second part, we outline interesting applications that show the potential of LSTM as an undeniable state-of-the-art method within the deep learning field. Some interesting application domains include text recognition, time series forecasting, natural language processing, computer vision, and image and video captioning, among others. In the last part, we present a code example in Tensorflow that aims to predict the next word of a sample short story.

As for the search criteria on which this literature review is based, we evaluated 409 papers containing the terms “Long short-term memory” or “LSTM” in either the title, abstract or keywords. Only non-paid journals with a recognised peer-review system were considered. Moreover, we included papers presented in mainstream conferences such as the Conference on Neural Information Processing Systems, the Conference on Computer Vision and Pattern Recognition, the AAAI Conference on Artificial Intelligence, the Conference on Empirical Methods in Natural Language, etc. Having constructed our starting library, we selected papers making a relevant contribution in terms of theory or practice in which LSTM played a key role. It does not imply however that papers not included in our review do not fulfil this criterion, but certainly we could not cover the whole literature concerning the LSTM model. For example, a rough search carried out in February 2020 by using the search criteria mentioned above reported 11,931 documents indexed by Scopus. Therefore, we have given priority to most recent contributions.

The remainder of this paper is structured as follows. In Sect. 2 we describe the theoretical foundations behind the LSTM model, followed by a concise description of a procedure to adjust the learnable parameters in Sect. 3. Section 4 zooms in on different applications of the model as found in the literature. An example implementation in Tensorflow can be found in Sect. 5. Finally, we summarise our conclusions in Section 6.

2 Long short-term memory

The LSTM model (Hochreiter and Schmidhuber 1997a) is a powerful recurrent neural system specially designed to overcome the exploding/vanishing gradient problems that typically arise when learning long-term dependencies, even when the minimal time lags are very long (Hochreiter and Schmidhuber 1997b). Overall, this can be prevented by using a constant error carousel (CEC), which maintains the error signal within each unit’s cell. As a matter of fact, such cells are recurrent networks themselves, with an interesting architecture in the way that the CEC is extended with additional features, namely the input gate and output gate, forming the memory cell. The self-recurrent connections indicate feedback with a lag of one time step.

A vanilla LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. This forget gate was not initially a part of the LSTM network, but was proposed by Gers et al. (2000) to allow the network to reset its state. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information associated with the cell. In the remainder of this section, LSTM will refer to the vanilla version as this is the most popular LSTM architecture (Greff et al. 2017). This does not imply, however, that it is also the superior one in every situation. This will be elaborated on in Sect. 4.

In short, the LSTM architecture consists of a set of recurrently connected sub-networks, known as memory blocks. The idea behind the memory block is to maintain its state over time and regulate the information flow thought non-linear gating units. Figure 1 displays the architecture of a vanilla LSTM block, which involves the gates, the input signal $x^{(t)}$, the output $y^{(t)}$, the activation functions, and peephole connections (Gers and Schmidhuber 2000). The output of the block is recurrently connected back to the block input and all of the gates.

Aiming to clarify how the LSTM model works, let us assume a network comprised of N processing blocks and M inputs. The forward pass in this recurrent neural system is described below.

Block input This step is devoted to updating the block input component, which combines the current input $x^{(t)}$ and the output of that LSTM unit $y^{(t-1)}$ in the last iteration. This can be done as depicted below:

$$\begin{aligned} z^{(t)} = g(W_z x^{(t)} + R_z y^{(t-1)} + b_z) \end{aligned}$$

(1)

where $W_z$ and $R_z$ are the weights associated with $x^{(t)}$ and $y^{(t-1)}$, respectively, while $b_z$ stands for the bias weight vector.

Input gate In this step, we update the input gate that combines the current input $x^{(t)}$, the output of that LSTM unit $y^{(t-1)}$ and the cell value $c^{(t-1)}$ in the last iteration. The following equation shows this procedure:

$$\begin{aligned} i^{(t)} = \sigma (W_i x^{(t)} + R_i y^{(t-1)} + p_i \odot c^{(t-1)} + b_i) \end{aligned}$$

(2)

where $\odot $ denotes point-wise multiplication of two vectors, $W_i$, $R_i$ and $p_i$ are the weights associated with $x^{(t)}$, $y^{(t-1)}$ and $c^{(t-1)}$, respectively, while $b_i$ represents for the bias vector associated with this component.

In the previous steps, the LSTM layer determines which information should be retained in the network’s cell states $c^{(t)}$. This included the selection of the candidate values $z^{(t)}$ that could potentially be added to the cell states, and the activation values $i^{(t)}$ of the input gates.

Forget gate In this step, the LSTM unit determines which information should be removed from its previous cell states $c^{(t-1)}$. Therefore, the activation values $f^{(t)}$ of the forget gates at time step t are calculated based on the current input $x^{(t)}$, the outputs $y^{(t-1)}$ and the state $c^{(t-1)}$ of the memory cells at the previous time step $(t-1)$, the peephole connections, and the bias terms $b_f$ of the forget gates. This can be done as follows:

$$\begin{aligned} f^{(t)} = \sigma (W_f x^{(t)} + R_f y^{(t-1)} + p_f \odot c^{(t-1)} + b_f) \end{aligned}$$

(3)

where $W_f$, $R_f$ and $p_f$ are the weights associated with $x^{(t)}$, $y^{(t-1)}$ and $c^{(t-1)}$, respectively, while $b_f$ denotes for the bias weight vector.

Cell This step computes the cell value, which combines the block input $z^{(t)}$, the input gate $i^{(t)}$ and the forget gate $f^{(t)}$ values, with the previous cell value. This can be done as depicted below:

$$\begin{aligned} c^{(t)} = z^{(t)} \odot i^{(t)} + c^{(t-1)} \odot f^{(t)}. \end{aligned}$$

(4)

Output gate This step calculates the output gate, which combines the current input $x^{(t)}$, the output of that LSTM unit $y^{(t-1)}$ and the cell value $c^{(t-1)}$ in the last iteration. This can be done as depicted below:

$$\begin{aligned} o^{(t)} = \sigma ( W_o x^{(t)} + R_o y^{(t-1)} + p_o \odot c^{(t)} + b_o ) \end{aligned}$$

(5)

where $W_o$, $R_o$ and $p_o$ are the weights associated with $x^{(t)}$, $y^{(t-1)}$ and $c^{(t-1)}$, respectively, while $b_o$ denotes for the bias weight vector.

Block output Finally, we calculate the block output, which combines the current cell value $c^{(t)}$ with the current output gate value as follows:

$$\begin{aligned} y^{(t)} = g(c^{(t)}) \odot o^{(t)}. \end{aligned}$$

(6)

In the above steps, $\sigma $, g and h denote point-wise non-linear activation functions. The logistic sigmoid $\sigma (x) = \frac{1}{1+e^{1-x}}$ is used as a gate activation function, while the hyperbolic tangent $g(x) = h(x) = \tanh (x)$ is often used as the block input and output activation function.

It seems appropriate to mention that the functionality of this architecture inspired the authors in Kumar Srivastava et al. (2015) to enhance the training of very deep networks. The gating mechanism was employed in the so-called highway networks to allow for an unimpeded information flow across many layers. This could be considered as another proof-of-concept, showing that the gates work.

Even though vanilla LSTM already performs very well, several works studied the possibilities to improve performance. For example, Su and Kuo (2019) developed the Extended LSTM model, further improving the accuracy of predictions in several application fields by enhancing the memory capability. This shows that theoretical improvements can still be made to an already state-of-the-art performing architecture. In the work of Bayer et al. (2009), the search for improvements of the model was already ongoing. The authors looked for for an architectural alternative to LSTM to optimise sequence learning capabilities. They succeeded in evolving memory cell structures capable of learning context-sensitive formal languages through gradient descent, being in many ways comparable to LSTM performance-wise. The authors in Bellec et al. (2018) built upon recurrent networks of spiking neurons, developing Long short-term memory Spiking Neural Networks (LSNN) including adapting neurons. During tests in which the size of LSNN was comparable to that of LSTM, it was shown that the performance is very comparable to that of LSTM. This is yet another illustration of how accurate LSTM is and remains.

3 How to train your model

The LSTM model described in Sect. 2 uses full gradient training presented by Graves and Schmidhuber (2005a) to adjust the learnable parameters (weights) involved in the network. More specifically, Backpropagation Through Time (Werbos 1990) is used to compute the weights that connect the different components in the network. Therefore, during the backward pass, the cell state $c^{(t)}$ receives gradients from $y^{(t)}$ as well as the next cell state $c^{(t+1)}$. Those gradients are accumulated before being backpropagated to the current layer.

In the last iteration T, the change $\delta _{y}^{(T)}$ corresponds with the network error $\partial E / y^{(T)}$ such that E denotes the loss function. Otherwise, $\delta _{y}^{(t)}$ is the vector of delta values passed down $\varDelta ^{(t)}$ from the above layer including the recurrent dependencies. This can be done as follows:

$$\begin{aligned} \delta _{y}^{(t)} = \varDelta ^{(t)} + R_z^T \delta _{z}^{(t+1)} + R_i^T \delta _{i}^{(t+1)} + R_f^T \delta _{f}^{(t+1)} + R_o^T \delta _{o}^{(t+1)}. \end{aligned}$$

(7)

In a second step, the change in the parameters associated with the gates and the memory cell are calculated as:

$$\begin{aligned} \delta _{o^{(t)}}= & {} \delta _y^{(t)} \odot h(c^{(t)}) \odot \sigma '({\hat{o}}^{(t)}) \end{aligned}$$

(8)

$$\begin{aligned} \delta _c^{(t)}= & {} \delta _y^{(t)} \odot o^{(t)} \odot h'(c^{(t)}) + p_o \odot \delta _{o}^{(t)} \nonumber \\&+\, p_i \odot \delta _{i}^{(t+1)} + p_f \odot \delta _f^{(t+1)} + \delta _c^{(t+1)} \odot f^{(t+1)} \end{aligned}$$

(9)

$$\begin{aligned} \delta _{f^{(t)}}= & {} \delta _c^{(t)} \odot c^{(t-1)} \odot \sigma '({\hat{f}}^{(t)}) \end{aligned}$$

(10)

$$\begin{aligned} \delta _{i^{(t)}}= & {} \delta _c^{(t)} \odot z^{(t)} \odot \sigma '({\hat{i}}^{(t)}) \end{aligned}$$

(11)

$$\begin{aligned} \delta _{z^{(t)}}= & {} \delta _c^{(t)} \odot i^{(t)} \odot g'({\hat{z}}^{(t)}) \end{aligned}$$

(12)

where ${\hat{o}}^{(t)}$, ${\hat{i}}^{(t)}$, ${\hat{z}}^{(t)}$ and ${\hat{f}}^{(t)}$ denote the raw values attached with the output gate, input gate, the block input and forget gate, respectively, before being transformed by the corresponding transfer function.

As pointed out in Greff et al. (2017), the delta values for the inputs are only required if there is a layer below that needs to be trained, thus:

$$\begin{aligned} \delta _{x}^{(t)} = W_z^T \delta _{z}^{(t)} + W_i^T \delta _{i}^{(t)} + W_f^T \delta _{f}^{(t)} + W_o^T \delta _{o}^{(t)}. \end{aligned}$$

(13)

Finally, the gradients for the weights are calculated as follows:

$$\begin{aligned} \delta _{W_{*}}&= \sum _{t=0}^T \delta _{*}^{(t)} \otimes x^{(t)}&\delta _{p_{i}}&= \sum _{t=0}^{T-1} c^{(t)} \odot \delta _{i}^{(t+1)} \\ \delta _{R_{*}}&= \sum _{t=0}^{T-1} \delta _{*}^{(t+1)} \otimes y^{(t)}&\delta _{p_{f}}&= \sum _{t=0}^{T-1} c^{(t)} \odot \delta _{f}^{(t+1)}&\\ \delta _{b_{*}}&= \sum _{t=0}^T \delta _{*}^{(t)}&\delta _{p_{o}}&= \sum _{t=0}^T c^{(t)} \odot \delta _{o}^{(t)} \end{aligned}$$

such that $\otimes $ represents the outer product of two vectors, whereas $*$ can be any component associated with the weights: the block input ${\hat{z}}$, the input gate ${\hat{i}}$, the forget gate ${\hat{f}}$ or the output gate ${\hat{o}}$.

Alternatively, one can train the model with evolino (Schmidhuber et al. 2007; Wierstra et al. 2005). This method evolves weights to the nonlinear, hidden nodes of recurrent neural networks while computing optimal linear mappings from hidden state to output.

4 Relevant applications

The LSTM network is applied in a wide array of problem domains, both individually and in combination with other deep learning architectures. As previously discussed, LSTM is one of the most advanced networks to process temporal sequences. For this reason, the vanilla LSTM is still one of the most popular network choices, even though it is possible to combine it with other networks to create hybrid models. LSTM is well suited to handle time series predictions, but also any other problem that requires temporal memory.

This section gives an overview of the topics the model is most suitable for and how it can be used to solve complex problems. In that regard, we will discuss the applications per problem domain.

4.1 Time series prediction

When it comes to temporal sequences in data, time series data immediately comes to mind. Still, this is a broad concept. In the more literal sense of time series predictions, the LSTM model has been applied to financial market predictions in, for example, Fischer and Krauss (2018) and Yan and Ouyang (2017). Due to complex features such as non-linearity, non-stationarity and sequence correlation, financial data pose a huge forecasting challenge. It was shown by Fischer and Krauss, however, that the LSTM network outperforms more traditional benchmarks: the random forest, a standard deep neural network and a standard logistic regression.

Sagheer and Kotb (2019) confirmed this finding that LSTM outperforms standard approaches when predicting petroleum production. In their research, they stacked multiple layers of LSTM blocks on top of one another in a hierarchical fashion. This increased the model’s ability to process temporal tasks and enabled it to better capture the structure of data sequences. Attempting to forecast the oil market price, the authors in Cen and Wang (2019) came to the same conclusions. The proposed prediction model, composed of the vanilla LSTM architecture, proved to be superior. Liu (2019) compared LSTM with other models, like support vector machines. This research in estimating financial stock volatility not only showed it was relatively easy to calibrate LSTM, but also that it results in accurate predictions for even large time intervals.

In Rodrigues et al. (2019) the authors used LSTM to model the time series observations and later on improve predictions by combining this with text data input. Using this technique, they attempted to predict taxi demand in New York, even though the proposed method is generalisable for other applications as well. This is an important benefit of the architecture, as it was empirically shown that the forecast error can be greatly reduced with this approach.

Receiving a time series as input does not necessarily mean the model will predict the next values in the series, as it can also be used to train a classifier. This is the case of the fault diagnosis in Lei et al. (2019) where the authors used an LSTM-based framework for condition monitoring of wind turbines using only raw time series data gathered by a single or multiple sensors. They explicitly made this choice to avoid a heavy reliance on expert knowledge. LSTM was utilised to capture long-term dependencies in the data to do proper fault classification. Experiments in this study have shown that this framework is not only robust, but it also outperformed the state-of-the-art methods. Other fault classification works include batteries in electric vehicles (Hong et al. 2019), where LSTM predicts the battery voltage levels which lays the ground work for fault classification, and electro-mechanical actuators in aircraft (Yang et al. 2019) based on the recorded sensor data. The study of Saeed et al. (2020) presented the necessary steps and process required to create a flexible fault diagnosis model, in which LSTM plays a pivotal role in the statistical analysis. The authors tested their model in the nuclear energy domain.

As another LSTM-based classification example, Uddin (2019) demonstrated that the model can also detect which activity was performed given input data from multiple wearable healthcare sensors obtained via an edge device like a laptop. This sensor data was fed to LSTM with which twelve different human activities were modelled. Again, the proposed method was proven to be robust and achieve better results than the reported traditional models.

Given several data points produced by a number of sensors, Elsheikh et al. (2019) predicted the remaining useful life (RUL) of physical systems, for example, production resources. This was done by proposing a new bidirectional LSTM (BLSTM) architecture. The term bidirectional in recurrent neural networks comes from the idea to provide the input sequence in both ways: forwards and backwards. From these two hidden layers, the output layer can get information about the past and the future states, respectively, simultaneously (Schuster and Paliwal 1997). The same concept can be applied to the LSTM network, as it is a recurrent network that takes sequences as input. Graves and Schmidhuber (2005b) presented the BLSTM network, comparing it with the unidirectional variant in a classification context. It was already confirmed that LSTM outperforms traditional recurrent neural networks, but the findings from this research indicated that BLSTM can achieve even better results.

Since the problem at hand in Elsheikh et al. (2019) is only about the RUL, the authors were not interested in intermediate predictions. Therefore, the input sequence is not processed in both directions simultaneously. Rather, the forward direction is processed first, after which the LSTM final states initialise the backward processing cells. This forces the network to produce two different yet linked mappings of the data to the RUL. Since the initial wear of the physical system is usually unknown, the network is trained to anticipate requirements such as replacement of parts. This method outperforms the conventional network types according to the conducted experiments.

Li et al. (2019) presented a lithium-ion battery RUL prediction based on the empirical mode decomposition algorithm in combination with a novel Elman-LSTM hybrid network. First, empirical mode decomposition extracts both high and low frequency signals from the input data. The Elman neural network was introduced to the solution as it has the capability of short-term memory (Li et al. 2014). Therefore, it is included to tackle the problem of capturing capacity recovery in certain cycles of the batteries. With the long-term capabilities of LSTM, the battery degradation is regarded as the time series. With both models working together, capturing both degradation and recovery characteristics, the battery RUL is predicted. Experiments show that this novel method offers a superior performance in comparison to other state-of-the-art models.

As shown in Li et al. (2019), predictions can be improved by combining different model architectures. However, not only can these models work together side by side, the input sequence can also be preprocessed for LSTM by another deep learning or other architecture, also forming a hybrid model. Wu et al. (2019) used ensemble empirical mode decomposition to select the proper intrinsic mode functions which then serve as input for the LSTM model to predict crude oil price movements. This proved to be a very effective methodology, even when the number of decomposition results varies.

As the reader can notice from the applications above, LSTM is an appropriate model to deal with time series data. But these are of course not all contributions which can be found in the literature. Many other applications exist for time series prediction scenarios, like emotion ratings (Ringeval et al. 2015), topic evolution in text streams (Lu et al. 2019), building energy usage (Wang et al. 2019b), carbon emission levels (Huang et al. 2019b), flight delays (McCarthy et al. 2019), financial trading strategy selection (Sang and Pierro 2019), air quality monitoring (Zhao et al. 2019a), wind speed forecasting (Chen et al. 2019b), precipitation nowcasting (Xingjian et al. 2015), real-time occupancy prediction (Kim et al. 2019b) and health predictive analytics (Manashty and Light 2019). The LSTM’s prediction power speaks for the wide usability of the model to overcome problems in which other recurrent neural systems are likely to fail.

4.2 Natural language processing

LSTM is a force to be reckoned with when it comes to language learning, both context-free and context-sensitive (Gers and Schmidhuber 2001; Gers et al. 2002). Natural language processing is the field of research that explores how computers can be used to understand and manipulate natural language text or speech to do useful things (Chowdhury 2003). For example, dialog systems—also known as conversational agents—allow human beings to interact with a machine via speech. Speech recognition with the use of the LSTM model was first performed in Graves et al. (2004) as it has the main benefit of dealing with long time lags. Results comparable to the hidden Markov model (HMM) (Rabiner 1986) were obtained in this experiment. This work then initiated the further exploration of LSTM for this domain.

Dialog systems should be able to respond to out-of-domain speech input, instead of giving a random response. In Ryu et al. (2017) a binary classifier was built using two LSTM layers. The classifier was trained with only in-domain data, with the goal to later on recognise out-of-domain sentences. This method achieved higher prediction rates than the former state-of-the-art models.

Of course, while identifying differences between sentences is interesting from the classification perspective, understanding the meaning of the sentence can be quite a challenge for machines. Take the following sentence as an example: “Apple will launch a trio of new iPads this spring, according to Barclays research analysts.” Given the context provided in the sentence, algorithms should be able to understand that “Apple” refers to the company, not the fruit. This is the purpose of entity disambiguation. The architecture in Sun et al. (2017) was comprised of two LSTM networks to learn the context of the sentences, which performed well. However, this already strong baseline was outperformed by the memory network proposed in Sun et al. (2018).

If the context of sentences and words is understood, then questions can be answered about them. One example of this application is visual question answering, a computer vision task where a system is provided with a text-based question about an image and the answer must be inferred (Kafle and Kanan 2017; Gao et al. 2015). Section 4.4 elaborates on other computer vision applications. Alternatively, community question answering—selecting the correct answer to a question—can be performed. The authors in Wen et al. (2019) built a hybrid attention network which considers the importance of words in the current sentence but also the mutual importance with respect to the counterpart sentence for sentence representation learning. The representations of the questions and answers are learned with an LSTM-based architecture.

Given the model’s capability to remember long-term contexts, LSTM can be used to detect dialogue breakdowns. Conversational agents have to timely detect such a breakdown as it helps the agents recover from mistakes. A recent contribution from Takayama et al. (2019) investigated variations when determining if a response causes breakdowns in a conversation (subjectivity), and variations in breakdown types due to designs of agents (variationality). The variationality was addressed with three models: LSTM which considers the global context, the convolutional neural network (CNN) which is sensitive to the local characteristics, and also the combination of both. The authors did investigate BLSTM as well, but chose to employ unidirectional LSTM since results were comparable.

On the other hand, Hori et al. (2019) did adopt both the uni- and bidirectional networks, along with a hierarchical recurrent encoder decoder. The responses given by these three models are combined by a minimum Bayes risk based system in order to go to the generation phase of system responses in a Twitter help-desk dialog task. Note that this proposal is not for a real-time dialog system as it is impossible to feed the input in a reverse order in this case.

Building a dialog system is, as expected, also possible with BLSTM. As with text recognition, the words in a sentence that follow on each other are not only dependent on the previous one, it can also be related with the next one. The authors in Kim et al. (2019a) employed the MemN2N architecture (Sukhbaatar et al. 2015) to perform automatic system responses, but added BLSTM to the beginning of their network to better reflect the temporal information. Numerical simulations showed that the performance of state-of-the-art methods for an end-to-end network is comparable to the hierarchical LSTM model. Just as in Hori et al. (2019), this is not for a real-time dialog system.

Wang et al. (2019c) attempted to generate natural and informative responses for customer service oriented dialog systems. For this purpose, two frameworks were proposed: one to encode the entire dialogue history, while the other integrates external knowledge extracted from a search engine. It is in the former that the authors explored both CNN and LSTM networks. The simulation results showed that the recurrent network was more effective in improving the reply quality when compared with the CNN variant, which can only capture the problem semantics through short patterns.

Although the interaction with humans is pivotal in this kind of problems, it does not have to end with providing the human with appropriate responses or to provide the information the user is seeking. Opinions can be analysed and extracted from sentences as well, as done by Zhang et al. (2019a) who employed a multilayered BLSTM architecture. D’Andrea et al. (2019) put the strengths of LSTM to use, in combination with other architectures, to track opinions on vaccination. From this study, however, it seemed that LSTM was not amongst the best architectures to tackle the problem.

Likewise, text can be classified into certain categories, as done by Dabiri and Heaslip (2019) with the use of CNN and LSTM, from which the contributors concluded that their approach reports superior results when compared with other algorithms. Combinatory Categorical Grammar supertagging can also be performed by means of deep learning architectures as illustrated in Kadari et al. (2018) where the authors used BLSTM and conditional random field. This hybrid architecture attained good results in a very efficient manner.

Liu et al. (2019) employed both LSTM and CNN networks to build a rumor identification classifier in the social media environment. For this purpose, they performed forwarding comments analysis, they included the influence, authority and popularity of so-called influencers, and captured diffusion structures. Results show that the proposed models are quite capable of learning hidden clues and contextual information.

Of course, before any analysis can be conducted, the structure of the document must be machine-understandable, which is the domain of document representation, studied in Zhang et al. (2019d). First, an attention-based LSTM is used to generate hidden representations of word sequences. Next, a latent topic-modelling layer and tree-structured LSTM generate the semantic representations of the document. This method proved its value over the state-of-the-art.

Naturally, there are a great deal of works which contributed to the field of natural language processing using (a variant of) LSTM. Examples are Chinese word segmentation (Ma et al. 2018a; Gong et al. 2019), morphological segmentation (Wang et al. 2016), relation extraction (Song et al. 2018), mapping natural text to knowledge base entities (Kartsaklis et al. 2018), emoji prediction (Barbieri et al. 2018), and translation tasks (Sutskever et al. 2014).

4.2.1 Sentiment analysis

Sentiment analysis is closely related with natural language processing. Many data points can be used to detect emotions like physiological data, the environment, videos, etc. Kanjo et al. (2019) put to use sensor signals coming from these multi-modal data sources. More specifically, these signals originated from smartphones and wearable devices. In fact, they were the first to perform emotion recognition using physiological, environmental and location data. To analyse all the data, four models were built, all based on a CNN-LSTM architecture: the on-body data, the environmental data, the location data, and finally the fusion of all data inputs. The authors concluded that, by using this hybrid network, the accuracy level was increased by more than 20% compared to a traditional multi-layer perception model.

When it comes to video-based sentiment analysis, Li and Xu (2019) introduced a new feature extraction method, hvnLBP-TOP, followed by a sequence learning module containing BLSTM. In short, they extracted from a video the different frames, in which human faces were detected to put together a human face video, a fixed-size picture sequence. From a comparison analysis with state-of-the-art models, this new method proved effective.

Identifying people’s emotions can also help detect conversation anomalies. For this, Sun et al. (2019) investigated the hybrid model of CNN-LSTM combined with a Markov chain Monte Carlo method. The former is applied to identify the emotion of the conversation texts, while the latter detects the emotion transitions. A limitation of this research, however, is that the initial and stimulating emotion cannot be set at any time during the conversation without guiding it in a specific direction.

Kraus and Feuerriegel (2019) proposed a method that builds upon the discourse structure of documents, namely a tensor-based, tree-structured deep neural network named Discourse-LSTM. The method is based on rhetorical structure theory which structures documents hierarchically, forming a discourse tree, which discourse-LSTM can process completely. The tensor structure reveals the salient text passages and thereby provides explanatory insights, all the while returning a superior performance.

In the work of Song et al. (2019) a method of sentiment lexicon embedding in Korean was proposed. After extensive data preprocessing, attention-based LSTM was responsible for sentiment classification. The authors’ approach resulted in improved accuracy of this classification. Zhou et al. (2016) used an attention-based BLSTM to model bilingual texts with the goal of cross-language sentiment classification. The authors confirmed the attention mechanism proves to be very effective, especially on the word-level. A similar architecture was used in Yang et al. (2017) with the goal of improving target-dependent sentiment classification, and achieving better or comparable results. The attention mechanism was also employed by Ma et al. (2018b), in their variants of LSTM, incorporating commonsense knowledge, which outperformed the competition in targeted aspect sentiment analysis.

The authors in Zhao et al. (2019b) experimented with a one- and a two-dimensional CNN-LSTM architecture to identify emotions from speech data. The results show that the proposed methods achieve outstanding performance. Especially the two-dimensional variant outperformed the benchmark models: the deep belief networks and the CNN-based architectures. A similar study is performed by Huang et al. (2019a), with similar results: CNN-LSTM outperforms not only CNN, but also support vector machine predictions.

In speech data emotion recognition, Fayek et al. (2017) conducted a review study to compare various deep learning architectures, both feed-forward and recurrent networks. In this study, CNN yielded the best accuracy. This is probably why most contributions use the hybrid network of both: to combine the strengths of both models.

4.3 Image and video captioning

So far, we have discussed basic time series predictions and natural language processing. However, a computer can be requested to describe what can be seen in an image or a video in a natural-like speech format. This is referred to as image and video captioning.

From the perspective of our literature study, we notice that this domain often operates with a CNN-LSTM hybrid architecture, thus following the encoder-decoder framework. For example, having a CNN feed a sequence of frames to LSTM layers to generate the word sequence, where (Venugopalan et al. 2016) integrated linguistic knowledge from large text corpora.

The study of Chen et al. (2017a) built upon the state-of-the-art of computer vision to enable “robots to speak”. In other words, the goal of this research was to feed the model with images of cars, as detection of cars is a hot topic in self-driving technology, which would then generate a textual description of the image in understandable human language. In this regard, a CNN was applied to extract car region proposals to embed them into fixed-size windows. The LSTM can then generate from the image input a one sentence description closely related to the input image with variable length words. Compared with four other relevant algorithms, the proposed model by Chen et al. (2017a) proved to be superior.

For more general image captioning, He et al. (2019) attempted to exploit the structure information of a natural sentence. The CNN extracts from the image a high-level features representation. This vector described the global content of the input image. The LSTM network is then employed to generate words from the image representation in a recurrent process, guided by part of speech. A part of speech tagger is a system which automatically assigns the part of speech to words using contextual information (Schmid 1994). The performance of LSTM in part of speech was studied in Plank et al. (2016) and later confirmed in Horsmann and Zesch (2017), thus concluding the superiority of LSTM in this domain.

To increase the fluidity and descriptive nature of the generated image captions, a deep network consisting of three steps was proposed in Kinghorn et al. (2019). The first step in their proposed method is, of course, system preprocessing. For this purpose, the region proposal network is applied to generate regions of interest—the regions likely to contain objects or people. Second, to start with image description generation, the model conducts object and scene classification and attribute predictions. The third step is language “conversion”. The generated attributes and class labels are converted into fully descriptive image captions. It is in these two final steps that LSTM plays a key role, as it is the language generating neural network.

Chen et al. (2017b) proposed a new reference-based LSTM model to caption images where the training images are considered to be the references. This way, the authors attempted to solve two problems, namely identifying the important words in a caption, and misrecognition of objects and scenes. Experiments show their approach offers superior performance.

Liu et al. (2017a) argued that, in order to perform proper video captioning, one should not just ignore the rich contextual information that is also available like objects, scenes, actions etc. In this work, LSTM was employed to first learn the multimodal and dynamic representation of the video. Next, the model is leveraged to generate the description words one by one.

To conclude this section, the study of Ren et al. (2016) applied multimodal LSTM to perform speaker identification. This is the task of localising the face of a person corresponding to the identity of the ongoing the voice in a video. Therefore, a collective perception of both visuals and audio is required. The authors show that modelling the temporal dependency across face and voice can significantly improve the robustness. In the end, their system outperforms the state-of-the-art systems with both a lower false alarm rate and a higher recognition accuracy.

4.4 Computer vision

LSTM can also be used for gesture and action recognition. This field includes identifying human poses and interactions. In Bilakhia et al. (2015) the authors investigated automatic recognition of mimicry behaviour, as mimicry has the power to influence social judgements and behaviours. Mimicry behaviour is here defined as face and head movements. Video recordings of mimicry changes are fed to the network and compared with other methods, namely cross-correlation and generalised time-warping. LSTM reported an outstanding performance due to the model’s inherent ability to process spatio-temporal transformations. A significant variance in the performance was detected in these experiments, thus suggesting there was still room for improvement. Zhang et al. (2018) explored the effects of attention in convolutional LSTM with regards to gesture recognition, and discovered that convolutional structures in the gates do not play the role of spatial attention. Instead, a reduction of these structures result in a better accuracy, a lower parameter size and a lower computational consumption. Therefore, they introduced a new variant of LSTM.

A similar study was conducted by Chen et al. (2017c). The network architecture consists of a global LSTM network comprised of multiple blocks. The video sequence is passed to the network as a set of images, while the output consists of estimates of facial landmark coordinates of the corresponding image. As these coordinates highly correlate throughout the sequence, the output of one image is used as input for the next one. In this work, however, the global network only produces the initial coordinates. Such points are fine-tuned with two feed-forward deep neural networks, which increases the accuracy of the coordinates while maintaining the shape of the face.

Hou et al. (2018) presented a facial landmark detection method for images and videos under uncontrolled conditions. This method is a unified framework which integrates, amongst others, LSTM to make full use of the spatial and temporal middle stage information to improve the accuracy. Based on experiments on publicly available datasets, the method proved to be more effective than the state-of-the-art approaches at that time.

LSTM can also recognise gestures made by hand and even track the entire human body. This is skeleton-based human activity tracking (Núñez et al. 2018) where the authors used three-dimensional data sequences obtained from full-body and hand skeletons to address human activity and hand gesture recognition, respectively. To accomplish this, the hybrid model CNN-LSTM was utilised. Extensive simulations concluded that this hybrid model has a similar performance as state-of-the-art methods. Zhu et al. (2016) also recognised the importance of skeleton joints as a good representation of the skeleton for describing actions. They introduced a new dropout algorithm which has proven its effectiveness in experiments. This dropout algorithm allowed the dropping of internal gates, cell and output responses for an LSTM neuron to encourage each unit to learn better parameters.

The applicability is not limited to tracking body characteristics, of course. The location of any arbitrary object in a sequence of frames can also be tracked given the initial position (Chen et al. 2019a). LSTM was employed to better keep track of the historical context while performing more reliable inferences in the current step. The authors fairly stated they did not propose a real-time algorithm in their research, as run-time speeds should be improved.

Humans also pass as objects for tracking in a sequence of frames. Pei et al. (2019) proposed an LSTM-based model to predict human trajectories in crowded scenes, where one LSTM is used for each object. This model includes the assumption that humans adjust their paths to accommodate for other people’s movements, as well as that of their partners. In other words, social interactions are also taken into account. However, the employed architecture failed to identify obstacles. In Alahi et al. (2016), on the other hand, the authors reported a superior performance of their human trajectory prediction study. The main difference is that Alahi et al. (2016) only consider human interactions, as objects and the scene are completely left out of the equation. In this context, Zhang et al. (2019b) worked on joint trajectory predictions and proposed a new states refinement module for LSTM. A comparative analysis shows the effectiveness of this method.

In Wang et al. (2017) spatio-temporal LSTM was introduced in order to predict the next images based on historical frames which did achieve state-of-the-art prediction performance. This was measured on three video prediction datasets. In the context of predicting future frames, Fan et al. (2019) introduced cubic LSTM consisting of three branches, namely a spatial, a temporal and an output branch, the latter of which combines the first two branches to generate the predicted frames.

Similar to Pei et al. (2019), the study from Li et al. (2017) was performed in the context of intelligent surveillance. Firstly, a fixed-size window is slid on a static image to generate an image sequence. This sequence serves as input for a CNN which extracts a feature sequence from the image sequence. Finally, this feature sequence is passed in proper order to memorise and recognise sequential patterns. This model is designed to predict potential object locations in the scenes. In terms of accuracy, this algorithm attained the best performance on three surveillance datasets. However, the authors make note that this method posed the challenge that it was not real-time detection.

One can go further than surveillance. Zhao et al. (2016) build upon the theme, the scene and the temporal structure of video footage to identify specific, potentially harmful, videos. Preventing the spreading of videos containing harmful ideas is a valuable application. As usual, LSTM is employed to process the temporal information. The theme and scene are learned from a variety of other models. The bottom line of the research is that impressive results were obtained, thus setting a benchmark architecture up.

Another application of LSTM with regards to the computer vision field is the estimation of the human pose in three dimensions when the input consists of two-dimensional images (Núñez et al. 2019). This is accomplished in three distinct steps. First of all, the two-dimensional poses are analysed during which key skeleton points are identified using CNN. Second, three-dimensional points are constructed for each key point. The initial guess is obtained by optical triangulation. From this, it was expected that convergence would be faster, given that the starting point is not random. Third, the full body pose is estimated using the LSTM network, as a series of poses is available, thus integrating the spatial and temporal information. This method performed very well when compared with state-of-the-art approaches. In Brattoli et al. (2017), the human pose was analysed with CNN and LSTM to reveal distinct functional deficits and restoration of humans during recovery, of which the approach proved widely applicable.

Nguyen et al. (2017) modelled the human-to-human multi-modal interactive behaviour with LSTM. This behaviour consists of speech, gaze and gestures which are jointly modelled. To be more specific, the one-directional LSTM was used for on-line model comparison, and the bidirectional input is employed for off-line comparison. For both off-line and on-line prediction tasks, the chosen model yielded better results than the conventional benchmark methods when generating the appropriate overt actions.

For end-to-end sequence learning of actions in video, VideoLSTM was introduced by Li et al. (2018b). However, their approach went the other way around: instead of adapting the data to the model with regards to input, the authors adapted the model to the data. They argued that, to reckon the spatio-temporal nature of the video using an LSTM, they should hard-wire the LSTM network with convolutions and spatio-temporal attentions, thus creating a convolutional attention LSTM architecture.

In order to tackle the problem of parallelisation on GPUs of multidimensional LSTMs, Stollenga et al. (2015) proposed the PyraMiD-LSTM, a re-arrangement of the traditional cuboid order of computations in a pyramidal fashion. Experiments have shown that this model is easy to parallelise, especially for 3D data and outperformed the state-of-the-art in pixel-wise image segmentation.

Naturally, there are plenty of other contributions in computer vision (Luo et al. 2018; Feng et al. 2019b; Perrett and Damen 2019; Si et al. 2019; Ma et al. 2016; Liu et al. 2017b; Guo et al. 2018; Baddar and Ro 2019), with or without proposing a variation of the LSTM model to improve performance.

4.4.1 Text recognition

Another field in which LSTM has proved to be specially proficient is text recognition, a known subtask of computer vision. For example, Naz et al. (2016) investigated text recognition possibilities on cursive scripts, specifically Urdu. The challenge imposed by cursive scripts originates from the large number of character shapes, inter- and intra-word overlaps, context sensitivity and diagonality of text. In this study, sliding windows over the text lines are employed for feature extraction, of which the resulting vector is inputted into a multi-dimensional LSTM network. The final output is provided by a connectionist temporal classification layer. In other words, the used architecture consists of three stages. This technique resulted in the highest results reported for the benchmark problems. Note that, a few years earlier, Graves and Schmidhuber (2009) applied a similar architecture with multi-dimensional recurrent networks (Graves et al. 2007) and connectionist temporal classification, which was at that time also a breakthrough with respect to accuracy.

A second study by Naz et al. (2017) on Urdu was conducted 1 year later, to further improve character recognition of the cursive script. This was done with explicit feature extraction, rather than implicit, by employing CNN. The CNN extracted lower level features, then convoluted the learned kernels with text line images and finally fed the features to a multi-dimensional LSTM (Graves et al. 2007) which is used as the classifier in this system. The simulations, carried out on a public dataset, showed this architecture again outperformed the state-of-the-art models, including the three-stage model he proposed with his collaborators 1 year earlier.

The results in Naz et al. (2016, 2017) suggest that creating hybrid architectures provide more accurate classifications than the ones computed with the vanilla LSTM model. This would explain the wide usage of hybrid models reported in the literature. A similar framework is applied by Bhunia et al. (2019) where a three-stage approach was used. First, stacked convolutional layers extract precise translation invariant image features. These layers generate varying dimension feature vectors that are fed into an LSTM network to exploit the spatial dependencies present in text script images. Second, patch weights are obtained via an attention network followed by a softmax layer. Third, the features obtained in the second stage are integrated by employing attention based dynamic weighting.

Before going to the recognition step, Frinken et al. (2014) performed several text preprocessing tasks. To summarise, after extracting the text lines, the skew angle was determined and removed by rotation. Afterwards the slant is corrected to normalise directions of long vertical strokes. After this estimation, a shear transformation is applied, to finally scale all characters both vertically and horizontally. The BLSTM neural network is then employed for keyword spotting. Keyword spotting is a detection task consisting in discovering the presence of specific spoken words in speech signals (Fernández et al. 2007). This study by Fernandez et al. used the power of BLSTM to handle information through time and showed the outperformance compared to the classic HMM in terms of accuracy. Another example BLSTM’s power is provided in Álvaro et al. (2016) to recognise handwritten mathematical expressions, which they see as collections of strokes, as it is the state-of-the-art model which outperformed previous models. Van Phan and Nakagawa (2016) wanted a highly accurate and quick method for text/non-text classification, for which they experimented with BLSTM and achieved top-tier results, but also several future research challenges. The authors in Zamora-Martínez et al. (2014) used, amongst others, the BLSTM as a text recogniser to feed it to the language modelling network. In Sachan et al. (2019) the BLSTM network was explored for text classification using both supervised and semi-supervised learning.

In He et al. (2017) the authors proposed Tensorised LSTM in which the hidden states are represented by tensors and updates via a cross-layer convolution. The goal of this model is to increase capacity without adding additional parameters and with only a little longer runtime. Experiments on the MNIST dataset, in which written digits are to be recognised, showed the potential of the proposed model.

The features that CNNs can extract from text as input for an LSTM network, can also be fed in both directions, thus the CNN-BLSTM hybrid network is also an architecture to be considered. For example, in Toledo et al. (2019) a BLSTM was added to the already existing CNN model from previous research. This enabled the authors to extract semantic categories and thus populate a knowledge database with the document contents. Experimental results showed better performances than the state-of-the-art approaches.

Naturally, CNN is not the only alternative to create a hybrid model. The HMM can also be utilised. Liwicki and Bunke (2009) were, as reported back in 2009, the first ones to propose the combination of such diverse recognition architectures on the decision level in the field of handwritten text line recognition. Finally, as LSTM could be combined with a connectionist temporal classification output layer, the same goes for BLSTM (Yousfi et al. 2017).

In contrast with time series applications, the vanilla LSTM architecture for text recognition does not always provide the most accurate results. The most convenient approaches exploit the input signal in both directions, or create hybrid deep learning architectures.

4.5 Other application domains

Of course, the use of the LSTM architecture is not limited to applications discussed above. A wide variety of problems are suited to the use of this model. In this section, we will illustrate how the model can be applied to this large range of problems.

For example, in Portegys (2010) the maze learning performance of LSTM is compared to two other neural network architectures. A maze is defined as a network of distinctly marked rooms randomly interconnected by doors that open probabilistically. This study investigated two characteristics of the models: the retention of long-term state information and the modular use of the learned information. It appeared that the performance varied. LSTM is capable of learning the context maze tasks with non-modular training. The fact that it does have problems with modular training highlights the problem of using this model for tasks of a dynamic nature which may cause components to change, necessitating retraining.

An application in traffic analyses, namely the analysis of short-term crash risk, is proposed by Bao et al. (2019). In this work, three architectures are used: CNN to extract spatial features, LSTM for the temporal features, and finally convolutional LSTM for the spatio-temporal features, all in a stacked hierarchy. The simulations showed that the hybrid model performs better than standard machine learning approaches for capturing the spatio-temporal characteristics for the citywide short-term crash risk prediction.

Several applications are identified in the field of computer science. The authors in Feng et al. (2019a) use an LSTM as part of a Denial of Service and privacy attacks detection model, in which LSTM is focused on the identification of XSS and SQL attacks. This proposed system was not only very accurate, but also acceptably fast. Homayoun et al. (2019) analysed the capability of CNN-LSTM to classify abnormalities related with ransomware activities. In Zuo et al. (2019) LSTM was applied to path planning in network traffic engineering with constrained conditions, which showed to be a superior method. Also in terms of information security did LSTM prove to be useful, as demonstrated by Kang et al. (2019) in their analysis of malware detection and classification.

The network can also be found in the healthcare sector. For example, Pei et al. (2017) adopted LSTM to map small bowel images to the corresponding diameters. In Li et al. (2018a) an architecture was set up to recognise irregular entities in biomedical text. The contribution of Turan et al. (2018) refers to a system that estimates the real-time pose for actively controlled endoscopic capsule robots. The authors in Andersen et al. (2019) developed an approach for automatic detection of atrial fibrillation. These applications all apply combinations of network types, where LSTM learns the temporal relations in the data. On the other hand, after several data preprocessing steps, LSTM is part of the development of an abnormal heart sound detection method in the study of Zhang et al. (2019c). Perhaps more impressively, Yi et al. (2019) applied the model in the fight against cancer in identifying anticancer peptides with great success. In the BLSTM context, Steenkiste et al. (2019) contributed the value of the architecture in predicting outcomes of blood culture tests. The authors found that prediction power only decreased slightly even when predicting hours upfront.

In the field of sound recognition, Wöllmer and Schuller (2014) used BLSTM in combination with bottleneck feature generation to develop a front-end allowing the production of context-sensitive probabilistic feature vectors of arbitrary size for speech recognition. But besides speech, all kinds of sounds can be detected. In Laffitte et al. (2019) several neural networks were compared with regards to detection of screams and shouts. All tested models performed virtually equally well, however, a distinction could be made when it came to speech recognition. The tests again show the temporal structure of speech, since recurrent neural networks outperformed the other networks types.

Eck and Schmidhuber (2002) wondered whether LSTM would be a good candidate to learn how to compose music, since standard recurrent neural networks often lack global coherence. Their results show that LSTM can learn to play blues music, while also composing novel melodies in that style. The model also does not drift from the learned structure.

Airport runways have to be of high quality to ensure a safe landing. In this capacity, the authors in Cai et al. (2019) developed GrooveNet: a classification model for identifying shallow and worn grooves in a runway consisting of two LSTM layers, two dropout layers and one fully connected layer. This methodology proved not only to be very robust, but also very accurate.

A perhaps more exotic application of LSTM concerns block-sparsity recovery Lyu et al. (2019). Here, the authors employed the LSTM framework for better capturing the correlations and dependencies among nonzero elements of signals. This proposed method proved to be superior compared to alternative methods. Finally, Wang et al. (2019a) proposed a novel deep learning waveform recognition method, using two-channel CNNs combined with BLSTM. The contributors claim that this approach has a significantly better performance than the state-of-the-art.

5 Implementing LSTM in Tensorflow

As discussed above, LSTM can be utilised in a wide variety of situations. To run the model on your personal computer, some code libraries have been developed. In this section, we shall briefly describe how to build an LSTM neural network in Python, specifically using the Tensorflow framework, with code examples. Note that, for the sake of relevance, we limit ourselves to the most import code extracts.

First of all, the required code libraries must be imported.

Let us suppose we want to train an LSTM to predict the next word of a sample short story. If we feed it with correct sequences of three symbols from the text as inputs and a labelled symbol, eventually the neural network will learn to predict the next symbol correctly.

The LSTM only supports numeric inputs. A way to convert symbols to numeric inputs is to assign a unique sequential number to each symbol based in order of appearance. The reverse dictionary is also generated since it will be used to decode the outputs.

LSTM produces an output vector of probabilities of the next symbol normalised by the function. The index of the element with the highest probability is the predicted index of the symbol in the reverse dictionary. This model is at the core of the application, which is very simple to implement in Tensorflow.

In the training process, at each step, three symbols are retrieved from the training data to form the input vector. These three symbols are converted to numeric values.

The training label is a one-hot vector coming from the symbol after the three input symbols.

After reshaping to fit in the feed dictionary, the optimisation runs.

The accuracy and loss are accumulated to monitor the progress of the training.

The cost is a cross entropy between the label and prediction using a gradient descent optimiser at a learning rate of 0.001.

The accuracy of the LSTM can be improved by additional layers.

After training, we are enabled to test the network predicting the next word after input “had”, “a” and “general” words.

Also, we can test the accuracy of a batch sample.

As the reader can notice, we only need a handful of code in the Tensorflow framework in order to predict the next word in a sequence. Alternatively, one can also opt for Keras or Pytorch to implement LSTM-based solutions.

Table 1 Recommendations: LSTM-dominated or integrated networks

Full size table

6 Conclusions

In this paper, we have revised the recent applications on LSTM reported in the literature. Our survey has illustrated the ability of this recurrent system to handle a wide variety of problems including time series forecasting, text recognition, natural language processing, image and video captioning, sentiment analysis and computer vision. When modelling most of these problems, it was found a common practice is to hybridise CNNs with LSTM with the aim to get an optimal performance. In such hybrid models, convolution and pooling layers were used to reduce the problem dimensionality while greatly suppressing the redundancy in representations. However, as the choice of networks to integrate is so vast, Table 1 shows a summary per application domain with recommended network types. Note that recommendations are limited to either the LSTM-dominated network or (standard) integrated architectures. As discussed in Sect. 4, further customisation of such architectures could always be applied to improve accuracy. Keep in mind that, as described in Yu et al. (2019), there is no variant surpassing the standard LSTM at all aspects and integrated networks could also use improvements. Second, our recommendations are based on the results reported in the literature, but it is wise to remember the heterogeneity of problems, and as such, results will vary. Therefore, it would be wise to take our recommendations with a grain of salt.

Together with relevant LSTM applications, the fundamental underpinnings behind this recurrent system are detailed including its main components, the interaction with each other and a gradient-based method to compute the weight matrix. The experimental study in Greff et al. (2017) concluded that the forget gate and the output transfer function are the most critical components of the LSTM block, whereas the learning rate is the most important hyperparameter in the backpropagation algorithm. Hence, further studying these components may lead to LSTM variants with improved prediction capabilities. Another equally relevant research line refers to less computationally demanding learning procedures to adjust the learnable parameters.

References

Alahi A, Goel K, Ramanathan V, Robicquet A, Fei-Fei L, Savarese S (2016) Social LSTM: human trajectory prediction in crowded spaces. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 961–971
Álvaro F, Sánchez JA, Benedí JM (2016) An integrated grammar-based approach for mathematical expression recognition. Pattern Recognit 51:135–147
MATH Google Scholar
Andersen RS, Peimankar A, Puthusserypady S (2019) A deep learning approach for real-time detection of atrial fibrillation. Expert Syst Appl 115:465–473
Google Scholar
Baddar WJ, Ro YM (2019) Mode variational LSTM robust to unseen modes of variation: application to facial expression recognition. Proc AAAI Conf Artif Intell 33:3215–3223
Google Scholar
Bao J, Liu P, Ukkusuri SV (2019) A spatiotemporal deep learning approach for citywide short-term crash risk prediction with multi-source data. Accid Anal Prev 122:239–254
Google Scholar
Barbieri F, Anke LE, Camacho-Collados J, Schockaert S, Saggion H (2018) Interpretable emoji prediction via label-wise attention LSTMs. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 4766–4771
Bayer J, Wierstra D, Togelius J, Schmidhuber J (2009) Evolving memory cell structures for sequence learning. In: International conference on artificial neural networks, pp 755–764. Springer
Bellec G, Salaj D, Subramoney A, Legenstein R, Maass W (2018) Long short-term memory and learning-to-learn in networks of spiking neurons. In: Advances in neural information processing systems, pp 787–797
Bhunia AK, Konwer A, Bhunia AK, Bhowmick A, Roy PP, Pal U (2019) Script identification in natural scene image and video frames using an attention based convolutional-LSTM network. Pattern Recognit 85:172–184
Google Scholar
Bilakhia S, Petridis S, Nijholt A, Pantic M (2015) The MAHNOB mimicry database: a database of naturalistic human interactions. Pattern Recognit Lett 66:52–61 Pattern Recognition in Human Computer Interaction
Google Scholar
Brattoli B, Buchler U, Wahl AS, Schwab ME, Ommer B (2017) LSTM self-supervision for detailed behavior analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6466–6475
Cai AZ, Li BL, Hu CY, Luo DW, Lin EC (2019) Automated groove identification and measurement using long short-term memory unit. Measurement 141:152–161
Google Scholar
Cen Z, Wang J (2019) Crude oil price prediction model with long short term memory deep learning based on prior knowledge data transfer. Energy 169:160–171
Google Scholar
Chen L, He Y, Fan L (2017a) Let the robot tell: describe car image with natural language via LSTM. Pattern Recognit Lett 98:75–82
Google Scholar
Chen M, Ding G, Zhao S, Chen H, Liu Q, Han J (2017b) Reference based LSTM for image captioning. In: 31st AAAI conference on artificial intelligence
Chen Y, Yang J, Qian J (2017c) Recurrent neural network for facial landmark detection. Neurocomputing 219:26–38
Google Scholar
Chen B, Li P, Sun C, Wang D, Yang G, Lu H (2019a) Multi attention module for visual tracking. Pattern Recognit 87:80–93
Google Scholar
Chen Y, Zhang S, Zhang W, Peng J, Cai Y (2019b) Multifactor spatio-temporal correlation model based on a combination of convolutional neural network and long short-term memory neural network for wind speed forecasting. Energy Convers Manag 185:783–799
Google Scholar
Chowdhury GG (2003) Natural language processing. Ann Rev Inf Sci Technol 37(1):51–89
MathSciNet Google Scholar
Dabiri S, Heaslip K (2019) Developing a twitter-based traffic event detection model using deep learning architectures. Expert Syst Appl 118:425–439
Google Scholar
D’Andrea E, Ducange P, Bechini A, Renda A, Marcelloni F (2019) Monitoring the public opinion about the vaccination topic from tweets analysis. Expert Syst Appl 116:209–226
Google Scholar
Eck D, Schmidhuber J (2002) Finding temporal structure in music: blues improvisation with LSTM recurrent networks. In: Proceedings of the 12th IEEE workshop on neural networks for signal processing. IEEE, pp 747–756
Elsheikh A, Yacout S, Ouali MS (2019) Bidirectional handshaking LSTM for remaining useful life prediction. Neurocomputing 323:148–156
Google Scholar
Fan H, Zhu L, Yang Y (2019) Cubic LSTMs for video prediction. In: Proceedings of the AAAI conference on artificial intelligence, vol 33
Fayek HM, Lech M, Cavedon L (2017) Evaluating deep learning architectures for speech emotion recognition. Neural Netw 92:60–68 Advances in Cognitive Engineering Using Neural Networks
Google Scholar
Feng F, Liu X, Yong B, Zhou R, Zhou Q (2019a) Anomaly detection in ad-hoc networks based on deep learning model: a plug and play device. Ad Hoc Netw 84:82–89
Google Scholar
Feng Y, Ma L, Liu W, Luo J (2019b) Spatio-temporal video re-localization by warp LSTM. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1288–1297
Fernández S, Graves A, Schmidhuber J (2007) An application of recurrent neural networks to discriminative keyword spotting. In: International conference on artificial neural networks. Springer, pp 220–229
Fischer T, Krauss C (2018) Deep learning with long short-term memory networks for financial market predictions. Eur J Oper Res 270(2):654–669
MathSciNet MATH Google Scholar
Frinken V, Fischer A, Baumgartner M, Bunke H (2014) Keyword spotting for self-training of BLSTM NN based handwriting recognition systems. Pattern Recognit 47(3):1073–1082 Handwriting Recognition and other PR Applications
Google Scholar
Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? Dataset and methods for multilingual image question. In: Advances in neural information processing systems, pp 2296–2304
Gers F, Schmidhuber J (2000) Recurrent nets that time and count. Proc Int Joint Conf Neural Netw 3:189–194
Google Scholar
Gers FA, Schmidhuber E (2001) LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE Trans Neural Netw 12(6):1333–1340
Google Scholar
Gers F, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12:2451–71
Google Scholar
Gers FA, Pérez-Ortiz JA, Eck D, Schmidhuber J (2002) Learning context sensitive languages with LSTM trained with Kalman filters. In: International conference on artificial neural networks. Springer, pp 655–660
Gong J, Chen X, Gui T, Qiu X (2019) Switch-lstms for multi-criteria Chinese word segmentation. Proc AAAI Conf Artif Intell 33:6457–6464
Google Scholar
Graves A, Schmidhuber J (2005a) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5):602–610 (IJCNN 2005)
Google Scholar
Graves A, Schmidhuber J (2005b) Framewise phoneme classification with bidirectional LSTM networks. In: Proceedings of 2005 IEEE international joint conference on neural networks, 2005, vol 4, pp 2047–2052. https://doi.org/10.1109/IJCNN.2005.1556215
Graves A, Schmidhuber J (2009) Offline handwriting recognition with multidimensional recurrent neural networks. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems, vol 21. Curran Associates, Inc, Red Hook, pp 545–552
Google Scholar
Graves A, Eck D, Beringer N, Schmidhuber J (2004) Biologically plausible speech recognition with LSTM neural nets. In: International workshop on biologically inspired approaches to advanced information technology. Springer, pp 127–136
Graves A, Fernández S, Schmidhuber J (2007) Multi-dimensional recurrent neural networks. In: International conference on artificial neural networks. Springer, pp 549–558
Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Schmidhuber J (2017) LSTM: a search space odyssey. IEEE Trans Neural Netw Learn Syst 28(10):2222–2232
MathSciNet Google Scholar
Guo D, Zhou W, Li H, Wang M (2018) Hierarchical LSTM for sign language translation. In: 32nd AAAI conference on artificial intelligence
He Z, Gao S, Xiao L, Liu D, He H, Barber D (2017) Wider and deeper, cheaper and faster: tensorized LSTMs for sequence learning. In: Advances in neural information processing systems, pp 1–11
He X, Shi B, Bai X, Xia GS, Zhang Z, Dong W (2019) Image caption generation with part of speech guidance. Pattern Recognit Lett 119:229–237 Deep Learning for Pattern Recognition
Google Scholar
Hochreiter S (1991) Untersuchungen zu dynamischen neuronalen netzen, vol 91, no 1. Diploma, Technische Universität München
Hochreiter S, Schmidhuber J (1997a) Long short-term memory. Neural Comput 9:1735–80
Google Scholar
Hochreiter S, Schmidhuber J (1997b) LSTM can solve hard long time lag problems. In: Mozer MC, Jordan MI, Petsche T (eds) Advances in neural information processing systems, vol 9, pp 473–479. MIT Press, Cambridge
Homayoun S, Dehghantanha A, Ahmadzadeh M, Hashemi S, Khayami R, Choo KKR, Newton DE (2019) DRTHIS: deep ransomware threat hunting and intelligence system at the fog layer. Future Gener Comput Syst 90:94–104
Google Scholar
Hong J, Wang Z, Yao Y (2019) Fault prognosis of battery system based on accurate voltage abnormity prognosis using long short-term memory neural networks. Appl Energy 251:113381
Google Scholar
Hori T, Wang W, Koji Y, Hori C, Harsham B, Hershey JR (2019) Adversarial training and decoding strategies for end-to-end neural conversation models. Comput Speech Lang 54:122–139
Google Scholar
Horsmann T, Zesch T (2017) Do LSTMs really work so well for pos tagging?—a replication study. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 727–736
Hou Q, Wang J, Bai R, Zhou S, Gong Y (2018) Face alignment recurrent network. Pattern Recognit 74:448–458
Google Scholar
Huang KY, Wu CH, Su MH (2019a) Attention-based convolutional neural network and long short-term memory for short-term detection of mood disorders based on elicited speech responses. Pattern Recognit 88:668–678
Google Scholar
Huang Y, Shen L, Liu H (2019b) Grey relational analysis, principal component analysis and forecasting of carbon emissions based on long short-term memory in China. J Clean Prod 209:415–423
Google Scholar
Kadari R, Zhang Y, Zhang W, Liu T (2018) CCG supertagging via bidirectional LSTM-CRF neural architecture. Neurocomputing 283:31–37
Google Scholar
Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Underst 163:3–20 Language in Vision
Google Scholar
Kang J, Jang S, Li S, Jeong YS, Sung Y (2019) Long short-term memory-based malware classification method for information security. Comput Electr Eng 77:366–375
Google Scholar
Kanjo E, Younis EM, Ang CS (2019) Deep learning analysis of mobile physiological, environmental and location sensor data for emotion detection. Inf Fusion 49:46–56
Google Scholar
Kartsaklis D, Pilehvar MT, Collier N (2018) Mapping text to knowledge graph entities using multi-sense LSTMs. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 1959–1970
Kim B, Chung K, Lee J, Seo J, Koo MW (2019a) A bi-LSTM memory network for end-to-end goal-oriented dialog learning. Comput Speech Lang 53:217–230
Google Scholar
Kim S, Kang S, Ryu KR, Song G (2019b) Real-time occupancy prediction in a large exhibition hall using deep learning approach. Energy Build 199:216–222
Google Scholar
Kinghorn P, Zhang L, Shao L (2019) A hierarchical and regional deep learning architecture for image description generation. Pattern Recognit Lett 119:77–85 Deep Learning for Pattern Recognition
Google Scholar
Kolen JF, Kremer SC (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer SC, Kolen JF (eds) A field guide to dynamical recurrent networks. Wiley-IEEE Press, Hoboken, 237–243. https://doi.org/10.1109/9780470544037.ch14
Kraus M, Feuerriegel S (2019) Sentiment analysis based on rhetorical structure theory: learning deep neural networks from discourse trees. Expert Syst Appl 118:65–79
Google Scholar
Kumar Srivastava R, Greff K, Schmidhuber J (2015) Training very deep networks. In: Neural information processing systems (NIPS 2015 Spotlight)
Laffitte P, Wang Y, Sodoyer D, Girin L (2019) Assessing the performances of different neural network architectures for the detection of screams and shouts in public transportation. Expert Syst Appl 117:29–41
Google Scholar
Lei J, Liu C, Jiang D (2019) Fault diagnosis of wind turbine based on long short-term memory networks. Renew Energy 133:422–432
Google Scholar
Li H, Xu H (2019) Video-based sentiment analysis with hvnLBP-TOP feature and bi-LSTM. Proc AAAI Conf Artif Intell 33:9963–9964
Google Scholar
Li P, Li Y, Xiong Q, Chai Y, Zhang Y (2014) Application of a hybrid quantized elman neural network in short-term load forecasting. Int J Electr Power Energy Syst 55:749–759
Google Scholar
Li X, Ye M, Liu Y, Zhang F, Liu D, Tang S (2017) Accurate object detection using memory-based models in surveillance scenes. Pattern Recognit 67:73–84
Google Scholar
Li F, Zhang M, Tian B, Chen B, Fu G, Ji D (2018a) Recognizing irregular entities in biomedical text via deep neural networks. Pattern Recognit Lett 105:105–113 Machine Learning and Applications in Artificial Intelligence
Google Scholar
Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek CG (2018b) Videolstm convolves, attends and flows for action recognition. Comput Vis Image Underst 166:41–50
Google Scholar
Li X, Zhang L, Wang Z, Dong P (2019) Remaining useful life prediction for lithium-ion batteries based on a hybrid model combining the long short-term memory and elman neural networks. J Energy Storage 21:510–518
Google Scholar
Liu Y (2019) Novel volatility forecasting using deep learning-long short term memory recurrent neural networks. Expert Syst Appl 132:99–109
Google Scholar
Liu AA, Xu N, Wong Y, Li J, Su YT, Kankanhalli M (2017a) Hierarchical & multimodal video captioning: discovering and transferring multimodal knowledge for vision to language. Comput Vis Image Underst 163:113–125 Language in Vision
Google Scholar
Liu J, Wang G, Hu P, Duan LY, Kot AC (2017b) Global context-aware attention LSTM networks for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1647–1656
Liu Y, Jin X, Shen H (2019) Towards early identification of online rumors based on long short-term memory networks. Inf Process Manag 56(4):1457–1467
Google Scholar
Liwicki M, Bunke H (2009) Combining diverse on-line and off-line systems for handwritten text line recognition. Pattern Recognit 42(12):3254–3263 New Frontiers in Handwriting Recognition
MATH Google Scholar
Lu Z, Tan H, Li W (2019) An evolutionary context-aware sequential model for topic evolution of text stream. Inf Sci 473:166–177
Google Scholar
Luo Y, Ren J, Wang Z, Sun W, Pan J, Liu J, Pang J, Lin L (2018) LSTM pose machines. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5207–5215
Lyu C, Liu Z, Yu L (2019) Block-sparsity recovery via recurrent neural network. Signal Proc 154:129–135
Google Scholar
Ma S, Sigal L, Sclaroff S (2016) Learning activity progression in LSTMs for activity detection and early detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1942–1950
Ma J, Ganchev K, Weiss D (2018a) State-of-the-art Chinese word segmentation with bi-LSTMs. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 4902–4908
Ma Y, Peng H, Cambria E (2018b) Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive LSTM. In: 32nd AAAI conference on artificial intelligence
Manashty A, Light J (2019) Life model: a novel representation of life-long temporal sequences in health predictive analytics. Future Gener Comput Syst 92:141–156
Google Scholar
McCarthy N, Karzand M, Lecue F (2019) Amsterdam to Dublin eventually delayed? LSTM and transfer learning for predicting delays of low cost airlines. Proc AAAI Conf Artif Intell 33:9541–9546
Google Scholar
Metz C (2016) An infusion of AI makes Google translate more powerful than ever. https://www.wired.com/2016/09/google-claims-ai-breakthrough-machine-translation/. Accessed 15 Nov 2019
Naz S, Umar AI, Ahmad R, Ahmed SB, Shirazi SH, Siddiqi I, Razzak MI (2016) Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks. Neurocomputing 177:228–241
Google Scholar
Naz S, Umar AI, Ahmad R, Siddiqi I, Ahmed SB, Razzak MI, Shafait F (2017) Urdu Nastaliq recognition using convolutional-recursive deep learning. Neurocomputing 243:80–87
Google Scholar
Nguyen DC, Bailly G, Elisei F (2017) Learning off-line vs. on-line models of interactive multimodal behaviors with recurrent neural networks. Pattern Recognit Lett 100:29–36
Google Scholar
Núñez JC, Cabido R, Pantrigo JJ, Montemayor AS, Vélez JF (2018) Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recognit 76:80–94
Google Scholar
Núñez JC, Cabido R, Vélez JF, Montemayor AS, Pantrigo JJ (2019) Multiview 3d human pose estimation using improved least-squares and LSTM networks. Neurocomputing 323:335–343
Google Scholar
Pei M, Wu X, Guo Y, Fujita H (2017) Small bowel motility assessment based on fully convolutional networks and long short-term memory. Knowl Based Syst 121:163–172
Google Scholar
Pei Z, Qi X, Zhang Y, Ma M, Yang YH (2019) Human trajectory prediction in crowded scene using social-affinity long short-term memory. Pattern Recognit 93:273–282
Google Scholar
Perrett T, Damen D (2019) DDLSTM: dual-domain LSTM for cross-dataset action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7852–7861
Pino JM, Sidorov A, Ayan NF (2017) Transitioning entirely to neural machine translation. https://engineering.fb.com/ml-applications/transitioning-entirely-to-neural-machine-translation/. Accessed 15 Nov 2019
Plank B, Søgaard A, Goldberg Y (2016) Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In: Proceedings of the 54th annual meeting of the association for computational linguistics, vol 2: short papers, pp 412–418
Portegys TE (2010) A maze learning comparison of Elman, long short-term memory, and Mona neural networks. Neural Netw 23(2):306–313
Google Scholar
Rabiner LR (1986) An introduction to hidden Markov models. IEEE ASSP Mag 3(1):4–16
Google Scholar
Ren J, Hu Y, Tai YW, Wang C, Xu L, Sun W, Yan Q (2016) Look, listen and learn—a multimodal LSTM for speaker identification. In: 30th AAAI conference on artificial intelligence
Ringeval F, Eyben F, Kroupi E, Yuce A, Thiran JP, Ebrahimi T, Lalanne D, Schuller B (2015) Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recognit Lett 66:22–30 Pattern Recognition in Human Computer Interaction
Google Scholar
Rodrigues F, Markou I, Pereira FC (2019) Combining time-series and textual data for taxi demand prediction in event areas: a deep learning approach. Inf Fusion 49:120–129
Google Scholar
Ryu S, Kim S, Choi J, Yu H, Lee GG (2017) Neural sentence embedding using only in-domain sentences for out-of-domain sentence detection in dialog systems. Pattern Recognit Lett 88:26–32
Google Scholar
Sabina Aouf R (2019) Openai creates dactyl robot hand with “unprecedented” dexterity . https://www.dezeen.com/2018/08/07/openai-musk-dactyl-robot-hand-unprecedented-dexterity-technology/. Accessed 17 Nov 2019
Sachan DS, Zaheer M, Salakhutdinov R (2019) Revisiting LSTM networks for semi-supervised text classification via mixed objective function. Proc AAAI Conf Artif Intell 33:6940–6948
Google Scholar
Saeed HA, jun Peng M, Wang H, wen Zhang B (2020) Novel fault diagnosis scheme utilizing deep learning networks. Prog in Nuclear Energy 118:103066
Google Scholar
Sagheer A, Kotb M (2019) Time series forecasting of petroleum production using deep LSTM recurrent networks. Neurocomputing 323:203–213
Google Scholar
Sak H, Senior A, Rao K, Beaufays F, Schalkwyk J (2015) Google voice search: faster and more accurate. https://ai.googleblog.com/2015/09/google-voice-search-faster-and-more.html. Accessed 15 Nov 2019
Sang C, Pierro MD (2019) Improving trading technical analysis with tensorflow long short-term memory (LSTM) neural network. J Finance Data Sci 5(1):1–11
Google Scholar
Schmid H (1994) Part-of-speech tagging with neural networks. In: Proceedings of the 15th conference on Computational linguistics, vol 1, pp 172–176. Association for Computational Linguistics
Schmidhuber J, Wierstra D, Gagliolo M, Gomez F (2007) Training recurrent networks by Evolino. Neural Comput 19(3):757–779
MATH Google Scholar
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681. https://doi.org/10.1109/78.650093
Article Google Scholar
Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1227–1236
Song L, Zhang Y, Wang Z, Gildea D (2018) N-ary relation extraction using graph-state LSTM. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 2226–2235
Song M, Park H, shik Shin K (2019) Attention-based long short-term memory network using sentiment lexicon embedding for aspect-level sentiment analysis in Korean. Inf Process Manag 56(3):637–653
Google Scholar
Steenkiste TV, Ruyssinck J, Baets LD, Decruyenaere J, Turck FD, Ongenae F, Dhaene T (2019) Accurate prediction of blood culture outcome in the intensive care unit using long short-term memory neural networks. Artif Intell Med 97:38–43
Google Scholar
Stollenga MF, Byeon W, Liwicki M, Schmidhuber J (2015) Parallel multi-dimensional LSTM, with application to fast biomedical volumetric image segmentation. In: Advances in neural information processing systems, pp 2998–3006
Su Y, Kuo CCJ (2019) On extended long short-term memory and dependent bidirectional recurrent neural network. Neurocomputing 356:151–161
Google Scholar
Sukhbaatar S, Weston J, Fergus R et al (2015) End-to-end memory networks. In: Advances in neural information processing systems, pp 2440–2448
Sun Y, Ji Z, Lin L, Tang D, Wang X (2017) Entity disambiguation with decomposable neural networks. Wiley Interdiscip Rev Data Min Knowl Discov 7(5):e1215
Google Scholar
Sun Y, Ji Z, Lin L, Wang X, Tang D (2018) Entity disambiguation with memory network. Neurocomputing 275:2367–2373
Google Scholar
Sun X, Zhang C, Li L (2019) Dynamic emotion modelling and anomaly detection in conversation based on emotional transition tensor. Inf Fusion 46:11–22
Google Scholar
Sutskever I, Vinyals O, Le Q (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems
Takayama J, Nomoto E, Arase Y (2019) Dialogue breakdown detection robust to variations in annotators and dialogue systems. Comput Speech Lang 54:31–43
Google Scholar
The AlphaStar Team (2019a) Alphastar: Grandmaster level in starcraft ii using multi-agent reinforcement learning. https://deepmind.com/blog/article/AlphaStar-Grandmaster-level-in-StarCraft-II-using-multi-agent-reinforcement-learning. Accessed 15 Nov 2019
The AlphaStar Team (2019b) Alphastar: Mastering the real-time strategy game starcraft ii. https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii. Accessed 15 Nov 2019
Toledo JI, Carbonell M, Fornés A, Lladós J (2019) Information extraction from historical handwritten document images with a context-aware neural model. Pattern Recognit 86:27–36
Google Scholar
Turan M, Almalioglu Y, Araujo H, Konukoglu E, Sitti M (2018) Deep endovo: a recurrent convolutional neural network (RCNN) based visual odometry approach for endoscopic capsule robots. Neurocomputing 275:1861–1870
Google Scholar
Uddin MZ (2019) A wearable sensor-based activity prediction system to facilitate edge computing in smart healthcare system. J Parallel Distrib Comput 123:46–53
Google Scholar
Van Phan T, Nakagawa M (2016) Combination of global and local contexts for text/non-text classification in heterogeneous online handwritten documents. Pattern Recognit 51:112–124
Google Scholar
Venugopalan S, Hendricks LA, Mooney R, Saenko K (2016) Improving LSTM-based video description with linguistic knowledge mined from text. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 1961–1966
Vogels W (2016) Bringing the Magic of Amazon AI and Alexa to Apps on AWS. https://www.allthingsdistributed.com/2016/11/amazon-ai-and-alexa-for-all-aws-apps.html. Accessed 15 Nov 2019
Wang L, Cao Z, Xia Y, De Melo G (2016) Morphological segmentation with window lstm neural networks. In: 30th AAAI conference on artificial intelligence
Wang Y, Long M, Wang J, Gao Z, Philip SY (2017) PREDRNN: recurrent neural networks for predictive learning using spatiotemporal LSTMs. In: Advances in neural information processing systems, pp 879–888
Wang Q, Du P, Yang J, Wang G, Lei J, Hou C (2019a) Transferred deep learning based waveform recognition for cognitive passive radar. Signal Process 155:259–267
Google Scholar
Wang W, Hong T, Xu X, Chen J, Liu Z, Xu N (2019b) Forecasting district-scale energy dynamics through integrating building network and long short-term memory learning algorithm. Appl Energy 248:217–230
Google Scholar
Wang Z, Wang Z, Long Y, Wang J, Xu Z, Wang B (2019c) Enhancing generative conversational service agents with dialog history and external knowledge. Comput Speech Lang 54:71–85
Google Scholar
Wen J, Tu H, Cheng X, Xie R, Yin W (2019) Joint modeling of users, questions and answers for answer selection in CQA. Expert Syst Appl 118:563–572
Google Scholar
Werbos PJ et al (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560
Google Scholar
Wierstra D, Gomez FJ, Schmidhuber J (2005) Modeling systems with internal state using Evolino. In: Proceedings of the 7th annual conference on Genetic and evolutionary computation. ACM, pp 1795–1802
Wöllmer M, Schuller B (2014) Probabilistic speech feature extraction with context-sensitive bottleneck neural networks. Neurocomputing 132:113–120 Innovations in Nature Inspired Optimization and Learning Methods Machines learning for Non-Linear Processing
Google Scholar
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K et al (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144
Wu YX, Wu QB, Zhu JQ (2019) Improved EEMD-based crude oil price forecasting using LSTM networks. Phys A Stat Mech Appl 516:114–124
Google Scholar
Xingjian S, Chen Z, Wang H, Yeung DY, Wong WK, Woo Wc (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–810
Yan H, Ouyang H (2017) Financial time series prediction based on deep learning. Wirel Pers Commun 102:1–18
Google Scholar
Yang M, Tu W, Wang J, Xu F, Chen X (2017) Attention based LSTM for target dependent sentiment classification. In: 31st AAAI conference on artificial intelligence
Yang J, Guo Y, Zhao W (2019) Long short-term memory neural network based fault detection and isolation for electro-mechanical actuators. Neurocomputing 360:85–96
Google Scholar
Yi HC, You ZH, Zhou X, Cheng L, Li X, Jiang TH, Chen ZH (2019) ACP-DL: a deep learning long short-term memory model to predict anticancer peptides using high-efficiency feature representation. Mol Ther Nucl Acids 17:1–9
Google Scholar
Yousfi S, Berrani SA, Garcia C (2017) Contribution of recurrent connectionist language models in improving LSTM-based Arabic text recognition in videos. Pattern Recognit 64:245–254
Google Scholar
Yu Y, Si X, Hu C, Zhang J (2019) A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput 31(7):1235–1270
MathSciNet MATH Google Scholar
Zamora-Martínez F, Frinken V, España-Boquera S, Castro-Bleda M, Fischer A, Bunke H (2014) Neural network language models for off-line handwriting recognition. Pattern Recognit 47(4):1642–1652
Google Scholar
Zhang L, Zhu G, Mei L, Shen P, Shah SAA, Bennamoun M (2018) Attention in convolutional LSTM for gesture recognition. In: Advances in neural information processing systems, pp 1953–1962
Zhang M, Wang Q, Fu G (2019a) End-to-end neural opinion extraction with a transition-based model. Inf Syst 80:56–63
Google Scholar
Zhang P, Ouyang W, Zhang P, Xue J, Zheng N (2019b) SR-LSTM: state refinement for LSTM towards pedestrian trajectory prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12085–12094
Zhang W, Han J, Deng S (2019c) Abnormal heart sound detection using temporal quasi-periodic features and long short-term memory without segmentation. Biomed Signal Process Control 53:101560
Google Scholar
Zhang W, Li Y, Wang S (2019d) Learning document representation via topic-enhanced LSTM model. Knowl Based Syst 174:194–204
Google Scholar
Zhang Z, Li H, Zhang L, Zheng T, Zhang T, Hao X, Chen X, Chen M, Xiao F, Zhou W (2019e) Hierarchical reinforcement learning for multi-agent moba game. arXiv preprint arXiv:1901.08004
Zhao Z, Song Y, Su F (2016) Specific video identification via joint learning of latent semantic concept, scene and temporal structure. Neurocomputing 208:378–386 SI: BridgingSemantic
Google Scholar
Zhao J, Deng F, Cai Y, Chen J (2019a) Long short-term memory-fully connected (LSTM-FC) neural network for PM2. 5 concentration prediction. Chemosphere 220:486–492
Google Scholar
Zhao J, Mao X, Chen L (2019b) Speech emotion recognition using deep 1d & 2d CNN LSTM networks. Biomed Signal Process Control 47:312–323
Google Scholar
Zhou X, Wan X, Xiao J (2016) Attention-based LSTM network for cross-lingual sentiment classification. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 247–256
Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: 30th AAAI conference on artificial intelligence
Zuo Y, Wu Y, Min G, Cui L (2019) Learning-based network path planning for traffic engineering. Future Gener Comput Syst 92:59–67
Google Scholar

Download references

Acknowledgements

We thank the reviewers for their very thoughtful and thorough reviews of our manuscript. Their input has been invaluable in increasing the quality of our paper. Also, a special thanks to prof. Jürgen Schmidhuber for taking the time to share his thoughts on the manuscript with us and making suggestions for further improvements.

Author information

Authors and Affiliations

Faculty of Business Economics, Hasselt University, Agoralaan gebouw D, 3590, Diepenbeek, Belgium
Greg Van Houdt & Gonzalo Nápoles
Artificial Intelligence Lab, Vrije Universiteit Brussel, Pleinlaan 9, 1050, Brussels, Belgium
Carlos Mosquera
Department of Cognitive Science & Artificial Intelligence, Tilburg University, The Netherlands, Warandelaan 2, 5037 AB, Tilburg, The Netherlands
Gonzalo Nápoles

Authors

Greg Van Houdt
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Mosquera
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Nápoles
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Greg Van Houdt.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Van Houdt, G., Mosquera, C. & Nápoles, G. A review on the long short-term memory model. Artif Intell Rev 53, 5929–5955 (2020). https://doi.org/10.1007/s10462-020-09838-1

Download citation

Published: 13 May 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10462-020-09838-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A review on the long short-term memory model

Abstract

Similar content being viewed by others

Overview of Long Short-Term Memory Neural Networks

Deep Learning

Overview of Incorporating Nonlinear Functions into Recurrent Neural Network Models

1 Introduction

2 Long short-term memory

3 How to train your model

4 Relevant applications

4.1 Time series prediction

4.2 Natural language processing

4.2.1 Sentiment analysis

4.3 Image and video captioning

4.4 Computer vision

4.4.1 Text recognition

4.5 Other application domains

5 Implementing LSTM in Tensorflow

6 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A review on the long short-term memory model

Abstract

Similar content being viewed by others

Overview of Long Short-Term Memory Neural Networks

Deep Learning

Overview of Incorporating Nonlinear Functions into Recurrent Neural Network Models

Explore related subjects

1 Introduction

2 Long short-term memory

3 How to train your model

4 Relevant applications

4.1 Time series prediction

4.2 Natural language processing

4.2.1 Sentiment analysis

4.3 Image and video captioning

4.4 Computer vision

4.4.1 Text recognition

4.5 Other application domains

5 Implementing LSTM in Tensorflow

6 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation