1 Introduction

Recurrent or very deep neural networks are difficult to train, as they often suffer from the exploding/vanishing gradient problem (Hochreiter 1991; Kolen and Kremer 2001). To overcome this shortcoming when learning long-term dependencies, the LSTM architecture (Hochreiter and Schmidhuber 1997a) was introduced. The learning ability of LSTM impacted several fields from both a practical and theoretical perspective, so that it became a state-of-the-art model. This led to the model being used by Google for its speech recognition (Sak et al. 2015), and to improve machine translations on Google Translate (Wu et al. 2016; Metz 2016). Amazon employs the model to improve Alexa’s functionalities (Vogels 2016), and Facebook puts it to use for over 4 billion LSTM-based translations per day as of 2017 (Pino et al. 2017).

Due to its high applicability and popularity, this neural architecture has also found its way into the world of gaming. For example, Google’s Deepmind created AlphaStar (The AlphaStar Team 2019b), an artificial intelligence designed to play Starcraft II. Throughout the development of AlphaStar, it started to master the game (The AlphaStar Team 2019a), climbing up the global rankings, which was unseen before. Research in this field is of course not limited to Starcraft II, as the research interest spans the entire RTS gaming genre due to its complexity (Zhang et al. 2019e). To generalise the topic of reinforcement learning in other settings, OpenAI succeeded in building a robot hand called Dactyl, which taught itself how to manipulate objects in a human-like fashion (Sabina Aouf 2019).

Of course, a neural architecture would not be so well adopted into practice without a strong theoretical foundation. An extensive review with respect to several LSTM variants and their performances relative to the so-called vanilla model was recently conducted by Greff et al. (2017). The vanilla LSTM is interpreted as the original LSTM block with the addition of the forget gate and peephole connections. In total, eight variants were identified for experimentation. In a nutshell, the vanilla architecture performs well on a number of tasks, and none of the eight investigated variants significantly outperforms the remaining ones. This justified most applications found in the literature to employ the vanilla LSTM.

A recent study conducted by Yu et al. (2019) provides an overview of the LSTM cell, its functionalities and different architectures. A distinction is made between LSTM-dominated neural networks and integrated LSTM networks, the latter of which adds other components than LSTM to take advantage of its properties, therefore potentially hybridising neural networks. As will be illustrated in Sect. 4, this work complements the review study of Yu et al. (2019) in terms of a broad applications overview showing where integrated networks are most useful. As each problem is largely unique, there is often a better solution than employing solely the standard LSTM model.

Therefore, in this paper, we present a comprehensive review on the LSTM model that complements the theoretical findings presented in Greff et al. (2017) and Yu et al. (2019). Our review study focuses on three main directions, which move from theory to practice. In the first part, we broadly describe the LSTM components, how they interact with each other and how we can estimate the learnable parameters. These considerations may become relevant for readers who want to master the model from a theoretical perspective, instead of being experienced practitioners. In the second part, we outline interesting applications that show the potential of LSTM as an undeniable state-of-the-art method within the deep learning field. Some interesting application domains include text recognition, time series forecasting, natural language processing, computer vision, and image and video captioning, among others. In the last part, we present a code example in Tensorflow that aims to predict the next word of a sample short story.

As for the search criteria on which this literature review is based, we evaluated 409 papers containing the terms “Long short-term memory” or “LSTM” in either the title, abstract or keywords. Only non-paid journals with a recognised peer-review system were considered. Moreover, we included papers presented in mainstream conferences such as the Conference on Neural Information Processing Systems, the Conference on Computer Vision and Pattern Recognition, the AAAI Conference on Artificial Intelligence, the Conference on Empirical Methods in Natural Language, etc. Having constructed our starting library, we selected papers making a relevant contribution in terms of theory or practice in which LSTM played a key role. It does not imply however that papers not included in our review do not fulfil this criterion, but certainly we could not cover the whole literature concerning the LSTM model. For example, a rough search carried out in February 2020 by using the search criteria mentioned above reported 11,931 documents indexed by Scopus. Therefore, we have given priority to most recent contributions.

The remainder of this paper is structured as follows. In Sect. 2 we describe the theoretical foundations behind the LSTM model, followed by a concise description of a procedure to adjust the learnable parameters in Sect. 3. Section 4 zooms in on different applications of the model as found in the literature. An example implementation in Tensorflow can be found in Sect. 5. Finally, we summarise our conclusions in Section 6.

2 Long short-term memory

The LSTM model (Hochreiter and Schmidhuber 1997a) is a powerful recurrent neural system specially designed to overcome the exploding/vanishing gradient problems that typically arise when learning long-term dependencies, even when the minimal time lags are very long (Hochreiter and Schmidhuber 1997b). Overall, this can be prevented by using a constant error carousel (CEC), which maintains the error signal within each unit’s cell. As a matter of fact, such cells are recurrent networks themselves, with an interesting architecture in the way that the CEC is extended with additional features, namely the input gate and output gate, forming the memory cell. The self-recurrent connections indicate feedback with a lag of one time step.

A vanilla LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. This forget gate was not initially a part of the LSTM network, but was proposed by Gers et al. (2000) to allow the network to reset its state. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information associated with the cell. In the remainder of this section, LSTM will refer to the vanilla version as this is the most popular LSTM architecture (Greff et al. 2017). This does not imply, however, that it is also the superior one in every situation. This will be elaborated on in Sect. 4.

In short, the LSTM architecture consists of a set of recurrently connected sub-networks, known as memory blocks. The idea behind the memory block is to maintain its state over time and regulate the information flow thought non-linear gating units. Figure 1 displays the architecture of a vanilla LSTM block, which involves the gates, the input signal \(x^{(t)}\), the output \(y^{(t)}\), the activation functions, and peephole connections (Gers and Schmidhuber 2000). The output of the block is recurrently connected back to the block input and all of the gates.

Fig. 1
figure 1

Architecture of a typical vanilla LSTM block

Aiming to clarify how the LSTM model works, let us assume a network comprised of N processing blocks and M inputs. The forward pass in this recurrent neural system is described below.

Block input This step is devoted to updating the block input component, which combines the current input \(x^{(t)}\) and the output of that LSTM unit \(y^{(t-1)}\) in the last iteration. This can be done as depicted below:

$$\begin{aligned} z^{(t)} = g(W_z x^{(t)} + R_z y^{(t-1)} + b_z) \end{aligned}$$
(1)

where \(W_z\) and \(R_z\) are the weights associated with \(x^{(t)}\) and \(y^{(t-1)}\), respectively, while \(b_z\) stands for the bias weight vector.

Input gate In this step, we update the input gate that combines the current input \(x^{(t)}\), the output of that LSTM unit \(y^{(t-1)}\) and the cell value \(c^{(t-1)}\) in the last iteration. The following equation shows this procedure:

$$\begin{aligned} i^{(t)} = \sigma (W_i x^{(t)} + R_i y^{(t-1)} + p_i \odot c^{(t-1)} + b_i) \end{aligned}$$
(2)

where \(\odot \) denotes point-wise multiplication of two vectors, \(W_i\), \(R_i\) and \(p_i\) are the weights associated with \(x^{(t)}\), \(y^{(t-1)}\) and \(c^{(t-1)}\), respectively, while \(b_i\) represents for the bias vector associated with this component.

In the previous steps, the LSTM layer determines which information should be retained in the network’s cell states \(c^{(t)}\). This included the selection of the candidate values \(z^{(t)}\) that could potentially be added to the cell states, and the activation values \(i^{(t)}\) of the input gates.

Forget gate In this step, the LSTM unit determines which information should be removed from its previous cell states \(c^{(t-1)}\). Therefore, the activation values \(f^{(t)}\) of the forget gates at time step t are calculated based on the current input \(x^{(t)}\), the outputs \(y^{(t-1)}\) and the state \(c^{(t-1)}\) of the memory cells at the previous time step \((t-1)\), the peephole connections, and the bias terms \(b_f\) of the forget gates. This can be done as follows:

$$\begin{aligned} f^{(t)} = \sigma (W_f x^{(t)} + R_f y^{(t-1)} + p_f \odot c^{(t-1)} + b_f) \end{aligned}$$
(3)

where \(W_f\), \(R_f\) and \(p_f\) are the weights associated with \(x^{(t)}\), \(y^{(t-1)}\) and \(c^{(t-1)}\), respectively, while \(b_f\) denotes for the bias weight vector.

Cell This step computes the cell value, which combines the block input \(z^{(t)}\), the input gate \(i^{(t)}\) and the forget gate \(f^{(t)}\) values, with the previous cell value. This can be done as depicted below:

$$\begin{aligned} c^{(t)} = z^{(t)} \odot i^{(t)} + c^{(t-1)} \odot f^{(t)}. \end{aligned}$$
(4)

Output gate This step calculates the output gate, which combines the current input \(x^{(t)}\), the output of that LSTM unit \(y^{(t-1)}\) and the cell value \(c^{(t-1)}\) in the last iteration. This can be done as depicted below:

$$\begin{aligned} o^{(t)} = \sigma ( W_o x^{(t)} + R_o y^{(t-1)} + p_o \odot c^{(t)} + b_o ) \end{aligned}$$
(5)

where \(W_o\), \(R_o\) and \(p_o\) are the weights associated with \(x^{(t)}\), \(y^{(t-1)}\) and \(c^{(t-1)}\), respectively, while \(b_o\) denotes for the bias weight vector.

Block output Finally, we calculate the block output, which combines the current cell value \(c^{(t)}\) with the current output gate value as follows:

$$\begin{aligned} y^{(t)} = g(c^{(t)}) \odot o^{(t)}. \end{aligned}$$
(6)

In the above steps, \(\sigma \), g and h denote point-wise non-linear activation functions. The logistic sigmoid \(\sigma (x) = \frac{1}{1+e^{1-x}}\) is used as a gate activation function, while the hyperbolic tangent \(g(x) = h(x) = \tanh (x)\) is often used as the block input and output activation function.

It seems appropriate to mention that the functionality of this architecture inspired the authors in Kumar Srivastava et al. (2015) to enhance the training of very deep networks. The gating mechanism was employed in the so-called highway networks to allow for an unimpeded information flow across many layers. This could be considered as another proof-of-concept, showing that the gates work.

Even though vanilla LSTM already performs very well, several works studied the possibilities to improve performance. For example, Su and Kuo (2019) developed the Extended LSTM model, further improving the accuracy of predictions in several application fields by enhancing the memory capability. This shows that theoretical improvements can still be made to an already state-of-the-art performing architecture. In the work of Bayer et al. (2009), the search for improvements of the model was already ongoing. The authors looked for for an architectural alternative to LSTM to optimise sequence learning capabilities. They succeeded in evolving memory cell structures capable of learning context-sensitive formal languages through gradient descent, being in many ways comparable to LSTM performance-wise. The authors in Bellec et al. (2018) built upon recurrent networks of spiking neurons, developing Long short-term memory Spiking Neural Networks (LSNN) including adapting neurons. During tests in which the size of LSNN was comparable to that of LSTM, it was shown that the performance is very comparable to that of LSTM. This is yet another illustration of how accurate LSTM is and remains.

3 How to train your model

The LSTM model described in Sect. 2 uses full gradient training presented by Graves and Schmidhuber (2005a) to adjust the learnable parameters (weights) involved in the network. More specifically, Backpropagation Through Time  (Werbos 1990) is used to compute the weights that connect the different components in the network. Therefore, during the backward pass, the cell state \(c^{(t)}\) receives gradients from \(y^{(t)}\) as well as the next cell state \(c^{(t+1)}\). Those gradients are accumulated before being backpropagated to the current layer.

In the last iteration T, the change \(\delta _{y}^{(T)}\) corresponds with the network error \(\partial E / y^{(T)}\) such that E denotes the loss function. Otherwise, \(\delta _{y}^{(t)}\) is the vector of delta values passed down \(\varDelta ^{(t)}\) from the above layer including the recurrent dependencies. This can be done as follows:

$$\begin{aligned} \delta _{y}^{(t)} = \varDelta ^{(t)} + R_z^T \delta _{z}^{(t+1)} + R_i^T \delta _{i}^{(t+1)} + R_f^T \delta _{f}^{(t+1)} + R_o^T \delta _{o}^{(t+1)}. \end{aligned}$$
(7)

In a second step, the change in the parameters associated with the gates and the memory cell are calculated as:

$$\begin{aligned} \delta _{o^{(t)}}= & {} \delta _y^{(t)} \odot h(c^{(t)}) \odot \sigma '({\hat{o}}^{(t)}) \end{aligned}$$
(8)
$$\begin{aligned} \delta _c^{(t)}= & {} \delta _y^{(t)} \odot o^{(t)} \odot h'(c^{(t)}) + p_o \odot \delta _{o}^{(t)} \nonumber \\&+\, p_i \odot \delta _{i}^{(t+1)} + p_f \odot \delta _f^{(t+1)} + \delta _c^{(t+1)} \odot f^{(t+1)} \end{aligned}$$
(9)
$$\begin{aligned} \delta _{f^{(t)}}= & {} \delta _c^{(t)} \odot c^{(t-1)} \odot \sigma '({\hat{f}}^{(t)}) \end{aligned}$$
(10)
$$\begin{aligned} \delta _{i^{(t)}}= & {} \delta _c^{(t)} \odot z^{(t)} \odot \sigma '({\hat{i}}^{(t)}) \end{aligned}$$
(11)
$$\begin{aligned} \delta _{z^{(t)}}= & {} \delta _c^{(t)} \odot i^{(t)} \odot g'({\hat{z}}^{(t)}) \end{aligned}$$
(12)

where \({\hat{o}}^{(t)}\), \({\hat{i}}^{(t)}\), \({\hat{z}}^{(t)}\) and \({\hat{f}}^{(t)}\) denote the raw values attached with the output gate, input gate, the block input and forget gate, respectively, before being transformed by the corresponding transfer function.

As pointed out in Greff et al. (2017), the delta values for the inputs are only required if there is a layer below that needs to be trained, thus:

$$\begin{aligned} \delta _{x}^{(t)} = W_z^T \delta _{z}^{(t)} + W_i^T \delta _{i}^{(t)} + W_f^T \delta _{f}^{(t)} + W_o^T \delta _{o}^{(t)}. \end{aligned}$$
(13)

Finally, the gradients for the weights are calculated as follows:

$$\begin{aligned} \delta _{W_{*}}&= \sum _{t=0}^T \delta _{*}^{(t)} \otimes x^{(t)}&\delta _{p_{i}}&= \sum _{t=0}^{T-1} c^{(t)} \odot \delta _{i}^{(t+1)} \\ \delta _{R_{*}}&= \sum _{t=0}^{T-1} \delta _{*}^{(t+1)} \otimes y^{(t)}&\delta _{p_{f}}&= \sum _{t=0}^{T-1} c^{(t)} \odot \delta _{f}^{(t+1)}&\\ \delta _{b_{*}}&= \sum _{t=0}^T \delta _{*}^{(t)}&\delta _{p_{o}}&= \sum _{t=0}^T c^{(t)} \odot \delta _{o}^{(t)} \end{aligned}$$

such that \(\otimes \) represents the outer product of two vectors, whereas \(*\) can be any component associated with the weights: the block input \({\hat{z}}\), the input gate \({\hat{i}}\), the forget gate \({\hat{f}}\) or the output gate \({\hat{o}}\).

Alternatively, one can train the model with evolino (Schmidhuber et al. 2007; Wierstra et al. 2005). This method evolves weights to the nonlinear, hidden nodes of recurrent neural networks while computing optimal linear mappings from hidden state to output.

4 Relevant applications

The LSTM network is applied in a wide array of problem domains, both individually and in combination with other deep learning architectures. As previously discussed, LSTM is one of the most advanced networks to process temporal sequences. For this reason, the vanilla LSTM is still one of the most popular network choices, even though it is possible to combine it with other networks to create hybrid models. LSTM is well suited to handle time series predictions, but also any other problem that requires temporal memory.

This section gives an overview of the topics the model is most suitable for and how it can be used to solve complex problems. In that regard, we will discuss the applications per problem domain.

4.1 Time series prediction

When it comes to temporal sequences in data, time series data immediately comes to mind. Still, this is a broad concept. In the more literal sense of time series predictions, the LSTM model has been applied to financial market predictions in, for example, Fischer and Krauss (2018) and Yan and Ouyang (2017). Due to complex features such as non-linearity, non-stationarity and sequence correlation, financial data pose a huge forecasting challenge. It was shown by Fischer and Krauss, however, that the LSTM network outperforms more traditional benchmarks: the random forest, a standard deep neural network and a standard logistic regression.

Sagheer and Kotb (2019) confirmed this finding that LSTM outperforms standard approaches when predicting petroleum production. In their research, they stacked multiple layers of LSTM blocks on top of one another in a hierarchical fashion. This increased the model’s ability to process temporal tasks and enabled it to better capture the structure of data sequences. Attempting to forecast the oil market price, the authors in Cen and Wang (2019) came to the same conclusions. The proposed prediction model, composed of the vanilla LSTM architecture, proved to be superior. Liu (2019) compared LSTM with other models, like support vector machines. This research in estimating financial stock volatility not only showed it was relatively easy to calibrate LSTM, but also that it results in accurate predictions for even large time intervals.

In Rodrigues et al. (2019) the authors used LSTM to model the time series observations and later on improve predictions by combining this with text data input. Using this technique, they attempted to predict taxi demand in New York, even though the proposed method is generalisable for other applications as well. This is an important benefit of the architecture, as it was empirically shown that the forecast error can be greatly reduced with this approach.

Receiving a time series as input does not necessarily mean the model will predict the next values in the series, as it can also be used to train a classifier. This is the case of the fault diagnosis in Lei et al. (2019) where the authors used an LSTM-based framework for condition monitoring of wind turbines using only raw time series data gathered by a single or multiple sensors. They explicitly made this choice to avoid a heavy reliance on expert knowledge. LSTM was utilised to capture long-term dependencies in the data to do proper fault classification. Experiments in this study have shown that this framework is not only robust, but it also outperformed the state-of-the-art methods. Other fault classification works include batteries in electric vehicles (Hong et al. 2019), where LSTM predicts the battery voltage levels which lays the ground work for fault classification, and electro-mechanical actuators in aircraft (Yang et al. 2019) based on the recorded sensor data. The study of Saeed et al. (2020) presented the necessary steps and process required to create a flexible fault diagnosis model, in which LSTM plays a pivotal role in the statistical analysis. The authors tested their model in the nuclear energy domain.

As another LSTM-based classification example, Uddin (2019) demonstrated that the model can also detect which activity was performed given input data from multiple wearable healthcare sensors obtained via an edge device like a laptop. This sensor data was fed to LSTM with which twelve different human activities were modelled. Again, the proposed method was proven to be robust and achieve better results than the reported traditional models.

Given several data points produced by a number of sensors, Elsheikh et al. (2019) predicted the remaining useful life (RUL) of physical systems, for example, production resources. This was done by proposing a new bidirectional LSTM (BLSTM) architecture. The term bidirectional in recurrent neural networks comes from the idea to provide the input sequence in both ways: forwards and backwards. From these two hidden layers, the output layer can get information about the past and the future states, respectively, simultaneously (Schuster and Paliwal 1997). The same concept can be applied to the LSTM network, as it is a recurrent network that takes sequences as input. Graves and Schmidhuber (2005b) presented the BLSTM network, comparing it with the unidirectional variant in a classification context. It was already confirmed that LSTM outperforms traditional recurrent neural networks, but the findings from this research indicated that BLSTM can achieve even better results.

Since the problem at hand in Elsheikh et al. (2019) is only about the RUL, the authors were not interested in intermediate predictions. Therefore, the input sequence is not processed in both directions simultaneously. Rather, the forward direction is processed first, after which the LSTM final states initialise the backward processing cells. This forces the network to produce two different yet linked mappings of the data to the RUL. Since the initial wear of the physical system is usually unknown, the network is trained to anticipate requirements such as replacement of parts. This method outperforms the conventional network types according to the conducted experiments.

Li et al. (2019) presented a lithium-ion battery RUL prediction based on the empirical mode decomposition algorithm in combination with a novel Elman-LSTM hybrid network. First, empirical mode decomposition extracts both high and low frequency signals from the input data. The Elman neural network was introduced to the solution as it has the capability of short-term memory (Li et al. 2014). Therefore, it is included to tackle the problem of capturing capacity recovery in certain cycles of the batteries. With the long-term capabilities of LSTM, the battery degradation is regarded as the time series. With both models working together, capturing both degradation and recovery characteristics, the battery RUL is predicted. Experiments show that this novel method offers a superior performance in comparison to other state-of-the-art models.

As shown in Li et al. (2019), predictions can be improved by combining different model architectures. However, not only can these models work together side by side, the input sequence can also be preprocessed for LSTM by another deep learning or other architecture, also forming a hybrid model. Wu et al. (2019) used ensemble empirical mode decomposition to select the proper intrinsic mode functions which then serve as input for the LSTM model to predict crude oil price movements. This proved to be a very effective methodology, even when the number of decomposition results varies.

As the reader can notice from the applications above, LSTM is an appropriate model to deal with time series data. But these are of course not all contributions which can be found in the literature. Many other applications exist for time series prediction scenarios, like emotion ratings (Ringeval et al. 2015), topic evolution in text streams (Lu et al. 2019), building energy usage (Wang et al. 2019b), carbon emission levels (Huang et al. 2019b), flight delays (McCarthy et al. 2019), financial trading strategy selection (Sang and Pierro 2019), air quality monitoring (Zhao et al. 2019a), wind speed forecasting (Chen et al. 2019b), precipitation nowcasting (Xingjian et al. 2015), real-time occupancy prediction (Kim et al. 2019b) and health predictive analytics (Manashty and Light 2019). The LSTM’s prediction power speaks for the wide usability of the model to overcome problems in which other recurrent neural systems are likely to fail.

4.2 Natural language processing

LSTM is a force to be reckoned with when it comes to language learning, both context-free and context-sensitive (Gers and Schmidhuber 2001; Gers et al. 2002). Natural language processing is the field of research that explores how computers can be used to understand and manipulate natural language text or speech to do useful things (Chowdhury 2003). For example, dialog systems—also known as conversational agents—allow human beings to interact with a machine via speech. Speech recognition with the use of the LSTM model was first performed in Graves et al. (2004) as it has the main benefit of dealing with long time lags. Results comparable to the hidden Markov model (HMM) (Rabiner 1986) were obtained in this experiment. This work then initiated the further exploration of LSTM for this domain.

Dialog systems should be able to respond to out-of-domain speech input, instead of giving a random response. In Ryu et al. (2017) a binary classifier was built using two LSTM layers. The classifier was trained with only in-domain data, with the goal to later on recognise out-of-domain sentences. This method achieved higher prediction rates than the former state-of-the-art models.

Of course, while identifying differences between sentences is interesting from the classification perspective, understanding the meaning of the sentence can be quite a challenge for machines. Take the following sentence as an example: “Apple will launch a trio of new iPads this spring, according to Barclays research analysts.” Given the context provided in the sentence, algorithms should be able to understand that “Apple” refers to the company, not the fruit. This is the purpose of entity disambiguation. The architecture in Sun et al. (2017) was comprised of two LSTM networks to learn the context of the sentences, which performed well. However, this already strong baseline was outperformed by the memory network proposed in Sun et al. (2018).

If the context of sentences and words is understood, then questions can be answered about them. One example of this application is visual question answering, a computer vision task where a system is provided with a text-based question about an image and the answer must be inferred (Kafle and Kanan 2017; Gao et al. 2015). Section 4.4 elaborates on other computer vision applications. Alternatively, community question answering—selecting the correct answer to a question—can be performed. The authors in Wen et al. (2019) built a hybrid attention network which considers the importance of words in the current sentence but also the mutual importance with respect to the counterpart sentence for sentence representation learning. The representations of the questions and answers are learned with an LSTM-based architecture.

Given the model’s capability to remember long-term contexts, LSTM can be used to detect dialogue breakdowns. Conversational agents have to timely detect such a breakdown as it helps the agents recover from mistakes. A recent contribution from Takayama et al. (2019) investigated variations when determining if a response causes breakdowns in a conversation (subjectivity), and variations in breakdown types due to designs of agents (variationality). The variationality was addressed with three models: LSTM which considers the global context, the convolutional neural network (CNN) which is sensitive to the local characteristics, and also the combination of both. The authors did investigate BLSTM as well, but chose to employ unidirectional LSTM since results were comparable.

On the other hand, Hori et al. (2019) did adopt both the uni- and bidirectional networks, along with a hierarchical recurrent encoder decoder. The responses given by these three models are combined by a minimum Bayes risk based system in order to go to the generation phase of system responses in a Twitter help-desk dialog task. Note that this proposal is not for a real-time dialog system as it is impossible to feed the input in a reverse order in this case.

Building a dialog system is, as expected, also possible with BLSTM. As with text recognition, the words in a sentence that follow on each other are not only dependent on the previous one, it can also be related with the next one. The authors in Kim et al. (2019a) employed the MemN2N architecture (Sukhbaatar et al. 2015) to perform automatic system responses, but added BLSTM to the beginning of their network to better reflect the temporal information. Numerical simulations showed that the performance of state-of-the-art methods for an end-to-end network is comparable to the hierarchical LSTM model. Just as in Hori et al. (2019), this is not for a real-time dialog system.

Wang et al. (2019c) attempted to generate natural and informative responses for customer service oriented dialog systems. For this purpose, two frameworks were proposed: one to encode the entire dialogue history, while the other integrates external knowledge extracted from a search engine. It is in the former that the authors explored both CNN and LSTM networks. The simulation results showed that the recurrent network was more effective in improving the reply quality when compared with the CNN variant, which can only capture the problem semantics through short patterns.

Although the interaction with humans is pivotal in this kind of problems, it does not have to end with providing the human with appropriate responses or to provide the information the user is seeking. Opinions can be analysed and extracted from sentences as well, as done by Zhang et al. (2019a) who employed a multilayered BLSTM architecture. D’Andrea et al. (2019) put the strengths of LSTM to use, in combination with other architectures, to track opinions on vaccination. From this study, however, it seemed that LSTM was not amongst the best architectures to tackle the problem.

Likewise, text can be classified into certain categories, as done by Dabiri and Heaslip (2019) with the use of CNN and LSTM, from which the contributors concluded that their approach reports superior results when compared with other algorithms. Combinatory Categorical Grammar supertagging can also be performed by means of deep learning architectures as illustrated in Kadari et al. (2018) where the authors used BLSTM and conditional random field. This hybrid architecture attained good results in a very efficient manner.

Liu et al. (2019) employed both LSTM and CNN networks to build a rumor identification classifier in the social media environment. For this purpose, they performed forwarding comments analysis, they included the influence, authority and popularity of so-called influencers, and captured diffusion structures. Results show that the proposed models are quite capable of learning hidden clues and contextual information.

Of course, before any analysis can be conducted, the structure of the document must be machine-understandable, which is the domain of document representation, studied in Zhang et al. (2019d). First, an attention-based LSTM is used to generate hidden representations of word sequences. Next, a latent topic-modelling layer and tree-structured LSTM generate the semantic representations of the document. This method proved its value over the state-of-the-art.

Naturally, there are a great deal of works which contributed to the field of natural language processing using (a variant of) LSTM. Examples are Chinese word segmentation (Ma et al. 2018a; Gong et al. 2019), morphological segmentation (Wang et al. 2016), relation extraction (Song et al. 2018), mapping natural text to knowledge base entities (Kartsaklis et al. 2018), emoji prediction (Barbieri et al. 2018), and translation tasks (Sutskever et al. 2014).

4.2.1 Sentiment analysis

Sentiment analysis is closely related with natural language processing. Many data points can be used to detect emotions like physiological data, the environment, videos, etc. Kanjo et al. (2019) put to use sensor signals coming from these multi-modal data sources. More specifically, these signals originated from smartphones and wearable devices. In fact, they were the first to perform emotion recognition using physiological, environmental and location data. To analyse all the data, four models were built, all based on a CNN-LSTM architecture: the on-body data, the environmental data, the location data, and finally the fusion of all data inputs. The authors concluded that, by using this hybrid network, the accuracy level was increased by more than 20% compared to a traditional multi-layer perception model.

When it comes to video-based sentiment analysis, Li and Xu (2019) introduced a new feature extraction method, hvnLBP-TOP, followed by a sequence learning module containing BLSTM. In short, they extracted from a video the different frames, in which human faces were detected to put together a human face video, a fixed-size picture sequence. From a comparison analysis with state-of-the-art models, this new method proved effective.

Identifying people’s emotions can also help detect conversation anomalies. For this, Sun et al. (2019) investigated the hybrid model of CNN-LSTM combined with a Markov chain Monte Carlo method. The former is applied to identify the emotion of the conversation texts, while the latter detects the emotion transitions. A limitation of this research, however, is that the initial and stimulating emotion cannot be set at any time during the conversation without guiding it in a specific direction.

Kraus and Feuerriegel (2019) proposed a method that builds upon the discourse structure of documents, namely a tensor-based, tree-structured deep neural network named Discourse-LSTM. The method is based on rhetorical structure theory which structures documents hierarchically, forming a discourse tree, which discourse-LSTM can process completely. The tensor structure reveals the salient text passages and thereby provides explanatory insights, all the while returning a superior performance.

In the work of Song et al. (2019) a method of sentiment lexicon embedding in Korean was proposed. After extensive data preprocessing, attention-based LSTM was responsible for sentiment classification. The authors’ approach resulted in improved accuracy of this classification. Zhou et al. (2016) used an attention-based BLSTM to model bilingual texts with the goal of cross-language sentiment classification. The authors confirmed the attention mechanism proves to be very effective, especially on the word-level. A similar architecture was used in Yang et al. (2017) with the goal of improving target-dependent sentiment classification, and achieving better or comparable results. The attention mechanism was also employed by Ma et al. (2018b), in their variants of LSTM, incorporating commonsense knowledge, which outperformed the competition in targeted aspect sentiment analysis.

The authors in Zhao et al. (2019b) experimented with a one- and a two-dimensional CNN-LSTM architecture to identify emotions from speech data. The results show that the proposed methods achieve outstanding performance. Especially the two-dimensional variant outperformed the benchmark models: the deep belief networks and the CNN-based architectures. A similar study is performed by Huang et al. (2019a), with similar results: CNN-LSTM outperforms not only CNN, but also support vector machine predictions.

In speech data emotion recognition, Fayek et al. (2017) conducted a review study to compare various deep learning architectures, both feed-forward and recurrent networks. In this study, CNN yielded the best accuracy. This is probably why most contributions use the hybrid network of both: to combine the strengths of both models.

4.3 Image and video captioning

So far, we have discussed basic time series predictions and natural language processing. However, a computer can be requested to describe what can be seen in an image or a video in a natural-like speech format. This is referred to as image and video captioning.

From the perspective of our literature study, we notice that this domain often operates with a CNN-LSTM hybrid architecture, thus following the encoder-decoder framework. For example, having a CNN feed a sequence of frames to LSTM layers to generate the word sequence, where (Venugopalan et al. 2016) integrated linguistic knowledge from large text corpora.

The study of Chen et al. (2017a) built upon the state-of-the-art of computer vision to enable “robots to speak”. In other words, the goal of this research was to feed the model with images of cars, as detection of cars is a hot topic in self-driving technology, which would then generate a textual description of the image in understandable human language. In this regard, a CNN was applied to extract car region proposals to embed them into fixed-size windows. The LSTM can then generate from the image input a one sentence description closely related to the input image with variable length words. Compared with four other relevant algorithms, the proposed model by Chen et al. (2017a) proved to be superior.

For more general image captioning, He et al. (2019) attempted to exploit the structure information of a natural sentence. The CNN extracts from the image a high-level features representation. This vector described the global content of the input image. The LSTM network is then employed to generate words from the image representation in a recurrent process, guided by part of speech. A part of speech tagger is a system which automatically assigns the part of speech to words using contextual information (Schmid 1994). The performance of LSTM in part of speech was studied in Plank et al. (2016) and later confirmed in Horsmann and Zesch (2017), thus concluding the superiority of LSTM in this domain.

To increase the fluidity and descriptive nature of the generated image captions, a deep network consisting of three steps was proposed in Kinghorn et al. (2019). The first step in their proposed method is, of course, system preprocessing. For this purpose, the region proposal network is applied to generate regions of interest—the regions likely to contain objects or people. Second, to start with image description generation, the model conducts object and scene classification and attribute predictions. The third step is language “conversion”. The generated attributes and class labels are converted into fully descriptive image captions. It is in these two final steps that LSTM plays a key role, as it is the language generating neural network.

Chen et al. (2017b) proposed a new reference-based LSTM model to caption images where the training images are considered to be the references. This way, the authors attempted to solve two problems, namely identifying the important words in a caption, and misrecognition of objects and scenes. Experiments show their approach offers superior performance.

Liu et al. (2017a) argued that, in order to perform proper video captioning, one should not just ignore the rich contextual information that is also available like objects, scenes, actions etc. In this work, LSTM was employed to first learn the multimodal and dynamic representation of the video. Next, the model is leveraged to generate the description words one by one.

To conclude this section, the study of Ren et al. (2016) applied multimodal LSTM to perform speaker identification. This is the task of localising the face of a person corresponding to the identity of the ongoing the voice in a video. Therefore, a collective perception of both visuals and audio is required. The authors show that modelling the temporal dependency across face and voice can significantly improve the robustness. In the end, their system outperforms the state-of-the-art systems with both a lower false alarm rate and a higher recognition accuracy.

4.4 Computer vision

LSTM can also be used for gesture and action recognition. This field includes identifying human poses and interactions. In Bilakhia et al. (2015) the authors investigated automatic recognition of mimicry behaviour, as mimicry has the power to influence social judgements and behaviours. Mimicry behaviour is here defined as face and head movements. Video recordings of mimicry changes are fed to the network and compared with other methods, namely cross-correlation and generalised time-warping. LSTM reported an outstanding performance due to the model’s inherent ability to process spatio-temporal transformations. A significant variance in the performance was detected in these experiments, thus suggesting there was still room for improvement. Zhang et al. (2018) explored the effects of attention in convolutional LSTM with regards to gesture recognition, and discovered that convolutional structures in the gates do not play the role of spatial attention. Instead, a reduction of these structures result in a better accuracy, a lower parameter size and a lower computational consumption. Therefore, they introduced a new variant of LSTM.

A similar study was conducted by Chen et al. (2017c). The network architecture consists of a global LSTM network comprised of multiple blocks. The video sequence is passed to the network as a set of images, while the output consists of estimates of facial landmark coordinates of the corresponding image. As these coordinates highly correlate throughout the sequence, the output of one image is used as input for the next one. In this work, however, the global network only produces the initial coordinates. Such points are fine-tuned with two feed-forward deep neural networks, which increases the accuracy of the coordinates while maintaining the shape of the face.

Hou et al. (2018) presented a facial landmark detection method for images and videos under uncontrolled conditions. This method is a unified framework which integrates, amongst others, LSTM to make full use of the spatial and temporal middle stage information to improve the accuracy. Based on experiments on publicly available datasets, the method proved to be more effective than the state-of-the-art approaches at that time.

LSTM can also recognise gestures made by hand and even track the entire human body. This is skeleton-based human activity tracking (Núñez et al. 2018) where the authors used three-dimensional data sequences obtained from full-body and hand skeletons to address human activity and hand gesture recognition, respectively. To accomplish this, the hybrid model CNN-LSTM was utilised. Extensive simulations concluded that this hybrid model has a similar performance as state-of-the-art methods. Zhu et al. (2016) also recognised the importance of skeleton joints as a good representation of the skeleton for describing actions. They introduced a new dropout algorithm which has proven its effectiveness in experiments. This dropout algorithm allowed the dropping of internal gates, cell and output responses for an LSTM neuron to encourage each unit to learn better parameters.

The applicability is not limited to tracking body characteristics, of course. The location of any arbitrary object in a sequence of frames can also be tracked given the initial position (Chen et al. 2019a). LSTM was employed to better keep track of the historical context while performing more reliable inferences in the current step. The authors fairly stated they did not propose a real-time algorithm in their research, as run-time speeds should be improved.

Humans also pass as objects for tracking in a sequence of frames. Pei et al. (2019) proposed an LSTM-based model to predict human trajectories in crowded scenes, where one LSTM is used for each object. This model includes the assumption that humans adjust their paths to accommodate for other people’s movements, as well as that of their partners. In other words, social interactions are also taken into account. However, the employed architecture failed to identify obstacles. In Alahi et al. (2016), on the other hand, the authors reported a superior performance of their human trajectory prediction study. The main difference is that Alahi et al. (2016) only consider human interactions, as objects and the scene are completely left out of the equation. In this context, Zhang et al. (2019b) worked on joint trajectory predictions and proposed a new states refinement module for LSTM. A comparative analysis shows the effectiveness of this method.

In Wang et al. (2017) spatio-temporal LSTM was introduced in order to predict the next images based on historical frames which did achieve state-of-the-art prediction performance. This was measured on three video prediction datasets. In the context of predicting future frames, Fan et al. (2019) introduced cubic LSTM consisting of three branches, namely a spatial, a temporal and an output branch, the latter of which combines the first two branches to generate the predicted frames.

Similar to Pei et al. (2019), the study from Li et al. (2017) was performed in the context of intelligent surveillance. Firstly, a fixed-size window is slid on a static image to generate an image sequence. This sequence serves as input for a CNN which extracts a feature sequence from the image sequence. Finally, this feature sequence is passed in proper order to memorise and recognise sequential patterns. This model is designed to predict potential object locations in the scenes. In terms of accuracy, this algorithm attained the best performance on three surveillance datasets. However, the authors make note that this method posed the challenge that it was not real-time detection.

One can go further than surveillance. Zhao et al. (2016) build upon the theme, the scene and the temporal structure of video footage to identify specific, potentially harmful, videos. Preventing the spreading of videos containing harmful ideas is a valuable application. As usual, LSTM is employed to process the temporal information. The theme and scene are learned from a variety of other models. The bottom line of the research is that impressive results were obtained, thus setting a benchmark architecture up.

Another application of LSTM with regards to the computer vision field is the estimation of the human pose in three dimensions when the input consists of two-dimensional images (Núñez et al. 2019). This is accomplished in three distinct steps. First of all, the two-dimensional poses are analysed during which key skeleton points are identified using CNN. Second, three-dimensional points are constructed for each key point. The initial guess is obtained by optical triangulation. From this, it was expected that convergence would be faster, given that the starting point is not random. Third, the full body pose is estimated using the LSTM network, as a series of poses is available, thus integrating the spatial and temporal information. This method performed very well when compared with state-of-the-art approaches. In Brattoli et al. (2017), the human pose was analysed with CNN and LSTM to reveal distinct functional deficits and restoration of humans during recovery, of which the approach proved widely applicable.

Nguyen et al. (2017) modelled the human-to-human multi-modal interactive behaviour with LSTM. This behaviour consists of speech, gaze and gestures which are jointly modelled. To be more specific, the one-directional LSTM was used for on-line model comparison, and the bidirectional input is employed for off-line comparison. For both off-line and on-line prediction tasks, the chosen model yielded better results than the conventional benchmark methods when generating the appropriate overt actions.

For end-to-end sequence learning of actions in video, VideoLSTM was introduced by Li et al. (2018b). However, their approach went the other way around: instead of adapting the data to the model with regards to input, the authors adapted the model to the data. They argued that, to reckon the spatio-temporal nature of the video using an LSTM, they should hard-wire the LSTM network with convolutions and spatio-temporal attentions, thus creating a convolutional attention LSTM architecture.

In order to tackle the problem of parallelisation on GPUs of multidimensional LSTMs, Stollenga et al. (2015) proposed the PyraMiD-LSTM, a re-arrangement of the traditional cuboid order of computations in a pyramidal fashion. Experiments have shown that this model is easy to parallelise, especially for 3D data and outperformed the state-of-the-art in pixel-wise image segmentation.

Naturally, there are plenty of other contributions in computer vision (Luo et al. 2018; Feng et al. 2019b; Perrett and Damen 2019; Si et al. 2019; Ma et al. 2016; Liu et al. 2017b; Guo et al. 2018; Baddar and Ro 2019), with or without proposing a variation of the LSTM model to improve performance.

4.4.1 Text recognition

Another field in which LSTM has proved to be specially proficient is text recognition, a known subtask of computer vision. For example, Naz et al. (2016) investigated text recognition possibilities on cursive scripts, specifically Urdu. The challenge imposed by cursive scripts originates from the large number of character shapes, inter- and intra-word overlaps, context sensitivity and diagonality of text. In this study, sliding windows over the text lines are employed for feature extraction, of which the resulting vector is inputted into a multi-dimensional LSTM network. The final output is provided by a connectionist temporal classification layer. In other words, the used architecture consists of three stages. This technique resulted in the highest results reported for the benchmark problems. Note that, a few years earlier, Graves and Schmidhuber (2009) applied a similar architecture with multi-dimensional recurrent networks (Graves et al. 2007) and connectionist temporal classification, which was at that time also a breakthrough with respect to accuracy.

A second study by Naz et al. (2017) on Urdu was conducted 1 year later, to further improve character recognition of the cursive script. This was done with explicit feature extraction, rather than implicit, by employing CNN. The CNN extracted lower level features, then convoluted the learned kernels with text line images and finally fed the features to a multi-dimensional LSTM (Graves et al. 2007) which is used as the classifier in this system. The simulations, carried out on a public dataset, showed this architecture again outperformed the state-of-the-art models, including the three-stage model he proposed with his collaborators 1 year earlier.

The results in Naz et al. (2016, 2017) suggest that creating hybrid architectures provide more accurate classifications than the ones computed with the vanilla LSTM model. This would explain the wide usage of hybrid models reported in the literature. A similar framework is applied by Bhunia et al. (2019) where a three-stage approach was used. First, stacked convolutional layers extract precise translation invariant image features. These layers generate varying dimension feature vectors that are fed into an LSTM network to exploit the spatial dependencies present in text script images. Second, patch weights are obtained via an attention network followed by a softmax layer. Third, the features obtained in the second stage are integrated by employing attention based dynamic weighting.

Before going to the recognition step, Frinken et al. (2014) performed several text preprocessing tasks. To summarise, after extracting the text lines, the skew angle was determined and removed by rotation. Afterwards the slant is corrected to normalise directions of long vertical strokes. After this estimation, a shear transformation is applied, to finally scale all characters both vertically and horizontally. The BLSTM neural network is then employed for keyword spotting. Keyword spotting is a detection task consisting in discovering the presence of specific spoken words in speech signals (Fernández et al. 2007). This study by Fernandez et al. used the power of BLSTM to handle information through time and showed the outperformance compared to the classic HMM in terms of accuracy. Another example BLSTM’s power is provided in Álvaro et al. (2016) to recognise handwritten mathematical expressions, which they see as collections of strokes, as it is the state-of-the-art model which outperformed previous models. Van Phan and Nakagawa (2016) wanted a highly accurate and quick method for text/non-text classification, for which they experimented with BLSTM and achieved top-tier results, but also several future research challenges. The authors in Zamora-Martínez et al. (2014) used, amongst others, the BLSTM as a text recogniser to feed it to the language modelling network. In Sachan et al. (2019) the BLSTM network was explored for text classification using both supervised and semi-supervised learning.

In He et al. (2017) the authors proposed Tensorised LSTM in which the hidden states are represented by tensors and updates via a cross-layer convolution. The goal of this model is to increase capacity without adding additional parameters and with only a little longer runtime. Experiments on the MNIST dataset, in which written digits are to be recognised, showed the potential of the proposed model.

The features that CNNs can extract from text as input for an LSTM network, can also be fed in both directions, thus the CNN-BLSTM hybrid network is also an architecture to be considered. For example, in Toledo et al. (2019) a BLSTM was added to the already existing CNN model from previous research. This enabled the authors to extract semantic categories and thus populate a knowledge database with the document contents. Experimental results showed better performances than the state-of-the-art approaches.

Naturally, CNN is not the only alternative to create a hybrid model. The HMM can also be utilised. Liwicki and Bunke (2009) were, as reported back in 2009, the first ones to propose the combination of such diverse recognition architectures on the decision level in the field of handwritten text line recognition. Finally, as LSTM could be combined with a connectionist temporal classification output layer, the same goes for BLSTM (Yousfi et al. 2017).

In contrast with time series applications, the vanilla LSTM architecture for text recognition does not always provide the most accurate results. The most convenient approaches exploit the input signal in both directions, or create hybrid deep learning architectures.

4.5 Other application domains

Of course, the use of the LSTM architecture is not limited to applications discussed above. A wide variety of problems are suited to the use of this model. In this section, we will illustrate how the model can be applied to this large range of problems.

For example, in Portegys (2010) the maze learning performance of LSTM is compared to two other neural network architectures. A maze is defined as a network of distinctly marked rooms randomly interconnected by doors that open probabilistically. This study investigated two characteristics of the models: the retention of long-term state information and the modular use of the learned information. It appeared that the performance varied. LSTM is capable of learning the context maze tasks with non-modular training. The fact that it does have problems with modular training highlights the problem of using this model for tasks of a dynamic nature which may cause components to change, necessitating retraining.

An application in traffic analyses, namely the analysis of short-term crash risk, is proposed by Bao et al. (2019). In this work, three architectures are used: CNN to extract spatial features, LSTM for the temporal features, and finally convolutional LSTM for the spatio-temporal features, all in a stacked hierarchy. The simulations showed that the hybrid model performs better than standard machine learning approaches for capturing the spatio-temporal characteristics for the citywide short-term crash risk prediction.

Several applications are identified in the field of computer science. The authors in Feng et al. (2019a) use an LSTM as part of a Denial of Service and privacy attacks detection model, in which LSTM is focused on the identification of XSS and SQL attacks. This proposed system was not only very accurate, but also acceptably fast. Homayoun et al. (2019) analysed the capability of CNN-LSTM to classify abnormalities related with ransomware activities. In Zuo et al. (2019) LSTM was applied to path planning in network traffic engineering with constrained conditions, which showed to be a superior method. Also in terms of information security did LSTM prove to be useful, as demonstrated by Kang et al. (2019) in their analysis of malware detection and classification.

The network can also be found in the healthcare sector. For example, Pei et al. (2017) adopted LSTM to map small bowel images to the corresponding diameters. In Li et al. (2018a) an architecture was set up to recognise irregular entities in biomedical text. The contribution of Turan et al. (2018) refers to a system that estimates the real-time pose for actively controlled endoscopic capsule robots. The authors in Andersen et al. (2019) developed an approach for automatic detection of atrial fibrillation. These applications all apply combinations of network types, where LSTM learns the temporal relations in the data. On the other hand, after several data preprocessing steps, LSTM is part of the development of an abnormal heart sound detection method in the study of Zhang et al. (2019c). Perhaps more impressively, Yi et al. (2019) applied the model in the fight against cancer in identifying anticancer peptides with great success. In the BLSTM context, Steenkiste et al. (2019) contributed the value of the architecture in predicting outcomes of blood culture tests. The authors found that prediction power only decreased slightly even when predicting hours upfront.

In the field of sound recognition, Wöllmer and Schuller (2014) used BLSTM in combination with bottleneck feature generation to develop a front-end allowing the production of context-sensitive probabilistic feature vectors of arbitrary size for speech recognition. But besides speech, all kinds of sounds can be detected. In Laffitte et al. (2019) several neural networks were compared with regards to detection of screams and shouts. All tested models performed virtually equally well, however, a distinction could be made when it came to speech recognition. The tests again show the temporal structure of speech, since recurrent neural networks outperformed the other networks types.

Eck and Schmidhuber (2002) wondered whether LSTM would be a good candidate to learn how to compose music, since standard recurrent neural networks often lack global coherence. Their results show that LSTM can learn to play blues music, while also composing novel melodies in that style. The model also does not drift from the learned structure.

Airport runways have to be of high quality to ensure a safe landing. In this capacity, the authors in Cai et al. (2019) developed GrooveNet: a classification model for identifying shallow and worn grooves in a runway consisting of two LSTM layers, two dropout layers and one fully connected layer. This methodology proved not only to be very robust, but also very accurate.

A perhaps more exotic application of LSTM concerns block-sparsity recovery Lyu et al. (2019). Here, the authors employed the LSTM framework for better capturing the correlations and dependencies among nonzero elements of signals. This proposed method proved to be superior compared to alternative methods. Finally, Wang et al. (2019a) proposed a novel deep learning waveform recognition method, using two-channel CNNs combined with BLSTM. The contributors claim that this approach has a significantly better performance than the state-of-the-art.

5 Implementing LSTM in Tensorflow

As discussed above, LSTM can be utilised in a wide variety of situations. To run the model on your personal computer, some code libraries have been developed. In this section, we shall briefly describe how to build an LSTM neural network in Python, specifically using the Tensorflow framework, with code examples. Note that, for the sake of relevance, we limit ourselves to the most import code extracts.

First of all, the required code libraries must be imported.

figure a

Let us suppose we want to train an LSTM to predict the next word of a sample short story. If we feed it with correct sequences of three symbols from the text as inputs and a labelled symbol, eventually the neural network will learn to predict the next symbol correctly.

The LSTM only supports numeric inputs. A way to convert symbols to numeric inputs is to assign a unique sequential number to each symbol based in order of appearance. The reverse dictionary is also generated since it will be used to decode the outputs.

figure b

LSTM produces an output vector of probabilities of the next symbol normalised by the function. The index of the element with the highest probability is the predicted index of the symbol in the reverse dictionary. This model is at the core of the application, which is very simple to implement in Tensorflow.

figure d

In the training process, at each step, three symbols are retrieved from the training data to form the input vector. These three symbols are converted to numeric values.

figure e

The training label is a one-hot vector coming from the symbol after the three input symbols.

figure f

After reshaping to fit in the feed dictionary, the optimisation runs.

figure g

The accuracy and loss are accumulated to monitor the progress of the training.

figure h

The cost is a cross entropy between the label and prediction using a gradient descent optimiser at a learning rate of 0.001.

figure j

The accuracy of the LSTM can be improved by additional layers.

figure k

After training, we are enabled to test the network predicting the next word after input “had”, “a” and “general” words.

figure l

Also, we can test the accuracy of a batch sample.

figure m

As the reader can notice, we only need a handful of code in the Tensorflow framework in order to predict the next word in a sequence. Alternatively, one can also opt for Keras or Pytorch to implement LSTM-based solutions.

Table 1 Recommendations: LSTM-dominated or integrated networks

6 Conclusions

In this paper, we have revised the recent applications on LSTM reported in the literature. Our survey has illustrated the ability of this recurrent system to handle a wide variety of problems including time series forecasting, text recognition, natural language processing, image and video captioning, sentiment analysis and computer vision. When modelling most of these problems, it was found a common practice is to hybridise CNNs with LSTM with the aim to get an optimal performance. In such hybrid models, convolution and pooling layers were used to reduce the problem dimensionality while greatly suppressing the redundancy in representations. However, as the choice of networks to integrate is so vast, Table 1 shows a summary per application domain with recommended network types. Note that recommendations are limited to either the LSTM-dominated network or (standard) integrated architectures. As discussed in Sect. 4, further customisation of such architectures could always be applied to improve accuracy. Keep in mind that, as described in Yu et al. (2019), there is no variant surpassing the standard LSTM at all aspects and integrated networks could also use improvements. Second, our recommendations are based on the results reported in the literature, but it is wise to remember the heterogeneity of problems, and as such, results will vary. Therefore, it would be wise to take our recommendations with a grain of salt.

Together with relevant LSTM applications, the fundamental underpinnings behind this recurrent system are detailed including its main components, the interaction with each other and a gradient-based method to compute the weight matrix. The experimental study in Greff et al. (2017) concluded that the forget gate and the output transfer function are the most critical components of the LSTM block, whereas the learning rate is the most important hyperparameter in the backpropagation algorithm. Hence, further studying these components may lead to LSTM variants with improved prediction capabilities. Another equally relevant research line refers to less computationally demanding learning procedures to adjust the learnable parameters.