1 Introduction

Deep learning models constitute the current state of the art in most artificial intelligence applications, from computer vision to robotics or medicine. When dealing with sequential data, Recurrent Neural Networks (RNNs), specially those architectures with gating mechanisms such as the LSTM [7], the GRU [3] and other variants, are usually the default choice. One of the most interesting applications of RNNs is related to the field of Natural Language Processing, where most tasks, such as machine translation, document summarization or language modeling, involve the manipulation of sequences of textual data. Of these, language modeling has been extensively used to test different innovations in recurrent architectures, mainly due to the ease of obtaining very large datasets that can be used to train neural networks with millions of parameters.

Sequence modeling consists of predicting the next element in a sequence given the past history. In language modeling, the sequence is a text, and hence the task is to predict the next word or the next character. In this context, some of the best performing architectures include the Mogrifier LSTM [12] and different variations of the Averaged SGD Weight-Drop (AWD) LSTM [13], usually combined with dynamic evaluation and Mixture of Sofmaxes (MoS) [6, 20]. These models obtain the best state-of-the-art performance with moderate size datasets, such as the Penn Treebank [15] or the Wikitext-2 [14] corpora, when no additional data are used during training. When larger datasets are considered, or when external data are used to pre-train the networks, attention-based architectures usually outperform other models [2, 18].

In this work we use moderate-scale language modeling datasets to explore the effect of a mechanism recently proposed by [16], when combined with different LSTM-based models in the language modeling context. The idea consists of modifying a recurrent architecture by introducing a direct connection between the input and the output of the recurrent module. This has been shown to improve both the model’s generalization results and its readability in simple tasks related to the recognition of regular languages.

In a standard RNN, the output depends only on the network’s hidden state, \({h}_{t}\), which in turn depends on both the input, \({x}_{t}\), and the recent past, \({h}_{t-1}\). But there is no explicit dependence of the network’s output on its input. In some cases this could be a shortcoming, since the transformation of \({x}_{t}\) needed to compute the network’s internal state is not necessarily the most appropriate to compute the output. However, an explicit dependence of the output on \({x}_{t}\) can be forced by adding a dual connection that skips the recurrent layers. We claim that this strategy may be of general application in RNN models.

To test our hypothesis we perform a thorough comparison of several state-of-the-art RNN architectures, with and without the dual connection, on the Penn Treebank (PTB) and the Wikitext-2 (WT2) datasets. Our results show that, under all experimental conditions, the dual architectures outperform their non-dual counterparts. In addition, the Mogrifier-LSTM enhanced with a dual connection establishes a new state-of-the-art word-level perplexity for the Penn Treebank dataset when no additional data are used to train the models.

The remainder of the article is organized as follows. First, in Sect. 2, we present the different models we have used and the two possible architectures, the standard recurrent architecture and the dual architecture. In Sect. 3, we describe the datasets and the experimental setup. In Sect. 4, we present our results. And finally, in Sect. 5, we extract some conclusions and discuss further lines of research.

2 Models

We start by presenting the standard recurrent architecture which is common to all the models. In absence of a dual connection, the basic architecture involves an embedding layer, a recurrent layer and a fully-connected layer with softmax activation:

$$\begin{aligned} {e}_t= & {} {W}^{ex} {x}_t \end{aligned}$$
(1)
$$\begin{aligned} {h}_t= & {} REC({e}_t, {S}_{t-1}) \end{aligned}$$
(2)
$$\begin{aligned} {y}_t= & {} softmax({W}^{yh} {h}_t + {b}^{y}), \end{aligned}$$
(3)

where \({W}^{**}\) and \({b}^{*}\) are weight matrices and biases, respectively, and \({x}_t\) is the input vector at time t. The REC module represents an arbitrary recurrent layer, with \({S}_{t-1}\) being a set of vectors describing its internal state at the previous time step. In the most general case, this module will simply be an LSTM cell, but we consider other possibilities as well, as described below.

The dual architecture introduces an additional layer, with ReLU activation, which is fed with both the output of the embedding layer and the output of the recurrent module:

$$\begin{aligned} {e}_t= & {} {W}^{ex} {x}_t \end{aligned}$$
(4)
$$\begin{aligned} {h}_t= & {} REC({e}_t, {S}_{t-1}) \end{aligned}$$
(5)
$$\begin{aligned} {d}_t= & {} ReLU({W}^{de}{e}_t + {W}^{dh}{h}_t + {b}^{d}) \end{aligned}$$
(6)
$$\begin{aligned} {y}_t= & {} softmax({W}^{yd} {d}_t + {b}^{y}). \end{aligned}$$
(7)

This way the network’s input can reach the softmax layer following two different paths, through the recurrent layer and through the dual connection. In the following we consider different forms for the recurrent module in Eq. 2 and 5.

2.1 The LSTM Module

In the simplest approach the recurrent module consists of an LSTM cell, where the internal state includes both the output and the memory, \({S}_{t} = \{{h}_{t}; {c}_{t}\}\), which are computed as follows:

$$\begin{aligned} {f}_{t}= & {} \sigma ({W}^{fe} {e}_{t} + {W}^{fh} {h}_{t-1} + {b}^{f}) \end{aligned}$$
(8)
$$\begin{aligned} {i}_{t}= & {} \sigma ({W}^{ie} {e}_{t} + {W}^{ih} {h}_{t-1} + {b}^{i}) \end{aligned}$$
(9)
$$\begin{aligned} {o}_{t}= & {} \sigma ({W}^{oe} {e}_{t} + {W}^{oh} {h}_{t-1} + {b}^{o}) \end{aligned}$$
(10)
$$\begin{aligned} {z}_{t}= & {} tanh({W}^{ze} {e}_{t} + {W}^{zh} {h}_{t-1} + {b}^{z}) \end{aligned}$$
(11)
$$\begin{aligned} {c}_{t}= & {} {f}_{t} \odot {c}_{t-1} + {i}_{t} \odot {z}_{t} \end{aligned}$$
(12)
$$\begin{aligned} {h}_{t}= & {} {o}_{t} \odot tanh({c}_{t}), \end{aligned}$$
(13)

where, as before, \({W}^{**}\) are weight matrices and \({b}^{*}\) are bias vectors. The \(\odot \) operator denotes an element-wise product, and \(\sigma \) is the logistic sigmoid function. For convenience, we summarize the joint effect of Eqs. 813 as:

$$\begin{aligned} {h}_t = LSTM({e}_t, \{{h}_{t-1}; {c}_{t-1}\}). \end{aligned}$$
(14)

In the literature it is quite common to stack several LSTM layers. Here we consider a double-layer LSTM, where the output \({h}_t\) of the recurrent module is obtained by the concatenated application of two LSTM layers:

$$\begin{aligned} {h}'_t= & {} LSTM_{1}({e}_t, \{{h}'_{t-1}; {c}'_{t-1}\})\end{aligned}$$
(15)
$$\begin{aligned} {h}_t= & {} LSTM_{2}({h}'_t, \{{h}_{t-1}; {c}_{t-1}\}). \end{aligned}$$
(16)

We refer to this double LSTM module as dLSTM:

$$\begin{aligned} {h}_t= & {} dLSTM({e}_t, \{{h}_{t-1}; {c}_{t-1}; {h}'_{t-1}; {c}'_{t-1}\}) \end{aligned}$$
(17)
$$\begin{aligned}= & {} LSTM_{2}(LSTM_{1}({e}_t, \{{h}'_{t-1}; {c}'_{t-1}\}), \{{h}_{t-1}; {c}_{t-1}\}) . \end{aligned}$$
(18)

2.2 The Mogrifier-LSTM Module

The Mogrifier-LSTM [12] is one of the state-of-the-art variations of the standard LSTM architecture achieving the lowest perplexity scores in language modeling tasks. It basically consists of a standard LSTM block, but the input \({e}_{t}\) and the hidden state \({h}_{t-1}\) are transformed before entering Eqs. 813. The mogrifier transformation involves several steps where \({e}_{t}\) and \({h}_{t-1}\) modulate each other:

$$\begin{aligned} {e}_{t}^{i}= & {} 2 \sigma ({Q}^{i} {h}_{t-1}^{i-1}) \odot {e}_{t}^{i-2}, \;\;\;\;\; \text {for odd} \; i \in \{1, 2, ..., r\} \end{aligned}$$
(19)
$$\begin{aligned} {h}_{t-1}^{i}= & {} 2 \sigma ({R}^{i} {e}_{t}^{i-1} ) \odot {h}_{t-1}^{i-2}, \;\;\;\;\; \text {for even} \; i \in \{1, 2, ..., r\}, \end{aligned}$$
(20)

where \({Q}^{i}\) and \({R}^{i}\) are weight matrices and we have \({e}_{t}^{-1} = {e}_{t}\) and \({h}_{t-1}^{0} = {h}_{t-1}\). The linear transformations \({Q}^{i} {h}_{t-1}^{i-1}\) and \({R}^{i} {e}_{t}^{i-1}\) can also include the addition of a bias vector, which has been omitted for the sake of clarity. The constant r is a hyperparameter whose value defines the number of rounds of the transformation. We refer to this recurrent module, including the mogrifier transformation and the subsequent application of the LSTM layer, as:

$$\begin{aligned} {h}_t = mLSTM({e}_t, \{{h}_{t-1}; {c}_{t-1}\}) = LSTM({e}_t^{*}, \{{h}_{t-1}^{*}; {c}_{t-1}\}), \end{aligned}$$
(21)

where \({e}_t^{*}\) and \({h}_{t-1}^{*}\) are the highest indexed \({e}_t^{i}\) and \({h}_{t-1}^{i}\) in Eq. 19 and 20. Note that the choice \(r = 0\) recovers the standard LSTM model.

The original work also explored the use of a double-layer LSTM enhanced with the mogrifier transformation. This strategy can be summarized as follows:

$$\begin{aligned} {h}_t= & {} mdLSTM({e}_t, \{{h}_{t-1}; {c}_{t-1}; {h}'_{t-1}; {c}'_{t-1}\}) \end{aligned}$$
(22)
$$\begin{aligned}= & {} mLSTM_{2}(mLSTM_{1}({e}_t, \{{h}'_{t-1}; {c}'_{t-1}\}), \{{h}_{t-1}; {c}_{t-1}\}) . \end{aligned}$$
(23)

3 Experiments

3.1 Datasets

We perform experiments on two datasets: the Penn Treebank (PTB) corpus [11], as preprocessed by [15], and the WikiText-2 (WT2) dataset [14]. In both cases, the data are used without any additional preprocessing.

The Penn Treebank dataset has been widely used in the literature to experiment with language modeling. The standard data preprocessing is due to [15], and includes transformation of all letters to lower case, elimination of punctuation symbols, and replacement of all numbers with a special token. The vocabulary is limited to the 10,000 most frequent words. The data is split into a training set which contains almost 930,000 tokens, and validation and test sets with around 80,000 words each.

The WikiText-2 dataset, introduced by [14], is a more realistic benchmark for language modeling tasks. It consists of more than 2 million words extracted from Wikipedia articles. The training, validation and test sets contain around 2,125,000, 220,000, and 250,000 words, respectively. The vocabulary includes over 30,000 words, and the data retain capitalization, punctuation, and numbers.

3.2 Experimental Setup

All the considered models follow one of the two architectures discussed in Sect. 2, either the Embedding-Recurrent-Softmax (ERS) architecture (Eqs. 13) or the dual architecture (Eqs. 47). In either case, the recurrent module can be any of LSTM, dLSTM, or mdLSTM. Weight tying [8, 17] is used to couple the weight matrices of the embedding and the output layers. This reduces the number of parameters and prevents the model from learning a one-to-one correspondence between the input and the output [13].

We run two different sets of experiments. First, we analyze the effect of the dual connection by comparing the performances of the two architectures (ERS vs Dual), using each of the recurrent modules, on both the PTB and the WT2 datasets. In this setting the hyperparameters are tuned for the ERS architecture, and then transferred to the dual case. Second, we search for the best hyperparameters for the dual architecture using the mdLSTM recurrence, and compare the perplexity score with current state-of-the-art values. All the experiments have been performed using the Keras library [4].

The networks are trained using the Nadam optimizer [5], a variation of Adam [9] where Nesterov momentum is applied. The number of training epochs is different for each experimental condition. On one hand, when the objective is to perform a pairwise comparison between dual and non-dual architectures, we train the models for 100 epochs. On the other hand, when the goal is to compare the dual network with state of the art approaches, we let the models run for 300 epochs. We use batch sizes of 32 and 128 for the PTB and the WT2 problems, respectively, and set the sequence length to 25 in all cases. The remaining hyperparameters are searched in the ranges described in Table 1.

Finally, all the models are run twice, both with and without dynamic evaluation [10]. Dynamic evaluation is a standard method commonly used to adapt the model parameters, learned during training, using also the validation data. This allows the networks to get adapted to the new evaluation conditions, which in general improves their performance. In order to keep the models as simple as possible, no additional modifications have been considered.

Table 1. List of all the hyperparameters and the search range associated with each of them. Those marked with an asterisk \((^*)\) refer to the dual architectures only.

4 Results

We first show the results of the comparative analysis ERS vs Dual, then we focus on the search of the optimal hyperparameters for the dual architecture with the mdLSTM recurrence.

4.1 Dual vs Non-dual Architectures

Table 2 displays the validation and test perplexity scores obtained for each of the experimental configurations on the PTB and the WT2 problems, both with and without dynamic evaluation. To facilitate the comparison, each pair of rows contain the results for one of the recurrent modules (LSTM, dLSTM or mdLSTM) using the two architectures ERS and Dual, with the best values shown in bold. In each case, the hyperparameters are tuned for the standard ERS architecture and then used within the dual networks without any additional adaptation. The exceptions are hyperparameters, such as the dual dropout, which do not exist in the ERS configuration (those marked with an asterisk in Table 1). To give a measure of the model complexity, Table 2 contains also the approximate number of trainable parameters for each configuration.

Table 2. Validation and test word-level perplexity obtained for each of the experimental configurations on the PTB (top) and the WT2 (bottom) datasets.

As expected, dynamic evaluation improves the results regardless of the model or the dataset. The main observation, however, is that networks enhanced with the dual connection display lower perplexity scores for almost all the training conditions on both the PTB and the WT2 datasets. The advantage of the Dual vs the ERS architecture is larger for less complex models, and narrows as the model complexity increases. Nevertheless, even for networks with mdLSTM recurrence, the dual architectures outperform their non-dual counterparts in more than 2 perplexity points on the test set, when dynamic evaluation is used.

In order to test that this improvement is due to the dual connection and not to the presence of an extra processing layer, we performed an additional experiment with a Dual mdLSTM model, but removing the term \(W^{de}{e}_t\) from Eq. 6. The results for the PTB dataset are shown in Table 2 as mdLSTM+. Note that, in spite of slightly improving the baseline, this enhanced mogrifier model is still well below the result obtained with the full dual architecture.

Finally, it is worth noting that all the results presented correspond to our own implementation of the models, and that in most cases we are not including some of the several training or validation adaptations frequently used in the literature (such as AWD or MoS, for example). This can explain the difference with respect to the results reported by [12] for the Mogrifier-LSTM model. We would expect a further improvement of the results if these additional mechanisms were implemented.

4.2 Dual Mogrifier Fine Tuning

The second part of the experiments consists of searching for the best hyperparameters in the configuration that provided the smallest perplexity in the previous setup, that is the Dual mdLSTM architecture. We carry out this experiment with the PTB problem. After an extensive search (see Table 1), the best performance is obtained with a model with 850 units in the embedding layer, 850 units in each of the mogrifier LSTM layers, and 850 units also in the dual layer. The input, recurrent, internal, and output dropout rates are all set to 0.5, the dual input and output dropout rates are set to 0.5 and 0.4, respectively, and the mogrifier dropout rate is set to 0.15. Both the embedding and the dual L2 regularization parameters are set to \(10^{-5}\). The mogrifier number of rounds is set to 4, and the rank to 100. All the remaining hyperparameters are set to 0.

After the training phase, we continue with a fine tuning of some additional hyperparameters, using the validation data. First, we look for the best sequence length in the range [5, 70], and then we fine-tune the softmax temperature in the range [0.9, 1.3]. When using dynamic evaluation, we also look for the best gradient clipping value (in the range [0.0, 1.0]) and, following [12], we repeat the whole procedure with the \(\beta _1\) parameter of the Nadam optimizer set to 0, which resembles the RMSProp optimizer without momentum. The results are shown in Table 3, together with the top perplexity scores reported in the literature for the same problem.

Table 3. Best validation and test word-level perplexity scores reported in the literature for the Penn Treebank dataset, with and without dynamic evaluation. Missing values in the last two columns correspond to works where the dynamic evaluation approach was not considered. The last row in the table displays the results obtained with our Dual mdLSTM network.

The state-of-the-art is dominated by several variations of the AWD-LSTM network [13], the most common being the inclusion of a Mixture of Softmaxes (MoS) [21]. Other add-ons include Direct Output Connection (DOC) [19], which is a generalization of MoS, Frequency Agnostic word Embedding (FRAGE) [6], Past Decode Regularization (PDR) [1], or Partial Shuffling (PS) with Adversarial Training (Adv) [20]. The mogrifier-LSTM described in Sect. 2.2 combines many of these ideas with a mutual gating between the input and the hidden state vectors to obtain the best results reported in the literature for the PTB problem, among those obtained by networks that do not use additional data during the training phase. Compared with all these models, our current approach leads the ranking with a perplexity score of 44.61, even though most of the aforementioned adjustments have not been considered.

Finally, it is important to mention that the last two rows in the table do not correspond to comparable models. While the penultimate row shows the results reported by Melis et al. [12] with their Mogrifier-LSTM model, the last table row contains the results of our Dual mdLSTM model, which uses the mogrifier transformation but lacks many of the additional characteristics of the Melis et al. model. Hence, regarding the improvement associated with fine-tuning the Dual mdLSTM model, the fair comparison would be with the results shown in Table 2, which have been also included in Table 3 for the sake of clarity. In this case the dual model outperforms its non-dual equivalent in more than 5 perplexity points (50.27 versus 44.61) on the test set.

5 Discussion

In this work, we have presented a new network design for the Language Modeling task based on the dual network proposed by [16]. This network adds a direct connection between the input and the output, skipping the recurrent module, and can be adapted to any of the traditional Embedding-Recurrent-Softmax (ERS) models, opening the way to new approaches for this task. We have based our work on the Penn Treebank [15] and the WikiText-2 [14] datasets, comparing the ERS approach and its dual alternative. Regardless of the configuration, the dual version performs always better, even though it faces a slight disadvantage, since most of the hyperparameters are tuned using the ERS model. We can expect a much better performance if the complete set of hyperparameters is properly tuned for the dual network.

This is in fact the case for the second experiment, where a Dual mdLSTM, which includes a simplified version of the mogrifier LSTM [12] within a dual architecture, is fine tuned for the Penn Treebank dataset. After a thorough search of the hyperparameter space, we have found a network configuration that establishes a new state-of-the-art score for this problem. Interestingly, this new record has been obtained in spite of leaving aside many of the standard features used in most state-of-the-art approaches, such as AWD [13] or MoS [21]. The incorporation of these features into the dual architecture can be expected to further increase the model performance.

The dual architecture was firstly proposed as an alternative that reduces the computational load on the recurrent layer, letting it concentrate on modeling the temporal dependencies only. From a more abstract point of view, it has been argued that the dual architecture can be understood as a sort of Mealy machine, where the output explicitly depends on both the hidden state and the input [16]. Our results show that this explicit dependence on the input can indeed lead to better performance on language modeling tasks. This emphasizes the importance of the current input in RNN models.

Finally, although the new approach has not been tested with large-scale language corpora, we expect that our results scale well to larger datasets. Work in progress contemplates this extension. The dual architecture also needs further research concerning the deepness of the specific variations of Language Modeling and other families of problems not necessarily related to Natural Language Processing. This work opens a new line of research to be considered when processing any sequence or time series. The utility of this approach in more general problems will be addressed as future work.