Keywords

1 Introduction

In this article, we describe a variant of Tree-LSTM neural network [16] for phrase-level sentiment classification. The contribution of this paper is evaluating various strategies for fine-tuning this model for a morphologically rich language with relatively loose word order – Polish. We explored the effects of several variants of regularization technique known as zoneout [9] as well as using pre-trained word embeddings enhanced with sub-word information [2].

The system was evaluated in PolEval competition. PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for PolishFootnote 1. The task that we undertook was phrase-level sentiment classification, i.e. labeling the sentiment of each node in a given dependency tree. The dataset format was analogous to the seminal Stanford Sentiment TreebankFootnote 2 for English [14].

The source code of our system is publicly available at https://github.com/tomekkorbak/treehopper.

2 Phrase-Level Sentiment Analysis

Sentiment analysis is the task of identifying and extracting subjective information (attitude of the speaker or emotion she expresses) in text. In a typical formulation, it boils down to classifying the sentiment of a piece of text, where sentiment is understood as either binary (positive or negative) or multinomial label and where classification may take place on document level or sentence level. This approach, however, is of limited effectiveness in case of texts expressing multiple (possibly contradictory) opinions about multiple entities (or aspects thereof) [17]. What is needed is a more fine-grained way of assigning sentiment labels, for instance to phrases that build up a sentence.

Apart from aspect-specificity of sentiment labels, another important consideration is to account for the effect of syntactic and semantic composition on sentiment. Consider the role negation plays in the sentence “The movie was not terrible”: it flips the sentiment label of the whole sentence around [14]. In general, computing the sentiment of a complex phrase requires knowing the sentiment of its subphrases and a procedure of composing them. Applying this approach to full sentences requires a tree representation of a sentence.

There are two broad families of formalism used to represent sentential syntactic structure: constituency grammars and dependency grammars. The choice of a formalism is highly dependent on peculiarities of language of interest. For instance, English can be nicely captured using a constituency grammar, therefore Stanford Sentiment Treebank represents sentences as binary constituency trees. Polish, on the other hand, has relatively loose word order and rich morphology, thus making dependency approaches more suitable.

PolEval dataset represents sentences as dependency trees. Dependency grammar models sentences in terms of tokens and (binary, directed) relations between them, with some additional constraint: there must be a single root node with no incoming edges and each non-root node must have a single incoming arc and a unique path to the root node. What this entails is that each phrase will have a single head that governs how its subphrases are to be composed [7].

PolEval dataset consisted of a 1200 sentence training set and 350 sentence evaluation test. Each token in a sentence is annotated with its head (the token it depends on), relation type (i.e. coordination, conjunction, etc.) and sentiment label (positive, neural, negative). For an example, consider Fig. 1.

Fig. 1.
figure 1

An entry in Poleval dataset consists of (1) an ordered list of tokens, (2) dependency relations between them, (3) types of these relations (not used by our model, hence not shown) and (4) sentiment labels for each head (−1, 0, 1).

3 LSTM and Tree-LSTM Neural Networks

3.1 Recurrent Neural Networks

Recurrent neural networks (RNNs) are a class of neural networks designed to handle sequential data. This includes EEG signals, protein sequences and natural language sentences, modeled as linearly ordered sequences of words rather than tree structures. The power of recurrent neural networks lies in their ability to take advantage of previously seen samples in modeling subsequent ones.

Let us focus on the class of per-sample binary classification problems (also known as sequence labeling tasks). Let \(\{x^{(t)}\}\) denote the sequence of vectors, where t ranges from 1 to \(\tau \), let \(\{\hat{y}^{(t)}\}\) be the sequence of predicted labels and let \(\{y^{(t)}\}\) be the sequence of ground truth labels. Then the cost function that undergoes maximization in the course of learning is

$$\begin{aligned} \prod _{i=1}^{\tau } P( \hat{y}^{(i)} = y^{(i)} \mid \theta ; x^{(t)}, x^{(t-1)},\ldots , x^{(1)}) \end{aligned}$$
(1)

where \(\theta \) denotes all parameters (connection weights) of a model that are optimized.

The fundamental concept behind RNNs is sharing parameters across different parts of the model [5]. While a typical shallow neural network would have separate parameters for particular region of a fixed-size dataset, an RNN can generalize parameters across different time-steps of a sequence.

The idea of parameter sharing can be formalized in terms of unfolding computational graphs of RNNs. A computational graph of a neural networks is a directed graph \(G = (N, E)\), where nodes \( n_i \in N\) denote variables and edges \(e_i \in E\) denote operations on variables (i.e. applying functions). Unfolded computational graphs for RNNs will contain cycles that can be interpreted as recurrent connections between variables. They enable the flow of information across time-steps during forward and backpropagation through time.

An RNN can be described as a dynamical system with transition function f:

$$\begin{aligned} h_t = f(h_t, x_t; \theta ) \end{aligned}$$
(2)

where \(h_t\) denotes hidden state at time-step t, \(x_t\) denotes t-th sample and \(\theta \) denotes model parameters (weight matrices).

The output \(\hat{y}_t\) is then a function of current hidden state \(h_t\), current sample \(x_t\) and parameters \(\theta \):

$$\begin{aligned} \hat{y}^{(t)} = g(h^{(t)}, x^{(t)}; \theta ) \end{aligned}$$
(3)

In the most simple case (known as Vanilla RNN, or Elman network, cf. [4]), both f and g can be defined as an affine transformations of a concatenation of hidden states and inputs, \([h^{(t)}, x^{(t)}]\), that is:

$$\begin{aligned} f(h_t, x_t; \theta ) = W_h [h_t, x_t] + b_h \end{aligned}$$
(4)
$$\begin{aligned} g(h_t, x_t; \theta ) = W_y [h_t, x_t] + b_y \end{aligned}$$
(5)

for some \(W_h, W_y, b_h, b_y \in \theta \). Importantly, none of these parameters depends on t; they are shared across time-steps.

3.2 LSTM Cells and Learning Long-Term Dependencies

Notice the recursion inherent in the definition of f and g. In principle, every RNN could be defined non-recursively by self-substitution of Eqs. 2 and 3 \(\tau \) times, for instance

$$\begin{aligned} h^{(t)} = f(f(... f(h^{(1)}, x^{(1)}; \theta ) ..., x^{(t-1)}; \theta ), x^{(t)}; \theta ) \end{aligned}$$
(6)

for some initial hidden state \(h^{(1)}\). This re-formulation corresponds to unfolding a folded computation graph, yielding a directed acyclic graph.

Thanks to recurrent connections, RNNs are capable of maintaining a working memory (or short-term memory, as opposed to long-term memory captured in weights of forward connections) for storing information about earlier time-steps and use it for classifying subsequent ones. Linguistically, this corresponds to learning constraints that previous words in a sentence place on subsequent words or sentences. RNNs can theoretically handle arbitrarily complex and long-distance dependencies as they are proved to compute any function computable by a Turing machine [12]. One problem is that the distance between two time-steps has a huge effect on learnability of constraints they impose on each other. This particular problem with long-term dependencies is known as vanishing gradient problem [1].

Long short-term memory (LSTM) architecture [6] was designed address to the problem of vanishing gradient by enforcing constant error flow across time-steps. This is done by introducing a structure called memory cell; a memory cell has one self-recurrent connection with constant weight that carries short-term memory information through time-steps. Information stored in memory cell is thus relatively stable despite noise, yet it can be superimposed with each time-step. This is regulated by three gates mediating memory cell with inputs and hidden states: input gate, forget gate and output get.

For time-step t, let input gate \(i_t\), forget gate \(f_t\) and output gate \(o_t\) be defined in terms of the following Eqs. (79):

$$\begin{aligned} i_t = \sigma (W^{(i)} x^{(t)} + U^{(i)} h_{t-1}) \end{aligned}$$
(7)
$$\begin{aligned} f_t = \sigma (W^{(f)} x^{(t)} + U^{(f)} h_{t-1}) \end{aligned}$$
(8)
$$\begin{aligned} o_t = \sigma (W^{(o)} x^{(t)} + U^{(o)} h_{t-1}) \end{aligned}$$
(9)

where \(W^{(i)}, W^{(f)}, W^{(o)}\) and \(U^{(i)}, U^{(f)}, U^{(o)}\) denote weight matrices for input-to-cell (where input is \(x_t\)) and hidden-to-cell (where hidden layer is \(h_t\)) connections, respectively, for input gate, forget gate and output gate. \(\sigma \) denotes the sigmoid function.

Gates are then used for updating short-term memory. Let new memory cell candidate \(\widetilde{c}_t\) at time-step t be defined as

$$\begin{aligned} \widetilde{c}_t = \tanh (W^{(c)} x_t + U^{(c)} h_{t-1}) \end{aligned}$$
(10)

where \(W^{(c)}, U^{(c)}\), analogously, are weight matrices for input-to-cell and hidden-to-cell connections and where \(\tanh \) denotes hyperbolic tangent function.

Intuitively, \(\widetilde{c}_t\) can be thought of as summarizing relevant information about word-token \(x_t\). Then, \(\widetilde{c}_t\) is used to update \(c_t\), according to forget gate and input gate.

$$\begin{aligned} c_t = f_t \circ c_{t-1} + i_t \circ \widetilde{c}_t \end{aligned}$$
(11)

where \(A \circ B\) denotes the Hadamard product of two matrices, i.e. element-wise multiplication.

Forget gate, given current input and hidden state, decides which information from previous memory cell \(c_{t-1}\) can be dropped, while input gate, given current input and hidden state, decides which information from candidate memory cell should be incorporated into the memory cell \(c_t\).

Finally, \(c_t\) is used to compute next hidden state \(h_t\), again depending on output gate (defined in Eq. 9) that takes into account input and hidden states at current time-step.

$$\begin{aligned} h_t = o_t \circ \tanh (c_t) \end{aligned}$$
(12)

In a sequence labeling task, \(h_t\) is then used to compute label \(\hat{y}_t\) as defined by Eq. 5. The forward-propagation for a LSTM network is done by recursively applying Eqs. 712 while incrementing t.

3.3 Recursive Neural Networks and Tree Labeling

Recursive neural networks, or tree-structured neural networks, make a superset of recurrent neural networks, as their computational graphs generalize computational graphs of recurrent neural network from a chain to a tree. Whereas a recurrent neural networks hidden state \(h_t\) depends only on one previous hidden states, \(h_{t-1}\), a hidden state of a recursive neural network depends on a set of descending hidden states \(C(h_t)\), when C(j) denotes a set of children of a node j.

Tree-structured neural networks have a clear linguistic advantage over chain-structured neural networks: trees make a very natural way of representing the syntax of natural languages, i.e. how more complex phrases are composed of simpler ones.Footnote 3 Specifically, in this paper we will be concerned with a tree labeling task, which is analogous generalization of sequence labeling to tree-structured inputs: each node of a tree is assigned with a label, possibly dependent on all of its children.

Finally, in this paper we will be concerned with a tree labeling task, which is analogous generalization of sequences labeling to tree-structured inputs: each node of a tree is assigned with a label, possibly dependent on all of its children.

3.4 Tree-LSTMs Neural Networks

A Tree-LSTM (as described by [16] is a natural combination of the approaches described in two previous subsections. Here we will focus on a particular variant of Tree-LSTM known as Child-Sum Tree-LSTM. This variant allows a node to have an unbounded number of children and assumes no order over those children. Thus, Child-Sum Tree-LSTM is particularly well-suited for constituency trees.Footnote 4

Let C(j) again denote the set of children of the node j. For a given node j, Child-Sum Tree-LSTM takes as inputs vector \(x_j\) and hidden states \(h_k\) for every \(k \in C(j)\). The hidden state \(h_j\) and cell state \(c_j\) are computed using the following equations:

$$\begin{aligned} \widetilde{h}_j = \sum _{k \in C(j)}^{} h_k \end{aligned}$$
(13)
$$\begin{aligned} i_j = \sigma (W^{(i)}x_j + U^{(i)} \widetilde{h}_j + b_j) \end{aligned}$$
(14)
$$\begin{aligned} f_{jk} = \sigma (W^{(f)} x_j + U^{(f)} \widetilde{h}_j + b_f) \end{aligned}$$
(15)
$$\begin{aligned} o_j = \sigma (W^{(o)} x_j + U^{(o)} \widetilde{h}_j + b_o) \end{aligned}$$
(16)
$$\begin{aligned} u_j = \tanh (W^{(u)} x_j + U^{(u)} \widetilde{h}_j + b_u) \end{aligned}$$
(17)
$$\begin{aligned} c_j = i_j \circ u_j + \sum _{k \in C(j)}^{} f_{jk} \circ c_k \end{aligned}$$
(18)
$$\begin{aligned} h_j = o_j \circ \tanh {(c_j)} \end{aligned}$$
(19)

Equations 1419 are analogous to Eqs. 711; they correspond to applying input gate, forget gate, output gate, update gate and computing cell and hidden states.

In a tree labeling task, we will additionally have an output function

$$\begin{aligned} \hat{y}_j = W^{(y)} h_j + b_y \end{aligned}$$
(20)

for computing a label of each node.

4 Experiments

We choose to implement our model in PyTorchFootnote 5 due to convenience of using a dynamic computation graphs framework.

We evaluated our model on tree labeling as described in Subsect. 3.3 using PolEval 2017 Task 2 dataset. (For an example entry, see Fig. 1).

4.1 Regularizing with Zoneout

Zoneout [9] regularization technique is a variant of dropout [15] designed specifically for regularizing recurrent connections of LSTMs or GRUs. Dropout is known to be successful in preventing feature co-adaptation (also known as overfitting) by randomly applying a zero mask to the outputs of a given layer. More formally,

$$\begin{aligned} h := d_t \circ h \end{aligned}$$
(21)

where \(d_t\) is a random mask (a tensor with values sampled from Bernoulli distribution).

However, dropout usually could not be applied to recurrent hidden and cell states of LSTMs, since aggregating zero mask over a sufficient number of time-steps effectively zeros them out. (This is reminiscent of the vanishing gradient problem).

Zoneout addresses this problem by randomly swapping the current value of a hidden state with its value from a previous time-step rather than zeroing it out. Therefore, contrary to dropout, gradient information and state information are more readily propagated through time. Zoneout has yielded significant performance improvements on various NLP tasks when applied to cell and hidden states of LSTMs. This can be understood as substituting Eqs. 10 and 12 with the following ones:

$$\begin{aligned} c_t := d^c_t \circ c_t + (1-d^c_t) \circ c_{t-1} \end{aligned}$$
(22)
$$\begin{aligned} h_t := d^h_t \circ h_t + (1-d^h_t) \circ h_{t-1} \end{aligned}$$
(23)

where 1 denotes a unit tensor and \(d^c_t\) and \(d^h_t\) are random, Bernoulli-sampled masks for a given time-step.

Notably, zoneout was originally designed with sequential LSTMs in mind. We explored several ways of adapting it to tree-structured LSTMs. We will consider only hidden state updates, since cell states updates are isomorphic.

As Tree-LSTM’s nodes are no longer linearly ordered, the notion of previous hidden states must be replaced with the notion of hidden states of children nodes. The most obvious approach, that we call “sum-child” will be randomly replacing the hidden states of node j with the sum of its children nodes’ hidden states, i.e.

$$\begin{aligned} h_j := d^h_j \circ h_j + (1-d^h_j ) \circ \sum _{k \in C(j)}^{} h_k \end{aligned}$$
(24)

Another approach, called “choose-child” by us, is to randomly choose a single child to replace the node with.

$$\begin{aligned} h_j := d^h_j \circ h_j + (1-d^h_j ) \circ h_k \end{aligned}$$
(25)

where k is a random number sampled from indices of the members of C(j).

Apart from that, we explored different values for \(d^h\) and \(d^c\) as well as keeping a mask fixed across time-steps, i.e. \(d_t\) being constant for all t.

4.2 Using Pre-trained Word Embeddings

Standard deep learning approaches to distributional lexical semantics (e.g. word2-vec, [10]) were not designed with agglutinative languages, like Polish, in mind and cannot take advantage of compositional relation between words. Consider the example of “chodziłem” and “chodziłam” (Polish masculine and feminine past continuous forms of “walk”, respectively). The model has no sense of morphological similarity between these words and has to infer it from distributional information itself. This poses a problem when the number of occurrences of a specific orthographic word form is small or zero and some Polish words can have up to 30 orthographic forms (thus, the effective number of occurrences is 30 times smaller than the number of occurrences when counting lemmas).

One approach we explore is to use word embeddings pre-trained on lemmatized data. The other, more promising approach, is take advantage of morphological information by enhancing word embeddings with subword information. We evaluate fastText word vectors as described by [2]. Their work extends the model of [10] with additional representation of morphological structure as a bag of character-level n-gram (for \(3 \le n \le 6\)). Each character n-gram has its own vectors representations and the resulting word embeddings is a sum of the word vector and its character vectors. Authors have reported significant improvements in language modeling tasks, especially for Slavic languages (8% for Czech and 13% for Russian; Polish was not evaluated) compared to pure word2vec baseline.

5 Results

We conducted a thorough grid search on a number of other hyperparameters (not reported here in detail due to spatial limitations). We found out that the best results were obtained with minibatch size of 25, Tree-LSTM hidden state and cell state size of 300, learning rate of 0.05, weight decay rate of 0.0001 and L2 regularization rate of 0.0001. No significant difference was found between Adam [8] and Adagrad [3] optimization algorithms. It takes between 10 and 20 epochs for the system to converge.

Here we focus on two fine-tunings we introduced: fastText word embeddings and zoneout regularization.

The following word embeddings model were used:

  • word2vec [10], 300 dimensions, pre-trained on Polish Wikipedia and National Corpus of Polish [11] using lemmatized word forms. Lemmatization was done using Concraft morphosyntactic tagger [18].

  • word2vec [10], same as above, but using orthographical word forms.

  • fastText [2], 300 dimensions, pre-trained on Polish Wikipedia using orthographical word forms and sub-word information.

Table 1. Results of our faulty solution as evaluated by PolEval organizing committee. “Ensemble epochs” means the number of training epochs we averaged the weights over to obtain a snapshot-based ensemble model.
Table 2. A comparison of the effect of pre-trained word embedding on model’s accuracy. “emb lr” means learning rate of the embedding layer, i.e. 0.0 means the layer was kept fixed and not optimized during training. “time” means wall-clock time of training on a CPU measured in minutes.
Table 3. Results extracted from a grid search over zoneout hyperparameters. “Mask” denotes the moment mask vector is sampled from Bernoulli distribution: “common” means all node share the same mask, while “distinct” means mask is sampled per node. “Strategy” means zoneout strategy as described in Sect. 4.1. “\(d^c_j\)” and “\(d^h_j\)” mean zoneout rates for, respectively, hidden and cell states of a Tree-LSTM. No significant differences in training time were observed.

Our results for different parametrization of pre-trained word embeddings and zoneout are shown in Tables 2 and 3, respectively. The effects of word embeddings and zoneout were analyzed separately, i.e. results in Table 2 were obtained with no zoneout and results in Table 3 were obtained with best word embeddings, i.e. fastText.

Note that these results differ from what is reported in official PolEval benchmark. Our results as evaluated by organizing committee, reported in Table 1, left us behind the winner (0.795) by a huge margin. This was due to a bug in our implementation, which was hard to spot as it manifested only in inference mode. The bug broke mapping between word tokens and weights in our embedding matrix. All results reported in Tables 2 and 3 were obtained after fixing the bug (the model trained on training dataset and evaluated on evaluation dataset, after ground truth labels were disclosed). Note that these results beat the best reported solution by a small margin.

6 Conclusions

As far as word2vec embeddings are concerned, both training on lemmatized word forms and further optimizing embedding yielded small improvements; the two effects being cumulative. FastText vectors, however, beat all word2vec configurations by a significant margin. This result is interesting as fastText embeddings were originally trained on a smaller corpus (Wikipedia, as opposed to Wikipedia + NKJP in the case of word2vec).

When it comes to zoneout, it barely affected accuracy (improvement of about 0.6% point) and we did not found a hyperparameter configuration that stands out. More work is needed to determine whether zoneout could yield robust improvements for Tree-LSTM.

Unfortunately, our system did not manage to win the Task 2 competition, this being due to a simple bug. However, our results obtained after the evaluation indicate that it was very promising in terms of overall design and in fact, could beat other participants by a small margin (if implemented correctly). We intend to prepare and improve it for the next year’s competition having learned some important lessons on fine-tuning and regularizing Tree-LSTMs for sentiment analysis.