Keywords

1 Introduction

The application of deep learning methods to Natural Language Processing (NLP) is possible due to the representation of words as vectors in a low-dimensional continuous space. These traditional word embeddings are static: each word has a single vector, regardless of its context [20, 21]. This generates several problems, especially that all the senses of a polysemic word have to share the same representation. Recent developments in the field produced deep neural language models such as ELMo [23] and BERT [10], which have successfully created contextualized word representations, word vectors that are sensitive to the context in which they appear. Using contextualized representations rather than static embeddings has resulted in significant improvements in a variety of NLP tasks, such as question answering and coreference resolution.

In this paper, we present experiments carried out to evaluate different word representation models for Portuguese, including both Brazilian and European variants, for semantic similarity tasks. To our knowledge, this is the first paper to evaluate deep neural language models on semantic similarity tasks in the Portuguese language.

Our experiments indicate that, if fine-tuning is not applied to any language model, then the ELMo language model is able to achieve better accuracy than any other pretrained model which has been made publicly available for the Portuguese language. We have found that performing vocabulary reduction on the corpus before training not only improved the standalone performance of ELMo, but also improved its performance while combined with classical word embeddings. We also demonstrate that FastText skip-gram embeddings [2] can have a significantly better performance on semantic similarity tasks than it was indicated by previous studies in this field.

In Sect. 2 we describe some of the approaches for generating deep neural language models proposed in the literature. The approaches investigated in this paper are described in Sect. 3. The experiments carried out for evaluating deep neural language models for Portuguese are described in Sect. 4. Section 5 finishes this paper with its conclusions and proposals for future work.

2 Related Work

Hartmann et al. [12] trained 31 word embedding models using FastText, GloVe, Wang2Vec and Word2Vec. The authors evaluated them intrinsically on syntactic and semantic analogies and extrinsically on POS tagging and sentence semantic similarity tasks. The authors contribute with a variety of pre-trained word embeddings, intrinsic and extrinsic task comparisons, and preprocessing and evaluation codes. We used this work as a baseline for deep neural language models.

Quinta de Castro et al. [6] evaluated the four different types of word embeddings pre-trained by [12] and performed an extrinsic evaluation of them in the Named Entity Recognition (NER) task. The authors used only 100-dimensional word embeddings, applying them to the same BiLSTM-CRF deep learning architecture from [15], and improved the previous state-of-the-art on the HAREM [26] benchmark for Portuguese language using Wang2Vec [16] embeddings.

An ELMo [23] model trained for Portuguese has been previously evaluated by Quinta de Castro [7] on NER tasks for the IberLEF evaluation [11]. Quinta de Castro [7] also made their model publicly available through the AllenNLP libraryFootnote 1. The authors experimented different scenarios of NER with Portuguese corpora, using a BiLSTM-CRF network from the AllenNLP library. The results achieved state-of-the-art performance using the optimal values for them.

Santos et al. [27] assessed how different combinations of static word embeddings and contextualized embeddings impact NER for the Portuguese language. The authors show a comparative study of 16 different combinations of static and contextualized embeddings and evaluate NER performance using the HAREM benchmark. The best NER system outperforms the state-of-the-art in Portuguese NER by 5.99 in absolute percentage points.

Quinta de Castro [5] evaluated different combinations of word representations, such as character level embeddings, static word embeddings from [12] and ELMo embeddings [23] on the NER task. The author performed a comparative study on two different domains for the Portuguese language (general and legal), performing the pre-training of the ELMo embeddings for each domain, and comparing them to a fine-tuned version of the model on different NER corpora, for each domain. The author reached a new state-of-the-art for the HAREM benchmark using the fine-tuned ELMo embeddings, combined with 100-dimensional Wang2Vec embeddings.

To our knowledge, this is the first paper to evaluate deep neural language models on semantic similarity tasks in the Portuguese language. The semantic similarity task provided by the ASSIN dataset is equivalent to the Semantic Textual Similarity Benchmark (STS-B), and works that evaluated deep neural language models on the STS-B task, such as [22], can be taken as a reference for what to expect of its performance in other linguistic contexts.

3 Word Representations

In this paper, two ways of word representation were evaluated in semantic similarity tasks for Portuguese: contextualized and static word representations. They were tested both individually and also pairwise concatenated with each other, and each approach is explained in the next sections.

3.1 Static Word Representations

Word representations are numerical vectors which can represent words or concepts in a low-dimensional continuous space, reducing the inherent sparsity of traditional vector-space representations [25]. These vectors, also known as embeddings, are able to capture useful syntactic and semantic information, such as regularities in natural language. They are based on the distributional hypothesis, which establishes that the meaning of a word is given by its context of occurrence [3]. A numerical vector representing a word can be visualized in a continuous vector space, accepting algebraic operations such as the cosine distance.

The ability of static word embeddings to capture knowledge has been exploited in several tasks, such as Machine Translation [20], Word Sense Disambiguation [9] and Language Understanding [18].

Although very useful in many applications, the static word embeddings, like those generated by Word2Vec [19], GloVe [21], Wang2Vec [16] and FastText [2] have an important limitation: each word is associated with only one vector representation, ignoring the fact that polysemous words can assume multiple meanings. This limitation is called Meaning Conflation Deficiency, which is a mixture of possible meanings in a single word [4]. For instance, in the phrase “My mouse was broken, so I bought a new one yesterday.” the word “mouse” should be associated with its meaning of being a computer device, rather than the animal called mouse. Figure 1 is an illustration of this Meaning Conflation Deficiency in a 2D semantic space.

Fig. 1.
figure 1

Example of meaning conflation deficiency of ambiguous word “mouse”. The words in blue refer to the sense of animal and the words in green to the sense of device. (Color figure online)

Because they create a single representation for each word, a notable problem with static word embeddings is that all senses of a polysemous word must share a single vector.

3.2 Contextualized Word Representations

The limitations of static word embeddings have led to the creation of context-sensitive word representations. ELMo [23], BERT [10], and GPT-2 [24] are examples of deep neural language models that are fine-tuned to create models for a wide variety of downstream NLP tasks. As GPT-2 is not yet available for the Portuguese language, we performed our experiments solely on ELMo and a multilingual version of BERT. The internal representations of words for these language models are called contextualized word representations because they are a function of the entire input sentence, and in this study, sentence embeddings were built through the summation of these representations. The success of this approach suggests that these representations capture highly transferable and task-agnostic properties of natural languages [17].

ELMo. [23] is a two-layer bidirectional LSTM language model, built over a context independent character CNN layer and originally trained on the Billion Word Benchmark dataset [8], consisting primarily of newswire text. In order to obtain a representation for each word, we performed a linear concatenation of all three ELMo layers, without learning any task-specific weights. During our experiments, we considered two ELMo language models that were exclusively trained for the Portuguese language. The first model has been made publicly available through the AllenNLP library. The second model was trained by ourselves in an attempt to improve on the accuracy of this public model: although it took the same dataset used by the first model as its starting point, words that occurred less than three times were removed from the dataset before training the model. Such additional vocabulary reduction step was accompanied by suitable adjustments on the softmax layer and the network architecture.

BERT. [10] is a deep Transformer [28] encoder trained jointly as a masked language model and on next-sentence prediction, originally trained on the concatenation of the Toronto Books Corpus [29] and the English Wikipedia. As with GPT, we do not fine-tune the encoder weights. We utilized the publicly released BERT-multilingual model, which was simultaneously trained on the Wikipedia dumps for 104 different languages. In order to achieve better accuracy on the semantic similarity task, we considered only the final layer of the model for generating its sentence embeddings.

4 Experiments and Results

In this section we show the experiments carried out to evaluate the two approaches under investigation: word embeddings (Word2Vec, FastText) and deep neural language models (ELMo, BERT) on semantic similarity tasks.

4.1 Evaluation

Based on [12], this experiment is a task of semantic similarity between sentences where the use of neural language models is evaluated. Word embeddings were chosen as baselines.

Dataset. ASSIN (Avaliação de Similaridade Semântica e Inferência Textual) was a workshop co-located with PROPOR-2016 which encompassed two shared-tasks regarding: (i) semantic similarity and (ii) entailment. We chose the first one to evaluate our contextualized vectors extrinsically in a semantic task. In ASSIN, the participants of the semantic similarity shared-task were asked to assign similarity values between 1 and 5 to pairs of sentences (gold score). The workshop made available the training and test sets for Brazilian (PT-BR) and European (PT-EU) Portuguese.

Algorithm. The objective of this task is to predict, through a linear regression, the similarity score between two sentences. The model is trained in the training set, which contains sentence pairs with the gold score. The prediction occurs in the test set, which contains sentence pairs without the gold score. As we have this same test set with the gold score, it is possible to calculate Pearson’s Correlation (\(\rho \)) and Mean Squared Error (MSE) between them. These results show how much the automatic prediction has approached the human prediction.

The results were obtained after training a linear regressor with the cosine similarity between the summations of the word representations of each sentences’ words, in a procedure almost equivalent to what has been performed by [12]. However, we applied the following changes to his original approach: for word embeddings, we avoided most occurrences of out of vocabulary words by applying to the test set the same tokenization and normalization steps which were performed on the training set before the word embeddings were trained. These steps were described by [12] and we performed them through their standard implementation.

This approach significantly reduced the amount of out-of-vocabulary words for word embeddings, and the remaining ones were simply ignored, instead of being replaced by a single UNKNOWN token. In the case of language models, such as ELMo and BERT, no preprocessing was applied, and whenever evaluating the combination of a language model and a word embedding, we simply performed the concatenation of the sentence embeddings produced from each source.

Table 1. Best results for extrinsic evaluation on the semantic similarity task. Arrows indicate whether lower (\(\downarrow \)) or higher (\(\uparrow \)) is better. A hyphen (-) indicates the absence of either a word embedding or a language model. All word embeddings present in the table below have 1000 dimensions.

Evaluation Metrics. The Pearson correlation coefficient measures the linear relationship between two datasets: one annotated by the participants and another which is output by the system. Like other correlation coefficients, this one varies between −1 and +1 with 0 meaning no correlation. Correlations of −1 or +1 mean an exact linear relationship. The Mean Squared Error (MSE) of an estimator measures the average of the squares of the errors, that is, the average squared difference between the estimated value and what was expected.

Discussion of Results. Table 1 shows the performance of our models for the Brazilian Portuguese and European Portuguese test sets, through the Pearson’s Correlation (\(\rho \)) and mean squared error (MSE).

All semantic similarity tests for word embeddings listed by [12] were repeated during our experiments. Although most word embeddings retained exactly the same relative accuracy, FastText skip-gram embeddings have exhibited a noticeable increase in performance; in fact, the FastText skip-gram embedding at 1000 dimensions achieved the best standalone accuracy among all word embeddings considered for the ASSIN semantic similarity task. This happened because the approach deployed by [12] produced a higher amount of out-of-vocabulary (OOV) words, and the FastText embeddings were abnormally sensitive to their adopted strategy of replacing all OOV words by a single UNKNOWN token. [14] provides a survey of OOV word replacement techniques that can avoid this handicap. We therefore conclude that the performance oscillations in FastText word embeddings reported by [12] should first be regarded as a result of their approach to word preprocessing and OOV word replacement after the training stage, rather than as a by-product of intrinsic properties of the word embeddings themselves.

Furthermore, three language models were evaluated, both in isolation and in combination with each one of the word embeddings made publicly available by [12]: BERT, and ELMo with and without vocabulary reduction. While the concatenation of ELMo without vocabulary reduction with any word embeddings resulted in a worse result than using ELMo by itself, the reduced version of ELMo significantly improved its accuracy after being concatenated with Word2Vec embeddings. Such an improvement has not been achieved by any other word embedding. The best combination of Word2Vec and reduced ELMo is reported in Table 1: results belong to the concatenation of the reduced version of ELMo with a Word2Vec embedding that has 1000 dimensions and follows the Continuous Bag of Words (CBOW) model.

It is also important to notice that, while ELMo retained a relatively stable performance across the Brazilian and European versions of the dataset, BERT-multilingual loses a measurable portion of its accuracy while performing semantic similarity tasks in European Portuguese. In all likelihood, such a steep decline happens due to the imbalanced proportion between Brazilian and European Portuguese articles in Wikipedia, on which BERT-multilingual was trained.

5 Conclusion and Future Work

Our experiments have shown that the ELMo model that has been made publicly available through the AllenNLP library is already able to consistently perform better on semantic similarity tasks than multilingual versions of BERT that have not been subject to fine-tuning, or classical word embeddings, even when both the Brazilian and European dialects of the Portuguese language are taken into account. Although results for word embeddings superior to those achieved by ELMo have already been reported in the literature, they combine multiple word embeddings [13] or combine a word embedding with several linguistic features [1].

Furthermore, we have also seen that vocabulary reduction not only improved its standalone performance, but made it suitable to be concatenated with Word2Vec embeddings on semantic similarity tasks, which seems to suggest that vocabulary reduction made ELMo favorable to ensemble approaches for improving on its accuracy.

In the future, we should also evaluate if similar results would happen on other downstream tasks, such as sentiment analysis, part-of-speech tagging and named entity recognition. And given the current lack of pretrained deep language models in the Portuguese language, we may also consider introducing in our next experiments not only existing multilingual models, but also more deep language models trained by ourselves, optimized to work exclusively with the Portuguese language.