Keywords

1 Introduction

Paraphrase generationFootnote 1 transforms a natural language text into a new text with the same semantic meaning but a different syntactic or lexical surface form [7]. This is a challenging problem commonly approached using supervised learning [2, 17].

While this task has been extensively explored for English, few works have been developed for other languages, namely Portuguese. We are aware of one work exploring paraphrase generation for (Brazilian) Portuguese [28]. There is no existing work targeting paraphrase generation for European Portuguese, and only two small phrasal datasets of aligned paraphrases are available [4, 5], which are not publicly accessible. For English, however, approaches have been developed for generating freely-available datasets with millions of sentential paraphrase pairs [12, 37].

In this paper, we describe the creation of a dataset containing more than 1.5 million sentential paraphrase pairs. We use neural machine translation (NMT) to translate the English side of a large English-Portuguese parallel corpus, namely OpenSubtitles [23]. We pair the Portuguese translations with the European Portuguese references to form paraphrase pairs. We call this dataset OSPT, as an abbreviation of OpenSubtitles for Portuguese. This dataset covers a broad range of paraphrase phenomena (we cover this analysis in more detail in Sect. 3).

We show the utility of the dataset by using it to train paraphrastic sentence embeddings. We primarily evaluate our sentence embeddings on the ASSIN2 [29] semantic textual similarity (STS) competition. Despite being built for Brazilian Portuguese, for a lack of a better alternative, we deem this competition a good option to evaluate the quality of our data intrinsically. We compare sentence embeddings trained on the official training set from the competition against sentence embeddings trained with a small subset of OSPT. We found the embeddings trained with our dataset outperform those trained from a curated training split.

Lastly, we show that our dataset can be used in paraphrase generation. Having the European Portuguese sentences as targets in fine-tuning a multilingual pre-trained language model produces a pseudo-translation effect. The generations are much more European Portuguese-like than the sources, which exhibited Brazilian-like features.

We release our dataset, trained sentence embeddings, paraphrase generators, and all the code to do so. As far as we know, OSPT is the most extensive collection of Portuguese sentential paraphrases released to date. We hope it can motivate new research directions in Portuguese and be used to create powerful Natural Language Processing models while adding robustness to existing ones by incorporating paraphrastic knowledge.

2 Related Work

We discuss work in automatically building paraphrase corpora using parallel text for learning sentence embeddings and similarity functions, and paraphrase generation in Portuguese.

Paraphrase Discovery and Generation

Many methods have been developed for generating or finding paraphrases, including using multiple translations of the same source material [6], using comparable articles from multiple news sources [10], crowdsourcing [18], using diverse machine translation systems to translate a single source sentence [34], and using tweets with matching URLs [21].

Besides all these techniques, the most influential prior work uses bilingual corpora. Bannard, and Callison-Burch [3] used methods from statistical machine translation to find lexical and phrasal paraphrases in parallel text. Ganitkevitch et al. [13] scaled up these techniques to produce the Paraphrase Database (PPDB), which has then been extended for many languages [12] since it only needs parallel text. Wieting et al. [38] used NMT to translate the non-English side of sentential parallel texts to get English-English paraphrase pairs and claimed their data quality to be on par with manually-written English paraphrase pairs. The same authors then scale up the method to produce a larger dataset [37]. We intend to do the same but produce Portuguese-Portuguese paraphrase pairs.

Sentence Embeddings

As in Wieting and Gimbel’s work [37, 38], we train sentence embeddings to demonstrate the quality of the dataset. These works trained models on noisy paraphrase pairs and evaluated them primarily on semantic textual similarity (STS) tasks. Prior work in learning general sentence embeddings has used autoencoders [16], encoder-decoder architectures [11], and other learning frameworks [1, 9, 27]. More recently, there are approaches leveraging the embeddings of pretrained language models, like SimCSE [14] or Sentence-BERT (SBERT) [30]. We use the latter for our STS task.

Parallel Text for Learning Embeddings

Prior work has shown that parallel text, and resources built from parallel text like NMT systems and PPDB, can be used to learn word and sentence embeddings. Some works have used PPDB as a knowledge resource for training or improving embeddings [26, 36]. Others have used NMT architectures and training settings to obtain better embeddings, like Mallinson et al. [25] that adapted trained NMT models to produce sentence similarity scores in semantic evaluations, or Wieting and Gimpel [37] that proposed mega-batches to expand the search space for selecting negative examples for each paraphrase pair to then compute a margin triple loss [30]. In this work, we opt to use a multiple negative loss [15] because we do not have negative examples. This loss assumes that every other target sentence (aside from the target sentence from the pair being evaluated) in the batch is a negative example.

3 The Dataset

To create our dataset, we used back-translation [38]. We used an English-Portuguese NMT system to translate English sentences from the training data into Portuguese. We paired the translations with the European Portuguese references to form Portuguese-Portuguese paraphrase pairs (i.e., \(\langle Mixed Portuguese, European Portuguese\rangle \) pairs).

Throughout the document, we refer to Portuguese as a mixture of European and Brazilian Portuguese, as most pre-trained multilingual models do not distinguish between the two variants. To refer to a specific variant, we explicitly say so.

Table 1. Examples from source dataset machine-translated sentences that build into paraphrase pairs for our dataset. Each entry consists of the original English sentence (“en-XX”), its Portuguese machine translation (“MT pt-XX”) and the European Portuguese reference (“pt-PT”). These pairs have varying lexical diversity.

Because pivot translation can potentially diminish the fidelity of the information forwarded into the target language, we chose parallel data containing text in European Portuguese, from which we can translate the side which is not European Portuguese. This is the approach from [37]. Additionally, in [38], the authors found little difference among Czech, German, and French as source languages for back-translation from English. As for Portuguese, we did not find prior work focusing on the best source language to translate from. As such, to maximize performance, we chose English as our language to translate from and an English-centric multilingual pre-trained language model, such as mBART-50 [35]. This model extends the original mBART [24] to encompass more languages, including Portuguese.

3.1 Choosing a Data Source

As far as we know, the two primary publicly available datasets with European Portuguese bitext are Europarl [20] and OpenSubtitles [23]. As per the study conducted in [37], Europarl exhibits low diversity in terms of rare word usage, vocabulary entropy, and parse entropy, mainly due to the formulaic and repetitive nature of speech in a Parliament. In [37], the authors chose the CzEng dataset [8], of which a significant portion is movie subtitles which tend to use a vast vocabulary and have a diversity of sentence structures. This serves as a strong motivation for conducting our experiments using OpenSubtitles.

The OpenSubtitles dataset has over 33 million English-European Portuguese bitext pairs. Because of the computational expense of translating such an extensive dataset, we sample 3 million entries. When translating the English sentences to Portuguese, we used beam search with a beam size of 5 and selected the highest-scoring translation. We show illustrative examples in Table 1. Note the matching is not always perfect, mainly because the original bitext pairs not being perfect translations (there are instances where the meaning is significantly different). The translations are of very high quality, with sporadic errors like gender-mismatch due to English having no gendered nouns, or translations failing to discern whether a second person pronoun (“you”) is singular or plural.

3.2 Automatic Quality Assessment, Cleaning, and Filtering

As manually evaluating such an extensive dataset is very expensive and time-consuming, we resort to automatic mechanisms to assess the dataset’s quality and clean and filter uninteresting information.

We found recurring problems on manual inspection, like close captions, start hyphenation, and sentence misalignment. For example, “(vomita) Tu queres saber o que é de loucos?” has a close caption that should be removed. Similarly, in “- Deem-me dois minutos.”, the hyphen should be removed to match the target sentence. An example of the misalignment is ‘E Dr\(^{\text {a}}\). Lin, tente não me chamar.” \(\rightarrow \) “Sim.”, where the two sentences do not share the same meaning. To find these pairs, we search for big differences in token size between source and target. We use the following equation to prune heavily uneven word counts while normalizing for text sizes:

$$\begin{aligned} |n\_tokens_{src} - n\_tokens_{tgt}| / \max (n\_tokens_{src}, n\_tokens_{tgt}) > 0.5 \end{aligned}$$

We arbitrate the threshold value to be 0.5 based on a few empirical experiments. For a random sample of 100 000 entries, we find around 3 500 entries that do not match the above equation (are deemed unfit to keep). The mean SBERT score for this sample is 81.69, a low value for SBERT, indicating that these pairs with heavily uneven word counts have a low semantic similarity.

Finally, we remove sentence pairs that are exactly the same. This behavior occurs most prominently for very small sentences (\({<}4\) tokens).

3.3 Data Analysis

We further analyze the relevance of the data. As per Li et al. [22], relevance regards how semantically close the paraphrase text is to the original text. We study the semantic similarity resorting to Sentence BERT [30] (SBERT). Specifically, we conduct preliminary testing with multilingual SBERT (mSBERT) [31], and a Brazilian Portuguese SBERTFootnote 2 trained on ASSIN2 [29]. Despite being more general-purpose, we found mSBERT performs better than the latter. Using mSBERT and normalizing the scores in the range of [0, 1], we get an average value of 87.724, which suggests the majority of the pairs have high semantic similarity between them. Nonetheless, we prune pairs with semantic scores lower than 80. From empirical assessment, from this threshold on, most sentence pairs are misaligned.

We do not conduct any particular study regarding fluency (the syntactic and grammar correctness of the paraphrased text [22]), relying on the assumption that pre-trained language models are inherently good grammar inductors [19].

OSPT has 1 519554 pairs. For reference, two widely used English sentential parallel paraphrase datasets, QQPFootnote 3 and PAWS [39] have respectively 1 49263 and 2 8904 paraphrase pairs. TaPaCo [32], a corpus of sentential paraphrases for various languages, has 3 6451 Brazilian Portuguese paraphrase pairs. The OSPT averages 8 words for both source and target sentences, as subtitles are rarely long. QQP averages around 11 words per sentence, PAWS around 21, and TaPaCo around 7.

4 Learning Sentence Embeddings

We assess the quality of the dataset intrinsically, using it to train sentence embeddings.

4.1 Experimental Setup

We fine-tune a mSBERT [31] model. We train the model for 10 epochs with a batch size of 64, a learning rate of 2e-5, AdamW optimizer and a linear scheduler with 100 warmup steps. As referred to in Sect. 2, the training loss we use allows for training good quality sentence embeddings without negative examples. The training data for the loss consists of sentence pairs \([(a_1, b_1), \ldots , (a_n, b_n)]\) where we assume that \((a_i, b_i)\) are similar sentences and (\(a_i\), \(b_j\)) are dissimilar sentences for \(i \ne j\). It minimizes the distance (cosine similarity) between \(a_i\) and \(b_i\) while maximizing the distance between \(a_i\) and \(b_j\) for all \(i \ne j\).

We evaluate sentence embeddings using the ASSIN2 semantic textual similarity (STS) tasks [29]. Given two sentences, the aim of the STS tasks is to predict their similarity on a 0-5 scale, where 0 indicates the sentences are on different topics and 5 means they are entirely equivalent. To fairly compare OSPT with ASSIN2’s official training data (with 6 500 pairs), we randomly sampled a subset of 6.5K pairs from our dataset. We further compare with a 6500-pair subset of the TaPaCo dataset.

4.2 Results

In Table 2, we report the scores for the official task’s evaluation metrics.

Table 2. Results for STS on the ASSIN2 test set. We compare three fine-tuned SBERT models, one using the ASSIN2 training data (has 6 500 pairs), other using a random subset of 6 500 samples from TaPaCo [32], and the other using a random subset of 6 500 samples from OSPT. We report the official metrics from the STS tasks of the ASSIN2 competition. The best results are in bold.

The results reported compare the same model trained under the same conditions, and with the same amount of data, only changing the data source. The mSBERT trained with a subset of OSPT performed the best for the task, achieving the highest Pearson’s r and MSE values. Assuming the randomly sampled subsets to be good representations of the data as a whole (which is hard to assess if this is true for sentences), we can conclude the data to be of good quality, or at least, to be good enough to produce good quality sentence embeddings.

5 Paraphrase Generation

Besides creating state-of-the-art paraphrastic sentence embeddings, we show our dataset can help produce interesting paraphrase generators for data augmentation.

5.1 Experimental Setup

We fine-tune three mBART [24] models, two on subsets of OSPT and another on the TaPaCo dataset. Since our dataset is so large, it is computationally demanding to train paraphrase generation models in its entirety. As such, we filter the data to create a training set of 240K samples, 30K samples for validation, and 30K for testing. Additionally, we build a subset of OSPT of 36451 training pairs (the same size as TaPaCo) for fair comparison. We train both models for four epochs, with a batch size of 64, a learning rate of 1e-4, AdamW optimizer, and a linear scheduler with 100 warmup steps.

Following recent work [17], we use as our primary evaluation metric the iBLEU [33] score:

$$\begin{aligned} \text {iBLEU} & = \alpha \cdot \text {BLEU}(outputs, references) \\ {} & \quad - (1-\alpha ) \cdot \text {BLEU}(outputs, inputs) \end{aligned}$$

iBLEU measures the fidelity of generated outputs to reference paraphrases as well as the level of diversity introduced. We set \(\alpha \) as 0.7 per the original paper [33]. Additionally, to probe the semantic retention of the generations, we measure the semantic similarity using mSBERT [30]. We chose this metric because it was found to have the lowest coupling between semantic similarity and linguistic diversity [2].

5.2 Results

We evaluate paraphrase generation using the ASSIN2 competition’s test set.

Table 3. Top-1 results for automatic evaluation on the ASSIN2 test set. The Source as prediction baseline serves as a dataset quality indicator. The naming convention matches the number of pairs used to train the models. The best results are in bold.

Table 3 shows the performance of the two mBART-based models we fine-tuned. The results are bound to the basic statistics of the data, hence why we report the source as prediction, that is, using the source sentences as predictions. The ASSIN2 pairs have high word overlap, expressed as a low iBLEU score in the source as prediction baseline. Consequently, models trained on that data will produce sentences similar to the sources. That is why the iBLEU scores are low across the board. These iBLEU values could be made higher by increasing the \(\alpha \) hyper-parameter, but we would be reducing the contribution of lexicon diversity for the results. Nevertheless, we can see that we can improve diversity by having more diverse generations (expressed as a higher iBLEU score) with a drop in the semantics (even though the metric is not fully decoupled from the vocabulary used). The model trained on OSPT-36k achieves the highest diversity but at the cost of some semantic preservation. Ramping up the number of training examples to 240k has a minimal decrease in diversity with increased semantic fidelity, much closer to the model trained on TaPaCo. Note that we did not fiddle with hyper-parameters, and four epochs may not be sufficient for achieving optimal performance considering the complexity and size of our model, hence why the larger model is not clearly better than the smaller one. Notice that TaPaCo is a Brazilian Portuguese dataset, such as ASSIN2, making it likely to perform better in this specific context, as we are trying to produce European Portuguese text. Moreover, this ASSIN2 test set contains texts with low syntactic diversity and many uses of the gerund form of the verbs, a pattern most prevalent in Brazilian Portuguese.

Table 4. Example generations from the mBART-OSPT-240k model on the ASSIN2 test set illustrating the pseudo-translation.

Table 4 shows some examples of these sentences and the respective generations from the mBart-OSPT-240k model. We can produce European Portuguese paraphrases by building the training pairs with the European Portuguese as targets, even when paraphrasing from Brazilian Portuguese. Our model performs a pseudo-translation from Brazilian Portuguese to European Portuguese.

Future work could use the properties mentioned above of the paraphrase generator to further denoise the dataset we present in this paper. We could use the generations of this paraphrase generator to convert the source sentences of our dataset into European-like Portuguese. We can also consider generalizing the approach and employing this technique to convert any Brazilian Portuguese text into European Portuguese.

6 Conclusion

We described the creation of a dataset of more than 1.5M Portuguese sentential paraphrase pairs. We showed how to use this dataset to train paraphrastic sentence embeddings that outperform systems trained with other data on STS tasks, as well as how it can be used for generating paraphrases for purposes of data augmentation and pseudo-translate from Brazilian Portuguese to European Portuguese.

The key advantage of our approach is that it only requires parallel text and a translation system. There are hundreds of millions of parallel sentence pairs, and more are being generated continually. Our procedure immediately applies to the wide range of languages for which we have parallel text. Additionally, the quality of the datasets generated using this approach will increase in parallel with improvements in Machine Translation.

We release our dataset, code, and pre-trained sentence embeddings.Footnote 4

This work is supported by LIACC, funded by national funds through FCT/MCTES (PIDDAC), with reference UIDB/00027/2020.