Keywords

1 Introduction

Linguistic corpora have been widely used in Natural Language Processing (NLP) tasks in recent years. Experiment has shown that a well constructed and analyzed corpus can be exploited to improve the quality of the linguistic objects produced by NLP algorithms in general, and also in the specific case of Automatic Text Generation (ATG) procedures. However, the construction of consistent literary corpora is often unattainable [17] due to the complexity of the process, which requires much time for analysis.

Moreno-Jiménez and collaborators [10, 14] have recently presented a corpus for use in NLP formed only by literary texts in Spanish. This corpus was applied in tasks such as sentiment analysis [9] and automatic generation of literary sentences [12, 14]. The corpus reached the mark of approximately 200 million (M) tokens from literature in Spanish and for this reason was named MegaLite. The last available version of the Spanish section contains approximately 5 000 documents, 1 300 different authors, approximately 15 M sentences, 200 M tokens, and 1 000 M characters. In [11], the corpus was extended to encompass a section composed of literature in French. This addition contains 2 690 documents, 1 336 different authors, approximately 10 M sentences, close to 182 M tokens, and approximately 1 082 M characters. To contemplate the addition of a new language, the sections of the corpus acquired two new names, MegaLiteES for the Spanish section and MegaLiteFR for French the part.

In this work, we present an extension of MegaLite [20], formed by adding to it a new section based on literature produced in Portuguese, from different lusophone countries, such as Brazil, Portugal, and Mozambique, to name a few. It also contains literature translated to Portuguese taken from sources from different countries around the globe. We describe how the corpus was produced and formatted, its main properties, some ATG experiments carried out on the corpus and their results. We use two different representations of the corpus to better understand its structure and possible applications.

In Sect. 2, we present some work related to the development and analysis of corpora. In Sect. 3, we describe the new corpus MegaLitePT. Section 4 briefly describes the algorithm for ATG and, in Sect. 5, we present some experiments and evaluate the performance of the ATG algorithms trained with MegaLitePT. Finally, in Sect. 6, we propose some ideas for future work before concluding.

2 Related Work

In this section, we discuss some work related to the topic of the construction of literary corpora. We note that most of these corpora are composed of documents written in English. For this reason, we have concentrated our efforts on collecting literary documents written in Portuguese, in order to extend the MegaLite [11] corpus, which already contains a section of literary documents in Spanish and another in French. We hypothesize that the richness and variability of styles found in literature can improve the quality of texts obtained with ATG algorithms, overcoming the limitations of the overly rigid styles of technical documents, or the stereotypes of the journalistic style.

In [17], the authors introduced the RiQuaFootnote 1 corpus composed of literary quotation structures in 19th century English. The RiQua corpus provides a rich view of dialogue structures, focusing on the importance of the relation between the content of a quotation and the context in which it is inserted in a text. Another interesting approach presented in [19] describes the SLäNDa corpus that consists of 44 chapters of Swedish narratives, with over 220 K manually annotated tokens. The annotation process identified 4733 occurrences of quoted material (quotes and signs) that are separate from the main narrative, and 1143 named speaker-to-speech correspondences. This corpus has been useful for the development of computer tools for analyzing literary narratives and discourse.

A Spanish corpus called LiSSS has been proposed in [9]. It is constituted by literary sentences collected manually from many literary works. The LiSSS corpus has been annotated according to five emotions: love, anger, happiness, hope and fear. It is available in two versions: the first one has 500 sentences manually multi-annotated (by 13 persons), and the second one has 2 000 manually mono-annotated sentences.

Concerning corpora with emotional content, we have the SAB corpus introduced in [15]. This corpus is composed of tweets in Spanish, representing reviews about seven types of commercial products. The tweets are classified into eight categories: Confidence, Satisfaction, Happiness, Love, Fear, Disaffection, Sadness and Anger. In [2], another very complete work with three resources is described. The first of these is an emotional lexicon composed of words from 136 of the most spoken languages in the world. The second resource is a knowledge graph that includes 7 M words from the same 136 languages, with about 131 M inter-language semantic links. Finally, the authors detected the emotional coherence expressed in Wikipedia texts about historical figures in 30 languages.

3 Megalite Corpus

This section describes MegaLitePT, a literary corpus for the Portuguese language. It consists of thousands of literary documents, spanning more than a thousand different authors from different countries, writing styles, and literary genres. The documents in this corpus come from a personal collection and hence, for copyright reasons, we are not allowed to share them in their original form. Nevertheless, following the same formatting standards used in the Sections MegaLiteES and MegaLiteFR, the corpus is available as files indexed by author surname and title, in the form of embeddings, represented in a Parts Of Speech (POS) tags version and a lemma version, and also in files displaying the lists and frequencies of unigrams, bigrams, and SU4-bigrams.

Table 1. Properties of MegaLitePT, with 4311 literary texts (\(K = 10^3\) and \(M = 10^6\)).

3.1 Structure of the Corpus

The original corpus was built from literary documents in the Portuguese language, written by lusophone authors and also by text translated from other languages to Portuguese. The corpus contains 4311 documents, from 1418 authors, in different literary genres, such as plays, poems, novels, essays, chronicles, etc. The original documents, obtained in heterogeneous formats (ordinary text, epub, pdf, HTML, ODT, doc, etc.), were processed and stored as plain text, UTF-8 document files. Textual metadata such as indexes, titles, remarks, author notes and page numbering were filtered out using techniques that detect regular expressions and pattern matching, and by manual removal. Afterwards, we performed a textual segmentation using a tool developed in PERL 5.0 to detect regular expressions [5]. Some of the properties of the corpus, after pre-processing, are detailed in Table 1.

In its current state, the MegaLite corpus is very extensive, containing literary documents in French and Spanish, so that it is suitable for use in automatic learning and translation. It has, however, a small amount of noise formed by a few textual objects not detected in the pre-processing stages, leading to some mistakes in the segmentation process. This is not unusual in a corpus of the size of MegaLite, and these same kind of objects may also be found in most corpora containing unstructured text, and they also occur in the Portuguese corpus MegaLitePT.

Table 2. Properties of MegaLitePT. Numbers of documents, authors, sentences, tokens, and characters in each directory, which is identified by the initials of the last name of the authors.

The names of all files in MegaLitePT follow the same naming patterns used in the other sections of MegaLite, that is authorLastName,_authorName-title. We also group all authors with the same last name initials in directories. In Table 2, we display the properties of the corpus, for each one of the directories identified by the initial of the last names of the authors.

3.2 Word2vec Embeddings

Word embeddings are representations of words that quantify semantic similarities between linguistic terms. These embeddings can be determined from the analysis of relations among words in a large corpus. Embeddings for the MegaLitePT corpus were generated using the Word2vec model [7] with the Gensim [18] library, which resulted in a set of 389, 340 embeddings. Each embedding is an s-dimensional vector whose elements were obtained from semantic relationships among words in the MegaLitePT corpus. The training process performed to generate our embeddings used the parameters shown in Table 3. Iterations, i, represents the number of training epochs. Minimal count, m, indicates the minimal frequency of occurrence of a word in the corpus needed for it to be added to the vocabulary. For any word x, its embedding has vector size, s (s specifies the dimension of the vector representation of x), and window size, ws, represents the number of words adjacent to x in a sentence (that are related to it within the sentence) that will be considered to form the embedding. In this model, we used the skip-gram approach [6], with a negative sampling of five words and a downsampling threshold of 0.001.

Table 3. Word2Vec configuration parameters.

Table 4 displays the 10 nearest tokens found in MegaLitePT for the word queries Azul (blue), Mulher (woman) and Amor (love). The distance between the query and a token is determined by the cosine similarity given by Eq. (2) (see the model description in Sect. 4). For each query word, Q, in Table 4, the left column shows a word, x, associated to Q chosen from the corpus by Word2vec, and the right column shows the cosine similarity between Q and x. We chose to not translate the words associated to the queries within the table, since many of these are synonymous to each other or do not have an English translation to a single word. This is an interesting feature of MegaLite, that it captures some literary/artistic meanings of words which normally do not emerge from non-literary corpora.

Table 4. List of 10 nearest tokens found in MegaLitePT for queries Azul, Mulher and Amor.

3.3 POS Tag and Lemma Representations

In this section, we present two representations of MegaLitePT. The first one is a corpus built by using only POS tags and the second one uses only lemmas. This is a solution found that enables sharing the corpus without breaking copyright laws, although still preserving semantic meaning. Table 5 contains a very small subset of these representations, it shows a few sentences from Machado de Assis’s “Memórias Póstumas de Brás Cubas”. The first column displays the line number, which corresponds to the order of the sentence in the original text document (its line number in the file). The second column shows the original sentence as it appears in the original text. The third column displays the version of the original sentence in the POS tag representation, and the fourth column shows the sentence in its lemma representation. These two representations of MegaLitePT are formed as we describe in what follows.

POS Tag Corpus. This representation is constructed by making a morpho-syntactic analysis of each document, and replacing each word of the document with its corresponding POS tag. The analysis was performed using Freeling version 4.0 [16]. The POS tagFootnote 2 shows grammatical information for each word within a given sentence.

Lemma Corpus. The second representation is a lemmatized version of the original documents. This was achieved by using Freeling POS tags as references to first extract only meaningful lexical words, in this case only verbs, nouns, and adjectives. Every extracted word was then substituted by the corresponding lemma, which is a basic form of a given word, without conjugation, in its singular form and neutral or male genre. Words corresponding to all other types of POS tags, i.e. not verbs, nouns, and adjectives, were removed from this corpus.

Table 5. Samples of sentences recovered from Machado de Assis’s novel “Memórias Póstumas de Brás Cubas”, in different versions of MegaLitePT.

3.4 n-Gram Statistics

MegaLite also provides the frequencies of occurrences of unigrams, bigrams, and skip-grams of the type SU4-bigrams [1]. SU4-bigrams are obtained by taking a pair of words from a sentence such that from the first word in the pair one takes n steps to find the second word, i.e., using n-sized skip-grams, for \(n = 1, 2, 3, 4\). For example, for the sentence “Não tive filhos, não transmiti a nenhuma criatura o legado da nossa miséria.”, given the word filhos, the SU4 bigrams are: filhos/não, filhos/transmiti, filhos/a and, filhos/nenhuma. The same procedure is applied to every token in every sentence in the text. Then all the occurrences of the same pair are summed up to compute the total frequency of occurrence of each pair of tokens and they are sorted in decreasing order of frequency. In Table 6, we display the top 5 most frequent bigrams and SU-4 bigrams for 4 texts of different authors.

Table 6. Bigrams and SU4-Bigrams with the 5 highest frequencies from 4 literary works in MegaLitePT.

4 Model for Generating Artificial Literary Sentences

In this section, we present a brief description of an adaptation of our previously developed model for literary sentence generation [8, 12, 13]. We have used this model to generate sentences in Spanish and French, using MegaLiteES and MegaLiteFR and we will show results of its use in experiments of ATG with MegaLitePT, in the next section. The model consists of the two following stages.

First Stage -Canned Text. This step consists of using the canned text method, commonly used for ATG [3]. The process begins by selecting a sentence f from the original version of MegaLitePT, which will be used to generate a new phrase. Sentence f is then parsed with FreeLing [16] to replace the lexical wordsFootnote 3 by their morpho-syntactic labels (POS tags) and thus generate a Partially Empty Grammatical Structure (PGS). Functional words such as prepositions, pronouns, auxiliary verbs, or conjunctions are kept in the sentence. To maintain semantic accuracy in our algorithm, the generated sentences must have at least 3 lexical words, but no more than 10. Once the PGS has been generated, it will be analyzed by the semantic module in the second stage.

Second Stage - Semantic Module (Word2vec) Training. We next replace the POS tags of the PGS by lexical words using the Word2vec model. This model has been implemented for our experiments under the Skip-gram architecture [6] using MegaLitePT for training. We have used the hyper-parameter values specified in Table 3 during the Word2vec training phase, to obtain 389, 340 embeddings.

In order to select the vocabulary that will replace the POS tags in the PGS formed from f to construct the new sentence, we have implemented a procedure based on an arithmetic analogy proposed by [4]. We consider the three embeddings corresponding to the words Q, O and A defined as

\(\vec Q\)::

the embedding associated with the context word Q, the query, given by the user,

\(\vec O\)::

the embedding associated with the original word O in f which has been replaced by the POS tag,

\(\vec A\)::

the embedding associated with the word adjacent to O on the left in the sentence f.

With these embeddings, we calculated a fourth embedding \(\vec y\) with the expression

$$\begin{aligned} \vec {y} = \vec {A} - \vec {O} + \vec {Q} \, . \end{aligned}$$
(1)

This embedding \(\vec y\) has the features of \(\vec {A}\) and \(\vec {Q}\) enhanced and the features of \(\vec {O}\) decreased, so that it is more distant to \(\vec O\).

We then obtain the embeddings of the best word associations related to \(\vec y\) with Word2vec, and store the first \(M=4~000\) of the these in a list \(\mathcal {L}\), i.e. we take the 4000 first outputs of Word2vec, when \(\vec y\) is given as input. \(\mathcal {L}\) is thus an ordered list of 4000 vectors, a matrix, where each row, j, corresponds to an embedding of a word, \(w_j\) associated to \(\vec y\). The value of M has been established as a compromise between the execution time and the quality of the embeddings for the procedure we are describing. The next step consisted of ranking the M embeddings in \(\mathcal {L}\), by calculating the cosine similarities between the \(j^{th}\) embedding in \(\mathcal {L}\), \(\vec {L_j}\), and \(\vec {y}\) as

$$\begin{aligned} \theta _j = \cos (\vec {L_j},\vec {y}) = \frac{\vec {L_j} \cdot \vec {y}}{||\vec L_j|| \cdot ||\vec y||} \,\,\,\, 1 \le j \le M . \end{aligned}$$
(2)

\(\mathcal {L}\) is ranked in decreasing order of \(\theta _j\).

Another important characteristic to consider when choosing the substitute word is grammatical coherence. We have therefore implemented a bigram analysis, by estimating the conditional probability of the presence of the \(n^{th}\) word, \(w_n\), in a sentence, given that a previous, adjacent word, \(w_{n-1}\), on the left is present,

$$\begin{aligned} P(w_n|w_{n-1})=\frac{P(w_n \wedge w_{n-1})}{P(w_{n-1})} \, . \end{aligned}$$
(3)

The conditional probability of Eq. (3) corresponds to the frequencies of occurrence of each bigram in MegaLitePT, which was obtained from the n-gram detection procedure used when constructing this corpus, as described in Subsect. 3.4. Among the bigrams in MegaLitePT, we considered only the bigrams formed by lexical and functional words (punctuation, numbers, and symbols are ignored) to form a list, LB, used to calculate the frequencies.

For each \(\vec L_j\) in \(\mathcal {L}\), we compute two bigrams, \(b1_j\) and \(b2_j\), where \(b1_j\) is formed by the left word adjacent to O in f (corresponding to embedding \(\vec A\)) concatenated with the word \(w_j\) (corresponding to embedding \(\vec L_j\)). Then, \(b2_j\) is formed by \(w_j\) concatenated with the word adjacent to O to the right in f. We then calculate the arithmetic mean, \(bm_j\), of the frequencies of occurrence of \(b1_j\) and \(b2_j\) in LB. If O is the last word in f, \(bm_j\) is simply the frequency of \(b1_j\). The value \(bm_j\) for each \(\vec L_j\) is then combined with the cosine similarity \(\theta _j\), obtained with Eq. (2), and the list \(\mathcal {L}\) is re-ranked in decreasing order of the new value

$$\begin{aligned} \theta _j\,{:}{=}\,\frac{\theta _j + bm_j}{2} \, , \,\,\,\,\, 1 \le j \le M \, . \end{aligned}$$
(4)
Fig. 1.
figure 1

Second step: vocabulary selection

Next, we take the word corresponding to the first embedding in \(\mathcal {L}\) as the candidate chosen to replace O. The idea is to select the word semantically closest to \(\vec {y}\), based on the analysis performed by Word2vec, while keeping the coherence of the text obtained with the linguistic analysis done by the language model and the structure of MegaLitePT. The definition of \(\vec y\) given by Eq. (1) should allow a substituion of O by a word more distant in meaning, so that potentially more creative phrases may arise. Finally, to respect the syntactic information given by the POS tag, we use Freeling to convert the selected word to the correct gender and number inflection of the word O, which is specified by its respective POS tag. This process is repeated for each replaceable word in f (each POS tag). The result is a new sentence that does not exist in the corpus MegaLitePT. The model is illustrated in Fig. 1, where the sentence f converted to PGS can be appreciated on the top of the illustration. The PGS sends inputs to the Word2vec module that receives Q, A, and O to generate the list \(\mathcal {L}\). This list is then filtered with the language model, to obtain the best choice with the correct grammatical struture returned by Freeling.

5 Experiments of Automatic Sentence Generation in Portuguese

In this section, we describe a group of experiments implemented to evaluate the influence of corpus MegaLitePT in the task of automatic sentence generation. We describe the evaluation protocol, show some examples of generated sentences and present our results. We have chosen 45 sentences, with different grammatical structures and lengths varying from 3 to 10 lexical words, to be used as input to the canned text method. In Table 7, we display some of the queries and the corresponding generated sentences obtained with the model explained in Sect. 4.

Table 7. Generated sentences based on user input queries

5.1 Evaluation Protocol and Results

Using the method described in Sect. 4, we have automatically generated a set of fifteen sentences for each of the queries amor, guerra, and sol, with a total 45 sentences. We grouped according to query and submitted these sentences for human evaluation to 18 persons, each of whom completed the evaluation survey. Each sentence was evaluated for the three following qualitative categories.

  • Grammaticality. This category is used to measure the grammatical quality of the generated text. The main characteristics that should be evaluated are orthography, verb conjugations, gender, number agreement and punctuation. Other grammatical rules can also be evaluated but to a lesser degree of importance.

  • Coherence. In this case, we require the evaluation of how harmonic and well placed the words are within the sentence. The principal points of analysis are the correct use of words and word sequences, the sentence should have a clear meaning and should be read without difficulty.

  • Context. represents how the sentence is related to the topic of the query. Naturally, in a literary sentence, the relation with the topic can be subtle or even antagonistic.

Each one of these criteria should be evaluated by attributing a numerical, discrete value of 0, 1 or 2, where 0 represents that the sentence does not match that category at all. A value of 1 means that the sentence satisfies some of the conditions in that category, but not all. And finally, a value of 2 is given, if the sentence seems correct in relation to that category.

In the instructions for the evaluators, we stated that some sentences were generated using a computational algorithm, and others were extracted from multiple literary works. We didn’t inform the evaluator of the correct ratio between these two categories. We also performed an adapted Turing test where, for each sentence, we asked the evaluator to predict if the sentence is artificial, that is generated by a computer or if it is natural, that is written by some human author.

Fig. 2.
figure 2

Evaluation of coherence, grammaticality and context of 45 automatically generated sentences.

The results of our evaluation procedure can be seen in Fig. 2, where we can notice that 55% of the sentences are evaluated as grammatically correct, while 28% as acceptable, and only 17% are considered bad. The coherence values also display positive results with 47%, 30% and 23% evaluated as good, acceptable and bad, respectively. Finally, in the context category we have 40%, 27%, and 33% evaluated as good, acceptable and bad, respectively. All these values were rounded to integer values and the sum is 100% in each category, as expected.

Figure 3 shows the ratio of the evaluation for the Turing test for each one of the 45 sentences. Each bar sums up to 100, and represents in blue the percentage of evaluators that consider the text as written by a human (a natural sentence), while the other part, in yellow, represents the percentage of evaluators who consider the sentence as generated by a computer (an artificial sentence). The fact is that all sentences were generated by the model. The dashed line indicates the mean ratio between sentences evaluated as natural and artificial. This line shows that, on average, 56% of the evaluators consider that sentences were written by humans. Table 8 shows some sentences evaluated by human evaluators and how they were categorized by the majority (more than 80%) of the evaluators.

Fig. 3.
figure 3

Evaluations for the adapted Turing test.

Table 8. Example of sentences and results for the adapted turing test.

6 Conclusions and Perspectives

We have introduced MegaLitePT, an extension of the MegaLite literary corpus consisting of literary documents in Portuguese. We have provided versions of MegaLitePT in the POS tag format and in a lemmatized form. We also made available the lists and distributions of unigrams, bigrams, and SU4-bigrams for statistical analysis. The embeddings of 60-dimensional vectors, were obtained using the Word2vec model. In our experiments, we have shown that MegaLitePT is useful for NLP tasks such as automatic sentence generation. Our embeddings display a high degree of literary information and are very well suited for creative tasks.

In a human evaluation, 56% of the sentences produced using our model were considered to be generated by real human authors. These sentences were evaluated with good degrees of grammaticality, only 17% being considered bad in this category. Also very good coherence and context were perceived, with only 23% and 33% of the sentences being considered bad in each respective category. Hence, we strongly recommend MegaLitePT for NLP tasks such as Deep Learning Algorithms, textual assessment, text generation and text classification.

6.1 Future Work

We can extend this corpus to build a subset of MegaLitePT using only native writers. We believe that this corpus will be able to better model the nuances, details, and characteristics of Portuguese literature. We intend to build deep statistical analysis based on our corpus to find possible patterns and metrics that could help us to investigate structural properties of literature, artistic texts, and of ATG.