1 Introduction

Port4NooJ (Barreiro 2010) is a set of resources that allows the generation of paraphrases for Portuguese, feeding the linguistic engine of the eSPERTo paraphrasing systemFootnote 1 based on NooJ technology (Silberztein 2015). The term paraphrase is commonly used to refer the relation between two or more constructions that are morpho-syntactically and/or semantically related (e.g. to make a presentation (of) = to present). In most cases, this relation is established between constructions corresponding to the same syntactic unit. However, the transformations described in lexicon-grammar tables also allow establishing relations between different syntactic units (e.g. These teachers are Portuguese = The Portuguese teachers [...]). Hence, we extend the term paraphrase to the semantico-syntactic relation between two or more sentences and/or their constituents.

This paper presents an enhanced Port4NooJ that includes 15 lexicon-grammar tables describing the distributional properties of 4,248 human intransitive adjectives formalized by Carvalho (2007). Among other properties, these linguistic resources provide information on: (i) the syntactic and semantic nature of the subject modified by each adjective, which can correspond to a human noun, a complex noun phrase involving an appropriate noun, or to a finite or non-finite clause; (ii) the copulative verbs (and aspectual variants) selected by each adjective; (iii) the constraints related to the quantification of adjectives by an adverb or a degree morpheme; (iv) the position of adjectives in adnominal context (pre- or post-nominal position); (v) the possibility of certain adjectives being optionally followed by an infinitive clause, with causal interpretation, or by a human noun phrase introduced by the preposition para (for). In addition to general properties, these resources also describe particular constructions in which human intransitive adjectives may occur, as in: (vi) generic and cross-constructions, where the adjective fills the head of a noun phrase; (vii) characterizing indefinite constructions, where the adjective occurs after an indefinite article; (viii) exclamative sentences expressing insult. Moreover, these lexicon-grammar tables specify the morpho-syntactically related predicative nouns and verbs, whenever they exist, as well as the appropriate nouns that can appear in specific adjective constructions.

The properties described in the lexicon-grammar tables add new paraphrasing capabilities to eSPERTo. Port4Nooj initial paraphrases involved transformations of support verb constructions (or their stylistic or aspectual variants) into single verbs. Later on, new paraphrasing capabilities were added to Port4NooJ, namely transformations of phrasal verbs into equivalent expressions, compound adverbs into single adverbs, relatives into participial adjectives, relatives into possessives, relatives into compound nouns, and agentive passives into actives. Section 4 presents examples of these paraphrases, but a more detailed description of Port4NooJ first paraphrasing capabilities can be found in Barreiro (2009) and Barreiro (2011).

The use of the linguistic knowledge described in the integrated tables allows mapping of several other types of paraphrasing constructions resulting in a semantic relationship between adjective, noun and verb predicates. The lexicon-grammar tables enable eSPERTo to paraphrase (i) adjective, noun and verb morphologically related constructions (está zangado (is angry) = zangou-se (got (self) angry) = esteve envolvido numa zanga (was involved in anger)); (ii) adjective constructions supported by different copulative verbs (estar perdido (to be lost) = andar perdido (walk around lost)); (iii) constructions involving nationality and other membership relations (de origem portuguesa (of Portuguese origin/roots) = portugueses (Portuguese) = de Portugal (from Portugal); benfiquista (Benfica fan) = do Sport Lisboa e Benfica (a fan of Sport Lisboa e Benfica)); (iv) cross-constructions (o idiota do rapaz (the idiot of the boy) = o rapaz é um idiota (the boy is an idiot)); appropriate noun constructions (foi moderado nos seus comentários (he was moderated in his comments) = os seus comentários foram moderados (his comments were moderated) = foi moderado (he was moderated), (v) generic noun phrases (é um indivíduo estúpido (he is a fool) = é um estúpido (he is a fool) = é estúpido (he is fool)), among others.

2 Related Work

Our work follows previous attempts to integrate lexicon-grammar tables in natural language processing systems.

Machonis (2010) presents experiments in NooJ with lexicon-grammar tables of transitive and neutral phrasal verbs in English to identify these types of constructions in texts with and without insertion. The author corroborates that, on the one hand, NooJ is a very powerful tool for parsing compound expressions that involve insertion, such as English phrasal verbs, and, on the other hand, a lexicon-grammar of phrasal verbs can help distinguish prepositional usage from true particles in natural language processing.

Vietri (2010) describes the integration, also into NooJ, of one of her 13 lexicon-grammar tables, table EpinC, which contains 900 Italian idioms with the structure \(N_0\) essere in C, where essere (to be) is a support verb, and in C is a frozen or semi-frozen prepositional phrase starting with the preposition in. The author shows that this integration allowed to refine the linguistic analysis of this type of sequences.

Baptista et al. (2014) discuss lexical and parsing issues of integrating a lexicon-grammar of Portuguese verbal idioms into STRING, a hybrid statistical and rule-based pipeline for natural language processing of Portuguese. More than 2,000 rules were created semi-automatically for ten formal classes of verbal idioms. The system precision was estimated after processing a large Portuguese corpus of news texts.

3 The eSPERTo Project

This research work was developed in the scope of the eSPERToFootnote 2 project. The main objective of this project is the development of a context-sensitive and linguistically enhanced paraphrase generator that recognizes semantico-syntactic, multiwords and other phrasal units, and transforms them into semantically equivalent phrases, expressions, or sentences. This semantically-driven paraphrasing system uses a new hybrid technique that combines statistics and local grammars to acquire linguistic knowledge applied in the identification and generation of new and increasingly more complex paraphrases. Currently, eSPERTo is integrated in an interactive application that helps Portuguese language learners in producing and revising their texts. The utility of eSPERTo’s paraphrasing capabilities are now being explored in two other application scenarios: (i) in a question-answering system to increase the linguistic knowledge of an intelligent conversational virtual agent, and (ii) in a summarization tool to assist the paraphrasing task. Figure 1 shows eSPERTo’s current interactive Web interface designed to help Portuguese language learners in producing and revising their texts. Among other functionalities, the platform includes text-editing mechanisms, which provide a variety of alternatives for each expression, allowing the user to choose among several suggestions that can be immediately applied to text. For the sentence illustrated in Fig. 1: O homem americano apresentou o trabalho (The American man presented the work), eSPERTo suggests its equivalent passive paraphrase: O trabalho foi apresentado pelo homem americano (The work was presented by the American man). For the noun phrase o homem americano (the American man), eSPERTo suggests paraphrases such as: o homem que é americano (the man who is American), o homem de nacionalidade americana (the man with American nationality), o homem de naturalidade americana (the man with American origin), o homem de origem americana (the man with American origin). The user can then select any of the paraphrases listed, or provide his/her own paraphrase.Footnote 3

Fig. 1.
figure 1

Online use of eSPERTo in text editing and revision

4 Port4NooJ and Its First Paraphrases

Port4NooJ is the Portuguese linguistic module of NooJ. The module can be downloaded from the NooJ websiteFootnote 4 or from the Linguateca’s resources repositoryFootnote 5. The initial Port4NooJ resources derive from OpenLogos. OpenLogos is an open source derivative of the commercial Logos system downloadable from the DFKI websiteFootnote 6 and available for testing at INESC-IDFootnote 7. The Logos system was built on the Logos Model (Scott 2003), (Barreiro et al. 2011). In order to create Port4NooJ, the OpenLogos English-Portuguese bilingual dictionary was converted into NooJ format and its language pair order was inverted. Besides the large coverage electronic dictionary with English transfers, the Port4NooJ module contains two other important components: (i) the rules which formalize and document Portuguese inflectional and derivational descriptions, and (ii) different types of grammar, namely morphologicalFootnote 8, disambiguation, semantico-syntactic, multiword expressions, and translation and paraphrasing grammars. The different components of Port4NooJ interact among them and are used to process texts. Several processing functions can be performed with these resources, among others, part of speech annotation, pattern recognition, semantic unit analysis, concordances, information extraction, paraphrasing and translation.Footnote 9 Barreiro (2008) and Barreiro (2010) describe in detail the initial dictionary and its enhancement with new linguistic knowledge, namely inflectional, derivational and morpho-syntactic properties, and semantic relations that permitted the generation of paraphrases.

Initially, Port4NooJ contemplated paraphrases involving support verb constructions or their stylistic or aspectual variants and corresponding single verbs (fazer/realizar/efetuar uma apresentação (make a presentation (of)) = apresentar (present)), compound and single adverbs (de uma forma interativa (in an interactive way) = interativamente (interactively); com entusiasmo (with enthusiasm) = entusiasticamente (enthusiastically)), relatives and participial adjectives (que foram escritos (that were written) = escritos (written), relatives and possessive constructions o papel que a Europa tem/desempenha (the role that Europe plays) = o papel da Europa (the role of Europe)), and active/passive constructions (A solta B (A releases B) = B é solto por A (B is released by A)), among others. In Sect. 5, we will describe the new Port4NooJ paraphrases resulting from the transformation of human intransitive adjective constructions described in the lexicon-grammar tables.

5 Lexicon-Grammar of Human Intransitive Adjectives

The lexicon-grammar tables explored in this study describe 4,248 human intransitive adjectives, i.e. adjectives that select a human noun as subject and do not require any complement. These adjectives were grouped into 15 subclasses, which present different lexico-syntactic properties, as illustrated in Table 1.

Table 1. Classes and distributional properties of human intransitive adjectives

These properties relate specifically to: (i) the syntactic nature of the human subject: depending on the adjective class, this position can be headed by a human noun (Nhum), (ii) a complex noun phrase involving an appropriate noun (Nap de Nhum), and/or (iii) by a finite clause (QueF), (iv) the nature of the copular verb (Cop) selected by the adjective: there are adjectives that co-occur only with the verb ser or estar, while others co-occur with both copula verbs, (v) the possibility of the predicative adjective appearing in another indefinite construction (CCI).

In addition to these generic properties, the tables describe specific distributional and transformational attributes of the adjective, allowing recognition and generation of a variety of syntactic constructions where each adjective occurs. Some of these properties depend on the adjective subclass, while others depend on the adjective itself. Figure 2 presents some examples extracted from the lexicon-grammar table that describes the adjectives of nationality. As illustrated in the table, the adjectives of this semantico-syntactic class require a human noun (Nhum) to fill the subject position (N0). In predicative context, they are linked to its subject by the verb ser, and they cannot be quantified or intensified (Quant). Moreover, in adnominal context, the adjectives of nationality can only occur in post-adnominal position. However, there are some properties in the table that may vary depending on the adjective. For example, even though all the adjectives of this class are able to modify the generic predicative classifier origem (designation of origin), the presence of a more specific noun classifier, such as nacionalidade (nationality/country of origin), naturalidade (place of origin), etnia (ethnicity) or raça (race), depends on the semantics of the adjective (i.e., on the nature of the locative noun with which each adjective is associated). The information described in Table 2 allows generating the following constructions, among others:

figure a
Fig. 2.
figure 2

Lexicon-grammar table SAN describing adjectives of nationality

Fig. 3.
figure 3

Lexicon-grammar table SAHC1 describing a subclass of predicative adjectives

Figure 3 presents an excerpt of the table describing the predicative adjectives classified as SAHC1. The adjectives of this class select the verb ser, and some of their aspectual or stylistic variants, in particular mostrar-se (show self as), revelar-se (reveal self as), and/or tornar-se (become). Their subject position can be filled by a human noun (Nhum), an appropriate noun (Nap), or a finite clause. Most adjectives described in this table can be quantified by an adverb (Adv), and some can also receive a degree morpheme (Sup). In adnominal context, there are adjectives, like antipático (unfriendly), that can occur both in post- and pre-adnominal position. Among other properties, this table provides information on particular syntactic constructions derived from a set of transformations involving the subject (Reest N0). Below, we present some examples represented in this lexicon-grammar.

figure b

6 Integration of the Lexicon-Grammar Tables in Port4NooJ

6.1 From Lexicon-Grammar Tables to NooJ Dictionaries

The conversion of the 15 lexicon-grammar tables into a NooJ dictionary of human intransitive adjectives was mostly done automatically with different scripts. The process is hence easily adaptable to integrate other lexicon-grammar tables. For each entry in a lexicon-grammar table, we applied the following main steps:

  1. 1.

    If adjective is already in Port4NooJ, merge the lexicon-grammar properties with every homograph adjectival entry in Port4NooJ dictionary, by adding the new properties to that entry, otherwise create a new entry;

  2. 2.

    Create inflectional (FLX) and derivational (DRV) codes and corresponding rules as needed;

  3. 3.

    Check for missing FLX and DRV codes, and create new ones as needed.

6.1.1 Representation of Lexicon-Grammar Table Properties

The properties +IH and were added to all human intransitive adjectives. The first property indicates that the adjective is a human intransitive adjective, and the second one refers to the lexicon-grammar table where the adjective properties are formalized.

For each different column in a lexicon-grammar table, a property  was created. If the adjective row is marked with the value +, then that property was added to the adjective entry. Properties that have a value other than +/- were added as . For properties Nome and Verbo, instead of creating and , a script translates the pair(s) adjective/noun and adjective/verb, if they exist, into a derivation paradigm and creates attributes and , respectively.

The drv code is determined and formalized automatically by finding the radical between the adjective and the noun or verb. For example, the adjective alegre (happy) is associated with the corresponding noun alegria (happiness) and the corresponding verb alegrar (become happy) through derivation rules (cf. A2B143 and A2V6 below), which replace the adjectival ending -e with the noun and verb endings -ia and -ar, respectively.

figure c

The inflection of the derived word (flx_code1 or flx_code2) is determined by consulting Port4NooJ (cf. FLX=CASA for the noun and FLX=FALAR for the verb).

figure d

In cases where the derived forms did not exist, their codes were assigned automatically. The FLX code of the base form, the adjective, was determined in the same way: the inflection code is looked up in Port4NooJ or assigned automatically in cases where the word does not exist.

Additional properties were created to account for specific knowledge required in paraphrasing. For example, the property +TopDET={o|a|os|as|undef} indicates the determiner that co-occurs with a toponym:

figure e

The value of +TopDET was determined automatically by consulting the AC/DC corpora (28 corpora covering different variants of Portuguese in a total of 1,279 million words) and counting the distribution of the determiner that occurs (or its absence) in the context of the prepositions de (of) and em (in). For each toponym in the lexicon-grammar tables, a CQP query, the language used to query the AC/DC corpora, with the following structure was used to consult each corpora:

figure f

In cases where the toponym did not occur in the corpora, +TopDET=undef was used to distinguish those cases from toponyms that do not accept determiners (i.e., the property +TopDET was not added to the adjective entry).

6.1.2 Integration with eSPERTo Dictionary Entries

After the properties of each adjective in the lexicon-grammar tables were created, the script merged those properties with the information corresponding to that adjective in the Port4NooJ entries. When the adjective already existed in Port4NooJ, the lexicon-grammar properties were added to all the adjective homograph entries in Port4NooJ. This means that, for example, the following Port4NooJ entries:

figure g

became the following new entries:

figure h

Initially, this process was done blindly, i.e., the Port4NooJ entries were not checked for properties that excluded them from being human intransitive adjectives, and that, consequently, should not receive lexicon-grammar attributes. In a second round, entries with at least the attribute +AB, i.e., adjectives that are classified as “abstract”, such as the first entry above, should be discarded to obtain a more accurate version of the dictionary of human intransitive adjectives.

When the adjective did not exist in Port4NooJ, new entries were created. This happened for different reasons: the adjective was missing from Port4NooJ (e.g., abissínio), it derived from other base form (e.g., arranhado is the past participle of arranhar) or had another part of speech tag in Port4NooJ (e.g., solteiro in Port4Nooj is a noun only). In any of those cases, the inflection code was assigned automatically given that the suffixes of the human intransitive adjectives were very regular. A few exceptional adjectives with less productive suffixes were missing FLX codes. Those entries were reviewed by linguists and their codes were assigned manually. New FLX codes and corresponding inflectional paradigms were created as needed. All other properties of new adjectival entries came from the lexicon-grammar tables:

figure i

6.2 From Lexicon-Grammar Tables to NooJ Grammars

Syntactic grammars in NooJ can be described and used in two different ways: for syntactic parsing, and for transformational analysis. We explored both, as described in Sects. 6.2.1 and 6.2.2. However, for the time being, eSPERTo is generating paraphrases through syntactic parsing.

6.2.1 Option 1: Syntactic Parsing

NooJ syntactic grammars that are used to parse a text, need to describe for each input the corresponding paraphrases that will be generated in the output. For example, the possibility of having the indefinite article, a construction common to several tables, can be described by recognizing the sequence without the article, and generating the construction including the determiner, or the reverse. However, we are duplicating information by swapping the input with the output (cf. top path with bottom path in Fig. 4). In the case of just two equivalent constructions, this is not a big problem. In the case of a set with more than two paraphrases (e.g., o homem americano | o homem dos EUA | o homem de origem americana | o homem de nacionalidade americana | etc.), the recognition of all constructions in the set and the generation of the alternative constructions, would require at least \(n \times (n-1)\) paths, where n is the number of paraphrases in the set. For example, the grammar in Fig. 5 recognizes only o homem americano and generates the corresponding paraphrases. Similar grammars would have to be constructed for each paraphrase to be also recognized in a text.

Fig. 4.
figure 4

Characterizing indefinite constructions: paraphrasing through parsing

Fig. 5.
figure 5

Paraphrasing constructions involving patronymic adjectives

6.2.2 Option 2: Transformational Module

A better description than option 1 is to represent that the construction without the indefinite article (top path in Fig. 4) and the construction with the indefinite article (bottom path in Fig. 4) are equivalents, as described in Fig. 6. This equivalency is expressed through the use of the global variable @A. That grammar can then be used in the transformational module to generate all the equivalent constructions.

Fig. 6.
figure 6

Characterizing indefinite constructions: paraphrasing through transformational analysis

Table 2. Statistics on the merge between human intransitive adjectives and Port4NooJ adjectives

7 Preliminary Results

Port4NooJ dictionary formalizes 40,336 lemmas that recognize 1,006,424 word forms. There are 13,051 entries formalizing adjectives that correspond to 6,115 different adjectives.

The new standalone dictionary of human intransitive adjectives integrated in Port4NooJ includes 5,177 entries, that correspond to 4,138 different adjectives. Table 2 shows, for each lexicon-grammar table, how many adjectives existed already in Port4NooJ, and how many were added. Only 26 % of the adjectives formalized in the lexicon-grammar tables were in Port4NooJ alreadyFootnote 10. This means that the number of different adjectives in Port4Nooj increased about 50 %.

A few grammars were constructed that explore the information in the new dictionary to extend eSPERTo paraphrase knowledge. We started by developing grammars to recognize and paraphrase (i) constructions involving patronymic adjectives, (ii) characterizing indefinite constructions, (iii) the possibility of alternating Vcop ser and estar with other aspectual variants, and (iv) cross constructions.

Fig. 7.
figure 7

Using properties of the human intransitive adjectives in noun phrase grammars

The information in the new dictionary of human intransitive adjectives was also used to improve the recognition of human noun phrases (see Fig. 7).

8 Conclusions and Future Work

We successfully integrated 15 lexicon-grammar tables describing the distributional properties of human intransitive adjectives into Port4NooJ by creating a standalone dictionary of human intransitive adjectives and by creating grammars that use information provided by the new dictionary to describe equivalent constructions involving those adjectives. In this way, we extended eSPERTo paraphrasing capabilities.

In the near future, we intend to: (i) create additional grammars to recognize the remaining constructions formalized in lexicon-grammar tables of human intransitive adjectives; (ii) revise and evaluate the new resources; (iii) integrate and adapt additional lexicon-grammar tables, such as the ones formalizing constructions with Vsup ser de (Baptista 2000) and Vsup fazer (Chacoto 2005).

We will also use the Port4NooJ paraphrase knowledge to annotate a corpus with paraphrases. This corpus will be used to develop, train and test the eSPERTo’s hybrid paraphrase acquisition engine. In turn, the new paraphrases will be merged with the existing paraphrases in Port4NooJ.