1 Introduction

Paraphrases are key linguistic assets in natural language interpretation and generation and play a very important role in NLP applications. Paraphrasing is a technique that implies the replacement or displacement of words, phrases or expressions in a sentence with semantically-equivalent linguistic structures so that the meaning of the output sentence remains the same. Support verb constructions represent an area of substantial interest in paraphrasing as they often have a corresponding verb associated with it. For example, the support verb construction fazer uma sugestão (make a suggestion), where the verb fazer has a weak semantic value and the predicate noun sugestão has a strong semantic value, is equivalent to the (strong) verb sugerir (suggest). Furthermore, support verb constructions are a rich source of paraphrases for reasons concerning stylistic variance allowed by the property of equivalence among different types of support verbs (elementary and non-elementary), among other reasons stated in early work (Barreiro 2009).

In this paper, we describe in detail the integration in the lexicon of Port4NooJ, the Portuguese module of NooJ (Silberztein 2016), of nearly 3,000 predicate nouns that co-occur with the support verb fazer, one of the most frequent verbs in PortugueseFootnote 1, thus complementing existing lexicon-grammar resources available for the Portuguese language. The resource integration work was developed within the eSPERToFootnote 2 project whose main objective is to develop a smart, context-sensitive and linguistically enhanced paraphrasing system. At its current stage of development, eSPERTo comprises a paraphrase generator, a paraphrase acquisition module, and a web interactive application. eSPERTo recognizes semantico-syntactic, multiwords and other phrasal units, and transforms them into semantically equivalent phrases, expressions, or sentences. It uses local grammars to acquire linguistic knowledge that is applied in the identification/recognition and generation of different types of paraphrase, such as the support verb construction paraphrases described in Sect. 3. The new resources, which combine the dictionary of predicate nouns created by merging the information in lexicon-grammar tables with the entries in Port4NooJ plus the new transformational grammars contributed to over 5,000 entries before revision.Footnote 3 The utility of eSPERTo’s paraphrasing capabilities has been explored in a question-answering system to increase the linguistic knowledge of an intelligent conversational virtual agent and in a summarization tool to assist the paraphrasing task, but the resources described in this paper have not been tested in these applications. The broader application envisaged for the paraphrasing system is e-learning, namely in helping Portuguese language learners in editing and revising their texts. Precise paraphrases can also be helpful in professional translation, editing, and proofreading, among other tasks.

2 Related Work

Support verbs have been extensively and systematically studied within the lexicon-grammar theory proposed by Gross (1975), from the theoretical and methodological points of view, for many languages. Practical analytic formalization of multiword units (some including support verb constructions) in the Romance languages have been presented by several authors (D’Agostino and Elia 1998; Laporte and Voyatzi 2008; Silberztein 1993), among others. Support verb constructions have also been taken into account in contrastive studies for English and French (Salkoff 1990). With regards to Portuguese, most studies on support verb constructions outline the representation of argument mapping relations between a support verb and a nominalization (Baptista 2005; Chacoto 2005; Ranchhod 1990). Although support verbs often combine with autonomous predicate nouns, some studies focus on predicate adjective constructions (Carvalho 2007; Casteleiro 1981).

Mota et al. (2016) integrated the lexicon-grammar of human intransitive adjectives formalized by Carvalho (2007) into Port4NooJ, and showed how the properties contained in the lexicon-grammar tables can be used in paraphrasing tasks. The next step was to integrate complementary lexicon-grammars to expand the paraphrastic capabilities in Port4NooJ, building up on the previous work. Thus, the objective of the present work was to integrate the lexicon-grammar of predicate nouns which co-occur with the support verb fazer formalized by Chacoto (2005). The distributional and transformational properties of the predicate nouns addressed were used as the main criteria for establishing 19 classes and subclasses described in Sect. 3.

3 Lexicon-Grammar of Predicate Nouns with \(V_{sup}\) fazer

The systematic survey of predicate noun constructions with support verb fazer performed by Chacoto (2005) was carried out by assessing seven dictionaries and a subcorpus of five million words, i.e., Part 20 of the on-line 180 million word corpus CETEMPúblico.Footnote 4 The distributional and transformational properties of predicate nouns was used as the main criteria for establishing classes presented by these predicates, namely the syntactic structure, the number, and the type of constituents. Each class is represented in the form of a table, i.e., a binary matrix whose rows correspond to predicate nouns and whose columns represent the lexical and syntactic properties that these nouns have (+) or do not have (−). In total, 19 lexical and syntactic subclasses have been identified, as aforementioned.

Table 1. Classes and distributional properties of predicate nouns with \(V_{sup}\) fazer (\(N_0\) fazer Det (Npred + C) W)

The predicate nouns in the lexicon are, in general, everyday vocabulary (simple and compound nouns), with the exception of a group of predicate nouns of the sports and medical domains. The properties represented in the lexicon-grammar tables for each predicate noun lead to the generation of a variety of paraphrases, involving nominal group formation (active, passive and relative nominal groups), restructuring of the dative, shifting symmetrical nouns and symmetrical complements, conversion of the arguments, conversion into aspectual and stylistic variants of the support verb fazer, and transformation of the support verb, and relation between fazer and other support verbs. These properties establish semantic relationships between predicates allowing new features of paraphrases that have been described in NooJ grammars. For the sake of brevity, we illustrate only the case of paraphrasing based on the shifting of symmetrical nouns (Table 2), which are used with the predicates aliar-se a (ally self with/to), casar com (marry with), or fazer um acordo com (make an agreement with). The paraphrasing results from shifting the nouns from the position of subject to the position of indirect object and vice-versa.

Table 2. Paraphrasing based on symmetrical nouns

4 Integration of Lexicon-Grammar Tables in Port4NooJ

The integration of the lexicon-grammar tables formalizing the properties of nominal predicates into Port4Nooj is a two-step process: first, one needs to create a dictionary from the lexicon-grammar and merge it with the Port4NooJ, and, then, it is necessary to create grammars that make use of the syntactic and distributional information described in the lexicon-grammar to identify and relate equivalent predicate constructions. The second step in our case aimed at expanding eSPERTo’s paraphrasing capabilities. Before describing the integration process, we start by mentioning our major challenges to date.

Given our prior experience integrating the lexicon-grammar tables of the human intransitive adjectives, we expected the integration of the lexicon-grammar tables for the predicate nouns with the support verb fazer to be straightforward. However, that was not what happened and we were faced with challenges at different levels.

The first one worth mentioning is that 65% of the predicate nouns already existed in Port4NooJ as nominalizations. These nominalizations derived from the corresponding verb, but did not contain detailed syntactic and distributional information as the one formalized in the lexicon-grammar tables. Nonetheless, these entries include relevant information that cannot be discarded, such as semantico-syntactic (SAL) information and English transfers. Another challenge faced in the integration process was that, in contrast with the lexicon-grammar tables of the human intransitive adjectives, which had the morphological equivalent verb and adjective explicitly mentioned in the table, the lexicon-grammar tables in discussion simply indicate whether the predicate noun has morphological equivalent predicates (the corresponding entry has a plus (+) sign), but neither the equivalent verb nor the equivalent adjective are explicit in the table. Instead, they are listed co-occurring with additional information in one cross-reference table as an appendix, i.e., not having all information concentrated at a single site makes integration a more complex task. Additionally, there are 800 multiword predicate nouns comprised of a noun and an adjective, such as transcrição integral (full transcript) whose morphological equivalent is a multiword construction [verb + adverb], such as transcrever integralmente (fully transcribe). These cases need further revision, as the equivalent constructions need to be treated in a slightly different way. Finally, the spelling of a few predicate nouns in the lexicon-grammar were updated to comply with the Portuguese Ortographic AgreementFootnote 5, but Port4NooJ does not yet comply with this Agreement.

4.1 From Lexicon-Grammar Tables to NooJ Dictionaries

We were compelled to adjust and create new scripts to overcome those challenges, even if the general process to create the equivalent standalone dictionary remained identical. For each entry in a lexicon-grammar table we had to (i) convert the corresponding lexicon-grammar properties into NooJ dictionary attributes; (ii) either create a new dictionary entry with those attributes or add those properties to an existing Port4NooJ dictionary entry or entries; and (iii) add the new entries or the old entries merged with the lexicon-grammar properties to the standalone dictionary. Each table was converted into a dictionary separately, but then all dictionaries were combined in a single standalone dictionary.

Representation of Lexicon-Grammar Table Properties. Given a lexicon-grammar entry, the new dictionary entry lemma is the predicate noun formalized in the lexicon-grammar, its POS tag is N, and its inflection code (the FLX attribute) is looked up in Port4NooJ or assigned automatically in cases where the word does not yet exist in the dictionary. The entry also receives the following attributes: +Npred, which indicates that it is a predicate noun, +Vsup=fazer which indicates that the support verb of the noun is fazer, and +Table= \(\mathtt{<}\) name_of_ the_LG_table \(\mathtt{>}\) whose value is the name of the lexicon-grammar table where its distributional and syntactic properties are described.

Similar to what was done previously in Mota et al. (2016), for each different column in a lexicon-grammar table, a property +<name_of_ prop> was created. If the noun row is marked with the value +, then that property was added to the noun entry, after removing or replacing special characters (e.g. Vsupter was generated from the LG property Vsup=:ter, and N0Nnhum was generated from N0=:N-hum), with the exception of the properties V and Adj, which indicate whether a noun has a morphologically-related verb and adjective, respectively. For these attributes, if the row has a + sign, instead of simply adding +V and +Adj attributes, a script looks up the noun in a file that lists all the morphologically-related verbs and adjectives to a noun and tries to obtain a derivation between the noun and verb or between the noun and the adjective. If the pair(s) noun/verb and noun/adjective exist, then a derivation paradigm is created and the following attributes are added to the entry: +DRV=N2V<drv_code1>:<flx_code1> and +DRV=N2A<drv_code2>:<flx_code2>, respectively. The drv code is determined and formalized automatically by finding the radical between the noun and the verb or adjective. For example, the noun espuma (foam) is associated with the corresponding verb espumar (turn into foam) and the corresponding adjective espumoso (foamy) through derivation rules (N2V2=r/V and A2V14=<B1>oso/A), which adds an –r to the noun to create the verb, and replaces the nominal ending –a with the adjectival ending –oso, respectively. As it happens with the inflection code of the lemma, the inflection of the derived word (flx_code1 or flx_code2) is determined by consulting Port4NooJ (in this case, FLX=FALAR for the verb and FLX=ALTO for the adjective).

After all the attributes of a lexicon-grammar entry have been converted to NooJ format, we need to integrate the entry with the remaining Port4NooJ dictionaries.

Integration with eSPERTo Dictionary Entries. As previously mentioned, we had to use two dictionaries in order to properly integrate the lexicon-grammar entries into Port4NooJ: current version, i.e., the current version prior to integrating the lexicon-grammar entries, and old version, i.e., the dictionary version prior to the current version before removing the entries marked with +Npred that derive from verbs. In this way, even if the nominal predicate does not exist in the current version, it can still inherit the Port4NooJ properties as long as it makes part of the older version. There are 608 nominal predicates that only exist in the older version. If the nominal predicate exists in both dictionaries (there are 840 nominal predicates in this situation), and the linguistic information is the same in both dictionaries, the entry in the new standalone dictionary inherits that information, otherwise it only inherits the information that exists in the older dictionary as it means that the information in the current dictionary does not correspond to a nominal predicate - otherwise it would have been removed as well.

Given those two versions of the Portuguese dictionary, a new entry on the standalone dictionary of predicate nouns which co-occur with support verb fazer is created in one of the following situations.

Scenario 1: if the predicate noun or a predicate noun compliant with the pre-Ortographic AgreementFootnote 6 does not exist in Port4NooJ dictionary (neither current nor old version) then add the new entry created following the process described in the previous section as is. For example, the following entries did not exist in Port4NooJ and were created directly from the lexicon-grammar table:

figure a

Scenario 2: if noun (or the compliant noun) entry (i) exists in the current version but not in the old version or (ii) it is the same as in the old version, then merge the lexicon-grammar properties with the current entry, as in:

figure b
Table 3. Distribution of nominal predicates with support verb fazer by table attribute after integration into Port4NooJ

All information up to +Npred already existed whereas after +Npred was obtained from the lexicon-grammar table.

Scenario 3: if noun (or the compliant noun) (i) exists in old version only or (ii) it exists in both, but current and old entries differ, then merge the lexicon-grammar properties with the old entry. For example, the noun cruzamento exists both in the current and old versions, but the entries differ. So, the lexicon-grammar attributes are only added to the old version, as the following 4 entries illustrate - the first three entries existed in the current version and the last entry only existed in the old verb, consequently that is the information to which the lexicon-grammar attributes were added:

figure c

Additionally, remove previous Npred related properties, and then remove corresponding nominalizations from current version.

Finally, create inflectional (FLX) and derivational (DRV) codes and corresponding rules as needed, and check for missing inflectional (FLX) and derivational (DRV) codes.

We started by integrating 18 of the 19 tablesFootnote 7. The first attempt to convert the tables into a standalone Port4NooJ dictionary resulted in 6,205 nominal entries, corresponding to 1,918 different noun lemmas. Most entries already existed in Port4NooJ (55%), which corresponds to about a 7% increase in nominal entries in Port4NooJ from 11,719 different noun lemmas that existed in the previous version of Port4NooJ (i.e., before removing entries corresponding to nominal predicates that are nominalizations). In terms of predicate nouns only, there was an increase of 25%. In addition, 332 new derivational paradigms were automatically created. Table 3 shows the distribution of nominal predicates in the lemma dictionary by table attribute sorted by the most frequent table assigned.

Fig. 1.
figure 1

Grammar to identify and paraphrase constructions that allow symmetry

4.2 From Lexicon-Grammar Tables to NooJ Grammars

Although we have a preliminary version of the standalone dictionary that still needs to be reviewed, we initiated the process of using the information in the lexicon-grammar tables of these constructions to do paraphrasing. In particular, we started by addressing the following issues: (i) Port4NooJ grammars already paraphrase support verb constructions (for example, paraphrasing the verbal construction with the nominal construction and vice-versa, or alternation between the support verb and other stylistic or aspectual verbs) - grammars are being updated with attributes from the new tables and are also being extended to take into account other paraphrases involving this type of constructions; (ii) the grammars that paraphrase active and passive constructions were only paraphrasing sentences involving a verbal predicate - we are modifying these grammars so that they can also paraphrase nominal predicates in sentences and noun groups.

In addition, we also started developing new grammars to paraphrase equivalent constructions based on specific properties formalized in those lexicon-grammar tables. One such grammar paraphrases symmetric predicates, such as those illustrated in Table 2 and pictured in Fig. 1. The grammar in Fig. 1 applied to the sentences O homem apostou com a mulher (The man betted with the woman) and O homem fez uma aposta com a mulher (The man made a bet with the woman) results in the paraphrase O homem e a mulher fizeram uma aposta (The man and the woman made a bet).

5 Conclusions and Future Work

This paper described the ongoing process and preliminary results of integrating the lexicon-grammar that formalizes the properties of predicate noun constructions with the support verb fazer. This process is only complete after consolidation of linguistic information in Port4NooJ dictionaries and grammars. Although we already had scripts to generate a standalone dictionary from the lexicon-grammar of adjectives, these scripts had to be modified to generate a new standalone dictionary from the lexicon-grammar of predicate noun constructions as the challenges faced during this integration were different from the ones encountered during the integration of the adjectives. We will continue this integration work by creating all necessary grammars to process the constructions formalized in the lexicon grammar tables in question. After completion of this integration task, we intend to revise and evaluate the new resources in distinct applications.

In the near future, we plan to continue integrating and adapting additional lexicon grammars, such as the constructions with the support verb ser de (be of), such as in ser de uma ajuda inestimável (be of invaluable help) formalized by Baptista 2005, as these are a rich resource for paraphrasing in Portuguese.