Keywords

1 Introduction

This work describes a computational modeling of nominal ellipsis in Spanish. We begin with the theoretical assumptions of generative grammar following studies by Merchant [5]; Sag and Hankamer [6]; Culicover and Jackendoff [7]; and particularly, Saab [1, 2]. Specifically, the paper describes an algorithm developed for the automatic recognition of ellipses and the replacement of the elided element in natural language texts. Ellipsis consists of a grammatical mechanism avoidant of lexical redundancy [8], in which an element of the clause is not pronounced – i.e., it is elided – under certain syntactic restrictions. In this regard, Saab [1] and Merchant [9] point out that, in this phenomenon, the silenced element is recognized through a structural identity relationship with a preceding element that allows for the identification of said element and its position inside the clause.

This phenomenon has been addressed from different perspectives, including syntactic [5, 10,11,12,13,14] and semantic [6, 9] approaches, as well as within the frameworks of Head-Driven Phrase Structure Grammar [15] and Simpler Syntax [16]. Regarding research linked to computational linguistics, the works of Hardt [17]; Mitkov [18]; Nielsen [19]; Rello, Baeza-Yates, and Mitkov [20]; and McShane and Babkin [21] have undertaken analyses of ellipsis via corpus-labeling methodologies that identify and retrieve the elided element. However, those proposals tend not to include implicit syntactic mechanisms. To remedy this, this work seeks to formalize nominal ellipsis based on the processing of morphosyntactic information. The contribution to the literature is double: on the one hand, we provide an efficient algorithm for applied computational linguistics tasks; and on the other, we develop a tool to evaluate the reach of theoretical generative studies as applied to nominal ellipsis.

For such purposes, we propose a formal description of ellipsis, modeled using an electronic dictionary under NooJ syntactic grammars, and tested on journalistic texts extracted from the Internet. The results obtained (100% precision; 82% coverage; 90.10% F-measure) show that the developed algorithm is useful for the recognition of ellipsis in natural language texts, while also showing that the generative theoretical proposals are adequate for the analysis of this phenomenon.

The article is organized as follows: first, we present the general context of ellipsis; second, we discuss the concepts relevant to the theory of Identity [1, 2], with special focus on aspects relevant to ellipsis; third, we describe the data processing tasks that we performed, considering the syntactic structure of determiner phrases (DP) in Spanish; and fourth, we present our results and conclusions.

2 Ellipsis: Definition and Types

Ellipsis is a syntactic mechanism through which, under certain structural particularities, an element is not pronounced (that is, it is elided) [1, 2, 5, 22]. In this regard, two types can be recognized: nominal ellipsis, in which a noun is elided (1a); and verbal ellipsis, in which a verb is elided (1b):

figure a

This mechanism acts at different syntactic levels. Thus, Spanish nominal ellipsis is a nuclear ellipsis of a DP; and, in Spanish verbal ellipsis, mechanisms of sentence ellipsis are involved. It is important to clarify that nominal ellipses do not only occur between coordinating DPs (2a), but also as part of a sentence ellipsis (2b-c). Here, the verbal assembling domains expand the locality between the preceding element (in bold) and its place in the phrase in which it is elided (strikethrough). In (3), the types of verbal ellipsis are shown. Since the objective of this work is not to address verbal ellipsis, only the verbal types that appear in the different studies are mentioned.

figure b
figure c

One of the theories that best explains nominal (2) and verbal (3) ellipses in Spanish is that of Saab [1, 2], which is the basis for Distributed Morphology [DM, 23]. According to this theory, ellipsis is defined as a non-insertion mechanism of lexical features in which, during derivation, a particular feature, denominated as [I], blocks the insertion of phonological features. The allocation of this feature [I] is the product of a transformational operation called Identity, whose result is the non-pronunciation – i.e., silencing – of the elements that intervene in the syntactic operation. Saab [1] formally defines the ellipsis domain as follows:

  1. (4)

    a. An abstract morpheme α is identical to the abstract morpheme β if and only if α and β match all their morphosyntactic and semantic features.

    b. An A root is identical to a B root if and only if A and B share the same index.

This definition (4) establishes three contexts in which ellipsis intervenes: (i) cases of partial identity with grammatical results; (ii) cases of partial identity with ungrammatical results; and (iii) cases in which the identity is total, but the result is ungrammatical. This work considered only the first two, which correspond to the features of number (5) and gender (6) in Spanish, respectively.

  1. (5)

    a. Antonio quiere más a sus gatos que al gato de Pedro

    [Antonio loves his own cats more than Pedro’s cat.]

    b. Antonio quiere más a su gato que a los gatos de Pedro

    [Antonio loves his own cat more than Pedro’s cats.]

  2. (6)

    a. *Antonio quiere más a su gato que a la gata de Pedro

    [Antonio loves his own cat more than Pedro’s [female] cat.]

    b. Antonio quiere más a sus gatas que al gato de Pedro

    [Antonio loves his [female] cats more than Pedro’s cat.]

For Saab [1, 2], the ungrammaticalness of the examples in (6) is caused by the fact that, in Spanish, gender features are a property of the morphological root, in which the root is attached to the gender feature first, and then to the number feature. Therefore, for that author, number is a functional category that does not intervene in the domain of nominal ellipsis in Spanish, such as in (7).

(7)

figure d

3 The Formalization of Syntactic Information

This work adopts Abney’s [24] DP hypothesis, in which the determiner is a functional category that can take a noun phrase as a complement; namely, it is possible to represent a structure of constituents with a particular internal structure such as the one in (8).

figure e

In Spanish, the determiner occupies the highest position, which allows the ellipsis of N in the lowest position. Furthermore, structurally speaking, the determiner is an important feature as the nucleus of the phrase, which is why it may not be an optional element (9a and 10b); and why it is also a mechanism for the recognition of the elided element, since it must agree with the elided N. In (10), the elided element is indicated with strikethrough text.

  1. (8)

    a. *guitarra roja

    [red guitar]

    b. La guitarra roja

    [the red guitar]

figure f

Spanish data was analyzed under a constituent organization to outline the determiner phrase (8), resulting in three types of syntactic structures with nuclear ellipses (N). The first is that of coordination between determiner phrases (11a); the second, coordination of clauses (12a); and the third, predicative argument (13a). These three syntactic structures were formalized as shown respectively in pairs (11b), (12b), and (13c). The elided element is indicated with strikethrough text.

figure g

These structures explain the phenomenon of ellipsis as a mechanism presenting under two procedures: elision (in the case of production); and identification (in the case of comprehension).

3.1 Rules for Ellipsis: Elision and Identification

The process of elision defines the production of a nominal ellipsis, given its articulation of the syntactic and surface structures of Spanish (14).

  1. (13)

    a. Given a coordination A^B, where A and B are DPs and B has a structure equal to A with a root object as an NP nucleus equal to that of A, elide the NP nucleus object of B.

    b. Given a Predicate-Argument Structure (PAS) A, with arguments α, β, γ…, where α has a structure equal to that of the argument to its right and which has an NP whose root nucleus is equal to the NP in α, elide the nucleus of such NP.

On the other hand, identification describes the comprehension of the elliptical phenomenon, specifically, how the syntactic elements missing in the grammatical structure are computationally recognized (15).

  1. (14)

    a. Given a coordination A^B, where B is a DP equal to A, and the NP nucleus is elided:

    • Copy the nucleus root and the NP gender object of A

    • Copy the Det number of B

    b. Given a PAS A: P (α, β y γ), with an object to the right of α that represents the elision of its NP nucleus:

    • Copy the root of the NP object of α

    • Copy the determiner number of β or γ (according to the elided object)

This work computationally modeled (14) and (15). The creation of resources in NooJ is described below.

3.2 Computational Modeling in NooJ

The computational implementation used a general Spanish language dictionary [25] containing 72,593 entries. Lemmas corresponding to nouns, adjectives, and verbs were associated with inflected model grammars.

For the identification and replacement of the elided element in the nominal ellipsis, the following syntactic grammar was created.

Fig. 1.
figure 1

Grammar for the identification and replacement of nominal ellipsis in Spanish.

The numbers in Fig. 1 indicate the types of ellipsis found in the corpus: with [1], cases of predication; with [2] and [3], coordination between determiner phrases.

The variables represented by parentheses embed grammars that contain syntactic restrictions; for example, the gender and number concordance typical of Spanish and which is preserved in nominal ellipsis [1]. Thus, the variable PRENOM (see Fig. 2) details the restrictions in the categories of the determiner and its concordance with the adjective.

Fig. 2.
figure 2

Grammar embedded for determiner restrictions.

The POSTN variable indicates the restriction of the adjective with regards to its preceding N (see Fig. 3).

Fig. 3.
figure 3

Grammar embedded for adjective restrictions.

Finally, some grammars can be recursive (see Fig. 4). In the SP, the variation in the prepositional phrases found in the corpus can be described.

Fig. 4.
figure 4

Recursive grammars for phrase variation.

The indication of replacement implies a rewriting of the variables $PRENOM $NOM [1, 2, or 3] ($PREP) $DET; and later, of $NOM, although adjusted to the variables of number $DET. In cases in which a [1] $CONTR sequence appears, $NOM is replaced in the singular.

With the application of the grammar, it is possible to annotate (TAS) a grammatical sequence formally described with a corpus sequence (see Fig. 5).

Fig. 5.
figure 5

Annotation of a grammatical sequence of the corpus

Furthermore, the implemented grammar recognizes, identifies, and replaces the elided element in the identified grammatical structure (see Fig. 6).

Fig. 6.
figure 6

Results of the automatic recognition of nominal ellipses

4 Results

Table 1 summarizes the main results of the computational implementation based on the collected texts, a brief corpus of 5,000 words with 100 elided sentences.

Table 1. Results obtained

As can be observed, the algorithm did not result in any incorrect labeling. Moreover, while there were coverage problems caused by lexical units not listed in the electronic dictionary, the results obtained show that the syntactic restrictions under which the nominal ellipsis is structured are useful for automatic identification in natural language texts. We also note that, in addition to the applications for syntactic recognition to optimize results in computational linguistics [26], it is also a tool that, beyond numerical data, provides information on grammatical knowledge.

5 Closing Remarks

This paper presented a computational modeling of nominal ellipsis in Spanish, based on a proposal for generative grammar [2], using an electronic dictionary and morphological grammars. The resulting algorithm showed a high percentage of precision, coverage, and F-measure. This implies that the descriptions stemming from formal studies constitute an adequate basis for the elaboration of computational devices. Despite these initially promising results, the brevity of the corpus presents a limitation that merits further research.

Furthermore, in seeking to improve the ability of such models to resolve complex grammar when analyzing ellipses, future work will include comparative structures, such as (16), and those with preceding elements located in peripheral positions of the sentence, such as (17):

figure h

Lastly, we expect to include nouns in multi-word structures, as in (18).

figure i

For the latter, the complexity of the electronic dictionary will be increased with multi-word expressions that can also be linked to syntactic grammars.