Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Natural language inference (NLI) is the problem of determining whether a natural language hypothesis h can reasonably be inferred from a given premise p. For example:

  1. (1)
    p::

    Every firm polled saw costs grow more than expected, even after adjusting for inflation.

    h::

    Every big company in the poll reported cost increases.

A capacity for open-domain NLI is clearly necessary for full natural language understanding, and NLI can also enable more immediate applications, such as semantic search and question answering. Consequently, NLI has been the focus of intense research effort in recent years, centered around the annual Recognizing Textual Entailment (RTE) competition (Dagan et al. 2006).

For a semanticist, the most obvious approach to NLI relies on full semantic interpretation: first, translate p and h into some formal meaning representation, such as first-order logic (FOL), and then apply automated reasoning tools to determine inferential validity. While the formal approach can succeed in restricted domains, it struggles with open-domain NLI tasks such as RTE. For example, the FOL-based system of Bos and Markert (2005) was able to find a proof for less than 4 % of the problems in the RTE1 test set. The difficulty is plain: truly natural language is fiendishly complex. The formal approach faces countless thorny problems: idioms, ellipsis, paraphrase, ambiguity, vagueness, lexical semantics, the impact of pragmatics, and so on. Consider for a moment the difficulty of fully and accurately translating example (1) to a formal meaning representation.

Yet example (1) also demonstrates that full semantic interpretation is often not necessary to determining inferential validity. To date, the most successful NLI systems have relied on surface representations and approximate measures of lexical and syntactic similarity to ascertain whether p subsumes h (Glickman et al. 2005; MacCartney et al. 2006; Hickl et al. 2006). However, these approaches face a different problem: they lack the precision needed to properly handle such commonplace phenomena as negation, antonymy, downward-monotone quantifiers, non-factive contexts, and the like. For example, if every were replaced by some or most throughout (1), the lexical and syntactic similarity of h to p would be unaffected, yet the inference would be rendered invalid.

In this paper, we explore a middle way, by developing a model of what Lakoff (1970) called natural logic, which characterizes valid patterns of inference in terms of syntactic forms which are as close as possible to surface forms. For example, the natural logic approach might sanction (1) by observing that: in ordinary upward monotone contexts, deleting modifiers preserves truth; in downward monotone contexts, inserting modifiers preserves truth; and every is downward monotone in its restrictor NP. Natural logic thus achieves the semantic precision needed to handle inferences like (1), while sidestepping the difficulties of full semantic interpretation.

The natural logic approach has a very long history,Footnote 1 originating in the syllogisms of Aristotle (which can be seen as patterns for natural language inference) and continuing through the medieval scholastics and the work of Leibniz. It was revived in recent times by van Benthem (1988, 1991) and Sánchez Valencia (1991), whose monotonicity calculus explains inferences involving semantic containment and inversions of monotonicity, even when nested, as in Nobody can enter without a valid passportNobody can enter without a passport. However, because the monotonicity calculus lacks any representation of semantic exclusion, it fails to license many simple inferences, such as Stimpy is a catStimpy is not a poodle.

Another model which arguably belongs to the natural logic tradition (though not presented as such) was developed by Nairn et al. (2006) to explain inferences involving implicatives and factives, even when negated or nested, as in Ed did not forget to force Dave to leaveDave left. While the model bears some resemblance to the monotonicity calculus, it does not incorporate semantic containment or explain interactions between implicatives and monotonicity, and thus fails to license inferences such as John refused to danceJohn didn’t tango.

We propose a new model of natural logic which extends the monotonicity calculus to incorporate semantic exclusion, and partly unifies it with Nairn et al.’s account of implicatives. We first define an inventory of basic entailment relations which includes representations of both containment and exclusion (Sect. 2). We then describe a general method for establishing the entailment relation between a premise p and a hypothesis h. Given a sequence of atomic edits which transforms p into h, we determine the lexical entailment relation generated by each edit (Sect. 4); project each lexical entailment relation into an atomic entailment relation, according to properties of the context in which the edit occurs (Sect. 5); and join atomic entailment relations across the edit sequence (Sect. 3). We have previously presented an implemented system based on this model (MacCartney and Manning 2008); here we offer a detailed account of its theoretical foundations.

2 An Inventory of Entailment Relations

The simplest formulation of the NLI task is as a binary decision problem: the relation between p and h is to be classified as either entailment (ph) or non-entailment (\(p \not\models h\)). The three-way formulation refines this by dividing non-entailment into contradiction (p⊨¬h) and compatibility (\(p \not\models h \wedge p \not\models \neg h\)).Footnote 2 The monotonicity calculus carves things up differently: it interprets entailment as a semantic containment relation ⊑ analogous to the set containment relation ⊆, and thus permits us to distinguish forward entailment (ph) from reverse entailment (ph). Moreover, it defines ⊑ for expressions of every semantic type, including not only complete sentences but also individual words and phrases. Unlike the three-way formulation, however, it lacks any way to represent contradiction (semantic exclusion). For our model, we want the best of both worlds: a comprehensive inventory of entailment relations that includes representations of both semantic containment and semantic exclusion.

Following Sánchez Valencia, we proceed by analogy with set relations. In a universe U, the set of ordered pairs 〈x,y〉 of subsets of U can be partitioned into 16 equivalence classes, according to whether each of the four sets xy, \(x \cap \overline {y}\), \(\overline {x} \cap y\), and \(\overline {x} \cap \overline {y}\) is empty or non-empty.Footnote 3 Of these 16 classes, nine represent degenerate cases in which either x or y is either empty or universal. Since expressions having empty denotations (e.g., round square cupola) or universal denotations (e.g., exists) fail to divide the world into meaningful categories, they can be regarded as semantically vacuous. Contradictions and tautologies may be common in logic textbooks, but they are rare in everyday speech. Thus, in a practical model of informal natural language inference, we will rarely go wrong by assuming the non-vacuity of the expressions we encounter.Footnote 4 We therefore focus on the remaining seven classes, which we designate as the set \(\mathfrak{B}\) of basic entailment relations, shown in Table 1.

Table 1 The set \(\mathfrak{B}\) of seven basic entailment relations

First, the semantic containment relations (⊑ and ⊒) of the monotonicity calculus are preserved, but are factored into three mutually exclusive relations: equivalence (≡), (strict) forward entailment (⊏), and (strict) reverse entailment (⊐). Next, we have two relations expressing semantic exclusion: negation (), or exhaustive exclusion, which is analogous to set complement; and alternation ( |), or non-exhaustive exclusion. The next relation is cover (\(\mathrel{\smallsmile}\)), or non-exclusive exhaustion. Though its utility is not immediately obvious, it is the dual under negation of the alternation relation.Footnote 5 Finally, the independence relation ( #) covers all other cases: it expresses non-equivalence, non-containment, non-exclusion, and non-exhaustion. Note that # is the least informative relation, in that it places the fewest constraints on its arguments.Footnote 6

Following Sánchez Valencia, we define the relations in \(\mathfrak{B}\) for all semantic types. For semantic types which can be interpreted as characteristic functions of sets,Footnote 7 the set-theoretic definitions can be applied directly. The definitions can then be extended to other types by interpreting each type as if it were a type of set. For example, propositions can be understood (per Montague) as denoting sets of possible worlds. Thus two propositions stand in the | relation iff there is no world where both hold (but there is some world where neither holds). Likewise, names can be interpreted as denoting singleton sets, with the result that two names stand in the ≡ relation iff they refer to the same entity, or the | relation otherwise.

By design, the relations in \(\mathfrak{B}\) are mutually exclusive, so that we can define a function β(x,y) which maps every ordered pair of expressionsFootnote 8 to the unique relation in \(\mathfrak{B}\) to which it belongs.

3 Joining Entailment Relations

If we know that entailment relation R holds between x and y, and that entailment relation S holds between y and z, then what is the entailment relation between x and z? The join of entailment relations R and S, which we denote RS,Footnote 9 is defined by:

$$R \Join S \stackrel{\text{def}}{=} \{{ \langle x, z \rangle } : \exists y\ ({ \langle x, y \rangle } \in R \wedge { \langle y, z \rangle } \in S) \} $$

Some joins are quite intuitive. For example, it is immediately clear that ⊏⋈⊏ = ⊏, ⊐⋈⊐ = ⊐, = ≡, and for any R, (R⋈≡) = (≡⋈R) = R. Other joins are less obvious, but still accessible to intuition. For example, | = ⊏. This can be seen with the aid of Venn diagrams, or by considering simple examples: fish | human and human nonhuman, thus fishnonhuman.

But we soon stumble upon an inconvenient truth: not every join yields a relation in \(\mathfrak{B}\). For example, if x|y and y|z, the relation between x and z is not determined. They could be equivalent, or one might contain the other. They might be independent or alternative. All we can say for sure is that they are not exhaustive (since both are disjoint from y). Thus, the result of joining | and | is not a relation in \(\mathfrak{B}\), but a union of such relations, specifically ⋃{≡,⊏,⊐,|,#}.Footnote 10

We will refer to (non-trivial) unions of relations in \(\mathfrak{B}\) as union relations.Footnote 11 Of the 49 possible joins of relations in \(\mathfrak{B}\), 32 yield a relation in \(\mathfrak{B}\), while 17 yield a union relation, with larger unions conveying less information. Union relations can be further joined, and we can establish that the smallest set of relations which contains \(\mathfrak{B}\) and is closed under joining contains just 16 relations.Footnote 12 One of these is the total relation, which contains all pairs of (non-vacuous) expressions. This relation, which we denote •, is the black hole of entailment relations, in the sense that (a) it conveys zero information about pairs of expressions which belong to it, and (b) joining a chain of entailment relations will, if it contains any noise and is of sufficient length, lead inescapably to •.Footnote 13 This tendency of joining to devolve toward less-informative entailment relations places an important limitation on the power of the inference method described in Sect. 7.

A complete join table for relations in \(\mathfrak{B}\) is shown in Table 2.Footnote 14

Table 2 The join table for the basic entailment relations

In an implemented model, the complexity introduced by union relations is easily tamed. Every union relation which results from joining relations in \(\mathfrak{B}\) contains #, and thus can safely be approximated by #. After all, # is already the least informative relation in \(\mathfrak{B}\)—loosely speaking, it indicates ignorance of the relationship between two expressions—and further joining will never serve to strengthen it. Our implemented model therefore has no need to represent union relations.

4 Lexical Entailment Relations

Suppose x is a compound linguistic expression, and let e(x) be the result of applying an atomic edit e (the deletion, insertion, or substitution of a subexpression) to x. The entailment relation which holds between x and e(x), which we denote β(x,e(x)), will depend on (1) the lexical entailment relation generated by e, which we label β(e), and (2) other properties of the context x in which e is applied (to be discussed in Sect. 5). For example, suppose x is red car. If e is sub(car, convertible), then β(e) is ⊐ (because convertible is a hyponym of car). On the other hand, if e is del(red), then β(e) is ⊏ (because red is an intersective modifier). Crucially, β(e) depends solely on the lexical items involved in e, independent of context.

How are lexical entailment relations determined? Ultimately, this is the province of lexical semantics, which lies outside the scope of this work. However, the answers are fairly intuitive in most cases, and we can make a number of useful observations.

Substitutions

The entailment relation generated by a substitution edit is simply the relation between the substituted terms: β(sub(x,y))=β(x,y). For open-class terms such as nouns, adjectives, and verbs, we can often determine the appropriate relation by consulting a lexical resource such as WordNet. Synonyms belong to the ≡ relation ( sofacouch, forbidprohibit); hyponym-hypernym pairs belong to the ⊏ relation ( crowbird, frigidcold, soarrise); and antonyms and coordinate terms generally belong to the | relation ( hot |cold, cat |dog).Footnote 15 Proper nouns, which denote individual entities or events, will stand in the ≡ relation if they denote the same entity ( USAUnited States), or the | relation otherwise ( JFK |FDR). Pairs which cannot reliably be assigned to another entailment relation will be assigned to the # relation ( hungry #hippo). Of course, there are many difficult cases, where the most appropriate relation will depend on subjective judgments about word sense, topical context, and so on—consider, for example, the pair system and approach. And some judgments may depend on world knowledge not readily available to an automatic system. For example, plausibly skiing |sleeping, but skiing #talking.

Closed-class terms may require special handling. Substitutions involving generalized quantifiers generate a rich variety of entailment relations: allevery, everysome, some no, no |every, at least four \(\mathrel{\smallsmile}\) at most six, and most #ten or more.Footnote 16 Two pronouns, or a pronoun and a noun, should ideally be assigned to the ≡ relation if it can determined from context that they refer to the same entity, though this may be difficult for an automatic system to establish reliably. Prepositions are somewhat problematic. Some pairs of prepositions can be interpreted as antonyms, and thus assigned to the | relation ( above |below), but many prepositions are used so flexibly in natural language that they are best assigned to the ≡ relation (on [a plane] ≡ in [a plane] ≡ by [plane]).

Generic Deletions and Insertions

For deletion edits, the default behavior is to generate the ⊏ relation (thus red carcar). Insertion edits are symmetric: by default, they generate the ⊐ relation ( singsing off-key). This heuristic can safely be applied whenever the affected phrase is an intersective modifier, and can usefully be applied to phrases much longer than a single word ( car which has been parked outside since last weekcar). Indeed, this principle underlies most current approaches the RTE task, in which the premise p often contains much extraneous content not found in the hypothesis h. Most RTE systems try to determine whether p subsumes h: they penalize new content inserted into h, but do not penalize content deleted from p.

Special Deletions and Insertions

However, some lexical items exhibit special behavior upon deletion or insertion. The most obvious example is negation, which generates the relation (didn’t sleep did sleep). Implicatives and factives (such as refuse to and admit that) constitute another important class of exceptions, but we postpone discussion of them to Sect. 6. Then there are non-intersective adjectives such as former and alleged. These have various behavior: deleting former seems to generate the | relation ( former student |student), while deleting alleged seems to generate the # relation ( alleged spy #spy). We lack a complete typology of such cases, but consider this an interesting problem for lexical semantics. Finally, for pragmatic reasons, we typically assume that auxiliary verbs and punctuation marks are semantically vacuous, and thus generate the ≡ relation upon deletion or insertion. When combined with the assumption that morphology matters little in inference,Footnote 17 this allows us to establish, e.g., that is sleepingsleeps and did sleepslept.

5 Entailment Relations and Semantic Composition

How are entailment relations affected by semantic composition? In other words, how do the entailment relations between compound expressions depend on the entailment relations between their parts? Say we have established the value of β(x,y), and let f be an expression which can take x or y as an argument. What is the value of β(f(x),f(y)), and how does it depend on the properties of f?

The monotonicity calculus of Sánchez Valencia provides a partial answer. It explains the impact of semantic composition on entailment relations ≡, ⊏, ⊐, and # by assigning semantic functions to one of three monotonicity classes: up, down, and non. If f has monotonicity up (the default), then the entailment relation between x and y is projected through f without change: β(f(x),f(y))=β(x,y). Thus some parrots talksome birds talk. If f has monotonicity down, then ⊏ and ⊐ are swapped. Thus no carp talkno fish talk. Finally, if f has monotonicity non, then ⊏ and ⊐ are projected as #. Thus most humans talk #most animals talk.

The monotonicity calculus also provides an algorithm for computing the effect on entailment relations of multiple levels of semantic composition. Although Sánchez Valencia’s presentation of this algorithm uses a complex scheme for annotating nodes in a categorial grammar parse, the central idea can be recast in simple terms: propagate a lexical entailment relation upward through a semantic composition tree, from leaf to root, while respecting the monotonicity properties of each node along the path. Consider the sentence Nobody can enter without pants. A plausible semantic composition tree for this sentence could be rendered as (nobody (can ((without pants) enter))). Now consider replacing pants with clothes. We begin with the lexical entailment relation: pantsclothes. The semantic function without has monotonicity down, so without pantswithout clothes. Continuing up the semantic composition tree, can has monotonicity up, but nobody has monotonicity down, so we get another reversal, and find that nobody can enter without pantsnobody can enter without clothes.

While the monotonicity calculus elegantly explains the impact of semantic composition on the containment relations (chiefly, ⊏ and ⊐), it lacks any account of the exclusion relations ( and |, and, indirectly, \(\mathrel{\smallsmile}\)). To remedy this lack, we propose to generalize the concept of monotonicity to a concept of projectivity. We categorize semantic functions into a number of projectivity signatures, which can be seen as generalizations of both the three monotonicity classes of Sánchez Valencia and the nine implication signatures of Nairn et al. (see Sect. 6). Each projectivity signature is defined by a map \(\mathfrak{B} \mapsto \mathfrak{B}\) which specifies how each entailment relation is projected by the function. (Binary functions can have different signatures for each argument.) In principle, there are up to 77 possible signatures; in practice, probably no more than a handful are realized by natural language expressions. Though we lack a complete inventory of projectivity signatures, we can describe a few important cases.

Negation

We begin with simple negation (not). Like most functions, it projects ≡ and # without change ( not happynot glad and isn’t swimming #isn’t hungry). As a downward monotone function, it swaps ⊏ and ⊐ ( didn’t kissdidn’t touch). But we can also establish that it projects without change (not human not nonhuman) and swaps | and \(\mathrel{\smallsmile}\) ( not French \(\mathrel{\smallsmile}\) not German and not more than 4 |not less than 6). Its projectivity signature is therefore .

Intersective Modification

Intersective modification has monotonicity up, but projects both and | as | ( living human |living nonhuman and French wine |Spanish wine), and projects \(\mathrel{\smallsmile}\) as # ( metallic pipe #nonferrous pipe). It therefore has signature .Footnote 18

Quantifiers

While semanticists are well acquainted with the monotonicity properties of common quantifiers, how they project the exclusion relations may be less familiar. Table 3 summarizes the projectivity signatures of the most common binary generalized quantifiers for each argument position.

Table 3 Projectivity signatures for various quantifiers

A few observations:

  • All quantifiers (like most other semantic functions) project ≡ and # without change.

  • The table confirms well-known monotonicity properties: no is downward-monotone in both arguments, every in its first argument, and not every in its second argument.

  • Relation | is frequently “blocked” by quantifiers (i.e., projected as #). Thus no fish talk #no birds talk and someone was early #someone was late. A notable exception is every in its second argument, where | is preserved: everyone was early |everyone was late. (Note the similarity to intersective modification.)

  • Because no is the negation of some, its projectivity signature can be found by projecting the signature of some through the signature of not. Likewise for not every and every.

  • Some results depend on assuming the non-vacuity of the other argument to the quantifier: those marked with assume it to be non-empty, while those marked with assume it to be non-universal. Without these assumptions, # is projected.

Verbs

Verbs (and verb-like constructions) exhibit diverse behavior. Most verbs are upward-monotone (though not all—see Sect. 6), and many verbs project , |, and \(\mathrel{\smallsmile}\) as # ( eats humans #eats nonhumans, eats cats #eats dogs, and eats mammals #eats nonhumans). However, verbs which encode functional relations seem to exhibit the same projectivity as intersective modifiers, projecting and | as |, and \(\mathrel{\smallsmile}\) as #.Footnote 19 Categorizing verbs according to projectivity is an interesting problem for lexical semantics, which may involve codifying some amount of world knowledge.

6 Implicatives and Factives

In (Nairn et al. 2006), Nairn et al. offer an elegant account of inferences involving implicatives and factivesFootnote 20 such as manage to, refuse to, and admit that. Their model classifies such operators into nine implication signatures, according to their implications—positive (+), negative (−), or null (∘)—in both positive and negative contexts. Thus refuse to has implication signature −/∘, because it carries a negative implication in a positive context (refused to dance implies didn’t dance), and no implication in a negative context (didn’t refuse to dance implies neither danced nor didn’t dance).

Most of the phenomena observed by Nairn et al. can be explained within our framework by specifying, for each implication signature, the relation generated when an operator of that signature is deleted from (or inserted into) a compound expression, as shown in Table 4.

Table 4 Implicatives and factives

This table invites several observations. First, as the examples make clear, there is room for variation regarding the appearance of infinitive arguments, complementizers, passivization, and morphology. An implemented model must tolerate such diversity.

Second, some of the examples may seem more intuitive when one considers their negations. For example, deleting signature ∘/− generates ⊐; under negation, this is projected as ⊏ ( he wasn’t permitted to livehe didn’t live). Likewise, deleting signature ∘/+ generates \(\mathrel{\smallsmile}\); under negation, this is projected as | ( he didn’t hesitate to ask |he didn’t ask).

Third, a fully satisfactory treatment of the factives (signatures +/+, −/−, and ∘/∘) would require an extension to our present theory. For example, deleting signature +/+ generates ⊏; yet under negation, this is projected not as ⊐, but as | ( he didn’t admit that he knew |he didn’t know). The problem arises because the implication carried by a factive is not an entailment, but a presupposition.Footnote 21 As is well known, the projection behavior of presuppositions differs from that of entailments (van der Sandt 1992). It seems likely that our model could be elaborated to account for projection of presuppositions as well as entailments, but we leave this for future work.

We can further cement implicatives and factives within our model by specifying the monotonicity class for each implication signature: signatures +/−, +/∘, and ∘/− have monotonicity up ( force to tangoforce to dance); signatures −/+, −/∘, and ∘/+ have monotonicity down ( refuse to tangorefuse to dance); and signatures +/+, −/−, and ∘/∘ (the propositional attitudes) have monotonicity non ( think tangoing is fun #think dancing is fun). We are not yet able to specify the complete projectivity signature corresponding to each implication signature, but we can describe a few specific cases. For example, implication signature −/∘ seems to project as | ( refuse to stay |refuse to go) and both | and \(\mathrel{\smallsmile}\) as # ( refuse to tango #refuse to waltz).

7 Putting It All Together

We now have the building blocks of a general method to establish the entailment relation between a premise p and a hypothesis h. The steps are as follows:

  1. 1.

    Find a sequence of atomic edits 〈e 1,…,e n 〉 which transforms p into h: thus h=(e n ∘…∘e 1)(p). For convenience, let us define x 0=p, x n =h, and x i =e i (x i−1) for i∈[1,n].

  2. 2.

    For each atomic edit e i :

    1. a.

      Determine the lexical entailment relation β(e i ), as in Sect. 4.

    2. b.

      Project β(e i ) upward through the semantic composition tree of expression x i−1 to find an atomic entailment relation β(x i−1,x i ), as in Sect. 5.

  3. 3.

    Join atomic entailment relations across the sequence of edits, as in Sect. 3:

    $$\beta(p, h) = \beta(x_0, x_n) = \beta(x_0, e_1) \Join \ldots \Join \beta(x_{i-1}, e_i) \Join \ldots \Join \beta(x_{n-1}, e_n)$$

However, this inference method has several important limitations, including the need to find an appropriate edit sequence connecting p and h;Footnote 22 the tendency of the join operation toward less informative entailment relations, as described in Sect. 3; and the lack of a general mechanism for combining information from multiple premises.Footnote 23 Consequently, the method has less deductive power than first-order logic, and fails to sanction some fairly simple inferences, including de Morgan’s laws for quantifiers. But the method neatly explains many inferences not handled by the monotonicity calculus.

For example, while the monotonicity calculus notably fails to explain even the simplest inferences involving semantic exclusion, such examples are easily accommodated in our framework. We encountered an example of such an inference in Sect. 1: Stimpy is a catStimpy is not a poodle. Clearly, this is a valid natural language inference. To establish this using our inference method, we must begin by selecting a sequence of atomic edits which transforms the premise p into the hypothesis h. While there are several possibilities, one obvious choice is first to replace cat with dog, then to insert not, and finally to replace dog with poodle. An analysis of this edit sequence is shown in Table 5. In this representation (of which we will see several more examples in the following pages), we show three entailment relations associated with each edit e i , namely:

  • β(e i ), the lexical entailment relation generated by e i ,

  • β(x i−1,x i ), the atomic entailment relation which holds across e i , and

  • β(x 0,x i ), the cumulative join of all atomic entailment relations up through e i . This can be calculated in the table as β(x 0,x i−1)⋈β(x i−1,x i ).

Table 5 An example inference involving semantic exclusion

In Table 5, x 0 is transformed into x 3 by a sequence of three edits. First, replacing cat with its coordinate term dog generates the lexical entailment relation |. Next, inserting not generates , and | joined with yields ⊏. Finally, replacing dog with its hyponym poodle generates ⊐. Because of the downward-monotone context created by not, this is projected as ⊏, and ⊏ joined with ⊏ yields ⊏. Therefore, premise x 0 entails hypothesis x 3.

For an example involving an implicative, consider the inference in Table 6. Again, x 0 is transformed into x 3 by a sequence of three edits.Footnote 24 First, deleting permitted to generates ⊐, according to its implication signature; but because not is downward-monotone, this is projected as ⊏. Next, deleting not generates , and ⊏ joined with yields |. Finally, inserting Cuban cigars restricts the meaning of smoked, generating ⊐, and | joined with ⊐ yields |. So x 3 contradicts x 0.

Table 6 An example inference involving an implicative

Let’s now look at a more complex example (first presented in (MacCartney and Manning 2008)) that demonstrates the interaction of a number of aspects of the model we’ve presented. The inference is:

p::

Jimmy Dean refused to move without blue jeans.

h::

James Dean didn’t dance without pants.

Of course, the example is quite contrived, but it has the advantage that it compactly exhibits several phenomena of interest: semantic containment (between move and dance, and between pants and jeans); semantic exclusion (in the form of negation); an implicative (namely, refuse to); and nested inversions of monotonicity (created by refuse to and without). In this example, the premise p can be transformed into the hypothesis h by a sequence of seven edits, as shown in Table 7. This time we include even “light” edits yielding ≡ for the sake of completeness.

Table 7 Analysis of a more complex inference

We analyze these edits as follows. The first edit simply substitutes one variant of a name for another; since both substituends denote the same entity, the edit generates the ≡ relation. The second edit deletes an implicative (refuse to) with implication signature −/∘. As described in Sect. 6, deletions of this signature generate the | relation, and ≡ joined with | yields |. The third edit inserts an auxiliary verb (did); since auxiliaries are more or less semantically vacuous, this generates the ≡ relation, and | joined with ≡ yields | again. The fourth edit inserts a negation, generating the relation. Here we encounter the first interesting join: as explained in Sect. 3, | joined with yields ⊏. The fifth edit substitutes move with its hyponym dance, generating the ⊐ relation. However, because the edit occurs within the scope of the newly-introduced negation, ⊐ is projected as ⊏, and ⊏ joined with ⊏ yields ⊏. The sixth edit deletes a generic modifier (blue), which generates the ⊏ relation by default. This time the edit occurs within the scope of two downward-monotone operators (without and negation), so we have two inversions of monotonocity, and ⊏ is projected as ⊏. Again, ⊏ joined with ⊏ yields ⊏. Finally, the seventh edit substitutes jeans with its hypernym pants, generating the ⊏ relation. Again, the edit occurs within the scope of two downward-monotone operators, so ⊏ is projected as ⊏, and ⊏ joined with ⊏ yields ⊏. Thus p entails h.

Of course, the edit sequence shown in Table 7 is not the only sequence which can transform p into h. A different edit sequence might yield a different sequence of intermediate steps, but the same final result. Consider, for example, the edit sequence shown in Table 8. Note that the lexical entailment relation β(e i ) generated by each edit is the same as before. But because the edits involving downward-monotone operators (namely, ins(n’t) and del(refused to)) now occur at different points in the edit sequence, many of the atomic entailment relations β(x i−1,x i ) have changed, and thus the sequence of joins has changed as well. In particular, edits 3 and 4 occur within the scope of three downward-monotone operators (negation, refuse, and without), with the consequence that the ⊏ relation generated by each of these lexical edits is projected as ⊐. Likewise, edit 5 occurs within the scope of two downward-monotone operators (negation and refuse), and edit 6 occurs within the scope of one downward-monotone operator (negation), so that | is projected as \(\mathrel{\smallsmile}\). Nevertheless, the ultimate result is still ⊏.

Table 8 An alternative analysis of the inference from Table 7

However, it turns out not to be the case that every edit sequence which transforms p into h will yield equally satisfactory results. Consider the sequence shown in Table 9. The crucial difference in this edit sequence is that the insertion of not, which generates lexical entailment relation , occurs within the scope of refuse, so that is projected as atomic entailment relation | (see Sect. 5). But the deletion of refuse to also produces atomic entailment relation | (see Sect. 6), and | joined with | yields a relatively uninformative union relation, namely ⋃{≡,⊏,⊐,|,#} (which could also be described as the non-exhaustion relation). The damage has been done: further joining leads directly to the “black hole” relation •, from which there is no escape. Note, however, that even for this infelicitous edit sequence, our inference method has not produced an incorrect answer (because the • relation includes the ⊏ relation), only an uninformative answer (because it includes all other relations in \(\mathfrak{B}\) as well).

Table 9 A third analysis of the inference from Table 7

Additional examples are presented in (MacCartney 2009).

8 Implementation and Evaluation

The model of natural logic described here has been implemented in software as the NatLog system. In previous work (MacCartney and Manning 2008), we have presented a description and evaluation of NatLog; this section summarizes the main results. NatLog faces three primary challenges:

  1. 1.

    Finding an appropriate sequence of atomic edits connecting premise and hypothesis. NatLog does not address this problem directly, but relies instead on edit sequences from other sources. We have investigated this problem separately in (MacCartney et al. 2008).

  2. 2.

    Determining the lexical entailment relation for each edit. NatLog learns to predict lexical entailment relations by using machine learning techniques and exploiting a variety of manually and automatically constructed sources of information on lexical relations.

  3. 3.

    Computing the projection of each lexical entailment relation. NatLog identifies expressions with non-default projectivity and computes the likely extent of their arguments in a syntactic parse using hand-crafted tree patterns.

We have evaluated NatLog on two different test suites. The first is the FraCaS test suite (Cooper et al. 1996), which contains 346 NLI problems, divided into nine sections, each focused on a specific category of semantic phenomena. The goal is three-way entailment classification, as described in Sect. 2. On this task, NatLog achieves an average accuracy of 70 %.Footnote 25 In the section concerning quantifiers, which is both the largest and the most amenable to natural logic, the system answers all problems but one correctly. Unsurprisingly, performance is mediocre in four sections concerning semantic phenomena (e.g., ellipsis) not relevant to natural logic and not modeled by the system. But in the other five sections (representing about 60 % of the problems), NatLog achieves accuracy of 87 %. What’s more, precision is uniformly high, averaging 89 % over all sections. Thus, even outside its areas of expertise, the system rarely predicts entailment when none exists.

The RTE3 test suite (Giampiccolo et al. 2007) differs from FraCaS in several important ways: the goal is binary entailment classification; the problems have much longer premises and are more “natural”; and the problems employ a diversity of types of inference—including paraphrase, temporal reasoning, and relation extraction—which NatLog is not designed to address. Consequently, the NatLog system by itself achieves mediocre accuracy (59 %) on RTE3 problems. However, its precision is comparatively high, which suggests a strategy of hybridizing with a broad-coverage RTE system. We were able to show that adding NatLog as a component in the Stanford RTE system (Chambers et al. 2007) led to accuracy gains of 4 %.

9 Conclusion

The model of natural logic presented here is by no means a universal solution to the problem of natural language inference. Many NLI problems hinge on types of inference not addressed by natural logic, and the inference method we describe faces a number of limitations on its deductive power (discussed in Sect. 7). Moreover, there is further work to be done in fleshing out our account of projectivity, particularly in establishing the proper projectivity signatures for a broader range of quantifiers, verbal constructs, implicatives and factives, logical connectives, and other semantic functions.

Nevertheless, we believe our model of natural logic fills an important niche. While approximate methods based on lexical and syntactic similarity can handle many NLI problems, they are easily confounded by inferences involving negation, antonymy, quantifiers, implicatives, and many other phenomena. Our model achieves the logical precision needed to handle such inferences without resorting to full semantic interpretation, which is in any case rarely possible. The practical value of the model is demonstrated by its success in evaluations on the FraCaS and RTE3 test suites.