1 Argument semantic complexity

It is a well-known fact that nouns differ for their acceptability as predicate arguments. Traditionally, linguistic theory has modeled this as a binary contrast between acceptable vs. impossible arguments:

figure a

Impossible arguments are those that violate the combinatorial constraints (aka selectional preferences) of the predicate to such a degree that we are not able to build any coherent representation for the described situation, like in . Recently, psycholinguistic and neurocognitive research has questioned the dichotomous nature of the phenomenon, arguing that arguments differ in their degree of acceptability, as shown by the following sentence:

figure b

Although the selectional constraints of play are satisfied both in and (2), the latter expresses a more unusual event. Investigations on event-related potentials (ERP)—the electrophysiological responses of the brain to a stimulus measured with electroencephalography (EEG)—have brought extensive evidence that sentences like and (2), despite being both semantically acceptable, have a different cognitive status. In particular, sentences such as (2), including possible but unexpected combinations of lexemes, evoke stronger N400 components than plausible ones. The N400 component, originally described by Kutas and Hillyard (1980), is a negative-going deflection that peaks around 400 ms after the presentation of the stimulus, and since its discovery the amplitude of this component has been taken to reflect the complexity of semantic composition: unusual combinations of lexemes require an extra cognitive effort to be understood, as they are not coherent with the unfolding semantic representation of the context (Baggio and Hagoort 2011; Baggio et al. 2012). We refer to this phenomenon as argument complexity, to distinguish it from other cases of syntactic and semantic complexities occurring during online sentence comprehension. Over the years, several linguistic theories and computational models have been proposed to account for processing differences between natural language sentences, among which we can cite Dependency Locality Theory (Gibson 2000), the ACT-R based model by Lewis and Vasishth (2005) and Surprisal Theory (Hale 2001, 2016). A common point of all the above-mentioned frameworks is the focus on the syntactic factors of complexity, which are sometimes identified with the length of dependencies, sometimes with the probabilities of a given syntactic analysis in a given context, and so on. The notion of argument complexity we analyse in this work instead concerns the semantic factors leading to the construction of sentence meaning via predicate-argument composition.

A different but strongly related phenomenon is (complement) coercion, in which an argument is reinterpreted to overcome the violation of its predicate selectional preferences (Lauwers and Willems 2011). One widely studied case of coercion is logical metonymy, which is traditionally considered as a theoretical challenge for classical models of compositionality (Pustejovsky and Batiukova 2019):

figure c

Logical metonymy is described as a type clash between an event-selecting metonymic verb (e.g., aspectual verbs like begin) and an entity-denoting nominal object, which triggers the recovery of a hidden event (e.g., writing). Crucially, previous research has brought extensive evidence that such metonymic constructions also determine extra processing costs and increased complexity during online sentence comprehension (McElree et al. 2001; Traxler et al. 2002), apparently due to “the deployment of operations to construct a semantic representation of the event” (Frisson and McElree 2008). Therefore, logical metonymy, as well as complement coercion in general, can be regarded as an instance of argument complexity caused by the effort required to repair the violation of the verb selectional preferences.

The N400 effects and the processing costs of logical metonymy suggest that “not all arguments are processed equal”, and that the semantic complexity of an argument depends on its compatibility with the selectional constraints of the predicate. Argument compatibility is a graded, rather than binary notion and is typically referred to as thematic fit. Several psycholinguistic studies making use of different experimental paradigms (self-paced reading, eye-tracking, EEG, etc.) indicate that argument complexity is determined by information about event contingencies and specific predicate-argument combinations stored in semantic memory. This event knowledge has a key role in human sentence processing: Verbs (e.g., eat) activate expectations about nouns typically occurring as their arguments (e.g., pizza) (McRae et al. 1998), and in turn entity-denoting nouns prime verbs referring to the events they typically participate in (McRae et al. 2005). Arguments that are coherent with the activated expectations have a lower semantic complexity and are read faster by subjects.

Moreover, priming experiments show that nouns trigger also other nouns co-occurring as arguments in the same events (Hare et al. 2009). More in detail: (i) nouns of events prime participants (sale-shopper) and objects (trip-luggage) typically found at those events; (ii) locations prime people/animals and objects (hospital-doctor, barn-hay) typically found at those locations; (iii) instrument nouns prime things on which they are commonly used (key-door). All these event-based priming effects support the hypothesis of a mental lexicon arranged as a web of mutual expectations that are exploited online to compute the thematic fit of the argument nouns as fillers of the verb roles. In the literature, this knowledge contained in semantic memory is generally referred to as Generalized Event Knowledge (GEK), and it is acquired by humans from first-hand experience (e.g., playing music) and linguistic experience (e.g., talking and reading about playing music) (McRae and Matsuki 2009).

The expectations for the predicate fillers and the resulting argument complexity depend on wide event scenarios. As shown by some recent studies (Bicknell et al. 2010; Matsuki et al. 2011), the expectations about the likely filler of a given verb argument (e.g., the patient role) depend on the fillers of the other verb arguments (e.g., the agent). For example, given an agent noun like boxer, the most likely patient for the verb dodge is punch, while if the agent noun is politician, something like question will be much more likely as a patient filler. In other words, argument complexity and thematic fit have a context-sensitive nature and are affected by the general situation described by the sentence. Sentences including congruent argument combinations elicit significantly smaller N400 amplitudes than incongruent ones (Bicknell et al. 2010), as they show lower processing complexity. After an analysis of the evidence presented in the previously-cited studies, Jeffrey Elman proposed that words should be conceived as cues to event knowledge (words-as-cues hypothesis), and that sentence meaning consists precisely of the event representations that the lexical items in the sentence activate (Elman 2009, 2014). As new information comes in during online linguistic processing, new constraints on the possible interpretations of the sentence are progressively added. Importantly, logical metonymy too is affected by the whole configuration of verb arguments. For instance, the event recovered to overcome the type clash depends on both the patient and the agent roles (Lascarides and Copestake 1998; Zarcone et al. 2014). Therefore, argument complexity in general is a compositional phenomenon that must be addressed within the context of the cognitive processes leading to sentence meaning construction.

1.1 Argument complexity in distributional semantics

Computational models of argument complexity have been extensively investigated in distributional semantics (Lenci 2018). Erk et al. (2010) were among the first authors to measure the correlation between human-elicited thematic fit ratings and the scores assigned by a syntax-based Distributional Semantic Model (DSM). The plausibility of each verb-argument pair was computed as the similarity between new candidate nouns and previously attested exemplars for each specific verb-role pairing, as already proposed in Erk (2007). Baroni and Lenci (2010) adopted an approach to thematic fit modeling that has become dominant in the literature: For each verb role, they used their Distributional Memory (henceforth DM) framework to build a prototype vector by averaging the dependency-based vectors of its most typical fillers. The higher the similarity of a noun with a role prototype, the higher its plausibility as a filler for that role. Lenci (2011) later extended this model to account for the dynamic update of the expectations on an argument, depending on how another role is filled. By using the same DM tensor, this study tested an additive and a multiplicative model (Mitchell and Lapata 2010) to compose and update the expectations on the patient filler of the subject–verb–object triples of the dataset used in the study by Bicknell et al. (2010). More recent contributions aimed at improving the original model by Baroni and Lenci (2010), either by using semantic role labels instead of syntactic dependencies as the context for the vectors (Sayeed et al. 2015) or by clustering the verb fillers in order to better deal with polysemy (Greenberg et al. 2015). Another variant of the model, introduced by Santus et al. (2017), achieves better results by replacing cosine with a metric based on the semantic feature overlap between the prototype and the candidate fillers.

A different approach to the thematic fit problem was proposed by Tilk et al. (2016), who presented two neural architectures for generating probability distributions over the possible arguments for each thematic role. Their models took advantage of supervised training on two role-labeled corpora to optimize the distributional representation for thematic fit modeling, and managed to obtain significant improvements over the other systems on almost all the evaluation datasets. They also evaluated their model on the task of composing and updating verb argument expectations, obtaining a performance comparable to Lenci (2011). More recently, Chersoni et al. (2019) proposed a general distributional model for incremental sentence meaning representation that has been tested on human ratings of compositional argument plausibility. A closely related notion to thematic fit is the one of selectional preference (Resnik 1997), with the main difference being that the former refers to a gradient compatibility between arguments and thematic roles, while the latter involves discrete semantic types (Lebani and Lenci 2018). The acquisition of selectional preferences has mostly been seen as an auxiliary task for improving the performance of systems with different goals, such as semantic role classification (Collobert et al. 2011; Zapirain et al. 2013; Roth and Lapata 2015) or coreference resolution (Heinzerling et al. 2017). Some recent and some notable exceptions are the studies by Zhang et al. (2019, 2020), which introduced large-scale evaluation benchmarks for the task and proposed multiplex embedding models incorporating both the overall semantics of a word and its relational dependencies in context.

Concerning the Natural Language Processing (NLP) research on logical metonymy, previous studies focused on two different and complementary aspects of the phenomenon. On the one hand, the retrieval of the covert event, which has been approached by means of either probabilistic methods (Lapata and Lascarides 2003) or distributional similarity models (Zarcone et al. 2012). On the other hand, the modeling and reproduction of the processing differences observed in the experimental literature, a problem mainly tackled, again, with DSMs (Zarcone et al. 2013). In our view a computational model, in order to provide a complete account of logical metonymy and its processing consequences, should be able to deal with both of these aspects.

Leveraging and extending these previous results, we introduce in Sect. 2 a distributional model of argument complexity inspired by the Memory, Unification and Control framework by Hagoort (2013). Our proposal has two major elements of novelty. First of all, it is able to subsume the gradient nature of argument acceptability in (2) and the coercion in (3) under the same general computational approach to argument complexity. Secondly, it is grounded on the assumption that distributional semantics can provide a useful model of (at least a subset of) GEK and of its role in constructing compositional semantic representations. In Sect. 3, we evaluate our model on two psycholinguistic datasets, respectively, in the task of composing and updating verb argument expectations and in modeling logical metonymy.

2 A distributional model for argument complexity

The objectives of our model are (i) to build an incremental distributional representation of a sentence, and (ii) to introduce a compositional weight to account for its argument complexity. We assume that sentences represent events consisting of various participants playing different roles, and that their argument complexity depends on two main factors: (a) the availability and salience of “ready-to-use” event information already stored in GEK and (b) the cost of unifying the GEK portions activated by the context into a coherent semantic representation, a cost mainly depending on the mutual semantic coherence of the event participants. We thus predict that sentences containing highly familiar argument combinations are easier to process than sentences containing novel ones, like the one in (2). Moreover, the complexity of novel combinations depends on how “compatible” they are with the event knowledge stored in the semantic memory.

GEK is assumed to be a highly structured repository, organized under various levels of complexity, granularity, and schematicity. It includes information about fully-specified micro-events (e.g., students reading books, gardeners cutting grass, etc.) and about more complex scenarios. In fact, sentences can be regarded as partial descriptions of events, since many details about described situations can be left unspecified, and it is up to the comprehender to infer the missing parts by retrieving relevant information in GEK: for example, when we hear a sentence like The soldier killed all the enemies, we could infer that he used some sort of weapon (e.g., a rifle, a machine gun, etc.) to perform the killing event. Consistently with the psycholinguistic findings reviewed in Sect. 1 and with Elman’s words-as-cues hypothesis, each linguistic expression works as a cue for recovering portions of GEK. Not only verbs, but also nouns (and possibly adjectives) activate GEK: more specifically, they activate the events involving those entities. For instance, hearing the noun student in a sentence leads to the activation of student-related events in GEK. As long as comprehenders manage to retrieve the “right” event scenario, they are also able to anticipate upcoming arguments in the sentence, and fill in unexpressed elements (e.g, like the covert event in logical metonymic sentences).

Comprehension consists in recovering the most likely event expressed by a sentence (Kuperberg 2016), and it is an incremental process leading to the construction of a semantic representation, which is in turn obtained by combining the subsets of GEK activated by the different constructions in the sentence. Analogously to Hagoort (2013), we distinguish between two components of our model:

  • a Memory component, representing the storage of event structures in GEK contained in the semantic memory. In this study, we only consider the GEK subset derived from linguistic experience, which we model with distributional information extracted from corpora;

  • a Unification component, which combines the GEK portions activated by linguistic expressions, in order to generate new and more complex structures.

2.1 The memory component: modeling GEK in long-term memory

Fig. 1
figure 1

A fragment of the DEG representing GEK with several instances of events, each represented by a sequence of co-indexed e. The \(\sigma \) are the activation scores of events

In our framework, we assume that each lexical item \(w_{i}\) activates a set of events \(\langle e_{1},\sigma _{1} \rangle ,\ldots ,\langle e_{n}, \sigma _{n} \rangle \) such that \(e_{i}\) is an event in GEK, and \(\sigma _{i}\) is an activation score computed as the conditional probability \(P(e|w_{i})\). In other words, the activation level of e is quantified as its probability given the context word \(w_{i}\). Therefore, processing a linguistic expression in a given sentence will lead to the activation of a set of events in the semantic memory, each one associated to a \(\sigma \) score.

In a previous work, Chersoni et al. (2019) represented GEK with a Distributional Event Graph (DEG) that contains events extracted from dependency parsed sentences (Fig. 1).Footnote 1 The DEG nodes were distributional vectors (i.e., embeddings), meant as “out-of-context” encoding of lexical items. Notice that, in principle, any type of distributional vector can be used to this purpose. The edges corresponded to syntactic relations as an approximation of deeper semantic roles (e.g., the subject relation for the agent, the direct object relation for the patient, etc.), and they were weighted with activation scores identifying the most prototypical event-entity links.

The approach that we followed for representing events, in this work, is to extract syntactic joint contexts (Chersoni et al. 2016b). A syntactic joint context includes the whole set of dependencies of a given lexical head (ignoring determiners and modifiers), and we assume it as a surface representation of an event. For example, from the dependency structure of the sentence The student reads a book we extract the following event corresponding to a path in the DEG in Fig. 1:

figure d

Events in GEK can be cued by several lexical items, with a strength depending on the salience of the event given the item. For example, the event above is cued by student, read and book. Besides complete events, we assume GEK to contain schematic (i.e., underspecified) events too, obtained by abstracting away from one or more arguments. For instance, from the sentence The student reads a book we also generate the schematic event \([_{E_{1}}\) nsubj:\({\mathbf {student}}\) dobj:\({\mathbf {book}}]\) describing an underspecified event schema with a student agent and a book patient, which can be instantiated by different actions (e.g., reading, borrowing, etc.). Therefore, GEK is not a flat list of events, but a structured repository of prototypical knowledge about event contingencies.

2.2 The unification component: building semantic representations

Language can be seen as a set of instructions that the comprehender uses to represent the situation described by the speaker. In our framework, the event currently being processed is stored in a data structure called Semantic Representation (henceforth SR), which is akin to Discourse Representation Structures in DRT (Kamp 2013; Chersoni et al. 2019). Comprehension always occurs within the context of an existing SR: during online sentence processing, lexical items cue portions of GEK and the SR is dynamically updated by unifying the current content with the new information.

We anticipated that, in our view, the goal of sentence comprehension consists in recovering (reconstructing) the event e that the sentence is most likely to describe. The event e is the event that best satisfies all the constraints set by the lexical items in the sentence and by the active SR. Let \(w_{1}, w_{2}, \ldots , w_{n}\) be an input linguistic sequence (e.g., a sentence) that is currently being processed. Let \(SR_{i}\) be the semantic representation built for the linguistic input until \(w_{1},\ldots , w_{i}\), and let \(e_{i}\) be the event representation in \(SR_{i}\). When we process \(w_{i+1}\):

  1. 1.

    \(GEK[w_{i+1}]\), the event knowledge associated with \(w_{i+1}\) in the lexicon, is activated;

  2. 2.

    \(GEK[w_{i+1}]\) is integrated with \(SR_{i}\) to produce \(SR_{i+1}\).

We model semantic composition as an event construction and update function F , whose aim is to build a coherent SR by integrating the GEK cued by the linguistic elements that are composed:

$$\begin{aligned} F(SR_{i}, GEK[w_{i+1}]) = SR_{i+1}. \end{aligned}$$
(1)

The composition function carries out two distinct processes:

  • F unifies the events activated by two lexical items into a new complex event:

    figure e

    In this example, the event of a mechanic performing an action on an engine activated by the noun mechanic and the event of a mechanic checking something activated by the verb check are unified into a new complex event of a mechanic checking an engine;

  • F weights the unified event \(e_{k}\) with a pair of scores \(\langle \theta _{e_{k}}, \sigma _{e_{k}} \rangle \), weighting \(e_{k}\) with respect to its semantic coherence \(\theta _{e_{k}}\) and to the salience \(\sigma _{e_{k}}\) of its activation.

Semantic coherence and activation salience, which will be illustrated in the following section, are the essential factors of our model of the argument complexity of semantic representations.

2.2.1 The cost of unification: semantic coherence

We introduce the score \(\theta _{e_{k}}\) to quantify the degree of semantic coherence of a unified event \(e_{k}\), under the assumption that such coherence depends on the mutual typicality of its components. Consider the following sentences:

figure f

The event represented in has a high degree of semantic coherence because all its components are mutually typical: student is a typical subject for the verb write and thesis has a strong typicality both as an object of write and as an object occurring in student-related events. Conversely, the components in the event expressed by have a low level of mutual typicality, thereby resulting into an event with much lower semantic coherence. Although the sentence is perfectly understandable, the described situation is more unusual.

Verb-argument typicality is measured in the computational and psycholinguistic literature with thematic fit values (McRae et al. 1998). In the present proposal, the notion of thematic fit is extended in order to account for the degree of coherence of the events described by whole sentences. In computational approaches (Baroni and Lenci 2010), thematic fit is modeled with vector cosine in the following way:

Given a list of lexemes \(\mathbf {c_{1}},\ldots ,\mathbf {c_{n}}\) occurring in the same event structures as \({\mathbf {b}}\) with the role \(s_{i}\) and ordered by their decreasing salience, \(\theta ({\mathbf {a}}|s_{i}, {\mathbf {b}})\) (the thematic fit of \({\mathbf {a}}\) given \({\mathbf {b}}\) and the role \(s_{i}\)) is the cosine between \({\mathbf {a}}\) and the prototype vector built out of the k top values \(\mathbf {c_{1}},\ldots ,\mathbf {c_{k}}\), with for \(1 \le k \le n\).

For instance, the thematic fit of student as an agent in writing-events is given by the cosine between the embedding of student and the centroid vector built out of the k most salient agents of write. Similarly, the typicality of thesis as a patient related to student (i.e., as a patient in events involving student as an agent) could be assessed by measuring the cosine between the embedding of thesis and the centroid vector built out of the k most salient patients related to student, and the typicality of thesis as a patient of write can be measured in the same way. In other words, typical fillers of a given role are used to build a sort of abstract distributional representation of an “ideal” filler for that role, and the thematic fit of a new candidate is computed as the distance between its embedding and the vector of the ideal filler.

Although we adopt the same approach for measuring the typicality of the participants, an important problem is how the partial scores of single event-participant combinations are combined in a global semantic coherence score. In our work, we experimented with two different solutions:

  • as in Chersoni et al. (2016a) and Chersoni et al. (2017a), semantic coherence is assessed as the product of all the partial thematic fit scores for all the event-participant (and inter-participant) combinations within a sentence;Footnote 2

  • similarly to Lenci (2011) and Chersoni et al. (2017b), semantic coherence is assessed as the cosine similarity between the arguments of the sentence and the prototype vector of current argument expectations, which is dynamically updated as new information from newly-saturated arguments comes in.

In the first case, the global score \(\theta _{e_{k}}\) of an event \(e_{k}\) is defined as:

$$\begin{aligned} \theta _{e_{k}} = \prod _{a,b,s_{i} \in e}{\theta ({\mathbf {a}} | s_{i}, {\mathbf {b}})} \end{aligned}$$
(2)

For example, given a sentence like The student drinks beer, the score \(\theta _{e_{k}}\) would be the product of three factors: (i) the thematic fit of student as an agent (AG) of drink; (ii) the thematic fit of beer as a co-participant (CO) of student; (iii) the thematic fit of beer as a patient (PA) of drink. Thus, \(\theta _{e_{k}}\) would be computed as:

$$\begin{aligned} \theta _{e_{k}} = \theta (\mathbf {student} | AG, \mathbf {drink}) \cdot \theta (\mathbf {beer} | CO, \mathbf {student}) \cdot \theta (\mathbf {beer} | PA, \mathbf {drink}). \end{aligned}$$
(3)

The product between thematic fit scores directly captures the idea of the mutual typicality between all event participants. Indeed, as an effect of the product, if the partial thematic fit score between an argument pair is low (e.g., the agent–patient combination), this will decrease the semantic coherence of the entire event. In the experiments in Sect. 3, we refer to the models using this computation of semantic coherence score as \(\mathbf {ThetaProd}\) models.

The alternative approach consists in building a prototype vector for the final argument that needs to be predicted (e.g., the patient in an agent–verb–patient triple) using a single representation that incorporates the updated expectations for the verb given the previously-realized arguments (Lenci 2011; Chersoni et al. 2017b). In this model, the update on the expectation EX for a given filler caused by a new input word (e.g., a verb combining with an agent) is modeled with a function f(x) that combines the prototypes built out of the typical fillers for every word \(w_{i}\).

$$\begin{aligned} EX_{role}(\langle \mathbf {w_{1}}, \mathbf {w_{2}} \rangle ) = f(EX_{role_{1}}(\mathbf {w_{1}}), EX_{role_{2}}(\mathbf {w_{2}})). \end{aligned}$$
(4)

Once the expectation vector has been calculated, the filler fit for a role can be computed by measuring the cosine similarity between the filler and the expectations vector. For example, the procedure to estimate how likely is burglar as a patient of the policeman arrested the... is the following:

  1. 1.

    we first build a prototype vector out of the embeddings of nouns typically co-occurring with the agent policeman;

  2. 2.

    then we build another prototype vector out of the embeddings of typical patients of the verb arrest;

  3. 3.

    we combine the prototype vectors with f(x);

  4. 4.

    at this point, we can estimate the filler thematic fit by calculating its cosine similarity (cosSim) to the updated prototype vector:

    $$\begin{aligned}&EX_{PA}(\mathbf {burglar} | \langle \mathbf {police}, \mathbf {arrest} \rangle ) = cosSim(\mathbf {burglar}, f(EX_{CO}(\mathbf {policeman}),\nonumber \\&\quad EX_{PA}(\mathbf {arrest}))). \end{aligned}$$
    (5)

In Chersoni et al. (2017b), the best performing function f turned out to be the simple vector sum between prototype vectors, and thus we used vector sum for the experiments presented in Sect. 3. According to this second model, semantic coherence is conceived as the coherence between the dynamically-updated expectations for the participants of an event described by a sentence, and the fillers saturating the participant roles. In this case, the global semantic coherence, depends on how well the last sentence argument matches the expectations generated from the sentence context.

$$\begin{aligned} \theta _{e_{k}} = EX_{lastRole}. \end{aligned}$$
(6)

In our experiments, we refer to this model as ThetaProtoSum.

2.2.2 The cost of unification: event salience

In our perspective, event representations are not necessarily built on the fly: Events already stored in the GEK are activated during processing and they can progressively change their activation levels, as new words are processed. Ideally, events that satisfy all the constraints imposed by the incoming words should increase their activation, becoming the “best candidates” of a retrieval operation.

In order to account for the role of event memorization and retrieval, a second score, \(\sigma _{e_{k}}\), is used to weight the salience of the unified event \(e_{k}\) by combining the weights of \(e_{i}\) and \(e_{j}\) into a new weight assigned to \(e_{k}\). The activation of an event e in GEK is computed by summing the activation scores of the single lexical items cuing it (Eq. 8), which are in turn estimated with conditional probabilities of the event given each lexical item in the input (Eq. 7):

$$\begin{aligned} \sigma _{i}= & {} P (e | i) = \frac{P(e,i)}{P(i)}, \end{aligned}$$
(7)
$$\begin{aligned} F(\sigma _{i},\sigma _{j})= & {} \sigma _{e_{k}} = \sigma _{i} + \sigma _{j}. \end{aligned}$$
(8)

The score \(\sigma _{e_{k}}\) measures the degree to which a unified event is activated by the linguistic expressions composing it. Consequently, events that are cued by many constructions in the sentence incrementally increase their salience.

It should be pointed out that the activation mechanism not only works for fully-specified events, but also for schematic ones (i.e., a noun student is supposed to activate also generic student reading events in the GEK). When we compute the global activation scores for a sentence \(s_{e_{k}}\), we sum the scores of (i) the entire event \(e_{k}\), if such an event is stored in GEK; (ii) the sub-events corresponding to all the partial combinations of the verb and its arguments. The global activation score for the sentence \(s_{e_{k}}\) is computed as follows:

$$\begin{aligned} \sigma {e_{k}} = \sum _{e_{i} \in E} \sigma _{e_{i}}, \end{aligned}$$
(9)

where the set of events E includes both the full event \(e_{k}\) and all the sub-events \(e_{i}\) activated by the lexical items in the input sentence.

To sum up, we weigh unified events along two dimensions: internal semantic coherence (\(\theta \)), and degree of activation by linguistic expressions (\(\sigma \)). The latter is used to estimate the importance of “ready-to-use” event structures stored in GEK and retrieved during sentence processing. Salience scores can also be used to identify missing pieces of information, such as implicit arguments. For instance, suppose that we have the sentence The student reads the book, with the location role left unexpressed. If library-related events are simultaneously cued by student, read and book, their score will get higher during the integration, with the result that library will become a highly salient (i.e., highly probable) location for the event described in the sentence. This is a piece of unexpressed information that will be recovered during sentence comprehension. on the other hand, the \(\theta \) score allows us to weigh events that are not available in the Memory component. In fact, the Unification component can construct new events never observed before, thereby accounting for the ability to comprehend novel sentences representing atypical and yet possible events.

Given an input sentence s, its interpretation INT(s) is the event \(e_{k}\) with the highest semantic composition weight (SCW), defined as follows:

$$\begin{aligned} \text {INT}(s_{k})= & {} \underset{e_{k}}{\text {argmax}} (\text {SCW}(e_{k})), \end{aligned}$$
(10)
$$\begin{aligned} \text {SCW}(e_{k})= & {} \theta _{e_{k}}+\sigma _{e_{k}}. \end{aligned}$$
(11)

Finally, we model the argument complexity (ArgComp) of a sentence \(s_{e_{k}}\) as inversely related to the SCW of the event representing its interpretation:

$$\begin{aligned} \text {ArgComp}(s) = \frac{1}{\text {SCW}(\text {INT}(s))}. \end{aligned}$$
(12)

The less internally coherent is the event represented by the sentence and the less strong is its activation by the lexical items, the more the unification is cognitively expensive and the higher is the sentence argument complexity. Therefore, the joint effect of the \(\sigma \) and \(\theta \) scores is meant to capture the “balance between storage and computation” driving sentence processing (Baggio and Hagoort 2011), and they can be considered as facilitating factors in the process of building semantic representations for the events described in natural language.

3 Case studies

We test our distributional model of argument complexity to account for the different processing costs of (i) typical vs. atypical verb-argument combinations (Sect. 3.2), and (ii) of logical metonymic vs. non-coercion sentences (Sect. 3.3).Footnote 3

3.1 Experimental settings

First of all, we populated the DEG modelling GEK with events extracted from parsed corpora. We followed the procedure proposed in Chersoni et al. (2016b) to extract syntactic joint contexts from a concatenation of four different corpora: the Reuters Corpus Vol. 1 (Lewis et al. 2004); the ukWac and the Wackypedia Corpus (Baroni et al. 2009) and the British National Corpus (Leech 1992).Footnote 4 For each sentence, we generated a surface event representation by extracting the verb and its direct dependencies. In the present case, the dependency relations of interest are subject (nsubj), direct (dobj) and indirect object (iobj), infinitive and gerund complements (xcomp), and a generic prepositional complement relation (prepcomp), on which we mapped all the complements introduced by a preposition. As in Chersoni et al. (2016b), we discarded all the adjectival/adverbial modifiers and just kept their heads. For instance, from the joint context director-n-nsubjwrite-v-headarticle-n-dobj we generated the event \([_{E}\) nsubj:\(\mathbf {student}\) head:\(\mathbf {read}\) dobj:\(\mathbf {book}]\). For each joint context, we also generated schematic events from its dependency subsets. We extracted a total of 1,043,766 events, each including at least one of the words of the evaluation datasets.

All the lexemes in the events are represented as distributional vectors. We built a syntax-based distributional semantic model by using as targets the 20K most frequent nouns and verbs in our concatenated corpus, plus any other word occurring in the events in GEK. Words with frequency below 100 were excluded. The total number of targets is 20,560. As vector dimensions, we used the same target words, while the dependency relations are those used to build the joint contexts (e.g., the nouns nsubj:chef and dobj:pizza are examples of contexts for the verb to cook). Syntactic co-occurrences were weighted with Local Mutual Information (Evert 2004):

$$\begin{aligned} LMI(t,r,f) = log\left( \frac{O_{trf}}{E_{trf}}\right) \cdot O_{trf} \end{aligned}$$
(13)

\(O_{trf}\) is the co-occurrence frequency of the target t, the syntactic relation r and the filler f, and \(E_{trf}\) is the expected co-occurrence frequency under independence. LMI values have been used then to rank the typical fillers for the roles in the computation of the \(\theta \) components. Since our datasets are composed of agent–verb–patient triplets, we used the following approximations for semantic roles (Baroni and Lenci 2010; Lenci 2011): (i) the nsubj relation for the agent role; (ii) the dobj relation for the patient role; (iii) a generic verb relation for co-participants. Concretely, this relation links noun pairs that appear as subject and direct object of the same verb.

3.2 Case Study 1: modeling verb argument expectations

As a first test for our framework, we measure the argument complexity of the sentences in the Bicknell dataset (2010). The Bicknell dataset was collected to verify the hypothesis that the typicality of a verb direct object depends on the subject argument. For this purpose, the authors selected 50 verbs, each paired with 2 agent nouns that altered the scenario evoked by the agent–verb combination.

Plausible patients for each agent–verb pair were obtained with production norms, in order to generate triplets where the patient was congruent with the agent and with the verb. For each congruent triple, an incongruent one was generated by combining each verb–congruent patient pair with the other agent noun, in order to have items describing atypical situations. The final dataset includes 100 pairs of agent–verb–patient triplets, that were used to build the stimuli for a self-paced reading and an ERP experiment. For instance, subjects were presented with sentence pairs such as:

figure g

The sentences of each pair contain the same verb and the same patient, differing for the agent. Given the agent, the patient is a preferred argument of the verb in the congruent condition, while it is way less plausible in the incongruent condition. Bicknell et al. (2010) reported that the congruent condition produced shorter reading times and smaller N400 signatures. Their conclusion was that verb argument expectations are dynamically updated during sentence processing, by integrating some kind of general knowledge about events and their typical participants. Later, Lenci (2011) was the first to use the Bicknell dataset to evaluate a distributional model for composing argument expectations on the task of assigning a higher thematic fit score to the congruent combinations than to the incongruent ones.

We interpret Bicknell’s experimental data as suggesting that congruent sentences have less argument complexity than incongruent sentences. Consistently, we predict that our models will assign a higher argument complexity score to incongruent triplets than to congruent ones. Given a congruent–incongruent triple pair, we score a hit each time a model assigns a higher ArgComp score to the incongruent one. Models are primarily evaluated in terms of their accuracy in this binary classification task.

3.2.1 Complexity models

For each test triple, we computed the \(\sigma \) and a \(\theta \) scores:

  • \(\theta \) represents the semantic coherence of the event represented by the sentence, and is obtained by measuring the mutual typicality of its components. As we illustrated in Sect. 2.2.1, we tested two models that differ for the way they estimate semantic coherence:

    1. 1.

      In the ThetaProd model, we computed the \(\theta \) values as the product of partial thematic fit scores. Following Eq. 2, we computed \(\theta _{e}\) for each triple as the product of (i) the thematic fit of nsubj given the verb head, \(\theta _{S,V}\); (ii) the thematic fit of dobj given the verb head, \(\theta _{O,V}\); and (iii) the thematic fit of dobj given nsubj, \(\theta _{S,O}\). In particular, \(\theta _{S,V}\) is the cosine between the vector of nsubj and the prototype vector built out of the k most salient subjects of the verb head (e.g., the cosine between the vector of journalist and the centroid vector of the most salient subjects of check); \(\theta _{O,V}\) is the cosine between the vector of dobj and the centroid vector built out of the k most salient direct objects of the verb head (e.g., the cosine between the vector of article and the prototype vector of the most salient objects of check); and \(\theta _{S,O}\) is the cosine between the vector of dobj and the centroid vector built out of the k most salient direct objects occurring in events where the subject is nsubj (e.g., the cosine between the vector of article and the prototype vector of the most salient objects of events whose subject is journalist);

    2. 2.

      In the ThetaProtoSum model, the \(\theta _{e}\) of each triple was computed as the similarity score between the vector of the dobj and the vector of the expectations for the dobj given nsubj and the verb head, as in Eq. 4. Vector sum is the function that we used to combine partial prototypes in the global expectation vector for the patient (Chersoni et al. 2017b).

    We identified the typical fillers for each role as the set of filler nouns with the strongest LMI score with the target word t and the relation r. Following Baroni and Lenci (2010), we set the parameter k (i.e., the number of typical fillers used to build the prototypes) to 20, 

  • to compute the \(\sigma \) score, given an event \(e_{k}\), we looked for a matching syntactic joint context in our DEG repository and for schematic events matching the sub-chunks of \(e_{k}\) (some examples are shown in Table 1). For each of these events \(e_{i}\), we computed the activation score by using Eqs. 7 and 8. Partial scores were then summed with Eq. 9 to obtain the global \(\sigma _{e}\).

Table 1 Examples of schematic events retrieved from the DEG to compute the \(\sigma \) of a given joint context

Finally, after computing \(\theta _{e}\) and \(\sigma _{e}\) for each of our test triplets, we used Eqs. 10, 11, and 12 to derive the final ArgComp scores.

3.2.2 Baseline models

Besides our models of argument complexity, we computed two baselines inspired by the early models of compositional distributional semantics. Mitchell and Lapata (2010) proposed two simple models for vector composition. Given the vectors of the word u and the word v, the vector representation of the expression p that they compose is computed as follows:

  • in the simplified additive model (Sum):

    $$\begin{aligned} \mathbf {p} = \alpha \mathbf {u} + \beta \mathbf {v}, \end{aligned}$$
    (14)

    where both the \(\alpha \) and \(\beta \) weights are set to 1 (i.e, the output vector is the component-wise sum of the input ones);

  • in the pointwise multiplicative model (Product):

    $$\begin{aligned} p_{i} = u_{i} \cdot v_{i}. \end{aligned}$$
    (15)

Despite their simplicity, such models turned out to be extremely efficient and competitive in a wide variety of compositionality-related tasks (Rimell et al. 2016). For each triple in our dataset, we used Sum and Product to build a vector representation of the patient expectations given the agent–verb combination of each dataset triple. Then, we measured the cosine similarity between the output vector and the patient one, scoring a hit whenever the score was higher for the congruent condition than for the incongruent one. The principle is the same of the ThetaProtoSum model: The fit of the expectations is assessed in terms of similarity between the vector of the last argument to be predicted and a vector representing the previous context, the difference being that the baseline models do not have information about typical role fillers and simply combine the vectors of the verb and its agent.

Another baseline model is based on the notion of Surprisal. After extracting all the subject–verb–object triples, we computed the probabilities of the trigrams and of the subject–verb bigrams with Add-One Smoothing (Jurafsky and Martin 2014). For each triple t, Surprisal estimates were then computed as follows:

$$\begin{aligned} Surprisal(t) = - \log _{2} P(nsubj_{t}, verb_{t}, dobj_{t} | nsubj_{t}, verb_{t}), \end{aligned}$$
(16)

where \(nsubj_{t}\), \(verb_{t}\) and \(dobj_{t}\) are, respectively, the agent, the verb and the patient of t. The model accuracy is computed as the percentage of atypical triples to which it assigns a higher surprisal score.

Our models are also compared with the best configuration in Lenci (2011), that is the Product model (prod-l11). Such model is based on the Distributional Memory data and estimates thematic fit by composing a prototype for the expectations on the patient, given the agent and the verb. In prod-l11, a single prototype for the patient slot is built by updating the typicality scores: If a filler f has a score \(\alpha _{subj}\) given the agent and a score \(\alpha _{verb}\) given the verb, its typicality will be computed as \(\alpha _{subj}*\alpha _{verb}\) and the prototype is built out of the 20 top fillers in the updated ranking. This way, arguments that are not compatible with both the verb and the agent are filtered out.

3.2.3 Results on the Bicknell dataset

Table 2 Model accuracy and coverage for the classification task on the Bicknell dataset

All models except for the Sum baseline differentiate between the two conditions. The Wilcoxon rank sum test on the output scores of the different models reveals that:

  • the ArgComp scores assigned by ThetaProtoSum to the incongruent condition are significantly higher (\(p < 0.05\));

  • the ArgComp scores assigned by ThetaProd to the incongruent condition are significantly higher (\(p < 0.01\));

  • the thematic fit scores assigned by the baseline Product to the incongruent condition are significantly lower (\(p < 0.01\)).Footnote 5

Perhaps surprisingly, the simple Product baseline manages to obtain the best accuracy in the binary classification task (cf. Table 2). This confirms that it is difficult to beat baselines based on simple vector operations in many compositionality-related tasks, a finding reported also by other studies on compositional distributional models (Mitchell and Lapata 2010; Rimell et al. 2016). Moreover, it has been noticed that vector multiplication eases the problem of lexical ambiguity, since dimensions that are inconsistent with the more appropriate meaning in context are filtered out. This could explain the particularly strong performance of this baseline. Still, despite being outperformed, our models also achieve high levels of accuracy and assign significantly different scores to the two conditions.

We consider the performance of ThetaProd to be particularly satisfactory, as it manages to outperform the original model of expectations update by Lenci (2011), when tested on the covered triplets (73.8%).Footnote 6 Moreover, its classification accuracy does not differ significantly from the one of the best-performing Product baseline (\(p = 0.4\)), while the same baseline retains a marginally significant advantage over the other complexity model, ThetaProtoSum (\(p < 0.1\)).Footnote 7 Compared to the other baselines, its advantage over Sum is significant at \(p < 0.05\), while the difference with the Surprisal baseline is only marginally significant (\(p < 0.1\)).

Concerning the coverage of our models, we should also mention that for several of the triplets in the dataset (48 out of 200) the contribution of the \(\sigma \) component was null, as no matching joint context was retrieved from the DEG. Moreover, a syntactic joint context for the entire event could be retrieved for only 22 out of the 200 triplets. Another important point is that the task of composing and updating argument expectations is generally addressed by means of thematic fit models (Lenci 2011; Chersoni et al. 2017b) corresponding to our \(\theta \) component. Thus, one might wonder if it is worth making the model more complex by introducing the extra parameter \(\sigma \).

Table 3 Accuracy scores for the two complexity models without the \(\sigma \) component and the accuracy loss with respect to the full model

Table 3 shows the results for our complexity models after excluding \(\sigma \) scores from the computation. The accuracy of the ThetaProtoSum model remains unchanged, meaning that the direct retrieval of events from the DEG does not contribute to the correct classification of the triplets. On the other hand, the accuracy of ThetaProd slightly drops, and this means that the two components, in this version of the model, do not classify correctly exactly the same triplets. Although the difference (also considering the small size of the dataset) is too small to reach significance, the contribution of the two components seems to be more balanced in ThetaProd. From these data, it seems clear that an implementation of the memory component based only on textual corpora suffers from data sparsity (a problem that is shared with Surprisal models, even when smoothed), and future developments of argument complexity models will have to take this factor into account.

3.3 Case study 2: logical metonymy

In the second case study, we test our distributional approach to argument complexity on two different tasks: (i) modeling the reading times of logical metonymic sentences, (ii) and predicting the covert event that is implicitly recovered as part of their interpretation (cf. Sect. 1). For our experiments, we used two datasets created for previous psycholinguistic studies: the McElree dataset (McElree et al. 2001) and the Traxler dataset (Traxler et al. 2002). Each dataset includes three different experimental conditions, by contrasting constructions requiring a type-shift with those requiring normal composition:

figure h

Sentence corresponds to the metonymic condition (MET), while sentences and correspond to non-metonymic constructions, with the difference that contains a typical event given the subject and the object (HIGH_TYP), whereas expresses a plausible but less typical event (LOW_TYP). The McElree dataset was created for the self-paced reading study by McElree et al. (2001), and includes 99 sentences arranged into 33 triplets like (8), while the Traxler dataset was used in the eye-tracking experiment by Traxler et al. (2002) and contains 108 sentences (36 triplets). Three triplets of the McElree datasets were discarded, because some of their words had very low frequency in the training corpora.

3.3.1 Modeling the processing times of logical metonymy

The models have been tested on the triplets corresponding to the agent–verb–patient combination of the original datasets and the \(\sigma \) and \(\theta \) scores have been computed like in Case Study 1. We predict that our models ThetaProd and ThetaProtoSum will assign higher ArgComp scores to metonymic sentences than to non-coercion sentences, because the former do not comply with the semantic preferences of the event-selecting verb. According to Zarcone et al. (2014), it is exactly the low thematic fit between the event-selecting verb and the entity-denoting object that triggers complement coercion and that, at the same time, causes the extra processing load.

The baselines are the same we used for Case Study 1 (cf. Sect. 3.2.2) plus the following ones:

ZetAl13:

Zarcone et al. (2013) proposed to model the processing costs of the same datasets by using a simpler distributional model, in which the cost of each dataset triple was computed as

$$\begin{aligned} 1 - \theta (\mathbf {noun}|patient, \mathbf {verb}) \end{aligned}$$
(17)

Therefore, this model only considers the thematic fit \(\theta \) of the patient noun, without taking into account the agent filler.

SurprisalD17:

A second surprisal model, similar to the one described in the study by Delogu et al. (2017) on logical metonymy, is based on the probabilities of the trigrams composed by the verb, a determiner and the object noun. Given a trigram t, its surprisal score is computed as follows (for simplicity, we abstract away from the determiner):

$$\begin{aligned}&SurprisalD17(t) \nonumber \\&\quad = - \log _{2} P(verb_{t}, DET, dobj_{t} | verb_{t}, DET, dobj_{t}), \end{aligned}$$
(18)

where \(verb_{t}\) and \(dobj_{t}\) are, respectively, the verb and the patient of the triple t, and DET is a generic determiner of the direct object. In their eye-tracking and ERP experiments, Delogu et al. (2017) reported that surprisal can fully account for the extra processing costs of logical metonymies. In other words, the expectedness of the object noun was shown to be the main determining factor of processing difficulty, without the need of postulating coercion-specific costs.

The ThetaProd model turns out to be the most faithful one to the psycholinguistic results. On the McElree dataset (cf. Table 4; Fig. 2 top), the Kruskal–Wallis rank sum test revealed a main effect of the sentence types on the SemComp scores assigned by ThetaProd (\(\chi ^{2} = 17.18\), \(p < 0.001\)). Post hoc tests showed that SemComp scores for the HIGH_TYP conditions are significantly lower than those in the LOW_TYP (\(p < 0.05\)) and MET conditions (\(p < 0.001\)). These results mirror exactly those of McElree et al. (2001) for the reading times at the type-shifted noun (both conditions engendered significantly longer reading times than the preferred condition).

Table 4 Results of the pairwise post hoc comparisons for the three conditions on the McElree dataset (Wilcoxon rank sum test with Bonferroni correction), scores assigned by ThetaProd

A main effect of sentence types on the SemComp scores was also found for the Traxler dataset (\(\chi ^{2} = 15.39\), \(p < 0.001\)). In their eye-tracking experiment (Experiment 1), Traxler et al. (2002) found no significant difference between HIGH_TYP and LOW_TYP conditions, but they observed higher values for second-pass and total time data in the MET condition with respect to the other two. Interestingly, the ThetaProd model produced similar results (cf. Table 5; Fig. 2 bottom): post hoc tests reveal no difference between non-coerced conditions, but significantly higher SemComp scores for metonymic sentences with respect to both the HIGH_TYP (\(p < 0.001\)) and the LOW_TYP condition (\(p < 0.05\)).

Table 5 Results of the pairwise post hoc comparisons for the three conditions on the Traxler dataset (Wilcoxon rank sum test with Bonferroni correction), scores assigned by ThetaProd
Fig. 2
figure 2

SemComp scores for McElree (left) and Traxler (right), computed with the ThetaProd model

The ThetaProtoSum model also assigned significantly different scores to the three conditions, both in the McElree (\(\chi ^{2} = 28.64\), \(p < 0.001\)) and in the Traxler dataset (\(\chi ^{2} = 26.656\), \(p < 0.001\)). However, the results of this model did not reproduce so accurately the results of the experiments, as the assigned scores simply discriminate between metonymic and non-metonymic conditions in both datasets (see Tables 8, 9). This pattern is very close to the one found by ZetAl13, which discriminates between HIGH_TYP and MET (\(p < 0.001\)) and LOW_TYP and MET (\(p < 0.01\)) on both datasets. Additionally, ZetAl13 found a marginally significant difference between HIGH_TYP and LOW_TYP in the McElree dataset.

Table 6 Results of the pairwise post hoc comparisons for the three conditions on the McElree dataset (Wilcoxon rank sum test with Bonferroni correction), scores assigned by ThetaProtoSum
Table 7 Results of the pairwise post hoc comparisons for the three conditions on the Traxler dataset (Wilcoxon rank sum test with Bonferroni correction), scores assigned by ThetaProtoSum
Table 8 Summary table with the results of all the pairwise comparisons on the McElree dataset for all models
Table 9 Summary table with the results of all the pairwise comparisons on the Traxler dataset for all models

Concerning the baseline models, the original Surprisal (with Add-One smoothing) fails to differentiate between conditions in both datasets. SurprisalD17, instead, generates significantly different scores on both the McElree (\(\chi ^{2} = 6.05\), \(p < 0.05\))and the Traxler dataset (\(\chi ^{2} = 7.02\), \(p < 0.05\)), but the only conditions that differ are HIGH_TYP and MET (in both cases, \(p < 0.05\)). Finally, both the simple DSM baselines struggle in differentiating between the three experimental conditions: for the Kruskal–Wallis test, the differences between the scores assigned by Sum and Product never reach significance, with the only exception of Sum on the McElree dataset (\(p < 0.05\)). Coming to pairwise comparisons, the pattern is different than the one reported by McElree and colleagues, since no significant difference between HIGH_TYP and LOW_TYP has been found (\(p = 0.9\)).

3.3.2 Identifying the covert event

We assume that the SR of a metonymic sentence like The author starts the book contains the following complex event:

figure i

where E\(_{cov}\) is the covert event recovered when interpreting the sentences (e.g., writing). We modeled covert event retrieval as a binary classification task: Given a set of candidate hidden events, we argue that the selected interpretation is the one that minimizes argument complexity. This claim was tested with the following procedure:

  1. 1.

    for each metonymic sentence (e.g., The author starts the book) in the McElree and Traxler datasets, we selected as candidate covert events (E\(_{cov}\)) the verbs in the non-coercion sentences, which we refer to respectively as HIGH_TYP_EVENT (e.g., write) and LOW_TYP_EVENT (e.g., read). Therefore, we obtain quadruple pairs like the following ones:

figure j
  1. 2.

    for each sentence \(S V_{met} O\), we computed SCW(e) (cf. Eq. 11) of the events composing its interpretation, that is \([_{E}\) S V\(_{met}\) E\(_{cov}]\) and \([_{E_{cov}}\) S E\(_{cov}\) O] (i.e., we computed it for both the HIGH_TYP and the LOW_TYP quadruple in each pair);Footnote 8

  2. 3.

    the model accuracy was computed as the percentage of test items for which SCW(E\(_{cov}\) = HIGH_TYP_EVENT) > SCW(E\(_{cov}\) = LOW_TYP_EVENT).

We compared our distributional approach with the probabilistic model introduced by Zarcone et al. (2012), and we computed the probability P(e) of a candidate verb as the hidden event E\(_{cov}\) as:

$$\begin{aligned} P(e) = P(verb) \cdot P(subject | verb) \cdot P(object | verb). \end{aligned}$$
(19)

We refer to this model as ZetAl12. This is a generative model, since it first assumes a hidden event E\(_{cov}\) and then generates the arguments on the basis of the choice of E\(_{cov}\). When compared with other distributional models of logical metonymy, ZetAl12 achieved the highest accuracy, but a lower coverage due to the zero-counts of many of the co-occurrences needed to compute the probabilities in (19).

The results for the covert event identification are shown in Table 10. Overall, we can observe that the ThetaProd model is again the best performing one, classifying correctly almost all the triplets, and it is the only one to significantly outperform a random baseline at \(p < 0.05\) in both the McElree and the Traxler dataset.Footnote 9 Conversely, ThetaProtoSum, Sum, Product and Surprisal struggle in this classification task, and they barely manage to classify a few triples more than a random baseline.

Table 10 Accuracy (and coverage) of the models and of the baselines on the binary classification task for covert event retrieval

The model going closer to ThetaProd in terms of accuracy is the reimplementation of ZetAL12. Like in the original study, this probabilistic model has very high accuracy, but it also struggles with data sparsity and has a more limited coverage. Again, we tested the ThetaProd model by removing the \(\sigma \) component, in order to assess its contribution to the classification task. Once again, the contribution of the \(\sigma \) component is limited to few triplets, especially on the Traxler dataset that includes several rare words (cf. Table 11). It is the \(\theta \) component to play the crucial role in the covert event prediction, while for unusual and rare events, there is simply no matching joint context that can be retrieved from the DEG representing GEK.

Table 11 Accuracy of ThetaProd after the removal of \(\sigma \) and performance drop on the McElree and the Traxler datasets

As a final experiment, we wanted to test the claim by Zarcone et al. (2013, 2014), according to which thematic fit estimation is the mechanism responsible for the triggering of logical metonymy. Their hypothesis was that the recovery of the implicit event could be a consequence of the dispreference of the verb for the entity-denoting argument. In our framework, this corresponds to saying that the low thematic fit between verb and patient triggers a retrieval operation with the aim of increasing the semantic coherence of the event represented in the SR. To test this claim, we compared the \(\theta \) scores of the events containing the HIGH_TYP covert event (i.e., \([_{E}\) S V\(_{met}\) E\(_{cov}]\) + \([_{E_{cov}}\) S E\(_{cov}\) O]) and the corresponding MET event (i.e., \([_{E}\) S V\(_{met}\) O]), predicting that the former events are more semantically coherent than the latter.Footnote 10 This hypothesis turned out to be correct: According to the Wilcoxon rank sum test, both in the McElree (\(W = 199, p < 0.01\)) and in the Traxler dataset (\(W = 157, p < 0.01\)) the \(\theta \) of the structures with the covert events are significantly higher.

4 Discussion

We introduced a framework for argument complexity relying on the two components of Memory and Unification, as in the MUC framework by Hagoort (2013). The first refers to the storage of GEK that we represent by means of the corpus-derived DEG, whereas the second concerns the constraint-driven combination of the units stored in the DEG into more complex structures. Our hypothesis is that GEK stores information about typical events and participants, and that this knowledge allows speakers to anticipate the upcoming linguistic input during sentence processing. Human lexical knowledge, as argued by several modern theories of language processing (Libben 2005; Marzi and Pirrelli 2015), does not seem to be organized to minimize storage, but rather to maximize processing efficiency.

Words work as cues to GEK (Elman 2014), and the recovered information is dynamically unified to build a representation of the events that natural language sentences are likely to communicate. Differently from other approaches, mainly looking at syntactic factors, we focused on the semantics of the events described by natural language sentences and used syntax only to identify aspects of their structure. However, a complete model of processing complexity could separately represent the relevant information for each linguistic domain by means of different constraints (Blache 2016), and domain-specific complexity indexes could be somehow combined and integrated in order to account for the different complexity sources.

In the proposed DSM-based implementation, event representations are weighted along two different dimensions:

  • the semantic coherence \(\theta \) of the unified event, which depends on the mutual typicality between the participants and is computed with a distributional model of thematic fit;

  • the activation by lexical items or salience \(\sigma \), which corresponds to the activation strength of GEK events cued by lexical items. Activation values are modeled as simple conditional probability scores, and the global activation of an event is computed by taking into account also the contribution of schematic events.

An important assumption of our model is that the argument complexity of a sentence is inversely-related to these two factors: (i) the activation strength of a corresponding event stored in GEK, and (ii) the mutual typicality of its participants, resulting in a more predictable situation. In our experiments, we compared the predictions of our model with the findings of some psycholinguistic studies. The most successful version of the model turned out to be ThetaProd, which computed the \(\theta \) component as the product of the single event-participant thematic fit scores. We argue that this approach has several elements of strength:

  • it achieved a competitive performance on the binary classification task for the update of context-sensitive argument typicality, evaluated on the data by Bicknell et al. (2010), being outperformed only by a strong Product baseline, which however obtain suboptimal performances in the other tasks;

  • in modeling the processing cost of logical metonymy observed in the studies by McElree et al. (2001) and Traxler et al. (2002), ThetaProd closely reproduced the behavioral data showing significant differences between the three experimental conditions (typical, non-typical and metonymic event);

  • in retrieving the covert event of logical metonymy, which turned out to be difficult for all the models, it achieved the best performance and was the only system managing to significantly outperform a random baseline. Moreover, it does not suffer from the coverage problems of probabilistic models (Zarcone et al. 2012) ;

  • the \(\theta \) component assigns significantly higher scores to metonymic verbs (e.g., finish) with a non-event denoting direct object (e.g., book) than to the corresponding structure after the integration of the covert event. This is coherent with the hypothesis by Zarcone et al. (2013, 2014), according to which the covert event retrieval is triggered by a low thematic fit between verb and object, and it is aimed at “repairing” the low degree of semantic coherence of the metonymic structure;

  • finally, the addition of the \(\sigma \) component leads to some improvement (although not significant) over the thematic fit model alone (\(\theta \)), making us think that the action of the two components can be somehow considered as complementary.

An actual limit of the model is the coverage of the \(\sigma \) component, which was found to be low on all datasets. On the one hand, this could perfectly make sense, as it is difficult to think that a semantic memory component could store all possible events. In most cases, it is likely that the semantic representations for the events have to be built from scratch. On the other hand, it would be desirable for future extensions of such a model to implement some sort of generalization on the basis of the similarity between the arguments. For example, there might be no distributional information stored for the event of a policeman arresting a burglar, but there might be one for a policeman arresting a crook. The ability to generalize, by recognizing the similarity between the two situations and adapting the stored representation to the new event, would be extremely useful for increasing the contribution of the \(\sigma \) component. At the same time, the results of our experiments confirm that argument complexity and its online processing effects need to be explained within a general model of the incremental and compositional construction of the semantic representations of sentences describing previously unseen events, which is the very essence of natural language productivity.

5 Conclusions and future work

In this work, we have presented a distributional model of argument complexity and we tested it on the tasks of accounting for sentence typicality and logical metonymy resolution. In our view, these are two aspects of the same phenomenon, as in both cases the argument typicality determines processing complexity.

On the computational side, one of our models showed a nice capacity of handling both phenomena in two different psycholinguistic datasets, proving to be more general than previous approaches. It should be pointed out, however, that our datasets were not the ideal ones for an exhaustive comparison between the models, given their small size and the relatively simple structure, which has been modeled as subject–verb–object triplets. As we anticipated in the introduction, we treated the problem of semantic complexity mainly in relation to the problem of argument typicality, but this entails ruling out several, potential sources of complexity, such as more complex event structures (i.e., events including also roles like instruments and locations), the presence of argument modifiers, and semantic relatedness effects due to the sentence or wider discourse context. Many current approaches to the estimation of argument typicality also limit themselves to relatively easy tasks, and one of the main reasons is the well-known scarcity of benchmark datasets (Vassallo et al. 2018). Hopefully, the joint effort of NLP and psycholinguistic research in the next years will produce more robust benchmarks, built with the goal of evaluating argument complexity models on a wider variety of structures, and taking into account semantic complexity stemming from different linguistic domains (Blache 2011, 2016).