Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Traditional theories of grammar distinguish between lexical knowledge (the individual words that a speaker knows) and grammatical knowledge (the rules for combining words into meaningful utterances). However, there is a rich range of linguistic phenomena in the less explored area between words and combinatory rules/constraints. For example, a multiword lexeme such as take the train has an idiosyncratic semantics (“use a train as mode of transport”) that suggests its treatment as a lexical unit, since the meaning cannot be compositionally derived in a general manner.Footnote 1 But take the train also behaves as a syntactic phrase, undergoing various alternative means of expression (e.g., took a train, take the fast train, take trains all over Europe). Much research on language has thus focused on a range of multiword lexemes such as idioms, light verb constructions, noun compounds, and collocations (e.g., [15, 20, 22, 46, 48, 55, 76]). Psycholinguists have also shown the importance of co-occurrence and contingent frequency effects between words, and between words and syntactic patterns in the learning and processing of language (e.g., [5, 57, 70, 72]).

In theories of language acquisition in particular, especially usage-based accounts of language learning (which eschew complex innate linguistic knowledge), the role of multiword constructions has been emphasized (e.g., [40, 41, 74]). However, computational modelling of language acquisition has continued to focus on various aspects of word learning (e.g., [33, 37, 49, 65, 77]), or grammar learning (e.g., [17, 69]). Work on intermediate constructions has mostly been limited to identifying general properties of verb argument usages (e.g., [3, 4, 13, 18, 23, 61, 63]), rather than on multiword lexemes. Recent work by Borensztajn et al. [9] uses a probabilistic model (in the DOP framework) to show that a grammar learner can progress from highly lexicalized to multiword tree fragments, on the basis of statistical patterns in the kind of input children receive. Bannard and Matthews [7] further give evidence from human subjects that children are sensitive to the frequencies of multiword sequences. These studies provide evidence that children recognize and produce certain (e.g., high-frequency) multiword sequences in their input, but do not address what sort of cues (other than, e.g., frequency) a child might use to identify, and treat differentially, the various distinguished types of multiword lexemes suggested by linguistic analyses.

Thus in the study of child language acquisition, much remains to be explored concerning the precise computational mechanisms that underlie how children learn to identify different types of multiword lexemes—that is, how they recognize that an idiosyncratic semantics is associated with a sequence of words (rather than single words plus combinatory rules), and how the idiosyncratic meaning relates to the surface (lexical and syntactic) form of a particular combination. In contrast, there has been significant work in computational linguistics on this very topic, with the development of statistical measures, both for identifying multiword lexemes in a corpus, and for determining the syntactic and semantic behaviour of the particular type of multiword lexeme in question (e.g., [8, 19, 21, 25, 28, 30, 43, 50, 53, 67, 71, 75]). The goal of our research here is to explore whether this computational work on multiword lexemes can be extended in a natural way to the domain of child language acquisition, where an informative cognitive model must take into account the two issues of what kind of data the child is exposed to, and what kinds of processing of that data is cognitively plausible for a child.

In pursuing these questions, we focus in particular on the acquisition of multiword verbs, such as take the train and give a kiss. These constructions are a rich and productive source of predication which children must master in most languages, doing so at very young ages [41]. For example, consider the following conversation from the CHILDES database ([11], sarah130a.cha):

\(\begin{array}{l l} \rm{ *MOT} :&\rm{ you\prime re not gonna}\ \mathit{take\ any\ toys}\ \rm{ down to the beach today you know.} \\ \rm{ *CHI:} &\rm{ why?}\\ &\ldots \\ \rm{ *MOT} :&\text{ we have to}\mathit{take\ the\ train}.\end{array}\)

Here, the mother uses the verb take first in its core literal meaning (in take any toys), and then within a multiword lexeme in which take has a non-literal meaning and combines with the particular argument to express the use of a mode of transportation (in take the train). The child’s further responses within this conversation give no indication that she is puzzled by these very different usages of take. Yet they do pose a very significant puzzle for researchers: It has been noted that children learn highly frequent verbs (such as take) first (e.g., [41]), and yet it is precisely these verbs that are also the most polysemous, showing a wide range of metaphorical sense extensions in multiword lexemes, which children recognize and deal with effectively [16, 44, 73].

Research over the last few years has shown that the distinctions among literal and non-literal verb–argument combinations (such as take the toys versus take the train or take a nap) are in principle learnable based on statistics over usages of such expressions (e.g., [30, 75]). However, such work depends on very large amounts of data (from corpora on the order of 100 M words) and on sophisticated statistical and grammatical calculations over such data. The goal here is to determine what is learnable through the means available to a child—that is, on the basis of data in child-directed speech and using simpler, cognitively plausible calculations.

We begin by summarizing the motivation and approach to deriving simple statistics based on the linguistic properties of the multiword lexemes under study (first presented in [32]). We then present new experiments that show that such statistics can be informative in identifying such multiword lexemes in child-directed speech. Then we turn to a novel approach for incorporating these statistical measures into an existing model of word learning, to show further that such statistics can be used within a natural process of word learning to associate a single meaning with a sequence of words. In this way, we take a first step toward computational modelling of acquisition of the kinds of multiword verbs that children must master early in language learning, shedding light on the mechanisms that could underlie a usage-based model of this process.

2 Multiword Lexemes with Basic Verbs

The highly frequent and highly polysemous verbs referred to above include what are called “basic” verbs—those that express physical actions or states central to human experience, such as give, get, take, put, see, and stand, among others. These verbs undergo metaphorical sense extensions of their core physical meanings that enable them to combine with various arguments to form multiword lexemes [15, 58, 59, 62]. We focus here on expressions in which a basic verb is combined with a noun in its direct object position to form either a literal combination (as in take the toys) or a multiword lexeme (such as take the train, take a nap). We refer to all such expressions (both literal and non-literal) as verb–noun combinations or verb–noun pairs, with the understanding that the verb is a basic verb.

Verb–noun combinations that form multiword lexemes are very frequent in many languages (e.g., [1, 20, 45, 46, 51, 54]). Such expressions show a range of semantic idiosyncrasy, where the semantics of the multiword lexeme is more or less related to the semantics of the verb and the noun separately [38, 66]. Thus, verb–noun combinations can be viewed as lying on a continuum (without completely clear boundaries) from entirely literal and compositional, to highly idiomatic. However, for convenience we can think of classes of constructions on this continuum, each identified by a particular way in which the verb and the noun component contribute to the meaning of the construction. Following [30], we consider four possible classes; these are listed below with an example from the child-directed speech used in our experiments along with some information about the semantic contribution of the components of expressions in that class:

  1. 1.

    Literal combination or lit

    • Give (me) the lion

      • Give: physical transfer of possession

      • Lion: a physical entity

  2. 2.

    Abstract combination or abs

    • Give (her) time

      • Give: abstract transfer or allocation

      • Time: an abstract meaning

  3. 3.

    Light verb construction or lvc

    • Give (the doll) a bath

      • Give: convey/conduct an action

      • Bath: a predicative meaning

  4. 4.

    Idiomatic combination or idm

    • Give (me) the slip

      • Give, slip: no/highly abstract contribution

These classes are important in the context of child language acquisition because there is a clear connection between the linguistic properties of each class and the meaning of the expressions in the class. Such a relation can enable language learners to generalize their item-specific knowledge, for example by making predictions about the meanings of new expressions based on their likely class. For example, when a child hears a new expression such as give a shout, if she recognizes that this is likely an lvc, then she can infer that it roughly means the same thing as the noun—i.e., shout—which contributes the predicative meaning, and also infer any other properties holding of lvcs more generally.Footnote 2

The four classes of expressions above have differing linguistic behaviours that can be cues to the underlying distinctions among the classes [30]. Specifically, expressions from each class exhibit particular lexical and syntactic behaviour that closely relate to the semantic properties of the class. We next elaborate on these properties and behaviours, and describe how they can form the basis for statistical measures for distinguishing the classes.

3 Linguistic Properties and the Usage-Based Measures

It has been shown that children are sensitive to the frequency of occurrence of multiword sequences (e.g., [7]). However, simple co-occurrence frequency of a verb and a noun (or measures of association between the two) do not suffice for accurate identification of multiword verb–noun lexemes [29]. We thus further hypothesize that children are also sensitive to the syntactic and semantic properties of each class of verb–noun combination. As a first step to examining this hypothesis, we need to verify whether information about such properties is available in the input children receive, and whether the available information is useful for determining the semantic class of a given combination. We note that there is some overlap in the properties exhibited by the various non-literal classes. We thus further simplify our task here by aiming to distinguish the non-literal expressions (those from abs, lvc, idm) from literal ones (lit). There is only one instance of an idm in our data, hence in our presentation of the measures here, we discuss the properties with respect to the abs + lvc classes.

As noted earlier, computational linguistic studies have developed sophisticated statistical measures based on such properties, which have achieved success in identifying non-literal combinations when evaluated on large amounts of text corpus data (e.g., [28, 30]). Given the hypothesized importance of simplicity in language learning (c.f. [60]), our goal here is to use simpler measures (tapping into similar properties) that are more cognitively plausible, and that are robust when used with smaller amounts of child-directed speech (CDS). We note that some of the measures explained in this section are taken and adapted for this purpose from Fazly [29]. The resulting measures fit into three groups based on the linguistic properties of the verb and the noun in a verb–noun combination: the degree of association of the verb and noun, the semantic properties of the noun, and the degree of syntactic fixedness of the expression.

3.1 Association of a Verb–Noun Pair

In a literal verb–noun combination, where the verb contributes its core physical semantics, a wide variety of nouns can occur as the noun component (e.g., one can give an apple, a book, a car, a dog, etc.). In contrast, in a non-literal combination, the verb has an abstract and/or metaphorical meaning and hence can combine with a set of nouns that is semantically, and somewhat idiosyncratically, restricted (e.g., give a groan/cry/yell, but not give a gripe, [31]). Moreover, the latter group of nouns often contribute a specific abstract meaning to the combinations they appear in, and hence may not occur as the direct object of other verbs as frequently as do concrete nouns. As a result, we expect the verb and the noun component in non-literal expressions to co-occur more often compared to the components of literal combinations [14, 27]. Below we explain two different measures capturing the marked frequency of a verb-noun pair.

The simplest way to measure the association of a verb and a noun is by the frequency of co-occurrence of the verb–noun pair ⟨v, n⟩, as in:

$$\begin{array}{rcl} \mathrm{Cooc}(\mathit{v,n})\doteq\mathrm{freq}(\mathit{v,n},\mathrm{gr = dobj})& &\end{array}$$
(1)

where gr = dobj indicates that the noun is the direct object of the verb. We assume that children are able to keep track of simple counts of such verb–noun pairs.

Although non-literal expressions are expected to co-occur more often compared to literal expressions, the co-occurrence of some literal expressions is also significant (e.g., take the toy in child-directed speech). However, the noun in a non-literal expression generally does not occur with as diverse a set of verbs as a noun in a literal expression. For example, apple can be used in many literal expressions with different verbs: give the apple, take the apple, eat the apple, and wash the apple, whereas decision only occurs in one non-literal verb–noun combination: make a decision.Footnote 3 In other words, while the verb in a lit expression is typically thought of as selecting for a noun in direct object position, in a non-literal expression the noun can be viewed as selecting for a verb (e.g., [24, 43]). We measure this property by computing the conditional probability of a verb–noun pair given the noun (CProb).

$$\begin{array}{rcl} \mathrm{CProb}(\mathit{v,n})& \doteq & P(\mathit{v\vert n},\mathrm{gr = dobj}) \\ & =& \frac{\mathit{\mathrm{freq}(v,n,\mathrm{gr = dobj})}} {\sum \limits _{v\prime}\mathit{\mathrm{freq}(v\prime,n,\mathrm{gr = dobj})}}\end{array}$$
(2)

This measure is still a very simple one for children, since it is composed of two frequency counts, although we should note that it does assume that children are able to keep track of the count of a noun as the direct object of any verb.Footnote 4

3.2 Semantic Properties of the Noun

There is evidence that children are sensitive to the semantic differences between the nouns in a literal versus non-literal verb–noun combination [64]. For example, whereas the noun in a non-literal verb–noun combination is often non-referential, abstract, and/or predicative (as in take time and give a hug), the noun in a literal combination tends to be referential and concrete (as in take the toys and give a banana). Earlier work has used WordNet [35] to estimate non-referentiality and predicativeness by looking at the noun’s position in the taxonomy, and its morphological relation to a verb [30]. However, WordNet’s conceptual and lexical organization most likely does not reflect that of a child. Next, we explain two measures that instead aim to capture these properties with simple statistics over the surface behaviour of the noun.

Non-referential nouns (such as those in non-literal expressions) tend to appear in particular syntactic forms [42]—typically preceded by an indefinite determiner (such as a/an) or no determiner [34, 76]. Moreover, it has been shown that children indeed associate certain semantic properties with surface syntactic forms [10]. Here we assume that a noun is recognized as non-referential to the extent that it occurs in this preferred pattern of determiner use, i.e.:

$$\begin{array}{rcl} \mathrm{NRef}(\mathit{n})\doteq P(\mathit{p{t}_{nref }}\vert \,\mathit{n}) = \frac{\mathit{\mathrm{freq}(n,p{t}_{nref })}} {\mathit{\mathrm{freq}(n)}} & &\end{array}$$
(3)

where pt nref = ⟨det:a/an/ null  n⟩, freq(n,pt nref ) is the frequency of occurrence of n in pattern pt nref, and the denominator estimates the frequency of n in any pattern. Note that we look at all occurrences of a noun irrespective of its grammatical relation to a verb; this is thus a simple relative frequency for a child to determine: of the instances she sees of this noun, what proportion are in this particular pattern.

In a non-literal verb–noun combination, such as make a decision, the predicative meaning is contributed mainly by the noun component, i.e., decision. Moreover, in such expressions the noun is often morphologically related to a verb (e.g., decision as the nominalized form of decide). To capture this property, previous work has looked at whether the noun has a morphologically-related verb form [30]. We cannot assume that full knowledge of morphology is in place before a child starts learning about non-literal expressions. But it has been shown that young children can accurately predict whether a word is used as a verb or a noun in a given context [10]. We thus measure predicativeness of the noun n in a verb–noun pair as the relative frequency of the form n (e.g., push in give a push) being used as a verb (as in, e.g., push the door).

$$\begin{array}{rcl} \mathrm{Pred}(\mathit{n})\doteq \frac{\mathrm{freq}({n}_{V })} {\mathrm{freq}({n}_{V }) +\mathrm{ freq}({n}_{N})}& &\end{array}$$
(4)

where freq(n V ) is the frequency of the form n appearing as a verb, and freq(n N ) is the frequency of the form n appearing as a noun.

3.3 Degree of Syntactic Fixedness

Young children show evidence of learning associations between a complex syntactic form and a specific semantic interpretation (e.g., [36, 70]). It is thus reasonable to assume that children can use the information about the surface syntactic behaviour of a verb–noun combination to identify its semantic class. Here we devise statistical measures that aim at capturing the differing syntactic behaviour of non-literal and literal combinations.

Non-literal expressions are known to have a fixed syntactic structure and not occur in a variety of forms [20, 26]. More specifically, abs + lvc expressions, while allowing some variation, are relatively restricted compared to lit expressions. For example, an lvc such as give a shout allows limited noun and determiner variation; e.g., give some shouts and give the shout are not as acceptable as give a shout. This is also true for abs expressions. For example, take a time and take times are not recognized as acceptable variations of take time. In contrast, literal expressions are generally much more syntactically flexible, e.g., take an apple, take the apple, and take three apples are all acceptable.

Although there is some variation, most lvc and abs expressions appear in the form pt fixed = ⟨v det: \(\mathit{a/an/}\) null n⟩. (Note that the noun is in the same pattern as for NRef above; the difference is that here the focus is on the degree to which the particular verb–noun combination leads to the use of that pattern for the noun.) Measures of this type of syntactic fixedness have required keeping track of probability distributions over a wide range of items and patterns [6, 30]. Here, we estimate the degree of syntactic fixedness of a target verb–noun combination with a much simpler measure—the relative frequency of the pair in the preferred pattern:

$$\begin{array}{rcl} \mathrm{Fixed}\mathit{(v,\,n)}& \doteq & \mathit{P(p{t}_{fixed}\vert \,v,\,n,\,\mathrm{gr = dobj})} \\ & =& \frac{\mathit{\mathrm{freq}(v,n,\mathrm{gr = dobj},p{t}_{fixed})}} {\mathit{\mathrm{freq}(v,n,\mathrm{gr = dobj})}} \end{array}$$
(5)

Children appear to store specific information about the frequency of occurrence of multiword sequences in general (e.g., [7]), and about verb–argument structures in particular (e.g., [41, 74]). We thus expect the above calculations to be plausible for children.

We have described five simple statistical measures that may be plausible for children to keep track of. In the remainder of the paper, we first present experiments that evaluate how well the measures can identify non-literal verb–noun combinations in child-directed speech, and then describe extensions to a word learning model that enable it to learn the meaning of such expressions by incorporating these statistical measures.

4 Evaluating the Statistical Measures

In this section, we present two types of experiments to determine the potential of our statistical measures to identify non-literal verb–noun combinations in child-directed speech. Each of our measures assigns a numerical score to the expressions that reflects one of the linguistic properties that may be useful to a child in determining which are literal and which are non-literal. To evaluate their effectiveness, we first (in Sect. 4.2) apply a hierarchical agglomerative clustering algorithm that uses the scores to separate all the experimental expressions into two clusters, and then see how closely those clusters correspond to the actual labels on the expressions as lit, or as abs + lvc. Since we assume that, in any learning situation, a combination of the cues might be at work, we use all five measures as input to the clustering algorithm.

The clustering results thus show the effectiveness of the measures working together to separate non-literal from literal combinations. We further analyze (in Sect. 4.3) each individual measure in its ability to separate literal and non-literal expressions, in order to better understand how relevant each measure is to the identification of multiword lexemes. We begin by presenting the details of the experimental data and evaluation methods.

4.1 Experimental Setup

To gather input for our experiments, we use the American English section of the CHILDES database [52], removing 16 corpora that either lack child-directed speech (CDS) or belong to a special group with a particular language use (e.g., socio-economically distinguished). All the data are automatically parsed with the parser of Sagae et al. [68]. Because we are interested in what is learnable from input a child is exposed to, the statistics for all experiments are extracted from CDS. The size of the CDS portion of the corpus is about 600, 000 utterances, which contain nearly 3. 2 million words (including punctuation).

In this work we focus on two basic verbs, take and give, because they are highly polysemous and frequently used in verb–noun combinations [15]. We extract verb–noun combinations that contain these verbs from the CDS portion of the data. The final expression list that is used in the experiments includes those verb–noun pairs with a frequency of at least 5. In some experiments, we further restrict the data to higher-frequency verb–noun combinations, i.e., those occurring at least 10 times. Dealing with low-frequency items is important in modeling child language acquisition, and here we vary the relatively low cutoff to see if it helps to have more items. The final list of expression types was annotated by a native English speaker with four classes: lit, abs, lvc, and idm. Note that we consider expression types, not tokens. Thus, if a verb–noun combination had usages that fall into more than one class, the annotator chose the class that seemed to reflect the predominant usage.Footnote 5 Invalid expressions (due to parsing errors) and the single instance of an idm were removed from the expression list. Table 1 presents the number of expressions in each class, as well as the total number of non-literal expressions (abs + lvc).

Table 1 A detailed breakdown of the experimental expressions

To evaluate the clustering experiments, we assign to each resulting cluster a label (either lit or abs + lvc), which is the label of the majority of items in the cluster, and calculate accuracy (Acc) and completeness (Comp) as measures of the goodness of the cluster. Accuracy gives the proportion of expressions in a cluster that have the same label as the cluster; completeness gives the proportion of all expressions with the same label as the cluster that are actually placed in that cluster. (Note that Acc is similar to precision, and Comp to recall.)

Recall that our measures are designed such that each is expected to be higher for the non-literal expressions than for the literal ones. In evaluating the measures individually, we can thus use each measure to rank the expressions and see whether abs + lvc expressions are generally ranked higher than lit ones. We do this for take and give expressions separately, and for all expressions together. We use a standard evaluation metric, namely average precision (AvgPrec), which reflects the goodness of a measure in placing expressions from the target classes (abs and lvc) before those from the other class (lit), and is calculated as the average of precision scores at different thresholds.

We also compare the performance of each measure against a baseline which reflects how hard the task is. We randomly assign a value between 0 and 1 to each expression in a set, generating a random ranked list. We repeat this process 1, 000 times and report the average of the AvgPrec values for each of these random lists as our baseline. We also calculate the relative error rate reduction (ERR) of each measure over the random baseline. To calculate ERR for a measure, we divide the difference between the error rates of the measure and the baseline by that of the baseline.

4.2 Measures in Combination: Clustering

Results of the clustering experiments are shown in Table 2. We can see that Acc for non-literal expressions is high only for the higher-frequency expressions (compare C2 in each panel of the table). We also see that literal expressions are better separated than non-literal ones since their Comp score is much higher (compare C1 and C2 for each panel of the table). Looking closely at the number of expressions of different labels (lit, lvc, and abs) in each cluster, it is clear that abs expressions are more mixed with lit expressions compared to lvc ones. Consequently, the measures are better in separating lvc from lit than abs from lit.

Table 2 Two-way clustering results. C i represents Cluster i; Label is the majority class in the cluster; Acc and Comp are explained in the text

We performed two-way clustering on the assumption that a two-way distinction would be easier for the measures than a three-way distinction. However, the poor performance on abs expressions may be due to a weakness of the measures, or may be due to a need for three clusters to capture the pattern in the data. We thus also performed a three-way clustering to examine the goodness of measures in dividing expressions into abs, lvc, and lit classes (see Table 3). According to the results, abs expressions do not form a separate cluster, and are again mixed in with the lit and lvc clusters. Future work will need to verify whether this is due to an inconsistent annotation of the abs expressions, or because our measures do not adequately capture properties of this class. Interestingly, however, a three-way clustering results in forming a more coherent lvc class: compare Acc and Comp for C3 in Table 3 with those for C2 in the top panel of Table 2.

Table 3 Three-way clustering results. C i represents Cluster i; Label is the majority class in the cluster; Acc and Comp are explained in the text

4.3 Performance of the Individual Measures

We test the performance of each measure, for take and give expressions separately, and for all the expressions with take and give. The results in Table 4 show that all measures perform better than the baseline (at separating non-literal expressions from literal ones), with CProb, Pred, and Fixed having the best performance. These results suggest that simple statistical measures that draw on specific linguistic properties of non-literal verb–noun combinations—measures which are plausible for children to keep track of—can indeed be effective in recognizing non-literal expressions.

Table 4 Performance (AvgPrec) of the individual measures. The numbers in parentheses show the ERR of the measures for take and give expressions combined

We also observe that, in general, our measures perform better on the expressions composed with take than the expressions with give. A possible explanation is that the give expressions are more complicated, because give more often occurs in a double object construction (in comparison to take). It remains to be tested whether children also show more difficulty in learning give expressions.

Looking at performance on higher-frequency expressions, we see that all measures show an improvement. However, note that only for two of the measures (NRef and Fixed) the gain in performance is substantially more than the increase in the baseline performance. These two measures summarize the syntactic behaviour of a word or a combination by examining all their usages. For higher-frequency expressions (with more usages), it is possible that the evidence available for these measures is more reliable, resulting in better performance.

5 Embedding the Measures into a Word Learning Model

The results presented so far suggest that simple statistics over the usages of a verb–noun combination (and its components) have the potential to provide useful cues for a child to identify non-literal expressions. We need to explore further how children learning the vocabulary of their native language might use such statistical cues to recognize that certain combinations of words in their input actually form multiword lexemes. We investigate this issue by incorporating (some of) the statistical measures into the operations of an existing computational model of early word learning in children, namely, that of [33].

We first give a brief overview of the original word learning model in Sect. 5.1 (we refer the interested reader to [33] for a full explanation of this model). When processing a multiword lexeme, such as take a nap, the original model finds a meaning for each individual word (take, a, nap) just as it does for a literal combination of words, such as take any toys. There is no mechanism for the model to associate a single meaning with the sequence of words take a nap.Footnote 6 We thus add a preprocessing step, described in Sect. 5.2, in which the model draws on statistics collected thus far to decide whether a given sequence of words in the input utterance should be considered as a multiword lexeme. Section 5.3 presents an evaluation of the new model with respect to the acquisition of multiword lexemes of the form verb–noun.

5.1 The Original Word Learning Model

We use the model of Fazly et al. [33], which is a probabilistic incremental model of cross-situational word learning in children. The input to the model is a list of pairs of an utterance (what the child hears, represented as a set of words) and a scene (what the child perceives or conceptualizes, represented as a set of meaning symbols), as in:Footnote 7

Utterance: Joe is happily eating an apple

Scene: joe, is, happily, eat, a, apple

The model incrementally learns a meaning for each word in the input as a probability distribution over all meaning symbols, P(m | w), referred to as the meaning probability of the word, as in:

Prior to receiving any usages of a given word, the model assumes that all symbols have equal probability as its meaning. The model then updates the meanings of words by processing each utterance–scene pair in two steps.

As the first step in processing an input utterance–scene pair, the model, like children, must determine which meaning symbol in the scene is associated with each word in the utterance. (Note that the input does not indicate which meaning goes with which word.) This process is called the alignment of words and meaning symbols. Alignment is probabilistic, so that each word is aligned more or less strongly with each meaning, according to the model’s partially-learned knowledge of meaning probabilities as calculated thus far. Specifically, the probability of aligning a meaning symbol and a word in the current input is proportional to the current meaning probability of that meaning symbol for the word, and is disproportional to the meaning probabilities of the meaning symbol and the other words in the utterance. That is, a word w and meaning m are strongly aligned if P(m | w) is relatively high and P(m | w′) is relatively low for other words w′ in the utterance.

As the second step, the meaning probabilities of the words in the current utterance are updated according to the accumulated (probabilistic) evidence from prior co-occurrences of words and meaning symbols (reflected in the alignment probabilities). This evidence is collected by maintaining a running total of the alignment probabilities over all input pairs encountered so far, yielding an accumulated frequency of co-occurrence of a word–meaning pair, weighted by the strength of alignment between the two each time they are observed together. Meaning probabilities for current words are then re-calculated from these incrementally-accumulated alignment probabilities.

5.2 Learning the Verb–Noun Multiword Lexemes

The approach described above learns a separate meaning probability distribution for each word. To enable the model to learn a meaning distribution for a verb–noun combination such as give a kiss, the model must be able to identify the expression as a single unit of meaning. To achieve this, we add an input pre-processing step to the original model and slightly modify the way alignment probabilities are calculated.

We assume that upon receiving an utterance–scene pair containing any verb–noun combination (literal or non-literal), a learner (here the model) simultaneously considers two possible interpretations: That the verb–noun combination is a multiword lexeme, or that the combination is literal. That is, when the original model receives an input such as:

  • U : give me a kiss

  • S : give, me, a, kiss

our modified model will also consider the alternative interpretation in which the verb and noun form a single unit of meaning:

  • U : give-kiss me a

  • S : me, a, give-kiss

This alternative interpretation is created by merging the verb and the noun into a single word (give-kiss), and by creating a new meaning symbol for the associated event (give-kiss). We assume that the learner has a certain confidence in either of these interpretations given what has been learned about words and meanings in the input thus far. Specifically, the learner calculates a probability probmwl(v,n) which reflects its confidence that the verb–noun combination in the utterance is a non-literal multiword lexeme, as in (\(\mathrm{{U}^{{\prime}}\mbox{ \textendash }{S}^{{\prime}}}\)) above. This probability combines the two statistical measures, namely CProb and Pred, which were the best in separating literal and non-literal expressions in our earlier experiments.Footnote 8 More formally, probmwl(v,n) is computed as in:

$$\begin{array}{rcl} \mathrm{pro{b}_{mwl}(\mathit{v,n})} = \alpha {_\ast}\mathrm{ CProb(\mathit{v,n})} + (1 - \alpha ) {_\ast}\mathrm{ Pred(\mathit{n})}& & \\ \end{array}$$

where α is set to 0. 5, weighting the evidence from the two statistical measures equally. Thus, the interpretation that a verb–noun combination is a multiword lexeme, as in (\(\mathrm{{U}^{{\prime}}\mbox{ \textendash }{S}^{{\prime}}}\)) above, is assigned a confidence score equal to probmwl(v,n), and the other interpretation, as in (\(\mathrm{U\mbox{ \textendash }S}\)) above, is given the confidence score of 1 − probmwl(v,n).

Whenever there is a verb–noun pair in an utterance, we calculate separate alignment probabilities over two possible utterance–scene pairs corresponding to the two interpretations. The two sets of alignment probabilities are then combined, using probmwl(v,n) as a weight, to get a single alignment probability for each word and meaning symbol in the input pair:

$$\begin{array}{rcl} \mathrm{align(\mathit{w}\vert \mathit{m})} =\mathrm{ pro{b}_{mwl}(\mathit{v,n})} {_\ast}\mathrm{ alig{n}_{1}(\mathit{w\vert m})}& & \\ +(1 -\mathrm{ pro{b}_{mwl}(\mathit{v,n})}) {_\ast}\mathrm{ alig{n}_{2}(\mathit{w\vert m})}& & \\ \end{array}$$

Note that for a wm pair that occurs only in one interpretation (e.g., give-kissgive-kiss), its alignment would be zero in the other interpretation. This means that the learner aligns each word and meaning symbol to the extent that it is confident that the corresponding interpretation is accurate. The modified alignment probabilities are then used to calculate the meaning probabilities as in the original model.

5.3 Experiments on the Modified Word Learner

We expect the modified word learning model to learn a single meaning for non-literal verb–noun pairs but not for literal ones. That is, we expect a meaning probability such as \(\mathrm{P}(\mathrm{GIVE - KISS}\vert \mathit{give - kiss})\) to be high, since give-kiss is a multiword lexeme that expresses a kissing event. By contrast, \(\mathrm{P}(\mathrm{GIVE - PRESENT}\vert \mathit{give - present})\) should be low, since give a present is literal with individual associations of give to give and present to present.

We use the same data as in Fazly et al. [33]: 180, 499 utterance–scene pairs, where the utterances are taken from the Manchester corpus in the CHILDES database [52], and the scene representations are automatically constructed using an input-generation lexicon containing a symbol as the meaning of each word. Because the Manchester corpus is British English and some American English verb–noun multiword lexemes with take occur with other basic verbs in British English, we only consider the verb–noun combinations with give in the current experiments. Since children can learn meanings of very low frequency words, we do not apply a frequency cut-off, but rather consider all verb–noun combinations with give in the corpus. The number of lit, abs, and lvc expressions used in our experiments is shown in Table 5.

Table 5 The number and percentage of verb–noun combinations in each class that are learned correctly: i.e., as literal for the lit class, and as non-literal for the abs and lvc classes

In Fazly et al. [33], a word–meaning pair is considered learned if the probability of the correct meaning given the word is above 0. 7. This is a somewhat arbitrary cut-off, but to be consistent we use the same threshold. We say that a verb–noun combination with verb and noun is “learned as a multiword lexeme” if the probability \(\mathrm{P}(\mathrm{VERB - NOUN}\vert \mathit{verb - noun})\) is above this threshold—that is, the combination of the verb and noun words are associated with a single (correct) meaning. We say that a verb–noun combination is “learned correctly” if the combination is non-literal and is learned as a multiword lexeme, or the combination is literal and is not learned as a multiword lexeme. To evaluate the model’s ability in learning multiword lexemes, we look at the proportion of expressions from each class that are learned correctly; see Table 5.

The results in Table 5 show that the model performs very well on the lvc and lit expressions (75 % and 91 %, respectively), but only a small proportion (33 %) of the abs expressions are learned correctly. A closer look at the results shows that many of the non-literal expressions with a low frequency of 1 are not learned correctly. This includes 46 % of lvc expressions with frequency 1, and 85 % of abs expressions with frequency 1. This finding is in line with what has been observed in children: that children are faster at producing more familiar (frequent) multiword sequences [7]. It remains to be tested whether children also are unable to learn some of these MWEs (as MWEs) from a single exposure.

6 Conclusions

Our results confirm that simple statistical measures that draw on linguistic properties of non-literal expressions are useful in identifying them. The best measure for give and take expressions is Pred, i.e., the normalized frequency of the usages of the noun as a verb. The success of this measure indicates that the predicativeness of the noun is a salient property of non-literal verb–noun combinations. The goodness of CProb in identifying non-literal expressions suggests that the verb–noun pair in such expressions is more entrenched compared to literal ones and exhibits collocational behaviour. However, collocational behaviour alone is not a very good indicator of non-literal expressions; the CProb measure consistently outperforms Cooc (which only quantifies the entrenchment of the verb–noun pair). The key difference between these two measures is that in CProb, we also measure the degree that the noun selects for the appropriate verb. The Fixed measure which looks at a specific syntactic pattern for non-literal expressions performs as well as CProb for all expressions, but is the best measure for expressions having frequency of at least ten, for which there is sufficient evidence of typical syntactic usage.

Our measures are generally better for higher-frequency expressions. However, two of the best measures (Pred and CProb) perform well on both expressions with frequency of at least 5 and higher-frequency expressions, suggesting that children might be able to learn verb–noun combinations even with very little input. Our results also show that the performance of our measures is better for take expressions compared to give. The Fixed measure especially performs well on take, but less well on give, suggesting that the more complex syntactic constructions that give appears in (e.g., the double object construction) may cause children difficulty.

We also integrate our measures into a word learning model, and show that the new model can successfully learn the meaning of many lvc expressions. Future work will need to further investigate why it is harder for the model to learn the meaning of abs expressions. In the experiments presented in this article, we have focused on a small number of verb–noun combinations (namely, 117) formed around one particular verb (i.e., give). To better understand the generalizability of our findings, future research will need to extend these experiments to other verbs (e.g., take) and to other types of multiword lexemes (e.g., noun compounds).

Another limitation of the model is that it learns word meanings by mapping each word to a distinct ‘concept’ (e.g., give-kiss must be mapped to give-kiss). In the future, we need to use a richer semantic representation where each concept is comprised of finer-grained semantic primitives. The use of such a representation would enable the model to determine semantic similarities among words (e.g., the similarity between the meaning of the expression give-kiss and that of the verb kiss), which would further allow it to make generalizations across different types of lexical items.