Keywords

1 Introduction: Theoretical Background

One of the paradoxes of recent research in corpus and computational linguistics is that the theoretical underpinnings are rarely addressed in detail, as if applying algorithms to an object (language) whose very structure is so controversial were a matter of evidence. In such a complex matter as recurrent linguistic sequences, it is besides necessary to have recourse to huge linguistic corpora and to take linguistic diversity into account.

Foreign language learners and translators are often impressed by the overwhelming importance of prefabricated elements and figurative senses in language. For more than 50 years, corpus and computational linguists have therefore been trying to extract those elements automatically from corpora. However, as pointed out by [13], the results of the automatic extraction of collocations in the broad sense are still disappointing, and new avenues of research ought to be explored. Indeed, another paradox of the research on automatic extraction of recurrent sequences in language is that you don’t know exactly what you (and the algorithm) should actually extract.

Phraseology as defined by [4] encompasses all phraseological units or phraseologisms, from collocations (weakly idiomatic phrases) to proverbs, but collocation is widely used in corpus or computational linguistics [13] as a generic term (covering all phraseological units). In the same way, idiom has sometimes been used for very idiomatic phrases [4] but is also a generic term for all set phrases of a language [17].

Yet another way of describing all language blocks that are accessed as one element of meaning is the notion of formulaic language [22]: a heteromorphic lexicon makes it possible to store linguistic material of different sizes (morphemes, words, multiword strings); besides, [22] claims that, by default, native speakers will use formulas and will only break down linguistic material into smaller components when they have specific needs (the principle of NOA, Needs Only Analysis). Adults will, on the whole, tend to analyze input and store smaller lexical units. Using formulaic language is considered as a way of promoting the speaker’s own interests, because larger and holistic units, with pragmatic and cultural associations, make it possible to exercise a better control over the interpretation of the message by the hearer. In other words, formulaic language is a way of manipulating the hearer. The secondary and somehow artificial character of written language, as opposed to original dialogues, is stressed.

In addition to the great terminological diversity around the notion of phraseology, its extraction raises another thorny issue: where does a phraseological unit (PU) start and where does it end? On the one hand, many PUs are discontinuous (as in the more you … the more you or X take Y into account); on the other hand, the exact beginning or the end of the PU are not always straightforward (as in to VERB (and to VERB) (…) was with N/Pro the work of an instant).

Finally, developments in cognitive linguistics over the past 30 years and among constructionist approaches in particular [17] have proposed a radical new approach to the study of constructions in general. For construction grammar [10, 12], all constructions are Saussurean signs, i.e. conventional and learned pairings of form and function at varying levels of complexity and abstraction. Thus, constructions include partially filled words or morphemes (e.g. pre-N or V-ing), words, but also idioms in the generic sense of PUs: filled (spill the beans), partially filled (take X for granted), minimally filled (the more… the more), and even abstract constructions such has the Ditransitive Verb Construction (give X to Y), or the Passive Construction.

The constructionist approaches to language have fundamentally changed the vision of PUs, because of the proposed continuum between lexicon and syntax, the constructicon [9, 11]: all language sequences are basically of the same type, as they are all constructions, ranging from very schematic and abstract constructions such as the passive, to substantive or specific constructions (morphemes and words), some of which are complex (idioms, i.e. PUs). According to this approach, PUs are in no fundamental way different from ordinary words of from syntactic constructions; on the opposite, as pointed out by [23], all constructions are in a sense idioms. Thus, partially schematic complex constructions (e.g. X SPILL the beans (on/about Y)) are traditionally labelled idioms (or PUs), but this is for [23] simply due to the fact that such constructions make the effects of idiomatic variation very clear, the main point being that they are not fundamentally different from other constructions.

An even more radical view is taken by [7]: all constructions are language-specific, and categories are besides construction-specific. On the basis of a rigorous analysis of various languages of the world, the author comes to the conclusion that there is no such thing as, for instance, a universal passive construction (similarities can only occur between cognate languages) or a universal verb category.

Those fascinating and challenging insights were not only gained through introspection and comparison of relevant examples, but also by means of rigorous psycholinguistic experiments, and were besides confirmed by the collostructional approach.

This methodology [14, 15, 21], makes it possible to quantify association strength in and between constructions, and is derived from collocational approaches used in corpus linguistics. For instance, [21] shows that there is a statistical association between verbs and Argument Structure constructions and that verbs display very different association strengths within those constructions.

In constructionist approaches, grammar is conceived as a network of constructions, of which the nature is largely probabilistic [8, 21].

These theoretical and practical issues were the starting point of the IdiomSearch experiment. In this paper, a summary is given of its methodology, main results, and possible further developments.

2 Methodology

In order to gain fresh insights into both the practical aspects and the theoretical underpinnings of extracting PUs from corpora, we chose to use a big data approach. In the first place, it has been demonstrated by several studies such as [5, 19] that large linguistic corpora (of 100 million tokens or more) are necessary in order to be confronted with various examples of the most common PUs. Besides, as pointed out by [13], the dispersion across corpora has not sufficiently been taken into account, and many studies are of a very limited scope, as opposed to the huge number of PUs in a single language.

[13] also lays stress on additional limitations of current methods for extracting PUs from a corpus: most statistical measures are not directional (they consider that you and you that as the same PU), they are not easy to reconcile with psycholinguistic or cognitive principles, and they are not easily extendable to longer sequences such as trigrams, fourgrams, let alone 6-grams.

Within the framework of the IdiomSearch experiment, it was therefore decided to use the cpr-score (Corpus Proximity Ratio) described in [6]. This experimental score is non-parametric and directional, as it is derived from information retrieval [1], and more specifically from metric clustering techniques (Fig. 1):

Fig. 1.
figure 1

The cpr-score

It basically corresponds to the average distance between the component grams of an n-gram, given a window W, set between 20 and 50 tokens according to the language. In order to compute it, it suffices to keep a trace of all the offsets (positions in the text file) of the n-gram and its component grams in a corpus. For instance, the PU spill the beans occurs 14 times on a 200 million token web corpus of English; this exact frequency is then divided by the frequency of spill the beans within a window of a maximum of 20 tokens between the grams, which yields 15 occurrences. Dividing 14 by 15 gives a cpr-score of 0.93. The significance threshold for PUs has been set experimentally at 0.40, while scores as low as 0.065 still yield partly fixed phrases or elements of phrases, as explained in [6].

Bigrams such as easy rider, New York, sharp criticism etc. are easy to explain to informants having to evaluate the results of automatic extraction. Once the level of trigram is reached, and in particular for longer n-grams, making clear to non-linguists what we mean exactly by PUs (or collocations, or formulas and so on) is not an easy matter, especially if the diversity of languages is taken into account.

This is the reason why recall was difficult to measure, in the absence of reliable gold standards in several languages, as is the case for most automated tasks in NLP (natural language processing). Precision on the basis of native speaker judgment or dictionaries, on the other hand, is very high for very idiomatic phrases (cpr > 0.40). The novel feature of cpr is that it is very stable, no matter what the length of the n-gram is (at least between bigrams and 7-grams). Thus, in Table 1, the very fixed PUs or idioms chosen randomly in the dictionary all yielded a very high cpr-score on a 200 million token web corpusFootnote 1.

At the lower end of the spectrum, however, linguistic structures yielding partly significant association scores (with a cpr between 0.065 and 0.40) were problematic. As a rule, separating the wheat from the chaff at the left hand of the phraseological spectrum is nigh on impossible, as already pointed out by [4].

In order to shed new light on the interplay between lexis and grammar, and particularly with respect to phraseology [4], formulaic language [22] and construction grammar [17], the cpr-score was computed on large web corpora of 200 million tokens. As mentioned in note 1, assembling the corpora happened in an automated way, with a balanced list of seed words. Web corpora of 200 million tokens were created for English, French, Spanish and ChineseFootnote 2 (Mandarin, simplified spelling).

For each language, all n-grams ranging from bigrams to 7-grams were extracted from the corpus, with a frequency threshold of 3 occurrences (for 200 million tokens). The cpr-score was computed for each selected n-gram. Thanks to an optimization of the database requests by means of a query likelihood model [18], this took no more than one week per language, with an average time for each request of 0.07 second on a Linux machine. Several regex-based algorithms were passed through the results, in order to deal with the problem of phraseological encapsulation: smaller n-grams (bigrams, trigrams) may be included in the results of longer n-grams (e.g. take the rough as a partial result for take the rough with the smooth); in this case, n-grams included in larger n-grams were discarded, except if their association score was high, which is sometimes the case if competing PUs are at stake (e.g. take a walk and take a walk down memory lane).

In order to allow easy access to researchers and students, all selected n-grams were put in a database, freely accessible from a web applicationFootnote 3. The user is presented with results highlighted in colors ranging from pale yellow (partly fixed and frequent) to deep red (very fixed and not frequent). The number of words in the results is also indicated, as well as the number of phrases per word (PW ratio) and the number of phrases in the text (PT ratio). The PT ratio is computed by checking how many words (tokens) of the text are included in PUs. Thus, a PT ratio of 0.45 means that just 55% of the words of the text are not included in phrases.

3 Results and Discussion

3.1 The Proportion of Phraseology in Texts

Before providing an overview of the results produced by the IdiomSearch experiment, it may be useful to start from a small text fragment, in order to show what types of PUs the algorithm is able to extract: a few lines from an opinion piece on Brexit, published in a British newspaperFootnote 4.

“Don’t get me wrong, Iand I’m sure many other Labour votersconsider Brexit to be the biggest act of political self-sabotage in my lifetime, with stark consequences for my generation and those following it. Were it to be miraculously cancelled I would be over the moon. But there’s a sense that some would like everything to be about Brexit and only Brexit. When it comes to what people care about in 2017and vote on the basis ofthat simply is not the case.”

The algorithm used by the IdiomSearch application makes it possible to extract the following PUs on the fly.

As one of the aims of the experiment is to receive feedback from users, and to improve the algorithm and the selection of corpora, no systematic survey of different registers or text types has been carried out so far, but more than 200 newspaper articles have been tested in the different languages, confirming that such a method yields a high percentage of phraseology (as expressed by the PT ratio) in newspaper articles. For opinion pieces published by British or American newspapers, the overall PT ratio, reflecting in other words the total percentage of phraseology that could be extracted by the algorithm, lies between 0.30 and 0.55. For the whole of the example text above (803 tokens), it was precisely 0.50 (i.e. roughly half of the text consisted of PUs). Table 3 presents the results (number of tokens and PT ratio) for 5 newspaper articles (comments on the news).

A comparison was also made with about 200 fragments (of comparable length) from available corpora of spoken English. An example with 5 texts is given in Table 3, comprising 5 randomly selected passages from the Corpus of Spoken, Professional American-English (Athelstan) Footnote 5.

A word of caution is necessary in the interpretation of these results. It should be stressed again that the extraction of PUs by means of the cpr-score crucially depends on the reference corpus used. As mentioned in the preceding section, a balanced English corpus of 200 million tokens was used for the experiment, but there is no guarantee that all phrases from a specific text are also present on the reference corpus; if that is not the case, they can of course not be extracted by the program. The PT ratio is therefore an indication of the minimal percentage of phraseology (in the broad sense).

It may also come as a surprise that fragments from spoken corpora did not show major differences with written corpora as far as the total percentage of phraseology (PT ratio) is concerned. This is partly due to the topics that were discussed during the interviews: the Athelstan fragments (Table 4), for instance, contain interviews about university topics. Many differences appear in the types of PUs used, with many typical phrases of spoken English (Thank you very much, I would strongly suggest, I’m not sure, I wondered if), but the average percentage is not very different from the results obtained for the written texts.

From a theoretical point of view, these results partly confirm the hypothesis that about 50% of all texts consist of phraseology in the broad sense, the idiom principle [20], as the PT ratio (see examples in Tables 3 and 4) reaches figures between 0.35 and 0.60.

The examples of extracted PUs, as illustrated in Table 2, also provide convincing evidence for the existence of a network of statistical association between the elements of complex structures such as idioms (spill the beans), grammatical collocations (care about, consequences for), communicative formulas (don’t get me wrong, when it comes to), which is compatible with construction grammar [17].

3.2 Experiments with Spanish, French and Chinese

Within the framework of the IdiomSearch experiment, similar tests were conducted for Spanish, French and Chinese (Mandarin, simplified spelling).

For Spanish and French, the results were comparable to those obtained for English, with roughly the same percentages of phraseology in the broad sense. As often in computational linguistics, the main difficulties were technical ones: compiling web corpora by means of the robot required special attention for possible errors of encodingFootnote 6. As Spanish and French are also Indo-European, segmented and inflectional languages, it comes as no surprise that the algorithm was able to work in much the same way as for English. Chinese, on the other hand, represented in many regards a daunting challenge for the IdiomSearch experiment.

Chinese is, in the first place, an unsegmented language: there are no blanks between words. Modern Mandarin Chinese remains largely an isolating language (there is a very low morpheme per word ratio, and no inflectional morphology). In classical Chinese, one character (han) corresponded to one word, but most words in modern Chinese consist of two characters, and sometimes more. The situation is therefore rather complex, which makes it also particularly interesting for testing linguistic hypotheses.

According to [7], constructions but also categories are language-specific, and Chinese is often cited as an example of a language apparently functioning in a totally different way for grammatical categories such as Noun and Verb. As pointed out by [22], several studies have besides confirmed that Chinese native speakers make different decisions when they have to segment a text into words, and that even the same persons do not always confirm their first choices. Therefore, words –if they exist at all, function in a very different way in Chinese.

For these reasons, it was not an easy matter to adapt the IdiomSearch algorithm to Mandarin Chinese, as the cpr-score is based on the average distance between words in a corpus. As a temporary solution, it was decided to consider the distance between Chinese characters, which made it possible to reach very high precision scores on the basis of established Chinese phrases. Table 5 shows the frequency and the cpr-score for a few common chengyus, the 4-syllable idioms [16], from which study the examples are borrowed.

The examples of chengyu in Table 5 clearly show the achievements of the cpr-score for very fixed Chinese phrases: contrary to what might have been expected, the score works particularly well. Being non-inflectional, mostly isolating and having under-gone few influences from other languages, Mandarin Chinese is actually well suited for testing linguistic extraction algorithms, in spite of some technical adaptations.

An important finding of the IdiomSearch experiment, thanks to extensive testing with Chinese, is that statistical association of morphemes/words is partly discontinuous, even within established phrases. This contradicts the intuition that adding one element at a time to a n-gram allows to narrow down the probabilities. Not only is frequency of minor importance in the statistical association of words or morphemes, but there is no strict continuity between the elements. Suppose, for instance, that ABCD is a common PU or constructional idiom in English. Thanks to the cpr-score, it is possible to measure the statistical association between A + B + C + D. It is also possible to compute the score for A + B + C. Intuitively, one may be tempted to think that A + B + C is incomplete (as D is missing), and that the statistical score for ABC will therefore be lower than for ABCD. This is indeed often the case, but there are many counterexamples showing that the statistical association is much more complex, as internal PUs may interact with the overall score.

Table 6 illustrates this point for the communicative English phrases long time no see and the next thing I knew, and for the Chinese proverb 书中自有黄金屋Footnote 7.

For both English phrases and for the Chinese proverb in Table 6, one can clearly see that adding a gram to the sequence, although it brings the sequence closer to the complete phrase, does not necessarily yields a higher statistical score at each level. These examples also suggest that the best method for extracting longer PUs is not bottom-up but top-down: starting at the level of a gram, adding one gram at a time, and checking the statistical association on the corpus at each level, will not yield good results, because the association is sometimes discontinuous, as between long time and long time no. On the contrary, the method used in the IdiomSearch experiment was top-down, in the sense that all n-grams (ranging from bigrams to 7-grams) were extracted (with a frequency threshold of 3 occurrences for 200 million tokens); the association was measured for each n-gram at all levels, which made it possible to extract even idiomatic 7-grams. The difficulty, however, remains for cases such as the Chinese proverb in Table 6, because the maximum frequency and association scores are already reached at the level of the 5-gram, whereas the full proverb is a 7-gram. This is one of the reasons why the algorithm has to be slightly adapted for each language, applying the general principle of the baboushka (Russian nesting dolls), or encapsulation: for long PUs such as 7-grams, a fine-tuned analysis of the associations between the internal grams is crucial.

3.3 A Probabilistic Network of Constructions

As mentioned in the introduction, constructionist approaches view grammar as a complex network of constructions, and several researchers hold the view that this network is based on probabilistic principles [8, 21].

Thanks to the specific problems posed by the Chinese language (as it is non-inflectional and unsegmented), some additional experiments were carried out within the framework of IdiomSearch.

In the first place, applying the statistical association score to an unsegmented language poses the question of the artificial segmentation in European languages. Tea cup, for instance, can be written in one word or two words, à l’aéroport is considered as a sequence of 3 words in French but the Spanish equivalent al aeropuerto as a two word sequence.

A widely accepted view in construction grammar [3] is precisely that constructional idioms exist both at syntactic and morphological level. A constructional idiom is then defined as a syntactic or morphological schema in which at least one position is fixed [3]. This is, for instance, the case in This is the life, but also in adjectives such as un-believable (in which -able cannot be replaced by -ible or other suffixes). Constructional morphology views complex words just as complex syntactic constructions. All constructions are partly fixed, but there is a cline from schematic to substantive constructions.

The specific claims about constructional morphology [3] have been made on the basis of solid examples and of cognitive experimentation, but the question arises of their statistical foundation in corpora. In order to test this hypothesis on a wide scale, it would be necessary to apply morphological segmentation to a whole corpus. In the meantime, preliminary experiments with the cpr-score indeed confirm that many words composed of several morphemes display association scores that are quite comparable to those obtained for PUs, especially for fully idiomatic PUs or idioms.

In the examples presented under Table 7, a number of English words were treated as separate morphemes, and the cpr-score was computed for their association.

As illustrated by the examples from Table 7, there is indeed very little difference between the statistical associations prevailing within idiomatic PUs (Table 1) and within morphological constructs (Table 7).

Table 1. cpr-score for a few English idioms
Table 2. Examples of extracted PUs
Table 3. PT ratio (percentage of phraseology) for 5 newspaper articles in English
Table 4. PT ratio (percentage of phraseology) for 5 texts from the Athelstan corpus
Table 5. cpr-score for a few Chinese chengyu
Table 6. cpr-score for successive n-grams within the phrases long time no see and the next thing I knew
Table 7. cpr-score for the association of morphemes in a few English words

One may actually go one step further, as [3] does, and consider that constructions such as the English past tense construction ‘have + past participle’ are also constructional idioms, in which the auxiliary is fixed and the participle slot schematic. Again, this hypothesis can be supported by the cpr-score. We may, for instance compute the score for a given form, say has, followed by a maximum of two tokens, followed by the suffix –ed, indicating many regular past participles. If the above mentioned structure is indeed a constructional idiom, we will expect a very significant score (cpr > 0.40). This is indeed the case: our 200 million token corpus yields a score of 0.58 for 52350 occurrences. In other words, the construction itself can somehow be captured by the cpr-score, as it can be extended to 7-grams or even higher.

4 Conclusion

As the extraction of phraseological units/idioms that consist of more than two words is fraught with a wide range of difficulties, both practical and theoretical, the IdiomSearch Experiment sought to determine if a corpus-driven experimental score, the cpr-score [6], based on techniques derived from information retrieval, could yield acceptable results for different languages. The experiment was carried out on English, Spanish, French and (Mandarin) Chinese.

The recourse to large web corpora of 200 million tokens each, to optimized data-base storage and to a user-friendly web application has already made it possible so far to receive extensive feedback from students and other researchers, while also shedding fresh light on the theoretical underpinnings of any attempt to derive associative meaning from n-grams.

Although the statistical score can still be improved, as well as the qualitative and quantitative aspects of the web corpora, the preliminary results indicate that most common PUs and even a high number of relatively rare and very fixed PUs can be extracted by the cpr-score for European languages such as English, Spanish and French. The fact that the results are at present slightly better for English than for Spanish and French may be due to two main reasons. First, there are many more pages in English on the WebFootnote 8, which might explain that robots assembling pages on the basis of seed words in specific combinations yield more representative results for English; the second reason has to do with technical issues around the encoding of special characters in languages other than English.

The results obtained for Chinese are however particularly interesting, because they confirm that we should relativize our Eurocentric vision of language as being assembled from words and syntax. The whole cline of statistical associations, measured by the cpr-score, starts at the level of morphemes and ends up in schematic constructions, which is quite compatible with and may even serve as evidence for the claims made by the constructionist approaches to language. Thus, extracting phraseology from corpora looks like an achievable target, but the whole enterprise may only make sense against the backdrop of a probabilistic network of constructions.