Keywords

1 Introduction

The linguistic resources presented here are intended to support the description of phenomena occurring in Quechua-French and Quechua-Spanish projects.

NooJ4Qu is a set of linguistic resources developed with NooJ for the automatic processing of Quechua. We follow the structure of the NooJ modules developed in the last ten years for various languages (Barreiro 2008 for Portuguese; Bogacki 2008 for Polish; Chadjipapa et al. 2008 and Gavrillidou et al. 2008 for Greek; Georganta and Papadopoulo 2012 for Ancient Greek; Aoughlis et al. 2014 for Tamazight; Dobrovolic 2014 for Slovene). It is composed of a demo text and contains two types of elements necessary for the linguistic analysis of texts. The first one comprises some dictionaries of lemmas which can operate independently. The second one is a system containing an important number of local grammars describing inflections and derivations obtained using nominal and verbal suffixes. It also includes a local grammar describing numerals ranging from l to 999999, a grammar that recognizes and annotates the dates and a grammar for the recognition and annotation of future forms.

It is not the purpose of this module to give users exhaustive dictionaries and complete libraries of local grammars but rather to show how Quechua, with its very rich morphology and specific peculiarities fits in NooJ’s linguistic development environment.

2 The Main Components of NooJ4QU

The lexical analysis of a text makes use of the following components launched simultaneously:

  • Dictionaries of nouns, adjectives and non-verbal POSs,

  • A dictionary of Quechua verbs,

  • A library of morphosyntactic grammars and of inflectional and derivational rules.

The files describing the inflection of the inflected entries are:

  • Ngrammars.flx corresponding to the compiled dictionary Nquechua.nod,

  • Vgrammars.flx corresponding to Vquechua.nod,

  • A Quechua text containing 8 tales.

3 The Noun Electronic Dictionary

The electronic dictionary of simple nouns contains 1,472 entries. It does not include proper nouns. All of the entries have an inflectional paradigm assigned to them, represented by FLX=. For instance the entry unit wasi (house) inflects according to the paradigm class NVOCAL, thus the entry appears: wasi,N+FR=“maison”+FLX=NVOCAL. Inflectional paradigms are standard pattern models or prototypes based on morphological suffixation rules. These rules cover variation in number, diminutives and superlatives, verbalizations, cases.

Quechua is a highly inflectional language:

The category of case is represented by the following set of suffixes {-hina, -kama, -man, -manta, -nta, -ninta, - ninka, -nka, -p, -pa, -pi, -paq, -pura, -rayku, -ta, -wan, -y!, -niy! }. The verbalization of a noun is obtained by one or more of the following derivation suffixes: {-y, -yay, -chay}

3.1 Nominal Inflections

There is another subset which is classified as enclitics: {-ña, -raq, -puni, }

The thousands of nominal forms that can be obtained by the inflectional paradigms are generated by the following set of nominal suffixes and their combinations in different layers:

The tag poss (7v+7c) represents the two sets of possessive suffixes:

  • poss (7v)={-y, -yki, -n, -nchik, -yku, -ykichik, -nku} applied to nouns with a vowel ending.

  • poss (7c)={-niy, -niyki, -nin, -ninchik, -niyku, -niykichik, -ninku} applied to nouns with a consonant ending.

They give us the following paradigms of possessive forms:

The morphosyntactic grammars of dimension 1 (only a single suffix intervene) for the inflection of nouns is shown in Fig. 1.

Fig. 1.
figure 1

Local grammar of one-dimension nominal inflection

For the combination of two nominal suffixes, we have developed an algorithm (See Duran 2013a which allows us how to obtain 517 valid agglutinations, some of which appear in the following paradigm:

For the three-layer combinations of nominal suffixes we have obtained 2,108 valid ones. And for the four-layer of four suffixes we obtain less than 1,800.

Nominal inflections like: wasi-cha-lla-ikichik-paq-hina-raq, “it is something that would look nice in your house” are generated by paradigms containing six agglutinated suffixes like:

Grammatical gender is not proper to Quechua. Natural gender is differentiated in lemmas, e.g. maqta boy, pasña girl or by noun phrases, e.g. urqu allqu male dog, china allqu female dog. Most of the kinship forms distinguish sex of both persons involved in the relationship:

3.2 Properties of the Noun Dictionary

According to their linguistic attributes words integrate different hierarchical ontology classes and subclasses. For many entries we have also provided syntactic-semantic properties. (Remark: most of the examples come from dictionaries QU-FR ou QU-SP (Itier 2011; Perroud 1970) and, in some cases we give their English translation). For instance:

  1. 1.

    QU allqu (FR chien) is classified as a common noun (imapa sutin IS, in QU), vertebrate animal (wasa tulluyuq quñi yawarniyuq WTQY, in QU) mammal (ñuñuq Ñ, in QU).

  2. 2.

    QU pacha (EN cloth) is classified as common noun (IS, in QU), pachakunapaq (PA, in QU), clothing, soft thing made of fabric, leather, and so on.

  3. 3.

    QU llaqta (EN city) is classified as a common noun(imapas sutin IS, in QU), proper name (runa kaqlla sutiyuq RKS in QU) denoting a geographic (allpaman qatiq, AQ in QU) place, geographical entity, and geographical location.

Let us comment the properties accompanying some entries of the sample presented in Fig. 2.

Fig. 2.
figure 2

A sample of the dictionary of nouns

  1. 1.

    allqu is marked (N) for noun; it inflects according to the morphological paradigm NVOCAL (FLX=NVOCAL) for nouns ending in a vowel, the mark FR for French gives the French translation of the entry: “chien” (dog). We introduce some semantic properties as: common name (NC in FR), mammal animal (MAMIF in FR mammifère), and can derivate into verbs by the suffixes -yay and -chay;

  2. 2.

    chakra a noun (N) which inflects according to the morphological paradigm NVOCAL (FLX=NVOCAL) for nouns ending in a vowel, corresponding to the FR noun “terrain agricole” (cultivated field), with semantic properties as common name (NC in FR), a geographical emplacement (GEO in FR), and can derivate into verb by the suffix -yay;

  3. 3.

    Inca ñan a noun (N) which inflects according to the morphological paradigm NCONSO (FLX=NCONSO) for nouns ending in a consonant, a proper noun (NPROP) corresponding to the FR noun Chemin de l’Inca (Inca Trail), a multi word unit defined as unambiguous UNAMB;

  4. 4.

    pacha a noun (N) which inflects according to the morphological paradigm NVOCAL (FLX=NVOCAL) for nouns ending in a vowel, corresponding to the FR noun “la Terre” (the earth), with semantic properties as common name (NC), a geographical emplacement (GEO), and can derivate into a verb by the suffix yay; which differs from the following homonym;

  5. 5.

    pacha a noun (N) which inflects according to the morphological paradigm NVOCAL (FLX=NVOCAL) for nouns ending in a vowel, corresponding to the FR noun “le temps” (time), with semantic properties as common name (NC), concerning the time (TP), but does not derivate; this form has the following third homonym:

  6. 6.

    pacha a noun (N) which inflects according to the morphological paradigm NVOCAL (FLX=NVOCAL) for nouns ending in a vowel, corresponding to the FR noun “vêtement” (cloth), with semantic properties as common name (NC), relative to clothing (VTM), and can derivate into a verb by the suffix yay.

Let us note that one QU entry corresponds to one FR transfer, which allows obtaining disambiguated words. For instance, we see that pacha=temps (time) and also pacha=vêtement (clothing), the first one corresponds to the time (TP) and the second one defines a concrete thing made of fabric or leather or another material which serves to wear. So we try to have the disambiguation at the dictionary level by enriching the lexicon with the addition of syntactic-semantic properties; this is why we try to have as many entries as there are meanings for the same word in the source language. The compiled Nquechua.nod dictionary is able to recognize more than one million inflected forms deriving from 1500 lemmas.

4 The Electronic Dictionary of Quechua Verbs

The electronic dictionary of simple verbs contains 1,472 entries. It does not include compound verbs, neither phrasal verbs, which are presented in another dictionary. Each verb has an inflectional paradigm assigned to it. For instance the entry unit rimay (to talk) inflects according to the paradigm class V_TR, thus the entry becomes: rimay,V+FR=“parler”+FLX=V_TR whose structure is relatively complex as we will see.

V_TR contains among other things, the paradigms of conjugation in present and future tenses: PR and FUT:

It also contains some syntactic and semantic information, like the two main classes of verbs:

Transitive (TR): rimay to talk, Intransitive (ITR): mikuy to eat.

The intransitive class is quite small. It contains less than one hundred stems. The class of impersonal verbs (INP) includes those relating to weather: paray to rain; lastay to snow.

As we showed in a previous work (Duran 2013 and 2014), a typical Quechua inflected verbal form has the following structure:

<V><IPS><PR ENDING><PPS>, where:

  • V: Verb stem,

  • IPSFootnote 1: Interposed suffixes (placed between the lemma and the ending),

  • PPSFootnote 2: Post-posed suffix (placed after the ending),

  • PR ENDINGFootnote 3: is the set of seven present tense personal endings (which will behave as fixed points during the inflections).

Examples:

In these examples, each class of suffix intervenes in the inflection only once at both sides of the ending. But, the Quechua grammar allows having several layers of IPS and PPS. (For a more detailed explanation on the construction of valid agglutinations of IPS and PPS suffixes and their mixed patterns see Duran 2014).

In fact, considered separately we will have 231 combinations of two PPS, 132 combinations of three PPS and 92 combinations of four valid combinations of PPS. Here are some samples of the corresponding paradigms:

On the other hand, a very large number of inflected forms include a mixture of one or more layers of IPS and one or more PPS, we call them mixed inflections MIFLX.

For instance, in the form rima-ri-lla-nki-man-raq ‘I think you should talk before’, we have two suffixes IPS (-ri, -lla) and two PPS (-man, -raq). Besides, in the form miku -cha-ku-na-lla-n-paq «hoping that she will kindly eat it », we have four suffixes IPS (cha, ku, na, lla) and one suffix SPP (paq) (4 + 1).

In order to generate automatically all these forms, we have constructed a series of mixed paradigms which we symbolize by V_MIXn. We present the details for V_MIX1 which contains 2*2 = 4 sub-grammars (2 four IPS consisting in one layer, ending in a vowel or a consonant, and 2 for the SPP including one layer placed after the endings, ending in a vowel or a consonant):

Thus, summarizing, one of the dominant paradigms for the inflection of a verb contains all these grammars:

4.1 Verb-Verb Derivations: Generating New ULAVs

Among the SIP, we find a sub-set of 27 suffixes which derivate a verb into another verb. This strategy allows Quechua to generate new ULAVFootnote 4. These new forms can be conjugated as if they were simple verbs. We name this set “SIP_DRV”.

For example, the NooJ grammar of Fig. 3. generates all the new ULAVs for the verb rimay shown in Fig. 4.

Fig. 3.
figure 3

Graphical grammar for derivation of a transitive verb

Fig. 4.
figure 4

ULAVs obtained by derivation of the verb rimay (to talk)

We can obtain combinations of two, three or four IPS suffixes. We may obtain in this way more than 2,000 derived forms for a single verb. They look like the ones appearing in the following sample which includes different layers.

It is an important question to know the meaning of these generated forms. Are they only theoretical forms with no tangible sense or are they used currently in the daily conversations? Many are currently used and are meaningful for the users but their translation needs long descriptions. For instance, we have the translation of the entry:

as follows: ‘a kind demand to start talking in behalf of a third person’.

On the other hand, if we attempt to get an automatic translation based in the meaning of each suffix participating in the combination, we may use the ‘meaning value’ of each of the suffixes noted in the entry, and put them in a grammatical order as follows:

The noted symbols carry on the following meaningsFootnote 5:

Thus, ‘a demand + to do the action of talking + done at present + executed gently in behalf of a third person + to start now’ (should be equivalent to) ‘a kind demand to start talking in behalf of a third”, which seems to be the case.

The complete translation of the new ULAVs is a real challenge for NLP and MT research in Quechua.

4.2 Nominalization

The nominalization suffixes are placed at the end of the verbal lemma. (For details on their construction, see Duran 2013b) They are generated by the following set of suffixes:

As we can see in the following examples, certain verbal stems can be nominalized by more than one of them.

These forms can be generated by the grammar shown in Fig. 5:

Fig. 5.
figure 5

Nominalization grammar

Gathering all this information concerning the verbs, we have built the dictionary of verbs contained in the Quechua Module for NooJ. We can see in a sample in Fig. 6:

Fig. 6.
figure 6

The verb dictionary of the module

5 Dictionaries for Other POSs

We have included, in the module, a dictionary of adjectives and another of other POS containing adverbs, pronouns and numerals. The set of suffixes that allows inflecting each category are given separately. For example the following sets correspond to adjectives and pronouns:

For instance, here is a sample of paradigm for adjectives ending in a vowel:

This will generate 56 inflected one-dimension forms for the adjective taksa (small) as follows:

As for the case of nouns, Quechua allows agglutination of adjectival suffixes up to seven layers. The following sample shows forms with two suffixes:

Applying similar methods to nouns and verbs has driven to hundreds of paradigms, for the inflection of these categories containing two or more agglutinated suffixes. Figure 7 shows a sample of the dictionary of pronouns.

Fig. 7.
figure 7

The Dictionary of pronouns

6 Some Additional Grammars to the Library

We present, as an illustration, three graphical grammars contained in the Quechua NooJ Module: Pachayupay.nog (the dates in QU), NumQU.nog (the numbers en QU) and Future.nog (the future in QU).

6.1 The Dates and Numerals in Quechua

Recognition and annotation of the dates is done by the following grammar (Fig. 8):

Fig. 8.
figure 8

Pachayupay.nog grammar (the dates in QU)

Figure 9 annotates the numeric determinants in full letters for the numbers going from 1 to 999999: “huk” to “isqun pachak isqun chunka isqunniyuq waranqa isqun pachak isqun chunka isqunniyuq”.

Fig. 9.
figure 9

Annotation of numeric determinants in full letters

6.2 Recognition Grammar for the Future Tense

(See Fig. 10)

Fig. 10.
figure 10

Future Quechua.nog grammar (The future in QU)

7 Linguistic Analysis

The linguistic resources that we have presented can be applied to analyze texts. Figure 11 shows an extract of the results of the concordance query <N>+<V+POL1><ADV> applied to a text of Quechua tales. It recognizes and annotates 1,320 forms.

Fig. 11.
figure 11

An extract from the results of the query <N>+<V+POL1><ADV>

8 Conclusion and Perspectives

In this paper we have described the first version of the Quechua Module for NooJ. We have seen how the electronic dictionaries are built. We have shown different grammars and how they interact with the dictionaries and the morphological paradigms. We have also seen a simple example of application of these grammars to a text. The simple verbs dictionary was extracted from M. Duran dictionary Quechua-Français (Duran 2009). Our next step will consist in the enhancement of the French-Quechua verb dictionary, by the inclusion of our current research on the electronic bilingual French - Quechua dictionary containing the translation of the 25,000 French verbal senses of Dubois et Dubois-Charlier 2000. On the other hand, we expect to add some syntactic-semantic grammars and disambiguation grammars.