Keywords

1 Introduction

The Prague Dependency Treebank (PDT) has a multi-layered scenario designed on the theoretical basis of Functional Generative Description (FGD). Though the theoretical framework itself focuses mainly on syntactic issues, the PDT annotation project started with annotation at the morphological layer. Information included at this layer was extensively used during annotation at both the layer of surface syntax and the deep-syntactic layer (tectogrammatics).

In the paper, the formal approach to Czech inflectional morphology is introduced first (see Sect. 2). An overview of tools for morphological analysis and disambiguation is followed by a description of the part-of-speech (POS) tags and morphological lemmas. The core of the paper presents annotation of morphological categories in PDT within the theoretical framework of FGD (Sects. 3.1 and 3.2). A lemma and a positional POS tag capturing formally expressed inflectional categories were assigned manually to each token at the morphological layer (Sect. 3.3), and reinterpreted in a semi-automatic procedure during the annotation at the tectogrammatical layer; here, meanings of semantically relevant morphological categories were represented as values of special attributes (called grammatemes) assigned to nodes of the tectogrammatical tree (Sect. 3.4). PDT annotation scenario served as one of the resources for other treebanks mentioned in Sect. 3.5.

In Sect. 4, recent topics are outlined that are immediately connected with the presented approach to Czech morphology, namely named entity recognition in Czech, formemes encoding morpho-syntactic information in the dependency-based machine translation system, and development of a lexical database of derivational relations based partially on information provided by the morphological analyser.

2 Computational Morphology of Czech

2.1 Tools for Morphological Analysis and Disambiguation

Czech is a Slavic language with a complex system of both inflectional and derivational morphology. Though the traditional separation of inflections and derivations, which is documented in influential grammars of Czech, has been partially overcome in some NLP approaches to Czech, the main focus is still on inflectional morphology.

This section is limited to morphological analysis and morphological disambiguation (tagging) as two subtasks of morphological processing of Czech;Footnote 1 the former of them consists in assigning pairs of a tag and a lemma to an individual word form (usually regardless of the context) while the latter subtask is to select a single tag–lemma pair for the respective word form, mostly with respect to a (close) context.

Formulation of a computational approach to Czech morphology is dated back to the 1990s; cf. first experiments in automatic morphological analysis and disambiguation of Czech by Hladká and Hajič [13, 18, 23]. Morphological analysis was based on the Czech morphological dictionary (published now under the name MorfFlex CZ; [14]) which contains more than 350 thousand manually entered entries; the recogniser recognises about 12 million Czech word forms.

For first tagging experiments [23], it was possible to use manually annotated data, thanks to a pioneering corpus annotation project which was carried out at the Institute of the Czech Language of the Academy of Sciences of the Czech Republic from 1971 to 1985 (the corpus was called Korpus věcného stylu ‘Practical Corpus’ and, later on, converted into the Czech Academic Corpus with morphological and analytical annotation compatible with PDT; [24, 66, 67]).

Table 1. Comparison of the taggers according to their accuracy on Czech (based on [51, 56])

The next, feature-based tagger was trained already on PDT data, which were manually annotated with positional POS tags and lemmas (Sects. 2.2 and 3.3). The tagger was based on a statistical algorithm with an exponential model [11], and distributed, along with a tool for morphological analysis, as a part of the PDT 2.0 release [16]. An implementation based on Hidden Markov Models is available as well [29].

In line with efforts to develop and to improve POS taggers for English and other languages inspired by Collins [6] and others, a tagger based on averaged perceptron, called Morče (an acronym of Morfologie češtiny ‘Morphology of Czech’; [68]), was published in 2006. The Morče tagger was trained on manually annotated data of PDT, achieving a state-of-the-art performance on Czech,and later on, it was involved in experiments combining this tagger with the feature-based tagger, HMM tagger and a rule-based component [52], and in semi-supervised training experiments [51].Footnote 2 The semi-supervised version of the Morče tagger outperformed its original implementation as well as the combination with other taggers; see Table 1.

The most recent implementation, MorphoDiTa (Morphological Dictionary and Tagger; [53, 56]), is an open-source tool for morphological analysis, tagging, and lemmatisation as well as for tokenisation and morphological generation; it is available along with trained linguistic models.

Table 2. Positions of the positional POS tag

The feature-based tagger and the Morče tagger were used for morphological processing of large (100,000,000+ tokens) corpora of the SYN series, built at the Institute of Czech National Corpus.Footnote 3 Experiments with the rule-based disambiguation of large corpus data have been carried out [31, 36, 37, 39]. Nevertheless, improvements in tagging have been reported recently by applying a combined disambiguation system including the Morče tagger and a rule-based component [40]; compare previous approaches to combining statistical and rule-based methods in [15, 50], or [52].

All the tools described above use compact tags or, predominantly, positional POS tags (both described in Sect. 2.2) as the output tag format.

An alternative system of encoding Czech morphology has been developed in the Natural Language Processing Centre at the Faculty of Informatics, Masaryk University in Brno, and implemented in the ajka analyser, which provides both inflectional and (to a limited extent) derivational analysis of Czech based on a large-coverage dictionary [44, 45].

Last but not least a weakly-supervised (resource-light) approach to morphological analysis and tagging is to be mentioned, which substantially decreases requirements on cost-intensive manual input [8, 20]. Though the weak supervision is often accompanied with a lower accuracy, the approaches are advantageous especially for underresourced languages.

2.2 Tag Sets for Czech, Positional POS Tag and Morphological Lemma Used in the Prague Dependency Treebank

There have been several tag sets used for Czech. From the chronological perspective, the tag set used in the original annotation of the Czech Academic Corpus (CAC; see Sect. 2.1) should be mentioned first [66, 67].

In the original CAC tag set,Footnote 4 tags of maximum eight positions were used. At the first and second position, the part-of-speech class of the token was specified; the remaining positions were associated with morphological categories that are relevant for the particular part-of-speech class. Thus, for instance, in the fourth tag position, mood is encoded with verb forms while gender with noun, adjectives, pronouns, and numerals. The values to be filled in at a particular position were defined with respect to the part-of-speech class as well and encoded with digits. Therefore, for instance, the same digit in the same position is to be interpreted differently with adjectives and with verbs. Compare the original CAC tags to be assigned to the tokens Pokládáte ‘(you) find’, za ‘for’, and standardní ‘standard’ (the first three tokens from the sentence analysed in Table 3) and their interpretation:

Pokládáte

5251_19

verb – imperfective – 2nd person plural – indicative present active – [imperative:default] – one-word form – gender not expressed

za

774

preposition – primary – with accusative

standardní

22_414

adjective – primary – [subclass:default] – neuter – singular – accusative

A system of compact tags was defined by Hajič [11], and used in compilation of the morphological dictionary (MorfFlex CZ; [14]) and in tagging experiments, e.g. [13]. This tag system works with positions, specifying a combination (a “pattern”) of relevant morphological categories (each associated with a tag position) for each part-of-speech (sub)class.Footnote 5 Compact tags for the same three tokens should be interpreted as follows:Footnote 6

Pokládáte

VPp2A

verb – indicative present – plural – 2nd person – affirmative

za

R4

preposition – with accusative

standardní

ANS41A

adjective – neuter – singular – accusative – no gradation – affirmative

Table 3. Morphological lemma and positional POS tag assigned to tokens of the sentence Pokládáte za standardní, když se s Mečiarovou vládou nelze téměř na ničem rozumně dohodnout? (lit.: Find for standard, when REFL with Mečiar’s government is-not-possible almost on nothing reasonably agree?) ‘Do you find it standard when almost nothing can be reasonably agreed on with Mečiar’s government?’ at the morphological layer of PDT, and conversion of the positional POS tags into the Interset interlingua attribute–value pairs (last column)

As an alternative to compact tags, a system of positional POS tags was developed and gradually preferred to the former one; cf. Hajič [11].Footnote 7 Positional POS tags, along with two-component lemmas (described below), were assigned to the PDT data at the morphological layer; see Sect. 3.3.

A positional POS tag consists of 15 positions: The part of speech and a (functionally or formally delimited) subpart of it are encoded in the first and second positions of the tag, respectively. Positions 3 to 12 are each associated with a particular morphological category, positions 13 and 14 are reserved for a potential extension of the tag information, and the 15th position captures information of variants, register features etc.; see Table 2.Footnote 8 Part-of-speech classes as well as values of morphological categories were delimited in accordance with their description in the academic grammar of Czech [25].

In spite of combinatorial restrictions implied by the language itself,Footnote 9 there is a considerable number of combinations of the category values attested in the language data; cf. 1,574 different positional POS tags (and 71,503 different morphological lemmas) assigned to 1,957,247 tokens of the PDT 3.0 data annotated at the morphological layer. The positional POS tag, which allows for a combination of values of single categories, enables thus to describe the rich inflection in an economical way (compare, for instance, the POS tag set used in the Penn Treebank project [32]).

Besides a positional POS tag, each token was assigned a morphological lemma composed of two parts at the morphological layer of PDT. The first part of the lemma (so-called lemma proper) is a string of characters mostly corresponding to the base form of the word (namely, nominative singular form of nouns, nominative singular masculine of pronouns and numerals, nominative singular masculine positive form of adjectives, infinitive form of verbs, and positive form of adverbs).Footnote 10 Since the lemma was proposed as a unique identifier, ambiguous base forms were disambiguated with a digit attached by a hyphen to the string of characters (cf. Lemmas assigned to prepositions za, s, and na in Table 3).

The second part of the lemma is a technical suffix. It is attached to the lemma proper by an underscore. Technical suffixes do not occur with most lemmas; however, if needed, more technical suffixes are possible with a single lemma. The suffix contains either a comment on verbal aspect (cf. the suffix of the verb lemma pokládat in Table 3), or a comment explaining the respective meaning (suffix of the pronoun se), a label identifying the named entity type (_;S with the lemma Mečiar \(\mathring{{u}}v\) identifying surnames), or derivational information (namely, formally encoded changes to be carried out to arrive at the base word; cf. with the same lemma: two characters should be removed in order to get the base word Mečiar).

Motivated by the needs of parsing, machine translation and other NLP subtasks, a method for conversion of different sets of POS tags has been developed: Interset is a set of universal morpho-syntactic features to which tag sets used in different corpora can be converted; it has been proposed as a sort of interlingua for POS tags [71]. The most recent Interset version covers 64 different tag sets of 37 languages [70]. See the positional POS tags used in PDT converted into the Interset attribute–value structures in Table 3.

3 Annotation of Morphological Categories in the Prague Dependency Treebank

3.1 Theoretical Background of the Prague Dependency Treebank: Functional Generative Description

Functional Generative Description is a theoretical linguistic framework formulated in Prague in the 1960s [48, 49]. It is rooted in the structuralist approach of the Prague Linguistic Circle; however, it has responded to similar stimuli as foreign approaches with fundamentally different backgrounds.

FGD decomposes the language system into several levels;Footnote 11 the “lowest” of them corresponds to linear text (either spoken or written) whereas the “highest” level represents the linguistic meaning of the sentence and is modelled as a dependency tree structure.Footnote 12 Between these two levels (phonetic and tectogrammatical level, respectively), another three levels were discerned in the original proposal, namely the morphonological level, morphological level, and level of surface syntax.

The theoretical fundamentals of FGD, to which – besides multiple levels – the dependency approach to syntax and the theory of valency belong, served as a starting point for the design of the annotation scenario of PDT [5]. Out of the set of levels differentiated in FGD, three layers have been included in the PDT scenario: the morphological layer, surface-syntactic layer, and tectogrammatical layer. Differences between the layout of the PDT layers and levels in FGD were motivated by the needs of NLP tasks, e.g. parsing, and were analysed by Štěpánek [65].

The formalised approach to morphology as a separate level of the language system model and the description of the meanings of morphological categories at the tectogrammatical level is a stable part of the FGD frameworkFootnote 13 and has been adopted into the annotation scenario of PDT as well.

Fig. 1.
figure 1

Sentence Pokládáte za standardní, kdyč se s Mečiarovou vládou nelze téměř na ničem rozumně dohodnout? ‘Do you find it standard when almost nothing can be reasonably agreed on with Mečiar’s government?’ annotated at the analytical layer of PDT 3.0. Nodes are labelled with word forms and surface-syntactic functions (e.g., Sb for subject, Adv for adverbials, the Aux labels are assigned to different types of function words)

3.2 History of the Prague Dependency Treebank

The Prague Dependency Treebank is a collection of Czech newspaper texts from 1990s, processed at four layers. At the first (non-annotation) layer, called word layer, the source text is segmented into documents and paragraphs, tokens are associated with unique identifiers. At the morphological layer, as the lowest annotation layer, each token is assigned a positional POS tag and a lemma, see Table 3. At the surface-syntactic (analytical) layer, the syntactic structure of each sentence is represented as a dependency-tree structure. Nodes of the analytical tree are in a one-to-one correspondence to tokens at the morphological layer and are labelled with surface-syntactic functions (such as subject Sb, object Obj etc.; Fig. 1). At the tectogrammatical layer (the highest layer of annotation), the underlying syntactic structure of the sentence is also represented as a dependency tree, which, however, differs from the analytical one in several aspects.

While every token annotated at the morphological layer has exactly one corresponding node in the analytical tree, the correspondence between the nodes of the tectogrammatical tree and the analytical tree, which is nevertheless explicitly recorded in the data in the form of cross-layer references, is not always one-to-one, since only content words are represented as tectogrammatical nodes, and new nodes are constructed for deletions (cf. the node with the lemma #PersPron representing the pro-dropped subject pronoun of the verb pokládáte in Fig. 2) or for grammatical elements which do play a role in the syntactic structure but cannot be expressed in the surface shape of the sentence (see the #Cor node in Fig. 2, which is the subject of the infinitive dohodnout se and is relevant for coreference annotation). Nodes of the tectogrammatical tree were labelled with

  • semantic roles (functors; e.g. ACT for Actor, PAT for Patient, MANN for Manner),

  • labels defining the type of the respective node and its semantic part of speech (cf. the nodetype and sempos attributes),

  • meanings of morphological categories (grammatemes), and

  • labels identifying the node as an element of the topic or focus part of the sentence; see Fig. 2.

Non-dependency relations are annotated on the top of dependencies in the tectogrammatical tree; see the coreference arrow in Fig. 2. Annotation at the tectogrammatical layer is documented in [35].

There are four releases of the PDT data available: PDT 1.0, PDT 2.0, PDT 2.5, and PDT 3.0.Footnote 14 PDT 1.0 was published in 2001 and contains data annotated at the morphological layer and at the analytical layer [19]. Annotation of both types is available for 1,583 documents (containing 1,255,590 tokens in 81,614 sentences); there are also another 14 documents (469,652 tokens in 29,561 sentences) annotated at the morphological layer only and 314 documents (251,743 tokens in 16,649 sentences) with analytical annotation only. A sample of 3,490 tokens (in 203 sentences) with morphological and analytical annotation is annotated at the tectogrammatical layer as well.

The complete three-layer annotation is available for a large part of the data from PDT version 2.0 onwards. PDT 2.0, published in 2006 [16], contains 3,165 documents (with 833,195 tokens in 49,431 sentences) with morphological, analytical, and tectogrammatical annotations. Another 2,165 documents (with 670,544 tokens in 38,482 sentences) are annotated at the morphological and analytical layer, and for yet another 1,780 documents (with 453,508 tokens in 27,931 sentences) only morphological annotation is available in PDT 2.0. The data at each layer were divided into train data (app. 80 % of the data set with the respective annotation combination), development-test data (app. 10 %), and evaluation-test data (app. 10 %).

In PDT 2.5 and PDT 3.0 (released in 2011 and 2013, respectively),Footnote 15 the texts of PDT 2.0 are enriched with new annotations at the tectogrammatical and analytical layer, but neither the size of the data nor the portions of the data annotated at individual layers have changed; particular mistakes were corrected in the recent releases as well [3, 4]. The following annotations were new in the PDT 2.5 as compared to PDT 2.0:

  • annotation of multiword expressions at the tectogrammatical layer,

  • a new grammateme identifying a special usage of plural forms of nouns (pair/group meaning) at the tectogrammatical layer,

  • clause segmentation at the analytical layer.

For the PDT 3.0 release, the tectogrammatical layer was further modified:

  • changes in the modality grammatemes,

  • an extended annotation of coreference and bridging anaphora,

  • annotation of discourse relations,

  • genre specification.

Table 4. Values of the nodetype attribute assigned to each tectogrammatical node

3.3 Morphology as a Layer of Annotation in the Prague Dependency Treebank

As one can see from the history of the PDT releases, data of PDT were annotated at the morphological layer first. Each token was assigned a positional POS tag and a morphological lemma within a manual procedure which was preceded by an automatic morphological analysis.

The manual annotation was carried out by eight annotators [21]. Each file was annotated by two annotators in parallel, their task was a manual disambiguation of results of the morphological analysis using the DA and LAW (Lexical Annotation Workbench) editors of morphological annotations.Footnote 16 When the lemma was not offered by the tagger, it was created manually by the annotator and, subsequently, included into the morphological dictionary. After the parallel annotation was finished, instances of disagreement were decided by a third annotator. See the morphological annotation of a sentence in Table 3.

Annotation at the morphological layer was used during annotation at the analytical and, more importantly, at the tectogrammatical layer, being the main source of information for automatic assignment of grammatemes.

Morphological annotation, after a separate checking at this layer, was involved in the cross-layer checking of analytical and tectogrammatical annotations before the public release of the data. Štěpánek [64] gives examples of rather simple comparisons of POS tag values with surface-syntactic functions at the analytical layer and with functors at the tectogrammatical layer (e.g. with conjunctions), and describes checking of named entity information involved in the technical suffix of the morphological lemma against the tectogrammatical annotation, or a complex verification whether all valency slots defined by the valency lexicon are filled in with tectogrammatical nodes representing the requested word forms.

Table 5. Frequency of the nodetype values in the PDT 3.0 data annotated at all three layers

3.4 Morphological Meanings at the Tectogrammatical Layer

Following the Praguian tradition of distinguishing form and function, functions (meanings) of morphological categories are captured by grammateme attributes in the tectogrammatical tree. The inclusion of grammatemes into the tectogrammatical layer responds to the claim of self-containedness and unambiguity of the sentence representation at each layer. If, for instance, meanings conveyed by the grammatical number with nouns, degree of comparison with adjectives, or tense with verbs were not specified at the tectogrammatical layer, several semantically different sentences could be generated from a single tectogrammatical tree.

Since morphological meanings are conveyed only by some nodes of the tectogrammatical tree and, moreover, not all grammatemes are relevant for all nodes, tectogrammatical nodes were classified in two subsequent steps. First, eight general types of nodes were distinguished according to their functor and/or tectogrammatical lemma in a fully automatic procedure. Grammatemes are relevant for nodes of just one type (for complex nodes); cf. the nodetype values and their frequency in PDT 3.0 in Tables 4 and 5.

Second, complex nodes were subdivided into four groups, called semantic parts of speech (semantic nouns, semantic adjectives, semantic verbs, and semantic adverbs) within which 19 more specific subgroups were discerned automatically. Accordingly, the sempos attribute with 19 values was defined (Table 6). Each subgroup was associated with a set of relevant grammatemes.

Table 6. Frequency of the sempos values in the PDT 3.0 data annotated at all three layers

As annotation of grammatemes was the last task in the PDT 2.0 annotation procedure, it could profit from the annotation at lower layers as well as from annotations already done at the tectogrammatical layer (mainly from the tree structure, functors, and coreference).

Nearly 1,600,000 grammateme values in total (with more than 550 thousand complex nodes) were assigned at the tectogrammatical layer of PDT 2.0, most of them automatically. Manual annotation, carried out by two annotators in parallel, with a follow-up decision by a third annotator in cases of disagreement, is responsible for approximately 17,500 out of the grammateme values [42].

The set of grammatemes and values assigned at the tectogrammatical layer was based on the FGD framework [49]. However, the repertoire has been revisited and changed according to the recent linguistic research during the annotation of individual PDT releases. In this paper, we present the grammateme annotation which is available in PDT 3.0.

There are 15 grammatemes annotated at the tectogrammatical layer of PDT 3.0. Grammatemes number, gender, person, politeness, and typgroup were assigned to nodes classified as semantic nouns. The grammatemes degcmp, negation, numertype, and indeftype were annotated with semantic nouns and with semantic adjectives. Semantic adverbs were assigned grammatemes degcmp, negation, and indeftype. Semantic verbs were assigned a special subset of verbal grammatemes: tense, aspect, factmod, deontmod, diatgram, and iterativeness.

Fig. 2.
figure 2

Sentence Pokládáte za standardní, když se s Mečiarovou vládou nelze téměř na ničem rozumně dohodnout? ‘Do you find it standard when almost nothing can be reasonably agreed on with Mečiar’s government?’ annotated at the tectogrammatical layer of PDT 3.0. Nodes are labelled with a tectogrammatical lemma, with a functor (e.g. ACT, MANN), topic-focus annotation (in front of the functor), a nodetype value (e.g., root or qcomplex), or a semantic part of speech and grammatemes (only with complex nodes, displayed under the functor). The predicate node of the tree (functor PRED) was assigned a sentence modality value (here, inter for interrogative sentences)

Seven out of the 15 grammatemes correlate with morphological categories which are traditionally addressed in the grammatical description of Czech. Nevertheless, the grammateme values cannot be mostly interpreted from a single word form (its POS tag), but a more complex structure including auxiliaries had to be involved in the value assignment procedure (cf. grammatemes tense, factmod, deontmod, or diatgram described below), or manual annotation was needed, for instance, to assign number with pluralia tantum, absolute usage of comparative forms of adjectives and adverbs, or polite usage of 2nd person plural verbs.

  • The number grammateme captures the number of entities to which the particular noun refers. In most cases, the value (sg or pl) correlates with the morphological category but is different, for instance, with pluralia tantum nouns (e.g., otevřel dveře.sg na terasu ‘he opened the door to the terrace’ vs. několikery dveře.pl ‘several doors’).

  • Values of the gender grammateme (anim for animate masculines, inan for inanimates, fem and neut) correspond to the morphological gender of nouns, but if the grammatical gender does not coincide with the natural gender, the grammateme value was chosen according to the former one (cf. the neuter noun děvče ‘girl’).

  • The person grammateme (values 1 for the speaker, 2 for the hearer, and 3 for a person/object it is talked about) was assigned with nodes representing pronouns. The grammateme values were non-trivially interpreted from agreement markers expressed by relevant verb forms.

  • Values pos (positive), comp (comparative), and sup (superlative) of the degcmp grammateme correspond mostly to the category of degree of comparison, but comparative forms with an absolute (non-comparative) meaning were identified manually and assigned the third value acomp (e.g., starší žena ‘an elder(ly) woman’).

  • Values of the tense grammateme distinguish the presented actions/states according to whether they preceded the moment of utterance or another action (ant), followed it (post), or happened simultaneously with it (sim). If the particular node represented a more complex verb form, the grammateme value had to be interpreted carefully. For example, future verbal tense in Czech is expressed by a simple inflected form (with perfectives; dohodne se ‘(he) will-agree’), or by an auxiliary verb (imperfectives; bude pokládat ‘(he) will find’), or by prefixing (lexically limited; cf. the future form pojede ‘(he) will-go’ of the verb jet ‘to go’).

  • For the factmod grammateme, four meanings (values) were distinguished according to the inner structure of the mood category in Czech, namely, asserted for actions/states presented as given (mostly by an indicative verb form), potential for potential events (expressed by a present conditional form), irreal for events expressed by a past conditional, and appeal for required events (conveyed by an imperative form).

  • Values proc (processual/imperfective) and cpl (complex/perfective) of the aspect grammateme correlate with the aspect information captured by the technical lemma suffix at the morphological layer.

Another four grammatemes are considered grammaticalised meanings in the FGD framework as well:

  • Values polite and basic of the politeness grammateme were assigned to personal pronouns to distinguish the polite form (Vy .polite jste se už přihlásil? ‘Have you.polite logged in already?’) from a common usage (Vy .basic jste se už přihlásili? ‘Have you.basic logged in already?’).

  • The typgroup grammateme was included into the grammateme system to capture the pair/group meaning (like in koupil si boty ‘(he) bought a-pair-of shoes’) expressed by plural forms; the pair/group meaning was delimited as another meaning of plural in Czech (besides the common usage reffering to several single entities, cf. vystaveny byly jen pravé boty ‘only right shoes were displayed’; [58]).

  • The diatgram grammateme captures meanings subsumed under grammaticalised diatheses, which are expressed by different verbal forms with a scale of auxiliaries: act for active voice, pas for passive voice, res1, res2.1 and res2.2 for different types of resultative forms, recip for recipient diathesis, disp for verb forms expressing dispositional modality, and deagent for deagentive verb forms.

  • The deontmod grammateme was used to represent modal verbs as auxiliaries at the tectogrammatical layer; seven values were delimited according to modal meanings of necessity, possibility etc.

Even subsumed under the term of grammatemes, the following attributes capture derivational morphology,Footnote 17 rather than inflections:

  • The iterativeness grammateme enables to represent an iterative verb by the tectogrammatical lemma of its non-iterative counterpart.

  • The negation grammateme represents the negative meaning (expressed mostly by the ne- prefix) of nouns, adjectives and adverbs.Footnote 18

  • The indeftype grammateme made it possible to reduce pronouns and pronominal adverbs to a small set of lemmas at the tectogrammatical layer, exploiting the semantically relevant regularities within this closed class [62]. Cf. the node with the lemma co ‘what’ in Fig. 2, which represents the pronoun (na) ničem ‘(on) nothing’ (the negative semantic feature was captured by the negat value).

  • Similarly, the numertype is used to capture the specific meanings of different types of numerals (e.g. ordinal numerals, multipliers) that are represented by the tectogrammatical lemma of the corresponding cardinal numeral.

In addition to the approach described above for selected derivational relations captured by grammatemes, two types of highly regular derivatives, namely possessive adjectives and deadjectival adverbs, were converted into their base words, i.e., into nouns and adjectives, respectively. Since both these types of derivatives differ from their base words just in the function they play within the tectogrammatical structure,Footnote 19 it is sufficient to use the functor to encode the difference between the derived word and the base; see the nodes with the lemma Mečiar and rozumný ‘reasonable’ in Fig. 2.

Possible extension of the annotation of derivational morphology at the tectogrammatical layer is discussed in Sect. 4.3.

3.5 PDT-Style Annotations in Other Treebanks

Czech Academic Corpus, mentioned above in Sect. 2, has been converted from the original annotation (carried out in the 1970s and 1980s) into the PDT annotation scheme after the PDT 2.0 release; cf. CAC 1.0 [67] and CAC 2.0 [66]. CAC 2.0 contains morphological and analytical annotation for nearly 500 thousand tokens (and another data portion with morphological annotation only) which is now fully compatible with PDT.

Besides CAC, PDT annotation scenario has been used also for Arabic [17] and English [12], and has served as one of the resources for annotation schemes for Slovak (Slovak Treebank, which is a part of the Slovak National Corpus), Slovenian (Slovene Dependency Treebank),Footnote 20 Ancient Greek and Latin (Ancient Greek and Latin Dependency Treebanks),Footnote 21 and as an inspiration for other treebanking projects.

In 2011, an important project of bringing treebanks of different languages (some of them just mentioned) under a common annotation scheme has been proposed under the acronym HamleDT (HArmonized Multi-LanguagE Dependency Treebank). Treebanks were harmonised into the Prague Dependencies annotation style (based on analytical PDT annotation; [73]) and, recently, converted into Stanford Universal Dependencies [33]. Thirty treebanks are available in HamleDT 2.0 [43, 72].Footnote 22

4 Morphology in Named Entity Recognition, Dependency-Based Machine Translation, and in a Database of Derivational Relations in Czech

4.1 Named Entity Recognition in Czech

In a pilot approach to named entity (NE) classification and recognition, started only in 2007 [60], technical suffixes of morphological lemmas were used as an important resource for this task. Based on a survey of previous NE research using a low number of coarse-grained categories (such as [9]) on the one hand, or detailed categories (preferred in semantically oriented tasks, cf. [47]) on the other, a two-level classification has been proposed for Czech, which is convenient for both a robust processing and research interested in more subtle categorisation.

At the first level of the classification, ten rough categories were distinguished and, at the second level, further subclassified into 62 detailed categories. For instance, within the category of geographical names, subcategories of names of continents, states, towns, hydronyms etc. were discerned. This classification was used in the Czech Named Entity Corpus (CNEC), which consists of 6 thousand sentences with more than 150 thousand tokens manually assigned with NE categories [57, 61]. The data were used for development of several recognisers of NE in Czech texts; cf. [2628, 55, 60], and the most recent of them, NameTag [54, 56], which is an open-source tool for NE recognition, distributed along with trained linguistic models.

4.2 Formemes in Dependency-Based Machine Translation

The complex dependency deep-syntactic analysis has been used as a transfer layer in a machine translation system developed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague. The MT system, originally called TectoMT [75], has been extended with a number of modules into a modular NLP framework Treex, which is either available for installation from CPAN,Footnote 23 or can be run on-line under the LINDAT/CLARIN repository [46]. Recently, the Treex framework has been used, for instance, in the QTLeap European machine translation project.Footnote 24

The deep-syntactic analysis provided by the Treex framework has introduced a special type of attributes, called formemes, into the deep-syntactic tree. Formemes are node attributes in which the form of the word represented by respective node is encoded by a combination of morphological and syntactic features. Taking the example of the prepositional phrase s (Mečiarovou) vládou in Fig. 2 and its English equivalent with (Mečiar’s) government, the formeme n:with+X is to be assigned to the tectogrammatical node representing the (source) phrase with government within the English-to-Czech machine translation, while the node representing the (target) phrase s vládou is assigned the formeme n:s+7 in which the morphological case (7 for instrumental) is specified in addition to the particular preposition. A complete list of formemes implemented in Treex can be found in [7].

From the perspective of the PDT annotation scheme, information encoded in formemes is a combination of information involved in POS tags at the morphological layer and in surface-syntactic functions at the analytical layer of PDT with selected auxiliary words (e.g., prepositions).

4.3 Derivational Morphology in Czech

Besides a basic NE annotation, the technical suffix of the morphological lemma provides information on regular derivational relations as well.Footnote 25 In PDT, derivational information involved in the lemma suffix at the morphological layer was extended by derivational information captured in selected grammatemes or in functors at the tectogrammatical layer (see Sect. 3.4).

This rather preliminary approach to interconnection of Czech derivational morphology with inflections on the one hand, and with syntax on the other has indicated the way how to overbridge the separation of derivations from inflectional morphology which is documented in all representative grammars of Czech.Footnote 26

In order to put the annotation of derivations in PDT on a solid basis but, primarily, to build a reliable resource of derivational data for Czech, a lexical network of derivationally related words (DeriNet; [59]) is being developed. The current version DeriNet 0.9 contains more than 305 thousand lexemes which were connected with more than 117 thousand links that correspond to derivational relations between pairs of lexemes (i.e., between a base lexeme and a lexeme derived from it).Footnote 27 The pairs of derivationally related lexemes can be arranged into a tree graph; see the derivational tree with the root standard ‘standard’ (displayed by DeriNet Viewer)Footnote 28 in Fig. 3.

Fig. 3.
figure 3

The derivational tree of the noun standard ‘standard’ in the lexical network DeriNet

The network was initialised with a set of lexemes whose existence was supported by corpus evidence. As the data were morphologically processed by the Morče tagger, technical suffixes including derivational information were available, and were extensively used in creating derivational links in the network. This starting annotation phase has been followed by several rounds of semi-automatic annotation within which special attention had to be devoted to vowel and consonant alternations that occur very frequently during derivation in Czech. Since some of the alternations are involved in the inflectional paradigm as well, recent efforts in exploiting the inflectional morphological dictionary seem to make it possible to build a model of alternations which will enable to couple derivationally related lexemes automatically with a high precision even if they differ substantially due to the alternations.Footnote 29

Though DeriNet is still being developed (besides exploitation of the inflectional data, the main focus is on addition of new edges and correction of mistakes),Footnote 30 it is, to the best of our knowledge, the most complex and the only freely available resource of derivational data for Czech, and it belongs to a relative small number of derivational resources in general (cf. CELEX [2] for English, German and Dutch, DerivBase for German [69], DerivBase.Hr for Croatian [63], or most recently, the Démonette network for French [22]).

After arriving at a final version of the DeriNet data, semantic labelling of the derivational relations is proposed as the next step. Here, dealing with ambiguity and homonymy is expected to be the biggest challenge.Footnote 31

The DeriNet network enriched with semantic labels is then envisaged to be used as the main resource for an extension of the derivational annotation of tectogrammatical data in PDT. Nevertheless, it is expected that only the most frequent semantic classes of derivatives with a transparent derivational meaning will be processed in order not to “overload” the data and to keep them usable for both NLP tasks and linguistic research.

5 Conclusions

The aim of the present paper was to put together a complex picture of the role of morphology in the richly annotated data of the Prague Dependency Treebank. Morphological annotation constitutes a separate layer in the treebank, nevertheless, it has been used as a source of information encoded at the higher, structural layers of annotation. Correlations between morphological categories captured at the morphological layer and grammateme attributes included in the tectogrammatical tree were analysed in detail.

Though tagging has been discussed to be a sort of solved task for at least “sufficiently resourced” languages [10], probably including Czech, it is still an interesting and appealing task since, particularly in a morphologically rich language like Czech, a high-quality lemmatisation and POS tagging are considered a common prerequisite of most NLP tasks.

In the paper we briefly outlined several topics that are based on morphological tools, and on morphologically annotated data as well. An outlook, concerning the proposed extension of the tectogrammatical annotation with derivations, documents the importance of morphology in efforts to deepen the syntactic analysis of language data.