Keywords

1 Introduction

The “PolNet-Polish Wordnet” project started in 2006 within the grant of the Polish Platform for Homeland Security “Language Resources and text processing technologies oriented to public security applications”. The project, inspired by the Princeton WordNet, intended to serve two main objectives: first to provide a reference semantic lexicon for the core of the Polish language, second to serve as a linguistically motivated ontology to support reasoning in the AI systems with natural language competence. The research was anticipated by our earlier studies on linguistically motivated ontologies (cf. Vetulani 2003, 2004). As is the case for practically all wordnet development projects, PolNet is a time-unlimited project run by an interdisciplinary team of experts involving linguists, language engineers and computer scientistsFootnote 1. Since the beginning, the project has evolved from a network of noun-based synsetsFootnote 2 organized in a hyponymy/hyperonymy hierarchy towards a lexicon grammar containing a verbnet part as backbone.

2 Methodology

The initial PolNet was built as a wordnet for nouns organized into synsets interrelated first of all by relations corresponding to traditional hyponymy/hyperonymy. Keeping in mind that a correctly constructed wordnet should render the conceptualization of the real word (reflected in the language), we decided to apply the so called “merge development model” where the lexical network is built from scratch, in opposition to the so called “expand method” consisting in “translating” a wordnet built for some other language (typically English). Application of the merge model is expensive and time consuming, as it requires essential involvement of well trained staff, mainly lexicographers. Building a wordnet from scratch, we followed the methodology elaborated within the Princeton WordNet and EuroWordNet projects so that we could reuse the existing linguistic knowledge accumulated in traditional written grammatical resources, mainly dictionaries and grammars.

In order to limit methodological arbitrariness, we deliberately decided to avoid statistic and AI-based methods (like genetic algorithms, machine learning, etc.) at this stage.Footnote 3 The PolNet development algorithm (Vetulani et al. 2007) makes essential use of traditional Polish dictionaries and the DEBVisDic platform (Pala et al. 2007) as a wordnet development tool. The work was organized in an incremental way, starting with general and frequently used vocabulary extracted from a corpus (IPI PAN corpus; Przepiórkowski 2004). The one important exception to this rule was caused by the need to have the system tested in a real application at a possibly early stage of development. This is why we introduced, since the beginning, some basic domain terminology from the area of homeland security. (From 2009 to 2010 PolNet was tested as ontology in the application POLINT-112-SMS (Vetulani et al. 2009, 2010a, b)).

With inclusion of the verb category we brought ideas inspired by the FrameNet (Fillmore et al. 2002) and VerbNet (Palmer 2009) projects to PolNet. The verbal part became the backbone of the whole network. Its organizing part was the system of semantic roles.

Semantic roles, as relations connecting noun synsets to verb synsets, describe the semantic requirements of the predicate. This permits us to consider the verb-extended PolNet as a situational semantics network of concepts where verb synsets represent situations (events, states), whereas semantic roles (Agent, Patient, Beneficent,…) provide information on the ontological nature of the actors involved in the situation. The abstract roles (Manner, Time,…) describe the situation (event, state) with respect to time, space and possibly also to some abstract, qualitative landmarks. Formally, the semantic roles are functions (in a mathematical sense) associated to the argument positions in the syntactic patterns corresponding to synsets. Values of these functions are ontological concepts represented by synsetsFootnote 4. For example, for many verbs, the semantic role BENEFICENT takes as value the concept of humans. The set of semantic roles we used is adapted from (Palmer 2009).

3 Development Phases

The PolNet project was supported by several research programs and grants. After the initial phase of the noun-based lexical ontology development (to be used in the public security application), PolNet continued with the support of the City of PoznańFootnote 5, then within a resources-oriented project of the National Program for Humanities (NPRH)Footnote 6, and finally as a part of research program of the Faculty of Mathematics and Computer Science AMU. At the end of the first phase (March 2010) PolNet was built of env. 10600 synsets, 18800 word senses for 10900 nouns. At that time preparative works for inclusion of the verb component were already advanced. For this resource we did mapping experiments between PolNet and the so called Global Wordnet Grid (for 1200 noun synsets), as well between PolNet and Princeton Wordnet (for env. 2400 synsets). Intensive works within the CITTA-Ontology project resulted in the first public PolNet distribution at the Language and Technology Conference 2011 (November) in Poznań, Poland and the Global Wordnet Conference 2012 (January) in Matsue, Japan.

This first release was freely distributed as PolNet 1.0 under a CC (Creative Commons) license. The released resource amounted – for nouns – to approximately 11,700 synsets for over 20,300 word-senses (and 12,000 nouns). The verb part of PolNet was composed of more than 1,500 synsets corresponding to some 2,900 word + meaning pairs for 900 of the most important Polish simple verbs. We refer to this release as to PolNet 1.0.

4 From PolNet 1.0 to PolNet 2.0

The present development is being operated within the NPRH project “Development of Digital Resources of Polish in the Area of Valency Dictionaries Towards the Lexicon Grammar Oriented to Computer Applications in the Humanities”/11H11 010080/. It focuses on development of the verbnet part of PolNet. Although verb synsets may be related to other verb synsets through the hyponymy/hyperonymy relation, the main interest is in relating verb synsets (representing predicative concepts) to noun synsets (representing general concepts) in order to show the semantic connectivity constraints that correspond to the particular argument positions opened by verbs. Inclusion of the predicative information, combined with morphosyntactic constraints, gives PolNet the status (and strength) of a lexicon grammar. Our attempts to transform PolNet into a Lexicon Grammar of Polish were inspired by two historical reference projects of the 1970s: Lexicon-Grammar (Gross 1994) and Syntactic-Generative Dictionary of Polish Verbs (Polański 1992).

At the present stage we focus our efforts on inserting verb-noun collocations into PolNet. Collocations assume predicative functions in a similar way to simple verbs. Verb-noun collocations often do not have one-word synonyms and, therefore, must be considered an essential part of the Polish predication system.

Our present work benefits from the recent advances of descriptive research on collocations. We directly use the “Dictionary of Polish Verb-Noun Collocations” (Vetulani, G. 2000, 2012) as basic resource. This resource is still in development and we may expect its essential expansion in the future. The most challenging and time consuming part of our work consists in adding semantic information missing in the G. Vetulani’s verb-noun collocations dictionary (cf. Example 1 above for the form of dictionary entries). This part of work may hardly be done automatically without uncontrolled quality loss.

The intended release of PolNet 2.0 (planned for 2014)/2015 will contain new synsets corresponding to some 3600 collocations.

5 Future Research

The “PolNet - Polish WordNet” project is in progress, and will be continued in the foreseeable future. Our short term priority is to complete first of all the verbnet part of the project. This means inclusion of both simple verbs and collocations. In parallel, we plan to continue our work, already started for PolNet 1.0, on the alignment of synsets to the upper ontology SUMO (Pease 2011). The long term goal doesn’t change: it is to transform PolNet into a complete lexicon grammar of Polish integrating all grammatical information necessary (and sufficient) for advanced AI and Language Engineering (LE) applications.