Key Words

1 Introduction

What do words mean and how are the words in different languages related? We make a start at answering these questions with a large multilingual lexical database and formal ontology. Each formalism captures knowledge about words and language in a different way. Linked together, they form a unified representation of knowledge suitable for language processing and logical reasoning.

An electronic lexicon is a fundamental resource for computational linguistics in any language, and Princeton English WordNet (PWN) (Fellbaum 1998) has become a de facto standard in English computational linguistics. WordNet represents meanings in terms of lexical and conceptual links between concepts and word senses. This allows us to model how concepts are represented in various languages. Ontologies offer a complementary representation where concepts are defined more axiomatically and can be formally reasoned with. The Suggested Upper Merged Ontology (SUMO) model of meaning (Pease 2011) addresses language-independent concepts, formalized in first- and higher-order logic. Bringing these two models together (Niles and Pease 2003) has resulted in a uniquely powerful resource for multilingual computational processes.

There have been a number of efforts to create wordnet s in other languages than English. The EuroWordNet (EWN) project provided a first solution for also connecting these wordnets to each other by introducing a shared Interlingual Index (ILI) (Vossen 1998). The ILI was based on the English Wordnet (mainly for pragmatic reasons) and was considered as an unstructured fund of concepts for linking synsets across wordnets.

Most wordnets developed since EWN have used PWN as a common pivot to which each new wordnet is linked. This has the drawback of making English a privileged language and creating a certain linguistic bias. Since all languages have a different set of lexicalized concepts, it is not possible to have an interlingua where everything is lexicalized in all languages. A solution to this was proposed in the ILI using the union of synsets from all languages, arranged and related via the semantic links of PWN (Laparra et al. 2012). In this case, wordnets in the individual languages do not have to lexicalize all synsets but can still be linked together.

Another approach is to use a language-independent formal ontology—SUMO (Pease 2006)—as the common hub, which allows for the creation of arbitrary new concepts that can eventually encompass the union of lexicalized concepts in all languages. This has additional advantages such as a logical language for creating definitions of concepts that can be checked automatically for logical consistency and a much larger inventory of possible relations among concepts. Using the ILI as an intermediate approach collects and arranges synsets that are in need of formalization while deferring that effort to a later time. It is hoped that by cataloging these synsets, it should be possible to have some of the benefits of a common hub while speeding construction. This will likely be used as input to full SUMO-based formalizations in the future.

Currently, we are exploring both approaches in parallel—creating an ILI (not yet released) and extending SUMO (which has been released and is regularly updated).

A key organizational challenge for a true multilingual lexico-semantic database has been the large-scale nature of the effort needed. Each wordnet project has generally had its own funding and processes, even when coordinated in a broad sense with the original PWN. A variety of formats have proliferated. Wordnets do not all link to one another or a central ontology. Another challenge has been that some wordnets have not been released under open licenses and thus cannot be legally redistributed. This has greatly improved since the initial survey in Bond and Paik (2012) with many more wordnets being made open (Bond and Foster 2013). Some years ago, we introduced the idea of combining wordnets in a single resourceFootnote 1 (Pease et al. 2008). This original vision has now been realized in the Open Multilingual Wordnet (OMW) described in Sect. 4. At the time of this writing, there are 22 wordnets that have been put into a common database format and linked to SUMO.

In the next section, we describe the Princeton Wordnet in more detail. We then introduce the linked ontology, SUMO (Sect. 3). In the next section, we describe how we built and made accessible the OMW: the main new resource described here (Sect. 4). Finally, we discuss how it can be extended to cover more languages better (Sect. 5).

2 Princeton English WordNet

Princeton WordNet (PWN: Fellbaum 1998) is a large lexical database comprising nouns, verbs, adjectives, and adverbs. Cognitively synonymous word forms are grouped into synsets, each expressing a distinct concept. Within each synset, words are linked by synonymy. Synsets are interlinked by means of lexical relations (among specific word forms) and conceptual relations (among synsets). Examples of the former are antonymy and the morphosemantic relation; examples of the latter are hyponymy, meronymy, and a set of entailment relations. The resulting network can be navigated to explore semantic similarity among words and synsets. PWN’s graph structure allows one to measure and quantify semantic similarity by simple edge counting; this makes PWN a useful tool for computational linguistics and natural language processing.

The main relation among words in PWN is synonymy, as between the words shut and close or car and automobile. A group of synonyms—words that denote the same concept and are interchangeable in many contexts—is grouped into an unordered set. Synsets are linked to other synsets by means of a small number of conceptual relations , such as hyperonymy, meronymy, and entailment. Additionally, each synset contains a brief definition and, in most cases, one or more short sentences illustrating the use of the synset members. Word forms with several distinct meanings are represented by appearing in as many distinct synsets as there are meanings. Thus, each form-meaning pair (or sense) in PWN is unique.

3 Suggested Upper Merged Ontology

The SUMOFootnote 2 (Niles and Pease 2001; Pease 2011) began as just an upper-level ontology encoded in first-order logic. The logic has expanded to include higher-order elements. SUMO itself is now a bit of a misnomer as it refers to a combined set of theories: (1) The original upper level, consisting of roughly 1,000 terms, 4,000 axioms, and some 750 rules; (2) A MId-Level Ontology (MILO) of several thousand additional terms and axioms that define them, covering knowledge that is less general than those in the upper level. We should note that there is no objective standard for what should be considered upper level or not. (3) There are also a few dozen domain ontologies on various topics including theories of economy, geography, finance, and computing. Together, all ontologies total roughly 22,000 terms and 90,000 axioms. There are also an increasing group of ontologies which are theories that consist largely of ground facts, semiautomatically created from other sources and aligned with SUMO. These include Yet Another Giant Ontology (YAGO) (de Melo et al. 2008), which is the largest of these sorts of resources and has millions of facts.

SUMO is defined in the Suggested upper Ontology-Knowledge Interchange Format (SUO-KIF) language,Footnote 3 which is a derivative of the original KIF (Genesereth 1991). It has been translated automatically, although in what is a necessarily very lossy translation into the W3C Web Ontology Language (OWL).Footnote 4 The translation also includes a version of PWN in OWLFootnote 5 andthe mappings between them.Footnote 6

SUMO proper has a significant set of manually created language display templates that allow terms and definitions to be paraphrased in various natural languages. These include Arabic, French, English, Czech, Tagalog, German, Italian, Hindi, Romanian, and Chinese (traditional and simplified characters).

SUMO has been mapped by hand to the entire PWN lexicon (Niles and Pease 2003). The mapping statistics are given in Table 1. There are a number of other approaches for mapping ontologies to wordnet s (Fellbaum and Vossen 2012; Vossen and Rigau 2010). However, these have not involved ontologies that are either comparable in size or degree of formalization to SUMO.

Table 1 SUMO WordNet mappings (115,261 total)

4 Open Multilingual Wordnet

Wordnet s have now been made for many languages. The Global Wordnet Association currently lists over 60 wordnets.Footnote 7 The individual wordnets are the result of many different projects and vary greatly in size and accuracy. The OMW (Bond and Paik 2012)Footnote 8 provides access to some of these, all linked to the PWN and SUMO. The goal is to make it easy to access lexical meaning in multiple languages. OMW has (1) extracted and normalized the data, (2) linked it to PWN 3.0, and (3) put it in one place. It includes a simple search interface that uses the SQL database developed by the Japanese Wordnet.

In order to make the wordnets more accessible, we have built a simple server with information from those wordnets whose licenses allow us to do so. It is based on a single shared database with all the languages in it. We only include data that is open: “anyone is free to use, reuse, and redistribute it—subject only, at most, to the requirement to attribute and/or share-alike.”Footnote 9

The accessibility of the data means that it is becoming widely used. BabelNet 2.0,Footnote 10 a very large multilingual encyclopedic dictionary and semantic network, is made by combining the OMW, PWN, Wikipedia, and OmegaWiki (a large collaborative multilingual dictionary). Google TranslateFootnote 11 also uses the OMW data.

The majority of freely available wordnets have been based on the expand approach, basically adding lemmas in new languages to existing PWN synsets (Vossen 1998, p. 11). These wordnets can easily be combined by using the PWN as a pivot. We realize that this is an incomplete solution, and a better one is discussed in Sect. 5.2. Some wordnets are based on the merge approach, where independent language-specific structures are built first and then some synsets linked to the PWN. For those merged wordnets in the OMW (Danish and Polish), only a small subset are actually linked, due more to lack of resources to link them than semantic incompatibility.

Adding a new language to the OMW turned out to be difficult for two reasons. The first problem was that the wordnet s were linked to various versions of PWN. In order to combine them into a single multilingual structure, we had to map to a common version. The second problem was the incredible variety of formats that the wordnets are distributed in. Almost every project used a different format and thus required a new script to convert it. In fact, different releases from the same project often had slightly different formats. These two problems mean that, even if a wordnet is legally available, there is still a technical hurdle before it becomes easily accessible.

The first problem can largely be overcome using the mappings from Daude et al. (2003). Mapping introduces some distortions. In particular, when a synset is split, we chose to only map the translations to the most probable mapping, so some new synsets will have no translations. For example, the synset pwn16-leg n: 8 “a section or portion of a journey or course” in PWN 1.6 maps to two senses in PWN 3.0: pwn30-leg n: 9 “a section or portion of a journey or course” and pwn30-leg n: 8 “the distance traveled by a sailing vessel on a single tack”. pwn16-leg n: 8 to pwn30-leg n: 9 is the most probable mapping, so any lemmas associated with pwn16-leg n: 8 will be associated only with pwn30-leg n: 9.

The second problem we have currently solved through brute force, writing a new script for every new wordnet we add. We discuss better possible solutions in Sect. 5.2. In the future, we hope people will move to a common standard for exchange, with Wordnet-LMF being the strongest contender (Vossen et al. 2013).

The server currently includes English (Fellbaum 1998); Albanian (Ruci 2008); Arabic (Black et al. 2006); Chinese (Huang et al. 2010; Wang and Bond 2013); Danish (Pedersen et al. 2009); Finnish (Lindén and Carlson 2010); French (Sagot and Fišer 2008); Hebrew (Ordan and Wintner 2007); Indonesian and Malaysian (Nurril Hirfana et al. 2011); Italian (Pianta et al. 2002); Japanese (Isahara et al. 2008); Norwegian (Bokmål and Nynorsk: Lars Nygaard 2012, p.c.); Persian (Montazery and Faili 2010); Polish (Piasecki et al. 2009); Portuguese (de Paiva and Rademaker 2012); Thai (Thoongsup et al. 2009); and Basque, Catalan, Galician, and Spanish from the Multilingual Common Repository (Gonzalez-Agirre et al. 2012).

The wordnet s are all in a shared sqlite database with either Python or PERL CGI clients using the wordnet module produced by the Japanese Wordnet project (Isahara et al. 2008). The database is based on the logical structure of the PWN, with an additional language attribute for lemmas, examples, definitions, and senses. It is thus effectively a single open multilingual resource. We summarize the size of the wordnets and their coverage of core concepts in Table 2. Core concepts are the 5,000 synsets proposed as a core lexicon based on the frequency of the word forms in the British National Corpus (Burnard 2000) and an intuitive sense of salience (Boyd-Graber et al. 2006). That is, the core concepts are frequently occurring concepts (at least in British English).

Table 2 Available wordnets

We make available the synset-lemma pairs as tab-separated files, where they can be used by the Natural Language ToolkitFootnote 12 (Bird et al. 2009) as well as WordNet-LMF (Lexical Markup Framework: Vossen et al. 2013) and lemon (McCrae et al. 2011).Footnote 13

Finally, we also make the SQL database available (with all languages except French and Basque, whose licenses are incompatible with the others). We use a simple database schema extended from the schema for the Japanese wordnet (Bond et al. 2009). When we use the combined database in applications, we typically use the database directly or through the Perl interface. Licenses that allow redistribution of derivative works allow people to make the entire lexicons available in any format, thus greatly improving their usefulness. There are also APIs for the database produced by other researchers in Python, Java, Ruby, Objective-C, Gauche, and an alternative Perl module.Footnote 14

There has been much research on making Wordnets available to the Semantic Web, including formatting as RDF (van Assem et al. 2006; Koide et al. 2006), serving LMF directly (Savas et al. 2010), or serving them through the lemon format (McCrae et al. 2011). Typically, these do not involve any changes in the actual content; the emphasis is instead on making it more easily accessible as Linked Open Data (Berners-Lee 2009). The proliferation of these approaches suggests that there is still some way to go until we will have an agreed-upon universal standard. Therefore, our approach has been to make our data open, clearly documented, well formatted, and validated in a simple format we use ourselves (tab-separated text) and some standard formats for exchange (LMF and lemon). This can then be straight-forwardly converted to whatever format is desired by those who want it in that format. Currently, in most of our use scenarios (principally word sense disambiguation and semantic processing), the latency of a Web interface is problematic—we expect that most of the users of our data will want to download the entire lexicon, and this is what we offer.

4.1 Possible Wordnet Structural Enhancements

In this section, we will discuss some extensions people have suggested to the structure of the original PWN: these are not currently part of the open wordnet. One advantage of having many language-specific projects loosely coordinated is that there can be a wide variety of experimentation.

Our conversion scripts basically reduce each wordnet to a list of synset-lemma pairs, plus frequency, definitions, and examples if available. Everything is mapped to PWN 3.0 synsets. Therefore, the current version loses any synsets not in the English 3.0 wordnet. Many of the wordnet s have such synsets, as well as metadata, definitions, examples, and other useful information. One of the ongoing goals of the OMW project is to make this information more easily accessible between projects.

We do not consider wordnets with licenses that do not allow redistribution, as we cannot legally include them. This includes some very well-constructed wordnets with excellent coverage, such as the Dutch,Footnote 15 German, and Korean wordnets (Vossen et al. 2008; Kunze and Lemnitzer 2002; Yoon et al. 2009). It is unfortunate that they cannot be integrated into the Open Wordnet. Some wordnets are built with their own structure and do not link to the PWN. These also cannot be included. Finally, some wordnets were not included even though they were open as the quality was still too poor due to the fact that they had been automatically made, with very little quality control.

Many of the wordnet projects extend the PWN relations in some way. For example, EWN defined many cross-part-of-speech links: hammer n: 1 is an involved-role of hammer v: 1 (Vossen 1998, pp. 97–110). Another instance of extensions is the Chinese Wordnet (Taiwan) which takes a different approach in representing lexical meanings. Unlike most models of lexical ambiguity resolution that assume only one meaning is chosen in a given context, it allows more than one (related) meanings to coexist in the same context. A lexical item is actively complex if it allows simultaneous multiple readings.Footnote 16 Meaning extensions thus are proposed to be distinguished between two types: sense and meaning facet (Ahrens et al. 1998). These can be distinguished as follows: given multiple possible meanings of a lemma, if a sentence that allows coexisting multiple readings for that lemma can be found, the distinction of these meanings is recognized as meaning facet distinction; otherwise, they are sense distinctions. The coexistence test for sense/meaning facet distinction can be illustrated in (1)–(4). The lemma kànbìng “seeing-sickness" in (1) allows two readings (“seeing the doctor” or “examining the patient”). The ambiguity can be resolved given more contextual information, and we cannot find a sentence that allows the coexistence of these two readings. Therefore, it is treated as two senses of that lemma. However, for the lemma zázhì “magazine," it can refer to the physical object in (2) or the information contained in (3); more specifically, we can find a sentence like (4) in which the meaning of the lemma can refer to both the physical object and the information contained in that object. We therefore consider this meaning distinction of zazhi “magazine" is a meaning facet rather than a sense. Interestingly, among the 5,890 meaning facets being identified in Chinese Wordnet, 9 regular systematic patterns are extracted, which are similar to the regular polysemy (Apresjan 1973) (of complex types) proposed by Pustejovsky (1995). This fine-grained distinction is implemented by extending the types of semantic relations within the Chinese wordnet. Many (perhaps most) of these relations are not specific to Chinese. One of the advantages of the OMW is that we can look at research like this being done for one language and easily test its applicability to other languages:

5 Extending the Multilingual Wordnet

In this section, we discuss the immediate plans to extend the wordnet s to deal with multilingual issues. As was demonstrated in EWN, we can expect most languages to have concepts that are not lexicalized in English. In addition, there are still many concepts lexicalized in English, but not in PWN. Thus, different wordnets will have synsets that do not appear in most or even any other existing wordnet (this was the case for seven of the wordnets in the OMW). Consider the example of the Tagalog word hilamos—to wash one’s face (Borra et al. 2010).

Words such as this form part of the motivation for using a formal ontology. While some wordnets have used English as an interlingua and created phrases to stand in the place of otherwise unlexicalized concepts, another approach is to use SUMO as an interlingua which can contain concepts which stand for the lexicalized concepts of any particular language.

Exactly what counts as lexicalized can be hard to determine. Consider the following example: foal is lexicalized in English so must be in the English Wordnet. In Malay, the closest equivalent is a phrase: anak kuda “horse child" which can be produced compositionally by fully productive syntactic rules. In Japanese, it is ko-uma “child+horse" a word produced by a semiproductive process. So it is not clear whether the Malay wordnet should have an entry here. On the one hand, it is produced by a fully productive process. On the other, it is useful to have an entry, even if fully compositional, for completeness. We suggest that it should be entered but marked as syntagmatic using metadata, following the example of Italian, Basque, and Hungarian wordnets (Pianta et al. 2002; Pociello et al. 2011). Vincze and Almázi (2014) show how it is possible to exploit this metadata to automatically make two versions of the monolingual wordnet s—one showing translation equivalents and one only showing concepts lexicalized in a particular language.

EWN distinguished a few types of nonuniversal lexicalizations and expressions, which call for different methods of handling:

Cultural concepts::

Concepts that exist in some cultures and not in others, for example, Dutch klunen=to walk on skates.

Pragmatic lexicalizations::

Concepts that are known in all/most cultures but are not considered lexicalized in all of these, for example, we all know the concept of a small fish, but Spanish happens to have a separate word for it alevin.

Morphosyntactic mismatches::

Concepts that are lexicalized through words with different morphosyntatic properties across languages, for example, Dutch has no equivalence for like but uses the adjective aardig.

Differences in perspective::

Some languages distinguish things depending on who is doing what to whom in ways that other languages don’t, for example, teach and learn in English, whereas French uses apprendre for both.

A pertinent question is what defines a word and what defines a concept. Commonly occurring collocations may have transparent, compositional semantics, yet we may still consider these words. For example, noun compounds such as sailing boat are so common and ready-made that we consider them to be one word. Another point is that the relation between the components cannot be predicted from the structure: who is doing the sailing, who has the sail, and what is being sailed? A classical Dutch example is kindermeel: meal for children and tarwemeel: flour made of oats. From the structure, we cannot infer the relation. It needs to be learned or inferred, but Dutch speakers are probably not deriving them over and over again.

We are also extending the wordnets in terms of their size and coverage both within individual projects and by exploiting the disambiguating power of multilingual data to link to other open resources such as Wiktionary (Bond and Foster 2013). The core idea is that by looking at multiple translations of a concept, we can pinpoint the meaning exactly: bat in English is ambiguous between the sporting equipment and the flying mammal, but adding, for example, French, removes the ambiguity (batte vs. chauve-souris).

We are investigating two (compatible) methods of dealing with these new concepts. One is to create a concept in an external ontology and use this to link languages. In this approach, as hilamos is not lexicalized in English, it is not linked directly to English wash in the English wordnet. The fundamental value of the ontology is to define meaning using axioms in an expressive logic so that the meanings can then be manipulated without recourse to a human’s intuition about the meaning of a word.

The second approach is to have a shared group of synsets for all languages, but not have them lexicalized in all languages. In this model, English wash and Filipino hugas are both lexicalizations of the same synset, and the synset for [ wash one’s face]hilamos inherits from this but would be marked as unlexicalized in English. Most expand style wordnets take this approach with nonlexicalized synsets being either just left blank or explicitly marked as nonlexicalized (as in, e.g., the MCR (Gonzalez-Agirre et al. 2012)).

5.1 Wordnets Linked to External Ontologies

Using ontologiesFootnote 17 to link words (the first approach) is more labor intensive but offers other advantages.

Consider the notion of earlier. PWN has a synset for this word, but not a way to use it in temporal inference. SUMO however has a relation for earlier and a formal rule (among others) that allows an automated inference system such as those available with Sigma (Pease and Benzmüller 2013; Pease et al. 2010) to conclude that an interval that is earlier than another has an endpoint that precedes the start point of the following interval. This is a necessary and sufficient definition for earlier and uses the bi-implication or equivalence sign <=>:

(<=>

  (earlier ?INTERVAL1 ?INTERVAL2)

  (before

    (EndFn ?INTERVAL1)

    (BeginFn ?INTERVAL2)))

Another example is the SUMO-based content developed to represent Muslim cultural concepts in Arabic Wordnet (Black et al. 2006). The Udhiyah ritual is performed during the period of Eid al-Adha and involves slaughtering a lamb by a Muslim. If a lamb has the attribute of being Udhiyah, then there necessarily exists an UdhiyahRitual in which it is the subject of the ritual:

Each of these symbols is further formalized, allowing them to be checked for logical consistency by automated theorem provers. This is also a key advantage for formal logic representation. The more expressive the representation and the more extensive the set of formalizations for each concept, the more things that can be checked automatically. A conventional dictionary must be checked by humans to ensure correctness of definitions. This is true with a conventional data dictionary, in which concepts in a database are defined in natural language in hopes of ensuring their correct usage. But when such a corpus of definitions grows large, into the thousands or more, it is not likely that a human or even many humans will be able to find all inconsistencies. Automated means are needed. At that point, expressiveness also matters. In a taxonomy, the only error that can be caught automatically is the presence of a cycle in the graph. With a description logic, many more checks can be performed. In a higher-order language such as that used by SUMO, theorem proving (Benzmüller and Pease 2010) can find much more deep and subtle errors, leading to definitions of considerable depth and consistency.

Because SUMO terms are mathematical symbols, with a semantics given solely by their logical axioms, and unlike taxonomies or semantic networks, the symbol names can be changed without altering their meaning. In fact, the current Sigma browser can display terms with their names in different languages in order to emphasize this point and make them more accessible to logicians who may not speak English.

5.2 Interlingual Index

The second approach is basically that of the Interlingual Index (ILI: Peters et al. 1998). The variety of approaches in the EWN initially resulted in wordnets that were mapped to very different sets of concepts in the ILI. Likewise, only a small set of synsets could be traced to other languages through the ILI. To harmonize the output, EWN took two measures: (1) the definition of a shared set of (1,000 up to 5,000) Base Concepts that were manually aligned and (2) the classification of these Base Concepts using a small top ontology of 63 terms. Base Concepts (not to be confused with the “Basic Level Categories” of Rosch (1978)) represent synsets that have the highest connectivity to the other synsets. The top-ontology classification of these synsets provided a shared semantic framework. Each wordnet made sure the Base Concepts were presented properly in their language and manually mapped to the ILI. The minimal intersection across these wordnet s through the ILI is thus the set of Base Concepts, but in practice the intersection is much larger. During the EWN project, it became clear that there are many problems with the ILI being based on PWN and that there are many possibilities to improve the ILI for linking wordnets (Vossen et al. 1999).

6 Conclusion

Several goals are being pursued in parallel: (1) research on building wordnets for individual languages, (2) research on building a more formal upper ontology, and (3) research on linking wordnets in many languages to make a multilingual resource. The ontology as well as some of the lexicons have been expressed in OWL, as well as their original formats, for use on the Semantic Web and in Linked Data. This effort builds on WordNet, Global Wordnet, and SUMO to create a rich Web of linguistic data and mathematically specified world knowledge.