Web 2.0, Language Resources and standards to automatically build a multilingual Named Entity Lexicon

Toral, Antonio; Ferrández, Sergio; Monachini, Monica; Muñoz, Rafael

doi:10.1007/s10579-011-9148-x

Web 2.0, Language Resources and standards to automatically build a multilingual Named Entity Lexicon

Original Paper
Published: 18 June 2011

Volume 46, pages 383–419, (2012)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Language Resources and Evaluation Aims and scope Submit manuscript

Web 2.0, Language Resources and standards to automatically build a multilingual Named Entity Lexicon

Download PDF

Antonio Toral¹,
Sergio Ferrández³,
Monica Monachini² &
…
Rafael Muñoz³

286 Accesses
3 Citations
4 Altmetric
1 Mention
Explore all metrics

Abstract

This paper proposes to advance in the current state-of-the-art of automatic Language Resource (LR) building by taking into consideration three elements: (1) the knowledge available in existing LRs, (2) the vast amount of information available from the collaborative paradigm that has emerged from the Web 2.0 and (3) the use of standards to improve interoperability. We present a case study in which a set of LRs for different languages (WordNet for English and Spanish and Parole-Simple-Clips for Italian) are extended with Named Entities (NE) by exploiting Wikipedia and the aforementioned LRs. The practical result is a multilingual NE lexicon connected to these LRs and to two ontologies: SUMO and SIMPLE. Furthermore, the paper addresses an important problem which affects the Computational Linguistics area in the present, interoperability, by making use of the ISO LMF standard to encode this lexicon. The different steps of the procedure (mapping, disambiguation, extraction, NE identification and postprocessing) are comprehensively explained and evaluated. The resulting resource contains 974,567, 137,583 and 125,806 NEs for English, Spanish and Italian respectively. Finally, in order to check the usefulness of the constructed resource, we apply it into a state-of-the-art Question Answering system and evaluate its impact; the NE lexicon improves the system’s accuracy by 28.1%. Compared to previous approaches to build NE repositories, the current proposal represents a step forward in terms of automation, language independence, amount of NEs acquired and richness of the information represented.

A Multilingual Lexico-Semantic Database and Ontology

Automatic Identification and Disambiguation of Concepts and Named Entities in the Multilingual Wikipedia

Language Resources and Linked Data: A Practical Perspective

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

World knowledge is a requirement for dealing with the semantic level of natural languages. Conceptualisations of reality have occupied human beings since the Ancient Greeks, where the term Ontology (from the Greek ὂν, genitive ὂντος: of being (part. of εἶναı: to be) and -λογìα: science, study, theory) was introduced by Aristotle (1908). A long time later, at the end of the XX century, the first attempts to give common sense to computers by building Knowledge Bases (KBs) were initiated in the field of Artificial Intelligence. Examples of this are the CYC project (Lenat 1998), MindNet (Richardson et al. 1998) and, more related to natural language, WordNet (Miller 1995).

Computational Linguistics is an interdisciplinary field related to Artificial Intelligence that deals with human-level understanding and generation of natural languages. World knowledge is necessary for attaining truly intelligent computer systems. In the case of language, this knowledge is contained in Language Resources (LRs), and in fact, these play a central role in the field of Computational Linguistics as they are practically indispensable for carrying out any automatic understanding of language. The research community has therefore dedicated a lot of effort to the manual construction of LRs during the last two decades.

In spite of the amount of work devoted to LRs, which has led to the availability of robust and high coverage LRs, some types of linguistic information are not exhaustively covered in these resources. Two paradigmatic examples are those of Named Entities (NEs) ^{Footnote 1} and domain-specific terms. It is clear that the manual population and maintenance of these two kinds of terms into LRs would be unfeasible, as the amount of terms involved is huge and their nature, especially that of NEs, is much more volatile than that of the terms that make up the core of traditional LRs (common nouns, adjectives, verbs and adverbs). This is related with the following assertion: “building a proper noun ontology is more difficult than building a common noun ontology as the set of proper nouns grows more rapidly” (Mann 2002). The problem is then that a proper noun resource should be constantly updated. Keeping with this, (Philpot et al. 2005) states that “the need for machine-assisted ontology construction is stronger than ever” because “humans cannot manually structure the available knowledge at the same pace as it becomes available”. Hence, in order to fill this gap, automatic procedures are needed. The so called knowledge acquisition bottleneck is a recognised issue within the Natural Language Processing (NLP) community.

In order to clarify this issue, let us take a look at the state of NEs in WordNet -the most widely used English LR nowadays-. From version 2.1., this LR explicitly distinguishes between common nouns (called classes) and proper nouns (called instances) (Miller and Hristea 2006). While WordNet’s coverage of open domain common nouns is quite high, it contains very few proper nouns (only 7,669 synsets are tagged as instances in WordNet 2.1).

Following with NEs, most of the research done up to now relates directly to their recognition and classification in text according to small predefined sets of categories, such as the four category set (person, organisation, location, miscellaneous) of CoNLL (Tjong Kim Sang 2002). With regards to NE resources, even if mature repositories of geographical NEs (also called gazetteers) do exist (e.g. geonames ^{Footnote 2}), there is a lack of more general resources. However, the availability of general LRs with NEs could be very useful for NLP tasks; (Mann 2002) shows how the use of a proper noun ontology, even if the ontology used has a low coverage, improves the precision of a Question Answering (QA) system. Moreover, this kind of resources could play a crucial role in NE Recognition systems that consider an extended hierarchy of entity types like that proposed in (Sekine et al. 2002).

Let us clarify the role that a NE rich LR could play in NLP by presenting a QA example. Consider the question 161 from the QA track at the 2006 edition of CLEF ^{Footnote 3}: “Who is Fernando Henrique Cardoso?”. This question would be easily answered if this person NE was present in a LR with semantic links to other entries, such as being an instance of “Brazilian”, “politician”, “president” or “minister”.

1.1 Motivation and roadmap

Our present work aims at devising a generic methodology to extend existing LRs with NEs. The approach should be general enough so that it could be applied to different kinds of LRs and furthermore it should be language independent. NEs should not be only introduced in the LR but also linked to relevant existing entries by means of semantic relations. Moreover, the procedure should be fully automatic and produce a high quality final resource.

Because of the requirements posed to the task (high quality automatic extension of LRs with up-to-date NEs) we come up with two main ideas that will characterise our approach. The first is to exploit the information already present in LRs; these resources have been manually built by expert lexicographers and hence, the information encoded has high quality and can be used to support and guide their own extension. The second regards taking advantage of the so called New Text sources.

Up to now, research devoted to the automatic population of LRs has mostly focused on extracting the required information from two kinds of sources: Machine Readable Dictionaries (MRDs) and raw corpora. However, both present disadvantages. While MRDs are small in size and thus limit the quantity of information that can be extracted, corpora consist of unstructured text and therefore make it harder to extract valuable information.

According to (Hearst 1998), relations found in unrestricted text tend to be subjective judgements compared to the more established statements present in dictionaries and encyclopaedias. This is in line with the study conducted by Wiebe et al. (2004). They analysed the Wall Street Journal Treebank Corpus and divided it into opinion and non opinion pieces . They discover that 70% of the sentences in opinion pieces are subjective and 30% are objective whereas in non opinion pieces, 44% of the sentences are subjective and only 56% are objective. Therefore, unless some post-process is carried out, these kind of textual sources are not appropriate for an automatic acquisition process. Wiebe and Riloff (2005) tackle this problem by creating subjective and objective sentence classifiers. Nevertheless, the results are far from being perfect; the best classifier, which is supervised, obtains 76% accuracy while the best unsupervised one achieves 73.8%.

Following with corpora based methods, they might, if no special treatment is applied, acquire the same instance with different lexical forms (Fleischman et al. 2003) (e.g. Bill Clinton and William Clinton) and therefore include them as different instances in the created resource.

However, new types of text -the so called New Text- have emerged as a consequence of the appearance of new forms of communication. By New Text we refer to “new types of text—dynamic, reactive, multilingual, with numerous cooperating or even adversarial authors and little or no editorial control” which have arisen due to “recent advances in publication and dissemination systems” (Karlgren 2006). We are interested in using these kinds of sources because (1) they tend to have some degree of structure which facilitates the extraction of valuable information and (2) they are dynamic and thus a sensible source to guarantee up-to-date information. Making use of these new kinds of information could present important advantages for Information Extraction compared to the aforementioned kinds of sources. New types of sources such as folksonomies (aka social tagging) and wikis contain semi-structured semantic information (categorisation tags, interlingual and multilingual links, attribute-value tables, etc) that is not only useful to recognise the elements to be extracted but also to disambiguate and normalise them. Besides, these sources are dynamic, thus change with time, and because they are collaboratively built, reflect language variety. The challenge consists of adapting state-of-the-art extraction techniques in order to derive the maximum benefit from these new kinds of sources.

One of these new kinds of text is known as wiki. Wikis can be defined as on-line texts that allow users to easily edit and change the contents. These characteristics make them an effective tool for collaborative authoring. The most widely known example of a wiki resource is Wikipedia, a multilingual encyclopaedia that follows the wiki philosophy. Wikipedia is an interesting textual source for the automatic creation of LRs because, being an encyclopaedia, it contains facts dealing with the entire range of human knowledge and, as it is developed by a large amount of people, ^{Footnote 4} therefore reflects the variations of language and human thought. The quality of Wikipedia’s content is comparable to traditional encyclopaedias, according to (Giles 2005), which compares its English version to the Encyclopaedia Britannica, and to a study carried out by the WIND research institute for the Stern magazine ^{Footnote 5} ^, ^{Footnote 6}, which confronts the German version to the Brockhaus On-line encyclopaedia.

Several aspects make this research different from previous work within lexical and semantic knowledge acquisition. Compared to research that relies on corpora, our research avoids problems due to subjective judgements ^{Footnote 7} and inconsistencies due to calling instances in different manners whereas compared to research that uses MRDs, our method is not limited by the small size of the input resource.

Table 1 compares the relevant characteristics of corpora, MRDs and Wikipedia for their application to knowledge acquisition. Taking into account all the four features considered (structure, subjectivity, size and nature), Wikipedia emerges as the resource offering the best trade-off.

Table 1 Comparison of corpora, MRDs and Wikipedia

Web 2.0, Language Resources and standards to automatically build a multilingual Named Entity Lexicon

Abstract

Similar content being viewed by others

A Multilingual Lexico-Semantic Database and Ontology

Automatic Identification and Disambiguation of Concepts and Named Entities in the Multilingual Wikipedia

Language Resources and Linked Data: A Practical Perspective

Explore related subjects

1 Introduction

1.1 Motivation and roadmap

2 Background

2.1 General lexical acquisition and enrichment of Language Resources

2.2 Onomastica acquisition and creation

2.3 Wikipedia and NLP

3 Language Resources

3.1 WordNet

3.2 EuroWordNet

3.2.1 Italian WordNet

3.2.2 Spanish WordNet

3.3 PAROLE-SIMPLE-CLIPS

3.4 Mapping between PSC and IWN

4 Procedure

4.1 Mapping

4.2 Disambiguation

4.2.1 Instances intersection

4.2.2 Text similarity

4.3 Extraction

4.4 NE identification

4.4.1 Web search

4.4.2 Wikipedia search

4.4.3 Combining Wikipedia and the Web

4.5 Postprocessing

4.6 The Named Entity Lexicon

5 Results and discussion

5.1 Data

5.2 Mapping

5.3 Mapping analysis

5.4 Disambiguation

5.4.1 Instance intersection

5.4.2 Semantic similarity

5.5 Extraction

5.6 NE identification

5.6.1 Web

5.6.2 Wikipedia

5.6.3 Combining Wikipedia and the Web

5.7 Postprocessing

6 Question answering application

7 Conclusions and future work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix: LMF output

Appendix: LMF output

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation