Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Named entity recognition (NER) is the problem of locating and categorizing important nouns and proper nouns in a text. For example, in news stories names of persons, organizations and locations are typically important. In the following example, the highlighted named entities hold key information and are useful for language processing applications.

Before joining UCB, Lisa North worked for Pegasus Books in North Berkeley.

Named entity recognition plays an important role in applications such as Information Extraction, Question Answering and Machine Translation. For example, information about named entities such as Lisa North helps a machine translation system to avoid translating them erroneously word by word.

The NER task has been studied extensively for many languages [54] including Arabic and Hebrew. Throughout the past two decades, numerous systems and data resources have been developed for NER. Moreover, there has been several forums and evaluation programs focused on named entity recognition and other related tasks.

In this chapter, we review the general state of NER research, relevant challenges and the current state of the art works on Semitic NER. Specifically, we look into two case studies for Arabic and Hebrew named entity recognition. We also review Semitic NLP tasks which overlap with the named entity recognition. We close with an overview of the available resources for Semitic NER and some the open research questions.

2 The Named Entity Recognition Task

2.1 Definition

Named entities (NEs) are words or phrases which are named or categorized in a certain topic. They usually carry key information in a sentence which serve as important targets for most language processing systems. Accurate named entity recognition can be used as a useful source of information for different NLP applications. For example the performance of applications like Question Answering [69], Machine Translation [7] or Information Retrieval [39] has been improved by named entity information. Table 7.1 shows an example sentence annotated with the named entity information, using different representation schemes. The three intuitive classes of person (PER), location (LOC), organization (ORG) along with the loosely defined miscellaneous(MIS) class are used in most NER systems. These classes are mostly relevant to the news related corpora. For other domains, NER systems are expected to be trained and tested with other relevant class labels.

Table 7.1 Sample NER output with the mention-level (SGML) and BIO and BIOLU representations

Table 7.1 also presents different representations of named entity annotation. Early NER approaches used the mention (chunk) level representation which annotated a named entity as a whole chunk [66]. As the task evolved into a statistical learning problem, the sequence labeling framework became the standard approach [16, 49]. In sequence labeling, the entire sequence of tokens (usually the sentence) is labeled concurrently. The BIO labeling is a representation that is generally used for sequence labeling. In this representation, a token is seen to be at the Beginning or Inside or Outside a named entity. In the alternative BILOU representation, the L and U labels are used respectively for the Last token of a multi-token entity and the Unit-length named entities.Footnote 1

The scope of named entity recognition has evolved over the past couple of decades. Originally NER was limited to the extraction of news related proper nouns such as names of persons, organizations and locations. With the expansion of NLP in other domains, those few traditional named entity classes were not sufficient. For example, for an article about science or technology, the three traditional classes are not enough and other named entity classes need to be considered. Moreover, named entities should not be limited to proper nouns. In certain areas of studies such as nuclear physics, one might highlight terms such as proton or uranium as named entities.Footnote 2 Thus, despite the common focus on the person, location and organization classes one can say that NER encompasses the extraction of all important entities in a given context.

2.2 Challenges in Named Entity Recognition

Named entity recognition consists of the following two sub-problems: (1) recognition of named entity boundaries; (2) recognition of named entity categories (classes). These problems are usually (but not necessarily) addressed concurrently. Similar to most problems in language processing, there are ambiguities in the language which add to the challenge of the task. In the following, we present examples of ambiguities in both recognition and categorization of named entities. In the first sentence, there is an ambiguity in the recognition of the named entity Reading that can be confused as a gerund form of a verb or a proper noun (city name). In the second example, the ambiguity is in the named entity type; Fox can be interpreted either as a person, an organization or a non-named entity. Furthermore, Washington might refer to a person, location or organization (US. government).

  • Reading is located between two major highways.

     < LOC > Reading < /LOC > is located between two major highways.

  • Fox criticized Washington .

     < ORG > FOX < /ORG > criticized < ORG > Washington < /ORG > .

Most NER challenges lie in its heavily lexicalized and domain-dependent nature. Names take a large part of a language and are constantly evolving in different domains. In order to have a robust NER system for any given domain (e.g. tourism), we need labeled corpora and lexicons (e.g. names of monuments). Creating and updating such resources for various topics is an expensive task and requires linguistics and domain expertise. In the following we will review two frameworks of rule-based and statistical NER and will discuss their data requirements and robustness.

2.3 Rule-Based Named Entity Recognition

Early approaches to named entity recognition were primarily rule-based. Most rule-based systems used three major components: (1) a set of named entity extraction rules, (2) gazeteersFootnote 3 for different types of named entity classes, and (3) the extraction engine which applies the rules and the lexicons to the text. The rule set and the lexicons were either completely handcrafted by humans or were bootstrapped from a few hand-crafted examples. A successful example of the rule-based framework was the AutoSlog Information Extraction system [61]. Table 7.2 presents samples of Auto-Slog’s rules and the extracted named entities.Footnote 4 The system starts with a set of simple seed rules for some known entities like Nicaragua. In an iterative bootstrapping framework the rules were applied and got extended to extract new entities like San Sebastian.

Table 7.2 Examples of rules used to extract named entities

Rule-based systems are relatively precise but usually have low coverage and work well on narrow domains. Their performance usually depends on how comprehensive the rules and lexicons are. Bootstrapping frameworks like [61] are still limited to the domain of the seed rules and lexicon. Furthermore, incorporation of deeper knowledge beyond the surface words and lexicons in to a rule-based system requires expensive manual effort. In contrast, statistical frameworks are more flexible in incorporating richer linguistic knowledge (e.g. syntax) which results in more robust systems.

2.4 Statistical Named Entity Recognition

The rising popularity of the statistical NLP methods along with the expansion of available data resources has directed NER research to data-driven and statistical methods. The use of statistical methods reduced the human effort needed for the tedious construction of rule sets and gazeteers. Soon after their development, statistical and hybrid systems like [51, 52] outperformed the state of the art rule-based systems.

Statistical named entity recognition usually uses the following two main components:

  1. 1.

    Labeled training data: text corpora where named entities are annotated (similar to examples in Table 7.1).

  2. 2.

    A statistical model: a probabilistic representation of the training data.

A statistical model is made of parameters which map a language event to a probability. For example a statistical model that is trained on our earlier example (Fox criticized Washington), might have parameters such as the probability of the first word in a sentence being a named entity or the probability of certain word (e.g. Fox) being labeled as organization.

As a supervised learning problem, named entity recognition can be modeled as a classification task for each individual token. However, such approach fails to consider the interdependency between different tokens. In contrast, NER is usually seen as a structured learning problem for a sequence of variables. That is the sequence labeling view where the learner predicts the labels for the entire sequence of tokens (usually a sentence). This approach allows the modeling of the dependency that exists between different tokens. For example in the earlier example, the class disambiguation for the word Fox is easier if the entire sequence (specially the word Washington) are included in the prediction.

In a sequence labeling framework a sentence is represented by a set of token variables \(t_{1},t_{2},\ldots,t_{N}\). The labeler is expected to find the most likely sequence of named entity labels, \(y_{1},y_{2},\ldots,y_{N}\). The set of labels consists of the BIO boundaries along with the named entity types. Thus, the class possibilities for a model which labels person, location, organization are: B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG and O.

Formulating the problem probabilistically, we would like to find the label sequence which satisfies:

$$\displaystyle{ S ={\rm argmax}_{y_{1}\ldots y_{N}}P(y_{1}\ldots y_{N}\vert t_{1}\ldots t_{N}) }$$
(7.1)

Using the Bayes’ theorem of probabilities, we can rewrite and simplify the above formula as:

$$\displaystyle{ S ={\rm argmax}_{y_{1}\ldots y_{N}}P(t_{1}\ldots t_{N}\vert y_{1}\ldots y_{N})P(y_{1}\ldots y_{N}) }$$
(7.2)

There are different ways of modeling the sequence labeling problem. One well-known approach is the hidden Markov model (HMM) [58]. HMM is based on two concepts:

  1. 1.

    A probabilistic graphical model in which class variables are represented by states which are able to generate tokens.

  2. 2.

    An assumption that there is a Markov process in the generation of the tokens. The assumption is that the probability of assigning a class to a token depends only on a few earlier tokens (and their class labels).

HMM formulates the labeling problem as:

$$\displaystyle\begin{array}{rcl} S& =& {\rm argmax}_{y_{1}\ldots y_{N}}P(t_{1}\ldots t_{n}\vert y_{1}\ldots y_{n})P(y_{1}\ldots y_{n}){}\end{array}$$
(7.3)
$$\displaystyle\begin{array}{rcl} & =& \prod _{i=1,\ldots,N}P(t_{i}\vert y_{i})P(y_{i}\vert y_{i-1}){}\end{array}$$
(7.4)

In the formulation shown above, the Markov assumption allows us to shorten the context for computing \(P(y_{1}\ldots y_{n})\) and simply use \(P(y_{i}\vert y_{i-1})\). This is the first order HMM in which the model includes the contextual information for one previous word. Richer models with higher order use longer context with much larger parameter space.

Fig. 7.1
figure 1

An simplified HMM for detect NE boundaries

Figure 7.1 presents an HMM for a simplified task of finding named entity boundaries. In this model, the class labels are limited to only three boundary labels (B, I and, O). The start and the end states are used to enforce boundaries for the sequence labeling task. Here, the sequence labeling of named entity boundaries follows a generative story:

  1. 1.

    The sequence begins at the Start state.

  2. 2.

    For each token position in the sequence, there is a probabilistic state transition where the class label gets decided.

  3. 3.

    After each transition, the destination state generates a word.

  4. 4.

    The sequence finishes at the End state.

In order to follow the above HMM framework, two sets of parameters are needed to train the HMM:

  1. 1.

    \(P(y_{i}\vert y_{i-1})\): state transition probability which is the conditional probability of the current token’s label given the previous token’s label.

  2. 2.

    \(P(t_{i}\vert y_{i})\): the probability of generating a token, given its label.

During the training, the model learns these two sets of parameters by counting and calculating the probability of different state transitions and word generations in the training data.

Having a trained HMM, we can choose the most likely tag sequence that maximizes the product the two parameters. Since the labeling takes place globally for the entire sequence, the model can deal with some of the class ambiguities. Figure 7.2 presents the correct and an incorrect sequence of HMM states (labels) for an ambiguous sequence. Here, the tagging of Fox as (news) organization influences the following state sequence and results in the tagging of Washington as a (government) organization. In the second labeling, the model collectively labels Fox as non-NE and Washington as person.

Fig. 7.2
figure 2

An ambiguous example with the correct and an incorrect labeling by HMM

In general, the procedure to find the most likely label (state) sequence is named decoding. Methods such as the Viterbi algorithm which use dynamic programming, are commonly used for the HMM decoding.Footnote 5

In order to train richer NER models, one would like to incorporate deeper linguistic information like long distance dependencies, morphological agreements, etc. HMM assumes that tokens are independent of each other. This assumption limits the scope of the contextual information that the NER model can use. Thus, learning features are limited to the current token [16].

In richer discriminative models such as the Maximum Entropy [15], the Perceptron [20] and the Conditional Random Fields (CRF) [41], there is no assumption made about the independence of the words and their class labels. This relaxed framework allows the model to benefit from diverse overlapping (non-independent) features [13, 49]. For example, the model can use different lexicons of foreign names or cultural genres [59]. Moreover, global features which are collected in context beyond the current sentence have also been incorporated into discriminative models [19, 59].

2.5 Hybrid Systems

Hybrid named entity recognition systems combine two or more systems to reach a collective decision. These systems have shown improvement over their baseline counterparts. The work of [17] in combining statistical and rule-based systems in the MUC competitions as well as the work of [26] in combining different statistical learning algorithms are two successful examples of hybrid NER. In Sect. 7.4 we will discuss two Semitic NER systems that use hybrid frameworks, with different learning algorithms.

2.6 Evaluation and Shared Tasks

Named entity recognition systems are evaluated by running them on human-labeled data and comparing their results against this gold-standard. The comparison is usually at the phrase level, giving full credit for complete boundary and category matches and no credit for partial matches. The commonly used evaluation metrics are the precision and recall which have been borrowed from Information Retrieval evaluation. Recall measures the coverage of the system i.e. the percentage of gold-standard named entities that the system is able to recognize. Precision measures the accuracy, i.e. the percentage of the labeled named entities that agree with the gold standard.

A third measure (F 1) is used to combine these two metrics as shown in the following:

$$\displaystyle{ \mathit{Precision} = \frac{C} {L}\ \ \ \ \ \ \ \ \ \mathit{Recall} = \frac{C} {G}\ \ \ \ \ \ \ \ \ F_{1} = \frac{2 \times \mathit{Precision} \times {\it \text{Recall}}} {\mathit{Precision} + {\it \text{Recall}}} }$$

Where:

  • L: Number of labeled named entities

  • G: Number of gold-standard named entities

  • C: Number of correctly labeled named entities

The F 1 measure has been the de facto evaluation and optimization metric for named entity recognition, because of its simplicity and generality. However, there have been debates about how informative this metric really is. In a NLP blog note,Footnote 6 Chris Manning compares various types of errors in NER and argues that F 1 penalizes some types of errors too much. For example, a perfect boundary recognition with incorrect categorization receives the same penalty as a total miss of a named entity. Furthermore, Manning shows that optimization for such an evaluation metric biases the system towards labeling fewer named entities.

2.7 Evaluation Campaigns

Since its introduction, named entity recognition has been a popular subject for group evaluation. There have been three major NER evaluation campaigns as part of NLP conferences. The shared task at the 6th and the 7th Message Understanding Conference (MUC) were the first NER system competitionsFootnote 7 which consisted of extracting entities like person, location, organization, temporal and number expressions [66]. The evaluation followed the template-filling framework of Information Extraction (IE) with the standard precision, recall metrics. MUC’s evaluation counts partial credits for cases in which the boundary of the entity or its class are incorrect.

In 2002 and 2003, the Conference of Natural Language Learning (CoNLL) included a language-independent shared task on named entity recognition. These were important forums for language-independent NERFootnote 8 where a diverse set of learning techniques and features were explored. The BIO encoding of the NER problem, the addition of the miscellaneous (MISC) class of named entitiesFootnote 9 and also the exact matching criteria in the evaluations were protocols which were introduced in the CoNLL shared tasks and since then have been followed by many researchers.

The Automatic Content Extraction (ACE) program was a multilingual (Arabic, Chinese and English) program that was focused on tasks such as named entity recognition and mention detection [23]. The program has created substantial amount of gold-standard data for the three languages. The Arabic corpus is probably one the most important dataset for Semitic NER. ACE introduced a few new conventions for named entity recognition; in addition to the standard person, location and organization classes, ACE added additional entity types such as facility, vehicle, weapon and geographic point entity (GPE). Furthermore, ACE used a more comprehensive evaluation framework. The evaluation incorporated several kinds of errors into an integrated scoring mechanism. This was aimed to address some of the concerns regarding the complete matching criteria of CoNLL.

2.8 Beyond Traditional Named Entity Recognition

In the past decade, the scope of named entity recognition has been extended to new categories and topics. Depending on the topic, there can be various categories of named entities. Works such as [63] constructed extended ontology of named entity categories. These ontologies are useful for NER in multi-topic texts like Wikipedia or weblogs. Balasuriya et al. [8] highlight the substantial difference between entities appearing in English Wikipedia versus traditional corpora, and the effects of this difference on NER performance. There is evidence that models trained on Wikipedia data generally perform well on corpora with narrower domains. Nothman et al. [56] and Balasuriya et al. [8] show that NER models trained on both automatically and manually annotated Wikipedia corpora perform reasonably well on news corpora. The reverse scenario does not hold true for models trained on news text and there is a major performance drop.

It is no surprise that the state-of-the-art news-based NER systems perform less impressively when subjected to new topics and domains. Domain and topic diversity of named entities has been studied within the framework of domain adaptation research. In domain adaptation studies, the traditional domain which usually matches the labeled training data in most part is the source domain and the novel domain which usually lacks large amount of labeled data is the target domain. A group of these methods use semi-supervised learning frameworks such as self-training and select the most informative features and training instances to adapt a source domain learner to a new target domain. Wu et al. [71] bootstrap the NER learner with a subset of unlabeled instances that bridge the source and target domains. Jiang and Zhai [36] as well as [21] make use of some labeled target-domain data, augmenting the feature space of the source model with features specific to the target domain.

There is also a body of work on extraction of named entities from biological and medical text.Footnote 10 In these works, target named entities range from the names of enzymes and proteins in biology texts to symptoms, medicines and diseases in medical records.

3 Named Entity Recognition for Semitic Languages

Named entity recognition inherits many of the general problems of Semitic NLP; complex morphology, the optional nature of short vowels (diacritics) and generally the non-standard orthography are well known problems involved in the processing of Semitic languages which also affect NER.

Except Arabic, NER is an under-studied problem for other Semitic languages. There is small to medium amount of labeled data for Arabic and Hebrew NER and for the rest of Semitic languages there is almost no resource. In the following sections we review the common challenges and some solutions for Semitic NER with a special focus on Arabic and Hebrew.

3.1 Challenges in Semitic Named Entity Recognition

There are four main problems involved with Semitic languages which make Semitic NER a challenging task. Table 7.3 illustrates samples for some of these problems in Arabic and Hebrew.Footnote 11

Absence of capitalization: For English and other Latin-scripted languages, the use of capitalization is a helpful indicator for named entities.Footnote 12 Maltese is the only Semitic language that uses capitalization in this similar fashion. The lack of capitalization in other Semitic languages like Arabic and Hebrew increases the ambiguity both in recognition and categorization of the named entities.

Table 7.3 Examples of morphological and orthographic challenges in Semitic NER

Optional vowels: Vowels are present in different levels in Semitic languages. Short vowels (diacritics) are optional in Arabic and Hebrew. In Amharic writing, vowels are mostly present (except in the case of gemination) and Maltese’s Latin scripting explicitly incorporates vowels. Whenever vowels become optional (as they are in Hebrew and Arabic), ambiguity increases. For example in Table 7.3, the non-vocalized surface form of the Hebrew word alwn in can be interpreted as the verb alun or the person name Alon. Similarly, the Arabic token BrAd might refer to the Arabic noun brrAd (with an optional gemination) or the Western name Brad.

Complex morphology: The concatenative morphology in Semitic languages makes it possible for a named entity to get attached to different clitics and form a longer phrase. For example in Table 7.3, the Arabic entity (Amryky: American) is agglutinated to a the Al (definite) proclitic and the yn (plural) suffix and forms a noun phrase (the Americans). In order to recognize and categorize such entities, morphological analysis needs to be performed. Thus, morphological analysis and disambiguation is expected to play an important role in Semitic NER.

Transliteration and diversity of spelling: Multiple transliteration of named entities is a common problem in most languages including the Semitic family. The non-standard mapping of cross-lingual consonants results in various spellings of phonologically complex names such as Schwarzenegger in Arabic or Hebrew. Moreover, in most Semitic languages we observe some diversity of spelling both for local and foreign names. For example, the first letter of person name Haylü in Amharic can take multiple forms which results in six different spellings of the name [65]. Another example is the multiple mapping between the “h” or “t” consonants in the Roman languages to Arabic.Footnote 13

3.2 Approaches to Semitic Named Entity Recognition

There is an extensive body of works on Arabic named entity recognition. That includes the creation of gazetteers, labeled datasets, statistical and also rule-based systems. The system in [64] is an example of a rule-based approach. The approach includes creation of name lists for the named entities and non-entities (white and black lists) along with the extraction rules (in form of regular expressions). The RENAR system [73] is a more recent rule-based approach. It is based on searching gazetteers followed by a set of hand-crafted grammar recognition rules for extracting out of lexicon entities. Finally, the system of [57] is a more recent hybrid approach in combining a rule-based system with various statistical classifiers in extracting a large set of named entity classes.

A range of statistical learning algorithms have been applied to Arabic NER: Nezda et al. [55] and Benajiba et al. [11] use Maximum Entropy, Benajiba et al. [12], Abdul-Hamid and Darwish [1] use Support Vector Machines and Farber et al. [24] as well as [53] use Perceptron. A range of lexical, morphological and syntactic features have been used in these statistical systems. The development and the distribution of tools such as MADA [30] and AMIRA [22] and SAMA [46] led to studies on the role and effects of morphological features in Arabic named entity recognition. Moreover, the English translation information provided by MADA has provided useful bilingual features. For example, Farber et al. [24] use the gloss translations to estimate a capitalization feature for Arabic words. In other studies such as [12], the MADA package has been extensively used to explore different morphological features with different learning frameworks. In the next section we will review the work in [12] as a case study for Semitic NER.Footnote 14

There are two major published works on Hebrew NER. Lemberski [44]Footnote 15 uses a Maximum Entropy sequence classifier and a set of lexical and morphological features. Features include lexeme, POS tag, several named entity lexicon and information extracted from hand-crafted regular expression patterns. In order to train the system with labeled data, a morphologically tagged corpus was manually annotated with the named entity information. The annotation was in the framework of MUC-7 on a set of 50 Hebrew news articles. In an extended work, Ben Mordecai and Elhadad [14] use three systems separately and jointly for Hebrew named entity recognition. In the following section we will review this work as a case study for Semitic NER.

Similar to English, the majority of the systems for Arabic and Hebrew NER are trained and evaluated on the news corpora. The named entity categories usually include the traditional person, organization, location classes. Some of the Arabic NER works go beyond the traditional classes and introduce additional classes relevant to the domain. Shaalan and Raza [64] extract ten named entity classes related to the business news domain. Some of the numeric classes are non-conventional (e.g. phone number) and contributed to the development of new labeled dataset for evaluation. The system in [55] uses an extensive annotation of text from the Arabic Tree Bank with 18 classes of named entities. The categories include several quantitative and temporal classes such as money and time.

Arabic Wikipedia has been the test-bed for a few recent studies on named entity recognition. Mohit et al. [53] demonstrate that traditional named entity classes are insufficient for a multi-topic corpus like Wikipedia. They use a relaxed annotation framework in which article-specific classes are considered and labeled. For example, for an article about Atom, annotators introduced and labeled particle names (e.g. electron, proton). Furthermore, Mohit et al. [53] develop an NER system which recognize (but does not categorize) their extended set of named entity classes for Arabic Wikipedia. Extended classes of named entities have also been used as a taxonomy for Arabic Wikipedia. Alotaibi and Lee [4] use a supervised classification framework to assign Wikipedia articles to one of their eight coarse-grained named entity classes.

Semitic NER has been studied as part of other relevant tasks. For example, Kirschenbaum and Wintner [40] locate named entities for the purpose of translating them from Hebrew to English. We will review these works in Sect. 7.5 along with other works relevant to Semitic NER.

4 Case Studies

In this section we review the work of Benajiba et al. [12] and also Ben Mordecai and Elhadad [14] as case studies in (respectively) Arabic and Hebrew named entity recognition. The two works share a common approach to Semitic NER: Exploring different learning algorithms and features sets and also lexicon construction to achieve an optimal performance. Benajiba et al. [12] aim at finding the optimal feature set for different classes of Arabic named entities. Ben Mordecai and Elhadad [14] include a brief analysis of effective features, but mainly focus on combining different learning methods for optimizing Hebrew NER. In the following we review different aspects of these two works:

4.1 Learning Algorithms

The system in [12] is an empirical framework to study the effects of different features on Arabic NER. It uses two discriminative learners (support vector machines and conditional random fields) to construct classifiers for each named entity class. Thus, there are classifiers for the person class, location class, etc. that label the named entity boundaries. After the initial per-class labeling, a collective NER classification takes place with a voting mechanism.

Ben Mordecai and Elhadad [14] explore a baseline rule-based system made of regular expressions and two statistical classifiers (Hidden Markov Model and Maximum Entropy). After trying different HMM schemes, they chose a structure where each state is made of a named entity class joined with the POS tag. Moreover, the HMM states omit a feature representation of the words. By such joint inclusion of the class label and the POS tag, they incorporate some structural knowledge in to their model. In contrast, their standard maximum entropy model of NER is not constrained and freely uses features independent of each other.

4.2 Features

Feature selection is an important component of these two case studies and also most other Arabic and Hebrew NER studies. As discussed earlier, NER is a heavily lexicalized task and models rely strongly on lexical and contextual features. A standard set of contextual features such as the preceding and following tokens and morphemes are inherited from the English systems. Furthermore, morphological complexities of Semitic languages requires explicit inclusion of morphological features into the models. In Arabic, for example the gender or number agreements between adjacent proper nouns are important hints to find the spans of the named entity. In the absence of robust morphological and syntactic analyzers (e.g. in Hebrew systems), models benefit from shallow structural and morphological features such as affixes or the token’s position in the sentence.

Table 7.4 compares features used in our two cases studies [12, 14]. The feature set used in the Arabic system includes lexical, contextual features and morphological features as well as features from named entity lexicons built from resources like Wikipedia. Most of the morphological features are extracted by using the Arabic MADA toolkit. The effectiveness of features has been estimated for each of the named entity classes. Some of these features tend to be contributing for most named entity classes (e.g. the morphological aspect or English capitalization). However, because each class holds its own classifier and feature analysis, there is not always a strong consensus about the general effectiveness of a certain feature.

Table 7.4 Features in the Arabic [12] and the Hebrew [14] systems

The feature set in [14] comprises of morphological, structural lexical and contextual features. For morphological features there is not much Hebrew-specific analysis and they are limited to POS tags, affixes and the lemma. However, there is a set of regular expressions and structural features which provide some language specific flavor to the model. Furthermore, gazetteer features use a few lexicons that hold a comprehensive list of frequent nouns and expressions and also use geographical and organizational lists.

4.3 Experiments

Both studies use system combination algorithms. However, the combination is aimed toward different goals. For Benajiba et al. [12], each entity class has a separate classifier and feature set. The feature-based ranking framework (Fuzzy Borda Voting Scheme) is a mechanism to combine these different classifiers into one final classifier. There is an average of 2 % improvement in the F 1 score after reaching the optimum feature set of classifier voting. The support vector machines classifier outperforms others for the majority of classes and datasets while lexical features are the most contributing ones in most experiments.

System combination in [14] is based on a simple recall-oriented heuristic: Take the output of the best individual system (maximum entropy) and use the other two taggers as the back-off. Finally, the empirical experiments show that dictionary features along with the POS tag tend to be the most contributing features.

To summarize, the Arabic system in [12] and the Hebrew system in [14] are successful examples of Semitic NER using a hybrid mixture of supervised learners. Both systems explore language-specific aspects of the problem, but in different ways; Ben Mordecai and Elhadad [14] use language-specific regular expressions to locate potential entities. Benajiba et al. [12] explicitly incorporate linguistic knowledge (e.g. Arabic morphology) as features in to its hybrid learning framework.

5 Relevant Problems

The importance of named entities for multilingual applications such as machine translation and cross language information retrieval has led researchers to focus on a few other problems which overlap with NER. Here we have a brief overview on three of such problems where Semitic languages (Arabic and Hebrew) have been studied.

5.1 Named Entity Translation and Transliteration

The multilingual named entity information is useful for applications such as cross language information retrieval or machine translation. For example, Hermjakob et al. [32] have shown that inclusion of transliteration information improves machine translation quality. Also, Babych and Hartley [7] showed that incorporation of bilingual named entity information in general improves machine translation quality.

Named entities usually are either translated or transliterated across languages. Compound named entities which are composed of simple nominals (as opposed to proper nouns) might be translated across languages. For example an organizational entity like The State Department usually gets translated. In contrast, named entities composed of proper nouns such as IBM or Adidas usually get transliterated across languages. Table 7.5 presents examples of translation and transliterations for Arabic and Hebrew named entities.

There is a body of work on translation and transliteration of named entities for Arabic and Hebrew. Al-Onaizan and Knight[3] address the named entity translation problem. Their approach has two folds: baseline translation and transliteration of the named entities and later, a filtering based on the target language corpus. The underlying assumption is based on the occurrences of the named entities in the international news: names which are important and frequent in the source language (Arabic), are also frequent in the target language (English).

An important decision for a multilingual system (e.g. machine translation) is whether to translate or transliterate a given source language named entity. Hermjakob et al. [32] address this problem using a supervised classification approach. They use a parallel corpus of phrases which include bilingual transliterated name pairs. The Arabic side of the transliterated bitext is used to train a classifier which highlights words of a monolingual (Arabic) text that can be transliterated. Similar classification frameworks have also been examined for the decision making of translation vs. transliteration for Hebrew[28, 40].

Machine transliteration deals with named entities that are translated with preserved pronunciation [38]. There are specific challenges in Arabic and Hebrew orthography and phonetics which add to the transliteration challenge. These include the optional nature of vowels, the absence of certain sounds (e.g. p in Arabic), zero or many mapping of certain sounds to Latin-based letters (e.g. multiple h in Arabic or khaf in Hebrew). An earlier approach to the problem is described in [2] which is a hybrid combination of phonetic-based and spelling-based models. The extracted transliterations are post-processed by a target language (English) spell checker. There are also transliteration studies which do not involve transliterating the term from scratch. In [32], the transliterated candidates are extracted from a bilingual phrase corpus and the transliteration problem is practically converted to a search problem. There, the system uses a scoring function to filter out the noisy transliterations using a large English corpus.Footnote 16 In a relevant framework, the work of Azab et al. [6] aims at automating the English to Arabic translation vs. transliteration decision and reducing the out of vocabulary terms of the MT system. They model the decision as a binary classification problem and later use their classifier within a SMT pipeline to direct a subset of source language named entities to a transliteration module.

Table 7.5 Translation vs. transliteration of named entities in Arabic and Hebrew

For Hebrew, Goldberg and Elhadad [28] identify the borrowed and transliterated words. Their decision is binary: A word is either generated by a Hebrew language model, or by a foreign language model. They train a generative classifier using a noisy list of borrowed words along with regular Hebrew text. The work of Kirschenbaum and Wintner in [40] is also an effort to locate and transliterate the appropriate Hebrew terms. The framework is a single-class classifier which locates entities that are supposed to be transliterated.

5.2 Entity Detection and Tracking

Mention detection is a subtask of information extraction which is focused on the identification of entities and the tracking of their associations to each other. Mentions can be named entities, nominals, or pro-nominals. Table 7.6 presents an Arabic example of entity detection, along with gloss and literal translations. A detection system is expected to highlight and link the two bold segments of the Arabic example. Entity detection is usually modeled as a sequence classification task where each token in a sentence gets assigned to an entity within the sentence. Similar to NER, there are tokens which are independent of entities and get an O label. The detection part of the task is similar to the NER. The tracking part might involve a separate linking model and coreference decoding.

Table 7.6 An Arabic example of entity detection and tracking with gloss and literal translations

Arabic mention detection was one of the tasks introduced in the ACE program. Florian et al. [27] presented a multi-lingual system which included an Arabic mention detection component. Their system uses two Maximum Entropy models, one for the detection and the other one for tracking. The tracking component is a binary linking model where each token gets either linked to another entity or starts a new entity. Also, there have been two recent studies on the effects of morphology and syntactic analysis on Arabic mention detection[9, 10] in which, richer Arabic linguistic knowledge boosted the performance.

5.3 Projection

Availability of parallel corpora, automatic word alignment and translation systems resulted in a body of work on resource projection [72]. In a projection framework we use a word-aligned corpus to project some linguistic information (e.g. named entity boundaries) from a language (e.g. English) to another language (e.g. Hebrew). This has been a useful framework for equipping resource-poor languages with some labeled data. Projection is not always a deterministic operation and cross lingual differences can make it a challenging task. Figure 7.3 demonstrate an example of named entity projection from English to Hebrew. It can be seen that morphological richness of the Hebrew does not allow a 1-1 entity mapping across two languages. Thus morphological analysis and segmentation should be considered as part of the a projection pipeline.

Fig. 7.3
figure 3

An NER projection example from English to Hebrew

There have been some successful attempts on the projection of entity information for Arabic. Hassan et al. [31] extract bilingual named entity pairs from parallel and comparable corpora using similarity metrics that use phonetic and translation model information. Zitouni and Florian [74] study the use of projection (through English to Arabic machine translation) to improve Arabic mention detection. Benajiba and Zitouni [10] directly project the mention detection information using automatic word alignments. The projected Arabic corpus provides new features which augments and improves the baseline Arabic mention detection system. Huang et al. [34] study the problem of finding various English spelling of Arabic names which affects machine translation and information extraction systems. They use a projection framework to locate various spelling of a given Arabic name.

6 Labeled Named Entity Recognition Corpora

Similar to the research, data resources for the Semitic NER have been limited to Arabic and Hebrew. The Automatic Content Extraction (ACE) program is a multilingual information extraction effort focused on Arabic, Chinese and English. Over the past decade, Arabic has been one of the focus languages of the Entity Detection and Tracking (EDT) task of the ACE. As a result, ACE has prepared a few standard Arabic corpora with named entity information [70]. These corpora are primarily in the newswire domain with recent additions of weblogs and broadcast news text. The named entity categories are targeted towards the political news. They include Person, Location, Organization, Facility, Weapon, Vehicle and Geo-Political Entity (GPE). The Arabic named entity annotations are performed with character-level information which boosts the accuracy of the data for morphologically compound tokens.Footnote 17 ACE has been releasing most of its dataset through the Linguistic Data Consortium (LDC).

In addition to the standard ACE datasets, a few projects have resulted in annotation of new NER datasets. The Ontonotes project [33] is an ongoing large scale multilingual annotation effort with several layers of linguistic information on texts collected from a diverse set of domains. Arabic Ontonotes includes annotation of parsing, word senses, coreferences and named entities.Footnote 18 The publicly releasedFootnote 19 Arabic ANER corpus [11] is a token-level annotated newswire corpora with four named entity classes: person, location, organization and miscellaneous. Mohit et al. [53] also have released a corpus of Arabic Wikipedia articles with an extended set of named entity categories. Finally, Attia et al. [5] created a large scale lexicon of Arabic named entities from resources such as Wikipedia.Footnote 20

Named entity annotation for Hebrew has been limited to a few projects that we discussed earlier. Hebrew corpus annotation of named entities are reported in [14, 44]. Furthermore, the annotated corpora in [35] includes a layer of named entity information.

7 Future Challenges and Opportunities

Named entity recognition is still far from a solved problem for Semitic languages. Amharic, Syriac and Maltese lack the basic data resources for building a system. The F 1 performance of the best Arabic and Hebrew systems varies between 60 and 80 % depending on the text genres. Most of the available labeled datasets are mainly news wire corpora which might degrade the NER performance in other topics and domains.

There are many interesting open questions to be explored. For the low resource languages like Amharic or Syriac, well established frameworks such as active learning or projection can be explored to create the basic data requirements and estimating basic models. Online resources such as Wikipedia can also provide the basic named entity corpora and lexicons.Footnote 21

For medium-resource languages like Arabic and Hebrew, NER needs to be tested in new topics and genres with extended named entity classes. To do so, semi-supervised learning frameworks along with domain adaptation methods are the natural starting solutions. Morphological information plays an important role in Semitic NER. Thus, richer incorporation of morphology in NER models in form of joint modeling is an interesting avenue to explore. Moreover richer linguistic information such as constituency and dependency parsing, semantic resources such as the Wordnet and Ontonotes are expected to enrich NER models.

8 Summary

We reviewed named entity recognition (NER) as an important task for processing Semitic languages. We first sketched an overview of NER research, its history and the current state of the art. We followed with problems specific to Semitic NER and reviewed a wide range of approaches for Arabic and Hebrew NER. We observed that complex morphology and the lack of capitalization create additional challenges for Semitic NER. We focused on two case studies for Arabic and Hebrew and reviewed their learning frameworks and features. Moreover, we explored the state of data resources and research on relevant tasks such as named entity translation, transliteration and projection for Hebrew and Arabic. We concluded that Semitic NER is still an open problem. For low resource languages such as Amharic and Syriac basic data resources are still needed for constructing baseline systems. For Arabic and Hebrew, inclusion of richer linguistic information (e.g. dependency parsing) and adaptation of the current systems to new text domains are interesting avenues to explore.