Keywords

1 Introduction

Chunking is a natural language processing (NLP) task that focuses on dividing a text into syntactically correlated non-overlapping and non-exhaustive groups of words, i.e., a word can only be a member of one chunk and not all words are in chunks (Tjong et al. 2000). Chunking is widely used as an intermediate step to parsing with the purpose of improving the performance of the parser. It also helps to identify non-overlapping phrases from a stream of data, which are further used for the development of different NLP applications such as information retrieval, information extraction, named entity recognition, question answering, text mining, text summarization, etc. These NLP tasks consist of recognizing some type of structure which represents linguistic elements of the analysis and their relations. In text chunking the main problem is to divide text into syntactically related non-overlapping groups of words (chunks).

The main goal of chunking is to divide a text into segments which correspond to certain syntactic units such as noun phrases, verb phrases, prepositional phrases, etc. Abney (1991) introduced the concept of chunk as an intermediate step providing input to further full parsing stages. Thus, chunking can be seen as the basic task in full parsing. Although the detailed information from a full parse is lost, chunking is a valuable process in its own right when the entire grammatical structure produced by a full parse is not required. For example, various studies indicate that the information obtained by chunking or partial parsing is sufficient for information retrieval systems rather than full parsing (Yangarber and Grishman 1998). Alongside, partial syntactical information can help to solve many NLP tasks, such as text summarization, machine translation and spoken language understanding (Molina and Pla 2002). For example, Kutlu (2010) stated that finding noun phrases and verb phrases is enough for information retrieval systems. Phrases that give us information about agents, times, places, objects, etc. are more significant than the complete configurational syntactic analyses of a sentence for question-answering, information extraction, text mining and automatic summarization.

Chunkers do not necessarily assign every word in the sentence like full parses to a higher-level constituent. They identify simple phrases but do not require that the sentence be represented by a single structure. By contrast full parsers attempt to discover a single structure which incorporates every word in the sentence. Abney (1995) proposed to divide sentences into labeled, non-overlapping sequences of words based on superficial analysis and local information. In general, many of NLP applications often require syntactic analysis at various NLP levels including full parsing and chunking. The chunking level identifies all possible phrases and the full parsing analyzes the phrase structure of a sentence. The choice of which syntactic analysis level should be used depends on the specific speed or accuracy of an application. The chunking level is efficient and fast in terms of processing than full parsing (Thao et al. 2009). Chunkers can identify syntactic chunks at different levels of the parser, so a group of chunkers can build a complete parser (Abney 1995). Most of the parsers developed for languages like English and German use chunkers as components. Brants (1999) used a cascade of Markov model chunkers for obtaining parsing results for the German NEGRA corpus. Today, there are a lot of chunking systems developed for various languages such as Turkish (Kutlu 2010), Vietnamese (Thao et al. 2009), Chinese (Xu et al. 2006), Urdu (Ali and Hussain 2010), etc.

Although Amharic is the working language of Ethiopia with a population of about 90 million at present, it is still one of less-resourced languages with few linguistic tools available for Amharic text processing. This work is aimed at developing Amharic base phrase chunker that generates base phrases. The remaining part of this paper is organized as follows. Section 2 presents Amharic language with emphasis to its phrase structure. Amharic base phrase chunking along with error pruning is discussed in Sect. 3. In Sect. 4, we present experimental results. Conclusion and future works are highlighted in Sect. 5. References are provided at the end.

2 Linguistic Structures of Amharic

2.1 Amharic Language

Amharic is the working language of Ethiopia. Although many languages are spoken in Ethiopia, Amharic is the lingua franca of the country and it is the most commonly learned second language throughout the country (Lewis et al. 2013). It is also the second most spoken Semitic language in the world next to Arabic. Amharic is written using Ethiopic script which has 33 consonants (basic characters) out of which six other characters representing combinations of vowels and consonants are derived for each character. The base characters have the vowel and other derived characters have vowels in the order of For example, for the base character the following six characters are derived from the base character:

2.2 Phrasal Categories

Phrases are syntactic structures that consist of one or more words but lack the subject-predicate organization of a clause. These phrases are composed of either only head word or other words or phrases with the head combination. The other words or phrases that are combined with the head in phrase construction can be specifiers, modifiers and complements. Yimam (2000) classified Amharic word classes into five types, i.e. nouns, verbs, adverbs, adjectives and prepositions. In line with this classification, Yimam (2000) and Amare (2010) classified phrase structures of the Amharic language as: noun phrases, verb phrases, adjectival phrases, adverbial phrases and prepositional phrases.

Noun Phrase.

An Amharic noun phrase (NP) is a phrase that has a noun as its head. In this phrase construction, the head of the phrase is always found at the end of the phrase. This type of phrase can be made from a single noun or combination of noun with either other word classes including noun word class. Examples are: (qäläbät/ring), (yä’almaz qäläbät/diamond ring), (tĭlĭq yä’almaz qäläbät/big diamond ring), (ya tĭlĭq yä’almaz qäläbät/that big diamond ring), etc.

Verb Phrase.

Amharic verb phrase (VP) is constructed with a verb as a head, which is found at the end of the phrase, and other constituents such as complements, modifiers and specifiers. But not all the verbs take the same category of complement. Based on this, verbs can be dividing into two. These are transitive and intransitive. Transitive verbs take transitive noun phrases as their complement and intransitive verbs do not. Examples are: (lĭkolatal/[he] sent [her] [something]), (gänzäb lĭkolatal/[he] sent [her] money), (lämeri gänzäb lĭkolatal/[he] sent money to Mary), (bäbank lämeri gänzäb lĭkolatal/[he] sent money to Mary via bank), etc.

Adjectival Phrase.

An Amharic Adjectival phrase (AdjP) is constructed with an adjective as a head word and other constituents such as complements, modifiers and specifiers. The head word is placed at the end. Examples are: (gobäz/clever), (bäţam gobäz/very clever), (ĭndä wändĭmu bäţam gobäz/very clever like his brother), etc.

Prepositional Phrase.

Amharic prepositional phrase (PP) is made up of a preposition head and other constituents such as nouns, noun phrases, prepositional phrases, etc. Unlike other phrase constructions, prepositions cannot be taken as a phrase, instead they should be combined with other constituents and the constituents may come either previous to or subsequent to the preposition. If the complements are nouns or NPs, the position of prepositions is in front of the complements whereas if the complements are PPs, the position will shift to the end of the phrase. Examples are: (ĭndä lĭj/like a child), (käwänzu aţägäb/close to the river), etc.

Adverbial Phrases.

Amharic adverbial phrases (AdvP) are made up of one adverb as head word and one or more other lexical categories including adverbs themselves as modifiers. The head of the AdvP is placed at the end. Unlike other phrases, AdvPs do not take complements. Most of the time, the modifiers of AdvPs are PPs that come always before adverbs. Examples are: (kĭfuña/severely), (bäţam kĭfuña/very severely), (ĭndä wändĭmu bäţam kĭfuña/very severely like his brother), etc.

2.3 Sentence Formation

Amharic language follows subject-object-verb grammatical pattern unlike, for example, English language which has subject-verb-object sequence of words (Yimam 2000; Amare 2010). For instance, the Amharic equivalent of the sentence “John killed the lion” is written as (jon/John) (anbäsawn/the lion) (gädäläw/killed)”. Amharic sentences can be constructed from simple or complex NP and simple or complex VP. Simple sentences are constructed from simple NP followed by simple VP which contains only a single verb. The following examples show the various structures of simple sentences.

Complex sentences are sentences that contain at least one complex NP or complex VP or both complex NP and complex VP. Complex NPs are phrases that contain at least one embedded sentence in the phrase construction. The embedded sentence can be complements. The following examples show the various structures of complex Amharic sentences.

3 Base Phrase Chunking

3.1 Chunk Representation

The tag of chunks can be noun phrases, verb phrases, adjectival phrases, etc. in line with the language construction rules. There are many decisions to be made about where the boundaries of a group should lie and, as a consequence, there are many different ‘styles’ of chunking. There are also different types of chunk tags and chunk boundary identifications. Nevertheless, in order to identify the boundaries of each chunk in sentences, the following boundary types are used (Ramshaw and Marcus 1995): IOB1, IOB2, IOE1, IOE2, IO, “[”, and “]”. The first four formats are complete chunk representations which can identify the beginning and ending of phrases while the last three are partial chunk representations. All boundary types use “I” tag for words that are inside a phrase and an “O” tag for words that are outside a phrase. They differ in their treatment of chunk-initial and chunk-final words.

IOB1::

the first word inside a phrase immediately following another phrase receives a B tag.

IOB2::

all phrase- initial words receive a B tag.

IOE1::

the final word inside a phrase immediately preceding another same phrase receives an E tag.

IOE2::

all phrase- final words receive an E tag.

IO::

words inside a phrase receive an I tag, others receive an O tag.

“[”::

all phrase-initial words receive “[” tag, other words receive “.” Tag.

“]”::

all phrase-final words receive “]” tag and other words receive “.” Tag.

An example of chunk representation for the sentence (hulätu lĭjoc bätĭlĭq mäkina wäda gojam hedu / The two children went to Gojjam by a big car) is shown Table 1.

Table 1. Chunk representation for the sentence

In this work, we considered six different kinds of chunks, namely noun phrase (NP), verb phrase (VP), Adjective phrase (AdjP), Adverb phrase (AdvP), prepositional phrase (PP) and sentence (S). To identify the chunks, it is necessary to find the positions where a chunk can end and a new chunk can begin. The part-of-speech (POS) tag assigned to every token is used to discover these positions. We used the IOB2 tag set to identify the boundaries of each chunk in sentences extracted from chunk tagged text. Using the IOB2 tag set along with the chunk types considered, a total of 13 phrase tags were used in this work. These are: B-NP, I-NP, B-VP, I-VP, B-PP, I-PP, B-ADJP, I-ADJP, B-ADVP, I-ADVP, B-S, I-S and O. The followings are examples of chunk tagged sentences.

3.2 Architecture of the Chunker

To implement the chunker component, we used hidden Markov model (HHM) enhanced by a set of rules to prune errors. The HMM part has two phases: the training phase and the testing phase. In the training phase, the system first accepts words with POS tags and chunk tags. Then, the HMM is trained with this training set. Likewise in the test phase, the system accepts words with POS tags and outputs appropriate chunk tag sequences against each POS tag using HMM model. Figure 1 illustrates the workflow of the chunking process.

Fig. 1.
figure 1

Workflow of the chunking process.

In this work, chunking is treated as a tagging problem. We use POS tagged sentence as input from which we observe sequences of POS tags represented as T. However, we also hypothesize that the corresponding sequences of chunk tags form hidden Markovian properties. Thus, we used a hidden Markov model (HMM) with POS tags serving as states. The HMM model is trained with sequences of POS tags and chunk tags extracted from the training corpus. The HMM model is then used to predict the sequence of chunk tags C for a given sequence of POS tag T. This problem corresponds to finding C that maximizes the probability P(C|T), which is formulated as:

$$ {\text{C}}{^{\prime}} = \mathop {\arg \hbox{max} }\limits_{C} \,P\left( {C\left | T \right.} \right) $$
(1)

where Cʹ is the optimal chunk sequence. By applying Baye’s rule can derive, Eq. (1) yields:

$$ {\text{C}}{^{\prime}} = \mathop {\arg \hbox{max} }\limits_{C} \,P\left( {T\left | C \right.} \right) * P\left( C \right) $$
(2)

which is in fact a decoding problem that is solved by making use of the Viterbi algorithm. The output of the decoder is the sequence of chunk tags which groups words based on syntactical correlations. The output chunk sequence is then analyzed to improve the result by applying linguistic rules derived from the grammar of Amharic. For a given Amharic word w, linguistic rules (from which sample rules are shown in Algorithm 1) were used to correct wrongly chunked words (“w−1” and “w+1” are used to mean the previous and next word, respectively).

4 Experiment

4.1 The Corpus

The major source of the dataset we used for training and testing the system was Walta Information Center (WIC) news corpus which is at present widely used for research on Amharic natural language processing. The corpus contains 8067 sentences where words are annotated with POS tags. Furthermore, we also collected additional text from an Amharic grammar book authored by Yimam (2000). The sentences in the corpus are classified as training data set and testing data set using 10 fold cross validation technique.

4.2 Test Results

In 10-fold cross-validation, the original sample is randomly partitioned into 10 equal size subsamples. Of the 10 subsamples, a single subsample is used as the validation data for testing the model, and the remaining 9 subsamples are used as training data. The cross-validation process is then repeated 10 times, with each of the 10 subsamples used exactly once as the validation data. Accordingly, we obtain 10 results from the folds which can be averaged to produce a single estimation of the model’s predictive potential. By taking the average of all the ten results the overall chunking accuracy of the system is presented in Table 2.

Table 2. Test result for Amharic base phrase chunker.

5 Conclusion and Future Works

Amharic is one of the most morphologically complex and less-resourced languages. This complexity poses difficulty in the development of natural language processing applications for the language. Despite the efforts being undertaken to develop various Amharic NLP applications, only few usable tools are publicly available at present. One of the main reasons frequently cited by researchers is the morphological complexity of the language. Amharic text parsing also suffers from this problem. However, not all Amharic natural language processing applications require full parsing. In this work, we tried to overcome this problem by employing chunker. It appears that chunking is more manageable problem than parsing because the chunker does not require deeper analysis of texts which will be less affected by the morphological complexity of the language. Thus, future work is recommended to be directed at improving the chunker and use this component to develop Amharic natural language processing applications that do not rely on deeper analysis of linguistic structures.