Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Annotated corpora are quite common even for under-resourced languages but there are languages with tens of million native speakers without high quality text corpora. Amharic is such a case.

Amharic is one of the official working languages of Ethiopia. It is the second most spoken Semitic language in the world with over 20 million native speakers. With so many speakers and being an official language it is hard to believe it counts as an under-resourced language. However, there are not many language resources for Amharic and most of those available are of poor quality, small sized and/or not easily accessible. That is also the case of text corpora. There are several text corpora available (see Sect. 2) but there is only one morphologically annotated corpus of small size and poor quality. One of the reasons for that situation is the special script used for writing Amharic: Ge’ez.

Ge’ez script, also called Fidel in Amharic, is a syllabic script. There are more than 300 characters, each representing a consonant – vowel pair. There are 26 consonant letters combined with 7 or more vowels. The Ge’ez script has also its own symbols for numbers and punctuation. See Table 1 for an example of the Ge’ez characters and Fig. 1 for an Amharic text written in Ge’ez. The Ge’ez script is used also for writing Tigrinya and several smaller languages of Ethiopia. Not all characters are used in all languages –there are characters used only in one language. Ge’ez script is supported by Unicode standard from version 3.0 (1999). Several rarely used characters were added into versions 4.1 (2005) and 6.0 (2010).

Because of the bad support for displaying and writing Ge’ez script on computers there were many attempts to use a transliteration of the script in other alphabets. There are at least ten different transliteration systems in Latin script. Not all of them define mapping of all Ge’ez characters but all are based on a phonetic transcription, hence the differences are not big. The most complete and most different from others is SERA [2]. No accents are required, only ASCII characters (English alphabet) are used. Therefore, it is easy to type SERA on any keyboard. Transliteration of several Ge’ez characters is listed in Table 1. We are using SERA in all our corpora together with the original Ge’ez script.

Table 1. Transliteration of selected Ge’ez characters in SERA system.
Fig. 1.
figure 1

Example of annotated WIC Corpus

2 Existing Corpora

2.1 WIC News Amharic Corpus

Amharic text corpora range from morphologically annotated to parallel corpora. Compared to similar corpora in other (even smaller) languages, all Amharic corpora are small. WIC Corpus [1] is the only manually morphologically annotated corpus. It consists of about 210,000 words in 1,065 documents. Texts were taken from the Web news published by the Walta Information Center (http://www.waltainfo.com) in 2001. A sample of the corpus in displayed in Fig. 1.

2.2 Morphological Annotation

Amharic language has a rich morphology: Nouns and adjectives are inflected and there are complex rules for deriving verbs. Several part-of-speech tag systems were proposed earlier, all working with about 10 tags for basic part of speech. No existing tag-set includes any tags for annotating gender, number and other grammatical categories. In some cases, nouns, pronouns, adjectives, verbs and numerals have variants of words with attached prepositions and/or conjunctions. For example, there are N = noun, NP = noun with a preposition as a prefix, NC = noun with a conjunction as a suffix, NPC = noun with a preposition as a prefix and a conjunction as a suffix. In total, there are 30 different PoS tags in the WIC Corpus.

3 New Amharic Corpora

We have created two new corpora. The first one is a cleaned version of the WIC Corpus, the second one is a new big corpus from the Web. Both corpora are available for querying on the web page of the HaBiT project at https://habit-project.eu/corpora.

3.1 Cleaned WIC Corpus

There were several attempts to use the WIC Corpus for training automatic part-of-speech taggers, for example [3, 4, 11]. All of them found that the corpus has many annotation inconsistencies: missing tags, misspelling of tags, multiword expressions and others. There were two separate versions of the corpus: one for original Ge’ez script and one with SERA transliteration. In several research papers, they report different number of tokens for each version. We have unified both versions and corrected non matching words either in Ge’ez or SERA depending on a native speaker decision. We have applied all cleaning procedures described in the above mentioned papers.

We have added more unifications of numbers and dates. For example, most of numbers containing decimal point were written as “6 8” where “” means “point”. It is the result of original transcription from hand-written “paper” annotation into computer. Sometimes such string formed one token while there were three tokens in other cases. We have normalised all such occurrences into the correct form (6.8 in this case) with the respective PoS tag. The size of the cleaned corpus is 200,561 tokens. Each token is represented by a word in Ge’ez, its transliteration in SERA and the respective PoS tag.

The cleaned WIC corpus was used to train a PoS tagger. Because of the small number of tags in the tag-set we chose TreeTagger [9], it works very well in such conditions. To evaluate an accuracy of created tagging model we have divided the corpus into 10 parts each containing 20,000 tokens. For each part, we trained a TreeTagger model on nine remaining parts, ran TreeTagger on that part, and compared the result with the manual annotation. The whole evaluation task was done separately on the Fidel part of the corpus and the SERA part, and for both on data before and after the final cleaning procedure. The results are summarised in Table 2, the average accuracy is 87.4 %. We can see that the final cleaning has not influenced the results much and the performance of TreeTagger is a bit better on the Fidel script than on the SERA transliteration.

Table 2. Accuracy of TreeTager on ten parts of the WIC corpus

3.2 Building an Amharic Web Corpus

We have used the following steps to create a big Web corpus: First, adopting the Corpus factory method [6] bigrams of Amharic words from the Crúbadán databaseFootnote 1 [8] were used to query Bing search engine for documents in Amharic. 354 queries yielded 6,453 URLs. URLs of 3,145 successfully downloaded documents were used as starting points for web crawler SpiderLing [10]. URLs of documents crawled in 2013 using a similar approachFootnote 2 were added to the set of starting points.

The following language models were created:

  • Character trigram model for language detection.Footnote 3 5.2 MB of text from the WIC Corpus and Amharic Wikipedia was used to train the model.

  • Byte trigram model for character encoding detection. The model was trained using web pages obtained by the Corpus factory method.

  • The most frequent Amharic words from the WIC Corpus wordlist were used as a resource for boilerplate removal tool jusText [7].

The crawler was set to harvest web domains in the Ethiopian national top level domain et and other general TLDs: com, org, info, net, edu. 3.6 GB of http responses was gathered in the process. HTML tags and boilerplate paragraphs were removed from the raw data. 42 % of paragraphs were identified as duplicate or near duplicate and removed using tool onion [7]. 66 MB of deduplicated text obtained by the same process in 2013 was added to the data. Sentence boundaries were marked at positions with Amharic end of sentence characters and . The final size of the corpus (containing data from years 2013, 2015 and 2016) is 461 MB or more than 17 million words. Finally, the corpus was tagged by TreeTagger with a model trained on the cleaned version of the WIC Corpus. The corpus is called amWaC 16.Footnote 4

3.3 Corpus Properties

Basic properties of corpus sources are summarised in Tables 3 and 4.Footnote 5

We observe the content of news/politic and religious portals has a significant presence in the corpus sources. Since there are only 138 domains with more than 10 documents represented in the corpus, we admit the result collection would benefit from a greater variety of sources.

The most frequent parts of speech in both corpora are nouns and verbs. For details see Fig. 2.

Table 3. The size of corpus structures.
Table 4. Document count – the most frequent web domains and domain size distribution.
Fig. 2.
figure 2

Relative frequency of tags in both Amharic corpora. (End of sentence token is marked by a PUNCT tag in WIC.)

Table 5. Keyword comparison of amWaC 16 to WIC: words most characteristic for the web corpus, sorted by keyword score.
Table 6. Keyword comparison of WIC to amWaC 16: words most characteristic for the news corpus, sorted by keyword score.

Tables 5 and 6 show main differences of corpora using keyword comparison: The language is much more formal and the main topic is politics in the news only corpus as expected. Religion related words are noticeable in the WaC corpus. Differences in tokenisation can be observed too, e.g. morpheme is represented as a separate token in the WaC corpus.

The Keyword Score KS of a word is calculated according to [5] as

$$KS = \frac{fpm_{foc} + n}{fpm_{ref} + n}$$

where \(fpm_{foc}\) is the normalised (per million words) count of the word in the focus corpus, \(fpm_{ref}\) is the normalised count of the word in the reference corpus and \(n = 100\) is the Simple Maths smoothing parameter.Footnote 6

4 Conclusion

We have built a web corpus of Amharic texts comprising of more than 15 million words. To our knowledge it is the largest Amharic corpus for language technology use currently available. We expect the corpus linguistics, lexicography and language teaching in Ethiopia will greatly benefit from such a resource.

We have also cleaned the WIC corpus and unified its Fidel and SERA versions. This resource could be used for building language models (like the TreeTagger model) and for other natural language processing applications for Amharic.

A similar approach is being applied to obtain web corpora in other East African languages: Afaan Oromo, Tigrinya and Somali. All corpora compiled within the project are available for browsing and querying by corpus manager Sketch Engine at https://habit-project.eu/corpora. The full source text was not made public because of possible copyright issues.