Keywords

1 Introduction

Stemming is the method by which the inflected or derived words are reduced to a stem, base or root form. The software which produces the stem or the root word is referred to as a stemmer [1]. A text in a natural language may contain several variants of the same word. These morphologically similar words have their meanings rooted in the stem or the root word [2, 3]. Today, several stemmers have been developed for many languages across the world, mainly for English and other European languages as well [4]. However, not much work in the field of stemming has been carried out for South Asian languages like Hindi and Nepali [5].

Languages like Hindi and Nepali are written using a writing methodology known as Devanagari script which is considered to be a descendant of Brahmi script. It is written and read from left to right direction and there is no distinction between upper case and lower case letters [6]. The Hindi and the Nepali alphabets share many similar characteristics in the way they are written, as both are written using the Devanagari script. The Devanagari script is the fourth most widely adopted writing system in the world and is composed of 47 primary characters, including 14 vowels and 33 consonants [7]. In order to develop a stemmer for any language, it is very important to understand the word constructs pattern for that language. Nepali is an Indo-Aryan language written in Devanagari script. It follows a Subject + Object + Verb pattern in sentences which is different for languages like English. From the aspect of applicability, stemming techniques find their use in various information retrieval tasks used by various search engines like Google, Yahoo etc. The stemming techniques also find their use in finding domain vocabularies related to a particular domain of interest.

1.1 Types of Stemmer

Stemming algorithms can be broadly classified into the following three categories.

1.2 Rule-Based Stemming

It is a structural stemming approach that utilizes the structure of the words in a language as well as the morphological rules to identify the stem of each word. The rules are written based on the morphology of the language and its word derivation structure [8].

1.3 Statistical Stemming

Rule-based stemmers have the disadvantage of being reliant on a fixed set of rules for carrying out the stemming operation. Rule-based stemmers also require an adequate amount of language expertise [9]. On the other hand, statistical stemming approaches do not require language expertise and use statistical information from a large corpus of a given language to learn the morphology of words [10].

1.4 Hybrid Stemming

This approach combines the features of a rule-based stemmer and a statistical stemmer for stemming. Combining the features of both stemmer helps to increase the accuracy of the stemming algorithm [11].

2 Challenges in Stemming

Stemming algorithms are often challenged with two kinds of errors that occur frequently depending upon the nature of algorithms used by the stemmer. The types of resulting errors are given below.

2.1 Over-Stemming

Over-stemming happens when two words with different stems are derived from the same root. Over-stemming may also be known to be false-positive.

2.2 Under-Stemming

Under-stemming happens when two words that do not have separate stems are derived from the same root. It is possible to interpret under-stemming as false-negative.

3 History of Stemming

Initially, the only language where stemming was carried out was the English language. The first-ever algorithm [1] for stemming was proposed by Julie Beth Lovins in the year 1968 and published in the Journal of Mechanical Translation and Computational Linguistics. An inflected word is a result of a combination of the word with prefix or suffix or both. The stemmer was a rule-based stemmer basically aimed at extracting the word by suffix removal. The Lovins stemmer has 294 endings, 29 conditions and 35 transformation rules [1].

Inspired by the work of Lovins, an algorithm [2] for stemming was proposed by Martin Porter in the year 1980 and published in the journal named Program. This stemmer is the most widely used stemmer in the world and is the most cited paper on stemming [7]. It is a de facto standard for stemming. The algorithm used was a rule-based approach. However, the stemmer had a lesser number of rules as compared to Lovins stemmer when it was derived.

The third significant work on stemming was carried by Christopher D Paice in the year 1990. The paper [3] was published in the proceedings of the conference on Special Interest Group on Information Retrieval. The stemmer was a rule-based stemmer with an added advantage of the inclusion of a subroutine for index compression in the algorithm. It was faster but produced a relatively large number of over-stemming errors [3] compared to the previous algorithms on stemming.

4 Stemming Techniques for Hindi Language

Various algorithms have been proposed for stemming based on the Hindi language. Ramanathan and Rao [4] proposed a lightweight stemmer for Hindi which uses handcrafted set of suffix list and looks for longest match stripping. They have used the name “Light stemming” as the algorithm is used for tripping of a small set of either prefixes or suffixes or both, without trying to deal with infixes, or recognize patterns and find roots. Out of 35,977 words used as input to the algorithm, 4.6% of words were found to be under-stemmed while 13.8% were found to be over-stemmed [4].

Pandey and Siddiqui [5] proposed an unsupervised Hindi stemmer with the aim of improving the combining various prefix and suffix rules based on heuristics. This paper focuses on the development of an unsupervised stemmer for Hindi and the evaluation of the approach using manually segmented words. The algorithm was evaluated on 1000 words randomly extracted words from the Hindi WordNet-1 database [12]. The training data has been constructed by extracting 106,403 words extracted from EMILLE (Enabling Minority Language Engineering) 2 corpus [13]. The observed accuracy was found to be 89.9% after applying some heuristic measures. The F-score was 94.96% [5].

Ganguly et al. (6) proposed two separate rule-based stemmers for the Bengali and Hindi languages. In this paper, linguistic knowledge was used to manually craft the rules for removing the commonly occurring plural suffixes for Hindi and Bengali. A baseline was fixed by choosing words randomly from websites of news articles written in Hindi on the web. The improvement obtained with the incorporation of new rules for stemming by handling some exceptional words which were not a part of the baseline was 5.03% [6].

Mishra and Prakash [7] proposed a stemmer named “Maulik” for the Hindi language. This stemmer is purely based on Devanagari script and it uses the hybrid approach which combination of brute force and suffix removal approach. A lookup table with 15,000 words was maintained in the database. The average accuracy of the stemmer was obtained to be 91.59% [7].

Anand et al. [8] proposed a semi-supervised approach for stemming text written in the Hindi language. This paper uses an algorithm to find the stem of a word in Hindi. The proposed algorithm uses word2vec, which is a semi-supervised learning algorithm, for finding the 10 most similar words from a corpus. Also, a mathematical function is used to find the stem. The results are verified by selecting a set of 1000 Hindi words randomly taken from a corpus [8].

5 Stemming Techniques for the Nepali Language

Bal and Shrestha [9] proposed a morphological analyzer and a stemmer for the Nepali language. This earliest stemming technique did not handle words formed as a result of the combination of two free morphemes. This paper discusses the design and implementation issues as well as the linguistic aspects of a morphological analyzer and a stemmer for the Nepali language [9].

Sitaula [10] proposed a hybrid algorithm for stemming Nepali text. The hybrid Nepali stemming algorithm uses affix stripping in conjunction with a string similarity function and reports a recall rate of 72.1% on 1200 words. The handwritten rules comprised 150 suffixes and around 35 prefixes were considered. The accuracy of this hybrid algorithm is 70.10% [10].

Paul et al. [11] proposed an affix removal stemmer for the Nepali language. This work has a rule base of 120 suffixes and 25 prefixes and a root lexicon of over 1000 words and reports an overall accuracy of 90.48% [11].

Shrestha and Dhakal [14] proposed a new stemmer for the Nepali language and classifies suffixes into three categories and stem them according to different criteria. The proposed algorithm was implemented in Ruby and was tested in a data set of 5000 words, extracted from a corpus containing E-Kantipur news. The accuracy of the algorithm is obtained as 88.78% [14].

Koirala and Shakya [15] proposed a Nepali rule-based stemmer and analyzed its performance on different NLP applications. The corpus contained articles from various different areas, including news, sports, politics, literature etc. Corpus contained a total of 438 news articles with a total word count of 11,813 and a total unique word count of 11,346. Each news article, on average, contained 269 total words and 181 unique words. The F1 score was 0.79 [15].