A systematic review of text stemming techniques

Singh, Jasmeet; Gupta, Vishal

doi:10.1007/s10462-016-9498-2

A systematic review of text stemming techniques

Published: 01 August 2016

Volume 48, pages 157–217, (2017)
Cite this article

Download PDF

Access provided by CONRICYT – Journals CONACYT

Artificial Intelligence Review Aims and scope Submit manuscript

A systematic review of text stemming techniques

Download PDF

Jasmeet Singh¹ &
Vishal Gupta¹

2346 Accesses
51 Citations
Explore all metrics

Abstract

Stemming is a program that matches the morphological variants of the word to its root word. Stemming is extensively used as a pre-processing tool in the field of natural language processing, information retrieval, and language modeling. Though a lot of advancements have been made in the field, yet organized arrangement of the previous work and efforts are lacking in this field. In this paper, we present a review of the text stemming theory, algorithms, and applications. It first describes the existing literature relevant to text stemming by classifying it according to certain key parameters; then it describes the deep analysis of some well-known stemming algorithms on standard data sets. In the end, the current state-of-the-art and certain open issues related to unsupervised stemming are presented. The main aim of this paper is to provide an extensive and useful understanding of the important aspects of text stemming. The open issues and analysis of the current stemming techniques will help the researchers to think of new lines to conduct research in future.

Analyzing the Stemming Paradigm

A New Stemming Algorithm for Efficient Information Retrieval Systems and Web Search Engines

Statistical Stemmers: A Reproducibility Study

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Today in the age of Internet, the presence of different communities of the society on the World Wide Web has increased remarkably. A large amount of data is available in digital form in multiple languages. The major concern, today, is to identify the challenges for the efficient tackling of document bases in a large number of languages and to develop additional technology to prepare for those challenges. Hence, development of intelligent tools for language processing and information retrieval has become an active research area. Earlier the language processing tools were mainly developed for English. But, with an increase in multi-lingual documents and insufficient tools and resources of languages other than English research work is now also being done for non-English languages. The various areas of information retrieval and natural language processing require certain text pre-processing tools for lexical, morphological, syntactic and semantic level analysis. Stemming is one of the numerous pre-processing tools and is useful in the areas of information retrieval and natural language processing such as text classification, clustering, searching, summarization, POS tagging, etc.

1.1 Stemming

The basic word form in most of the languages is modified to form variant word forms according to the function of the word in a sentence. These word forms are formed through different linguistic processes such as compounding (combination of two or more words), affixation (addition of prefixes and/or suffixes), conversion (formation of new words from existing ones), etc. These word forms often share the same meaning. Stemming is a procedure in which the various morphological variants of words are matched to their stem (root, base word). For instance, the words maintaining, maintained, maintenance are matched to their root word maintain through stemming. The programs that perform stemming are called stemmers. Stemming is the simplest type of language processing used in IR system and is found to be more beneficial in languages with complex morphology where a single word has a large number of variants (Xu and Croft 1998). Though text stemming started in the late sixties (Lovins 1968), it has experienced tremendous growth in recent years, and a large number of techniques and algorithms have been proposed to handle this task. However, to design completely unsupervised language independent stemmers is a great challenge.

Stemming has been perceived in different perspectives and researchers in the field of information retrieval and linguistic processing consider it to be desirable and an important step for different reasons (Manning et al. 2008). Firstly, stemming is viewed as a tool to improve retrieval accuracy. Stemming reduces the variant forms to the root word during indexing and searching of documents. The problem of vocabulary mismatch between the documents and the queries is addressed, and all those documents that do not exactly match the query terms are also retrieved. In Information retrieval systems, stemming is considered as a recall enhancing device, but for languages with complex morphology, it improves precision as well by promoting the relevant documents in superior ranks (Xu and Croft 1998).

Secondly, stemming is viewed as a mechanism to reduce the size of the index file since the various variant terms are reduced to a single term (Melucci and Orio 2003; Bhamidipati and Pal 2007). Sometimes stemming is found to be so useful, that the size of the index file is reduced to half of the original size. From another perspective, stemming is also viewed as clustering or feature reduction/selection mechanism where the major aim is to select the most appropriate dimension or class or a concept. Stemming is also viewed as a mechanism to normalize the different concepts or senses used in the query terms (Krovetz 1993). The stemming rules help in identifying which word forms are related to each other and such relations can be used by the stemmer to resolve the meaning of the word. For instance, suffixes are only added to the stems with specific parts-of-speech, this knowledge can be used to differentiate between two homographs [intimation derived from intimate (verb) and intimately derived from intimate (adjective)]. Hence, stemming is a mechanism which is widely accepted in terms of consistency and acceptance by the users.

1.1.1 Stemming and lemmatizing

Stemming and lemmatizing are often considered as sibling processes and put under the same roof. Both processes are related and perform a similar function of reducing the variant words in the input text. The basic difference between both the processes lies in their outputs. The output product of stemming is ‘stem’ and that of lemmatization is ‘lemma’. Stems usually have distinct meaning and are often task-oriented. Stem is a part of the word (with or without meaning) which are used to form new words through various linguistic methods such as compounding (e.g. six-pack, day-dream) or affixation (e.g. perish-able, dur-able) (Huddleston 1988). The stem may be valid, fully understandable word (free stem) or invalid word which requires an affix to make a word (bound stem). For instance, ‘perish’ is a free stem and ‘dur’ is a bound stem. Lemmas, on the other hand, are valid linguistic components and a dictionary form of a lexeme. Lexeme corresponds to a collection of all the word variant forms that have similar meaning and lemma is one particular variant used to represent the lexeme. For instance, run, ran, runs, running are different forms of lexeme which are represented by the lemma ‘run’.

Stemming is a simpler, easier and faster process that makes use of rules to determine the stem without considering the vocabulary, context of the word or part-of-speech whereas lemmatization is a comparatively complex procedure which first determines the part-of-speech and context of the word to return the lemma (Jivani 2011). Lemmatization performs complete morphological analysis of the words to determine the lemma whereas stemming removes the variations which may or may not be morphologically correct word forms. For instance, the word forms, introduces, introducing, introduction are mapped to lemma ‘introduce’ through lemmatizer, but a stemmer will map it to the stem ‘introduc’. This oddity cannot be considered as a flaw of stemmers, as the document keywords and queries are invisibly stemmed for a user.

In some cases, stemmers and lemmatizers can replace each other as stemmers cannot be used where the desired output should be a valid word of a language. On the other hand, stemmers can be designed to remove derivational suffixes whereas lemmatizers only remove the inflectional variations (Brychcín and Konopík 2015). Stemmers are semantically oriented and have the tendency to combine lexemes that are semantically related. Moreover, stemmers can be developed in an unsupervised manner without the use of any linguistic resource or expert, but currently, no method has been proposed in the literature for training a lemmatizer in an unsupervised manner.

1.2 Motivation for conducting the survey

The motivation for conducting this survey is to highlight the present status of stemming by finding its historical developments. An extensive survey of the stemming techniques is conducted and the various methods are compared using various metrics. The following facts specifically motivate this survey:

To present a comprehensive review of various stemming techniques, covering not only the functioning details of the methods but also identifying their distinguishing features. A number of useful tables highlighting features, advantages, disadvantages and performance analysis of various stemming techniques are synthesized cohesively.
To analyze the retrieval performance of various well-known stemming techniques in different language families. The retrieval performance of stemmers is also evaluated on various factors such as the size of training data, nature of documents, type of training and different indexing and searching schemes. Besides, information retrieval, stemmer performance is compared in web searching and text classification tasks.
To identify challenges and future research directions for unsupervised stemming. We highlighted various open issues related to unsupervised statistical stemming such as discovering rules other than affix stripping, learning advanced semantic relations from the corpus, unsupervised parameter tuning, evaluation independent of IR system, etc.

The rest of the article is organized as follows. In Sect. 2, classification of various stemming techniques is presented with reference to previous work. Section 3, reviews various language specific stemming techniques proposed in the literature. Besides English, linguistic stemmers in nearly thirty languages belonging to nine language families are reviewed in this section. In Sect. 4, methods related to unsupervised statistical stemming are described. Various evaluation mechanisms used for evaluating stemmers are discussed in Sect. 5. The evaluation results and analysis of some well-known stemmers on standard TREC, CLEF, and FIRE collections is also presented in this section. In Sect. 6, the performance of stemmers according to various factors and tasks is analyzed. Applications of stemming in various fields are discussed in Sect. 7. Various open issues and future directions related to unsupervised stemming are presented in Sect. 8. Section 9 concludes this article.

2 Classification of stemming techniques

Stemming is a well-studied technology and a number of stemmers based on different techniques and flavors are presented in the literature. The classification of stemming techniques is presented in Fig. 1. Broadly stemming techniques are classified into two categories: Language Specific (Rule-Based) and Statistical (Corpus-Based) techniques.

Language Specific or Rule-based stemmers make use of certain pre-defined language-related rules to map the morphological variants of the word to its base form. These language-related rules are created manually by the language experts or linguists. The quality of the output of rule-based stemmers is quite better than statistical stemmers because they not only strip the affixes from the word but can also change the complete word (‘ate’ to ‘eat’). The creation of rule-based stemmers is very time-consuming, and moreover, it requires linguistic experts and resources such as dictionaries, stem tables, etc. Language specific stemming methods are further classified into three categories: Table Lookup, Affix Stripping, and Morphological.

Table lookup (brute force stemming) These techniques make use of a lookup table that contains the root word corresponding to the inflected or derived words (Frakes 1992). In order to find the root of the word, the table is checked. If the match is found, then the root is returned. These algorithms are also called as dictionary based algorithms. These are simple and easy-to-use techniques which can take care of the exceptional cases as well. But these techniques require various language resources, and they cannot handle the words outside the dictionary.
Affix stripping algorithms The prefix or suffix of the word is called affix. Affix removal algorithms delete suffix and/or prefix of the word according to specific rules or suffix list (Baeza-Yates and Ribeiro-Neto 2011; Frakes 1992). A lot of work has been done in suffix stripping in comparison to prefixes. Development of rules for the stemmer requires the complete expertise of the language and various language resources. These techniques cannot handle variations caused due to compounding, spelling variations and produce a number of errors as the words produced after stripping of affixes are sometimes not real words.
Morphological stemmers These stemmers take into account the morphology of the language while stemming. Inflectional stemmers consider the inflectional morphology i.e. they can detect the changes in the word caused due to the syntax such as forms of nouns, verbs, changing singular into plural but the part-of-speech (POS) remains same (Krovetz 1993). Derivational stemmers take into account the derivational morphology i.e. change in the category such as nominalization (a noun phrase generated from some another class such as ‘informer’ from ‘inform’), deadjectival (word derived from adjectives such as ‘happiness’ from ‘happy’), deverbal (word derived from verb which is usually noun or adjective such as ‘readable’ from ‘read’), and denomial (word derived from noun such as ‘useful’ from ‘use’). These stemmers take into account the dictionary information such as context, word meanings, vocabulary, etc. The efficiency of these stemmers is quite high as these algorithms consider both the syntax as well as the semantics of the language. But the development of these algorithms requires complete knowledge of the language and its morphology.

Statistical techniques are based on unsupervised learning of the language by analyzing the lexicon or finding the co-occurrence or context of the words in the corpus. These are also called corpus-based techniques. These algorithms also perform suffix stripping but after performing some statistical analysis on the corpus (Paik et al. 2011a). Statistical techniques incorporate new language into the system with very little efforts, and this is useful especially for applications related to information retrieval. Moreover, these techniques can deal with languages that have complex morphology and sparse data. The major advantage of statistical techniques is that it does not require any prior knowledge of the language or language resources which are useful for many languages where the resources are either not available or are incomplete to provide effective results. Statistical techniques involve a lot of computations and take time as they learn the morphology of the language from the corpus on their own. Statistical techniques can be further divided into following categories:

Lexicon analysis based These stemmers understand the morphology of the languages by analyzing the lexicon of the language (Paik and Parui 2011). Word variants are identified from the lexicon using different methods such as computation of frequencies of substrings (Goldsmith 2001; Bacchin et al. 2005), string distances or similarities (Majumder et al. 2007b)
Corpus analysis based These stemmers use the context or co-occurrences statistics of the corpus words to perform stemming. The various statistical analysis methods such as expected maximization (Xu and Croft 1998), co-occurrence strength (Paik et al. 2011b, 2013), distributional similarity (Bhamidipati and Pal 2007), etc. are employed to learn the morphological rules for stemming.
Character N-gram based These stemmers use n-grams derived from the words of the language to perform stemming. The frequency or probability information derived from n-grams can help in identifying the variants as the frequency or probability of prefixes and suffixes is large as compared to roots or stems (Smirnov 2008).

Besides techniques mentioned above, stemmers are also developed through a hybrid of two or more techniques wherein stemming is performed by combining two or more than two algorithms. For instance, table lookup and suffix stripping approach can be combined; where the word to be stemmed is first searched in the lookup table. If the stem is found then it is returned, otherwise suffix stripping rules are applied to the word. Table 1 compares the various stemming approaches on the basis of some distinguishing features.

Table 1 Comparison of stemming techniques

A systematic review of text stemming techniques

Abstract

Similar content being viewed by others

Analyzing the Stemming Paradigm

A New Stemming Algorithm for Efficient Information Retrieval Systems and Web Search Engines

Statistical Stemmers: A Reproducibility Study

Explore related subjects

1 Introduction

1.1 Stemming

1.1.1 Stemming and lemmatizing

1.2 Motivation for conducting the survey

2 Classification of stemming techniques

3 Language specific stemmers

3.1 Linguistic stemmers in English

3.1.1 Distinguishing features of rule-based stemmers

3.2 Linguistic stemmers in other languages

4 Statistical stemmers

4.1 Lexicon analysis based approaches

4.1.1 Distinguishing features of lexicon analysis based statistical stemmers

4.2 Corpus analysis based approaches

4.2.1 Distinguishing features of corpus analysis based statistical stemmers

4.3 Character N-gram-based approaches

5 Evaluation: metrics, results, and analysis

5.1 Evaluation metrics

5.1.1 Paice evaluation mechanism

5.1.2 Hull evaluation mechanism

5.1.3 Accuracy/strength measurement

5.1.4 Evaluation of stemmer using information retrieval systems

5.2 Evaluation results and analysis

5.2.1 Test collection

5.2.2 English results

5.2.3 Hungarian and Czech results (European languages)

5.2.4 Bengali and Marathi results (Asian languages)

5.2.5 Analysis of results

6 Factors affecting stemmer performance

6.1 Information retrieval models

6.2 Size of corpus

6.3 Type of training

6.4 Nature of documents

6.5 Nature of task

6.5.1 Web search

6.5.2 Text classification

7 Stemming applications

8 Unsupervised stemming: open issues and challenges

9 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation