Keywords

1 Introduction

Natural language processing is a field of computational linguistic and the goal of NLP is to analyze, understand, and generate human understandable language. But this goal is not easy to reach, because different language has own grammatical structure. To understand dependency among words and sentences, ambiguity of word, and how to link those concepts together in a meaningful say it is challenging task in NLP [1].

1.1 Stop Words

Stop word is a word which has less significant meaning than other tokens. Identification of stop words and its removal process is a basic preprocessing phase in NLP and data mining applications. For any NLP tool there is no single universal list of stop words used for a specific language, because stop words list is generally domain specific [2].

1.2 Diacritics

Diacritic is a mark that is used to change the sound value of the character. Diacritic mark could be identifying by unique UTF-8 value. And by using with any consonant in Gujarati language, it is possible to produce multiple meaning. A list of diacritic marks presented in Table 1 and is further elaborated based on wide and rare usage of the concerned diacritic [3].

Table 1 Diacritics for Gujarati document

1.3 Gujarati Language

Gujarati is an official and regional language of Gujarat state in India. It is 23rd most widely spoken language in the world today, which is spoken by more than 46 million people. Approximately 45.5 million people speak Gujarati language in India and half million speakers are from outside of India that includes Tanzania, Uganda, Pakistan, Kenya and Zambia. Gujarati language is belongs to Indo-Aryan language of Indo-European language family and it is also closely related to Indian Hindi language [4].

1.4 Unicode Transformation Format (UTF)

Unicode Transformation Format (UTF) is a character set [5] which is used to display the character written in Indian languages. We have used 8-bit encoding system to process Gujarati written document which is not possible to display each character using American Standard Code for Information Interchange (ASCII). There are many representations of UTF including utf8, utf16 and utf32 in which UTF-8 is widely used in web technology and mobile application for Indian languages.

2 Related Works and Existing Approaches

Pandey and Siddiqui [6] prepared a list of stop words for Hindi language based on its frequency and some manually operations. For experiment they used EMILLE corpus dataset, precision, and recall was used for evaluation. By removing stop words from raw content, it is possible to improve the accuracy of retrieval [6].

Kaur and Saini [7] they presented natural language processing approach to identify stop words in Panjabi literature in which they concentrates on poetry and other news articles for data collection. They identify 256 stop words from selected category and released for public use [7]. Kaur and Saini [8] described pre-processing phases for Punjabi language, in which, they have manually analyzed the data set (Punjabi text documents) and identified 1,500 stop words. High-frequency terms occurring in document, they have also considered stop word [8]. Kaur and Saini [9] they have provided enhanced understanding of stop words in Panjabi language based on Part-of-speech tagging. They constructed data set from different five categories of Panjabi literature: natures, romantic, religious, patriotic and philosophical, are manually populated with 250 poems. They prepared 256 stop words manually, due to unavailability of Punjabi stop words in public domain [9].

Thangarasu and Manavalan [10] developed stemmer for Tamil language; stemming algorithm pay important role to create stop words list. They created a list of tokens which is available in text corpus. After shorting that list and based on token frequency they prepared stop words list and other words to be discarded [10].

Yao and Zen-wen [11] created list of 1289 Chinese-English stop words by combining domain-specific stop words with list of classical stop words [11]. For Mongolian language, [12] used entropy calculation to create stop words list. They calculate entry for each word that is available in initial created stop words list. To prepare final stop words list, they combine this result with Mongolian part-of-speech [12]. Alajmi et al. [13] have used statistical approach to generate stop words list for Arabic language [13].

Chauhan et al. [14] presented stemmer for Gujarati language by using rule-based approach to improve Information Retrieval System. They used Gujarati news paper corpus for experiment purpose and created list of 280 stop words based on a word which is frequently occurring and it is less importance in document [14]. Joshi et al. [15], presented stop word elimination approach for information retrieval (IR) of Indian Gujarati language to improve mean average precision (MAP). They have collected data from FIRE corpus, based on their experiment, they constructed list of 400 words which is less importance and extensively used in Gujarati language. They created 282 stop words list from constructed list by analyzing and manually inspection by linguistic expert [15]. Rakholia and Saini [16] they study and analyzed different stemmer algorithms and pre-processing approaches are available for Gujarati language to process Gujarati written document. Through of their literature they found that, stop words removal is important pre-processing step in natural language processing application [16].

Based on this detailed literature review of the most relevant research works found in research community, our analysis based on stop words identifying process for Gujarati written document, it has been found by us that most of the researchers have obtained average accuracy for training and testing phase for Gujarati stop word identification at 85 and 67%, respectively. This motivated us for the presented research work as there is no effective stop word identification method or approach developed for Gujarati written document, which can yield a performance enough to make it practically acceptable in real world.

3 Our Approach

We have used rule-based approach to identify stop words from Gujarati written document. We have not considered the length of the word to identify the stop word because Gujarati document can be written using consonants, vowels and diacritics signs as well. It is noteworthy to mention here that the length of the stop word found by methods used by other researchers hence is dependent on and influenced by the usage of diacritics as well. To design and implement a length independent approach, we have deployed the usage of the fact that each diacritic mark in written Gujarati document considers a single character. Also, from linguistic computational perspective, each diacritic mark has a unique UTF-8 hexadecimal value.

Following rules are applied to identify stop words appearing in Gujarati document

Rule 1: All single consonant or vowel words, with or without diacritics, were considered stop word and eliminated, except only

For instance: With diacritics:

Without diacritics:

Rule 2: A word that contains three regular Gujarati characters other then diacritic sign, if a word is terminated with and if a middle character has “” diacritic sign and first character has either “” or “” sign, then it was considered stop word and eliminated.

For instance:

If a word that contains two regular Gujarati characters other then diacritic sign and if word is terminated with then it was considered stop word and eliminated. For instance:

Rule 3: A word that contains only two regular Gujarati characters other then diacritics sign and if word is terminated with then it was considered stop word and eliminated.

For instance:

Rule 4: A word that contains only two regular Gujarati characters other then diacritics sign and the word is terminated with and word does not start by using this three diacritics sign {} then it was considered stop word and eliminated. Because in most cases, these three diacritic signs {} are used to make proper nouns (e.g., name of girls) in Gujarati language.

For instance:

Rule 5: A word that contains only two regular Gujarati characters other then diacritics sign and if word is terminated by with at least one diacritic sign and does not start with then it was considered stop word and eliminated.

For instance:

Rule 6: A word that contains only two regular Gujarati characters other then diacritics sign and if word is terminated by and starting character has only “” diacritic sign, then it was considered stop word and eliminated.

For instance:

Rule 7: A word that contains only two regular Gujarati characters other then diacritics sign and if word is terminated by and first character either does not contain diacritic sign or have only “” diacritic sign, then it was considered stop word and eliminated.

For instance:

Rule 8: A word that contains only two regular Gujarati characters other then diacritic sign and last character has at least one diacritic sign when first character has “” or “” sign, then the word under consideration was treated as a stop word and eliminated. Using this rule, it was also possible to identify past tense sentences written in Gujarati language.

For instance:

Rule 9: A word that contains only two regular Gujarati characters other then diacritics sign and if word is terminated with then it was considered stop word and eliminated.

For instance:

Rule 10: A word that contains two regular Gujarati characters other then diacritic sign and if word is terminated with and first character contained at least one diacritic sign except “”, then it was considered stop word and eliminated.

For instance:

Rule 11: A word that contains two or three regular Gujarati characters other then diacritic sign and if word is terminated with then it was considered stop word and eliminated.

For instance:

4 Comparison with Other Approaches

Almost researchers have created stop words list for Indian Gujarati language by manually inspection of linguistic expert and based on words frequency. A list of existing approaches that are used for Indian language to identify stop words is presented in Table 2.

Table 2 Existing approaches

Other than these approaches, statistical approach is also used to generate stop words list. In almost all existing approaches, first step is frequency calculation for each word. But in many cases a word that has high frequency with significant meaning in document, but it cannot be consider as stop word. Second, many researchers have used statistical approach for English language and they achieved good accuracy, because many stop words in English language does not have multiple form, for instance: “any”, “is,” “a,” “the,” “an.” But for the Indian Gujarati language statistical approach will lead to the loss of accuracy because single stop words has multiple form, for instance:

4.1 Precise Benefits of Proposed Approach Over Existing Approaches

The research works found in the related literature are based on training dataset and/or the length of the word. The proposed approached is free from the length of the word as well as the requirement of the training data set. It is noteworthy to mention that deploying a training dataset often leads to biased training of the system, more so in absence of availability of a standard text corpus for a resource scarce language like Gujarati. The proposed approached is hence free from machine learning based techniques. The proposed approach is also, hence, free from the risk of getting obsolete with time.

4.2 Known Limitations of Proposed Approach

The proposed work, in its present state, “will not perform well” only in case of stop words that contain more than three characters. It will also “not perform well” with specific words that belong to a peculiar domain. Still, two points are worth mentioning here. Firstly, the phrase “will not perform well” here should be taken with a pinch of salt as the only detrimental thing from the system will be a slight reduction in the accuracy. Second, the probability of peculiar domain stop word identification is very less, more so during the usual text processing and natural language processing tasks for any language, again much more so for resource scarce language Gujarati. In neither case, the proposed rules prove to be injurious enough preventing the system from wide implementation and its acceptability with good reputation in the scientific community.

5 Empirical Setup and Results

Indeed there is no a priori definition of stop words and their handling is governed by the domain and application area they are used for. Still, the NLP tasks like machine translation (MT), POS-tagging, and classification make use of general stop-word removal phase. To say “general stop-word” removal emphasizes on the fact that there are words with high frequency and their removal helps in faster processing as well as also helps in dimension reduction in terms of space requirement. This paper does neither intend to highlight the domain or application area in which stop words should be removed, nor does it focus on the number of stop words to be removed. The scientific literature of natural language processing has many instances of stop-word removal. This is true for Gujarati language, other Indo-Aryan languages as well as various International languages. This paper emphasizes on the fact that if the stop words have to be removed for Gujarati documents, there is no need to implement word frequency-based approach, word-length based approach, or manual inspection. Exploiting the morphological structure and symmetry of Gujarati stop words, this paper proposes a rule-based approach for stop word removal from Gujarati documents. This approach could be used anywhere where general (i.e., non-application and non-domain specific) removal of stop words is required. Even for cases where application and domain-specific removal of stop words is required for extrinsic evaluation of any system, the proposed “generic” rules could be applied before implementing domain and application specificities. As the proposed rules could be applied anywhere where removal of stop words is required, we term them ‘generic’.

This section described the source of data collection for empirical implementation of the proposed rules. The system was implemented using Java Server Pages (JSP) technology and the results follow.

5.1 Data Sets

The data was collected randomly from multiple free Gujarati websites, to avoid the bias of a single website on the proposed work. For experimental purpose, 373 documents were prepared for routine Gujarati document and each document contained more than 400 words. We also prepared 224 documents for domain-specific (medical and engineering) categories and each document in these categories contained more than 275 words.

5.2 Results

In Gujarati language, there is no automated tool readily available to calculate the accuracy. Hence, we had to manually go through each document and evaluate the performance of the system. The obtained results on accuracy were recorded side by side. The average accuracy of routine Gujarati written documents was obtained at 98.10%. Similarly, for domain-specific medical and engineering categories, the obtained average accuracy was 94.08%. We also pondered on the reasons of getting less accuracy for routine Gujarati written documents and found that the reason is the presence of stop words containing more than three characters. Similarly, the non-availability of 100% accuracy in case of domain-specific categories owes to the presence of peculiar domain biased words. The average accuracy of routine Gujarati written documents is greater than the average accuracy of specific domain category documents by 4.02% because of presence of many domain-specific words, in such documents, which were not identify by any rules.

6 Conclusion and Future Work

We have presented an effective approach to accurately identify and eliminate a high percentage of the stop words in the Gujarati written documents. The proposed work used rule-based approach to identify stop words dynamically. The average accuracy for routine Gujarati written documents was obtained at 98.10% and for the specific domain (Medical and Engineering), we got 94.08% accuracy. We advocate that these results are reproducible on other large corpuses of routine Gujarati written documents as well. We propose and strongly claim that this approach is more efficient than any other existing approaches, which are available for identification of stop words from Gujarati written documents. The approach to finding stop words that presented here is currently limited in its applicability only for the word that contains more than three characters and for the word that belongs to a specific domain. This is our focus for future work. The proposed approach can be well applied as a preprocessing step for many NLP tasks including text classification, information retrieval, as well as document clustering, to name a few.