Named Entity Recognition from Gujarati Text Using Rule-Based Approach

Shah, Dikshan N.; Bhadka, Harshad B.

doi:10.1007/978-3-319-76348-4_76

Dikshan N. Shah¹⁸ &
Harshad B. Bhadka¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 736))

Included in the following conference series:

International Conference on Intelligent Systems Design and Applications

1853 Accesses
2 Citations

Abstract

NER which is known as Named Entity Recognition is an application of Natural Language Processing (NLP). NER is an activity of Information Extraction. NER is a task used for automated text processing for various industries, a key concept for academics, artificial intelligence, robotics, Bioinformatics and much more. NER is always an essential activity when dealing with chief NLP activity such as machine translation, question-answering, document summarization etc. Most NER work has been done for other European languages. NER work has been done in few Indian constitutional languages. Not enough work is possible due to some challenges such as lack of resources, ambiguity in language, morphologically rich and much more. In this paper, to identify various named entities from a text document, rules are defined using Rule-based approach. Based on defined rules, three different test cases computed on the training dataset and achieved 70% of accuracy.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Named Entity Recognition System for Sindhi Language

Enhancing Performance of Hybrid Named Entity Recognition for Amazighe Language

Named Entity Recognition in Natural Language Processing: A Systematic Review

Keywords

1 Introduction

The phrase Named Entity (NE) was coined during the 6^th Message Understanding Conference (MUC-6) in 1995. Many NER systems were developed after that. Foremost work has been done in European languages and all systems were highly precise [6]. Named Entity is the structured information mentioning to predefined proper names like persons, locations, and organizations, year, date, month, monetary amounts, percentages as well as temporal and numeric expressions from text [2].

Named Entity Recognition (NER) systems proved to be very significant for many tasks in Natural Language Processing (NLP) such as information retrieval, machine translation, information extraction, question answering systems. The objectives of NER is to classify each word of a document into predefined target named entities classes.

1.1 Existing NER Approaches

Present NER systems have been built using mainly knowledge-based or linguistic, and machine learning approach.

1.1.1 Rule-Based Approach

The linguistic approach or Knowledge-based approach is basically called as a rule-based approach which uses a set of hand-crafted rules deliberate and described by human experts, especially linguists. This approach considers a set of patterns containing grammatical, syntactic, linguistic and orthographic features in a grouping with dictionaries. It is a prerequisite to have a thorough knowledge of target language as it is a time-consuming task to develop such kind of system.

1.1.2 Statistical Approach

Machine Learning or a Statistical approach is a swift way to build an NER system which fundamentally supports rule-based systems or use sequence labeling algorithms to collect knowledge from a collection of training examples. The accuracy of this approach is purely dependent upon the training dataset. Various Machine Learning models used for NER systems like Hidden Markov Model, Conditional Random Field, and Maximum Entropy.

1.1.3 Hybrid System

Use of Statistical tools as well as linguistic rules and combinations of both approaches make a system more precise and effective.

1.2 About the Gujarati Language

Basically, Among the Indo-European language family, Gujarati is well-known Indo-Aryan language and it was tailored from the Devanagari script. Alphabets of this language mainly include 34 consonants and 14 vowels. [1] A language is very widespread and spoken by more than 50 million people across the India. It is the official language of the Gujarat state of India.

2 Related Work on Different Indian Languages in NER

Among the constitutional Indian languages, NER work has been done in some languages. NER approaches used in various Indian languages with their accuracies are mentioned in Table 1 as follows:

Table 1. Different approaches used for various Indian Languages according to their accuracies

Full size table

3 Rule-Based Approach

A morphological analyzer for the Hindi language analyze Hindi sentences and produce its features with its root words. [7] As Rule-based approach is a domain specific, rules define for one language will not apply for other languages. Some Rules used to identify different tags in the Gujarati language are as follows:

3.1 Date and Time

This Rule is applied on given input which contains the various date and Time formats. Regular Expressions are used to identify these kinds of tags [15]. Following are date and time tagset examples:

3.1.1 Year

(Samvat) refers to the epoch of the several Hindu calendar systems in India and also in Nepal. There are three most significant (Samvat): Vikrama era, Old Shaka era and Shaka era of 78 AD [15].

3.1.2 Month Names

The names of the Indian months diverge by region. Hindu calendars are based on lunar cycle and usually phonetic variants of each other.
[15].

3.1.3 Days

The Hindu calendar has two measures of a day, one based on the lunar movement and the other on solar. The solar day or civil day is called divas and the lunar day is called tithhi (). A lunar month has 30 tithhi. Lunar month starts with Kartak ().

[16].

3.2 Location

Suffix matching is used for types of location names and terms. Different suffix makes different location names of Indian States and Cities are as follows: [17]

Location names that end with ‘pure’ () i.e. -
Location names that end with ‘Ghar’ () i.e. -
Location names that end with ‘stan’ () i.e. -
Location names that end with ‘bad’ () i.e. –
Location names that end with ‘Nagar’ () i.e. –
Location names that end with ‘pat’ () i.e. -
Location names that end with ‘nath’ () i.e. -
Location names that end with ‘mer’ () i.e. -
Location names that end with ‘kot’ () i.e. -
Location names that end with ‘Ishwar’ () i.e. -
Location names that end with ‘Wada’ () i.e. -
Location names that end with ‘giri’ () i.e. –
Location names that end with ‘Puram’ () i.e. –
Location names that end with ‘uru’ () i.e. –
Location names that end with ‘patnam’ () i.e. –
Location names that end with ‘guri’ () i.e. –
Location names that end with ‘tal’ () i.e. –
Location names that end with ‘Dwar’ () i.e. –
Location names that end with ‘Wada’ () i.e. –
Location names that end with ‘Palli’ () i.e. –
Location names that end with ‘Malai’ () i.e. – ,

3.3 To Identify Some Abbreviations in Date and Time Tag Entities

Abbreviations point to an original name. Some words used in their abbreviated form for a date, month and year entities. Examples: [15].

3.4 For Numerals

There is a difference between mentioning numbers. Two types of number system we used: Hindu Arabic Numerals and Gujarati Numerals. Numbers have different number names in different languages. Number 0 to 100 written in both format is different and their Gujarati names also [18] (Table 2).

Table 2. Number Names of Hindu Arabic Numerals in Gujarati

Full size table

4 Research Methodology

We have developed various rules using a rule-based approach which helps to recognize various named entities. For Identification of Named Entity, we have collected document in the Gujarati language as a corpus from E-newspaper ‘Gujarat Samachar’. There are various categories of news as Entertainment, Sports, Religious and much more. Among them, we have gathered 100 sports category documents to identify various Named Entity tagset.

A.
Preparation of Database

Based on various categories of tagset, following dictionaries are created.

Date Dictionary: Date tagset contains Day, Month number and name, and Year. The day is also categorized based on Hindu calendar and Panchang (). Tithis () and days () stored in gazetteer list.

Location Dictionary: Here Location names are only within a limited range of area or for a specific country. 21 Suffix stripping rules are created for Location Names as City or State or Village names of India.

Abbreviation Dictionary: Various abbreviations of date, month, day are listed in it.

Number-names dictionary: Based on Hindu-Arabic numerals, Gujarati number names listed for 0 to 100 digits (Fig. 1).

B.
Architecture of System

Step – 1 Input text - Through file upload, upload a file which is a text file comprising raw data in the Gujarati language.
Step – 2 Preprocessing – Prepare Gujarati text document for preprocessing.
Step – 3 Tokenization - Input text tokenized word by word for pattern matching.
Step – 4 Entity detection - Detection of Date, Time and Location entities based on created rules and if any rules matched go to step 7.
Step – 5 Detection of Abbreviation Names, if matches are found and compared with gazetteer list, go to step 7.
Step – 6 Detection of Numbers and its Number names from Non-numerals practice and if found any matches then go to step 7.
Step – 7 Display tagged output generated by the system with the untagged result to the user.
Step – 8 End

5 Experimental Result and Analysis

The core objectives of such experiment are to identify the kinds of patterns of named entities by the proposed NER algorithm. We have collected documents of Sports category to recognize various Named Entities such as date, Day names, Month Names, Tithhi (), Location and numerals (Table 3).

Table 3. Apply various test cases on dataset

Full size table

Among the given 363 words, 194 entities are correctly identified by applying various test cases and achieved 70% of accuracy.

6 Conclusions

An innovative technique can be build up to develop the performance of NER in the Gujarati language. We have developed rules to identify various named entities which is a very beneficial in many significant applications. We have studied various existing approaches of NER and analyzed that among the various constitutional Indian languages, lots of scopes is for NER in Indian languages. By implementing various rules on given dataset we attained 70% of accuracy. As a future work, we can build more precise rules for much more named entities to achieve good accuracy.

References

Athavale, V., Bharadwaj, S., Pamecha, M., Prabhu, A., Shrivastava, M.: Towards Deep Learning in Hindi NER: An approach to tackle the Labelled Data Scarcity (2016)
Google Scholar
Jiandani, K.S.D., Bhattacharyya, P.: Hybrid inflectional stemmer and rule-based derivational stemmer for Gujarati. In: Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing (WSSANLP 2011), November 2011
Google Scholar
Amarappa, S., Sathyanarayana, S.V.: Kannada named entity recognition and classification (nerc) based on multinomial naïve Bayes (MNB) classifier. Int. J. Nat. Lang. Comput. (IJNLC) 4, 39–52 (2015)
Google Scholar
Alfred, R., Leong, L.C., On, C.K., Anthony, P.: Malay named entity recognition based on rule-based approach. Int. J. Mach. Learn. Comput. 4(3), 300–306 (2014)
Article Google Scholar
Sathyanarayana, S.A.: A hybrid approach for named entity recognition, classification and extraction (NERCE) in Kannada documents. In: Proceedings of International Conference on Multimedia Processing, Communication, and Info. Tech., MPCIT (2013)
Google Scholar
Singh, A.K.: Named entity recognition for south and south east asian languages: taking stock. In: Proceedings of the IJCNLP Workshop on NER for South and South East Asian Languages, pp 5–16 (2008)
Google Scholar
Agarwal, A., Singh, S.P., Kumar, A., Darbari, H.: Morphological analyser for hindi-a rule-based implementation. Int. J. Adv. Comput. Res. 4(1), 19 (2014)
Google Scholar
Sharma, L.K., Mittal, N.: Named entity based answer extraction from hindi text corpus using n-grams. In: 11th International Conference on Natural Language Processing, p. 362, December 2014
Google Scholar
Sasan, T.S., Jamwal, S.S.: Transliteration of name entities using rule-based approach. Int. J. Adv. Res. Comput. Sci. Soft. Eng., 6(6) (2016)
Google Scholar
Jahan, N., Morwal, S., Chopra, D.: Named entity recognition in Indian languages using gazetteer method and hidden Markov model: a hybrid approach. IJCSET, March 2012
Google Scholar
Abinaya, N., Kumar, M.A., Soman, K.P.: Randomized kernel approach for named entity recognition in Tamil. Indian J. Sci. Technol. 8(24), 1–7 (2015)
Article Google Scholar
Kaur, Y., Kaur, E.: Named Entity Recognition system for Hindi Language using a combination of rule-based approach and list lookup approach. Int. J. Sci. Res. Manag. (IJSRM) 3(3), 2300–2306 (2015)
Google Scholar
Aboaoga, M., Ab Aziz, M.J.: Arabic person names recognition by using a rule-based approach. J. Comput. Sci. 9(7), 922 (2013)
Article Google Scholar
Bhalla, D., Joshi, N., Mathur, I.: Rule-based transliteration scheme for English to Punjabi (2013)
Google Scholar
To download. Guj-Ind-StyleGuide. http://download.microsoft.com/download/7/2/0/720b015e-94f9-4b6e-911f-539f38c60774/guj-ind-styleguide.pdf
Tithi (Internet). https://en.wikipedia.org/wiki/Tithi
Indian Place Names (Internet). http://www.irfca.org/docs/place-names.html
Gujarati Number names for Digits (Internet). https://www.omniglot.com/language/numbers/gujarati.htm

Download references

Author information

Authors and Affiliations

Faculty of Computer Applications, S S Agrawal Institute of Computer Science, Navsari, India
Dikshan N. Shah
Faculty of Computer Science, C U Shah University, Wadhwan, India
Harshad B. Bhadka

Authors

Dikshan N. Shah
View author publications
You can also search for this author in PubMed Google Scholar
Harshad B. Bhadka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dikshan N. Shah .

Editor information

Editors and Affiliations

Machine Intelligence Research Labs , Auburn, Washington, USA
Ajith Abraham
Department of Computer Science, South Asian University, Chanakyapuri, Delhi, India
Pranab Kr. Muhuri
Faculty of Information and Communication Technology, Universiti Teknikal Malaysia Melaka , Durian Tunggal, Melaka, Malaysia
Azah Kamilah Muda
Machine Intelligence Research Labs , Auburn, Washington, USA
Niketa Gandhi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shah, D.N., Bhadka, H.B. (2018). Named Entity Recognition from Gujarati Text Using Rule-Based Approach. In: Abraham, A., Muhuri, P., Muda, A., Gandhi, N. (eds) Intelligent Systems Design and Applications. ISDA 2017. Advances in Intelligent Systems and Computing, vol 736. Springer, Cham. https://doi.org/10.1007/978-3-319-76348-4_76

Download citation

DOI: https://doi.org/10.1007/978-3-319-76348-4_76
Published: 22 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-76347-7
Online ISBN: 978-3-319-76348-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Named Entity Recognition from Gujarati Text Using Rule-Based Approach

Abstract

Similar content being viewed by others

Named Entity Recognition System for Sindhi Language

Enhancing Performance of Hybrid Named Entity Recognition for Amazighe Language

Named Entity Recognition in Natural Language Processing: A Systematic Review

Keywords