Keywords

1 Introduction

During communiqué, “Language” acts as a model for transferring information through a standardized approach called its grammar. A Grammar Checker used in arena of Machine Learning integrates an application of Artificial Intelligence with Computational Linguistic. The generalized functionality can be depicted in Fig. 1.

Fig. 1
figure 1

Diagram for grammar checker—functionality

Though heaps of research is carried out in Grammar Check particularly for English and Foreign languages yet fewer research is carried out for various Indian languages like Punjabi. Statistics reveal that there are 6,900 spoken languages throughout the world. Punjabi language falls under the top ten languages with 120 million total speakers out of which 109 million are native speakers whereas Mandarin is the top spoken language and English occupies fourth rank in the list. The tyranny of the situation is that on the internet, English reserves the lion's share of 26.8%, Chinese occupies 24.2%, Spanish maintains 7.8%, whereas all other languages contribute meager 26.4%. This drift is sufficient motivation for the research community to contribute in this sphere. Also, the Punjabi language finds its linkage to Indo–Aryan languages family generally referred to as Indic Languages and is morphologically rich language.

Grammar checking systems are mostly an integral part of specific word processors. For instance, in English language, by default characteristic is imbibed in Microsoft Office and for Punjabi, such functionality is provided in AKHAR (a software exclusively designed for literary purpose). Contribution in the development of Urdu Grammar Checker was done by [1]. In Bangla, it was done by [2] by developing a Bangla Grammar Checker, Punjabi Grammar Checker was propounded by [3] and in Hindi, contribution was extended for checking grammar by [4].

Rule-based [5], statistical (data-driven) [6], and hybrid-based [7] grammar checking methodologies exist. Rule-based categorization is used frequently viz-a-viz, other techniques are used in grammar checking. In this technique, corpus is considered for framing rules as in case of if–then-else rules and given sentence is inputted for checking the accuracy of designed grammar checker. Highlighting aspect of this technique is that such rules are crafted easily and can be modified as and when required. Another motivation for using this feature is that programming is not requisite and a linguistic person can aid the process of rule creation. Additionally, details of the error, if any, are provided easily. Last but not the least, such rules are capable enough to handle basic candid features of specific languages without any major modifications required to entertain input sentence. History of such rule-based systems revolve around languages like Dutch [8], Slavic [9], English [10,11,12,13], Punjabi [14], Swedish [15,16,17,18,19,20], German [21], Korean [22], Danish [23], French [24, 25], Portuguese [26], Persian [27], Afan Oromo [28], Chinese [29], Malay [30].

In statistical grammar checker, annotated corpus is being used and implemented which is obtained from different journals, magazines, or documents. Rules for this system are manually generated. Correctness of a sentence is validated through a thumb rule. A given sentence is passed through a rule to check its correctness. On success, it is processed against a grammar checker with the help of corpus. On successful pass, the sentence is termed as grammatically correct otherwise it is flagged as a grammatical error. In case of supervised learning, from the given sample, rules are framed as production rules and are used to check the accuracy of the given sentence. The latter technique is infested with a drawback as it is very difficult to perform the task of detecting and recognizing an error in sentence or system.

An alternative approach consists of an Hybrid implementation which comprises Rule-Based and Statistical Grammar Checking which result in a more robust environment and having higher efficiency.

This paper has been organized into the following segments: Segment 2 presents literary aspects of computational linguistics and existing rule-based Punjabi Grammar Checker. Segment 3 presents the critical analysis and shortcomings of existing techniques in light of various sentences procured from standardized organizations and corpus like CDAC, TDIL, Language Newspapers, Texts, etc. Segment 4 presents a novice model to critically justify an advanced Punjabi Grammar checker. Finally, Segment 5 brings our paper to a close and suggests some areas for future investigation.

2 Existing Punjabi Grammar

An interesting aspect of prevailing Grammar Checker is that it follows purely Rule-based philosophy and has no correlation with Statistical approach for computation task, i.e., exhausted hand-crafted rules are followed. These rules can be easily edited and we can add new rules also, further already existing rules can be deleted as and when required based on the concept of production rules written by a linguistic expert without any specific intervention by the programmer.

In the current system, for evaluating correctness of a sentence, Input is given to the Grammar checker, which in turn identifies the end of a sentence with the help of punctuation and breaks down input into unit form, i.e., tokenization and detection of phrases is done here [31].

In preliminary phase, data pre-processing is done. Pre-processing checks for the presence of phrases and tokenizes the sentence into individual words. Once, this process is completed, the checker performs activities like Morphological Analysis (MA), Part-of-Speech (POS) tagging, Error Detection, and Correction. This rule-based approach analyzes the language at Morphological and Syntactical levels. The Morphological Analyzer analyzes each input word and grammatical information is assigned as part-of-speech tags. The suggestions generated for detecting grammatical errors use root word of a particular word along with a full form lexicon. The Part-of-Speech Tagger and Phrase Chunker again follows Rule-Based approach. Phrase Chunker helps in grouping based on predefined phrase chunking rules. Henceforth, at sentence level, rules are applied to check grammatical errors. Excerpt from the system is narrated as follows:

2.1 Pre-Processing Phase

In the preliminary phase, a Punjabi text is given as input which helps in tokenization, identification of punctuation symbols, detection of contractions, identification of colloquial and phrases, if any. Basically, this phase prepares the input text for next phase, i.e., for morphological analysis as shown in Fig. 2.

Fig. 2
figure 2

Pre-processing system design

2.2 Morphological Analyzer

With the help of full form lexicon concept, possible tags of all words (from the given extract) are assigned. Certain classes like noun, adjective, pronoun, verb, adverb, conjunction, interjection, postposition, ordinals, cardinals, etc., (twenty two in total) are used for classification as per Punjabi grammar. Adjectives are categorized into inflected and uninflected. Similarly, pronoun is classified as personal, interrogative, demonstrative, relative, reflexive, and indefinite; verb is classified as main verb, auxiliary verb, and operator verb, respectively. Additionally, details like number, gender, tense, etc., are added depending on the word class. It's worthwhile to mention here that lexicon used for this analyzer is based on full-form, i.e., all common words from literature are stored with their respective root and relevant grammatical information as shown in Fig. 3.

Fig. 3
figure 3

Morphological analyzer flow diagram

2.3 POS Tagger

In case of disambiguation, i.e., assigning multiple tag to a single word, a Rule-Based POS tagger (parts of speech) has been used to remove this anomaly. Current system uses 600 plus tag sets. Word-specific tags are additionally used. In addition to this, some tags are also there. For instance a notation, NMSD means a noun that is masculine, singular, and direct. In the absence of any statistical corpus used, existing system uses only rule-based phenomenon. The rules are followed in sequential order as shown in Fig. 4.

Fig. 4
figure 4

POS Tagger flow diagram

2.4 Phrase Chunker

Based upon certain phrase chunking rules, grouping of texts is done into various phrases. A rule-based protocol is followed here. Different tag sets are used for different cases—like direct or indirect. Polarity of a sentence, i.e., meaning of a sentence is also considered for framing such rules as shown in Fig. 5.

Fig. 5
figure 5

Phrase chunker flow diagram

3 Error Checker and Corrections

In this phase, rules keeping into consideration grammatical errors in phrases and sentence level agreement are implemented. Relevant corrections are suggested on the basis of contextual information on occurrence of error, if any. Subsequently in Grammar Checking phase, error detection rules (rule based) are used to detect potential errors and corrections are provided to resolve such errors.

The concept is summarized as shown in Fig. 6.

Fig. 6
figure 6

Model of existing Punjabi grammar checker

4 Critical Analysis and Shortcomings

Existing Punjabi Grammar checker detects grammatical mistakes only for simple sentences and lacks support for compound and complex sentences and raises false alarms. It does not have any component for unknown word guessing. Further, it has a limited domain for certain words that affect its precision and recall. Moreover, Spell checking is not available. Also, the structure lacks support for other languages of Modern Indo-Aryan family, like Hindi, Bengali, etc. [32]. The distinct features of such languages are highlighted in the following Table 1.

Table 1 Analysis of Indo-Aryan languages

Similar theories were put forwarded for other languages including European ones [35,36,37,38,39]. Existing Punjabi Grammar Checker system is processed against sufficient number of sentences (seventy five in total) collected from a standardized repository (as stated earlier) and the results were disappointing. Chosen sentences are processed at the listed URLs:

  1. a.

    http://punjabi.aglsoft.com/

  2. b.

    http://pgc.learnpunjabi.org/

The analysis report comprises the count of total number of errors (including false alarm) creeping from individual phases of the Grammar Checker and helps us in visualizing the inefficiency of individual components of the Grammar Checker [40,41,42,43]. The report is projected through the listed Table 2.

Table 2 Analysis of Punjabi sentences

The component-wise reasons for such errors /issues may be accounted for listed factors:

  1. a.

    In context of a Punjabi sentence, modifiers must collaborate with the noun and modify with respect to gender, number, and case.

  2. b.

    In Noun-Adjective agreement, Noun needs to be changed sometimes and not only adjective. In current rule, adjective is always changed.

  3. c.

    POS was not able to remove ambiguity and acted in contrary to its defined assignment and followed the same result of MA.

  4. d.

    Whenever a word is encountered whose root is not traced, “unknown” tag is assigned.

5 Proposed Framework for Punjabi Grammar Checker

All listed shortcomings as stated above may be overcome by using hybrid technique by combining grammar rules with machine learning technique [44]. Till now hybrid approach has not been used for development of Punjabi grammatical error detection because of unavailability of standard Punjabi corpus to be used for machine learning [45]. Two step approach may be followed for the same.

a. Step One

The working of each component of Existing Rule-based System is studied through the listed flow of steps.

As shown in Fig. 7, once an incorrect Punjabi sentence will be given as input, efficiency would be calculated phase-wise, i.e., efficiency would be calculated after MA, Tagging, Chunking, Error Detecting, and Error Correcting, respectively, for analysis so as to evaluate accuracy of each component.

Fig. 7
figure 7

Proposed model for measuring accuracy

b. Step Two

The components that are responsible for false alarm are identified, and a proposed algorithm to improve these components is followed using two phases. For evaluating a component accountable for false alarm situation, 2-phase process would be followed. In phase 1, Grammar Checker will perform preliminary check with the help of certain rules. An incorrect sentence would be made to pass through phase II. In phase II, output from phase I would pass through each component (step) to check whether the said component is faulty or not. A particular component is faulty, if the output from that component is incorrect; otherwise, the output will be made to pass through the next step and so on. The step-by-step approach is described in Fig. 8.

Fig. 8
figure 8

Proposed model for evaluating faulty component

6 Results and Discussions

Onto a repository of corpus collected from various standard texts, authorized resource centers like TDIL, etc., as discussed above, we were able to identify Morphological Analyzer as the component contributing maximum in generation of errors, false alarms followed by POS Tagger. The percentage contribution of these were 58.13% and 26.74%, respectively, on individual basis and combined error percentage is 28. Hence, paving a way for further research in this area as these being the important and preliminary steps in overall procedure would be helpful for checking grammatical errors with much accuracy once rectified.

7 Conclusion and Future Work

Our paper has categorically analyzed the accuracy of each component of existing rule-based Punjabi Grammar Checker. The effect of each component is analyzed as it has an implication on the overall accuracy of the system. The parameters for measuring the same were taken as Recall and Precision. This paper also proposes a “Fault Determination System” with an aim of evaluating the “Faulty Component” by following a two-phase approach and concludes with providing the facts and results that Morphological Analyzer and POS Tagger were the faulty components generating false alarms and errors to the tune of 58.13% and 26.74% respectively.

Based on these detections, further research can be carried out for developing a model to overcome these ambiguities using Machine Learning techniques by inculcating a “Hybrid” mechanism. Such “Hybrid” framework may be used for other morphologically rich Indian languages like Oriya, Sanskrit, Hindi, Bengali, etc., and can be further extended for various Natural Language Processing (NLP) tasks associated with Punjabi and other languages.