“Part of Speech Tagging – A Corpus Based Approach”

Rashmi, S.; Hanumanthappa, M.

doi:10.1007/978-981-10-3433-6_11

S. Rashmi¹⁹ &
M. Hanumanthappa¹⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 628))

Included in the following conference series:

International Conference on Smart Trends for Information Technology and Computer Communications

2188 Accesses

Abstract

POS tagging, an ideal way to augment a corpus is an imperative abstraction for text mining. However with an increase in the amount of linguistic errors and distinctive fashion of language ambiguities, the data filtered by POS tagging is noisier. In this paper, probabilistic tagging and tagging based on Markov models are combined to estimate the association probabilities. Based on this combined approach, error estimation model is defined. Comparison study is made on different corpus available in NLTK such as Crubadan, Brown and INSPEC. The results obtained by the proposed methodologies show a drastic increase in the accuracy rate of about 98% when compared to the existing algorithms which shows an average of 96% accurate. The performance measure is plotted to calculate the error ratio across the maximum-likelihood estimation.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Analyzing Tagging Accuracy of Part-of-Speech Taggers

Towards POS Tagging Methods for Bengali Language: A Comparative Analysis

A Comparative Evaluation of Statistical Part-of-Speech Taggers for Russian

Keywords

1 Introduction

Tagging defines an association. POS tagging tags Part-Of-Speech labels for a token in any language structure. POS is immensely used in text mining and it is an unsupervised/supervised classification. If a tagger is supervised then it counts the number of labeled set of data whose efficiency relies on the existence of the tagging dictionary. On the other hand unsupervised tagging seems to be a reasonable solution because they use un-annotated language models. The efficiency of any POS tagger depends on series of criterion as mentioned below

Syntactic/Grammatical structure of the sentence should be correct. If there are any syntactical mistakes in a given sentence this could accelerate a chaos in an absolutely efficient POS tagger.
When supervised model is used, the differences in working fashion, font styles, or the embodied texts between the input data and labeled/training data substantially drops the robustness of the system.
Ambiguities in certain tenses in English language. For example, should the words such as ‘have’, ‘be’, ‘book’ and ‘keep’ be treated as verbs or their base/own forms?
If a probabilistic measure is used for tagging then how much should be the support threshold? How to determine the base dependency of a tagger for a probable architecture?

To domicile the above issues, we propose a probabilistic tagging model which is data-rich and it overcomes the data scarcity encountered in unlabelled data sets. Intuitively the model proposed establishes a tag-engagement for each word of the operational data against the trained dataset. In the next step Markov models are used to visit the node probability of the hidden tags. A rigid performance evaluation is interfered to study the transitions among the labeling and the encoding of the tag-determiners [POS of Standard English].

2 Literature Review

Recent works in the field of NLP have reached a new horizon. There are multiple text processing system that shows an outrageous improvement in terms of accuracy and efficiency. Various researchers have proposed elementary methods to address the problem of POS tagging. Penn Treebank for English corpora was majorly used for the study of POS annotations. Depajan Das et al. [1] work on annotated corpus by building a dictionary. The word alignment exhibited in this work makes use of parallel data for label propagation. The scarcity of standard language corpora and the language challenges of defining a manual-tag construction are studied by Griffins et al. [2].

In the recent study made by Yoong Keok Lee et al. [3] an unsupervised POS tagging is constructed. The regular distributions of language projections is enforced for this study however the method falls costly and not suitable for many systems in a diverse range of applications. In [4], authors have shown a elegant way to enrich the corpus by showing labeling across ach word then assigning the tagged labels for a new word. The tagging schemes shown here make fair grammatical distinctions and hence as a result a large data set is formed. This is found helpful for inherit category of applications. Authors Leon et al. [5] has proposed a tagger using bootstrap constraints for unlabeled data. This is integrated with the probabilities of the existing-label tokens. An accuracy of 88.7% for tagging is achieved. The work also imposes the tagging issues related to a twitter text. Hand-annotated tagging is shown by the authors Clark and Ritter [6].

Reasonably, the classifier model designed with combinational machine learning models seems to provide greater results that transcend the ability of a human to tag the corpus. With this in mind, we have worked on uncertainties viewed in the existing literature. To overcome the problems associated with annotation, we decided to do the automatic tagging using Markov models and in scenarios where the tagging is not possible using the above said model, probabilistic measures are used. Subsequently the performance is enhanced due to the amalgamation of appending new features in tag-rule table. In our work we have shown how POS can be considered as a classification problem. Two lists are created; one says the available tags and the second talks about the POS tags taken as training set. A “HMM-POS [Hidden Markov Model – POS]” is an algorithm proposed in our work. This tags the POS for various lexicon units.

3 Our Approach – An Overview of the Methodologies

POS tagging can be achieved when it is viewed as a classification problem. Training set and positive tags are the two major considerations for this. However the problem of POS tagging is quite not simple and there are three prime elements for this deliberation.

The performance of a POS tagger is influenced by evaluation paradigm and an assessment criterion. The effectiveness of a tagger is relied upon the tagset. Tagset includes the tagging rules which can be projected as a statistical measure. In order to define a tagset either raw or annotated corpora is required. The granularity of the tagset is the main prerequisite as the higher granularity increases the accuracy of the tagger also increases. Perhaps this will chop off some areas and takes to count only the major consideration. In Fig. 1, the POS tagging model is shown. The figure shows the creation of tags and tags rules with the help of the statistical models such as entropy structure, randomization over conditional field, Hidden Markov model. These models are mainly used for POS tagging, Word Sense Disambiguation (WSD) and other areas of research interest under NLP.

3.1 Hidden Markov Model Based Tagging Approach

Hidden Markov Model (HMM) us a statistical language model that helps to define a time data. It is a process of converting an unobserved/unknown state into an known state. In POS tagging problem, the most appropriate and probable tag/label is selected for an input sentence by observing the previously marked labels in the chosen corpus. The tag sequence is chosen to maximize the probability as shown in equation [1].

$ P(W_{n} |T).P(T|T_{n} ) - - [1] $. The Current word W_n is allotted the tag (T) by observing the tag (T) as with the previous ‘n’ tags.

Taggers are usually developed using some of the types or approaches described below:

Tag with Stochastic Model: This evaluates the maximum likelihood of the observed tagset. HMM follows the below structure

$$ present\_state(DET - N) > present\_state(DET - DET) $$

Tagging using Association Rules: Here the maximum likelihood of the tagset is determined using the predefined the association rules as shown below

$$ \begin{aligned} & if < any\_pattern > \\ & then < Some\_pattern > \\ \end{aligned} $$

The prime approaches to be adopted while defining the tagging are, (1) HMM based tagging where all the prevalent information is used and to manoeuvre for deriving the assumption. Calculate the accuracy to study the integrity of this assumption. This approach is called bold approach. (2) Tagging based on constraint grammar. Here tagging assumption/hypothesis is not made as in (1) however the impossible/irrelevant tags are removed. This is called as cautious approach. (3) Tagging deployed on adaptation. A hypothesis is made and later can be changed if need arises. This kind of approach is also called as whimsical approach. In this section an algorithm based on HMM structure used for tagging is described. In this approach we choose the most likely tag for a given sentence where a POS can be derived. The criterion for searching a particular tag is given below.

$$ P(\mathop n\nolimits^{th} \_word - bit|tag).P(tag|tags\_of(n - 1)\_word\_sequence) $$

The tag sequence that maximizes the above condition is chosen as the most likely tag out of all the possible tags.

3.2 Hidden Markov Model Based Tagging Algorithm

The algorithm (Fig. 2) described in this section explains how tagging is done using HMM structure. The below algorithm is explained by considering two examples (Ex.1 and Ex.2) as stated below

I/PRP have/VBP to/TO book/VB a/DT flight/NN today/NN ---------Ex.1

The/DT book/NN series/NNS of/N Spider/NN man/NN are/VBP really/RB good/JJ------Ex.2

Consider all the words in the above sentences (Ex.1 and Ex.2) are tagged. Let us consider the word ‘book’ is not tagged. Perform the step3 from the above algorithm.

$$ \begin{aligned} {\text{to}}/{\text{TO book}}/? \;\; {\text{For Ex}}. 1\hfill \\ {\text{The}}/{\text{DT book}}/?\;\; {\text{For Ex}}. 2\hfill \\ \end{aligned} $$
Possible tags for the word ‘book’ are NN and VB
Apply (2) of step4 from the algorithm.

We choose the maximum likelihood of the tag which is seen as described here.

$ \begin{aligned} & P(NN|TO)P(book|NN)\_or\_P(VB|TO) \\ & P(NN|TO)P(book|NN)\_or\_P(book|VB) \\ \end{aligned} $

Phase 1:

$ P(NN|TO) $ and $ P(VB|TO) - - (a) $ - In this phase we scan the corpus to find the above probabilities as shown in (a). In reality the probability of ‘to’ preceding verb is more (example: to race, to fly, to eat…). However these are the sentences which the word ‘to’ precede a noun as well (example: run to temple, come to house, go to school…). Therefore calculate the likelihood ratio in the corpus. In our experiments we have considered three major available in NLTK. Those are Crubadan, Brown and INSPEC. The probability ratio of (a) is studied in each of these corpus and results obtained are tabulated in the Table 1.

Table 1. Occurrences of TO as NN & VB in three corpuses

Full size table

Phase 2:

$ P(book|NN) $ and $ P(book|VB) - - (b) $

As observed in Phase 1 the likelihood of ‘to’ being VB is more and hence Phase 1 talks about the maximum probability of the word ‘to’. In this phase we estimate the maximum likelihood probability of book as NN or VB. Tests were conducted on the same three corpus. The results of this test are shown in Table 2. From Table 2 it is observed that the word ‘book’ has the maximum probability of being NN rather that VB as in all the three corpuses this is evident. Finally we combine the probabilities of Phases 1 and 2 by considering the highest probable factor. This evaluation is shown in the Table 3.

Table 2. Occurrences of BOOK as NN & VB in three corpuses

Full size table

Table 3. The probability of the word ‘book’

Full size table

As shown in Table 3 the probability of the word ‘book’ being VB is more than NN. Therefore in the final stage the tag VB is assigned as POS for ‘book’. Phases 1 and 2 explains how POS is done for a single word however for the multiple word sequence, the same paradigm is used but has to be elaborated using Bayesian theorem. This is portrayed in Phase 3.

Phase 3:

In this phase all the word pair in a given sentence is tagged to their tag sequence based on the likeliness parameter. In order to do so, Bayesian theorem with chain rule is applied. Consider a set word sequence pair W₁, W₂ … W_N and a tag-set sequence T₁, T₂ … T_N. Consider the tag-rule to be of the form $ \gamma $. According to the Bayes’ rule, we have,

$$ \begin{aligned} & T \in \gamma \\ & \hat{T} = \arg \hbox{max} \_P(T|W) \\ & \hat{T} = \mathop {\arg \hbox{max} }\nolimits_{T \in \gamma } \frac{P(T)P(W|T)}{P(W)} \\ & \hat{T} = \mathop {\arg \hbox{max} }\nolimits_{T \in \gamma } P(T)P(W|T) \\ \end{aligned} $$

According to the chain rule,

$$ \begin{aligned} & P(X,Y) = P(X|Y)P(Y) = P(Y|X)P(X) \\ & P(X,Y,Z) = P(Y,Z|X)P(X) = P(Z|X,Y) \\ & P(Y|X)P(X) \\ & P(W,X,Y,Z) = P(W)P(X|W)P(Y|X,Y) \\ & P(Z|X,Y,Z \ldots ) \\ \end{aligned} $$

$$ P(T)P(W|T) - P(W_{i} )P(W_{2} |W_{1} )P(W_{3} |W_{2} W_{1} ) \ldots . $$

Trigram approximation can be deduced by using following condition,

$$ \begin{aligned} & P(T)P(W|T) - P(W_{i} )P(W_{2} |W_{1} )P(W_{3} |W_{2} W_{1} ) \ldots . \\ & P(W_{i} |W_{1} T_{1} \ldots ..T_{i - 1} T_{i} ) = P(W_{i} |T_{i} ) \\ \end{aligned} $$

(a)

With trigram approximation (a), the trigram assumption can be made based on the recent two probable states in addition to the present. This is indicated in the (b)

$$ P(T_{i} |W_{1} T_{1} \ldots ..T_{i - 1} ) = P(T_{i} |T_{i - 2} T_{i - 1} ) $$

(b)

By considering (a) and (b) for (A), we get,

$$ P(T)P(W|T) = [P(T_{1} )P(T_{2} |T_{1} )\mathop \prod \limits_{i = 3}^{n} P(T_{i} |T_{i - 2} T_{i - 1} )\mathop \prod \limits_{i = 1}^{n} P(W_{i} |T_{i} )] $$

4 Results and Discussions

In order to evaluate our result the interface using Natural Language Tool Kit (NLTK) was developed. The tool kit provides all the linguistic features to implement the required the algorithm. The result was evaluated on the three corpuses namely CRUBADAN, INSPEC, & BROWN. An accurate means of calculating the efficiency is by Recall and Precision. However each word is associated with utmost one tag for any given instance, calculating the F-measure score will not make good sense. Perhaps one can calculate the recall and precision for individual tag, for e.g. Recall and Precision for NN variations. For every tag encountered in the training data-set – The tagging associated with each word in the chosen corpus, three catalogues are maintained. These are True Positive (TP), False Positive (FP) and False Negative (FN). If the tag of trained dataset and test dataset match then increment the value of TP by 1. If there is no match between them we increment FN for the actual/original tag and also FP for those tags that our proposed algorithm mistakenly chose. With the obtained values, Recall and Precision can be calculated. The size of the corpus in terms of word ratio is approximately 10,000 words. We have Recall, which defines the likelihood of identifying the positive samples. This can be viewed as the proportion of every positive test samples which is modeled correctly. Recall is computed using the formula (i)

$$ Recall = \frac{TP}{TP + FN} $$

(i)

The precision talks about the hypothesis and assumptions. This is defined as being proportionate for every positive prediction that made on the test sample. Prediction can be calculated using the formula (ii)

$$ Precision = \frac{TP}{TP + Fp} $$

(ii)

Finally we arrive at F-measure. This defines the mean of recall and precision approximated by eq. (i) & (ii). F-measure is given by,

$$ F - Measure = \frac{1}{{\alpha \frac{1}{P} + (1 - \alpha )\frac{1}{R}}} $$

(iii)

The component $ \alpha $ is called balance co-efficient. The value of $ \alpha $ is considered as 0.5. This value can be changes conventionally. With the proposed algorithm, we arrive at the following values of Recall, Precision and F-measure that was obtained applying (i), (ii) & (iii) as indicated in the Table 4.

Table 4. Calculation of F-Measure with the values of Recall and Precision

Full size table

5 Conclusions

POS tagger, a very important criterion for language analysis is often found in many variations but performance is a major issue with these taggers. With the advent of many POS tagging the accuracy has become a major challenge. Hence in this work, a novel POS tagger is defined adopting the features of HMM. To prove the accuracy of our system, three corpuses have been chosen. We showed how efficient can be increased. The results show that the recommended approach is about 98% efficient. The drawback of this system is the data processing time. This is directly proportional to the size of the input. Since the corpuses chosen for our experiments are huge, time of about 20 s was incurred to arrive at the output. Further extensions to this are comprised of defining the complex tags to increase of the speed of the tagger. In addition to this, one can define a morphological analyzer for multiple natural languages. POS tagging rule can be prepared beforehand. This helps the speed and also the accuracy.

References

Das, D.: Unsupervised part-of-speech tagging with bilingual graph-based projections. In: The 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, USA, pp. 600–609, June 2011
Google Scholar
Goldwater, S.: A fully Bayesian approach to unsupervised part-of-speech tagging. In: Association for Computational Linguistics, vol. 45, p. 744 (2007)
Google Scholar
Lee, Y.K.: Simple type-level unsupervised POS tagging. In: Association for Computational Linguistics Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, pp. 853–861, October 2010
Google Scholar
de Gruyter, W.: Corpus Linguistics: An International Handbook, vol. 1, ISBN 978-3-11-021142-9
Google Scholar
Derczynski, L.: Twitter part-of-speech tagging for all: overcoming sparse and noisy data. In: Recent Advances in Natural Language Processing, Hissar, Bulgaria, pp. 198–206, pp. 7–13, September 2013
Google Scholar
Ritter, A.: Named entity recognition in tweets: an experimental study. In: Association for Computational Linguistics Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Applications, Bangalore University, Bangalore, 560056, India
S. Rashmi & M. Hanumanthappa

Authors

S. Rashmi
View author publications
You can also search for this author in PubMed Google Scholar
M. Hanumanthappa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Rashmi .

Editor information

Editors and Affiliations

Stanford University, Stanford, CA, USA
Aynur Unal
IT Buzz Limited, Dagenham, UK
Malaya Nayak
Microsoft Innovation Centre, Sri Aurobindo Institute of Technology, Indore, Madhya Pradesh, India
Durgesh Kumar Mishra
Namibia University of Science and Technology, Windhoek, Namibia
Dharm Singh
Sabar Institute of Technology, Sabarkantha, India
Amit Joshi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rashmi, S., Hanumanthappa, M. (2016). “Part of Speech Tagging – A Corpus Based Approach”. In: Unal, A., Nayak, M., Mishra, D.K., Singh, D., Joshi, A. (eds) Smart Trends in Information Technology and Computer Communications. SmartCom 2016. Communications in Computer and Information Science, vol 628. Springer, Singapore. https://doi.org/10.1007/978-981-10-3433-6_11

Download citation

DOI: https://doi.org/10.1007/978-981-10-3433-6_11
Published: 27 December 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3432-9
Online ISBN: 978-981-10-3433-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

“Part of Speech Tagging – A Corpus Based Approach”

Abstract

Similar content being viewed by others

Analyzing Tagging Accuracy of Part-of-Speech Taggers

Towards POS Tagging Methods for Bengali Language: A Comparative Analysis

A Comparative Evaluation of Statistical Part-of-Speech Taggers for Russian

Keywords

1 Introduction

2 Literature Review

3 Our Approach – An Overview of the Methodologies

3.1 Hidden Markov Model Based Tagging Approach

3.2 Hidden Markov Model Based Tagging Algorithm

Phase 1:

Phase 2:

Phase 3:

4 Results and Discussions

5 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

“Part of Speech Tagging – A Corpus Based Approach”

Abstract

Similar content being viewed by others

Analyzing Tagging Accuracy of Part-of-Speech Taggers

Towards POS Tagging Methods for Bengali Language: A Comparative Analysis

A Comparative Evaluation of Statistical Part-of-Speech Taggers for Russian

Keywords

1 Introduction

2 Literature Review

3 Our Approach – An Overview of the Methodologies

3.1 Hidden Markov Model Based Tagging Approach

3.2 Hidden Markov Model Based Tagging Algorithm

Phase 1:

Phase 2:

Phase 3:

4 Results and Discussions

5 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation