1 Introduction

Part-of-speech tagging is a basic problem in natural language processing (NLP). Potential applications for part-of-speech (POS) tagging exist in many areas including speech recognition, speech synthesis, machine translation and information retrieval (Greene and Rubin 1971; Varile and Zampolli 1997; Voutilainen 2003; Karkaletsis et al. 2015). POS tagging tries to tag (or label) each word in a sentence with the correct POS.

In POS tagging, each word or punctuation mark in the text is assigned with its tag. Different tagging systems can use different sets of tags. Typically a tag describes a word class and some word class specific features, such as number and gender.

Most POS tagger involves two problems:

  1. (1)

    Finding the exact tags for each word. This can be easy if the word is in a word tag lexicon, but if the word is unknown, this may be tough to do.

  2. (2)

    Choosing between the possible tags. This is called syntactic disambiguation, and it has to be solved for each word that is ambiguous in its POS.

Ambiguous words are very common in most languages, for example the English word set “can” be either a noun, an adjective, or a verb.

A lot of work and research has been going on in this area with great success (Jamatia et al. 2015). Most previous works applied different kinds of machine learning algorithms to POS tagging. Two factors that determine the tag of a word are its lexical probability and its contextual probability (Voutilainen 2003; Sun et al. 2008). Some approaches have been adopted, in most reports which can mainly fall into rule-based approaches and statistical approaches (Greene and Rubin 1971; Sun et al. 2008).

Rule-based approaches apply language rules to improve the accuracy of tagging (Brill 1992). However, the monumental manual work required and the need to have qualified people with a working knowledge of linguistics make this approach too inefficient to be practical.

Another rule-based approach “Transformation-Based method” was introduced in Ref. (Brill 1995). Brill pioneered a rule-based tagging approach using the transformation-based learning (TBL) methodology where the rules are not manually constructed, but learned from corpora. The three-phased process starts with an annotator to create the initial state by assigning a tag to each word in the corpus.

Statistical tagging uses large amount of data to establish the language of each situation and neither require knowledge of the rules of the language nor try to deduce them. Most of these systems are based on Decision Tree (Magerman 1995), Hidden Markov model (HMM) (Rabiner 1989; Carlberger and Kann 1999; Thede and Harper 1999; Lee et al. 2000; Brants 2000), Maximum Entropy Model (Ratnaparkhi 1996) or Support Vector Machines (SVM) (Giménez and Marquez 2004).

Using intelligent computing is another choice. A third and rather new approach is tagging with neural networks (Lippmann 1989; Schmid 1994; Marques and Lopes 2001), genetic algorithms (Araujo 2002) and gene expression programming (Lv et al. 2010). In the area of speech recognition neural networks have been used for a decade now. They have shown performances comparable to that of HMM systems or even better (Lippmann 1989; Schmid 1994). Evolutionary algorithms are among the most efficient methods to deal with complex optimization problems (Goldberg 1989). They have been applied to different issues of natural language processing such as query translation (Davis and Dunning 1995), inference of context free grammar (Smith and Witten 1995) and parsing (Araujo 2001; Bernd Bohnet and Joakim Nivre 2012). The POS tagging model with GA is the use of an evolutionary algorithm to find the tagging of new sentences and can achieve an accuracy of 96 % (Araujo 2002).

Gene Expression Programming (GEP) is a revolutionary member in the family of intelligence computing introduced by Candida in 2001 (Ferreira 2001). GEP has achieved a great progress in dressing the problem of machine learning and unknown things prediction. For the first time lv applied GEP to POS tagging, and obtained a high accuracy of 97.4 % (Lv et al. 2010).

With these methods, efficient and fast taggers have been developed. The best reported taggers have attained a high accuracy of 96–98 %. However, although the percentage is high, it is not so good.

Taking these figures into account one may think that POS tagging is a solved and closed problem being this accuracy perfectly acceptable for most NLP systems. So why waste time in finding a new tagger with a higher accuracy? What does an increasing of 0.3 % in accuracy really mean? We think that there is an important reason for thinking that there is still work to do in the field of automatic POS tagging:

A corpus may have thousands of sentences, and each sentence may have an average of around 30 words. The rate of 97 % means that there is one word tagged in error per sentence, 100 in a 100-sentence document, and 1 million in a corpus with 10,000 documents. Since the POS tagging is one of the earlier steps in many natural language processes. Some NLP tasks are very sensitive to POS disambiguation errors which can be found in the domain of Word Sense Disambiguation (Wilks and Stevenson 2000) and Information Retrieval (Krovetz 1997). Starting with an error in each sentence could be a severe drawback, especially considering that the propagation of this error could grow more than linearly.

There are two major problems which keeping tagging accuracy from 100 %: ambiguous words and unknown words. The ambiguity problem stems from the fact that the English word can be a noun, a finite verb or an infinitive. For example, consider this sentence: ‘We can can the can’ (Voutilainen 2003). The same word can is used in three different syntactic ways: as an auxiliary, a verb, and a noun. For a machine to determine what tag goes with which can is a difficult problem. It is not a difficult problem for a human. When knowing the word’s context and the syntax of the sentence, disambiguation ceases to be a problem (Martinez 2011).

The other obstacle in achieving 100 % accuracy in POS tagging is the set of words the tagger had not encountered in its training corpus. This is known as the unknown words problem. The accuracy rates of the methods mentioned above are limited to no more than 95 % while dealing with unknown words. Syntactic parsers, whose dependence on the output of the tagger is critical, will be stumped when encountering a word without a tag. Even with these limitations, taggers are being used in information retrieval (IR), question answering systems, partial parsing, lexical acquisition, information extraction (IE), and text data mining (Manning et al. 1999).

2 Tagging methods

This section presents the main approaches which are frequently used for POS-tagging. Of particular relevance are the exact features which are used by each tagger for the disambiguation of ambiguous and unknown words, so these will be treated in some detail. In Sect 2.1, a rule-based approaches “Transformation-Based method” was introduced and the performance of TBL model was reported. HMM and SVM model were analyzed in Sects 2.2 and 2.3 which represent the stochastic method. After that, two typical evolution algorithms model: Neural Network and Genetic Expression Programming model were spread out and discussed in subsection D and F.

2.1 POS tagging with transformation-based learning

Transformation-based learning tagging is a typical rule-based tagging where the rules are not manually constructed, but learned from corpora. The TBL paradigm as applied to POS tagging was first described in (Brill 1995). The TBL method is a machine learning technique which takes a tagged corpus as input from which it can learn how to correctly tag a test sample. The tagger learns a set of rules for assigning tags from the training data which gives the least possible errors and applies them to test data.

The algorithm relies on the fact that for the task it is trivial to devise a very simple tagging mechanism which achieves quite reasonable results. Such a method can be used to create an initial annotation of the text. Of course while such an annotation will have a large percentage of tags correct, there will still be a substantial proportion which is incorrectly tagged. The TBL algorithm aims to correct these errors by successively applying rules which correct such errors based on contextual and word-form-derived information.

The learning phase follows with the use of a set of predefined rule (transformation) templates. By instantiating each template with data from the annotated corpus, a set of rules is created. Each rule is then applied to the corpus that has been tagged by the annotator. The output is compared to the manually annotated corpus, which is considered to be the ‘truth’. In this step, transformations are learned and listed. These transformations are applied one at a time to the output of the annotator and again compared to the truth. In essence, a single transformation is learned when the ‘learner’ tries every possible transformation, while keeping a tally of the number of tagging errors after each transformation is tried. The best performing transformation or rule (i.e., the one that ‘resulted in the greatest error reduction’) is selected. The learning phase ends when there are no transformations that would reduce the error beyond some predefined threshold.

This approach produces results very close to those which use far more mathematically complex algorithms. Brill reports an overall accuracy of 96.6 % on the Penn Treebank WSJ corpus using 900,000 words of training data (split as 600,000 for learning contextual rules and 350,000 for learning rules for unknown words). The only drawback of this scheme is the long training time required, since in each iteration the counts for each possible rule must be regenerated, as previous rule applications will have probably changed the score of that rule since counts were last generated.

One of the most successful approaches to deal with this problem was that devised by Ngai and Florian (2001), which vastly reduces the training time with no reductions in accuracy. The basic idea is to avoid repetition by generating and storing good and bad counts for each rule “r” once at the beginning, and updating the counts only if they are modified by the application of another rule.

2.2 Hidden markov model

A HMM (Dan Garrette and Jason Baldridge 2013; Owoputi et al. 2013) is a typical statistical tagging model that can be used to solve classification problems that have an inherent state sequence representation. The model can be visualized as an interlocking set of states. These states are connected by a set of transition probabilities, which indicate the probability of traveling between two given states. A process begins in some state, then at discrete time intervals, the process “moves” to a new state as dictated by the transition probabilities. In an HMM, the exact sequence of states that the process generates is unknown (i.e., hidden). As the process enters each state, one of a set of output symbols is emitted by the process. Exactly which symbol is emitted is determined by a probability distribution that is specific to each state. The output of the HMM is a sequence of output symbols.

A complete and excellent description of the equations used in the standard Markov model for part-of-speech tagging is found in (Charniak et al. 1993). A text of n words is seen as a sequence of random variables W 1…n  = W 1 W 2 …W n , and the corresponding tagging is also a sequence of random variables T 1…n  = T 1 T 2 …T n . A particular sequence of values of W 1…n (T 1…n ) is denoted w 1…n (t 1…n ). The definition of the tagging problem is then:

$$ f(w_{{1{ \ldots }n}} ) = \mathop {\arg \hbox{max} }\limits_{{t_{{1{ \ldots }n}} }} P(t_{{1{ \ldots }n}} |w_{{1{ \ldots }n}} ) $$
(1)

where the operator “argmax” computes the tagging maximizing the probability, according to the model, that word sequence \( w_{{1{ \ldots }n}} \) is tagged \( t_{{1{ \ldots }n}} \). Making the two assumptions (Carlberger and Kann 1999):

$$ P(w_{i} |t_{{1{ \ldots }i}} ,w_{{1{ \ldots }i - 1}} ) = P(w_{i} |t_{i} ) $$
(2)
$$ P(t_{i} |t_{{1{ \ldots }i - 1}} ,w_{{1{ \ldots }i - 1}} ) = P(t_{i} |t_{{1{ \ldots }i - 1}} ) $$
(3)

That is, the word itself only depends on its tag, and the tag only depends on the i−1 preceding tags in the text. The tagging problem can now be formulated as formulary 4.

$$ f(w_{{1{ \ldots }n}} ) = \mathop {\arg \hbox{max} }\limits_{{t_{{1{ \ldots }n}} }} \sum\limits_{i = 1}^{n} {P(t_{i} |t_{{1{ \ldots }i - 1}} )} P(w_{i} |t_{i} ) $$
(4)

An unattractive feature of this formulation is that the quantities P(w i |t i ) are very small and difficult to estimate. Since the reversed conditional probabilities P(t i |w i ) are much more attractive in this respect, the following is a plausible alternative:

$$ f(w_{{1{ \ldots }n}} ) = \mathop {\arg \hbox{max} }\limits_{{t_{{1{ \ldots }n}} }} \sum\limits_{i = 1}^{n} {P(t_{i} |t_{{1{ \ldots }i - 1}} )} P(t_{i} |w_{i} ) $$
(5)

Both of these equations (and in particular, their corresponding first order Markov model equations) have been used in different stochastic taggers, but in Charniak et al. (1993), the two equations were compared, and Eq. (1) was found to be significantly better when tagging texts with quite a large training text. If all the probabilities are known, the optimal solution to the tagging problem using Eq. (1) is most efficiently computed with dynamic programming using the so called Viterbi algorithm (Viterbi 1967). This algorithm avoids the polynomial expansion of a breadth first search by trimming the search tree at each level.

In recent years, different kinds of Markov model had been adopted in POS tagging. In which two methods are worth mentioned.

Merialdo (1994) conducted several experiments which are the use of untagged text in the training of a simple triclass Markov model. Two approaches in particular are compared and combined: using text that has been tagged by hand and computing relative frequency counts, using text without tags and training the model as a hidden Markov process, according to a Maximum Likelihood principle. Training with the tagged corpus used Random Field (RF), and training with the untagged corpus used maximum likelihood (ML). Both approaches are used to determine the probability of a sequence of tags. The ML was done using the forward–backward (FB) algorithm. In the first experiment using tagged text, The RF was used for training and the Viterbi algorithm for tagging. Using 42,000 tagged sentences (approximately 1 million words) an accuracy of 97 % was reported. In the second experiment, untagged text was used (40,000 sentences). The training was done using ML and the tagging using the Viterbi algorithm. The reported accuracy was around 88.4 %.

The second POS tagger is the Trigrams ‘n’ Tags (TnT) created by (Brants 2000). The tagger makes use of a combination of smoothing using context-independent linear interpolation and word features like suffixes and capitalization. He reports 96.6 % accuracy with an added 0.5 % (97.1 %) accuracy when the model is tested and trained with data from one annotator; that is, data tagged manually by one and the same person (Brants 2000). Brants makes a point of stating that his tagger does as well as maximum entropy taggers, but faster.

A few other POS taggers based on HMMs have been proposed in the last 15 years (Rabiner 1989; Carlberger and Kann 1999; Thede and Harper 1999; Lee et al. 2000). Although none of them have achieved 100 % accuracy, several kinds of taggers using HMMs are wildly used in POS tagging (Yuan Tian and David Lo 2015).

2.3 Support vector machine-based POS tagging

Support vector machines (SVMs) were first applied to POS tagging in Nakagawa et al. (2001). An SVM is a binary classification algorithm based on a geometric interpretation of the feature values for each instance. As detailed in (Cristianini and Shawe-Taylor 2000), given a set of training instances each consisting of a vector of binary or numeric feature values and a true classification y ϵ {−1, 1}, an SVM learns a classification function f(x) which can be used to classify a test instance with feature vector x. In binary classification problems, the classification rule is then sgn (f(x)). The classification rule f(x) is dependent on what is known as the kernel function, which effectively maps the data into a higher dimensional feature space allowing the correct classification of instance which have non-linearly separable feature values in the original feature space.

Another SVM-based tagger extend binary support vector machines to cover multiclass classification using a strategy known as one-per-class binarisation (Giménez and Marquez 2004). A SVM is constructed for each POS which contains ambiguous lexical items (reportedly 34 for the Penn Treebank), and in the classification stage, the most confident prediction from all of the SVMs is selected as the tag for the word.

The contextual features used in Gim´enez and M`arquez’s tagger include unigrams, bigrams and trigrams of words and POS, derived from the tokens appearing in a context window of two tokens on either side of the target. POS features for ambiguous words which have not yet been tagged can be replaced with ambiguity classes. These nominal features are binarised to act as input to the SVM in the usual way: a nominal feature with k possible values is represented by k binary features each of which is true when the original feature takes one particular value and false otherwise. The accuracy reported for SVMTool is 97.16 % for all tokens and 89.01 % for unknown tokens.

2.4 POS tagging with neural networks

In the area of speech recognition neural networks have been used for a decade now. They have shown performances comparable to that of HMM systems or even better (Lippmann 1989). POS prediction is another area, closer to POS tagging, where neural networks have been applied successfully. Nakamura et al. (1990) trained a 4-layer feed-forward network with up to three preceding POS tags, as input to predict the word category of the next word. The prediction accuracy was similar to that of a trigram based predictor. Using tile predictor, Nakamura’s tagger is able to improve the recognition rate of their speech recognition system from 81.0–86.9 %.

Figure 1 shows the structure of the Net-Tagger without hidden layer; the arrow symbolizes the connections between the layers.

Fig. 1
figure 1

The structure of the net-tagger

In the output layer of the multilayer perceptron network (MLP), each unit corresponds to one of the tags in the tagset. The network learns during the training to activate that output unit which represents the correct tag and to deactivate all other output units in the trained network, the output unit with the highest activation indicates which tag should be attached to the word that is currently processed.

The input of the network comprises all the information which the system has about the parts of speech of the current word, the preceding words and the following words. More precisely, for each POS tag pos j and each of the p + 1 + f words in the context, there is an input unit whose activation in ij represents the probability that word i has part of speech pos j .

For the word which is being tagged and the following words, the lexical POS probability p(pos j |word i ) is all we know about the part of speech. This probability does not take into account any contextual influences. So, we get the following input representation for the currently tagged word and the following words:

$$ in_{ij} = p\left( {pos_{j} |word_{i} } \right), \, if \, i \ge 0 $$
(6)

For tile preceding words, there is more information available, because they have already been tagged. The activation values of the output units at the time of processing are here used instead of the lexieal POS probabilities:

$$ in_{ij} \left( t \right) = out_{j} \left( {t + i} \right), \, if \, i < 0 $$
(7)

The Net-Tagger (Schmid 1994) was trained on a 2 million word subpart of the Penn-Treebank corpus. Its performance was tested on a 100,000 word subpart which was not part of the training corpus. The training of the tagger took one day on a Sparcl0 workstation and the tagging of 100,000 words took 12 min on the same machine.

In Table 1, the accuracy rate of the Net-Tagger is compared to that of a trigram based tagger (Kempe 1993) and a HMM tagger (Cutting et al. 1992) which were trained and tested on the same data.

Table 1 The comparison of taggers

The accuracy of Net-Tagger can achieve a little higher than that of the trigram tagger and HMM tagger. The robustness on small training corpora is as good as that of the HMM tagger. Thus, the Net-Tagger combines advantages of both of these methods. The Net-Tagger has the additional advantage that problematic decisions between tags are easy to detect, so that in some cases an additional tag can be given in the output. In this way, the final decision can be delayed to a later processing stage, e.g. a parser. A disadvantage of the presented method may be its lower processing speed compared to statistical methods.

2.5 POS tagging with GEP

In Lv et al. (2010), Lv introduce a rather new evolutionary algorithm, GEP, for POS tagging. GEP is similar to GAs (Araujo 2002) and it differs from GAs mainly in chromosome encoding. GEP encodes individuals as chromosomes and implement them as liner stings with fixed lengths (Ferreira 2001; Zuo et al. 2002). The separation of genotype and phenotype has endowed GEP with more flexibility and power of exploring the entire search space. The chromosomes of GEP are simple and linear. It can be operated by the genetic process easily, and it has the capability to handle complex problems (Ferreira 2003; Zhou et al. 2003; Zuo et al. 2004; Jing et al. 2005; Karakasis and Stafylopatis 2008). GEP has been applied in the problem of machine learning and unknown things prediction and has achieve a great progress (Ferreira 2003; Zhou et al. 2003; Zuo et al. 2004; Jing et al. 2005; Karakasis and Stafylopatis 2008).

A GEP tagger is able to learn from a training corpus to produce an expression which observes the pattern of the tags in corpus. Chromosome is generated from the tags in trained corpus. The evolution procedure remains the different contexts of each tag. The table can be computed by the training text and recording the different contexts and the number of occurrences of each of them for every tag in the training text.

Figure 2 shows the scheme of GEP training. The evolution process is run for each sentence in the text to be tagged. Evolution steps aim to maximize the total probability of the tagging of the sentences in the test corpus. The process finishes either if the fitness deviation lies below a threshold value or if the evolutionary process has been running for a maximum number of generations.

Fig. 2
figure 2

Genetic expression programming train scheme

Evaluation of the performance of GEP has been undertaken, compared to various taggers. They compare the results on brown corpus of several different versions of intelligent algorithms: neural networks (Lippmann 1989; Schmid 1994; Marques and Lopes 2001) or genetic algorithms (Araujo 2002). They also give the performance of HMM method on Brown Corpus (Brants 2000).

The details of the brown corpus are as follows: the number of the words is 1,000,000; the number of sentences is 50,000; the number of tags is 80. The brown corpus was segmented into two parts, the training set of 90 % and the test set of 10 %, in the way that each sentence in the test set was extracted from every ten sentences.

From Table 2, it can be figured out that comparing with GA, neural network and HMM taggers the advantage of GEP can be obtained in a very efficient way in pos tagging. GEP tagger comes from GAs tagger, but is more efficient than it. The main disadvantage of GEP tagger is that it will spend a little time and spaces for data training just as GA and neural networks tagger do (Table 2). Accuracy rate on brown corpus.

Table 2 Comparison on brown corpus

3 Other taggers

The following are some other taggers worth be listing:

  • Decision trees and statistical decision tree taggers produce output that is easier to interpret. A classical supervised algorithm of the machine learning field has been applied (Magerman 1995), in order to automatically acquire a language model for POS tagging based on statistical decision trees. This learning algorithm uses more complex contextual information than usual n-gram models and it can easily accept other kinds of information. However, critical for this type of tagger is the use of the correct set of questions in the decision part of the process (Màrquez et al. 2000).

  • Finite state transducers seems a natural formalism to use for POS tagging which requires the sequential processing of inputs. In the context of POS tagging, the states represent the sequence of tags, and the output from the states is the words associated with the tags. See Sánchez-Villamil et al. (2004) for details.

  • GAs tagger (Araujo 2002) use a genetic algorithm which, after the “evolution” of sequences of the taggers for the words in the text, select the best individual as solution. Gas tagger has developed a genetic algorithm that works with a population of potential tagging for each input sentence in the text. The evolution of individuals is based on a training table composed of contexts extracted from an in advance labelled training text.

4 Conclusions

POS tagging is a well-studied problem in natural language processing, in which the aim, given a natural language text, is to a label each word in that sample with a POS tag such as noun, verb or adjective. There are three main types of approaches to POS tagging: rule-based, stochastic methods and intelligent algorithm.

The feasibility of training computers to perform this task has been due to the development of annotated corpora, for example, the Brown corpus, Bank of English and the Penn Treebank. The annotations in a corpus provide ‘word–tag’ pairs which can be used to build a model and provide the expectations of which tag would accompany word w i in the target corpus.

A typical rule-based method, TBL tagging is analysed. The TBL method uses the rules learned from the corpus and is a machine learning technique which takes a tagged corpus as input from which it can learn how to correctly tag a test sample. It reports an overall accuracy of 96.6 % on the Penn Treebank WSJ corpus using 900,000 words of training data.

Stochastic methods, more than rule-based methods, have used annotated corpora for POS tagging.

Two of the well understood and used stochastic methods were discussed: Markov models and SVM methods. These approaches and many others have performed with accuracies ranging from 96–97 %. It is believed that this is the level of accuracy that can be attained with the present annotated corpora due to annotation inconsistencies.

Rather new approaches with intelligent algorithm then are introduced. Net-Tagger and GEP tagger are detailed. Neural network and gene expression programming have shown their advantages on dressing the problem of machine learning and unknown things prediction. When they are applied on POS tagging, they can achieve a rather high accuracies (96.22 % for Net-Tagger, 97.4 % for GEP tagger).

As intelligent computing algorithms for POS tagging, a disadvantage of the intelligent methods may be their lower processing speed compared to statistical methods for their preceding training cost. In the light of the high speed of present computer hardware, however, this does not seem to be a serious drawback.

A number of opportunities exist for future research. It is possible that some of the modifications added which kept performance at an approximately constant level would actually result in better performance in downstream applications such as chunk parsing. Another area of potential research in this domain is that if unlexicalised tagging were considered, it is possible that new distinctions in the tag set could be far more productive, since the baseline tagger would have more valuable information.