Abstract
This chapter provides an overview of the maximum entropy framework and its application to a problem in natural language processing. The framework provides a way to combine many pieces of evidence from an annotated training set into a single probability model. The framework has been applied to many tasks in natural language processing, including part-of-speech tagging. This chapter covers the maximum entropy formulation, its relationship to maximum likelihood, a parameter estimation method, and the details of the part-of-speech tagging application.
Access provided by CONRICYT-eBooks. Download reference work entry PDF
Similar content being viewed by others
Definition
The term maximum entropy refers to an optimization framework in which the goal is to find the probability model that maximizes entropy over the set of models that are consistent with the observed evidence.
The information-theoretic notion of entropy is a way to quantify the uncertainty of a probability model; higher entropy corresponds to more uncertainty in the probability distribution. The rationale for choosing the maximum entropy model – from the set of models that meet the evidence – is that any other model assumes evidence that has not been observed (Jaynes 1957).
In most natural language processing problems, observed evidence takes the form of co-occurrence counts between some prediction of interest and some linguistic context of interest. These counts are derived from a large number of linguistically annotated examples, known as a corpus. For example, the frequency in a large corpus with which the word that co-occurs with the tag corresponding to determiner, or DET, is a piece of observed evidence. A probability model is consistent with the observed evidence if its calculated estimates of the co-occurrence counts agree with the observed counts in the corpus.
The goal of the maximum entropy framework is to find a model that is consistent with the co-occurrence counts, but is otherwise maximally uncertain. It provides a way to combine many pieces of evidence into a single probability model. An iterative parameter estimation procedure is usually necessary in order to find the maximum entropy probability model.
Motivation and Background
The early 1990s saw a resurgence in the use of statistical methods for natural language processing (Church and Mercer 1993). In particular, the IBM TJ Watson Research Center was a prominent advocate in this field for statistical methods such as the maximum entropy framework. Language modeling for speech recognition (Lau et al. 1993) and machine translation (Berger et al. 1996) were among the early applications of this framework.
Structure of Learning System
The goal of a typical natural language processing application is to automatically produce linguistically motivated categories or structures over freely occurring text. In statistically based approaches, it is convenient to produce the categories with a conditional probability model p such that p(a | b) is the probability of seeing a prediction of interest a (e.g., a part-of-speech tag) given a linguistic context of interest b (e.g., a word).
The maximum entropy framework discussed here follows the machine learning approach to NLP, which assumes the existence of a large corpus of linguistically annotated examples. This annotated corpus is used to create a training set, which in turn is used to estimate the probability model p.
Representing Evidence
Evidence for the maximum entropy model is derived from the training set. The training set is a list of (prediction, linguistic context) pairs that are generated from the annotated data. However, in practice, we do not record the entire linguistic context. Instead, linguistically motivated Boolean-valued questions reduce the entire linguistic context to a vector of question identifiers. Therefore, each training sample looks like:
Prediction | Question vector |
a | q 1 … q n |
where a is the prediction and where q1… q n is a vector of questions that answered true for the linguistic context corresponding to this training sample. The questions must be designed by the experimenter in advance, and are specifically designed for the annotated data and the problem space.
In the framework discussed here, any piece of evidence is represented with a feature. A feature correlates a prediction a with an aspect of a linguistic context b, captured by some question:
Combining the Evidence
The maximum entropy framework provides a way to combine all the features into a probability model. In the conditional maximum entropy formulation (Berger et al. 1996), the desired model p∗ is given by:
where H( p) is the conditional entropy of p, \(\tilde{p}(b)\) is the observed probability of the linguistic context b in the training set, and P is the set of models that are consistent with the observed data. A model p is consistent if its own feature expectation E p f j is equal to the observed feature expectation \(E_{\tilde{p}}f_{j}\), for all j = 1… k features. \(E_{\tilde{p}}f_{j}\) can be interpreted as the observed count of f j in the training sample, normalized by the training sample size. Both are defined as follows:
According to the maximum entropy framework, the optimal model p∗ is the most uncertain model among those that satisfy the feature constraints. It is possible to show that the form of the optimal model must be log-linear:
Here Z(b) is a normalization factor, and α j > 0. Each model parameter α j can be viewed as the “strength” of its corresponding feature f j ; the conditional probability is the normalized product of the feature weights of the active features.
Relationship to Maximum Likelihood
The maximum entropy framework described here has an alternate interpretation under the more commonly used technique of maximum likelihood estimation.
Here Q is the set of models of form (2), \(\tilde{p}(a,b)\) is the observed probability of prediction a together with linguistic context b, L(p) is the log-likelihood of the training set, and q∗ is the maximum likelihood model. It can be shown that p∗ = q∗; maximum likelihood estimation for models of the form (2) gives the same answer as maximum entropy estimation over the constraints on feature counts (1). The difference between approaches is that the maximum likelihood approach assumes the form of the model, whereas the maximum entropy approach assumes the constraints on feature expectations, and derives the model form.
Parameter Estimation
The Generalized Iterative Scaling (GIS) algorithm (Darroch and Ratcliff 1972) is the easiest way to estimate the parameters for this kind of model. The iterative updates are given below:
GIS requires the use of a “correction” feature g and constant C > 0, which are defined so that \(g(a,b) = C -\sum _{j=1\ldots k}f_{j}(a,b)\) for any (a, b) pair in the training set. Normally, the correction feature g must be trained in the model along with the k original features, although (Curran and Clark 2003) show that GIS converges even without the correction feature. The number of iterations needed to achieve convergence depends on certain aspects of the data, such as the training sample size and the feature set size, and is typically tuned for the problem at hand.
Other algorithms for parameter estimation include the Improved Iterative Scaling (Berger et al. 1996) algorithm and the Sequential Conditional GIS (Goodman 2002) algorithm. The list given here is not complete; many other numerical algorithms can be applied to maximum entropy parameter estimation, see Malouf (2002) for a comparison.
It is usually difficult to assess the reliability of features that occur infrequently in the training set, especially those that occur only once. When the parameters are trained from low frequency feature counts, maximum entropy models – as well as many other statistical learning techniques – have a tendency to “overfit” the training data. In this case, performance on training data appears very high, but performance on the intended test data usually suffers. Smoothing or regularization techniques are designed to alleviate this problem for statistical models; some smoothing techniques for maximum entropy models are reviewed in Chen and Rosenfeld (1999).
Applications
This framework has been used as a generic machine learning toolkit for many problems in natural language processing. Like other generic machine learning techniques, the core of the maximum entropy framework is invariant across different problem spaces. However, some information is specific to each problem space:
-
Predictions: The space of predictions for this model
-
Questions: The space of questions for this model
-
Feature Selection: Any possible (question, prediction) pair can be used as a feature. In complex models, only a small subset of all the possible features are used in a model. The feature selection strategy specifies how to choose the subset.
For a given application, it suffices to give the above three pieces of information to fully specify a maximum entropy probability model.
Part-of-Speech Tagging
Part-of-speech tagging is a well-known task in computational linguistics in which the goal is to disambiguate the part-of-speech of all the words in a given sentence. For example, it can be non-trivial for a computer to disambiguate the part-of-speech of the word flies in the following famous examples:
-
Fruit flies like a banana.
-
Time flies like an arrow.
The word flies behaves like a noun in the first case, and like a verb in the second case. In the machine learning approach to this problem, co-occurrence statistics of tags and words in the linguistic context are used to create a predictive model for part-of-speech tags.
The computational linguistics community has created annotated corpora to help build and test algorithms for tagging. One such corpus, known as the Penn treebank (Marcus et al. 1994), has been used extensively by machine learning and statistical NLP practitioners for problems like tagging. In this corpus, roughly 1 M words from the Wall St. Journal have manually been assigned part-of-speech tags. This corpus can be converted into a set of training samples, which in turn can be used to train a maximum entropy model.
Model Specification
For tagging, the goal is a maximum entropy model p that will produce a probability of seeing a tag at position i, given the linguistic context of the ith word, the surrounding words, and the previously predicted tags, written as \(p(t_{i}\vert t_{i-1}\ldots t_{1},w_{1}\ldots w_{n})\). The intent is to use the model left-to-right, one word at a time. The maximum entropy model for tagging (Ratnaparkhi 1996) is specified as:
-
Predictions: The 45 part-of-speech tags of the Penn treebank
-
Questions: Listed below are the questions and question patterns. A question pattern has a placeholder variable (e.g., X, Y ) that is instantiated by scanning the annotated corpus for examples in which the patterns match. Let i denote the position of the current word in the sentence, and let w i and t i denote the word and tag at position i, respectively.
-
Does w i = X?
-
Does wi−1 = X?
-
Does wi−2 = X?
-
Does wi+1 = X?
-
Does wi+2 = X?
-
Does ti−1 = X?
-
Does \(t_{i-1}t_{i-2} = X,Y\)?
-
For word that occur less than 5 times in the training set:
-
Are the first K (for K ≤ 4) characters X1… X K ?
-
Are the last K (for K ≤ 4) characters \(X_{1}\ldots X_{K}\)?
-
Does the current word contain a number?
-
Does the current word contain a hyphen?
-
Does the current word contain an uppercase character?
-
-
-
Feature Selection: Any feature whose count in the training data is less than 10 is discarded.
While the features for each probability decision could in theory look at the entire linguistic context, they actually only look at a small window of words surrounding the current word, and a small window of tags to the left. Therefore each decision effectively makes the markov-like assumption given in Eq. (3).
Equation (4) is the maximum entropy model for tagging. Each conditional probability of a prediction t i given some context \(t_{i-1}t_{i-2}w_{i-2}w_{i-1}w_{i} w_{i+1}w_{i+2}\) is the product of the features that are active for that (prediction, context) pair.
Training Data
The training set is created by applying the questions to each word in the training set. For example, when scanning the word flies in the sentence “Time flies like an arrow” the training example would be:
Prediction | Question vector |
verb | \(w_{i} =\mathrm{ flies},w_{i-1} =\mathrm{ Time},w_{i-2} =\mathrm{ {\ast}bd{\ast}},\) |
\(w_{i+1} =\mathrm{ like},w_{i+2} =\mathrm{ an},\) | |
\(t_{i-1} = \mathit{noun},t_{i-1}t_{i-2} = \mathit{noun},\mathrm{{\ast}bd{\ast}}\) |
Here *bd* is a special symbol for boundary. The tags have been simplified for this example; the actual tags in the Penn treebank are more fine-grained than noun and verb.
Hundreds of thousands of training samples are used to create candidate features. Any possible (prediction, question) pair that occurs in training data is a candidate feature. The feature selection strategy is a way to eliminate unreliable or noisy features from the candidate set. For the part-of-speech model described here, a simple frequency threshold is used to implement feature selection.
Given a selected feature set, the GIS algorithm is then used to find the optimal value for the corresponding α j parameters. For this application, roughly 50 iterations of GIS sufficed to achieve convergence.
Search for Best Sequence
The probability model described thus far will produce a distribution over tags, given a linguistic context including and surrounding the current word. In practice we need to tag entire sentences, which means that the model must produce a sequence of tags. Tagging is typically performed left-to-right, so that each decision has the left context of previously predicted tags. The probability of the best tag sequence for an n-word sentence is factored as:
The desired tag sequence is the one with the highest conditional sequence probability:
A dynamic programming procedure known as the Viterbi algorithm is typically used to find the highest probability sequence.
Other NLP Applications
Other NLP applications have used maximum entropy models to predict a wide variety of linguistic structure. The statistical parser in Ratnaparkhi (1999) uses separate maximum entropy models for part-of-speech, chunk, and parse structure prediction. The system in Borthwick (1999) uses maximum entropy models for named entity detection, while the system in Ittycheriah et al. (2001) uses them as sub-components for both answer type prediction and named entity detection. Typically, such applications do not need to change the core framework, but instead need to modify the meaning of the predictions, questions, and feature selection to suit the intended task of the application.
Future Directions
Conditional random fields (Lafferty et al. 2001), or CRFs, are an alternative to maximum entropy models that address the label bias issue. Label bias affects sequence models that predict one element at a time, in which features at a given state (or word, in the case of POS tagging) compete with each other, but do not compete with features at any other state in the sequence. In contrast, a CRF model directly produces a probability distribution over the entire sequence, and therefore allows global competition of features across the entire sequence. The parameter estimation for CRFs is related to the Generalized Iterative Scaling algorithm used for maximum entropy models. See Sha and Pereira (2003) for a example of CRFs applied to noun phrase chunking.
Another recently published future direction is Collobert et al. (2011), which presents a multi-layer neural network approach for several sequence labeling tasks, including POS tagging. This approach avoids task-specific feature engineering – like the questions in section “Model Specification” – and instead uses the neural network training algorithm to discover internal representations for the word and tag context. It also uses large amounts of unlabeled data to enhance the internal representations for words.
Recommended Reading
Berger AL, Della Pietra SA, Della Pietra VJ (1996) A maximum entropy approach to natural language processing. Comput Linguist 22(1):39–71
Borthwick A (1999) A maximum entropy approach to named entity recognition. PhD thesis, New York University
Chen S, Rosenfeld R (1999) A Gaussian prior for smoothing maximum entropy models. Technical report CMUCS-99-108, Carnegie Mellon University
Church KW, Mercer RL (1993) Introduction to the special issue on computational linguistics using large corpora. Comput Linguist 19(1):1–24
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
Curran JR, Clark S (2003) Investigating GIS and smoothing for maximum entropy taggers. In: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics-Volume 1. Association for Computational Linguistics, pp 91–98
Darroch J, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Stat 43(5):1470–1480
Goodman J (2002) Sequential conditional generalized iterative scaling. In: Proceedings of the Association for Computational Linguistics
Ittycheriah A, Franz M, Zhu W, Ratnaparkhi A (2001) Question answering using maximum-entropy components. In: Procedings of NAACL
Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106(4):620–630
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 282–289
Lau R, Rosenfeld R, Roukos S (1993) Adaptive language modeling using the maximum entropy principle. In: Proceedings of the ARPA human language technology workshop. Morgan Kaufmann, San Francisco, pp 108–113
Malouf R (2002) A comparison of algorithms for maximum entropy parameter estimation. In: Sixth conference on natural language learning, pp 49–55
Marcus MP, Santorini B, Marcinkiewicz MA (1994) Building a large annotated corpus of English: the Penn Treebank. Comput Linguist 19(2): 313–330
Ratnaparkhi A (1996) A maximum entropy model for part-of-speech tagging. In: Brill E, Church K (eds) Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, Somerset, pp 133–142
Ratnaparkhi A (1999) Learning to parse natural language with maximum entropy models. Mach Learn 34(1–3):151–175
Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of HLT-NAACL, pp 213–220
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media New York
About this entry
Cite this entry
Ratnaparkhi, A. (2017). Maximum Entropy Models for Natural Language Processing. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7687-1_525
Download citation
DOI: https://doi.org/10.1007/978-1-4899-7687-1_525
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4899-7685-7
Online ISBN: 978-1-4899-7687-1
eBook Packages: Computer ScienceReference Module Computer Science and Engineering