Synonyms

Maxent models; Log-linear models; Statistical natural language processing

Definition

The term maximum entropy refers to an optimization framework in which the goal is to find the probability model that maximizes entropy over the set of models that are consistent with the observed evidence.

The information-theoretic notion of entropy is a way to quantify the uncertainty of a probability model; higher entropy corresponds to more uncertainty in the probability distribution. The rationale for choosing the maximum entropy model – from the set of models that meet the evidence – is that any other model assumes evidence that has not been observed (Jaynes 1957).

In most natural language processing problems, observed evidence takes the form of co-occurrence counts between some prediction of interest and some linguistic context of interest. These counts are derived from a large number of linguistically annotated examples, known as a corpus. For example, the frequency in a large corpus with which the word that co-occurs with the tag corresponding to determiner, or DET, is a piece of observed evidence. A probability model is consistent with the observed evidence if its calculated estimates of the co-occurrence counts agree with the observed counts in the corpus.

The goal of the maximum entropy framework is to find a model that is consistent with the co-occurrence counts, but is otherwise maximally uncertain. It provides a way to combine many pieces of evidence into a single probability model. An iterative parameter estimation procedure is usually necessary in order to find the maximum entropy probability model.

Motivation and Background

The early 1990s saw a resurgence in the use of statistical methods for natural language processing (Church and Mercer 1993). In particular, the IBM TJ Watson Research Center was a prominent advocate in this field for statistical methods such as the maximum entropy framework. Language modeling for speech recognition (Lau et al. 1993) and machine translation (Berger et al. 1996) were among the early applications of this framework.

Structure of Learning System

The goal of a typical natural language processing application is to automatically produce linguistically motivated categories or structures over freely occurring text. In statistically based approaches, it is convenient to produce the categories with a conditional probability model p such that p(a | b) is the probability of seeing a prediction of interest a (e.g., a part-of-speech tag) given a linguistic context of interest b (e.g., a word).

The maximum entropy framework discussed here follows the machine learning approach to NLP, which assumes the existence of a large corpus of linguistically annotated examples. This annotated corpus is used to create a training set, which in turn is used to estimate the probability model p.

Representing Evidence

Evidence for the maximum entropy model is derived from the training set. The training set is a list of (prediction, linguistic context) pairs that are generated from the annotated data. However, in practice, we do not record the entire linguistic context. Instead, linguistically motivated Boolean-valued questions reduce the entire linguistic context to a vector of question identifiers. Therefore, each training sample looks like:

Prediction

Question vector

a

q 1 … q n

where a is the prediction and where q1… q n is a vector of questions that answered true for the linguistic context corresponding to this training sample. The questions must be designed by the experimenter in advance, and are specifically designed for the annotated data and the problem space.

In the framework discussed here, any piece of evidence is represented with a feature. A feature correlates a prediction a with an aspect of a linguistic context b, captured by some question:

$$\displaystyle\begin{array}{rcl} f_{j}(a,b) = \left \{\begin{array}{ll} 1&\mbox{ if $a = x$ and $q(b) =$true}\\ 0 &\mbox{ otherwise}\\ \end{array} \right .& & {}\\ \end{array}$$

Combining the Evidence

The maximum entropy framework provides a way to combine all the features into a probability model. In the conditional maximum entropy formulation (Berger et al. 1996), the desired model p is given by:

$$\displaystyle\begin{array}{rcl} P& =& \left \{p\vert E_{p}f_{j} = E_{\tilde{p}}f_{j},j = 1\ldots k\right \} \\ H(p)& =& -\sum _{a,b}\tilde{p}(b)p(a\vert b)\log p(a\vert b) \\ p^{{\ast}}& =& \text{argmax}_{ p\in P}H(p) {}\end{array}$$
(1)

where H( p) is the conditional entropy of p, \(\tilde{p}(b)\) is the observed probability of the linguistic context b in the training set, and P is the set of models that are consistent with the observed data. A model p is consistent if its own feature expectation E p f j is equal to the observed feature expectation \(E_{\tilde{p}}f_{j}\), for all j = 1… k features. \(E_{\tilde{p}}f_{j}\) can be interpreted as the observed count of f j in the training sample, normalized by the training sample size. Both are defined as follows:

$$\displaystyle\begin{array}{rcl} E_{p}f_{j}& =& \sum _{a,b}\tilde{p}(b)p(a\vert b)f_{j}(a,b) {}\\ E_{\tilde{p}}f_{j}& =& \sum _{a,b}\tilde{p}(a,b)f_{j}(a,b) {}\\ \end{array}$$

According to the maximum entropy framework, the optimal model p is the most uncertain model among those that satisfy the feature constraints. It is possible to show that the form of the optimal model must be log-linear:

$$\displaystyle\begin{array}{rcl} p^{{\ast}}(a\vert b)& =&{ 1 \over Z(b)}\prod _{j=1\ldots k}\alpha _{j}^{f_{j}(a,b)} \\ Z(b)& =& \sum _{a'}\prod _{j=1\ldots k}\alpha _{j}^{f_{j}(a',b)} {}\end{array}$$
(2)

Here Z(b) is a normalization factor, and α j > 0. Each model parameter α j can be viewed as the “strength” of its corresponding feature f j ; the conditional probability is the normalized product of the feature weights of the active features.

Relationship to Maximum Likelihood

The maximum entropy framework described here has an alternate interpretation under the more commonly used technique of maximum likelihood estimation.

$$\displaystyle\begin{array}{rcl} Q& =& \left \{p\vert p(a\vert b) ={ 1 \over Z(b)}\prod _{j=1\ldots k}\alpha _{j}^{f_{j}(a,b)}\right \} {}\\ L(p)& =& \sum _{a,b}\tilde{p}(a,b)\log p(a\vert b) {}\\ q^{{\ast}}& =& \mathop{\text{argmax}}\limits_{p \in Q}L(p) {}\\ \end{array}$$

Here Q is the set of models of form (2), \(\tilde{p}(a,b)\) is the observed probability of prediction a together with linguistic context b, L(p) is the log-likelihood of the training set, and q is the maximum likelihood model. It can be shown that p = q; maximum likelihood estimation for models of the form (2) gives the same answer as maximum entropy estimation over the constraints on feature counts (1). The difference between approaches is that the maximum likelihood approach assumes the form of the model, whereas the maximum entropy approach assumes the constraints on feature expectations, and derives the model form.

Parameter Estimation

The Generalized Iterative Scaling (GIS) algorithm (Darroch and Ratcliff 1972) is the easiest way to estimate the parameters for this kind of model. The iterative updates are given below:

$$\displaystyle\begin{array}{rcl} \alpha _{j}^{(0)}& =& 1 {}\\ \alpha _{j}^{(n)}& =& \alpha _{ j}^{(n-1)}\left [{E_{\tilde{p}}f_{j} \over E_{p}f_{j}}\right ]^{{ 1 \over C} } {}\\ \end{array}$$

GIS requires the use of a “correction” feature g and constant C > 0, which are defined so that \(g(a,b) = C -\sum _{j=1\ldots k}f_{j}(a,b)\) for any (a, b) pair in the training set. Normally, the correction feature g must be trained in the model along with the k original features, although (Curran and Clark 2003) show that GIS converges even without the correction feature. The number of iterations needed to achieve convergence depends on certain aspects of the data, such as the training sample size and the feature set size, and is typically tuned for the problem at hand.

Other algorithms for parameter estimation include the Improved Iterative Scaling (Berger et al. 1996) algorithm and the Sequential Conditional GIS (Goodman 2002) algorithm. The list given here is not complete; many other numerical algorithms can be applied to maximum entropy parameter estimation, see Malouf (2002) for a comparison.

It is usually difficult to assess the reliability of features that occur infrequently in the training set, especially those that occur only once. When the parameters are trained from low frequency feature counts, maximum entropy models – as well as many other statistical learning techniques – have a tendency to “overfit” the training data. In this case, performance on training data appears very high, but performance on the intended test data usually suffers. Smoothing or regularization techniques are designed to alleviate this problem for statistical models; some smoothing techniques for maximum entropy models are reviewed in Chen and Rosenfeld (1999).

Applications

This framework has been used as a generic machine learning toolkit for many problems in natural language processing. Like other generic machine learning techniques, the core of the maximum entropy framework is invariant across different problem spaces. However, some information is specific to each problem space:

  • Predictions: The space of predictions for this model

  • Questions: The space of questions for this model

  • Feature Selection: Any possible (question, prediction) pair can be used as a feature. In complex models, only a small subset of all the possible features are used in a model. The feature selection strategy specifies how to choose the subset.

For a given application, it suffices to give the above three pieces of information to fully specify a maximum entropy probability model.

Part-of-Speech Tagging

Part-of-speech tagging is a well-known task in computational linguistics in which the goal is to disambiguate the part-of-speech of all the words in a given sentence. For example, it can be non-trivial for a computer to disambiguate the part-of-speech of the word flies in the following famous examples:

  • Fruit flies like a banana.

  • Time flies like an arrow.

The word flies behaves like a noun in the first case, and like a verb in the second case. In the machine learning approach to this problem, co-occurrence statistics of tags and words in the linguistic context are used to create a predictive model for part-of-speech tags.

The computational linguistics community has created annotated corpora to help build and test algorithms for tagging. One such corpus, known as the Penn treebank (Marcus et al. 1994), has been used extensively by machine learning and statistical NLP practitioners for problems like tagging. In this corpus, roughly 1 M words from the Wall St. Journal have manually been assigned part-of-speech tags. This corpus can be converted into a set of training samples, which in turn can be used to train a maximum entropy model.

Model Specification

For tagging, the goal is a maximum entropy model p that will produce a probability of seeing a tag at position i, given the linguistic context of the ith word, the surrounding words, and the previously predicted tags, written as \(p(t_{i}\vert t_{i-1}\ldots t_{1},w_{1}\ldots w_{n})\). The intent is to use the model left-to-right, one word at a time. The maximum entropy model for tagging (Ratnaparkhi 1996) is specified as:

  • Predictions: The 45 part-of-speech tags of the Penn treebank

  • Questions: Listed below are the questions and question patterns. A question pattern has a placeholder variable (e.g., X, Y ) that is instantiated by scanning the annotated corpus for examples in which the patterns match. Let i denote the position of the current word in the sentence, and let w i and t i denote the word and tag at position i, respectively.

    • Does w i = X?

    • Does wi−1 = X?

    • Does wi−2 = X?

    • Does wi+1 = X?

    • Does wi+2 = X?

    • Does ti−1 = X?

    • Does \(t_{i-1}t_{i-2} = X,Y\)?

    • For word that occur less than 5 times in the training set:

      • Are the first K (for K ≤ 4) characters X1… X K ?

      • Are the last K (for K ≤ 4) characters \(X_{1}\ldots X_{K}\)?

      • Does the current word contain a number?

      • Does the current word contain a hyphen?

      • Does the current word contain an uppercase character?

  • Feature Selection: Any feature whose count in the training data is less than 10 is discarded.

While the features for each probability decision could in theory look at the entire linguistic context, they actually only look at a small window of words surrounding the current word, and a small window of tags to the left. Therefore each decision effectively makes the markov-like assumption given in Eq. (3).

$$\displaystyle\begin{array}{rcl} & & p(t_{i}\vert t_{i-1}\ldots t_{1},w_{1}\ldots w_{n}) \\ & & = p(t_{i}\vert t_{i-1}t_{i-2}w_{i-2}w_{i-1}w_{i}w_{i+1}w_{i+2}){}\end{array}$$
(3)
$$\displaystyle\begin{array}{rcl} & & ={ \prod _{j=1\ldots k}\alpha _{j}^{f_{j}(t_{i},t_{i-1}t_{i-2}w_{i-2}w_{i-1}w_{i}w_{i+1}w_{i+2})} \over Z(t_{i-1}t_{i-2}w_{i-2}w_{i-1}w_{i}w_{i+1}w_{i+2})} {}\end{array}$$
(4)

Equation (4) is the maximum entropy model for tagging. Each conditional probability of a prediction t i given some context \(t_{i-1}t_{i-2}w_{i-2}w_{i-1}w_{i} w_{i+1}w_{i+2}\) is the product of the features that are active for that (prediction, context) pair.

Training Data

The training set is created by applying the questions to each word in the training set. For example, when scanning the word flies in the sentence “Time flies like an arrow” the training example would be:

Prediction

Question vector

verb

\(w_{i} =\mathrm{ flies},w_{i-1} =\mathrm{ Time},w_{i-2} =\mathrm{ {\ast}bd{\ast}},\)

 

\(w_{i+1} =\mathrm{ like},w_{i+2} =\mathrm{ an},\)

 

\(t_{i-1} = \mathit{noun},t_{i-1}t_{i-2} = \mathit{noun},\mathrm{{\ast}bd{\ast}}\)

Here *bd* is a special symbol for boundary. The tags have been simplified for this example; the actual tags in the Penn treebank are more fine-grained than noun and verb.

Hundreds of thousands of training samples are used to create candidate features. Any possible (prediction, question) pair that occurs in training data is a candidate feature. The feature selection strategy is a way to eliminate unreliable or noisy features from the candidate set. For the part-of-speech model described here, a simple frequency threshold is used to implement feature selection.

Given a selected feature set, the GIS algorithm is then used to find the optimal value for the corresponding α j parameters. For this application, roughly 50 iterations of GIS sufficed to achieve convergence.

Search for Best Sequence

The probability model described thus far will produce a distribution over tags, given a linguistic context including and surrounding the current word. In practice we need to tag entire sentences, which means that the model must produce a sequence of tags. Tagging is typically performed left-to-right, so that each decision has the left context of previously predicted tags. The probability of the best tag sequence for an n-word sentence is factored as:

$$\displaystyle\begin{array}{rcl} & & p(t_{1}\ldots t_{n}\vert w_{1}\ldots w_{n}) {}\\ & & =\prod _{i=1\ldots n}p(t_{i}\vert t_{i-1}\ldots t_{1},w_{1}\ldots w_{n}) {}\\ \end{array}$$

The desired tag sequence is the one with the highest conditional sequence probability:

$$\displaystyle\begin{array}{rcl} t_{1}^{{\ast}}\ldots t_{ n}^{{\ast}} =\mathop{ \text{argmax}}\limits_{t_{ 1}\ldots t_{n}}p(t_{1}\ldots t_{n}\vert w_{1}\ldots w_{n})& & {}\\ \end{array}$$

A dynamic programming procedure known as the Viterbi algorithm is typically used to find the highest probability sequence.

Other NLP Applications

Other NLP applications have used maximum entropy models to predict a wide variety of linguistic structure. The statistical parser in Ratnaparkhi (1999) uses separate maximum entropy models for part-of-speech, chunk, and parse structure prediction. The system in Borthwick (1999) uses maximum entropy models for named entity detection, while the system in Ittycheriah et al. (2001) uses them as sub-components for both answer type prediction and named entity detection. Typically, such applications do not need to change the core framework, but instead need to modify the meaning of the predictions, questions, and feature selection to suit the intended task of the application.

Future Directions

Conditional random fields (Lafferty et al. 2001), or CRFs, are an alternative to maximum entropy models that address the label bias issue. Label bias affects sequence models that predict one element at a time, in which features at a given state (or word, in the case of POS tagging) compete with each other, but do not compete with features at any other state in the sequence. In contrast, a CRF model directly produces a probability distribution over the entire sequence, and therefore allows global competition of features across the entire sequence. The parameter estimation for CRFs is related to the Generalized Iterative Scaling algorithm used for maximum entropy models. See Sha and Pereira (2003) for a example of CRFs applied to noun phrase chunking.

Another recently published future direction is Collobert et al. (2011), which presents a multi-layer neural network approach for several sequence labeling tasks, including POS tagging. This approach avoids task-specific feature engineering – like the questions in section “Model Specification” – and instead uses the neural network training algorithm to discover internal representations for the word and tag context. It also uses large amounts of unlabeled data to enhance the internal representations for words.