5.1 The Traditional Framework of Text Classification

For simplicity of description, in the following we use “document” instead of “a piece of text” at different levels, and the text classification or text categorization problem can also be called document classification or document categorization. If not specified, the methods described below apply not only to document classification but also to text classification at other levels (e.g., sentence classification). As shown in Fig. 5.1, the goal of document classification is to divide a collection of documents into a set of predefined categories such as “technology,” “sports,” or “entertainment.”

Fig. 5.1
figure 1

An example of document classification

The traditional framework of a document classification system is represented in Fig. 5.2. The system consists of three separate components: text representation, feature selection, and classification. The literature (Sebastiani 2002) summarizes text classification techniques according to this framework. The three stages are normally separate in traditional document classification. In the following three subsections, we will introduce these three stages respectively.

Fig. 5.2
figure 2

The main components of text classification based on traditional machine learning

In document classification, a document must be correctly and efficiently represented for subsequent classification algorithms. The representation method must truly reflect the content of the text and have sufficient ability to distinguish different types of text. We have systematically introduced text representation methods in Chap. 3. For more details on the text representation methods, particularly the traditional vector space model, readers can refer to Sect. 3.1, and we will not go into detail on these here. However, it is worth noting that the selection of a text representation method depends on the choice of classification algorithm. For example, discriminative classification models (such as ME and SVM) usually use the vector space model (VSM) for text representation. Text representation in a generative model (such as NB) is determined by the class-conditional distribution hypothesis, e.g., the multinomial distribution or the multivariable Bernoulli distribution.

There are two steps to using the vector space model for text representation: (1) generating a feature vector composed of a sequence of terms (e.g., the vocabulary) based on the training data and (2) assigning a weight to each term in the vector and performing some normalization for each document in the training and testing datasets. The vector space model is simple to use, but it loses too much information from the original documents.

We construct a small dataset for text classification, as shown in Table 5.1. The dataset includes two classes: “education” and “sport.” Each class in the training set includes four documents (train_ d 1 and train_ d 2 belong to the education class, and train_ d 3 and train_ d 4 belong to the sport class), and the test set consists of two documents (test_ d 1 and test_ d 2). Table 5.2 provides the vocabulary of the text classification dataset. Each document can be represented as a vector in the vector space based on this vocabulary.

Table 5.1 Text classification dataset
Table 5.2 Vocabulary for the text classification dataset given in Table 5.1

5.2 Feature Selection

The traditional vector space model represents a document based on a high-dimensional vector space. To reduce the noise contained in such a high-dimensional vector and improve the computational efficiency, it is necessary to reduce its dimension before performing classification. In machine learning and pattern recognition, dimension reduction methods fall into two main categories: feature extraction and feature selection.

The purpose of feature extraction is to map the original high-dimensional sparse feature space into a low-dimensional dense feature space. The classical feature extraction methods include principal component analysis (PCA) and independent component analysis (ICA).

Feature selection is the process of selecting a subset of features for text representation and classification. In comparison with feature extraction, feature selection has been more widely discussed and used for text data. The feature selection methods for text classification normally include unsupervised and supervised methods. The former can be applied to a corpus without category annotation, but its effect is often limited. The representative approaches include term frequency (TF) and document frequency (DF), where the latter relies on category annotation, which can more effectively select a better subset of features for text classification. The commonly used supervised methods include the mutual information (MI), information gain (IG), and chi-square statistic (χ 2) methods. Yang and Pedersen (1997) and Forman (2003) systematically summarized the feature selection methods used in text classification and pointed out that a good feature selection algorithm can effectively reduce the feature space, remove redundant and noise features, and improve the efficiency of the classifier.

We introduce supervised feature selection methods in this subsection.

5.2.1 Mutual Information

In information theory, suppose that X is a discrete random variable whose probability distribution is \(p\left ( x\right )=P\left ( {X=x} \right )\). The entropy of X is defined as follows:

$$\displaystyle \begin{aligned} H\left( X \right)=-\sum \limits_x p\left( x \right)\log p\left( x \right) \end{aligned} $$
(5.1)

Entropy, also known as the expectation of self-information, is used to measure the average level of “information” or “uncertainty” inherent in the variable’s possible outcomes. If a random variable has greater entropy, it has greater uncertainty, and consequently, a larger amount of information is needed to represent it, while less entropy means less uncertainty and requires less information.

Suppose X and Y are a pair of discrete random variables with the joint distribution \(p\left ( {x,y} \right )=P\left ({X=x,Y=y} \right )\). Then, the joint entropy of X and Y is defined as:

$$\displaystyle \begin{aligned} H\left( {X,Y} \right)=-\sum \limits_x \sum_y p\left( {x,y} \right)\log p\left( {x,y} \right) \end{aligned} $$
(5.2)

The joint entropy indicates the uncertainty (i.e., the amount of information needed for representation) of a pair of random variables.

The conditional entropy describes the uncertainty of random variable Y given the value of random variable X. In other words, it indicates the amount of additional information needed to represent Y under the condition that the value of X is known. It can be defined as follows:

$$\displaystyle \begin{aligned} H\left( {Y\vert X} \right)&=\sum \limits_x p\left( x \right)H\left( {Y\vert X=x} \right)\\ &=-\sum \limits_x \sum \limits_y p\left( {x,y} \right)\log p\left( {y\vert x} \right) \end{aligned} $$
(5.3)

H(Y |X) = 0 if and only if the value of Y is completely determined by X. Conversely, H(Y |X) = H(Y ) if and only if Y and X are independent of each other.

The relationship between entropy, joint entropy, and conditional entropy can be described as follows:

$$\displaystyle \begin{aligned} H\left( {Y\vert X} \right)=H\left( {X,Y} \right)-H\left( X \right)\end{aligned} $$
(5.4)

Figure 5.3 displays the relationship between entropy, joint entropy, and conditional entropy. Suppose that the circle on the left represents entropy H(X) and the circle on the right represents entropy H(Y ). The union of the two circles represents the joint entropy H(X, Y ), the crescent on the left represents the conditional entropy H(X|Y ), the crescent on the right represents the conditional entropy H(Y |X), and the intersection of the two circles is called the mutual information of X and Y , as defined below.

Fig. 5.3
figure 3

The relationship between entropy, joint entropy, and conditional entropy

Mutual information reflects the degree to which two random variables are related to each other. For two discrete random variables X and Y , their mutual information is defined as:

$$\displaystyle \begin{aligned} I\left( {X;Y} \right)=\sum \limits_{x,y} p\left( {x,y} \right)\log \frac{p\left( {x,y} \right)}{p\left( x \right)p\left( y \right)} \end{aligned} $$
(5.5)

The relationship between entropy, conditional entropy, and mutual information is as follows:

$$\displaystyle \begin{aligned} I\left( {X;Y} \right)=H\left( Y \right)-H\left( {Y\vert X} \right)=H\left( X \right)-H\left( {X\vert Y} \right)\end{aligned} $$
(5.6)

Mutual information is a measure of the interdependence between two random variables. It can be regarded as the amount of the reduction in uncertainty in a random variable when another random variable is known.

Let \(I\left ( {x;y} \right )=\log \frac {p(x,y)}{p(x)p(y)}\) denote the pointwise mutual information (PMI) of X and Y when they take the value (x, y). Equation (5.6) shows that MI is the expectation of PMI. In text classification, PMI measures the amount of discriminative information of class c j provided by feature t i.

For a given collection of documents, we first calculate the value of each term t i and class c j in Table 5.3. \(N_{t_i,c_j}\) indicates the document frequency of term t i appearing in class c j; \(N_{t_i,\bar {c}_j}\) indicates the document frequency of term t i appearing in all classes except c j; \(N_{\bar {t}_i,c_j}\) indicates the document frequency of all terms except t i appearing in class c j; and \(N_{\bar {t}_i,\bar {c}_j}\) indicates the document frequency of all terms except t i appearing in all classes except c j. \(N=N_{t_i,c_j} +N_{t_i,\bar {c}_j}+N_{\bar {t}_i,c_j } + N_{\bar {t}_i,\bar {c}_j }\) is the total number of documents.

Table 5.3 The document frequency statistic for each feature and class

Based on the principle of maximum likelihood estimation, we can estimate the following probability:

$$\displaystyle \begin{aligned}p\left( {c_j } \right)&=\frac{N_{t_i,c_j } +N_{\bar{t}_i,c_j }}{N} \end{aligned} $$
(5.7)
$$\displaystyle \begin{aligned}p\left( {t_i } \right)&=\frac{N_{t_i,c_j } +N_{t_i,\bar{c}_j }}{N} \end{aligned} $$
(5.8)
$$\displaystyle \begin{aligned}p\left( {c_j \vert t_i } \right)&=\frac{N_{t_i,c_j } +1}{N_{t_i,c_j }+N_{t_i,\bar{c}_j } +M} \end{aligned} $$
(5.9)
$$\displaystyle \begin{aligned}p\left( {c_j \vert \bar{t}_i } \right)&=\frac{N_{\bar{t}_i,c_j } +1}{N_{\bar{t}_i,c_j } +N_{\bar{t}_i,\bar{c}_j } +M} \end{aligned} $$
(5.10)

where M denotes the number of categories and \(p\left ( {c_j\vert t_i } \right )\) and \(p\left ({c_j \vert \bar {t}_i }\right )\) are estimated with Laplace smoothing to avoid zero probabilities.

On this basis, the mutual information \(I\left ( {t_i ;c_j } \right )\) between t i and c j can be calculated as

$$\displaystyle \begin{aligned} I\left( {t_i ;c_j } \right)=\log \frac{N_{t_i,c_j } N}{\left( {N_{t_i,c_j } +N_{\bar{t}_i,c_j } } \right)\left( {N_{t_i,c_j } +N_{t_i ,\bar{c}_j } } \right)}\end{aligned} $$
(5.11)

Finally, we can take either the weighted average of I(t i;c j):

$$\displaystyle \begin{aligned} I_{\text{avg}} (t_i ;c_j)=\sum \limits_j p\left( {c_j } \right)I\left( {t_i;c_j } \right)\end{aligned} $$
(5.12)

or the maximum value among different classes

$$\displaystyle \begin{aligned} I_{\text{max}} (t_i ;c_j)=\mathop {\max }\limits_j \left\{ {I\left( {t_i; c_j } \right)} \right\}\end{aligned} $$
(5.13)

to measure the amount of discriminative information that item t i contains for all classes.

The process of feature selection, hence, first calculates the MI score (Eq. (5.12) or Eq. (5.13)) for all terms, then ranks the terms according to their MI scores, and finally selects a subset of the top-ranked terms as the selected features.

Table 5.4 gives the results of the feature selection for the text classification dataset (Table 5.1) based on the MI method.

Table 5.4 MI feature selection results for the text classification dataset

5.2.2 Information Gain

Information gain (IG) denotes the reduction in uncertainty of the random variable Y given the condition that random variable X is observed:

$$\displaystyle \begin{aligned} G\left( {Y\vert X} \right)=H\left( Y \right)-H\left( {Y\vert X} \right) \end{aligned} $$
(5.14)

Such a reduction in uncertainty can be represented by the difference between H(Y ) and H(Y |X).

In the text classification task, we can regard a feature \(T_i \in \{t_i,\bar {t}_i\}\) as a binary random variable that has a Bernoulli distribution (also called a 0-1 distribution) and regard the class C as a random variable that has a categorical distribution. Based on this, information gain can be defined as the difference between entropy H(C) and conditional entropy H(C|T i):

$$\displaystyle \begin{aligned} G\left( {T_i } \right)&=H\left( C \right)-H\left( {C\vert T_i } \right)\\ &=- \sum \limits_j p\left( {c_j } \right)\log p\left( {c_j } \right)-\left[ \left( {- \sum \limits_j p\left( {c_{j,} t_i } \right)\log p\left( {c_j \vert t_i } \right)} \right)\right.\\ {} &\quad +\left.\left( {- \sum \limits_j p\left( {c_j,\bar{t}_i } \right)\log p\left( {c_j \vert \bar{t}_i } \right)} \right) \right] \end{aligned} $$
(5.15)

While using the same number of top-ranked features for text classification, IG performs significantly better than MI in many text classification applications because information gain takes both t i and \(\bar t_i\) into consideration and can be viewed as a weighted average of the pointwise mutual information \(I(\bar {t}_i;c_j)\) and I(t i;c j) (Yang and Pedersen 1997):

$$\displaystyle \begin{aligned} G\left( {T_i } \right)= \sum \limits_j p(t_i,c_j) \cdot I(t_i,c_j)+p\left( {\bar{t}_i,c_j } \right) \cdot I\left( \bar{t}_i,c_j \right) \end{aligned} $$
(5.16)

Table 5.5 gives the feature selection results for the text classification dataset (Table 5.1) based on IG.

Table 5.5 IG feature selection results for the text classification dataset

5.2.3 The Chi-Squared Test Method

The chi-square (χ 2) test is a statistical hypothesis testing method. It is widely used to test the independence of two random variables by determining whether there is a statistically significant difference between the expected frequency and the observed frequency.

As applied to text classification, suppose term \(T_i \in \{t_i,\bar {t}_i \}\) and class \(C_j\in \left \{ {c_j,\bar {c}_j } \right \}\) are two binary random variables that both obey a Bernoulli distribution, where t i and \(\bar {t}_i\) represent whether t i appears in a document or not and c j and \(\bar {c}_j\) represent whether the class of a document is c j or not.

On this basis, we propose the null hypothesis: T i and C j are independent of each other. That is, \(p(T_i,C_j)=p\left ( {T_i }\right ) \cdot p\left ({C_j } \right )\). For each term T i and class C j, we calculate the chi-square statistics:

$$\displaystyle \begin{aligned} \chi ^2\left( {T_i,C_j } \right)= \sum \limits_{T_i \in \left\{ {t_i ,\bar{t}_i } \right\}}\ \sum \limits_{C_j \in \left\{ {c_j,\bar{c}_j } \right\}} \frac{(N_{T_i,C_j } -E_{T_i,C_j })^2}{E_{T_i,C_j } } \end{aligned} $$
(5.17)

where \(N_{T_i, C_j}\) denotes the observed document frequency defined in Table 5.3 and \(E_{T_i, C_j}\) denotes the expected document frequency based on the null hypothesis (i.e., T i and C j are independent of each other).

Under the null hypothesis, \(E_{t_i,c_j }\) can be estimated based on Eqs. (5.17) and (5.18) as follows:

$$\displaystyle \begin{aligned} E_{t_i,c_j } &=N\cdot p\left( {t_i,c_j } \right)=N\cdot p\left( {t_i } \right)\cdot p\left( {c_j } \right)\\ &=N\cdot \frac{N_{t_i,c_j } +N_{t_i ,\bar{c}_j } }{N}\cdot \frac{N_{t_i,c_j } +N_{\bar{t}_i,c_j } }{N} \end{aligned} $$
(5.18)

Similarly, we can estimate \(E_{\bar {t}_i,c_j } \), \(E_{t_i,\bar {c}_j}\), and \(E_{\bar {t}_i,\bar {c}_j }\). Finally, by bringing the above results into Eq. (5.17), the chi-square statistic can be written as:

$$\displaystyle \begin{aligned} \chi^2\left(T_i,C_j\right)=\frac{N\cdot \left(N_{t_i,c_j} N_{\bar{t}_i,\bar{c}_j}-N_{\bar{t}_i,c_j}N_{t_i,\bar{c}_j} \right)^2}{\left( N_{t_i,c_j}+N_{\bar{t}_i,c_j}\right)\cdot \left(N_{t_i,c_j}+N_{t_i,\bar{c}_j}\right)\cdot \left(N_{t_i,\bar{c}_j} +N_{\bar{t}_i,\bar{c}_j} \right)\cdot \left(N_{\bar{t}_i,c_j} +N_{\bar{t}_i,\bar{c}_j } \right)}\end{aligned} $$
(5.19)

The higher the \(\chi ^2\left ( {T_i,C_j } \right )\) value, the less valid the null hypothesis is, and the higher the correlation between T i and C j.

Similar to MI and CHI, the weighted average or maximum \(\chi ^2\left ( {T_i,C_j }\right )\) across all classes can measure the amount of discriminative information contained by term T i, and the top-ranked terms can be used as the selected features:

$$\displaystyle \begin{aligned} \chi _{\text{max}}^2 \left( {T_i } \right)&=\mathop{\max }\limits_{j=1,\ldots,M} \left\{ {\chi ^2\left( {T_i,C_j } \right)} \right\} \end{aligned} $$
(5.20)
$$\displaystyle \begin{aligned} \chi _{\text{avg}}^2 \left( {T_i } \right)&=\sum \limits_{j=1}^M p(c_j)\chi ^2\left( {T_i,C_j } \right)\end{aligned} $$
(5.21)

Table 5.6 shows the results from the feature selection for the text classification dataset (Table 5.1) using the χ 2 method.

Table 5.6 Chi-square feature selection results for the text classification dataset

5.2.4 Other Methods

Nigam et al. (2000) proposed a weighted log-likelihood ratio (WLLR) method to measure the correlation between term t i and class c j for feature selection for text classification:

$$\displaystyle \begin{aligned} \text{WLLR}(t_i,c_j)&=p\left( {t_i \vert c_j } \right) \cdot \log \frac{p\left( {t_i \vert c_j } \right)}{p\left( {t_i \vert \bar{c} _j } \right)}\\ &=\frac{N_{t_i,c_j } }{N_{t_i,c_j } +N_{\bar{t}_i,c_j } } \cdot \log \frac{N_{t_i,c_j } (N_{t_i,\bar{c} _j } +N_{\bar{t}_i,\bar{c}_j })}{N_{t_i,\bar{c} _j } (N_{t_i,c_j } +N_{\bar{t}_i,c_j })} \end{aligned} $$
(5.22)

Li et al. (2009a) further analyzed six kinds of feature selection methods (MI, IG, χ 2, WLLR, and so on). They found that frequency \(p\left ({t_i \vert c_j} \right )\) and odd ratio \( \frac {p\left ({t_i \vert c_j } \right )}{p\left ({t_i\vert \bar {c}_j }\right )}\) are two basic components in feature selection, and the above feature selection methods can be formulated as the combination of the two basic components. Based on this, they proposed a general feature selection method called general weighted frequency and odd (WFO) for text classification:

$$\displaystyle \begin{aligned} \text{WFO}\left(t_i,c_j\right)&=p\left(t_i \vert c_j\right)^{\lambda} \left(\log \frac{p\left(t_i \vert c_j\right)}{p\left(t_i \vert \bar{c}_j\right)}\right)^{1-{\lambda}}\\ {} &=\left( \frac{N_{t_i,c_j}}{N_{t_i,c_j}+N_{\bar{t}_i,c_j}} \right)^{\lambda} \left(\log \frac{N_{t_i,c_j} (N_{t_i,\bar{c}_j}+ N_{\bar{t}_i,\bar{c}_j})}{N_{t_i,\bar{c}_j}(N_{t_i,c_j}+N_{\bar{t}_i,c_j})}\right)^{1-{\lambda}} \end{aligned} $$
(5.23)

We assume that the feature space obtained after feature selection is {computer, volleyball, game, medal, university}. Based on the reduced feature set, the text representations of the documents in Table 5.1 are shown in Table 5.7.

Table 5.7 The text classification dataset (Table 5.1) after feature selection

5.3 Traditional Machine Learning Algorithms for Text Classification

After text representation and feature selection, the next step is to employ a classification algorithm to predict the class label of the documents. Early text classification algorithms included the Rocchio approach, the K-nearest neighbor classifier, and decision trees. The most widely used text classification algorithms in traditional machine learning are naïve Bayes, maximum entropy (ME), and support vector machines (SVM).

5.3.1 Naïve Bayes

The Bayesian model is a kind of generative algorithm that models the joint distribution p(x, y) of the observation x and its class y. In practice, the joint distribution is transformed into the product of the class-prior distribution p(y) and the class-conditional distribution p(x|y):

$$\displaystyle \begin{aligned} p\left( {{\boldsymbol x},y} \right)=p(y) \times p({\boldsymbol x}\vert y) \end{aligned} $$
(5.24)

The Bernoulli distribution or the categorical distribution can be used to model the former for binary and multiclass classifications, respectively. The remaining problem is how to estimate the class-conditional distribution p(x|y) for different applications.

In text classification, to solve the above problem, it is necessary to further simplify the class-conditional distribution of documents. A simple way is to ignore the word order relationships in the document and assume that a document is a bag of words where the individual words are interchangeable. Mathematically, such simplification can be described as an assumption that the class-conditional distributions of words are independent of each other. Based on this assumption, the class-conditional distribution of a document can be written as the product of multiple class-conditional distributions of words. Such a bag-of-words assumption is consistent with the discriminant model where a document can be represented based on a vector space model. The Bayesian model under this assumption is called the naïve Bayes model.

There are two main hypotheses for the class-conditional distribution, known as the multinomial distribution and the multivariate Bernoulli distribution (McCallum et al. 1998). The multivariate Bernoulli distribution only captures the presence of words in a document and ignores their frequency. In comparison, the multinomial distribution is used more often and has generally better classification performance. In this section, we will introduce the naïve Bayes model based on the multinomial distribution.

First, we represent a document x as a sequence of words:

$$\displaystyle \begin{aligned} {\boldsymbol x}=\left[ {w_1,w_2,\cdots,w_{\left| {\boldsymbol x} \right|} } \right]\end{aligned} $$
(5.25)

Under the bag-of-words assumption, p(x|y) has the form of a multinomial distribution:

$$\displaystyle \begin{aligned} {p({\boldsymbol x}\vert c_j)} & {=p([w_1,w_2,\cdots,w_{\left| {\boldsymbol x} \right|} ]\vert c_j)}\\ {} & =\prod \limits_{i=1}^V p(t_i \vert c_j)^{N(t_i,{\boldsymbol x})} \end{aligned} $$
(5.26)

where V is the dimension of the vocabulary, t i is the i-th term in the vocabulary, θ i|j = p(t i|c j) is the probability of occurrence of t i in class c j, and N(t i, x) is the term frequency of t i in document x.

We take the multiclass classification problem as an example for description. We assume that the class y obeys the categorical distribution:

$$\displaystyle \begin{aligned} p\left( {y=c_j } \right)=\pi _j \end{aligned} $$
(5.27)

According to the assumption of a multinomial distribution, the joint distribution of \(p\left ( {{\boldsymbol x},y} \right )\) can be written as

$$\displaystyle \begin{aligned} p({\boldsymbol x},y=c_j)=p(c_j) \cdot p({\boldsymbol x}\vert c_j)=\pi _j \prod \limits_{i=1}^V \theta _{i\vert j} ^{N(t_i,{\boldsymbol x})}\end{aligned} $$
(5.28)

Naïve Bayes learns the parameters (π, θ) based on the principle of maximum likelihood estimation (MLE). Given the training set \(\left \{ {{\boldsymbol x}_k,y_k } \right \}_{k=1}^N \), the optimization objective is to maximize the log-likelihood function \(L\left ( {{\boldsymbol \pi },{\boldsymbol \theta }} \right )=\log \prod \limits _{k=1}^N p\left ( {{\boldsymbol x}_k,y_k }\right )\). By solving the MLE problem, we obtain the estimated value of the parameters:

$$\displaystyle \begin{aligned} \pi_j &=\frac{\sum_{k=1}^N I(y_k =c_j)}{\sum_{k=1}^N\sum_{j'=1}^C I(y_k =c_{j'})}=\frac{N_j }{N} \end{aligned} $$
(5.29)
$$\displaystyle \begin{aligned} \theta_{i\vert j} &=\frac{\sum_{k=1}^N I(y_k =c_j)N(t_i,{\boldsymbol x}_k)}{\sum_{k=1}^N I(y_k =c_j)\sum_{i'=1}^V N(t_{i'},{\boldsymbol x}_k)}\end{aligned} $$
(5.30)

It can be seen that the estimated value of the class-prior probability π j is the document frequency of the j-th class in the training set, and the estimated value of the class-conditional probability of term t i in class c j is also the frequency of t i in the documents of class c j over all the terms in the vocabulary.

To prevent the occurrence of zero probabilities, a Laplace smoothing technique is often applied to Eq. (5.30):

$$\displaystyle \begin{aligned} \theta _{i\vert j} =\frac{\sum_{k=1}^N I(y_k =c_j)N(t_i ,{\boldsymbol x}_k)+1}{\sum_{i'=1}^V \sum_{k=1}^N I(y_k =c_j)N(t_{i'},{\boldsymbol x}_k)+V}\end{aligned} $$
(5.31)

We train a multinomial naïve Bayes model on the dimension-reduced training set (Table 5.7). Let t 1 = computer, t 2 = volleyball, t 3 = game, t 4 = medal, t 5 = university, and y = 1 for the class “education” and y = 0 for the class “sport.” The parameter estimation results are shown in Table 5.8.

Table 5.8 Naïve Bayes parameter estimation on the dimension-reduced text classification dataset (Table 5.7)

We classify the test documents in Table 5.7 based on the above model. Suppose the representation of test document test_d 1 is x 1. The joint probabilities of x 1 and each class are

According to Bayes’ theorem, the posterior probabilities of x 1 belonging to each class are

Thus, test_d 1 belongs to the “education” class.

Similarly, the joint probabilities of the test document test_d 2 belonging to each class are

The posterior probabilities are

Thus, test_d 2 belongs to the “sport” class.

5.3.2 Logistic/Softmax and Maximum Entropy

Logistic regression is a classification algorithm, although its name contains the term “regression.” It is a linear classification model that is widely used for binary classification. Softmax regression is its extension from binary classification to multiclass classification. In natural language processing, there is also a commonly used model called maximum entropy (ME). Although softmax regression and maximum entropy were proposed in different ways, their essence is the same.

We first introduce the three models with an emphasis on logistic regression.

We begin by introducing the sigmoid function \(\sigma \left ( z \right )= \frac {1}{1+\text{e}^{-z}}\), which can convert the range of real numbers (from negative infinity to positive infinity) to the range of probability (from 0 to 1). Thus, it is often used to approximate a distribution. Its derivative is

$$\displaystyle \begin{aligned} \frac{\text{d}\sigma (z)}{\text{d}z}=\sigma \left( z \right)\left( {1-\sigma \left( z \right)}\right)\end{aligned} $$
(5.32)

For a binary classification problem, let y ∈{0, 1} denote its class, x denote the feature vector, and θ denote the weight vector. Logistic regression defines the posterior probability of y ∈{0, 1} given x as follows:

$$\displaystyle \begin{aligned} \left\{ {{\begin{array}{*{20}l} {p(y=1\vert {\boldsymbol x};{\boldsymbol \theta})=h_{\boldsymbol \theta} ({\boldsymbol x})=\sigma ({\boldsymbol \theta}^{\text{T}}{\boldsymbol x})} \\ {p(y=0\vert {\boldsymbol x};{\boldsymbol \theta})=1-h_{\boldsymbol \theta} ({\boldsymbol x})} \\ \end{array} }} \right.\end{aligned} $$
(5.33)

where the probability p(y = 1|x) is defined by a logistic function.

The two equations above can be written in a unified form:

$$\displaystyle \begin{aligned} p(y\vert {\boldsymbol x};{\boldsymbol \theta})&=(h_{\boldsymbol \theta}({\boldsymbol x}))^y(1-h_{\boldsymbol \theta} ({\boldsymbol x}))^{\left( {1-y} \right)}\\ &=\left( {\frac{1}{1+\text{e}^{-{\boldsymbol \theta}^{\text{T}}{\boldsymbol x}}}} \right)^y\left( {1-\frac{1}{1+\text{e}^{-{\boldsymbol \theta}^{\text{T}}{\boldsymbol x}}}} \right)^{\left( {1-y} \right)}\end{aligned} $$
(5.34)

For the hypothesis given by Eq. (5.34), logistic regression estimates the parameters based on the principle of maximum likelihood estimation. Given a training set \(\left \{{\left ( {\boldsymbol x}_i,y_i \right )}\right \},i=1,\cdots ,N\), the log-likelihood of the model is

$$\displaystyle \begin{aligned} l({\boldsymbol \theta})=\sum \limits_{i=1}^n y_i\mbox{log}\,h_{\boldsymbol \theta} ({\boldsymbol x}_i)+(1-y_i)\mbox{log}\left( {1-h_{\boldsymbol \theta} ({\boldsymbol x}_i)} \right)\end{aligned} $$
(5.35)

First-order optimization methods such as gradient ascent and stochastic gradient ascent are usually used to solve this optimization problem. In addition, quasi-Newton methods such as BFGS (Broyden–Fletcher–Goldfarb–Shanno) and L-BFGS (limited-memory BFGS) are also used to increase learning efficiency for large-scale training data.

Softmax regression is the extension of logistic regression from binary classification to multiclass classification, and it is also called multiclass logistic regression. Logistic regression can also be viewed as a special case of softmax regression where the number of classes is two. Softmax regression is the most widely used classification algorithm in traditional machine learning and is often used as the last layer of deep neural networks to perform classification.

Given the feature vector x, the posterior probability of class y = c j is defined in terms of a softmax function as follows:

$$\displaystyle \begin{aligned} p(y=c_j \vert {\boldsymbol x};{\boldsymbol \varTheta })& =h_j ({\boldsymbol x})\\ &=\frac{\text{exp}({\boldsymbol \theta} _j^{\text{T}} {\boldsymbol x})}{\sum_{l=1}^C \text{exp}({\boldsymbol \theta}_l^{\text{T}} {\boldsymbol x})},\quad j=1,2,\cdots,C \end{aligned} $$
(5.36)

where the parameters of the model are Θ = {θ j}, j = 1, ⋯ , C.

Given the training set \(\left \{ {\left ({{\boldsymbol x}_1,y_1}\right ),\dots ,\left ({ {\boldsymbol x}_N,y_N}\right )} \right \}\), the log-likelihood of softmax regression is

$$\displaystyle \begin{aligned} L\left( {\boldsymbol \varTheta} \right)=\sum \limits_{i=1}^N \sum \limits_{j=1}^C I(y_i=c_j)\mbox{log}\ h_j ({\boldsymbol x}_i) \end{aligned} $$
(5.37)

Note that the negative log-likelihood of softmax regression is also called the cross-entropy loss function and is widely used in classification. It is worth noting that softmax regression and naïve Bayes can be seen as a pair of discriminant generative models (Ng and Jordan 2002).

In comparison with softmax regression, maximum entropy is a widely used model in NLP classification tasks. Maximum entropy assigns the joint probability to observation and label pairs (x, y) based on a log-linear model that is quite similar to softmax regression:

$$\displaystyle \begin{aligned} \mathop{p} \limits_{\vec{\boldsymbol{\theta}}}(\boldsymbol{x},y)=\frac{\exp(\vec{\boldsymbol{\theta}} \cdot f(\boldsymbol{x},y))}{\sum_{\boldsymbol{x}',y'}\exp(\vec{\boldsymbol{\theta}} \cdot f(\boldsymbol{x}', y'))} \end{aligned} $$
(5.38)

where \(\vec {\boldsymbol {\theta }}\) is a vector of weights, as mentioned earlier, and f is a function that maps pairs (x, y) to a binary-value feature vector.

The feature vector in softmax regression is defined based on the vector space model of the observation x. In maximum entropy, it is defined by the following feature function, which describes the known relationships between the observation x and the class label y:

$$\displaystyle \begin{aligned} f_i \left( {{\boldsymbol x},y} \right)=\left\{ \begin{array}{ll} 1,& {\boldsymbol x}\mbox{ satisfies a certain fact, and }y\mbox{ belongs to a category}\\ 0,& \mbox{others}\\ \end{array}\right. \end{aligned} $$
(5.39)

By using the text classification dataset in Table 5.7, for example, we can construct one feature function of (x, y) as follows: whether the category is “education” when the document contains the word “university.” When the feature template is consistent with the definition of the vector space model of softmax regression, the two models are equivalent. It has also been proven that the parameter estimation principles of the two methods (i.e., maximum entropy and maximum likelihood) are also identical.

5.3.3 Support Vector Machine

Support vector machine (SVM) is a kind of supervised discriminative learning algorithm for binary classification. It is one of the most popular and widely discussed algorithms in traditional machine learning. There are two core ideas in SVM: first, if the data points are linearly separable, a good separation in terms of a hyperplane is the one that has the largest distance to the nearest training data points on both sides; second, if the data points are linearly nonseparable, SVM can transform the data to high-dimension space, where the data points may become linearly separable through nonlinear transformation based on kernel functions. The linear SVM is widely used for text classification.

The logistic regression model mentioned above is also a kind of linear binary classification model that uses maximum likelihood as its learning criterion. The learning criterion used in linear SVM is called the maximum margin.

For a linear classification hypothesis

$$\displaystyle \begin{aligned} f\left( {\boldsymbol x} \right)={\boldsymbol w}^{\text{T}}{\boldsymbol x}+b\end{aligned} $$
(5.40)

Its classification hyperplane is w T x + b = 0. The maximum margin criterion can be expressed as

$$\displaystyle \begin{aligned} &\mathop{\max }\limits_{{\boldsymbol w},b}\ \frac{1}{2}\|{\boldsymbol w}\|{}^2 {}\\ {} &~\text{s.t.}\quad y_i\left( {{\boldsymbol w}^{\text{T}}{\boldsymbol x}_i+b} \right)\ge 1,\quad i=1,\cdots,N\end{aligned} $$
(5.41)

As a quadratic optimization problem, it can be solved with any off-the-shelf quadratic programming optimization package. However, instead of directly solving the primal optimization problem, SVM tends to solve the following dual problem based on the Lagrange multiplier method:

(5.42)

where α i ≥ 0 is the Lagrange multiplier. The solutions of the dual problem satisfy the Karush–Kuhn–Tucker (KKT) conditions. According to the KKT conditions, only the weights of data points on the boundaries of the margin are positive (α i > 0), and the weights of the remaining data points are all zero (α i = 0). It can be further inferred that the classification hyperplane is only supported by the data points on the boundaries. This is the main reason why the model is called a “support vector” machine.

The hard-margin SVM can work only when the data are completely linearly separable. In the case of noise points or outliers, the margin of SVM will become smaller or even negative. To solve this problem, the soft-margin SVM was proposed, which introduces slack variables into the primary problem:

(5.43)

where ξ i is the slack variable and C is the parameter that determines the tradeoff between increasing the margin size and ensuring that the points lie on the correct side of the margin boundary. The corresponding dual problem of soft-margin SVM is

(5.44)

Meanwhile, to address the linearly nonseparable classification problem in low-dimensional space, SVM introduces the kernel function, which allows the algorithm to fit the maximum-margin hyperplane in a transformed high-dimensional feature space. Although the problem is linearly nonseparable in the original input space, it might be linearly separable in the transformed feature space.

The kernel function is defined as the inner product of kernel data in the transformed space:

$$\displaystyle \begin{aligned} K({\boldsymbol x},{\boldsymbol z})=\varphi ({\boldsymbol x})^{\text{T}}\varphi \left( {\boldsymbol z} \right)\end{aligned} $$
(5.45)

According to Eq. (5.44), the operations involved in SVM are all inner product operations toward x. We can therefore use a nonlinear kernel function to replace the dot product in the transformed feature space without needing to know the exact mapping function. The resulting dual problem is formally similar to the previous problem:

(5.46)

and the decision function is accordingly

$$\displaystyle \begin{aligned} f\left( {\boldsymbol x} \right)&=\sum \limits_{i=1}^N \alpha _i^{\ast} y_i\langle\varphi \left( {{\boldsymbol x}_i } \right),\varphi \left( {\boldsymbol x} \right)\rangle+b^{\ast}\\ {} &=\sum \limits_{i=1}^N \alpha _i^{\ast} y_iK\left( {{\boldsymbol x}_i,{\boldsymbol x}} \right)+b^{\ast}\end{aligned} $$
(5.47)

The commonly used kernel functions include:

  • Linear kernel: K(x, z) = x T z,

  • Polynomial kernel: \(K({\boldsymbol x},{\boldsymbol z})=\left ( {{\boldsymbol x}^{\text{T}}{\boldsymbol z}+c} \right )^d\),

  • Radial basis function: \(K\left ( {{\boldsymbol x},{\boldsymbol z}} \right )=\exp \left ( {- \frac {\mid {{\boldsymbol x}}-{\boldsymbol z}\mid ^2}{2\delta ^2}} \right )\)

as well as some other kernels, such as the sigmoid kernel, pyramid kernel, string kernel, and tree kernel functions. The linear kernel function is mostly used in text classification because the feature space representing a document is usually high-dimensional and linearly separable.

Thus far, we have introduced how to convert the primal problem of SVM into the dual problem shown in Eq. (5.46), but we still need to solve the dual problem and obtain the optimal parameters α and b . A representative method for this task is the sequential minimal optimization (SMO) algorithm. Interested readers can refer to Platt (1998) for more details.

As a representative classification algorithm in traditional machine learning, SVM has been widely used in text classification since the 1990s. According to the comparative study of Yang and Liu (1999), SVM’s performance on topic-based text classifications is significantly better than those of NB, linear least squares fit, and a three-layer feed-forward neural network and is equivalent to or slightly better than the k-nearest neighbor classifier. For the sentiment classification task (Pang et al. 2002), it was also reported that SVM performed better than NB and ME on the movie review corpus.

5.3.4 Ensemble Methods

The pursuit of ensemble methods has been motivated by the intuition that the appropriate integration of different participants might leverage distinct strengths. In traditional machine learning, ensemble methods mostly combine multiple learning algorithms to obtain better predictive performance than any of the base learning algorithms alone. There are three main methods for generating multiple base learning algorithms: (1) training on different data subsets; (2) training on different feature sets; and (3) adopting different classification algorithms.

Bagging (bootstrap aggregating) and boosting algorithms belong to the first category. Bagging, proposed by Breiman (1996), involves training each base classifier based on a randomly extracted subset of the training set and obtaining the ensemble predication by voting on multiple base classifiers. Boosting involves incrementally building an ensemble model by training each base classifier iteratively to emphasize the training instances that previous base classifiers misclassified. AdaBoost is one of the representative (Freund et al. 1996) variants of the boosting algorithm.

Ensemble learning has been successfully applied to text classification. An early study (Larkey and Croft 1996) combined different types of machine learning algorithms to obtain an ensemble classifier with better performance for text classification. Schapire and Singer (2000) proposed BoosTexter, a text classification system based on boosting, which performed better than traditional algorithms. Xia et al. (2011) performed a comparative study of the effectiveness of the ensemble technique for sentiment classification by integrating different feature sets and classification algorithms to synthesize a more accurate sentiment classification procedure.

5.4 Deep Learning Methods

Traditional text representation and classification algorithms rely on manually designed features, which have many shortcomings, such as the high-dimensional problem, data-sparsity problem, and poor representation learning ability. In recent years, deep learning techniques, represented by deep neural networks, have made great breakthroughs in speech recognition, image processing, and text mining. Because of its powerful representation learning ability and the end-to-end learning framework, deep learning has been widely applied to and made great progress in many text mining tasks, including text classification.

In the following, we will introduce several representative deep learning methods for text classification.

5.4.1 Multilayer Feed-Forward Neural Network

A multilayer feed-forward neural network is a forward-structured artificial neural network that maps a set of input vectors to a set of output vectors in a multilayer fully connected manner. Compared to the linear classification algorithm, a multilayer forward neural network adds a hidden layer and an activation function for nonlinear transformation. Ideally, a multilayer feed-forward network can approximate any nonlinear functions.

Figure 5.4 shows the structure of a three-layer feed-forward network. Suppose \({\boldsymbol x}\in {\mathbb R}^M,{\boldsymbol h}\in {\mathbb R}^S,{\boldsymbol y}\in {\mathbb R}^C\) are the input layer, hidden layer, and output layer, respectively. The nodes between two adjacent layers are fully connected. For example, the hidden node b h is connected with all input nodes, x 1, &, x i, &, x M, and the output node y j is connected with all hidden nodes, b 1, &, b h, &, b S. \({\boldsymbol W}\in {\mathbb R}^{M\times S}\) represents the weight matrix between the hidden layer and the output layer, where w hj is the weight of the connection between b h and y j. \({\boldsymbol V}\in {\mathbb R}^{S\times C}\) represents the weight matrix between the input layer and the hidden layer, where v ih is the weight of the connection between x i and b h. The network structure can be formulated as follows:

$$\displaystyle \begin{gathered} {} b_h = \sigma \left( {\alpha _h } \right)=\sigma \left( {\sum_{i=1}^M v_{ih} x_i +\gamma _h } \right) \end{gathered} $$
(5.48)
Fig. 5.4
figure 4

The structure of a three-layer feed-forward neural network

$$\displaystyle \begin{gathered} {} \hat{y}_j=\sigma \left( {\beta _j } \right)=\sigma \left( {\sum \limits_{h=1}^S w_{hj} b_h +\theta _j } \right) \end{gathered} $$
(5.49)

where σ(⋅) is a nonlinear activation function such as sigmoid.

Given a training set \(D=\left \{ {\left ( {{\boldsymbol x}_1,{\boldsymbol y}_1}\right ),\left ( {{\boldsymbol x}_2,{\boldsymbol y}_2} \right ),\ldots ,\left ( {{\boldsymbol x}_N,{\boldsymbol y}_N} \right )}\right \}\), define the following least mean squares loss function:

$$\displaystyle \begin{aligned} E=\frac{1}{2}\sum_{k=1}^N \sum_{j=1}^C \left( {\hat{\boldsymbol y}_{k_j} -{\boldsymbol y}_{k_j}} \right)^2 \end{aligned} $$
(5.50)

Learning or training can be viewed as the process of optimizing the loss function, i.e., determining the optimal parameters of the model that best fits the training data according to the loss function. For training, a feed-forward neural network uses the back-propagation (BP) algorithm, which is essentially a stochastic gradient descent optimization.

Artificial neural networks were investigated in the early research on text classification (Yang and Liu 1999) but have not been widely used because of the computational inefficiency at that time. Moreover, neural networks were only used as a classifier module rather than an end-to-end joint framework of feature representation learning and classification widely used now. A document was first represented as a sparse feature vector \({\boldsymbol x}=\left [{x_1,x_2,x_3,\cdots }\right ]^{\text{T}}\) based on the manually designed vector space and then sent to the feed-forward neural network for classification only, similar to the process for traditional classification algorithms such as naïve Bayes and SVM.

Recently, with the development of representation learning ability and the application of the end-to-end learning method, artificial neural network models, renamed deep learning, have achieved great success in many text data mining fields, including text classification. The representative deep learning algorithms include convolutional neural networks and recurrent neural networks.

5.4.2 Convolutional Neural Network

Convolutional neural network (CNN) is a special kind of feed-forward neural network in which the hidden layers consist of a series of convolutional and pooling layers. In comparison with multilayer feed-forward neural networks, a CNN has the characteristics of local connection, shared weights, and translation invariance.

Figure 5.5 shows the basic structure of a convolutional neural network for text classification, which consists of an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer.

Fig. 5.5
figure 5

The basic structure of CNN for text classification

A text classification model based on CNN usually has the following steps:

  1. (1)

    The input text is normally subjected to morphological processing (e.g., tokenization for English or word segmentation for Chinese) and converted to a word sequence, and then the word embedding is used for the initialization of the network.

  2. (2)

    Feature extraction is then performed through the convolutional layer. Taking Fig. 5.5 as an example, there are three different sizes of convolution kernels, 2 × 5, 3 × 5, and 4 × 5, and two convolution kernels for each size. It should be noted that when computing the convolutional representation matrix of the input text, the two-dimensional convolution is usually performed only in one direction (i.e., the width of the convolution kernel and the dimension of the word vector are maintained), and the step of the convolution operation is set to 1. Each convolution kernel is used to operate on the representation matrix of the input text, and each convolution kernel will obtain a vector representation of the input text.

  3. (3)

    The pooling layer downsamples the feature vectors outputted by the convolutional layer and obtains an abstract text representation whose dimension is equal to the number of convolution kernels. The vector representation outputted by the pooling layer is then fed into the softmax layer (i.e., a fully connected layer plus a normalization layer) for classification.

Kim (2014) first proposed convolutional neural networks for text classification and found that they achieved significantly better performance than classical machine learning methods in both topic and sentiment classification tasks. Kalchbrenner et al. (2014) proposed a dynamic convolutional neural network that used the dynamic k-max pooling operation to downsample and several of the most important features to represent local features after two-dimensional convolutions. Zhang et al. (2015) proposed a character-level CNN that represents the text and performs convolution operations at a finer granularity (e.g., characters) and achieved better or competitive results in comparison with word-level CNN and RNN.

5.4.3 Recurrent Neural Network

  1. (1)

    RNN, LSTM, Bi-LSTM, and GRU

Recursive neural network is a kind of deep neural network created by applying the same set of weights recursively over a structured input. It has been widely used in learning sequences and tree structures in natural language processing. Usually, the recursive neural network over time (i.e., a modeling sequence) is called a recurrent neural network. In the following, RNN refers to recurrent neural network if not stated specifically otherwise.

The structure of a recurrent neural network is shown in Fig. 5.6. The left side is the structure that runs recurrently over time, and the right side is the structure that is expanded to a sequence. Suppose x t is the input at time step t and o t is the output of the model. It can be seen that o t is related to not only x t but also the hidden layer state of the previous time step s t−1. o t can be described as follows:

$$\displaystyle \begin{aligned} {\boldsymbol s}_t &=f\left( {{\boldsymbol U}{\boldsymbol x}_t +{\boldsymbol W}{\boldsymbol s}_{t-1} } \right) \end{aligned} $$
(5.51)
Fig. 5.6
figure 6

The structure of a recurrent neural network

$$\displaystyle \begin{aligned} {\boldsymbol o}_t &={\boldsymbol V}{\boldsymbol s}_t\end{aligned} $$
(5.52)

where \({\boldsymbol U}\in \mathbb {R}^{h\times d}\), \({\boldsymbol W}\in \mathbb {R}^{h\times h}\), and \({\boldsymbol V}\in \mathbb {R}^{c\times h}\) are the weight matrices of the input node to the hidden node, the current hidden node to the next hidden node, and the hidden node to the output node, respectively. d, h, and c are the dimensions of the input layer, the hidden layer, and the output, respectively. f is a nonlinear activation function (e.g., tanh). By feeding o t to a softmax layer, we can perform classification for each node or the entire sequence:

$$\displaystyle \begin{aligned} {\boldsymbol p}_t =\mbox{softmax}\left( {{\boldsymbol o}_t } \right) \end{aligned} $$
(5.53)

RNN learns the model parameters by the back-propagation through time (BPTT) algorithm, which is a generalization of the back-propagation algorithm of feed-forward neural networks.

To address the problems of vanishing gradients and exploding gradients when processing long sequence data, Hochreiter and Schmidhuber (1997) proposed the long short-term memory (LSTM) model, which was further improved and promoted by Gers et al. (2002). Schuster and Paliwal (1997) proposed the bidirectional RNN to make better use of the forward and backward context information. Graves et al. (2013) employed bidirectional LSTM (Bi-LSTM) in speech recognition to encode the sequence from front to back and back to front. To address the complexity and redundancy of LSTM, Cho et al. (2014) proposed a gated recurrent unit (GRU) based on LSTM. GRU simplifies the structure of LSTM by combining the forget gate and the input gate into an update gate while merging the cell state and the hidden layer.

When using an RNN to model sequence data, one can learn from the attention mechanism of the human brain, which adaptively selects some key information from a large number of input signals. This approach can improve the performance and efficiency of the model. Inspired by this, the attention mechanism was proposed to differentiate the importance of component units in sequence in semantic composition. For example, the representation of a sentence will be the weighted sum of the representations of the words it contains, and furthermore, the representation of a document will be the weighted sum of the representations of the sentences it contains.

More details of LSTM, GRU, and the attention mechanism can be found in Chap. 3 of this book.

  1. (2)

    Sentence-Level Classification Model Based on RNN

In this section, we take sentence-level sentiment classification as an example to introduce how to apply RNN to text classification. Let us assume that the input sentence is “I like this movie” and the class label is “positive.”

As shown in Fig. 5.7, we first obtain the initial representation of the sentence with the pretrained word vector \(\left [ {x_1,x_2,\ldots ,x_T } \right ]\). Each word embedding x t is sent to Bi-LSTM according to the word order:

$$\displaystyle \begin{gathered} \overrightarrow{{\boldsymbol c}}_t,\overrightarrow{{\boldsymbol h}}_t =\mbox{LSTM}\left( {\overrightarrow{{\boldsymbol c}}_{t-1},\overrightarrow{{\boldsymbol h}}_{t-1},{\boldsymbol w}_t } \right) \end{gathered} $$
(5.54)
Fig. 5.7
figure 7

The basic structure of RNN for sentence-level text classification

$$\displaystyle \begin{gathered} \overleftarrow{{\boldsymbol c}}_t,\overleftarrow{{\boldsymbol h}}_t =\mbox{LSTM}\left( {\overleftarrow{{\boldsymbol c}}_{t+1},\overleftarrow{{\boldsymbol h}}_{t+1},{\boldsymbol w}_t } \right) \end{gathered} $$
(5.55)

The hidden vector is

$$\displaystyle \begin{aligned} {\boldsymbol h}_t =\left[\overrightarrow{{\boldsymbol h}}_t,\overleftarrow{{\boldsymbol h}}_t\right] \end{aligned} $$
(5.56)

After preprocessing all words, the hidden states are \(\left [ {{\boldsymbol h}_1,{\boldsymbol h}_2,\ldots ,{\boldsymbol h}_T}\right ]\).

Then, we calculate the attention weight α t according to the attention mechanism:

$$\displaystyle \begin{aligned} \alpha _t =\mbox{softmax}\left( {{\boldsymbol u}_t^{\text{T}} {\boldsymbol q}} \right)\end{aligned} $$
(5.57)

where \({\boldsymbol u}_t =\tanh \left ( {{\boldsymbol W}{\boldsymbol h}_t +{\boldsymbol b}}\right )\) and q is the query vector. The final sentence representation vector is obtained in the form of the weighted sum of the hidden state of each word in the sentence:

$$\displaystyle \begin{aligned} {\boldsymbol r}=\sum \limits_t {\boldsymbol \alpha} _t {\boldsymbol h}_t\end{aligned} $$
(5.58)

The prediction is finally obtained by feeding r to a softmax layer:

$$\displaystyle \begin{aligned} {\boldsymbol p}=\mbox{softmax}({\boldsymbol W}_c {\boldsymbol r}+{\boldsymbol b}_c) \end{aligned} $$
(5.59)

where W c and b c are the weight matrix and the bias term, respectively.

The cross-entropy E between the ground truth y and the prediction distribution p is used as the loss function:

$$\displaystyle \begin{aligned} E=-\sum \limits_{j=1}^{ C} y_j \log p_j \end{aligned} $$
(5.60)

The model parameters are learned through the BPTT algorithm.

  1. (3)

    Hierarchical Document-Level Text Classification Model

Document-level text classification refers to text classification for the entire document, where each document is assigned a class label. A simple method for document-level text classification is to treat the document as a long sentence and employ an RNN to encode and classify it. However, this approach does not consider the hierarchical structure of the document.

A document usually contains multiple sentences, and each sentence contains multiple words. Therefore, a document can be modeled according to such a “word-sentence-document” hierarchy. Tang et al. (2015a) first employed CNN (or LSTM) to encode word sequences in a sentence and then used a gated RNN to encode the sentence sequences in the document. Yang et al. (2016) furthermore proposed a hierarchical attention GRU model that consists of five parts: the word-level encoding layer, the word-level attention layer, the sentence-level encoding layer, the sentence-level attention layer, and the softmax layer, as shown in Fig. 5.8.

  • Word-level encoding layer: For each sentence, we send the initialized word embedding to Bi-GRU and obtain the forward hidden state \(\overrightarrow {{\boldsymbol h}}_{it}\) and the backward hidden state \(\overleftarrow {{\boldsymbol h}}_{it}\) of each word. Their concatenation is used as the representation of each word \({\boldsymbol h}_{it}=\left [ \overrightarrow {{\boldsymbol h}} _{it},\overleftarrow {\boldsymbol h}_{it}\right ]\).

    Fig. 5.8
    figure 8

    The hierarchical structure of RNN for document-level text classification

  • Word-level attention layer: We first calculate the weight according to \(\alpha _{it} = \frac {\exp \left ( {{\boldsymbol u}_{it}^{\text{T}} {\boldsymbol u}_w } \right )}{\sum _t \exp \left ( {{\boldsymbol u}_{it}^{\text{T}} {\boldsymbol u}_w } \right )}\), where \({\boldsymbol u}_{it} =\tanh \left ( {{\boldsymbol W}_w {\boldsymbol h}_{it} +{\boldsymbol b}_w } \right )\) and u w is a query vector that measures the importance of each word in the sentence. It can be seen as a high-level representation of the query statement “Which word is more important?” It is randomly initialized in the model and trained with the other parameters of the model jointly. Finally, the weighted sum of the hidden representation of each word is used as the representation of the sentence s i =∑t α it h it.

  • Sentence-level encoding layer: After word-level encoding and attention, each sentence obtains its representation. A document consists of multiple sentences. Similar to the word-level encoding layer, the representation of each sentence is sent to the Bi-GRU to obtain the forward embedding vector \(\overrightarrow {{\boldsymbol h}}_i\) and the backward embedding vector \(\overleftarrow {{\boldsymbol h}}_i\). The concatenation is used as the hidden representation of each sentence \({\boldsymbol h}_i =\left [\overrightarrow {{\boldsymbol h}} _i,\overleftarrow {{\boldsymbol h}}_i\right ]\).

  • Sentence-level attention layer: We introduce the attention mechanism again to distinguish the importance of different sentences for document representation. The weight for each sentence is \(\alpha _i = \frac {\exp \left ( {{\boldsymbol u}_i^{\text{T}} {\boldsymbol u}_s } \right )}{\sum _i \exp \left ( {{\boldsymbol u}_i^{\text{T}} {\boldsymbol u}_s } \right )}\) , where \({\boldsymbol u}_i =\tanh \left ( {{\boldsymbol W}_s {\boldsymbol h}_i +{\boldsymbol b}_s } \right )\). The final document representation is also a weighted sum of the representations of all sentences v =∑i α i h i.

  • Softmax layer: The document representation v is sent to a softmax layer for document classification: p = softmax(W c v + b c), where W c and b c are the weight matrix and bias, respectively.

5.5 Evaluation of Text Classification

Assume that there are M categories in a text classification task, represented by C 1, …, C M. For each of the classes, we calculate statistics on the number of documents with respect to the following four cases:

  1. (1)

    True positive (TP): where the system correctly predicts it as a positive example (i.e., both the prediction and the ground truth belong to this class).

  2. (2)

    True negative (TN): where the system correctly predicts it as a negative example (i.e., both the prediction and the ground truth do not belong to this class).

  3. (3)

    False positive (FP): where the system incorrectly predicts it as a positive example (i.e., the prediction belongs to this class, but the ground truth does not belong to this class).

  4. (4)

    False negative (FN): where the system incorrectly predicts it as a negative example (i.e., the ground truth belongs to this class, but the prediction does not belong to this class).

After obtaining TP, TN, FP, and FN for each class, we can obtain the microlevel statistics as shown in Table 5.9.

Table 5.9 The microlevel statistics of text classification
  1. I.

    Recall, Precision, and F 1 Score

By using j ∈{1, 2, …, M} to denote the class index, we define the following metrics for each class:

  1. (1)

    Recall is defined as the proportion of examples being correctly predicted as this class among all examples whose ground truth is this class:

    $$\displaystyle \begin{aligned} R_j =\frac{\text{TP}_j }{\text{TP}_j +\text{FN}_j } \times 100\% \end{aligned} $$
    (5.61)
  2. (2)

    Precision is defined as the proportion of examples being correctly predicted as the current class among all examples being predicted as the current class:

    $$\displaystyle \begin{aligned} P_j =\frac{\text{TP}_j }{\text{TP}_j +\text{FP}_j } \times 100\% \end{aligned} $$
    (5.62)
  3. (3)

    An ideal system with high precision and high recall will return many results, with all results labeled correctly. A system with high recall but low precision returns many positive predictions, but most of them are incorrect in comparison with the ground truth labels. A system with high precision but low recall returns very few positive predictions, but most of its predicted labels are correct when compared with the ground truth labels. We ideally hope that a classification system has both high recall and high precision, but the two are often contradictory. We therefore define the harmonic average of precision and recall as the F 1 score to comprehensively evaluate the effects of these two aspects.

To distinguish the importance of recall and precision, we can also define a more general F β score:

$$\displaystyle \begin{aligned} F_\beta =\frac{(\beta ^2+1)PR}{\beta ^2P+R} \times 100\% \end{aligned} $$
(5.63)

When β = 1, F β becomes the standard F 1 score.

  1. II.

    Accuracy, Macroaverage, and Microaverage

Recall, precision, and F score can only evaluate the classification performance for a certain class. To measure the performance on the entire classification task, we define the classification accuracy as follows:

$$\displaystyle \begin{aligned} \text{Acc}=\frac{\# \text{Correct}}{N} \times 100\% \end{aligned} $$
(5.64)

where N is the number of all examples and #Correct is the number of examples that are correctly predicted.

In addition to classification accuracy, we can also use the macroaverage and microaverage of previous class-oriented measures across all classes to evaluate the performance of the entire classification task.

The recall, precision, and F 1 score based on the macroaverage are defined as follows:

$$\displaystyle \begin{aligned} \text{Macro}\_{\,\text{P}}&=\frac{1}{C}\sum_{j=1}^C \frac{\text{TP}_i }{\text{TP}_i +\text{FP}_i } \end{aligned} $$
(5.65)
$$\displaystyle \begin{aligned} \text{Macro}\_{\,\text{R}}&=\frac{1}{C}\sum \limits_{j=1}^C \frac{\text{TP}_i }{\text{TP}_i +\text{FN}_i } \end{aligned} $$
(5.66)
$$\displaystyle \begin{aligned} \text{Macro}\_{\,\text{F}}_1&=\frac{2\times \text{Macro}\_{\,\text{P}}\times \text{Macro}\_{\,\text{R}}}{\text{Macro}\_{\,\text{P}}+\text{Macro}\_{\,\text{R}}} \end{aligned} $$
(5.67)

The recall, precision, and F 1 score based on the microaverage are defined as follows:

$$\displaystyle \begin{aligned} \text{Micro}\_{\,\text{P}}&=\frac{\sum_{j=1}^C \text{TP}_i }{\sum_{j=1}^C (\text{TP}_i +\text{FP}_i)} \end{aligned} $$
(5.68)
$$\displaystyle \begin{aligned} \text{Micro}\_{\,\text{R}}&=\frac{\sum_{j=1}^C \text{TP}_i }{\sum_{j=1}^C (\text{TP}_i +\text{FN}_i)} \end{aligned} $$
(5.69)
$$\displaystyle \begin{aligned} \text{Micro}\_{\,\text{F}}_1&=\frac{2\times \text{Micro}\_{\,\text{P}}\times \text{Micro}\_{\,\text{R}}}{\text{Micro}\_{\,\text{P}}+\text{Micro}\_{\,\text{R}}} \end{aligned} $$
(5.70)

According to the classification results of a binary classification problem as shown in Table 5.10, we calculate all the aforementioned measures in Table 5.11.

  1. III.

    P-R Curve and ROC Curve

    Table 5.10 An example of binary classification results
    Table 5.11 The evaluation of the classification results in Table 5.10

In a classification problem, predictions are made based on the comparison of the prediction score and a predefined prediction threshold. For example, the threshold value of logistic regression is normally set to be 0.5. When the positive probability is greater than 0.5, we predict it as positive; when the positive probability is less than 0.5, we predict it as negative.

To evaluate the performance of classification models more comprehensively under different recall scores, we can adjust the prediction threshold of the classifier and observe the corresponding precision-recall (P-R) curve by using recall as the x-axis and precision as the y-axis. The P-R curve shows the tradeoff between precision and recall for different thresholds, and the area under the P-R curve can be used to measure the general performance of a classification system. A high area under the curve represents both high recall and high precision, where high precision relates to a low false-positive rate, and high recall relates to a low false-negative rate. High scores for both show that the classifier returns accurate results (high precision) and returns a majority of all positive results (high recall). The mean average precision (mAP) is a metric that can be viewed as a simplification of the area under the P-R curve. It computes the average precision for recall over 0 to 1. For example, we can set the recall to be 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0. Then, we use the 11-point average precision for evaluation.

Similar to the P-R curve, we can also plot the ROC (receiver operating characteristic) curve by using the false-positive rate as the x-axis and the true positive rate (i.e., recall) as the y-axis. The area under the ROC curve is called AUC (area under the ROC curve). The higher the AUC value is, the better the general classification performance of the classifier.

The ROC curve summarizes the tradeoff between recall and the false-positive rate. The P-R curve summarizes the tradeoff between precision and recall. The ROC curve is appropriate when the observations are balanced between each class, whereas the P-R curve is more suitable for imbalanced datasets.

5.6 Further Reading

Classification algorithms based on statistical machine learning can be roughly divided into two categories: the discriminative model and the generative model. In general, a discriminative model models the decision boundary between the classes (i.e., learns the decision function y = f(x) or the posterior probability p(y|x) directly); a generative model explicitly models the distribution of each class p(x|y) as well as the joint distribution of observation and class label (i.e., \(p\left ( {{\boldsymbol x},y} \right )=p(y)p({\boldsymbol x}\vert y)\)). With application to text classification, the typical generative model is the naïve Bayes model, and the typical discriminative models include logistic/softmax regression, maximum entropy model, support vector machine, and artificial neural networks.

The classification models introduced in this chapter are all designed for classification on the entire document, and the discussion has not involved structure predication in the document. Given a piece of text x that consists of multiple nodes x t, it is a classification task to predict the label of x and a sequence labeling task to predict the labels of all nodes x t in x. In a sequence labeling task, each node x t has a label y t. Typical sequence labeling models include hidden Markov models (HMMs) and conditional random fields (CRFs). HMM can be viewed as an extension of the naïve Bayes model from classification to sequence labeling. In addition to modeling the relationship between x t and y t, HMMs also use the state transition probability to model the relationship of y t−1 and y t. Similarly, the CRF model is the extension of maximum entropy from classification to sequence labeling. The CRF model adopts the log-linear model hypothesis of the maximum entropy model and defines similar feature functions. In addition, the CRF model also defines a state transition feature function to learn the structural relationships in a sequence. Interested readers can refer to Zong (2013); Li (2019) for more details about the sequence labeling models.

Recurrent neural networks naturally have the ability to handle both classification and sequence labeling problems. In the RNN structure shown in Fig. 5.6, if we perform prediction on each node of the sequence, it is a sequence labeling problem; if we obtain the representation of the entire document via semantic composition (e.g., attention) and only perform classification for the document, it is a classification problem. Such a high degree of flexibility is also a major advantage of deep neural networks for text modeling in comparison with traditional machine learning models.

Exercises

5.1

Please derive the naïve Bayes model under the assumption of multivariable Bernoulli distribution according to (McCallum et al. 1998).

5.2

What are the main differences between the softmax regression model and the maximum entropy model?

5.3

Why is the linear kernel more widely used than the other nonlinear kernels when using SVM for text classification?

5.4

What are the main differences between the multilayer feed-forward neural network and the convolutional neural network?

5.5

Can a convolutional neural network capture n-gram grammatical features in text? Why?

5.6

What are the main differences between recurrent neural networks and convolutional neural networks? Which do you think is more suitable for document classification? Why?