Abstract
Recently, supervised topic modeling approaches have received considerable attention. However, the representative labeled latent Dirichlet allocation (L-LDA) method has a tendency to over-focus on the pre-assigned labels, and does not give potentially lost labels and common semantics sufficient consideration. To overcome these problems, we propose an extension of L-LDA, namely supervised labeled latent Dirichlet allocation (SL-LDA), for document categorization. Our model makes two fundamental assumptions, i.e., Prior 1 and Prior 2, that relax the restriction of label sampling and extend the concept of topics. In this paper, we develop a Gibbs expectation-maximization algorithm to learn the SL-LDA model. Quantitative experimental results demonstrate that SL-LDA is competitive with state-of-the-art approaches on both single-label and multi-label corpora.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Recently, considerable attention has been focused on topic modeling approaches [8]. The original goals of such methods were (1) to obtain a brief description of document collection for basic tasks [1] such as classification, clustering, and dimension reduction, and (2) to use the concept of latent topics to capture the semantics behind documents. Latent Dirichlet allocation (LDA) [5] is acknowledged as the most successful topic model. It simulates the generative process of a corpus, where each document is composed of latent topics, and each topic is described by a multinomial distribution over words. To control the capacity of the model parameters and avoid the over-fitting problem, a Dirichlet prior is given over all topics beyond the corpus.
Most approaches to topic modeling are unsupervised, and LDA is no exception. Unsupervised LDA neglects supervised information, resulting in some significant messages being wasted. For document classification tasks [5, 6, 14], LDA commonly acts as an upstream method for dimension reduction before various classifiers are executed, e.g., LDA with support vector machines (SVMs). Intuitively, some discriminative features must be lost when LDA transforms the original word distribution into the latent topic distribution.
The investigation of supervised LDA models faces the challenge of incorporating supervision into the learning procedure. Recently, some modifications have been developed (e.g., supervised LDA (sLDA) [4], discriminative LDA (DiscLDA) [12] and maximum entropy discrimination LDA (MedLDA) [23, 24] for single-label classification; labeled LDA (L-LDA) [15], Dirichlet process with mixed random measures (DP-MRM) [11], Prior-LDA (Pr-LDA), and Dependency-LDA (Dep-LDA) [17] for multi-label classification).
To the best of our knowledge, L-LDA was the first supervised LDA model that could be applied to multi-label corpora. L-LDA defines a 1-1 correspondence between labels and topics, and constrains each document to its pre-assigned label set. The empirical results reported in [15] show that, under certain conditions, L-LDA can achieve performance that is competitive with state-of-the-art discriminative methods, e.g., SVMs [13]. However, it suffers from two problems: (1) L-LDA is over-focused on pre-assigned labels for each document, resulting in worse performance for corpora with small cardinality values (i.e., the average number of labels per document), and even degenerating to the mixture of unigrams model for single-label corpora; (2) from a generative perspective, L-LDA lacks a mechanism to capture potentially lost labels and common semantics.
In this paper, we consider these two problems, and then propose a Supervised L-LDA (SL-LDA) model for both single- and multi-label document categorization. Our model makes two fundamental assumptions beyond those used in L-LDA: Prior 1 provides a label threshold to highlight the pre-assigned labels, instead of restricting them; Prior 2 extends the topics concept to cover the semantics of lost labels and common topics. We develop a Gibbs Expectation-Maximization (Gibbs-EM) algorithm to infer the SL-LDA model. Naturally, as a supervised model, SL-LDA can directly predict the response labels for the test documents.
Extensive experiments are conducted to evaluate the proposed SL-LDA model. Several traditional approaches are chosen as performance baselines, including: 1) for the single-label case, four LDA-based approaches, i.e., LDA-SVM, sLDA, DiscLDA, and MedLDA; 2) for the multi-label case, three supervised LDA models, i.e., L-LDA, Pr-LDA, and Dep-LDA, and the state-of-the-art discriminative algorithm SVMs [13, 21]. The empirical results show that SL-LDA achieves a performance level that is competitive with the state-of-the-art methods. Some important notation used in this paper is summarized in Table 1.
The remainder of this paper is organized as follows. Section 2 describes the proposed SL-LDA model in detail. Section 3 presents the parameter estimation and inference methods for SL-LDA. In Section 4, we introduce some related studies on supervised topic models. Sections 5 and 6 present and discuss the empirical results for single- and multi-label corpora, respectively. Finally, our conclusions and suggestions for future work are given in Section 7.
2 Proposed model
This section introduces the novel SL-LDA model. We begin by reviewing the L-LDA model.
2.1 L-LDA
L-LDA [15] is a supervised topic model for describing labeled corpora. Similar to LDA, L-LDA represents documents as multinomial distributions over topics, where each topic is represented by a multinomial distribution over words. Additionally, L-LDA defines a 1-1 correspondence between the labels (tagged by human-being) and topics. Each document d is constrained to be described by its pre-assigned label set \({y_{d}} \subseteq \left \{ {1,2, {\cdots } ,{K_{t}}} \right \}\).
L-LDA is formalized as follows: For each topic k, generate the topic-word distribution \(\overrightarrow {{\phi _{k}}}\), drawn from Dirichlet prior \(\overrightarrow {\beta }\). However, for each document d, L-LDA restricts the multinomial distribution \(\overrightarrow {{\theta _{d}}}\) to topics within y d . Towards this requirement, the topic presence/absence indicator \(\overrightarrow {{\Lambda }_{d}} = \left ({{l_{1}},{l_{2}}, {\cdots } ,{l_{{K_{t}}}}} \right )\) is generated via a Bernoulli distribution η k , where \({l_{k}} \in \left \{ {0,1} \right \}\). Define \({y_{d}} = \left \{ {k|{{\Lambda }_{d,k}} = 1} \right \}\). The document-label projection matrix L d [15] is used to project the topic Dirichlet prior \( \overrightarrow \alpha \) to a presence topic prior \(\overrightarrow {{\alpha _{d}}} {\kern 1pt} = {({\alpha _{d,{l_{1}}}},{\alpha _{d,{l_{2}}}}, {\cdots } ,{\alpha _{d,{l_{{K_{d}}}}}})^{T}}\). Then, the topics and words can be sampled as in the traditional LDA model.
In summary, L-LDA assumes the following generative process for a labeled corpus W:
-
1.
For each topic k
-
(a)
Choose \({\overrightarrow {{\phi _{k}}} {\kern 1pt} = {\left ({{\phi _{k,1}},{\phi _{k,2}}, {\cdots } ,{\phi _{k,V}}} \right )^{T}}}\\ \sim Dirichlet\left ({\overrightarrow \beta } \right )\)
-
(a)
-
2.
For each document d in the corpus W
-
(a)
For each topic k
-
(i)
Choose \({{\Lambda }_{d,k}}{\kern 1pt} \in \left \{ {0,1} \right \} \sim Bernoulli\left ({{\eta _{k}}} \right )\)
-
(i)
-
(b)
Choose \(\overrightarrow {{\alpha _{d}}} = {L_{d}} \times \overrightarrow \alpha \)
-
(c)
Choose \({\overrightarrow {{\theta _{d}}} {\kern 1pt} = {({\theta _{d,{l_{1}}}},{\theta _{d,{l_{2}}}}, {\cdots } ,{\theta _{d,{l_{{K_{d}}}}}})^{T}} \sim }\\ Dirichlet\left ({\overrightarrow {{\alpha _{d}}} } \right )\)
-
(d)
For each of the N d words w d,n
-
(i)
Choose a topic \({z_{d,n}} \in \left \{y_{d,1},{y_{d,2}},\right .\\ {\left .{\cdots } ,{y_{d,{K_{d}}}} \right \} \sim Multinomial\left ({\overrightarrow {{\theta _{d}}} } \right )}\)
-
(ii)
Choose a word w d,n from \(p\left ({w_{d,n}}|\right .\) \(\left .z_{d,n},{\Phi }\right )\), a multinomial probability conditioned on the topic z d,n
-
(i)
-
(a)
2.2 SL-LDA
In reviewing the L-LDA model, we found that it suffers from two problems in terms of supervised tasks. First, the assumption that constrains each document to be sampled from its own label set y d is too strong for corpora with small cardinality. For single-label corpora, in particular, L-LDA will degenerate to the simple mixture of unigrams model. This undermines the advantage of capturing the latent semantics of topic modeling, and leads to worse classification performance. Second, because L-LDA defines a 1-1 correspondence between labels and topics, its number of topics is equivalent to the true number of labels in the corpus. From a generative perspective, L-LDA ignores the hidden labels (i.e., lost labels), which might be neglected during manual processing. More importantly, this framework of L-LDA lacks a mechanism to cover the common semantics. For example, the words “news” and “report”, which express common semantics in newsgroup collections, might occur in most documents. In the context of L-LDA, such common words are forcibly assigned a pre-defined label. Unfortunately, these words are less discriminative, so this is harmful to the classification.
To overcome these problems, we propose a novel supervised model. SL-LDA makes two fundamental assumptions beyond those of L-LDA, namely Prior 1 and Prior 2. The aim of Prior 1 is to relax the restriction of label sampling. Following the intuition that a document involves a wide range of labels, but that only the main labels will be tagged (e.g., given a document d with label proportion \({\left [ {0.1,0.2,0.2,0.5} \right ]}\), its label set might be manually given as \({y_{d}} = \left \{ 4 \right \}\)), Prior 1 supposes that each document d samples from all labels and gives the dominant weights, i.e., the label threshold 𝜃 ∗, for labels in y d . Under this assumption, SL-LDA will never degenerate to a simpler model. Prior 2 extends the concept of topics to handle the problem of semantic loss. It assumes that each document d is represented by K topics, consisting of K t observed labels and K h hidden topics. The hidden topics are used to cover the potentially lost labels and common semantics, which are represented by the common words. Consequently, less discriminative common words, such as “news” and “report” in newsgroup collections, contribute to the hidden topics rather than the pre-defined labels. Although Prior 2 seems straightforward, it provides significant benefits to classification tasks.
Note that in the structure of SL-LDA the observed labels and hidden topics are on the same level. Thus, in this paper, we do not differentiate between the concepts of “label” and “topic”. Prior 1 and Prior 2 are summarized as follows:
-
–
Prior 1: each document d samples from all topics; the labels in y d are given dominative proportions, defined by 𝜃 ∗.
-
–
Prior 2: there are a total of K topics, consisting of K t observed labels and K h hidden topics.
Formally, SL-LDA generates the topic-word distribution \(\overrightarrow {{\phi _{k}}}\) as in L-LDA. For each document d, it first samples the topic presence/absence indicator \(\overrightarrow {{{\Lambda }_{d}}}\) under the observed labels, and then generates the document-topic distribution \(\overrightarrow {{\theta _{d}}} {\kern 1pt}\) via the following conditional function:
where 𝜃 ∗ ≤ 1. For each document d, \(\sum \nolimits _{i \in {y_{d}}} {{\theta _{d,i}}} = {\theta ^{*}}\) and 𝜃 d,i = 𝜃 d,j , ∀i,j ∈y d .; \(Dir^{\prime }(\overrightarrow {\alpha } ,{y_{d}})\) is a pseudo-Dirichlet distribution with the following Probability Density Function (PDF)Footnote 1:
where \({\Gamma } \left (x \right )\) is the Gamma function.
The generative process of SL-LDA can be summarized as follows:
-
1.
For each topic k
-
(a)
Choose \(\overrightarrow {{\phi _{k}}} {\kern 1pt} = {\left ({{\phi _{k,1}},{\phi _{k,2}}, {\cdots } ,{\phi _{k,V}}} \right )^{T}} \sim Dirichlet\left ({\overrightarrow \beta } \right )\)
-
(a)
-
2.
For each document d in the corpus W
-
(a)
For each existing topic k
-
(i)
Choose \({{\Lambda }_{d,k}}{\kern 1pt} \in \left \{ {0,1} \right \} \sim Bernoulli\left ({{\eta _{k}}} \right )\)
-
(i)
-
(b)
Choose \(\overrightarrow {{\theta _{d}}} {\kern 1pt} = {({\theta _{d,1}},{\theta _{d,2}}, {\cdots } ,{\theta _{d,K}})^{T}} \sim f(\overrightarrow \alpha ,{y_{d}},{\theta ^{*}},{K_{d}})\)
-
(c)
For each of the N d words w d,n
-
(i)
Choose a topic \({z_{d,n}} \in \left \{ {{k_{1}},{k_{2}}, {\cdots } ,}\right .\) \(\left .{k_{K}} \right \}{\kern 1pt} \sim Multinomial\left ({\overrightarrow {{\theta _{d}}} } \right )\)
-
(ii)
Choose a word w d,n from \(p\left ({w_{d,n}}|\right .\) \(\left .{z_{d,n}},{\Phi } \right )\), a multinomial probability conditioned on the topic z d,n
-
(i)
-
(a)
Discussion of parameter 𝜃 ∗
The parameter 𝜃 ∗ is a crucial threshold function that determines the degree of attention of the marked labels. We suggest the tuned interval \(\theta ^{*}\in \left [ {0.5,0.8} \right ]\) following the intuition that: (1) a document is assigned a label (labels), provided that at least 50 % is focused on this label (labels); (2) a document is over-focused on the marked label (labels) if \(\theta ^{*}{}\) approaches 1.
In Sections 5.2.3 and 6.4, we study 𝜃 ∗ experimentally. The results show that the optimum performance is achieved when 𝜃 ∗ is within the suggested interval.
2.3 Evidence for SL-LDA
Given the parameters Φ and Θ , the joint probability of the labeled corpus W and a set of topic assignments \(\overrightarrow z \) is:
where N d,k is the number of times that the topic k occurs in document d; and N k,v is the number of times that the word v has been assigned to topic k.
SL-LDA places a Dirichlet prior over Φ:
and a pseudo-Dirichlet prior over Θ:
Given the hyper-parameters \(\overrightarrow \alpha \) and \(\overrightarrow \beta \), we obtain the evidence for a labeled corpus W by combining (2), (3) and (4):
where variable C ∗ is defined, for simplification, as follows:
3 Estimation and inference
In this section, we describe the process of parameter estimation and inference with respect to SL-LDA.
3.1 Estimation for hyper-parameters
We develop a Gibbs-EM algorithm to learn the hyper-parameters \(\overrightarrow \alpha \) and \(\overrightarrow \beta \) in SL-LDA. Gibbs sampling [2, 19] is a popular method for approximate learning in high-dimension models. It imitates the high-dimension probability distribution given by the stationary state of Markov chain Monte Carlo (MCMC) chains. We first use Gibbs sampling to approach the expectation of topic assignments \(\overrightarrow z \): \(E\left [ {P\left ({\overrightarrow z |W,\overrightarrow \alpha ,\overrightarrow \beta } \right )} \right ]\), and then maximize the likelihood of (5) to estimate the hyper-parameters. Repeating this procedure until convergence, we obtain the optimized values of \({\overrightarrow \alpha ^{M}}\) and \({\overrightarrow \beta ^{M}}\). The Gibbs-EM algorithm is summarized as follows:
-
1.
Initialize topics \({\overrightarrow z^{0}}\) and hyper-parameters \({\overrightarrow \alpha ^{0}}\) and \({\overrightarrow \beta ^{0}}\)
-
2.
For i=1,2, ⋯
-
(a)
E-step: sample for \({\overrightarrow z^{i}}\) from \(P\left ({\overrightarrow z |W,\overrightarrow \alpha ,\overrightarrow \beta } \right )\) by Gibbs sampling.
-
(b)
M-step: maximize \(\log P\left ({W,{{\overrightarrow z }^{i}}|\overrightarrow \alpha ,\overrightarrow \beta } \right )\) w.r.t \({\overrightarrow \alpha ^{i}}\) and \({\overrightarrow \beta ^{i}}\) using the fixed point method
Until convergence
-
(a)
In the E-step, we sample each topic assignment z t alternately from the distribution determined by all other topic assignments. Define the symbol \(\widetilde {{z_{t}}}\) as a set of topic sequences that excludes the current variable z t . The conditional probability for z t is then:
where w t is the word corresponding to z t ;d t is the document containing w t ; and variables with the subscript “¬” should have 1 subtracted, e.g., N ¬k ∗ = N k ∗ − 1.
The M-step uses the fixed point method. Using the samples \(\overrightarrow z \) obtained in the E-step, we update the hyper-parameter \(\alpha _{k}^{i + 1}\) as follows:
and for \(\beta _{v}^{i + 1}\)
where Ψ(x) is the Digamma function, i.e., the logarithmic derivative of the Gamma function.
Consistency over multiple runs
In terms of multiple runs, the results of Gibbs sampling might be inconsistent because of the random sampling procedure [9]. This problem persists in our model, but it is almost insignificant for the classification tasks described here. Note that the topics in SL-LDA are composed of observed labels and hidden topics. Given a set of training data, the observed labels are fixed, so they must be consistent. However, because the hidden topics are used to collect common words, they should contribute to the classification regardless of whether they are consistent.
3.2 Inference for unlabeled documents
Let \(d^{\prime }\) be a document from the testing corpus \(W^{\prime }\) and \(U = \left \{ {\overrightarrow z ,\overrightarrow w } \right \}\) be a stationary MCMC state for the training corpus W.
We employ Gibbs sampling to infer the unlabeled document \(d^{\prime }\) by estimating the posterior distribution of topic assignments [7]: \(P\left ({z^{\prime }|{{\overrightarrow w }_{d^{\prime }}},U,{{\overrightarrow \alpha }^{M}},{{\overrightarrow \beta }^{M}}} \right )\), where \({{\overrightarrow w }_{d^{\prime }}}\) is the vector of \(d^{\prime }\), and \({{\overrightarrow \alpha }^{M}},{{\overrightarrow \beta }^{M}}\) are obtained during the parameter estimation procedure. As the test document is unlabeled, the update rule is as follows:
where \({w^{\prime }_{t}}\) is the t-th word in \({w_{d^{\prime }}}\); \({N^{\prime }_{{k^{*}},v}}\) is the number of times that v has been generated by topic k ∗.
Finally, the topic distribution of document \(d^{\prime }\) is estimated as follows:
where \(N_{d^{\prime },i}^{*}\) is the number of times that topic i has occurred in document \(d^{\prime }\).
4 Related work
4.1 Topic model
A number of variants of LDA have been developed for supervised cases. Representative single-label corpora models include sLDA [4], which captures document labels as a classification response, DiscLDA [12], where documents are associated with labels and topic mixtures, and MedLDA [23, 24], which combines maximum margin technology and LDA. In terms of multi-label corpora, L-LDA [15] was the first supervised topic model. The authors of [17] developed the equivalent Flat-LDA, and further extended this model to Pr-LDA and Dep-LDA via observations of label frequency and label dependency, respectively. Other modifications for multi-label corpora include DP-MRM [11] and Partially LDA [16].
The proposed SL-LDA builds upon the L-LDA model by introducing a topic threshold 𝜃 ∗ and extending the concept of topics. In a sense, DiscLDA also extends the topics, as it represents documents by labels and topic mixtures. However, DiscLDA appears to be more complicated than our model. Several models that consider common topics have been explored, e.g., the Cluster-based Topic Model [3] and Multi-Grain Cluster Topic Model [20]. However, as unsupervised models, these cannot be directly applied to classification.
The state-of-the-art Pr-LDA and Dep-LDA models can also be deemed as extensions of L-LDA. Pr-LDA assumes that there is a corpus-wide distribution with respect to the label occurrence frequency in the corpus. A document’s topic priors are generated by this frequency distribution, instead of using the same prior. Based on Pr-LDA, Dep-LDA further introduces a topic level beyond the label level to capture label dependency. Although these two models have significantly improved multi-label document classification [17], the two problems of L-LDA still exist. More importantly, these models (particularly Dep-LDA) are complex and contain too many parameters. In terms of their application to different corpora, tuning the parameters might be time-consuming. In contrast, our model involves a simpler construction and fewer parameters, and so is straightforward and can be easily controlled. Another advantage of SL-LDA is that it can be applied to both single- and multi-label corpora. This leads to better scalability in practice.
4.2 Computational complexity
We now discuss the computational complexity of training various supervised topic models. In the original papers, these models were trained using different inference algorithms, and so we provide a descriptive comparison. The traditional LDA model is used as the baseline.
As SL-LDA does not change the basic construction of the traditional LDA model, its complexity will be the same as that of LDA. All single-label models perform extra computations, e.g., calculating the normalization factor in sLDA, learning the transition matrix in DiscLDA, and solving the dual problem in MedLDA. In terms of multi-label models, it is clear that L-LDA is as efficient as LDA, but Pr-LDA and Dep-LDA must compute the upper-topic distribution over the labels. In particular, proper inference using Dep-LDA is time-consuming, because it repeatedly computes the expensive Gamma function. Thus, we argue that the proposed SL-LDA model is more efficient than most of these related supervised topic models. Empirical tests are reported in the following sections.
5 Evaluation on single-label setting
For a single-label corpus (K d = 1), we evaluate both the text modeling and document classification performance of SL-LDA. Experiments are run on the balanced NewsgroupsFootnote 2 collection, which consists of 19,997 documents in 20 related categories (i.e., K t = 20 ). By convention, we remove stop words in the standard listFootnote 3 and words that occur only once in the corpus.
In the experiments, the parameter 𝜃 ∗ is tuned to a value in the set \(\left \{ {0.5,0.6,0.7,0.8} \right \}\). Other parameters are set as follows: 1) hyper-parameters \(\overrightarrow \alpha \) and \(\overrightarrow \beta \) are initialized to 50/K and 0.1, respectively; 2) maximum number of iterations is 50, and the termination precision is 1 × 10−4; 3) Gibbs sampling uses a burn-in of 500 iterations in the E-step.
5.1 Text modeling
We conducted a simple text modeling experiment to examine the topic structures given by SL-LDA. We fitted the SL-LDA model to the Newsgroups corpus, where 𝜃 ∗=0.5 and K h = 5.
Table 2 lists the 10 most frequent words over 20 observed labels and 5 hidden topics. This result is similar to that given by DiscLDA [12], where words that occurred in the labeled rows are significantly cross-referenced with their corresponding labels, and words in the hidden topics show no obvious preference to any class.
5.2 Classification performance
For single-label corpus, SL-LDA assigns each testing document d a label \(y_{d}^{*}\) by:
Following previous studies [12, 23], we evaluated SL-LDA in terms of binary- and multi-class document classification. Several existing supervised topic models were chosen as performance baselines, i.e., LDA-SVM, sLDA, DiscLDA, and MedLDA. Average scores were obtained from 20 runs, and pairwise t-tests between SL-LDA and the baselines were conducted at the 5 % significance level. As in [22], an indicator •/∘ is used to denote whether SL-LDA was found to be statistically superior/inferior to the compared algorithm.
5.2.1 Binary classification
The binary-class (K t = 2) classification experiments were performed on two subgroups of Newsgroups, i.e., alt.atheism and talk.religion.misc, following the design described in [12, 23].
For the LDA-SVM, we used the Gibbs-EM algorithm to fit the upstream LDA model, and employed the celebrated LibSVMFootnote 4 as the downstream classifier. We used a radial basis function as the kernel, and optimized its parameters via the grid search method. The public codes of sLDAFootnote 5 and MedLDAFootnote 6, were employed for dependable results. Their parameters were determined according to the discussions in the corresponding publications. The DiscLDA results were taken from the original paper, as we could not obtain its primary implementation.
Table 3 lists the results for different topic numbers K. We can see that SL-LDA attains a competitive level of performance, and is statistically superior to other models in most cases. Our model obtains better scores for relatively small K, and achieves the highest score of 0.819 when K = 15. As K becomes larger, SL-LDA performs sightly worse than the state-of-the-art MedLDA, e.g., by 0.09 for K = 25 and 0.02 for K = 35. We believe this is because larger values of K bring too many hidden topics for binary-class settings, resulting in reduced performance. In Summary, our model can provide higher performance with fewer topics, i.e., with less computational expense. This characteristic is quite meaningful in practice.
5.2.2 Mutli-class classification
The multi-class classification experiments considered the full Newsgroups collection. We compared SL-LDA with LDA-SVM, sLDA, and MedLDA. All three benchmark models were set up as for the binary-class experiment.
The experimental results are given in Table 4. Clearly, SL-LDA achieves the top scores for all K, and is also statistically superior to the other models in most cases. Compared with LDA-SVM and sLDA, SL-LDA yields significant improvements, scoring around 0.2 higher than LDA-SVM and 0.3 higher than sLDA. More importantly, SL-LDA outperforms the state-of-the-art MedLDA by about 0.02.
5.2.3 Study on parameter 𝜃 ∗
In this section, we examine the label threshold 𝜃 ∗. Figure 1 illustrates the binary-class results with different values of 𝜃 ∗. We can clearly observe that the performance at values of 0.5 and 0.6 is better in than the other cases. The multi-class cases (Fig. 2) exhibit a similar trend, with 0.5 and 0.6 dominating the performance. These results conform to the discussions in Section 2.2.
5.3 Running time
We now examine the time efficiency of SL-LDA on a 3.1GHz Intel Core i5 2400 CPU. To ensure a fair comparison, two public models, i.e., sLDA and MedLDA, were used as baselines.
Figure 3 shows that SL-LDA is more efficient than sLDA. As reported in [24], we found that sLDA is quite time-consuming. This is mainly because the normalization factor strongly couples the topic assignments of all the words. SL-LDA is also faster than MedLDA, which needs to solve an extra dual problem during training.
6 Evaluation on multi-label setting
In this section, we evaluate the performance of SL-LDA for multi-label (K d > 1) document classification. The parameter 𝜃 ∗ is tuned using values in the set \(\left \{ {0.5,0.55,0.6,0.65,0.7,0.75,0.8} \right \}\), and the other parameters are set as for the single-label case.
6.1 Metric
The multi-label classification problem requires more metrics than the single-label case. In this experiment, we employ several popular metrics [18] to evaluate SL-LDA. Assume that the test corpus \(W^{\prime }\) consists of \(d^{\prime }\) documents. For each document d, y d and \(y_{d}^{*}\) denote the true and estimated label set, respectively, where \( {y_{d}}, y_{d}^{*} \subseteq \left \{ {1,2, {\cdots } ,{K_{t}}} \right \}\).
6.1.1 Rank-based metric
The estimated rank of label k for document d is denoted by \({r_{d}}\left (k \right )\). We introduce three rank-based metrics, for each of which smaller values imply better classification.
Ranking loss
This measures the number of times that irrelevant labels are ranked higher than relevant labels:
where \(\overline {{y_{d}}} \) is the complement of y d .
One error
This measures how many times the top-ranked label is not in the true label set:
where
Margin
Measures the difference in ranking between the top-ranked irrelevant label and the lowest-ranked relevant label:
6.1.2 Binary prediction metric
The two binary prediction metrics used in our experiments are the Macro-F1 and Micro-F1 scores. Larger Macro-F1 and Micro-F1 scores denote better performance.
We define the Recall, Precision and F1-score for a document d as follows:
After computing the F1-scores for all the test documents, the Macro-F1 metric is obtained by averaging all of the individual F1-scores. The Micro-F1 considers the full testing corpus as a large document. It can be directly computed using (15).
6.2 Datasets
Yahoo! Arts and Health
Footnote 7 These two corpora are from the Yahoo! Collection. Arts consists of 7,741 documents and 19 unique labels, and Health contains 9,109 documents and 14 unique labels. The cardinality of both datasets is relatively small, i.e., 1.7 (Arts) and 1.6 (Health), and about 55% of the documents have only a single label.
Following the preprocessing steps in [10], we randomly sampled 1,000 documents from each dataset, ensuring that each label appeared at least once, to form the training data. The remaining documents were used as the test data. This process was repeated 5 times to give 5 available training/test splits.
RCV1-v2 [13]
The RCV1-v2 dataset is another popular benchmark for multi-label document classification. It consists of over 800,000 documents with 103 labels. In our experiment, we used the original training set from the LYRL2004 split [13], which contains 23,149 documents assigned by 101 labels. We selected 30,000 documents at random from the LYRL2004 test split to form the test data. This procedure was repeated 10times, giving 10 training/test splits. The cardinality of RCV1-v2 is about 3.1, larger than that of the Yahoo! subdirectories.
6.3 Classification performance
Several existing approaches were compared with SL-LDA, including L-LDA [15], Pr-LDA, Dep-LDA [17], and Tuned-SVMs (T-SVMs) [13]. We implemented the in-house codes for these three supervised LDA models. All parameters were set according to the discussions in the original papers. We implemented T-SVMs using LibSVM and the parameters in [13]. Each approach was executed 20 times on each training/test split (i.e., a total of 100 times for Arts and Health and 200 times for RCV1-v2), and pairwise t-tests between SL-LDA and the baselines were conducted at the 5 % significance level.
In the early experiments, we found the performance to be stable while the number of topics K was slightly larger than the true number of labels, but to decrease for significantly larger values of K. We argue that this is reasonable, because, intuitively, a large K will exaggerate the effect of hidden topics. In this section, we report the results for the two Yahoo! datasets and the RCV1-v2 dataset with K = 100 and K = 240, respectively.
6.3.1 Rank-based performance
SL-LDA outputs the distribution of topics for each unlabeled document. Thus, we can directly rank the existing label k for document d , i.e., \({r_{d}}\left (k \right )\). The experiment results are given in Table 5: our model performs well in terms of both numerical results and statistics.
Among the supervised LDA models, SL-LDA performs much better than L-LDA and Pr-LDA, and achieves competitive performance with the state-of-the-art Dep-LDA. The difference between L-LDA and SL-LDA is significant. The difference between Pr-LDA and SL-LDA is also large, except for the Yahoo! Arts dataset. We believe this is because of the small size and low label density of this dataset. SL-LDA also outperformed the state-of-the-art Dep-LDA on 5/6 scores across the Yahoo! datasets and on 2/3 evaluation metrics across the larger RCV1-v2 dataset. Although SL-LDA is simpler than Dep-LDA, it attains better performance in terms of rank-based metrics.
SL-LDA also achieved higher scores than T-SVMs for 8/9 metrics across all three datasets. These results demonstrate that SL-LDA is competitive with the state-of-the-art discriminative method.
6.3.2 Binary prediction performance
To compute the binary prediction metrics, we must transform the label ranking of each test document d into its estimated label set \(y_{d}^{*}\), which consists of the top N ranked documents. Here, we set N equal to the cardinality of each dataset.
The binary prediction results are listed in Table 6. We can see that SL-LDA is statistically superior to the other methods, and attains better performance. In most cases, SL-LDA outperforms the other supervised LDA models in terms of both Macro-F1 and Micro-F1 across all three datasets. For the simpler models, SL-LDA scores about 0.09-0.11 higher than L-LDA and about 0.04-0.08 higher than Pr-LDA on the Yahoo! datasets, and achieves greater improvements across the larger RCV1-v2 dataset. In particular, SL-LDA outscores the state-of-the-art Dep-LDA by about 0.005-0.02. Regarding T-SVMs, we can see that the proposed model scores slightly lower on the RCV1-v2 dataset (similar results have been reported in [13]). However, our method outperforms T-SVMs on 3/4 metrics across the smaller Yahoo! datasets(an improvement of around 0.02 in Macro-F1 across Yahoo! Arts and 0.03 in Micro-F1 across Yahoo! Health). These results indicate that SL-LDA is competitive with state-of-the-art approaches.
6.4 Study on parameter 𝜃 ∗
We now study the effect of parameter 𝜃 ∗ in the multi-label setting. Two datasets are used: 1) Yahoo! Arts collection (which has a small cardinality of 1.7); and 2) RCV1-v2 collection (which has a larger cardinality of about 3.1). As different metrics exhibit similar trends, we only use the ranking loss to compare different values of 𝜃 ∗ across the two datasets.
The experimental results for the Yahoo! Arts collection are shown in Fig. 4. It is clear that higher performance is obtained when 𝜃 ∗ = 0.5 or 0.6. Larger values tend to degrade the performance. The trend for the Yahoo! Arts collection is very similar to that of the single-label corpus in Section 5. This is because Yahoo! Arts contains fewer labels, and about half the documents are tagged with only a single label.
The results for the RCV1-v2 collection (Fig. 5) are somewhat different, with larger values of 𝜃 ∗ tending to improve performance. The peak score is obtained for a value of 𝜃 ∗=0.75. Because documents in RCV1-v2 contain an average of around three labels, a higher weighted proportion of pre-assigned labels are needed to describe a labeled document.
Clearly, the experimental results also conform to the discussions in Section 2.2. However, in addition to the multi-label settings, we suggest that a small/large value of 𝜃 ∗ should be applied for datasets with small/large cardinality.
6.5 Running time
Finally, we compared the time efficiency of SL-LDA with that of two state-of-the-art approaches, i.e., Dep-LDA and T-SVM, for multi-label classification. Pr-LDA was not chosen, because it was deemed to be a special case of Dep-LDA.
Figure 6 shows the running time of different approaches for the Yahoo! ArtsFootnote 8 and RCV1-v2 collections. Clearly, SL-LDAFootnote 9 is more efficient than Dep-LDA and T-SVMs. For the smaller Yahoo! Arts dataset, SL-LDA is slightly faster than Dep-LDA, but the efficiency gap widens for the RCV1-v2 dataset. This is because Dep-LDA adds a label layer to the model, and its computational complexity is sensitive to the number of labels. The slowness of T-SVMs is mainly because of the parameter optimization process. In contrast, SL-LDA has few parameters, and is therefore faster than T-SVMs in this step. Following our discussion on the choice of 𝜃 ∗, we can quickly determine an appropriate value. Thus, we argue that SL-LDA is more efficient in practice than Dep-LDA and T-SVMs.
7 Conclusion
In this paper, we have suggested a novel supervised model for document classification, including both single-label and multi-label settings. Based on the L-LDA model, SL-LDA relaxes the restriction of label sampling, and extends the topics concept to capture lost labels and common semantics. These modifications significantly improve the classification performance. We developed a Gibbs-EM algorithm to estimate and infer our model. A series of evaluation experiments were conducted, and the results show that: (1) in single-label cases, SL-LDA outperforms LDA-SVM, sLDA, DiscLDA, and the state-of-the-art MedLDA in most instances; (2) SL-LDA significantly outperforms L-LDA and Pr-LDA, and, more importantly, is competitive with the state-of-the-art Dep-LDA and SVMs.
In the future, we intend to: (1) develop online approaches for large-scale multi-label corpora; (2) investigate collections that contain many labels; (3) apply our model to some other applications, e.g., summarization and filtering.
Notes
In fact, \(Dir^{\prime }(\overrightarrow \alpha ,{y_{d}})\) is a pseudo-PDF, because \({\int }_{- \infty }^{\infty } Dir^{\prime }(\overrightarrow \alpha ,{y_{d}}) = \) 1−𝜃 ∗. Here, it is used to sample the absence topic components in \(\overrightarrow {{{\Lambda }_{d}}}\).
Results for the Yahoo! Health collection are very similar.
We set K = 100 for Yahoo! Arts and K = 240 for RCV1-v2.
References
Ali D, Faqir M (2012) Group topic modeling for academic knowledge discovery. Appl. Intell. 36(4):870–886
Andrieu C, Freitas ND, Doucet A, Jordan MI (2003) An introduction to MCMC for machine learning. Mach Learn 50(1):5–43
Blei DM, Lafferty JD (2007) A correlated topic model fo science. Ann Appl Stat 1(1):17–35
Blei DM, McAuliffe JD (2007) Supervised topic models. In: Neural information processing systems
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Fei-Fei L, Perona P (2005) A Bayesian hierarchical model for learning natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition , vol 2, pp 524–531
Heinrich G. (2005) Parameter estimation for text analysis. http://www.arbylon.net/publications/textest
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp 50–57
Jaegul C, Changhyun L, Chandan KR, Park H (2013) Utopian: user-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Trans Vis Comput Graph 19(12):1992–2001
Ji S, Tang L, Yu S, Ye J (2008) Extracting shared subspace for multi-label classification. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 381–389
Kim D, Kim S, Oh A (2012) Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. In: 29th International conference on machine learning, pp 727–734
Lacoste-Julien S, Sha F, Jordan MI (2009) Disclda: discriminative learning for dimensionality reduction and classification. In: Neural information processing systems, pp 897–904
Lewis DD, andTony G, Rose YY, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
Quelhas P, Monay F, Odobez JM, Gatica-Perez D, Tuytelaars T, Van Gool L (2005) Modeling scenes with local descriptors and latent aspects. Comput Vis IEEE Int Conf 1:883–890
Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Conference on empirical methods in natural language processing, pp 248–256. Association for Computational Linguistics
Ramage D, Manning CD, Dumais S (2011) Partially labeled topic models for interpretable text mining. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 457–465
Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1–2):157–208
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Wallach H (2006) Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd international conference on Machine learning, pp 977–984. ACM
Xie P, Xing EP (2013) Integrating document clustering and topic modeling. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, pp 694–703
Xu Y, Guo R (2014) An inproved nu-twin support vector machine. Appl Intell 41(1):42–54
Zhang ML, Zhang K (2010) Multi-label learning by exploiting label dependency. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp 999–1008
Zhu J, Ahmed A, Xing E (2009) Medlda: maximum margin supervised topic models for regression and classification. In: Proceedings of the 26th annual international conference on machine learning, pp 1257–1264. ACM
Zhu J, Ahmed A, Xing E. (2012) Medlda: maximum margin supervised topic models
Acknowledgments
This work was supported by National Nature Science Foundation of China (NSFC) under the Grant No. 61170092, 61133011, and 61103091.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, X., Ouyang, J., Zhou, X. et al. Supervised labeled latent Dirichlet allocation for document categorization. Appl Intell 42, 581–593 (2015). https://doi.org/10.1007/s10489-014-0595-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-014-0595-0