1 Introduction

Recently, considerable attention has been focused on topic modeling approaches [8]. The original goals of such methods were (1) to obtain a brief description of document collection for basic tasks [1] such as classification, clustering, and dimension reduction, and (2) to use the concept of latent topics to capture the semantics behind documents. Latent Dirichlet allocation (LDA) [5] is acknowledged as the most successful topic model. It simulates the generative process of a corpus, where each document is composed of latent topics, and each topic is described by a multinomial distribution over words. To control the capacity of the model parameters and avoid the over-fitting problem, a Dirichlet prior is given over all topics beyond the corpus.

Most approaches to topic modeling are unsupervised, and LDA is no exception. Unsupervised LDA neglects supervised information, resulting in some significant messages being wasted. For document classification tasks [5, 6, 14], LDA commonly acts as an upstream method for dimension reduction before various classifiers are executed, e.g., LDA with support vector machines (SVMs). Intuitively, some discriminative features must be lost when LDA transforms the original word distribution into the latent topic distribution.

The investigation of supervised LDA models faces the challenge of incorporating supervision into the learning procedure. Recently, some modifications have been developed (e.g., supervised LDA (sLDA) [4], discriminative LDA (DiscLDA) [12] and maximum entropy discrimination LDA (MedLDA) [23, 24] for single-label classification; labeled LDA (L-LDA) [15], Dirichlet process with mixed random measures (DP-MRM) [11], Prior-LDA (Pr-LDA), and Dependency-LDA (Dep-LDA) [17] for multi-label classification).

To the best of our knowledge, L-LDA was the first supervised LDA model that could be applied to multi-label corpora. L-LDA defines a 1-1 correspondence between labels and topics, and constrains each document to its pre-assigned label set. The empirical results reported in [15] show that, under certain conditions, L-LDA can achieve performance that is competitive with state-of-the-art discriminative methods, e.g., SVMs [13]. However, it suffers from two problems: (1) L-LDA is over-focused on pre-assigned labels for each document, resulting in worse performance for corpora with small cardinality values (i.e., the average number of labels per document), and even degenerating to the mixture of unigrams model for single-label corpora; (2) from a generative perspective, L-LDA lacks a mechanism to capture potentially lost labels and common semantics.

In this paper, we consider these two problems, and then propose a Supervised L-LDA (SL-LDA) model for both single- and multi-label document categorization. Our model makes two fundamental assumptions beyond those used in L-LDA: Prior 1 provides a label threshold to highlight the pre-assigned labels, instead of restricting them; Prior 2 extends the topics concept to cover the semantics of lost labels and common topics. We develop a Gibbs Expectation-Maximization (Gibbs-EM) algorithm to infer the SL-LDA model. Naturally, as a supervised model, SL-LDA can directly predict the response labels for the test documents.

Extensive experiments are conducted to evaluate the proposed SL-LDA model. Several traditional approaches are chosen as performance baselines, including: 1) for the single-label case, four LDA-based approaches, i.e., LDA-SVM, sLDA, DiscLDA, and MedLDA; 2) for the multi-label case, three supervised LDA models, i.e., L-LDA, Pr-LDA, and Dep-LDA, and the state-of-the-art discriminative algorithm SVMs [13, 21]. The empirical results show that SL-LDA achieves a performance level that is competitive with the state-of-the-art methods. Some important notation used in this paper is summarized in Table 1.

Table 1 Notation descriptions

The remainder of this paper is organized as follows. Section 2 describes the proposed SL-LDA model in detail. Section 3 presents the parameter estimation and inference methods for SL-LDA. In Section 4, we introduce some related studies on supervised topic models. Sections 5 and 6 present and discuss the empirical results for single- and multi-label corpora, respectively. Finally, our conclusions and suggestions for future work are given in Section 7.

2 Proposed model

This section introduces the novel SL-LDA model. We begin by reviewing the L-LDA model.

2.1 L-LDA

L-LDA [15] is a supervised topic model for describing labeled corpora. Similar to LDA, L-LDA represents documents as multinomial distributions over topics, where each topic is represented by a multinomial distribution over words. Additionally, L-LDA defines a 1-1 correspondence between the labels (tagged by human-being) and topics. Each document d is constrained to be described by its pre-assigned label set \({y_{d}} \subseteq \left \{ {1,2, {\cdots } ,{K_{t}}} \right \}\).

L-LDA is formalized as follows: For each topic k, generate the topic-word distribution \(\overrightarrow {{\phi _{k}}}\), drawn from Dirichlet prior \(\overrightarrow {\beta }\). However, for each document d, L-LDA restricts the multinomial distribution \(\overrightarrow {{\theta _{d}}}\) to topics within y d . Towards this requirement, the topic presence/absence indicator \(\overrightarrow {{\Lambda }_{d}} = \left ({{l_{1}},{l_{2}}, {\cdots } ,{l_{{K_{t}}}}} \right )\) is generated via a Bernoulli distribution η k , where \({l_{k}} \in \left \{ {0,1} \right \}\). Define \({y_{d}} = \left \{ {k|{{\Lambda }_{d,k}} = 1} \right \}\). The document-label projection matrix L d [15] is used to project the topic Dirichlet prior \( \overrightarrow \alpha \) to a presence topic prior \(\overrightarrow {{\alpha _{d}}} {\kern 1pt} = {({\alpha _{d,{l_{1}}}},{\alpha _{d,{l_{2}}}}, {\cdots } ,{\alpha _{d,{l_{{K_{d}}}}}})^{T}}\). Then, the topics and words can be sampled as in the traditional LDA model.

In summary, L-LDA assumes the following generative process for a labeled corpus W:

  1. 1.

    For each topic k

    1. (a)

      Choose \({\overrightarrow {{\phi _{k}}} {\kern 1pt} = {\left ({{\phi _{k,1}},{\phi _{k,2}}, {\cdots } ,{\phi _{k,V}}} \right )^{T}}}\\ \sim Dirichlet\left ({\overrightarrow \beta } \right )\)

  2. 2.

    For each document d in the corpus W

    1. (a)

      For each topic k

      1. (i)

        Choose \({{\Lambda }_{d,k}}{\kern 1pt} \in \left \{ {0,1} \right \} \sim Bernoulli\left ({{\eta _{k}}} \right )\)

    2. (b)

      Choose \(\overrightarrow {{\alpha _{d}}} = {L_{d}} \times \overrightarrow \alpha \)

    3. (c)

      Choose \({\overrightarrow {{\theta _{d}}} {\kern 1pt} = {({\theta _{d,{l_{1}}}},{\theta _{d,{l_{2}}}}, {\cdots } ,{\theta _{d,{l_{{K_{d}}}}}})^{T}} \sim }\\ Dirichlet\left ({\overrightarrow {{\alpha _{d}}} } \right )\)

    4. (d)

      For each of the N d words w d,n

      1. (i)

        Choose a topic \({z_{d,n}} \in \left \{y_{d,1},{y_{d,2}},\right .\\ {\left .{\cdots } ,{y_{d,{K_{d}}}} \right \} \sim Multinomial\left ({\overrightarrow {{\theta _{d}}} } \right )}\)

      2. (ii)

        Choose a word w d,n from \(p\left ({w_{d,n}}|\right .\) \(\left .z_{d,n},{\Phi }\right )\), a multinomial probability conditioned on the topic z d,n

2.2 SL-LDA

In reviewing the L-LDA model, we found that it suffers from two problems in terms of supervised tasks. First, the assumption that constrains each document to be sampled from its own label set y d is too strong for corpora with small cardinality. For single-label corpora, in particular, L-LDA will degenerate to the simple mixture of unigrams model. This undermines the advantage of capturing the latent semantics of topic modeling, and leads to worse classification performance. Second, because L-LDA defines a 1-1 correspondence between labels and topics, its number of topics is equivalent to the true number of labels in the corpus. From a generative perspective, L-LDA ignores the hidden labels (i.e., lost labels), which might be neglected during manual processing. More importantly, this framework of L-LDA lacks a mechanism to cover the common semantics. For example, the words “news” and “report”, which express common semantics in newsgroup collections, might occur in most documents. In the context of L-LDA, such common words are forcibly assigned a pre-defined label. Unfortunately, these words are less discriminative, so this is harmful to the classification.

To overcome these problems, we propose a novel supervised model. SL-LDA makes two fundamental assumptions beyond those of L-LDA, namely Prior 1 and Prior 2. The aim of Prior 1 is to relax the restriction of label sampling. Following the intuition that a document involves a wide range of labels, but that only the main labels will be tagged (e.g., given a document d with label proportion \({\left [ {0.1,0.2,0.2,0.5} \right ]}\), its label set might be manually given as \({y_{d}} = \left \{ 4 \right \}\)), Prior 1 supposes that each document d samples from all labels and gives the dominant weights, i.e., the label threshold 𝜃 , for labels in y d . Under this assumption, SL-LDA will never degenerate to a simpler model. Prior 2 extends the concept of topics to handle the problem of semantic loss. It assumes that each document d is represented by K topics, consisting of K t observed labels and K h hidden topics. The hidden topics are used to cover the potentially lost labels and common semantics, which are represented by the common words. Consequently, less discriminative common words, such as “news” and “report” in newsgroup collections, contribute to the hidden topics rather than the pre-defined labels. Although Prior 2 seems straightforward, it provides significant benefits to classification tasks.

Note that in the structure of SL-LDA the observed labels and hidden topics are on the same level. Thus, in this paper, we do not differentiate between the concepts of “label” and “topic”. Prior 1 and Prior 2 are summarized as follows:

  1. Prior 1: each document d samples from all topics; the labels in y d are given dominative proportions, defined by 𝜃 .

  2. Prior 2: there are a total of K topics, consisting of K t observed labels and K h hidden topics.

Formally, SL-LDA generates the topic-word distribution \(\overrightarrow {{\phi _{k}}}\) as in L-LDA. For each document d, it first samples the topic presence/absence indicator \(\overrightarrow {{{\Lambda }_{d}}}\) under the observed labels, and then generates the document-topic distribution \(\overrightarrow {{\theta _{d}}} {\kern 1pt}\) via the following conditional function:

$$\begin{array}{@{}rcl@{}} fun(\overrightarrow \alpha ,{y_{d}},{\theta^{*}},{K_{d}}) = \left\{ \begin{array}{ll} {\theta_{d,i}} = \frac{{{\theta^{*}}}}{{{K_{d}}}}& i \in {y_{d}} \\ \\ \sim Dir^{\prime}(\overrightarrow{\alpha} ,{y_{d}})& otherwise \\ \end{array} \right. \end{array} $$
(1)

where 𝜃 ≤ 1. For each document d, \(\sum \nolimits _{i \in {y_{d}}} {{\theta _{d,i}}} = {\theta ^{*}}\) and 𝜃 d,i = 𝜃 d,j , ∀i,jy d .; \(Dir^{\prime }(\overrightarrow {\alpha } ,{y_{d}})\) is a pseudo-Dirichlet distribution with the following Probability Density Function (PDF)Footnote 1:

$$\begin{array}{@{}rcl@{}} Dir^{\prime}(\overrightarrow \alpha ,{y_{d}}) = \frac{{\Gamma \left( {\sum\nolimits_{k \notin {y_{d}}} {{\alpha_{k}}} } \right)}}{{{{\left( {1 - {\theta^{*}}} \right)}^{\sum\nolimits_{k \in {y_{d}}} {{\alpha_{k}}} }} \times \prod\limits_{k \notin {y_{d}}} {\Gamma \left( {{\alpha_{k}}} \right)} }}\prod\limits_{k \notin {y_{d}}} {\theta_{k}^{{\alpha_{k}} - 1}}.\\ \quad \sum\limits_{k \notin {y_{d}}} {{\theta_{k}}} = 1 - {\theta^{*}}. \end{array} $$

where \({\Gamma } \left (x \right )\) is the Gamma function.

The generative process of SL-LDA can be summarized as follows:

  1. 1.

    For each topic k

    1. (a)

      Choose \(\overrightarrow {{\phi _{k}}} {\kern 1pt} = {\left ({{\phi _{k,1}},{\phi _{k,2}}, {\cdots } ,{\phi _{k,V}}} \right )^{T}} \sim Dirichlet\left ({\overrightarrow \beta } \right )\)

  2. 2.

    For each document d in the corpus W

    1. (a)

      For each existing topic k

      1. (i)

        Choose \({{\Lambda }_{d,k}}{\kern 1pt} \in \left \{ {0,1} \right \} \sim Bernoulli\left ({{\eta _{k}}} \right )\)

    2. (b)

      Choose \(\overrightarrow {{\theta _{d}}} {\kern 1pt} = {({\theta _{d,1}},{\theta _{d,2}}, {\cdots } ,{\theta _{d,K}})^{T}} \sim f(\overrightarrow \alpha ,{y_{d}},{\theta ^{*}},{K_{d}})\)

    3. (c)

      For each of the N d words w d,n

      1. (i)

        Choose a topic \({z_{d,n}} \in \left \{ {{k_{1}},{k_{2}}, {\cdots } ,}\right .\) \(\left .{k_{K}} \right \}{\kern 1pt} \sim Multinomial\left ({\overrightarrow {{\theta _{d}}} } \right )\)

      2. (ii)

        Choose a word w d,n from \(p\left ({w_{d,n}}|\right .\) \(\left .{z_{d,n}},{\Phi } \right )\), a multinomial probability conditioned on the topic z d,n

Discussion of parameter 𝜃

The parameter 𝜃 is a crucial threshold function that determines the degree of attention of the marked labels. We suggest the tuned interval \(\theta ^{*}\in \left [ {0.5,0.8} \right ]\) following the intuition that: (1) a document is assigned a label (labels), provided that at least 50 % is focused on this label (labels); (2) a document is over-focused on the marked label (labels) if \(\theta ^{*}{}\) approaches 1.

In Sections 5.2.3 and 6.4, we study 𝜃 experimentally. The results show that the optimum performance is achieved when 𝜃 is within the suggested interval.

2.3 Evidence for SL-LDA

Given the parameters Φ and Θ , the joint probability of the labeled corpus W and a set of topic assignments \(\overrightarrow z \) is:

$$\begin{array}{@{}rcl@{}} P\left( {W,\overrightarrow z |{\Theta} ,{\Phi} } \right) &=& \prod\limits_{d = 1}^{D} {\prod\limits_{k = 1}^{K} {\prod\limits_{v = 1}^{V} {\theta_{d,k}^{{N_{d,k}}}\phi_{k,v}^{{N_{k,v}}}} } \quad } \; \\ &&({\theta_{d,k}} = \frac{{{\theta^{*}}}}{{{K_{d}}}},\;\;\;if\,k \in {y_{d}}) \end{array} $$
(2)

where N d,k is the number of times that the topic k occurs in document d; and N k,v is the number of times that the word v has been assigned to topic k.

SL-LDA places a Dirichlet prior over Φ:

$$ P\left( {\Phi |\overrightarrow \beta } \right) = \prod\limits_{k = 1}^{K} {\frac{{\Gamma \left( {\sum\nolimits_{v = 1}^{V} {{\beta_{v}}} } \right)}}{{\prod\nolimits_{v = 1}^{V} {\Gamma \left( {{\beta_{v}}} \right)} }}} \prod\limits_{v = 1}^{V} {\phi_{v}^{{\beta_{v}} - 1}} $$
(3)

and a pseudo-Dirichlet prior over Θ:

$$ P\left( {\Theta |\overrightarrow \alpha } \right) = \prod\limits_{d = 1}^{D} {\frac{{\Gamma \left( {\sum\nolimits_{k \notin {y_{d}}} {{\alpha_{k}}} } \right)}}{{{{\left( {1 - {\theta^{*}}} \right)}^{\sum\nolimits_{k \in {y_{d}}} {{\alpha_{k}}} }} \times \prod\nolimits_{k \notin {y_{d}}} {\Gamma \left( {{\alpha_{k}}} \right)} }}\prod\limits_{k \notin {y_{d}}} {\theta_{k}^{{\alpha_{k}} - 1}} } $$
(4)

Given the hyper-parameters \(\overrightarrow \alpha \) and \(\overrightarrow \beta \), we obtain the evidence for a labeled corpus W by combining (2), (3) and (4):

$$\begin{array}{@{}rcl@{}} P\left( {W|{\Theta} ,{\Phi} } \right)& = & \sum\limits_{z} {\left( {{C^{*}}\prod\limits_{d = 1}^{D} {\frac{{\Gamma \left( {\sum\nolimits_{k \notin {y_{d}}} {{\alpha_{k}}} } \right)}}{{\prod\nolimits_{k \notin {y_{d}}} {\Gamma \left( {{\alpha_{k}}} \right)} }}} \frac{{\prod\nolimits_{k \notin {y_{d}}} {\Gamma \left( {{\alpha_{k}} + {N_{d,k}}} \right)} }}{{\Gamma \left({\sum\nolimits_{k \notin {y_{d}}} {\left( {{\alpha_{k}} + {N_{d,k}}} \right)} } \right)}}} \right.} \\ && \left. {\prod\limits_{k = 1}^{K} {\frac{{\Gamma \left( {\sum\nolimits_{v = 1}^{V} {{\beta_{v}}} } \right)}}{{\prod\nolimits_{v = 1}^{V} {\Gamma \left( {{\beta_{v}}} \right)} }}\frac{{\prod\nolimits_{v = 1}^{V} {\Gamma \left( {{\beta_{v}} + {N_{k,v}}} \right)} }}{{\Gamma \left( {\sum\nolimits_{v = 1}^{V} {\left( {{\beta_{v}} + {N_{k,v}}} \right)} } \right)}}} } \right) \end{array} $$
(5)

where variable C is defined, for simplification, as follows:

$$\begin{array}{@{}rcl@{}} {C^{*}} = \frac{{{{\left( {{\theta^{*}}} \right)}^{\sum\nolimits_{d = 1}^{D} {\sum\limits_{k \in {y_{d}}} {{N_{d,k}}} } }}}}{{\prod\nolimits_{d = 1}^{D} {K_{d}^{\sum\nolimits_{k \in {y_{d}}} {{N_{d,k}}} }} }}{\left( {1 - {\theta^{*}}} \right)^{\sum\nolimits_{d = 1}^{D} {\sum\limits_{k \notin {y_{d}}} {{N_{d,k}}} } }} \end{array} $$

3 Estimation and inference

In this section, we describe the process of parameter estimation and inference with respect to SL-LDA.

3.1 Estimation for hyper-parameters

We develop a Gibbs-EM algorithm to learn the hyper-parameters \(\overrightarrow \alpha \) and \(\overrightarrow \beta \) in SL-LDA. Gibbs sampling [2, 19] is a popular method for approximate learning in high-dimension models. It imitates the high-dimension probability distribution given by the stationary state of Markov chain Monte Carlo (MCMC) chains. We first use Gibbs sampling to approach the expectation of topic assignments \(\overrightarrow z \): \(E\left [ {P\left ({\overrightarrow z |W,\overrightarrow \alpha ,\overrightarrow \beta } \right )} \right ]\), and then maximize the likelihood of (5) to estimate the hyper-parameters. Repeating this procedure until convergence, we obtain the optimized values of \({\overrightarrow \alpha ^{M}}\) and \({\overrightarrow \beta ^{M}}\). The Gibbs-EM algorithm is summarized as follows:

  1. 1.

    Initialize topics \({\overrightarrow z^{0}}\) and hyper-parameters \({\overrightarrow \alpha ^{0}}\) and \({\overrightarrow \beta ^{0}}\)

  2. 2.

    For i=1,2, ⋯

    1. (a)

      E-step: sample for \({\overrightarrow z^{i}}\) from \(P\left ({\overrightarrow z |W,\overrightarrow \alpha ,\overrightarrow \beta } \right )\) by Gibbs sampling.

    2. (b)

      M-step: maximize \(\log P\left ({W,{{\overrightarrow z }^{i}}|\overrightarrow \alpha ,\overrightarrow \beta } \right )\) w.r.t \({\overrightarrow \alpha ^{i}}\) and \({\overrightarrow \beta ^{i}}\) using the fixed point method

    Until convergence

In the E-step, we sample each topic assignment z t alternately from the distribution determined by all other topic assignments. Define the symbol \(\widetilde {{z_{t}}}\) as a set of topic sequences that excludes the current variable z t . The conditional probability for z t is then:

$$\begin{array}{@{}rcl@{}} &&{}P\left( {{z_{t}} = {k^{*}}|\widetilde{{z_{t}}},W,\overrightarrow \alpha ,\overrightarrow \beta } \right) \propto \\ &&{} \left\{ \begin{array}{cc} \frac{{{\theta^{*}}}}{{{K_{d}}}} \times \frac{{{N_{\neg {k^{*}},{w_{t}}}} + {\beta_{{w_{t}}}}}}{{{N_{\neg {k^{*}}}} + \sum\limits_{v = 1}^{V} {{\beta_{v}}} }} & \quad if\,{k^{*}} \in {y_{{d_{t}}}} \\ (1 - {\theta^{*}}) \times \frac{{{N_{\neg {d_{t}},{k^{*}}}} + {\alpha_{{k^{*}}}}}}{{{N_{\neg {d_{t}}}} + \sum\limits_{k \ne {y_{{d_{t}}}}} {{\alpha_{k}}} }} \times \frac{{{N_{\neg {k^{*}},{w_{t}}}} + {\beta_{{w_{t}}}}}}{{{N_{\neg {k^{*}}}} + \sum\limits_{v = 1}^{V} {{\beta_{v}}} }} & else \\ \end{array} \right. \end{array} $$
(6)

where w t is the word corresponding to z t ;d t is the document containing w t ; and variables with the subscript “¬” should have 1 subtracted, e.g., N ¬k = N k − 1.

The M-step uses the fixed point method. Using the samples \(\overrightarrow z \) obtained in the E-step, we update the hyper-parameter \(\alpha _{k}^{i + 1}\) as follows:

$$\begin{array}{@{}rcl@{}} \alpha_{k}^{i + 1} &=& {\alpha_{k}^{i}} \\ &&\times \frac{{\sum\nolimits_{d = 1}^{D} {\left( {\Psi \left( {N_{d,k}^{i} + {\alpha_{k}^{i}}} \right) - {\Psi} \left( {{\alpha_{k}^{i}}} \right)} \right)} }}{{\sum\nolimits_{d = 1}^{D} {\left( {\Psi \left( {\sum\limits_{k \ne {y_{d}}} {\left( {N_{d,{y_{d}}}^{i} + {\alpha_{k}^{i}}} \right)} } \right) - {\Psi} \left( {\sum\limits_{k \ne {y_{d}}} {{\alpha_{k}^{i}}} } \right)} \right)} }} \\ &&{\kern85pt} \left( {N_{d,k}^{i} = 0,\;if\;k \in {y_{d}}} \right) \end{array} $$
(7)

and for \(\beta _{v}^{i + 1}\)

$$\begin{array}{@{}rcl@{}} \beta_{v}^{i + 1} &=& {\beta_{v}^{i}} \\ &&\times\frac{{\sum\nolimits_{v = 1}^{V} {\left( {\Psi \left( {N_{k,v}^{i} + {\beta_{v}^{i}}} \right) - {\Psi} \left( {{\beta_{v}^{i}}} \right)} \right)} }}{{\sum\nolimits_{d = 1}^{D} {\left( {\Psi \left( {{N_{k}^{i}} + \sum\nolimits_{v = 1}^{V} {{\beta_{v}^{i}}} } \right) - {\Psi} \left( {\sum\nolimits_{v = 1}^{V} {{\beta_{v}^{i}}} } \right)} \right)} }}\\ \end{array} $$
(8)

where Ψ(x) is the Digamma function, i.e., the logarithmic derivative of the Gamma function.

Consistency over multiple runs

In terms of multiple runs, the results of Gibbs sampling might be inconsistent because of the random sampling procedure [9]. This problem persists in our model, but it is almost insignificant for the classification tasks described here. Note that the topics in SL-LDA are composed of observed labels and hidden topics. Given a set of training data, the observed labels are fixed, so they must be consistent. However, because the hidden topics are used to collect common words, they should contribute to the classification regardless of whether they are consistent.

3.2 Inference for unlabeled documents

Let \(d^{\prime }\) be a document from the testing corpus \(W^{\prime }\) and \(U = \left \{ {\overrightarrow z ,\overrightarrow w } \right \}\) be a stationary MCMC state for the training corpus W.

We employ Gibbs sampling to infer the unlabeled document \(d^{\prime }\) by estimating the posterior distribution of topic assignments [7]: \(P\left ({z^{\prime }|{{\overrightarrow w }_{d^{\prime }}},U,{{\overrightarrow \alpha }^{M}},{{\overrightarrow \beta }^{M}}} \right )\), where \({{\overrightarrow w }_{d^{\prime }}}\) is the vector of \(d^{\prime }\), and \({{\overrightarrow \alpha }^{M}},{{\overrightarrow \beta }^{M}}\) are obtained during the parameter estimation procedure. As the test document is unlabeled, the update rule is as follows:

$$\begin{array}{@{}rcl@{}} &&{} P\left( {{{z^{\prime}}_{t}} = {k^{*}}|\widetilde{{{z^{\prime}}_{t}}},{{\overrightarrow w }_{d^{\prime}}},U,{{\overrightarrow \alpha }^{M}},{{\overrightarrow \beta }^{M}}} \right)\propto \\ &&{\kern34pt} \frac{{{N_{\neg d^{\prime},{k^{*}}}} + \alpha_{{k^{*}}}^{M}}}{{{N_{\neg d^{\prime}}} + \sum\nolimits_{k = 1}^{K} {{\alpha_{k}^{M}}} }} \times \frac{{{N_{{k^{*}},{{w^{\prime}}_{t}}}} + {{N^{\prime}}_{\neg {k^{*}},{{w^{\prime}}_{t}}}} + \beta_{{{w^{\prime}}_{t}}}^{M}}}{{{N_{{k^{*}}}} + {{N^{\prime}}_{\neg {k^{*}}}} + \sum\nolimits_{v = 1}^{V} {{\beta_{v}^{M}}} }}\\ \end{array} $$
(9)

where \({w^{\prime }_{t}}\) is the t-th word in \({w_{d^{\prime }}}\); \({N^{\prime }_{{k^{*}},v}}\) is the number of times that v has been generated by topic k .

Finally, the topic distribution of document \(d^{\prime }\) is estimated as follows:

$$ {\theta_{d^{\prime},k}} = \frac{N_{d^{\prime},k} + \alpha_{k}^{M}}{\sum\nolimits_{i = 1}^{K} \left(N_{d^{\prime},i} + \alpha_{i}^{M} \right)} $$
(10)

where \(N_{d^{\prime },i}^{*}\) is the number of times that topic i has occurred in document \(d^{\prime }\).

4 Related work

4.1 Topic model

A number of variants of LDA have been developed for supervised cases. Representative single-label corpora models include sLDA [4], which captures document labels as a classification response, DiscLDA [12], where documents are associated with labels and topic mixtures, and MedLDA [23, 24], which combines maximum margin technology and LDA. In terms of multi-label corpora, L-LDA [15] was the first supervised topic model. The authors of [17] developed the equivalent Flat-LDA, and further extended this model to Pr-LDA and Dep-LDA via observations of label frequency and label dependency, respectively. Other modifications for multi-label corpora include DP-MRM [11] and Partially LDA [16].

The proposed SL-LDA builds upon the L-LDA model by introducing a topic threshold 𝜃 and extending the concept of topics. In a sense, DiscLDA also extends the topics, as it represents documents by labels and topic mixtures. However, DiscLDA appears to be more complicated than our model. Several models that consider common topics have been explored, e.g., the Cluster-based Topic Model [3] and Multi-Grain Cluster Topic Model [20]. However, as unsupervised models, these cannot be directly applied to classification.

The state-of-the-art Pr-LDA and Dep-LDA models can also be deemed as extensions of L-LDA. Pr-LDA assumes that there is a corpus-wide distribution with respect to the label occurrence frequency in the corpus. A document’s topic priors are generated by this frequency distribution, instead of using the same prior. Based on Pr-LDA, Dep-LDA further introduces a topic level beyond the label level to capture label dependency. Although these two models have significantly improved multi-label document classification [17], the two problems of L-LDA still exist. More importantly, these models (particularly Dep-LDA) are complex and contain too many parameters. In terms of their application to different corpora, tuning the parameters might be time-consuming. In contrast, our model involves a simpler construction and fewer parameters, and so is straightforward and can be easily controlled. Another advantage of SL-LDA is that it can be applied to both single- and multi-label corpora. This leads to better scalability in practice.

4.2 Computational complexity

We now discuss the computational complexity of training various supervised topic models. In the original papers, these models were trained using different inference algorithms, and so we provide a descriptive comparison. The traditional LDA model is used as the baseline.

As SL-LDA does not change the basic construction of the traditional LDA model, its complexity will be the same as that of LDA. All single-label models perform extra computations, e.g., calculating the normalization factor in sLDA, learning the transition matrix in DiscLDA, and solving the dual problem in MedLDA. In terms of multi-label models, it is clear that L-LDA is as efficient as LDA, but Pr-LDA and Dep-LDA must compute the upper-topic distribution over the labels. In particular, proper inference using Dep-LDA is time-consuming, because it repeatedly computes the expensive Gamma function. Thus, we argue that the proposed SL-LDA model is more efficient than most of these related supervised topic models. Empirical tests are reported in the following sections.

5 Evaluation on single-label setting

For a single-label corpus (K d = 1), we evaluate both the text modeling and document classification performance of SL-LDA. Experiments are run on the balanced NewsgroupsFootnote 2 collection, which consists of 19,997 documents in 20 related categories (i.e., K t = 20 ). By convention, we remove stop words in the standard listFootnote 3 and words that occur only once in the corpus.

In the experiments, the parameter 𝜃 is tuned to a value in the set \(\left \{ {0.5,0.6,0.7,0.8} \right \}\). Other parameters are set as follows: 1) hyper-parameters \(\overrightarrow \alpha \) and \(\overrightarrow \beta \) are initialized to 50/K and 0.1, respectively; 2) maximum number of iterations is 50, and the termination precision is 1 × 10−4; 3) Gibbs sampling uses a burn-in of 500 iterations in the E-step.

5.1 Text modeling

We conducted a simple text modeling experiment to examine the topic structures given by SL-LDA. We fitted the SL-LDA model to the Newsgroups corpus, where 𝜃 =0.5 and K h = 5.

Table 2 lists the 10 most frequent words over 20 observed labels and 5 hidden topics. This result is similar to that given by DiscLDA [12], where words that occurred in the labeled rows are significantly cross-referenced with their corresponding labels, and words in the hidden topics show no obvious preference to any class.

Table 2 The most frequent words over artificial assigned labels and hidden labels

5.2 Classification performance

For single-label corpus, SL-LDA assigns each testing document d a label \(y_{d}^{*}\) by:

$$ y_{d}^{*} = \mathop {\arg \max }\limits_{k = 1,2, {\cdots} ,{K_{t}}} ({\theta_{d,k}}) $$
(11)

Following previous studies [12, 23], we evaluated SL-LDA in terms of binary- and multi-class document classification. Several existing supervised topic models were chosen as performance baselines, i.e., LDA-SVM, sLDA, DiscLDA, and MedLDA. Average scores were obtained from 20 runs, and pairwise t-tests between SL-LDA and the baselines were conducted at the 5 % significance level. As in [22], an indicator •/∘ is used to denote whether SL-LDA was found to be statistically superior/inferior to the compared algorithm.

5.2.1 Binary classification

The binary-class (K t = 2) classification experiments were performed on two subgroups of Newsgroups, i.e., alt.atheism and talk.religion.misc, following the design described in [12, 23].

For the LDA-SVM, we used the Gibbs-EM algorithm to fit the upstream LDA model, and employed the celebrated LibSVMFootnote 4 as the downstream classifier. We used a radial basis function as the kernel, and optimized its parameters via the grid search method. The public codes of sLDAFootnote 5 and MedLDAFootnote 6, were employed for dependable results. Their parameters were determined according to the discussions in the corresponding publications. The DiscLDA results were taken from the original paper, as we could not obtain its primary implementation.

Table 3 lists the results for different topic numbers K. We can see that SL-LDA attains a competitive level of performance, and is statistically superior to other models in most cases. Our model obtains better scores for relatively small K, and achieves the highest score of 0.819 when K = 15. As K becomes larger, SL-LDA performs sightly worse than the state-of-the-art MedLDA, e.g., by 0.09 for K = 25 and 0.02 for K = 35. We believe this is because larger values of K bring too many hidden topics for binary-class settings, resulting in reduced performance. In Summary, our model can provide higher performance with fewer topics, i.e., with less computational expense. This characteristic is quite meaningful in practice.

Table 3 The performance (averaged score± standard deviation) for binary-class cases; •/∘ means whether SL-LDA is statistically superior/inferior to the compared algorithm

5.2.2 Mutli-class classification

The multi-class classification experiments considered the full Newsgroups collection. We compared SL-LDA with LDA-SVM, sLDA, and MedLDA. All three benchmark models were set up as for the binary-class experiment.

The experimental results are given in Table 4. Clearly, SL-LDA achieves the top scores for all K, and is also statistically superior to the other models in most cases. Compared with LDA-SVM and sLDA, SL-LDA yields significant improvements, scoring around 0.2 higher than LDA-SVM and 0.3 higher than sLDA. More importantly, SL-LDA outperforms the state-of-the-art MedLDA by about 0.02.

Table 4 The performance (averaged score± standard deviation) for multi-class cases; •/∘ means whether SL-LDA is statistically superior/inferior to the compared algorithm

5.2.3 Study on parameter 𝜃

In this section, we examine the label threshold 𝜃 . Figure 1 illustrates the binary-class results with different values of 𝜃 . We can clearly observe that the performance at values of 0.5 and 0.6 is better in than the other cases. The multi-class cases (Fig. 2) exhibit a similar trend, with 0.5 and 0.6 dominating the performance. These results conform to the discussions in Section 2.2.

Fig. 1
figure 1

The evaluation of parameter 𝜃 on binary-class case

Fig. 2
figure 2

The evaluation of parameter 𝜃 on multi-class corpus

5.3 Running time

We now examine the time efficiency of SL-LDA on a 3.1GHz Intel Core i5 2400 CPU. To ensure a fair comparison, two public models, i.e., sLDA and MedLDA, were used as baselines.

Figure 3 shows that SL-LDA is more efficient than sLDA. As reported in [24], we found that sLDA is quite time-consuming. This is mainly because the normalization factor strongly couples the topic assignments of all the words. SL-LDA is also faster than MedLDA, which needs to solve an extra dual problem during training.

Fig. 3
figure 3

The training time (seconds in log2-scale) in terms of different number of topics for multi-class classification

6 Evaluation on multi-label setting

In this section, we evaluate the performance of SL-LDA for multi-label (K d > 1) document classification. The parameter 𝜃 is tuned using values in the set \(\left \{ {0.5,0.55,0.6,0.65,0.7,0.75,0.8} \right \}\), and the other parameters are set as for the single-label case.

6.1 Metric

The multi-label classification problem requires more metrics than the single-label case. In this experiment, we employ several popular metrics [18] to evaluate SL-LDA. Assume that the test corpus \(W^{\prime }\) consists of \(d^{\prime }\) documents. For each document d, y d and \(y_{d}^{*}\) denote the true and estimated label set, respectively, where \( {y_{d}}, y_{d}^{*} \subseteq \left \{ {1,2, {\cdots } ,{K_{t}}} \right \}\).

6.1.1 Rank-based metric

The estimated rank of label k for document d is denoted by \({r_{d}}\left (k \right )\). We introduce three rank-based metrics, for each of which smaller values imply better classification.

Ranking loss

This measures the number of times that irrelevant labels are ranked higher than relevant labels:

$$\begin{array}{@{}rcl@{}} &&{}Rnk-Loss =\\&&\frac{1}{{D^{\prime}}}\sum\limits_{d = 1}^{D^{\prime}} \times{\frac{1}{{\left| {{y_{d}}} \right|\left| {\overline {{y_{d}}} } \right|}}} \left| {\left\{ {\left( {{k_{i}},{k_{j}}} \right):{r_{d}}\left( {{k_{i}}} \right) > {r_{d}}\left( {{k_{j}}} \right),\left( {{k_{i}},{k_{j}}} \right) \in {y_{d}} \times \overline {{y_{d}}} } \right\}} \right|\\ \end{array} $$
(12)

where \(\overline {{y_{d}}} \) is the complement of y d .

One error

This measures how many times the top-ranked label is not in the true label set:

$$\begin{array}{@{}rcl@{}} One-Err = \frac{1}{{D^{\prime}}}\sum\limits_{d = 1}^{D^{\prime}} {\delta \left( {\mathop {argmin}\limits_{k = 1,2, {\cdots} ,{K_{t}}} {r_{d}}\left( k \right)} \right)} \end{array} $$
(13)

where

$$\begin{array}{@{}rcl@{}} \delta \left( k \right) = \left\{ \begin{array}{lc} 1 & k \notin {y_{d}} \\ 0 & otherwise \\ \end{array} \right. \end{array} $$

Margin

Measures the difference in ranking between the top-ranked irrelevant label and the lowest-ranked relevant label:

$$ Margin = \frac{1}{{D^{\prime}}}\sum\limits_{d = 1}^{D^{\prime}} {\left| {\mathop {argmin}\limits_{k \in \overline {{y_{d}}} } {r_{d}}\left( {{k_{k}}} \right) - \mathop {argmax}\limits_{k \in {y_{d}}} {r_{d}}\left( {{k_{k}}} \right)} \right|} $$
(14)

6.1.2 Binary prediction metric

The two binary prediction metrics used in our experiments are the Macro-F1 and Micro-F1 scores. Larger Macro-F1 and Micro-F1 scores denote better performance.

We define the Recall, Precision and F1-score for a document d as follows:

$$\begin{array}{@{}rcl@{}} &&{}Recall\left( d \right) = \frac{{\left| {{y_{d}} \cap y_{d}^{*}} \right|}}{{\left| {{y_{d}}} \right|}}\\ &&{}Precision\left( d \right) = \frac{{\left| {{y_{d}} \cap y_{d}^{*}} \right|}}{{\left| {y_{d}^{*}} \right|}} \\ &&{}F1-score\left( d \right) = \frac{{2 \times Recall\left( d \right) \times Precision\left( d \right)}}{{Recall\left( d \right) + Precision\left( d \right)}} \end{array} $$
(15)

After computing the F1-scores for all the test documents, the Macro-F1 metric is obtained by averaging all of the individual F1-scores. The Micro-F1 considers the full testing corpus as a large document. It can be directly computed using (15).

6.2 Datasets

Yahoo! Arts and Health

Footnote 7 These two corpora are from the Yahoo! Collection. Arts consists of 7,741 documents and 19 unique labels, and Health contains 9,109 documents and 14 unique labels. The cardinality of both datasets is relatively small, i.e., 1.7 (Arts) and 1.6 (Health), and about 55% of the documents have only a single label.

Following the preprocessing steps in [10], we randomly sampled 1,000 documents from each dataset, ensuring that each label appeared at least once, to form the training data. The remaining documents were used as the test data. This process was repeated 5 times to give 5 available training/test splits.

RCV1-v2 [13]

The RCV1-v2 dataset is another popular benchmark for multi-label document classification. It consists of over 800,000 documents with 103 labels. In our experiment, we used the original training set from the LYRL2004 split [13], which contains 23,149 documents assigned by 101 labels. We selected 30,000 documents at random from the LYRL2004 test split to form the test data. This procedure was repeated 10times, giving 10 training/test splits. The cardinality of RCV1-v2 is about 3.1, larger than that of the Yahoo! subdirectories.

6.3 Classification performance

Several existing approaches were compared with SL-LDA, including L-LDA [15], Pr-LDA, Dep-LDA [17], and Tuned-SVMs (T-SVMs) [13]. We implemented the in-house codes for these three supervised LDA models. All parameters were set according to the discussions in the original papers. We implemented T-SVMs using LibSVM and the parameters in [13]. Each approach was executed 20 times on each training/test split (i.e., a total of 100 times for Arts and Health and 200 times for RCV1-v2), and pairwise t-tests between SL-LDA and the baselines were conducted at the 5 % significance level.

In the early experiments, we found the performance to be stable while the number of topics K was slightly larger than the true number of labels, but to decrease for significantly larger values of K. We argue that this is reasonable, because, intuitively, a large K will exaggerate the effect of hidden topics. In this section, we report the results for the two Yahoo! datasets and the RCV1-v2 dataset with K = 100 and K = 240, respectively.

6.3.1 Rank-based performance

SL-LDA outputs the distribution of topics for each unlabeled document. Thus, we can directly rank the existing label k for document d , i.e., \({r_{d}}\left (k \right )\). The experiment results are given in Table 5: our model performs well in terms of both numerical results and statistics.

Table 5 The Experiment results (averaged score± standard deviation) on Ranking Loss (top section), One Error (middle section) and Margin (bottom section); •/∘ means whether SL-LDA is statistically superior/inferior to the compared algorithm

Among the supervised LDA models, SL-LDA performs much better than L-LDA and Pr-LDA, and achieves competitive performance with the state-of-the-art Dep-LDA. The difference between L-LDA and SL-LDA is significant. The difference between Pr-LDA and SL-LDA is also large, except for the Yahoo! Arts dataset. We believe this is because of the small size and low label density of this dataset. SL-LDA also outperformed the state-of-the-art Dep-LDA on 5/6 scores across the Yahoo! datasets and on 2/3 evaluation metrics across the larger RCV1-v2 dataset. Although SL-LDA is simpler than Dep-LDA, it attains better performance in terms of rank-based metrics.

SL-LDA also achieved higher scores than T-SVMs for 8/9 metrics across all three datasets. These results demonstrate that SL-LDA is competitive with the state-of-the-art discriminative method.

6.3.2 Binary prediction performance

To compute the binary prediction metrics, we must transform the label ranking of each test document d into its estimated label set \(y_{d}^{*}\), which consists of the top N ranked documents. Here, we set N equal to the cardinality of each dataset.

The binary prediction results are listed in Table 6. We can see that SL-LDA is statistically superior to the other methods, and attains better performance. In most cases, SL-LDA outperforms the other supervised LDA models in terms of both Macro-F1 and Micro-F1 across all three datasets. For the simpler models, SL-LDA scores about 0.09-0.11 higher than L-LDA and about 0.04-0.08 higher than Pr-LDA on the Yahoo! datasets, and achieves greater improvements across the larger RCV1-v2 dataset. In particular, SL-LDA outscores the state-of-the-art Dep-LDA by about 0.005-0.02. Regarding T-SVMs, we can see that the proposed model scores slightly lower on the RCV1-v2 dataset (similar results have been reported in [13]). However, our method outperforms T-SVMs on 3/4 metrics across the smaller Yahoo! datasets(an improvement of around 0.02 in Macro-F1 across Yahoo! Arts and 0.03 in Micro-F1 across Yahoo! Health). These results indicate that SL-LDA is competitive with state-of-the-art approaches.

Table 6 The Experiment results (averaged score± standard deviation) on Macro-F1 (top section) and Micro-F1 (bottom section); •/∘ means whether SL-LDA is statistically superior/inferior to the compared algorithm

6.4 Study on parameter 𝜃

We now study the effect of parameter 𝜃 in the multi-label setting. Two datasets are used: 1) Yahoo! Arts collection (which has a small cardinality of 1.7); and 2) RCV1-v2 collection (which has a larger cardinality of about 3.1). As different metrics exhibit similar trends, we only use the ranking loss to compare different values of 𝜃 across the two datasets.

The experimental results for the Yahoo! Arts collection are shown in Fig. 4. It is clear that higher performance is obtained when 𝜃 = 0.5 or 0.6. Larger values tend to degrade the performance. The trend for the Yahoo! Arts collection is very similar to that of the single-label corpus in Section 5. This is because Yahoo! Arts contains fewer labels, and about half the documents are tagged with only a single label.

Fig. 4
figure 4

The evaluation of parameter 𝜃 on Yahoo! Arts collection

The results for the RCV1-v2 collection (Fig. 5) are somewhat different, with larger values of 𝜃 tending to improve performance. The peak score is obtained for a value of 𝜃 =0.75. Because documents in RCV1-v2 contain an average of around three labels, a higher weighted proportion of pre-assigned labels are needed to describe a labeled document.

Fig. 5
figure 5

The evaluation of parameter 𝜃 on RCV1-v2 collection

Clearly, the experimental results also conform to the discussions in Section 2.2. However, in addition to the multi-label settings, we suggest that a small/large value of 𝜃 should be applied for datasets with small/large cardinality.

6.5 Running time

Finally, we compared the time efficiency of SL-LDA with that of two state-of-the-art approaches, i.e., Dep-LDA and T-SVM, for multi-label classification. Pr-LDA was not chosen, because it was deemed to be a special case of Dep-LDA.

Figure 6 shows the running time of different approaches for the Yahoo! ArtsFootnote 8 and RCV1-v2 collections. Clearly, SL-LDAFootnote 9 is more efficient than Dep-LDA and T-SVMs. For the smaller Yahoo! Arts dataset, SL-LDA is slightly faster than Dep-LDA, but the efficiency gap widens for the RCV1-v2 dataset. This is because Dep-LDA adds a label layer to the model, and its computational complexity is sensitive to the number of labels. The slowness of T-SVMs is mainly because of the parameter optimization process. In contrast, SL-LDA has few parameters, and is therefore faster than T-SVMs in this step. Following our discussion on the choice of 𝜃 , we can quickly determine an appropriate value. Thus, we argue that SL-LDA is more efficient in practice than Dep-LDA and T-SVMs.

Fig. 6
figure 6

The training time (seconds in log10-scale) for multi-label classification

7 Conclusion

In this paper, we have suggested a novel supervised model for document classification, including both single-label and multi-label settings. Based on the L-LDA model, SL-LDA relaxes the restriction of label sampling, and extends the topics concept to capture lost labels and common semantics. These modifications significantly improve the classification performance. We developed a Gibbs-EM algorithm to estimate and infer our model. A series of evaluation experiments were conducted, and the results show that: (1) in single-label cases, SL-LDA outperforms LDA-SVM, sLDA, DiscLDA, and the state-of-the-art MedLDA in most instances; (2) SL-LDA significantly outperforms L-LDA and Pr-LDA, and, more importantly, is competitive with the state-of-the-art Dep-LDA and SVMs.

In the future, we intend to: (1) develop online approaches for large-scale multi-label corpora; (2) investigate collections that contain many labels; (3) apply our model to some other applications, e.g., summarization and filtering.