1 Introduction

Document clustering (Fritzke 1995; Lamirel 2012; Lamirel et al. 2015) and topic modeling (Hofmann 1999; Blei et al. 2003; Teh et al. 2006; Zhu and Xing 2011) are two widely studied areas with practically many applications. Document clustering aims to organize similar documents into the same groups, which is a crucial building block in document organization, browsing, summarization, classification and retrieval. On the other hand, topic modeling develops probabilistic generative models to discover the latent semantics embedded in document collections and have shown a huge success in text modeling and analysis (Xie and Xing 2013). Recently, topic models have been widely used in many text mining applications, such as document retrieval, summarization, and clustering for different languages.

Probabilistic topic models such as Probabilistic Latent Semantic Analysis (PLSA) (Hofmann 1999), Latent Dirichlet Allocation (LDA) (Blei et al. 2003) and Hierarchical Dirichlet Processes (HDP) (Teh et al. 2006) were first developed to capture latent topics in a large collection of textual documents. Zhu and Xing (2011) presented a non-probabilistic topic model called Sparse Topical Coding (STC). Using STC, the document representations based on topics are “sparse” i.e. each document is described by small number of topics.

Topic modeling and document clustering are highly correlated and can have mutually beneficial relation. Topic models can enhance clustering by discovering the latent semantics embedded in document corpuses. This semantical information can be more useful for the identification of document groups (clusters) than using the raw features. Using topic models, a document corpus is projected into a topic space, in which the noise in measuring similarity is reduced. Moreover, the grouping structure of the corpus can be identified more effectively (Xie and Xing 2013).

In a simple process, it is possible to perform the document clustering and topic modeling tasks separately. We can use topic models to project documents from high dimensional word space into a low dimensional topic space; and then perform a clustering algorithm such as K-means in the topic space (Xie and Xing 2013). However, performing document clustering and topic modeling separately fails to make them to mutually promote each other in order to achieve the best overall performance. For this reason, Wallach (2008) proposed the Cluster based Topic Model (CTM) which integrates document clustering and topic modeling in a unified framework to jointly perform the two tasks. CTM generates the group indicator for each document and then samples the local topic distribution from the Dirichlet prior specific to the selected group. Based on CTM, Xie and Xing (2013) further introduced global topics to capture the global semantics and proposed the Multi-Grain Cluster Topic Model (MGCTM). MGCTM must select between local and global topics, and then generates local or global topics using its choice. Li et al. (2014) developed the Group Latent Dirichlet Allocation (GLDA), which is less complicated than MGCTM. The GLDA model samples the document-topic distributions from a combination of the Dirichlet prior with the selected group’s local prior and the global prior. In the vision domain, Wang et al. (2009) proposed three hierarchical Bayesian models to simultaneously learn the model and clusters of documents; namely: the LDA mixture model, the HDP mixture model, and the Dual-HDP model.

As far as we know, all the research conducted so far on integrating document clustering into the topic modeling process is based on probabilistic topic models. However, a limitation of probabilistic topic models is lack of a controlling mechanism to directly tune the sparsity of the inferred representations in a semantic space; which is desirable in text modeling applications (Xie and Xing 2013). In this paper, we develop an extension of the Sparse Topical Coding which integrates document clustering and non-probabilistic topic modeling. We refer to this method as Cluster-based Sparse Topical Coding (CSTC) and employ it for text clustering. In latent semantic space derived by STC, each document is represented by a linear combination of the basic topics. This representation of the documents in the topic space can be utilized as input features for the K-means clustering procedures. In CSTC method, the latent variables of cluster membership, document-topic distribution and topics are jointly inferred. Clustering and modeling are seamlessly connected and mutually promoted. Although we propose the CSTC method for clustering of text documents, it can also be applied to non-textual documents clustering. The major contribution of this paper is to propose a unified sparse topical coding to integrate document clustering and topic modeling together and demonstrating its advantages in text clustering applications. The remainder of the paper is organized as follows: Sect. 2 describes the STC and elaborates the CSTC method for document clustering. The experimental results are presented in Sect. 3. Finally, Sect. 4 concludes the paper.

2 Proposed method

In this section, we introduce CSTC, in which, the latent variables of cluster membership, document-topic distribution and topics are jointly inferred. A brief introduction to this method has been presented in Ahmadi et al. (2015). The main idea of CSTC is that document clustering and topic modeling can be integrated in a unified framework, in order to make two tasks mutually benefit each other and achieve the higher performance.

2.1 Notations

Suppose a collection of D documents \(\{ \mathbf{w }_{1},\ldots ,\mathbf{w }_{d},\ldots ,\mathbf{w }_{D}\}\) is given where each document contains words from a vocabulary, v (labeled by indexes), of size N. The dth document, \(\mathbf{w }_{d}\), is simply represented by an \(|\textit{I}_{d}|\)-dimensional vector \(\mathbf{w }_{d}=\{ w_{d,1},\ldots ,w_{d,|I_{d}|} \}\), where \(I_{d}\) is the index set of words that appear in \(\mathbf{w }_{d}\) and the nth entry, \(w_{d,n }(n\in I_{d})\), denotes the number of appearances of the nth word in the dth document. In topic modeling, a “topic” consists of a group of words that frequently occur together and a “dictionary” refers to all topics. Let \(\varvec{\beta } =[ \varvec{\beta }_{kn}]\in {\mathbb {R}}^{K\times N}\), called dictionary, be a matrix with K rows, where each row is assumed to be a topic. We use \(\varvec{\beta }_{kn}\) to denote the element of matrix \(\varvec{\beta } \) in the kth row and nth column, \(\varvec{\beta }_{.n}\) to denote the nth column of \(\varvec{\beta } \) and \(\varvec{\beta }_{k.}\) to denote the kth row of \(\varvec{\beta } \). The kth row of \(\varvec{\beta } , \varvec{\beta } _{k}\)., represents the kth topic of the dictionary. If P is an (N-1)-simplex, then we can regard \(\varvec{\beta } _{k.}\) as a distribution over v on this simplex. In other words, \(\varvec{\beta }_{k.}\) is a positive N-dimensional vector that its elements sum to one. For the dth document \(\mathbf{w }_{d}\), STC projects (maps) \(\mathbf{w }_{d}\) from the low-level space of words into a high-level semantic space spanned by a set of automatically learned topics \(\{\varvec{\beta }_{k.} \}_{k=1}^{K}\) and achieves a joint high-level representation of the entire document. STC is a hierarchical latent variable model, where \({{\varvec{\theta }}}_{d}=( {{\varvec{\theta }}} _{d,1},\ldots , {{\varvec{\theta }}}_{d,k},\ldots , {{\varvec{\theta }}}_{d,K})\in {\mathbb {R}}^{K\times 1}\) is called the document code of dth document and \({{\varvec{s}}}_{d,n}=({{\varvec{s}}}_{d,n;1},\ldots ,{{\varvec{s}}}_{d,n;k},\ldots ,{{\varvec{s}}}_{d,n;K})\in {\mathbb {R}}^{K\times 1}\) is the word code of the nth word in the dth document. For clarity, Table 1 presents a summary of key notations used in this paper.

Table 1 Notations of variables

2.2 Sparse topical coding

In probabilistic topic models, each document is represented by a normalized distribution over topics, and each topic is described by a normalized distribution over words. STC relaxes the normalization constraints made in probabilistic topic models. These relaxations lead to nice properties, such as direct control on the sparsity of discovered representations, efficient learning algorithm, and seamless integration with a convex loss function for learning predictive latent representations (Zhu and Xing 2011). In STC, each individual input feature (e.g., a word count) is reconstructed based on a linear combination of topics, and the coefficient vectors (or codes) are un-normalized. Besides, the representation of a given document is derived via an aggregation strategy (e.g., truncated averaging) from the codes of all individual features extracted from that document.

STC assumes that for the dth document, the word codes \({{\varvec{s}}}_{d,n}\) are conditionally independent given its document code \({{\varvec{\theta }}}_{d}\), and the observed word counts are independent given their latent representations \({{\varvec{s}}}_{d,n}\). With the conditionally independent assumptions, a generative procedure can be summarized as follows:

  1. 1.

    Sample a dictionary \(\varvec{\beta } \) from a prior distribution \(p(\varvec{\beta } )\).

  2. 2.

    For the dth document (\(d\in \{ 1,\ldots ,D\}\))

    1. (a)

      Sample the document code \({{\varvec{\theta }}}_{d}\) from a prior distribution \(p({{\varvec{\theta }}}_{d})\).

    2. (b)

      For nth observed word \((n\in {\varvec{I}}_{d})\)

      1. i.

        Sample the word code \({{\varvec{s}}}_{d,n}\) from a conditional distribution \(p({{\varvec{s}}}_{d,n}|{{\varvec{\theta }}}_{d})\).

      2. ii.

        Sample the observed word count \(w_{d,n}\) from a conditional distribution \(p(w_{d,n}|s_{d,n}, \varvec{\beta }_{.n})\).

According to the generative procedure, a joint probability distribution for \({{\varvec{\theta }}}, {{\varvec{s}}}, w, \varvec{\beta } \) can be defined as follows:

$$\begin{aligned} p({\varvec{\theta } },{{\varvec{s}}},{\mathbf {w}},{\varvec{\beta } })= & {} p({\varvec{\beta } })p({\varvec{\theta } },{{\varvec{s}}},\mathbf {w\vert }{\varvec{\beta } })\nonumber \\= & {} p({\varvec{\beta } })\prod \nolimits _{d\in D} p({\varvec{\theta } }_{d} )\prod \nolimits _{n\in I_{d} } p({{\varvec{s}}}_{d,n} \vert {\varvec{\theta } }_{d} )p(w_{d,n} \vert {{\varvec{s}}}_{d,n} ,{\varvec{\beta } }_{.n} ) \end{aligned}$$
(1)

In the STC model, the dictionary \(\varvec{\beta } \) is sampled from a uniform distribution. STC assumes that the discrete word counts in the dth document obey a Poisson distribution with \({{\varvec{s}}}_{d,n}^{T} {\varvec{\beta } }_{.n}\) as the mean parameter, i.e.,

$$\begin{aligned} p(w_{d,n} \vert {{\varvec{s}}}_{d,n} ,{\varvec{\beta } }_{.n} )=\frac{\left( {{\varvec{s}}}_{d,n}^{T} {\varvec{\beta } }_{.n}\right) ^{w_{n} }\exp \left( -{{\varvec{s}}}_{d,n}^{T} {\varvec{\beta } }_{.n}\right) }{w_{d,n} !} \end{aligned}$$
(2)

In order to achieve sparse representations for document codes \({{\varvec{\theta }}}\) and word codes s, STC chooses the Laplace prior and the super-Gaussian distribution (Hyvarinen 1999), as:

$$\begin{aligned}&\displaystyle p({\varvec{\theta } }_{d} )\propto \exp (-\lambda _{1} \left\| {{\varvec{\theta } }_{d} } \right\| _{1} ) \end{aligned}$$
(3)
$$\begin{aligned}&\displaystyle p({{\varvec{s}}}_{d,n} \vert {\varvec{\theta } }_{d} )\propto \exp \left( -\lambda _{2} \left\| {{{\varvec{s}}}_{d,n} } \right\| _{1} -\lambda _{3} \left\| {{{\varvec{s}}}_{d,n} -{\varvec{\theta } }_{d} } \right\| _{1} \right) \end{aligned}$$
(4)

Let \(\varvec{\Theta }=\{{\varvec{\theta } }_{d} ,{{\varvec{s}}}_{d} \}_{d=1}^{D} \) denote the codes for a collection of documents \(\left\{ {\mathrm{\mathbf{w}}_{d} } \right\} _{d=1}^{D} \). STC solves the optimization problem:

$$\begin{aligned}&\min \limits _{{\varvec{\Theta },\varvec{\beta }}} \mathop \sum \limits _{d=1}^{D} \sum \limits _{n=1}^{\left| {I_{d} } \right| } {\left( {{{\varvec{s}}}_{d,n}^{T} {\varvec{\beta } }_{.n} -w_{d,n} \ln \left( {{\varvec{s}}}_{d,n}^{T} {\varvec{\beta } }_{.n} \right) } \right) \,} +\lambda _{1} \sum \limits _{d=1}^D {\left\| {{\varvec{\theta } }_{d} } \right\| _{1} } \nonumber \\&\quad +\lambda _{2} \sum \limits _{d=1}^D {\sum \limits _{n=1}^{\left| {I_{d} } \right| } {\left\| {{{\varvec{s}}}_{d,n} } \right\| _{1} } } +\lambda _{3} \sum \limits _{d=1}^D {\sum \limits _{n=1}^{\left| {I_{d} } \right| } {\left\| {{{\varvec{s}}}_{d,n} -{\varvec{\theta } }_{d} } \right\| _{2}^{2} } } \nonumber \\&\quad \textit{s.t.}\,\,\,\,\,\,\,\,{\varvec{\theta } }_{d} \ge 0, \forall d;\,\,\,{{\varvec{s}}}_{d,n} \ge 0,\forall d,n; {\varvec{\beta } }_{k.} \in P,\forall k \end{aligned}$$
(5)

where \(\lambda _{1}, \lambda _{{2}}, \lambda _{{3}}\) are non-negative hyper-parameters set by users. The \(\hbox {L}_{1}\)-norm will bias towards finding sparse codes \({{\varvec{\theta }}}\) and s. Minimizing the log-Poisson loss in first part of (5) is actually equivalent to minimizing an un-normalized KL-divergence between observed word counts \(w_{d,n}\) and their reconstructions \({{\varvec{s}}}_{d,n.}^{T} {\varvec{\beta } }_{.n}\) (Zhu and Xing 2011).

2.3 Cluster-based sparse topical coding

Topic models are flexible tools for constructing models in prediction tasks. The motivation behind such models is that the given documents may include unobserved groups of different topics. By incorporating the structure in topic model, we are able to obtain more accurate predictions. We are also interested in uncovering that group structure. This looks like a clustering problem. For example, in the case of modeling textual documents it would be useful to understand the types of documents based on their topics.

To address the clustering problem, we extend STC to the CSTC. In CSTC, closely related documents are clustered together as latent groups (clusters). Each document first selects a group, and then generates the topic distribution with respect to the selected group. Finally, it samples words from the corresponding topic-word distributions. In CSTC, we assume that there is an L-dimensional distribution, p(l), that generates the group indicator \(l=1,\ldots ,L\) for the documents. Therefore, the generation process for the dth document can be formulated as follows: A group indicator \(l_{d}\) is chosen from the distribution \(p(l_{d})\). Then, we sample the document-topic distribution \({\varvec{\theta }}_{d}\) with respect to the selected group \(l_{d}\). The words are then generated as in STC. As shown in Fig. 1, the generative process of CSTC method is as follows:

  1. 1.

    Sample a dictionary \(\varvec{\beta } \) from a prior distribution \(p(\varvec{\beta })\).

  2. 2.

    For the dth document (\(d \in \{1,\ldots ,D\}\) )

    1. (a)

      Sample a cluster \(l_{d}\) from a prior distribution \(p(l_{d})\).

    2. (b)

      Sample the document code \({\varvec{\theta }}_{d}\) from a prior distribution p( \({\varvec{\theta }}_{d}\) | \(l_{d})\).

    3. (c)

      For nth observed word (\(n\in I_{d})\)

      1. i.

        Sample the word code \({{\varvec{s}}}_{d,n}\) from a conditional distribution \(p({{\varvec{s}}}_{d,n}|\varvec{\theta }_{d})\).

      2. ii.

        Sample the observed word count \(w_{n}\) from a conditional distribution \(p(w_{n}|{{\varvec{s}}}_{d,n}, \varvec{\beta }_{.n})\).

Fig. 1
figure 1

A graphical representation for the CSTC

According to the generative procedure, a joint probability distribution can be defined as follows:

$$\begin{aligned} p({\varvec{\theta } },{{\varvec{s}}},{\mathbf {w}},{\varvec{\beta } ,}l)=p({\varvec{\beta } })\prod \nolimits _{d\in D} p(l_{d} )p({\varvec{\theta } }_{d} \vert l_{d} )\prod \nolimits _{n\in I_{d} } p({{\varvec{s}}}_{d,n} \vert {\varvec{\theta } }_{d} )p(w_{d,n} \vert {{\varvec{s}}}_{d,n} ,{\varvec{\beta } }_{.n} )\nonumber \\ \end{aligned}$$
(6)

In CSTC, similar to the STC, the dictionary \(\varvec{\beta } \) is sampled from a uniform distribution. The discrete word counts follows a Poisson distribution. The word codes \({{\varvec{s}}}_{d,n}\) are sampled from the Laplace prior. However, for document codes \({\varvec{\theta }} \), CSTC chooses the super-Gaussian distribution, as:

$$\begin{aligned} p({\varvec{\theta } }_{d} \vert l_{d} )\propto \exp \left( -\lambda _{1} \left\| {{\varvec{\theta } }_{d} } \right\| _{1} -\lambda _{4} \left\| {{\varvec{\theta } }_{d} -\varvec{\mu }_{l_{d} } } \right\| _{2}^{2} \right) \end{aligned}$$
(7)

where \(\varvec{\mu }_{l_{d} }\) is the mean of \({\varvec{\theta }}\)’s from the cluster \(l_{d}\).

CSTC formulation for document clustering solves the optimization problem:

$$\begin{aligned} \min \limits _{\varvec{\Theta },\varvec{\beta }} f({{\varvec{\Theta }} };\varvec{\beta } )= & {} \min \limits _{\varvec{\Theta },{\varvec{\beta } }} \mathop \sum \limits _{d=1}^{D} \sum \limits _{n=1}^{\left| {I_{d} } \right| } {\left( w_{d,n} -{{\varvec{s}}}_{d,n}^{T} {\varvec{\beta } }_{.n}\right) ^{2}} +\lambda _{1} \sum \limits _{d=1}^D {\left\| {{\varvec{\theta } }_{d} } \right\| _{1} } \nonumber \\&+\,\lambda _{2} \sum \limits _{d=1}^D {\sum \limits _{n=1}^{\left| {I_{d} } \right| } {\left\| {{{\varvec{s}}}_{d,n} } \right\| _{1} } } +\lambda _{3} \sum \limits _{d=1}^D {\sum \limits _{n=1}^{\left| {I_{d} } \right| } {\left\| {{{\varvec{s}}}_{d,n} -{\varvec{\theta } }_{d} } \right\| _{2}^{2} } } \nonumber \\&+\,\lambda _{4} \sum \limits _{d=1}^D {\left\| {{\varvec{\theta } }_{d} -\varvec{\mu }_{l_{d} } } \right\| _{2}^{2} } \,\,\,\,\,\,\,\,\textit{s.t}.\,\,\,\,\,\,\,\,{\varvec{\theta } }_{d} \ge 0, \forall d;\nonumber \\&{{\varvec{s}}}_{d,n} \ge 0,\forall {{d}},{{n}}; \,\,\,{\varvec{\beta } }\ge 0,\,\mathop \sum \limits _{n=1}^{N} {\varvec{\beta } }_{kn} =1,\quad \forall k \end{aligned}$$
(8)

where \(f({\varvec{\Theta }} ; \varvec{\beta } )\) denotes the objective function of the optimization problem presented in (8) and \(\lambda _{1},\ldots ,\lambda _{{4}}\) are non-negative hyper-parameters which must be set by users. Note that s, \({\varvec{\theta }} \) and \(\varvec{\beta } \) must be non-negative as well. In other words, the elements of s, \({\varvec{\theta }} \) and \(\varvec{\beta } \) are non-negative real numbers. In CSTC formulation, unlike STC which minimizes KL-divergence, we minimize the \(\hbox {L}_{{2}}\)-norm of reconstructions error, which leads to simpler mathematical forms. Figure 2 illustrates another graphical representation of CSTC method.

Fig. 2
figure 2

A graphical representation of our proposed CSTC method for document clustering

We have:

$$\begin{aligned} \sum \limits _{d=1}^D {\left\| {{\varvec{\theta } }_{d} -\varvec{\mu }_{l_{d} } } \right\| _{2}^{2} } =\sum \limits _{l=1}^L {\sum \limits _{d\in C_{l} } {\left\| {{\varvec{\theta } }_{d} -\varvec{\mu }_{l} } \right\| _{2}^{2} } } \,\, \end{aligned}$$
(9)

where \({\varvec{\mu }}_{l }\) is the center of the lth cluster, \(C_{l}\), for clustering the document codes, \({\varvec{\theta }}\). Therefore, CSTC can be seen as the STC integrated with a K-means clustering, building a unified framework. This integration encourages the clustering to put documents with the same sparse representation in one cluster. The document code, \({\varvec{\theta }} \), can be regarded as a representation of the corresponding document in the topic space. Thus, we can treat \({\varvec{\theta }} \) as input to a K-means clustering process. The centroids of L clusters, \({\varvec{\mu }}_{l}, l= 1,\ldots ,L\), can be treated as the corresponding hidden variables to sparse document codes, \({\varvec{\theta }} \)’s. An intuitive interpretation of the K-means clustering term is that the document codes, \({\varvec{\theta }}\), are computed with respect to \({\varvec{\mu }}_{l}\).

Fig. 3
figure 3

Training phase

The objective function \(f({\varvec{\Theta }} ; \varvec{\beta } )\) is bi-convex, i.e. f is convex with respect to either \(\varvec{\Theta } \) or \(\varvec{\beta } \) when the other one is fixed. A typical solution to this problem is provided by the coordinate descent algorithm (Lee et al. 2006), which alternatively applies the optimization to \(\varvec{\Theta } \) and \(\varvec{\beta } \), as shown in Fig. 3 (Algorithm 1). The procedure of finding the sparse codes (the document codes \({\varvec{\theta }} \) and the word codes s) via optimization over \(\varvec{\Theta }=\{{\varvec{\theta } }_{d} ,{{\varvec{s}}}_{d} \}_{d=1}^{D} \), as specified in (8), is called “sparse coding”. In the sparse coding, we actually find the sparsest representations of the documents and words in the current dictionary according to the \(\hbox {L}_{1}\)-norm of \({\varvec{\theta }} \) and s. The procedure of finding the dictionary via the same optimization over \(\varvec{\beta } \) is called “dictionary learning”. After learning the dictionary of topics \(\varvec{\beta } \) and finding the centroids of \(\hbox {clusters}\left\{ {\varvec{\mu }_{c} } \right\} _{c=1}^{L}\) through training phase, the cluster of a test document can be defined according to Fig. 4 (Algorithm 2).

Fig. 4
figure 4

Test phase

2.3.1 A. Sparse coding

The sparse coding step aims to find the codes \(\varvec{\Theta } \) when \(\varvec{\beta } \) is fixed. The procedure of finding the word codes \({{\varvec{s}}}_{d,n}\) via optimization over s, as specified in (8), is called “word coding” and the procedure of finding the document codes \({\varvec{\theta }}_{d}\) via the same optimization over \({\varvec{\theta }} \) is called “document coding”.

  • Word coding

When \({\varvec{\theta }} \) is fixed, s is obtained is obtained by solving the optimization problem:

$$\begin{aligned}&\min \limits _{{{\varvec{s}}}} \mathop \sum \limits _{d=1}^{D} \sum \limits _{n=1}^{\left| {I_{d} } \right| } {\left( w_{d,n} -{{\varvec{s}}}_{d,n}^{T} {\varvec{\beta } }_{.n} \right) ^{2}} +\lambda _{2} \sum \limits _{d=1}^D {\sum \limits _{n=1}^{\left| {I_{d} } \right| } {\left\| {{{\varvec{s}}}_{d,n} } \right\| _{1} } } \nonumber \\&\quad +\lambda _{3} \sum \limits _{d=1}^D {\sum \limits _{n=1}^{\left| {I_{d} } \right| } {\left\| {{{\varvec{s}}}_{d,n} -{\varvec{\theta } }_{d} } \right\| _{2}^{2} } } \,\,\,\,\,\,\,\,\textit{s.t.} \,\,\,\,{{\varvec{s}}}_{d,n} \ge 0,\forall d,n \end{aligned}$$
(10)

Due to the conditional independence between the elements of s, we can perform this step for each document and each word separately by solving the optimization problem specified with:

$$\begin{aligned} \min \limits _{{{\varvec{s}}}_{n} } \left( w_{n} -{{\varvec{s}}}_{n}^{T} {\varvec{\beta } }_{.n} \right) ^{2}+\lambda _{2} \left\| {{{\varvec{s}}}_{n} } \right\| _{1} +\lambda _{3} \left\| {{{\varvec{s}}}_{n} -{\varvec{\theta } }} \right\| _{2}^{2} \,\,\,\,\,\,\,\,\textit{s.t.} \,\,\,\,{{\varvec{s}}}_{n} \ge 0,\forall n \end{aligned}$$
(11)

Due to the non-negativity constraint on \({{\varvec{s}}}_{n}=({{\varvec{s}}}_{n;1},\ldots ,{{\varvec{s}}}_{n;k},\ldots ,{{\varvec{s}}}_{n;K})\), the \(\hbox {L}_{1}\)-norm can be written as:

$$\begin{aligned} \left\| {{{\varvec{s}}}_{n} } \right\| _{1} =\sum \limits _{k=1}^K {\left| {{{\varvec{s}}}_{n;k} } \right| } =\sum \limits _{k=1}^K {{{\varvec{s}}}_{n;k} } \end{aligned}$$
(12)

Substituting (12) in (11), each \({{\varvec{s}}}_{n;k}\) can be calculated iteratively via:

$$\begin{aligned} {{\varvec{s}}}_{n;k} =\mathop {\arg \min }\limits _{{{\varvec{s}}}_{n;k} } \left\{ \left( w_{n} -{{\varvec{s}}}_{n}^{T} {\varvec{\beta } }_{.n} \right) ^{2}+\lambda _{2} \sum \nolimits _{l=1}^K {{{\varvec{s}}}_{n;l} } +\lambda _{3} \left\| {{{\varvec{s}}}_{n} -{\varvec{\theta } }} \right\| _{2}^{2} \right\} \end{aligned}$$
(13)

Setting the gradient to zero yields:

$$\begin{aligned} {{\varvec{s}}}_{n;k} =\frac{w_{n} {\varvec{\beta } }_{kn} -{\varvec{\beta } }_{kn} \sum \nolimits _{l=1,l\ne k}^K {{{\varvec{s}}}_{n;l} {\varvec{\beta } }_{ln} } -0.5\lambda _{2} +\lambda _{3} {\varvec{\theta } }_{k} }{{\varvec{\beta } }_{kn}^{2} +\lambda _{3} } \end{aligned}$$
(14)

Non-negativity constraint on s is imposed according to Proposition 1 in Zhu and Xing (2011) as \({{\varvec{s}}}_{n;k} \leftarrow \max ({{\varvec{s}}}_{n;k} ,0) \). Proposition 1 proposes a method to impose the non-negativity constraint in the convex optimization problems. According to this proposition, if h(x) is a strictly convex function, the optimum solution \(x^{*}\) of the constrained problem \(P_{0} :\text{ min }_{x\ge 0} h( x)\) is \(x^{*} =\text{ max }({0,x_{0} })\), where \(x_{0}\) is the solution of the unconstrained problem \(P_{1} :\text{ min }_{x} h(x)\).

  • Document coding

When s is fixed, \({\varvec{\theta }} \) is obtained is obtained by solving the optimization problem:

$$\begin{aligned}&\min \limits _{{\varvec{\theta } }} \lambda _{1} \sum \limits _{d=1}^D {\left\| {{\varvec{\theta } }_{d} } \right\| _{1} } +\lambda _{3} \sum \limits _{d=1}^D {\sum \limits _{n=1}^{\left| {I_{d} } \right| } {\left\| {{{\varvec{s}}}_{d,n} -{\varvec{\theta } }_{d} } \right\| _{2}^{2} } } \nonumber \\&\quad +\lambda _{4} \sum \limits _{l=1}^L {\sum \limits _{d\in C_{l} } {\left\| {{\varvec{\theta } }_{d} -\varvec{\mu }_{l} } \right\| _{2}^{2} } } \,\,\,\,\,\,\,\,\textit{s.t}. \,\,\,\,{\varvec{\theta } }_{d} \ge 0, \forall d \end{aligned}$$
(15)

The optimization formula (15), can be further simplified for a specific document and any chosen cluster as:

$$\begin{aligned} \min \limits _{{\varvec{\theta } }} \lambda _{1} \left\| {{\varvec{\theta } }} \right\| _{1} +\lambda _{3} \sum \limits _{n=1}^{\left| I \right| } {\left\| {{{\varvec{s}}}_{n} -{\varvec{\theta } }} \right\| _{2}^{2} } +\lambda _{4} \left\| {{\varvec{\theta } }-\varvec{\mu }_{l} } \right\| _{2}^{2} \,\,\,\,\,\,\,\,\textit{s.t}. \,\,\,\,{\varvec{\theta } }\ge 0 \end{aligned}$$
(16)

Due to the non-negativity constraint on \({\varvec{\theta }} =( \varvec{\theta }_{1},\ldots , \varvec{\theta }_{k},\ldots , \varvec{\theta }_{K})\), the \(\hbox {L}_{1}\)-norm can be simplified as:

$$\begin{aligned} \left\| {{\varvec{\theta } }} \right\| _{1} =\sum \limits _{k=1}^K {\left| {{\varvec{\theta } }_{k} } \right| } =\sum \limits _{k=1}^K {{\varvec{\theta } }_{k} } \end{aligned}$$
(17)

Since different dimensions of \({\varvec{\theta }} \) are not coupled to each other, each \({\varvec{\theta }}_{k}\) is calculated separately. Substituting (17) in (16), each \({\varvec{\theta }}_{k}\) can be expressed in terms of the optimization formula in:

$$\begin{aligned} {\varvec{\theta } }_{k} =\mathop {\arg \min }\limits _{{\varvec{\theta } }_{k} } \left\{ \lambda _{1} {\varvec{\theta } }_{k} +\lambda _{3} \sum \limits _{n=1}^{\left| I \right| } {({{\varvec{s}}}_{n;k} -{\varvec{\theta } }_{k} )^{2}} +\lambda _{4} ({\varvec{\theta } }_{k} -\varvec{\mu }_{l,k} )^{2}\right\} \end{aligned}$$
(18)

Setting the gradient to zero yields:

$$\begin{aligned} {\varvec{\theta } }_{k} =\frac{\lambda _{3} \sum \nolimits _{n=1}^{|I|} {{{\varvec{s}}}_{n;k} } +\lambda _{4} \varvec{\mu }_{l,k} -0.5\lambda _{1} }{\lambda _{3} \left| I \right| +\lambda _{4} } \end{aligned}$$
(19)

Non-negativity constraint on \({\varvec{\theta }} \) is imposed according to Proposition 1 in Zhu and Xing (2011) as \({\varvec{\theta } }_{k} \leftarrow \max ({\varvec{\theta } }_{k} ,0)\).

2.3.2 B. Dictionary learning

The dictionary learning step aims to find the \(\varvec{\beta } \) when \(\varvec{\Theta }\) is fixed. After finding all document codes and word codes of the collection, the dictionary \(\varvec{\beta } \) is updated by minimizing:

$$\begin{aligned} \min \limits _{{\varvec{\beta } }} \mathop \sum \limits _{d=1}^{D} \sum \limits _{n=1}^{\left| {I_{d} } \right| } {\left( w_{d,n} -{{\varvec{s}}}_{d,n}^{T} {\varvec{\beta } }_{.n} \right) ^{2}} \,\,\,\,\,\,\,\,\textit{s.t}.\,\,\,\,{\varvec{\beta } }\ge 0,\,\mathop \sum \limits _{n=1}^{N} {\varvec{\beta } }_{kn} =1,\forall k \end{aligned}$$
(20)

By assigning zero values to \(w_{d,n }\) and \({{\varvec{s}}}_{d,n}\) for \(n\notin | I_{d}|\), we can replace \(| I_{d}|\) with N. Then (20) can be written as:

$$\begin{aligned} \sum \limits _{n=1}^N \left( \sum \limits _{d=1}^D {\left( w_{d,n} -{{\varvec{s}}}_{d,n}^{T} {\varvec{\beta } }_{.n} \right) ^{2}} \right) \, {\textit{s.t.}}\quad \varvec{\beta }\ge 0, \sum \limits _{n=1}^N \varvec{\beta }_{kn}=1, \forall k. \end{aligned}$$
(21)

For the nth column of \(\varvec{\beta } , n=1,\ldots ,N\), we have:

$$\begin{aligned} {\varvec{\beta } }_{.n} =\mathop {\arg \min }\limits _{\underline{{\varvec{\beta } }}} \left\{ \sum \limits _{d=1}^D {\left( w_{d,n} -{{\varvec{s}}}_{d,n}^{T} \underline{{\varvec{\beta } }}\right) ^{2}} \right\} . \end{aligned}$$
(22)

Equation (22) is a convex optimization problem that can be efficiently solved by setting the gradient to zero. Thus, the nth column of \(\varvec{\beta } \) is calculated as:

$$\begin{aligned} {\varvec{\beta } }_{.n} =\left( {\sum \limits _{d=1}^D {{{\varvec{s}}}_{d,n} {{\varvec{s}}}_{d,n}^{T} } } \right) ^{-1}\left( {\sum \limits _{d=1}^D {w_{d,n} {{\varvec{s}}}_{d,n} } } \right) . \end{aligned}$$
(23)

3 Experimental results

In this section, experimental evaluation is performed on two popular text datasets and the performance is compared with traditional and some topic model based clustering methods.

3.1 Datasets

The 20-Newsgroups and WebKB datasetsFootnote 1 are two popular English datasets for experiments in text applications of machine learning techniques, such as text classification and text clustering. In this paper, experiments for text clustering are conducted on these datasets. The 20-Newsgroups dataset is a collection of 18,744 documents with 61,188 distinct words. It contains 11,269 (60%) documents for training and 7505 (40%) documents for testing. This dataset is equally divided into 20 categories, each with around 1000 documents. The WebKB dataset is a collection of 4199 documents and consists of 4 categories. In contrast to 20-Newsgroups, it is an unbalanced dataset, where the largest category contains 1641 documents and the smallest category contains only 504 documents. It contains 2803 (about 67%) documents for training and 1396 (about 33%) documents for testing. Datasets are pre-processed with stop-word removal and stemming. In our experiments, the input cluster number required by clustering algorithms is set to the number of categories in the dataset ground truth which is equal to 20 for 20-Newsgroups dataset and 4 for WebKB dataset.

3.2 Document sparsity

Different from the probabilistic topic models like LDA, CSTC method is a non-probabilistic topic model like STC, which directly controls the sparsity of inferred documents representations as a mixture of topics. By imposing a sparse bias on the document codes \({\varvec{\theta }}\), a document can be sufficiently reconstructed by a few topics of the dictionary.

The document sparsity is defined as the sparsity ratio of learned document codes and is calculated based on proportion of entries in the D document codes \({\varvec{\theta }}_{d}\in {\mathbb {R}}^{K}\), i.e.:

$$\begin{aligned} \text{ Document } \text{ sparsity }=\frac{\# \text{ real } \text{ number } \text{ zero }\,\text{ of }\,\left\{ {{\varvec{\theta } }_{1} ,...,{\varvec{\theta } }_{d} ,...,{\varvec{\theta } }_{D} } \right\} }{K\times D}. \end{aligned}$$
(24)

Figure 5 depicts that both STC and CSTC methods can discover sparse document codes. This means that each document belongs to just a limited number of topics. But, due to exploiting the clustering based regularization term in CSTC, a sparser representation for \({\varvec{\theta }} \) can be achieved.

3.3 Clustering evaluation metrics

We use two common evaluation metrics to measure the clustering performance: clustering accuracy and Normalized Mutual Information (NMI). For both metrics, a larger score represents a better performance.

Fig. 5
figure 5

Sparsity comparison of document codes learned by STC and CSTC methods using 20-Newsgroups dataset

The clustering accuracy is evaluated by comparing the recognized label of each document with the labels provided by the ground truth. To define the cluster labels, Kuhn–Munkres algorithm (Kuhn 1955) is used to find which cluster gives a maximal match to a ground truth class. Kuhn-Munkres algorithm finds the best permutation mapping of each cluster label to the equivalent label from the ground truth. Given a document \(\mathbf{w }_{d}\), let \(\tilde{{c}}_{d} \) and \(c_{d}\) be the extracted cluster label and the label provided by the ground-truth, respectively. The clustering accuracy is defined as follows:

$$\begin{aligned} \text{ Clustering } \text{ Accuracy }=\frac{\sum \nolimits _{d=1}^D {\delta (c_{d} ,\tilde{{c}}_{d} )} }{D} \end{aligned}$$
(25)

where D is the total number of documents and \(\delta \) (x,y) is the delta function that equals one if \(x=y\) and zero otherwise.

NMI measures the similarity between two label sets of the same data. Let \(\tilde{{C}}\) be the set of clusters obtained by the clustering algorithm and C be the set of clusters obtained from the ground truth. Their NMI is defined as:

$$\begin{aligned} \hbox {NMI}(\tilde{{C}},C)=\frac{\hbox {MI}(\tilde{{C}},C)}{\max (\hbox {H}(\tilde{{C}}),\hbox {H}(C))}. \end{aligned}$$
(26)

In (26), MI and H define the mutual information and the entropy, respectively, that are calculated as follows:

$$\begin{aligned} \hbox {MI}(\tilde{{C}},C)= & {} \sum \limits _{i=1}^L {\sum \limits _{j=1}^L {p(\tilde{{C}}_{i} ,C_{j} )\log } } \frac{p(\tilde{{C}}_{i} ,C_{j} )}{p(\tilde{{C}}_{i} )p(C_{j} )} \end{aligned}$$
(27)
$$\begin{aligned} \hbox {H}(\tilde{{C}})= & {} -\sum \limits _{i=1}^L {p(\tilde{{C}}_{i} )\log p(\tilde{{C}}_{i} )}\end{aligned}$$
(28)
$$\begin{aligned} \hbox {H}(C)= & {} -\sum \limits _{i=1}^L {p(C_{i} )\log p(C_{i} )} \end{aligned}$$
(29)

where L is the number of clusters that is set to the number of categories in the dataset ground truth; \(p(\tilde{{C}}_{i})\) and \(p(C_{i})\) denotes the probabilities that a document belongs to the cluster \(\tilde{{C}}_{i}\) and cluster \(C_{i}\), respectively; \(p(\tilde{{C}}_{i} ,C_{i} )\) is the joint probability that a document belongs to the cluster \(\tilde{{C}}_{i}\) and cluster \(C_{i}\) at the same time. It is easy to check that \(\hbox {NMI}(\tilde{{C}}_{i} ,C_{i} )\) ranges from 0 to 1. \(\hbox {NMI}=1\) the two sets of clusters are identical, and \(\hbox {NMI}=0\) if the two sets are independent.

3.4 Hypothesis test for statistical significance evaluation of the clustering results

In order to determine whether the difference between the document clustering results of two methods is significant, we perform hypothesis testing (Papoulis and Pillai 2002). Let \(X_{1}\) and \(X_{{2}}\), where \(X_{1}>X_{{2}}\), be the number of documents that are correctly clustered using the methods A and B, respectively and D be the total number of test documents (i.e., \(X_{1}/D\) and \(X_{{2}}/D\) are the clustering accuracies). Also, let \(P_{1}\) and \(P_{{2}}\) be the actual probabilities of correctly clustering a test document using the methods A and B, respectively. We wish to test the assumption that \(P_{1}-P=0\) (the null hypothesis) against the assumption that \(P_{1}>P_{{2}}\) (the alternative hypothesis). That is, we wish to test whether the numbers of truly clustered documents \(X_{{1}}\) and \(X_{{2}}\) for the given number of test documents D support the rejection of the null hypothesis. If these numbers support the rejection of the null hypothesis, we conclude that the difference between the clustering results is significant.

The number of truly clustered documents X for a given method can be viewed as a random variable with binomial distribution, where the probability of success P is the probability of correctly clustering a test document and the number of trials is equal to the total number of test documents D. We know that, for a sufficiently large sample size D, the binomial distribution converges to the normal distribution \(N(\textit{DP,DP}(1 -P))\). Let \(X_{1}\) and \(X_{{2}}\) be two random variables with binomial distributions with probabilities of success \(P_{1}\) and \(P_{{2}}\), respectively and the numbers of trials D. It can be shown that, under the assumption that \(P_{1}=P_{{2}}\),

$$\begin{aligned} q=\frac{X_{1} -X_{2} }{\sqrt{\frac{(X_{1} +X_{2} )(2D-X_{1} -X_{2} )}{2D}} } \end{aligned}$$
(30)

has approximately a standard normal distribution (i.e., N(0,1)). We use q as the test statistic for our hypothesis testing.

In the hypothesis testing, we find a region on the real line where under the null hypothesis, the density of the test statistic is negligible. This region is called critical region of the test. If the observed test statistic falls in the critical region, we reject the null hypothesis. The critical region is obtained according to the chosen level of significance. For 95% level of significance, the critical region for our hypothesis test is \(q> z_{\mathrm {.95}}\), where \(z_{\mathrm {.95}}\) is the standard normal percentile and is equal to 1.64.

3.5 Clustering experiments

In order to study how topic modeling affects document clustering, we compare CSTC method with two topic model based methods. The first one uses STC to learn a code vector for each document; then performs K-means on document codes to obtain clusters. We use “STC K-means” to refer to this method. The second one, similar to the method proposed in Lu et al. (2011) for PLSA and LDA, treats each topic as a cluster. Document code \({{\uptheta }} \) can be deemed as a mixture proportion vector over clusters and can be utilized for clustering. A document is assigned to cluster x if \(x = \hbox {argmax}_{j}\) \({\varvec{\theta }} _{j}\). We refer to the second method by “STC”.

In CSTC method, the hyper-parameters \(\lambda _{1},\ldots ,\lambda _{4}\) should be tuned to achieve the best clustering performance. We set the hyper-parameters as \(\lambda _{1}=\lambda _{{3}}=0.5, \lambda _{{2}}=0.2, \lambda _{4}=5\) for both 20-Newsgroups and WebKB datasets. Details of selecting appropriate values for hyper-parameters are discussed in Sect. 3.6. In STC method, the number of topics (K) should be set to the number of clusters that is 20 for 20-Newsgroups dataset and 4 for WebKB dataset. For STC K-means and CSTC, the number of topics could be set to the number of clusters or even higher.

Figure 6 shows the clustering accuracy versus the number of topics and Fig. 7 depicts the NMI with respect to the number of topics, for these methods. According to these figures, the best results, i.e. the highest clustering accuracy and NMI, are achieved by using 120 topics in STC K-means and 100 topics in CSTC. In STC K-means and CSTC methods, the number of topics defines the dimension of input feature vectors of the K-means and has an important impact on accuracy. Generally, accuracy increases with the number of topics in a certain range and then begins to decrease. The phenomenon of accuracy decline is recognized as over-fitting and is a direct result of the curse of dimensionality. The larger value at which the over-fitting starts, the model can handle the larger number of latent topic features. On one hand, a large number of topics increase the possibility of over-fitting; on the other hand, it provides more latent features for building the K-means.

Fig. 6
figure 6

Clustering accuracy (%) versus the number of topics on the 20-Newsgroups dataset

Fig. 7
figure 7

NMI (%) versus the number of topics on the 20-Newsgroups dataset

Tables 2 and 3 summarize the clustering accuracy and NMI values for different clustering methods using 20-Newsgroups and WebKB datasets, respectively. For STC K-means as well as CSTC, in which clustering accuracy and NMI values are changed versus the number of topics, the best results are considered. The results of traditional clustering methods including K-means, Normalized Cut (NC), Non-negative Matrix Factorization (NMF) and other topic model based methods [LDA (Lu et al. 2011); CTM (Wallach 2008); MGCTM (Xie and Xing 2013); GLDA (Wallach 2008); LDA mixture model (Wang et al. 2009)] as performance baselines are also reported in Tables 2 and 3. In NC, we use Gaussian kernel as the similarity measure between documents. The bandwidth parameter is set to 10. In LDA, we use symmetric Dirichlet priors \(\alpha \) and \({\beta }\) to draw document-topic distribution and topic-word distribution, setting to 0.1and 0.01 respectively. For the CTM, we set the number of topics to 120 for the 20-Newsgroups dataset and to 40 for the WebKB dataset. In MGCTM and GLDA, we used 10 local topics for each group and 20 global topics for the 20-Newsgroups dataset, and 32 local topics for each group and 32 global topics for the WebKB dataset. As illustrated in Tables 2 and 3, our proposed CSTC method achieves the highest accuracy and NMI in the task of textual document clustering. It performs much better than the traditional approaches (i.e., K-means, NC and NMF) and performs better compared to other topic model based clustering methods (like STC, STC K-means, LDA, CTM, MGCTM, GLDA and LDA mixture model). The results show that CSTC outperforms traditional methods by at least 26% in clustering accuracy and 33% in NMI and outperforms the topic model based clustering methods by at least 4% in clustering accuracy and about 4% in NMI, using 20-Newsgroups dataset. This improvement for WebKB dataset is at least 14% in clustering accuracy and 6% in NMI, compared to traditional methods and at least 4% in clustering accuracy and 2% in NMI, compared to topic model based clustering methods.

Table 2 Evaluation of different clustering methods on the 20-Newsgroups dataset
Table 3 Evaluation of different clustering methods on the WebKB dataset

To determine the statistical significance of the results, the test statistic for our proposed CSTC method versus other clustering methods is also shown in Tables 2 and 3. The decision to reject or accept the null hypothesis is based on the 95% level of significance, i.e., if \(q> (z_{0.95}=1.64\)), we reject the null hypothesis and conclude that the improvement in the performance is significant, otherwise, we conclude that the results do not support the rejection of the null hypothesis (i.e., accept). As shown in Tables 2 and 3, the clustering accuracy improvement achieved by the proposed CSTC method over all other clustering methods is significant.

We also measure the execution time as total time of training and testing. Figure 8 shows the execution time versus the number of topics, on the 20-Newsgroups dataset, for STC K-means and CSTC methods. As it can be seen, for both methods, execution time increases with the number of topics. Table 4 summarizes the execution time of K-means, STC, STC K-means and CSTC methods. For STC K-means, the execution time in 120 topics, and for CSTC, the execution time in 100 topics are reported. The result indicates that K-means has the least execution time compared to other methods. This demonstrates that STC increase the performance of clustering with loss of speed.

Fig. 8
figure 8

Execution time (in seconds) versus the number of topics on the 20-Newsgroups dataset

Table 4 Execution time of different clustering methods on the 20-Newsgroups dataset

3.6 Selecting the hyper-parameter values

The hyper-parameters \(\lambda _{1}, \lambda _{{2}}, \lambda _{{3}}\) control the sparsity of document codes \({\varvec{\theta }}\) and word codes s (Wang et al. 2014). In general, larger values of \(\lambda _{1}, \lambda _{{2}}\) and \(\lambda _{{3}}\) lead to a sparser \({\varvec{\theta }} \) and s. On the other hand, a topic model with highly sparse \({\varvec{\theta }} \) and s will miss some useful topic-word relationships. This will lead to a poor reconstruction performance. Therefore, in practice, we must try larger \(\lambda _{1}\) and \(\lambda _{{2}}\) to obtain sparser \({\varvec{\theta }} \) and s such that the reconstruction performance (over the training data) is kept in an acceptable level. In Wang et al. (2014), these parameters are being selected using a generic grid search based on cross-validation over a portion of the training data. These values are kept fixed for evaluating the method over the test data. The search process can be simplified by setting \(\lambda _{1}=\lambda _{{3}}\) (Zhu and Xing 2011). According to Wang et al. (2014), the code sparsity parameters \(\lambda _{1}, \lambda _{{2}}\) and \(\lambda _{{3}}\) are insensitive to the dataset. Therefore, \(\lambda _{1}=\lambda _{{3}}=0.5\) and \(\lambda _{{2}}=0.2\) experientially for all datasets.

Parameter \(\lambda _{4}\) controls the effect of the K-means clustering term. As \(\lambda _{4}\) increases, the K-means clustering term plays a more important role in the optimization problem stated in (8). For tuning the K-means clustering term, coefficient \(\lambda _{4}\) must be set to a value that the highest clustering accuracy is achieved through cross-validation over the training data. A practical choice for the value of these coefficients can be found according to the ratio of the corresponding terms. In this manner, we can estimate \(\lambda _{4}\) as shown in:

$$\begin{aligned} \frac{\lambda _{1} }{\lambda _{4} }\propto \frac{\sum \nolimits _{l=1}^L {\sum \nolimits _{d\in C_{l} } {\left\| {{\varvec{\theta } }_{d} -\varvec{\mu }_{l} } \right\| _{2}^{2} } } }{\sum \nolimits _{d=1}^{D}\Vert {\varvec{\theta }}_{d}\Vert _1}= \frac{\hbox {inter-cluster-variance}(\varvec{\theta })}{ \hbox {mean}(\varvec{\theta })} \end{aligned}$$
(31)

As the mean value is much larger than the inter-cluster-variance value, \(\lambda _{4}\) should be considered much larger than \(\lambda _{1}\). For example we set \(\lambda _{4}\) to ten times \(\lambda _{1}\), and \(\lambda _{4}=5\) is used in our experiments.

3.7 Discussion

According to experimental clustering results, topic model based clustering methods including STC, STC K-means and CSTC, LDA, CTM, MGCTM, GLDA and LDA mixture model are generally better than K-means, NC and NMF. This shows that topic modeling can promote document clustering. The semantics discovered by topic models can effectively facilitate accurate similarity measure, which is helpful to recognize coherent clusters. In STC K-means method, the STC is first learned off-line and K-means clustering is conducted subsequently on the document codes built from the STC. Compared with STC K-means which performs clustering and modeling separately, the STC, CSTC, LDA, CTM, MGCTM, GLDA and LDA mixture model which jointly perform two tasks achieve much better results. This demonstrates that clustering and modeling can mutually promote each other. Information on the category of documents helps to solve the ambiguity of word meaning in discovering the topics, and vice versa. Thus, coupling document clustering and topic modeling into a unified framework produces superior performance than separating them into two procedures.

Among the methods which unify clustering and modeling, our proposed CSTC method achieves the best clustering result. In STC and LDA methods, one cluster of documents only corresponded to one topic. Assigning each cluster only one topic may not be sufficient to capture the diverse semantics within each cluster. On the contrary, CTM, MGCTM, GLDA, LDA mixture model and CSTC assign each cluster a set of topics.

CSTC combines document modeling and clustering in a unified framework where these two tasks are jointly accomplished. In each iteration of the inference and learning process, the cluster assignments of documents depend on the current learned document codes and the estimation of document codes depends on the current inferred cluster labels. Learning the document codes is continually guided by intermediate clustering results. As such, they are specifically suitable for clustering task in the end. Compared to CTM, MGCTM, GLDA and LDA mixture model which are based on LDA, CSTC achieves higher clustering accuracy based on STC.

Intuitively, instead of letting all the topics to contribute in descriptions, it is reasonable to assume that each document or word has a few salient topical meanings. In comparison to LDA, the fact that STC directly achieves the document sparsity by imposing sparsity-inducing regularization on the inferred document representations, makes it advantages over LDA. The problem is that LDA is not able to discover sparse latent representations. That is mainly because LDA uses a Dirichlet distribution which prevents any zero contributions of words to topics and of topics to documents. As a result, due to the extreme density of the learned topics and new representations of documents, document sparsity will be equal to zero. Contrary to LDA, learning sparse latent representations of documents and deliberate contribution of only few topics to a document are allowed by STC. In document clustering and information retrieval, which are considered as large scale text mining applications, this sparsity is to be considered.

4 Conclusion

In this paper, we proposed a clustering topic model to simultaneously perform document clustering and modeling. Experiments demonstrated the fact that topic modeling task is closely related to document clustering and can mutually promote it. We conducted our experiments on two popular text datasets to evaluate the proposed model. The results indicated that through jointly topic modeling, our proposed CSTC method can achieve a much better performance compared to the traditional methods (Kmeans, NC and NMF) and better performance compared to other topic model based clustering methods (LDA, CTM, MGCTM, GLDA and LDA mixture model). According to the experimental results, CSTC significantly improves the accuracy and NMI of other methods by at least about 4% in the 20-Newsgroups dataset. These accuracy and NMI improvements are at least 4 and 2% respectively in the WebKB dataset.