1 Introduction

With more online service providers encouraging users to make comments or leave feedbacks to their real-time updated contents, co-occurring normal documents and short texts are constantly generated throughout the Internet. For example, each news article in news publishing platforms could be followed by multiple reader comments, each blog post in blog websites could be followed by multiple reader reviews, and each product description in electronic commerce websites could be followed by multiple consumer reviews. The short texts may discuss issues addressed in their corresponding normal documents, and may also discuss other issues, such as personal opinions. The co-occurring structure inherent in such text corpora poses challenges to conventional topic modeling.

Topic models [2, 9] have been successfully applied to modeling normal documents, such as news articles, blog posts and product descriptions, and have achieved great success in uncovering latent semantic structure. In the basic Latent Dirichlet Allocation (LDA) model [2], documents are taken as mixtures of topics and each topic has a probability distribution over a dictionary of words. It has also been extended in various ways to deal with more complicated modeling tasks. For example, Liu et al. [15] propose a model that jointly models the generation of contents and friendships of authors in social networks, within which the user topics and the link formation pattern can be learned in an unified model. McCallum et al. [17] propose a model to simultaneously discover groups among the entities and topics among the corresponding texts. Nagarajan et al. [21] propose a probabilistic model for community structures and user contents that can discover coherent communities and topics at the same time.

When faced with a corpus of short texts, LDA and its extensions suffer from severe data sparsity problem. Specifically, 1) small word counts in short texts restrict the ability of topic models to learn how words are related, and hence the learnt topics are less discriminative than those learnt from normal documents [10]; 2) limited contexts in short texts make it more difficult for topic models to distinguish ambiguous words [27].

Two major heuristic strategies have been adopted to alleviate this sparsity problem. The first strategy aggregates short texts into pseudo-documents. It is widely used in social media but is highly data-dependent. For example, Weng et al. [26] aggregate tweets belonging to the same user, Hong et al. [10] aggregate tweets containing the same word and Mehrotra et al. [18] aggregate tweets based on hashtags. Conventional topic models are then applied to the pseudo-documents to learn more prominent topics from the enriched contexts of aggregated texts. However, auxiliary information such as authorship or hashtag is not always available in real word applications. Another strategy is to extend topic models by adding strong assumptions on short texts. Zhao et al. [29] and Lakkaraju et al. [13] assume each short text is a mixture of unigrams sampled from only one topic. The biterm topic model (BTM) [5, 27] turns the whole corpus into a biterm set, where a biterm is constructed by any two distinct words in a short context. BTM then assumes that the two words in any biterm are drawn independently from a topic, where the topic is sampled from a topic mixture over the whole corpus. The self-aggregation topic model [25] assumes each piece of short text is sampled from unobserved pseudo-documents and automatically aggregates short texts. Also using the self-aggregation method, Zuo et al. [30] propose a Pseudo-document-based Topic Model (PTM) for short texts, which can solve the overfitting problem and save computational cost in [25].

Recently, word embedding models [11, 19, 22] have gained much attention with their ability to form clusters of conceptually similar words in the embedding space. [11] proposes a latent concept topic model (LCTM), which models each topic as a distribution over the latent concepts and each concept is a Gaussian distribution over the word embedding space. Since the number of concepts is often much smaller than the number of unique words, LCTM is less susceptible to the data sparsity.

The methods which explore external normal texts to improve topic learning of short texts are closely related to our work. For example, Phan et al. [23, 24] propose to train topic models on a collection of long texts which are in the same domain as the short texts, and then make inference on the short texts to help the learning of their topics. Jin et al. [12] learned topics on short texts via transferring knowledge from auxiliary long text data. Performance of these approaches, however, is highly data-dependent, as the quality of topic learning is highly dependent on the quality of the organization of external datasets. Targeted on summarization of short texts, Ma et al. [16] utilize the relationships between normal documents and corresponding short texts to enhance topic learning of short texts. They propose two models, the Master-Slave Topic Model (MSTM) to restrict topics of short texts within those of their associated normal documents, and the Extended Master-Slave Topic Model (ESTM) to allow some short texts to represent topics only extracted from themselves and not correlated with normal documents. However, both MSTM and EXTM miss the situation that short texts may not only contain content information from their associated normal documents and also express their own opinions.

In this paper, we fill this gap and propose a co-occurring topic model COTM, which can directly exploit the co-occurring structure in the text corpora and utilize information from both the normal documents and the short texts for efficient topic learning. We assume (1) each normal document has a probability distribution over a set of formal topics; (2) each short text has a probability distribution over two topics, one belonging to the formal topics, whose selection is governed by the topic probabilities of the corresponding normal document, and the other belonging to a set of informal topics which are shared only by short texts. Intuitively, for each short text, its formal topic is the one that appeals to its author from the set of topics for the corresponding normal document, and its informal topic reflects the additional discussion that its author adds to the chosen formal topic. As a result, in COTM, topic modeling of normal documents is enhanced by the inclusion of words from the corresponding short texts that are relevant to the formal topics, and the informal topics are learnt from words that are irrelevant to formal topics but shared across short texts.

In practice, texts are constantly generated, and online scalable inference algorithms are needed. For co-occurring text corpora, short texts can be created at any timestamps after the corresponding normal documents are published. We introduce an online algorithm for COTM, referred to as oCOTM, to deal with dynamically generated co-occurring documents. The oCOTM algorithm can incrementally adjust the learned topics according to the dynamic stream of data, without need to access previously processed texts. Compared with COTM, the advantage of oCOTM is that it only needs to store a small fraction of data for model update, saving both computational cost and memory cache.

We conduct extensive experiments on two large real-world text collections, i.e. news articles together with reader comments from NetEase news website, and blog posts together with user comments from Sina blog website. Experiments on both batch and online algorithms show that (1) COTM learns more coherent and comprehensive topics than several state-of-art methods for topic modeling, like LDA, BTM and EXTM; (2) the topic proportions obtained by COTM can better help document clustering and classification, indicating that COTM offers better topic representations than its competitors. Moreover, COTM properly reveals topical relationships between normal documents and their ensuing short texts, which can be effectively used in detecting spam user comments.

This paper extends our previous conference article [28] with the following improvements: 1) we introduce an online algorithm for COTM to handle continuously generated texts, including both short texts and normal documents. 2) Both batch and online COTM algorithms are empirically verified with more comprehensive experiments. The rest of this paper is organized as follows. Section 2 and 3 present the batch and online implementations of COTM. Section 4 shows experimental results. Section 5 then concludes.

2 The COTM model

Co-occurring documents, consisting of both normal documents and short texts, are illustrated in Figure 1. Borrowing ideas from previous works, we use normal documents as auxiliary information to help improve the topic learning for short texts. On the other hand, we also enhance topic learning for normal documents by using information from the corresponding short texts.

Figure 1
figure 1

Hierarchical structure of normal documents and short texts

2.1 Model description

We assume the generative process of normal documents follow the LDA model. Assume that there are K topics underlying D normal documents, which we refer to as formal topics thereafter. Each normal document d is a mixture of the K formal topics with its own vector of topic probabilities 𝜃 d = {𝜃 d1,𝜃 d2,...,𝜃 d K }. Each formal topic k has its own vector of word probabilities ϕ k = {ϕ k1,ϕ k2,...,ϕ k V } over a dictionary with size V, which consists of all distinct words in normal documents and short texts.

Contents in short texts may discuss topics from their corresponding normal documents, and may also discuss some additional issues, such as adding personal opinions. Hence the K formal topics are insufficient to cover all subjects discussed in the short text corpus. We assume there are another set of J informal topics, which appear only in short texts, and each informal topic j has its own vector of word probabilities ψ j = {ψ j1,ψ j2,...,ψ j V }. For the c th short text following normal document d, it has a probability distribution (p d c ,1 − p d c ) over two topics, a formal topic x d c and an informal topic y d c . Here p d c ∈ [0,1] depicts the association probability between the short text and the corresponding normal document, with higher values indicating more consistent relationships.

The graphical representation of normal documents and short texts is illustrated in Figure 2, and the generative process is described below.

Figure 2
figure 2

Graphical representation of COTM

For each normal document d ∈{1,2,...,D}:

  1. 1.

    Generate topic probabilities 𝜃 d from a homogeneous Dirichlet distribution with parameter α: 𝜃 d D i r(α);

  2. 2.

    For the n th word in normal document d, n ∈{1,2,...,N d }:

    1. (a)

      Choose a topic z d n from the K formal topics with probabilities given by 𝜃 d : z d n M u l t i(𝜃 d );

    2. (b)

      Choose a word w d n from the dictionary with probabilities given by \(\boldsymbol {\phi _{z_{dn}}}\): \(w_{dn}\sim Multi(\boldsymbol {\phi _{z_{dn}}})\).

Then for the c th short text associated with normal document d, \(c \in \{1,2,...,C_{d}\}\):

  1. 1.

    Choose the association probability p d c from a beta distribution with parameter \(\gamma \): \(p_{dc}\sim Beta(\gamma ,\gamma )\);

  2. 2.

    Choose a topic x d c from K formal topics with probabilities given by \(\boldsymbol {\theta _{d}}\): x d c M u l t i(𝜃 d );

  3. 3.

    Choose a topic y d c from J informal topics with probabilities given by ξ: \(y_{dc}\sim Multi(\boldsymbol {\xi })\);

  4. 4.

    For the m th word in the short text, m ∈{1,2,...,M d c }:

    1. (a)

      Generate a topic indicator b d c m with probability given by p d c : b d c m B e r n o u l l i(p d c );

    2. (b)

      If b d c m = 1, the word is chosen with probabilities under the formal topic: \(w_{dcm}\sim Multi(\boldsymbol {\phi _{x_{dcm}}})\).

    3. (c)

      If b d c m = 0, the word is chosen with probabilities under the informal topic: \(w_{dcm}\sim Multi(\boldsymbol {\psi _{y_{dcm}}})\);

To complete the specification, we assign homogeneous Dirichlet hyperpriors for ϕ k , ψ j and ξ, i.e.: ϕ k D i r(β), ψ j D i r(β), ξD i r(𝜖).

Here we make a few comments about the model specification. Firstly, unlike normal documents, each of which has its own topic probabilities 𝜃 d over the formal topics, we assume all short texts share the same topic probabilities ξ over the informal topics. This assumption makes the model simpler and thus easier to converge than assuming different topic probabilities for each short text. Secondly, noting that short texts are very concise, we assume each short text only represents two topics, a formal one and an informal one. This assumption also helps to simplify our model setting. Lastly, by using p d c to depict the topical relationships between short texts and normal documents, we obtain an unsupervised way to detect “spams”, i.e. the short texts whose p d c are smaller than a predefined threshold can be identified as “spams”.

To our best knowledge, models utilizing the co-occurring relationships between normal documents and short texts to enhance topic learning are still rarely seen in the literature, except for the MSTM and EXTM models proposed by Ma et al. [16]. While both COTM and the models in [16] employ the co-occurring structure in text corpus, there still exists many differences between these two methods:

  • Firstly, both MSTM and EXTM allow each short text to have only one topic, which is either derived from the topic distribution of its associated normal document or is one of the extended topics formed by all short texts. However, this assumption for short texts is still rigid because these two circumstances may coexist on one short text. Following a normal document discussing various topics, the corresponding short texts often concentrate on one specific topic (the formal one) and add additional personal opinions (the informal one). Thus, in COTM, we assume each short text is composed of two topics, a formal one derived from the normal documents and the informal one formed only by short texts. This assumption is much closer to the generative process of short texts in real world. In addition, a probability distribution is assumed over the two topics, indicating the correlation relationship between the short text and its associated normal document. Therefore, the topics of short texts are under different level of topical consistence with normal documents, from strongly correlated, partially correlated to completely irrelevant. In this respect, COTM can be seen as a more general extension of EXTM.

  • Secondly, in MSTM and EXTM, the same topic meaning is represented by two set of topics, the master topics, which use vocabulary formed by normal documents, and the slave topics, which use vocabulary formed by short texts. On the contrary, both formal and informal topics have word distributions over the whole vocabulary, which includes all unique words in normal documents and short texts. This more concise assumption for vocabulary can not only significantly reduce the parameter space and computational complexity under a big text corpus, but also integrate the generation of topics in an unified framework and make the artificial summarization of topic meanings easier. Moreover, under this assumption, the topic learning of normal documents can be enhanced by the inclusion of words from the corresponding short texts.

2.2 Model inference

In this section, we introduce the Gibbs sampling algorithm for COTM. For normal documents, let \(\boldsymbol {z_{d}}=(z_{d1},z_{d2},...,z_{dN_{d}})^{\top }\) and \(\boldsymbol {z}=\{\boldsymbol {z_{1}},\boldsymbol {z_{2}},...,\boldsymbol {z_{D}}\}\). For all the C d short texts associated with normal document d, let \(\boldsymbol {b_{dc}}=(b_{dc1},b_{dc2},...,b_{dcM_{dc}})^{\top }\), \(\boldsymbol {b_{d}}=\{\boldsymbol {b_{d1}},\boldsymbol {b_{d2}},...,\boldsymbol {b_{dC_{d}}}\}\), \(\boldsymbol {P_{d}}=\{p_{d1},p_{d2},...,p_{dC_{d}}\}\), \(\boldsymbol {x_{d}}=(x_{d1},x_{d2},...,x_{dC_{d}})^{\top }\) and \(\boldsymbol {y_{d}}=(y_{d1},y_{d2},...,y_{dC_{d}})^{\top }\). Then we have \(\boldsymbol {b}=\{\boldsymbol {b_{1}},\boldsymbol {b_{2}},...,\boldsymbol {b_{D}}\}\), \(\boldsymbol {P}=\{\boldsymbol {P_{1}},\boldsymbol {P_{2}},...,\boldsymbol {P_{D}}\}\), \(\boldsymbol {x}=\{\boldsymbol {x_{1}},\boldsymbol {x_{2}},...,\boldsymbol {x_{D}}\}\), and \(\boldsymbol {y}=\{\boldsymbol {y_{1}},\boldsymbol {y_{2}},...,\boldsymbol {y_{D}}\}\) for all short texts. Moreover, let \(\boldsymbol {\Theta }=\{\boldsymbol {\theta _{1}},\boldsymbol {\theta _{2}},...,\boldsymbol {\theta _{D}}\}\), \(\boldsymbol {\Phi }=\{\boldsymbol {\phi _{1}},\boldsymbol {\phi _{2}},...,\boldsymbol {\phi _{K}}\}\) and \(\boldsymbol {\Psi }=\{\boldsymbol {\psi _{1}},\boldsymbol {\psi _{2}},...,\boldsymbol {\psi _{K}}\}\). Let w represent all words in normal documents and short texts. Given w and all the hyperparameters, we can derive the full posterior distribution according to the generative process of COTM:

$$ \begin{array}{lllll} &f(\boldsymbol{z},\boldsymbol{b},\boldsymbol{P},\boldsymbol{x},\boldsymbol{y},\boldsymbol{\Theta},\boldsymbol{\Phi},\boldsymbol{\Psi},\boldsymbol{\xi} \mid \boldsymbol{w},\alpha,\beta,\gamma,\epsilon)\\ \propto &\left\{\prod \limits_{d=1}^{D}\prod \limits_{k=1}^{K}\theta_{dk}^{\alpha-1}\right\} \left\{\prod\limits_{k=1}^{K}\prod\limits_{v=1}^{V}\phi_{kv}^{\beta-1}\right\} \left\{\prod\limits_{j=1}^{J} \xi_{j}^{\epsilon-1}\right\} \left\{\prod\limits_{j=1}^{J}\prod\limits_{v=1}^{V}\psi_{jv}^{\beta-1}\right\}\\ &\left\{\prod\limits_{d=1}^{D}\prod \limits_{c=1}^{C_{d}} p_{dc}^{\gamma-1}(1-p_{dc})^{\gamma-1}\right\} \left\{\prod \limits_{d=1}^{D} \prod\limits_{n=1}^{N_{d}}\theta_{d,z_{dn}}\phi_{z_{dn},w_{dn}} \right\}\\ &\left\{\prod\limits_{d=1}^{D}\prod \limits_{c=1}^{C_{d}}\theta_{d,x_{dc}}\xi_{y_{dc}}\right\} \left\{\prod\limits_{d=1}^{D}\prod \limits_{c=1}^{C_{d}} \prod\limits_{m=1}^{M_{dc}}\left( p_{dc}\phi_{x_{dc},w_{dcm}}\right)^{b_{dcm}}\right\}\\ &\left\{\prod\limits_{d=1}^{D}\prod \limits_{c=1}^{C_{d}} \prod\limits_{m=1}^{M_{dc}}\left[(1-p_{dc})\psi_{y_{dc},w_{dcm}}\right]^{1-b_{dcm}}\right\}. \end{array} $$
(1)

Given the full posterior distribution in (1), we can easily get the full conditional posterior distributions for Θ, Φ, Ψ, ξ and P, which are all Dirichlet and conjugate with their priors. Therefore, we develop a collapsed Gibbs sampling algorithm by integrating out these parameters from the posterior distribution, and only need to update z, x, y and b in each iteration. Details of deriving the collapsed Gibbs sampling algorithm can be seen in Appendix.

For the n th word in normal document d, the full conditional distribution of z d n in the collapsed Gibbs sampling algorithm is:

$$ \begin{array}{l} f(z_{dn}=k \mid \cdot) \propto \left( l_{dk;-dn}^{(1)}+g_{dk}^{(1)}+\alpha\right) \frac{l_{k,w_{dn};-dn}^{(2)}+g_{k,w_{dn}}^{(2)}+\beta}{l_{k\cdot;-dn}^{(2)}+g_{k\cdot}^{(2)}+V\beta}, \end{array} $$
(2)

where the subscript “ − d n” indicates counts excluding the n th word in normal document d, \(l_{dk}^{(1)}\) and \(g_{dk}^{(1)}\) denote the number of words in normal document d or the number of short texts following normal document d that are associated with formal topic k, \(l_{kv}^{(2)}\) and \(g_{kv}^{(2)}\) denote the number of times word v is associated with formal topic k in all normal documents and short texts, \(l_{k\cdot }^{(2)}\) and \(g_{k\cdot }^{(2)}\) are the sum of \(l_{kv}^{(2)}\) and \(g_{kv}^{(2)}\) over all words v in the vocabulary.

For \(x_{dc}\) and y d c in the c th short text associated with normal document d, their full conditional distributions in the collapsed Gibbs sampling algorithm are:

$$\begin{array}{@{}rcl@{}} f(x_{dc}\!=k \mid \cdot) &\propto \left( l_{dk}^{(1)}\,+\,g_{dk;-dc}^{(1)}\,+\,\alpha\right) \frac{{\prod}_{v \in {\Lambda}_{dc}}{\prod}_{m=1}^{q_{dcv}^{(1)}}\left( l_{kv}^{(2)}+g_{kv;-dc}^{(2)}\!+m\,-\,1\!+\beta\right)} {{\prod}_{m=1}^{s_{dc}^{(1)}}\left( l_{k\cdot}^{(2)}\!+g_{k\cdot;-dc}^{(2)}\!+m-\!1\,+\,V\beta\right)}, \end{array} $$
(3)
$$\begin{array}{@{}rcl@{}} f(y_{dc}=j \mid \cdot) &\propto \left( h_{j;-dc}+\epsilon\right) \frac{{\prod}_{v \in {\Lambda}_{dc}}{\prod}_{m=1}^{q_{dcv}^{(2)}}\left( g_{jv;-dc}^{(3)}+m-1+\beta\right)} {{\prod}_{m=1}^{s_{dc}^{(2)}}\left( g_{j\cdot;-dc}^{(3)}+m-1+V\beta\right)}, \end{array} $$
(4)

where the subscript “ − d c” indicates counts excluding the c th short text following normal document d, Λ d c is the set of unique words appearing in the c th short text following normal document d, \(q_{dcv}^{(1)}\) and \(q_{dcv}^{(2)}\) denote the the number of times word v appears in the c th short text following normal document d and are associated with formal topics or informal topics respectively, h j denotes the number of short texts that are associated with informal topic j, \(g_{jv}^{(3)}\) denotes the number of times word v is associated with informal topic j in all short texts, \(s_{dc}^{(1)}\) and \(s_{dc}^{(2)}\) are summation of \(q_{dcv}^{(1)}\) and \(q_{dcv}^{(2)}\) over all unique words in Λ d c .

For the m th word in the c th short text following normal document d, the full conditional distribution of b d c m in the collapsed Gibbs sampling algorithm is:

$$\begin{array}{@{}rcl@{}} f(b_{dcm}=1 \mid \cdot) &\propto \frac{l_{x_{dc},w_{dcm}}^{(2)}+g_{x_{dc},w_{dcm};-dcm}^{(2)}+\beta} {l_{x_{dc},\cdot}^{(2)}+g_{x_{dc},\cdot;-dcm}^{(2)}+V\beta} \left( s_{dc;-dcm}^{(1)}+\gamma\right), \end{array} $$
(5)
$$\begin{array}{@{}rcl@{}} f(b_{dcm}=0 \mid \cdot) &\propto \frac{g_{y_{dc},w_{dcm};-dcm}^{(3)}+\beta}{g_{y_{dc},\cdot;-dcm}^{(3)}+V\beta} \left( s_{dc;-dcm}^{(2)}+\gamma\right), \end{array} $$
(6)

where the subscript “ − d c m” indicates counts excluding the m th word in the c th short text following normal document d, \(g_{y_{dc},\cdot }^{(3)}\) is the sum of \(g_{y_{dc},v}^{(3)}\) over all words in the dictionary. Equations (5) and (6) are then normalized to sum up to one to get the full conditional posterior probabilities for b d c m = 1 and b d c m = 0.

We can compute Θ, Φ, Ψ and P using the first posterior draw of z, x, y and b after convergence of the collapsed Gibbs sampling algorithm.

$$\begin{array}{@{}rcl@{}} \hat{\theta}_{dk}&=\frac{l_{dk}^{(1)}+g_{dk}^{(1)}+\alpha}{l_{d\cdot}^{(1)}+g_{d\cdot}^{(1)}+K\alpha}, \end{array} $$
(7)
$$\begin{array}{@{}rcl@{}} \hat{\phi}_{kv}&=\frac{l_{kv}^{(2)}+g_{kv}^{(2)}+\beta}{l_{k\cdot}^{(2)}+g_{k\cdot}^{(2)}+V\beta}, \end{array} $$
(8)
$$\begin{array}{@{}rcl@{}} \hat{\psi}_{jv}&=\frac{g_{jv}^{(3)}+\beta}{g_{j\cdot}^{(3)}+V\beta}, \end{array} $$
(9)
$$\begin{array}{@{}rcl@{}} \hat{p}_{dc}&=\frac{s_{dc}^{(1)}+\gamma}{s_{dc}^{(1)}+s_{dc}^{(2)}+2\gamma}. \end{array} $$
(10)

2.3 Model complexity

To illustrate the computational complexity of COTM, we show its running time and memory requirements and make comparison with the basic LDA model. We denote by \(\bar {C}\) the average number of short texts following each normal document, \(\bar {N}\) the average length (number of words) of normal documents, \(\bar {M}\) the average length of short texts, and N i t e r the number of iterations in Gibbs sampling. For simplicity of calculations, we further assume each normal document has the same number of short texts \(\bar {C}\), each normal document has the same length \(\bar {N}\), and each short text has the same length \(\bar {M}\). To ensure fair comparison with the same set of texts and the same number of topics, we compare COTM with the LDA model trained by Gibbs samplingFootnote 1 to both normal documents and short texts with K + J topics. The time complexity and number of in-memory variables in the Gibbs sampling procedure of the two models are listed in Table 1.

Table 1 Time complexity and the number of in-memory variables in LDA and COTM

For LDA, the assignment of each topic requires computational time in the order O(K + J). LDA draws a topic for each word in the text corpus, with an overall time complexity \(O(N_{iter}D(K+J)(\bar {N}+\bar {C}\bar {M}))\). For COTM, there are three sampling steps. In the first step, COTM draws a topic for each word in normal documents, which requires computational time in the order \(O(N_{iter}DK\bar {N})\). The second step involves drawing a formal topic and an informal topic for each short text, which requires computational time in the order \(O(N_{iter}D\bar {C}(K+J))\). In the final step, COTM draws the binary topic indicator b d c m for each word in the short texts, which requires computational time in the order \(O(N_{iter}D\bar {C}2\bar {M})\). Therefore, the overall time complexity of COTM is \(O(N_{iter}D(K\bar {N}+\bar {C}(K+J+2\bar {M})))\). The difference in the order of computational complexity between LDA and COTM is \(O(N_{iter}(DJ\bar {N}+D\bar {C}\{(K+J)\bar {M}-(K+J)-2\bar {M}\}))\). Noting \(K+J+2\bar {M}\ll (K+J)\bar {M}\), the time complexity of COTM has smaller order than that of LDA.

In the two models, count matrices and topic assignments need to be kept in memory. In LDA, the variables that need to be stored are: the count matrix for the number of words in each normal document or short text that are associated with each topic, the count matrix for the number of times that each word in the dictionary is associated with each topic, and the topic assignment for each word in the corpus. Hence the overall required memory size is \(D(K+J)(1+\bar {C})+V(K+J)+D(\bar {N}+\bar {C}\bar {M})\). In COTM, the count matrices \(l_{dk}^{(1)}+g_{dk}^{(1)}\), \(l_{kv}^{(2)}+g_{kv}^{(2)}\), \(g_{jv}^{(3)}\), \(s_{dc}^{(1)}\) and \(s_{dc}^{(2)}\) need to be stored, taking up memory size \(DK+V(K+J)+2D\bar {C}\). Moreover, the topic assignments z, x, y and b need to be stored, taking up memory size \(D\bar {N}+2D\bar {C}+D\bar {C}\bar {M}\). Hence, the overall required memory size for COTM is \(DK+V(K+J)+D(\bar {N}+\bar {C}(4+\bar {M}))\). The difference in required memory size between LDA and COTM is \(D(J+(K+J-4)\bar {C})\), which is usually very large. Thus the required memory size for COTM is less than that for LDA.

3 Online algorithm for COTM

In real-world applications, normal documents and their co-occurring short texts are constantly generated, which requires topic modeling algorithms to deal with large volume of data streams. Batch algorithms have high computational and memory cost, and are not efficient. As a result, we introduce an online algorithm for COTM, referred to as oCOTM, to deal with the online-learning task. Compared with batch COTM, the online algorithm only needs to store a small amount of data, and topics can be constantly updated over data streams.

The oCOTM algorithm is inspired by the online LDA algorithm proposed in [1]. It assumes documents are divided into successive time slices, e.g., each time slice being an hour or a day. The general idea of oCOTM is to fit a COTM model with K formal topics and J informal topics on normal documents and short texts at each time slice, and the counts of words in topics (i.e. \(l_{kv}^{(2)},g_{kv}^{(2)},g_{jv}^{(3)}\)) at the current time slice would be used to update parameters in priors for topics’ word probabilities in COTM at the next time slice.

Let V (t) denote the size of dictionary at time slice t, where the dictionary expands that at time slice t − 1 by the new words appearing in time slice t. Let \(\boldsymbol {\beta }_{k;\text {for}}^{(t)}\) and \(\boldsymbol {\beta }_{j;\inf }^{(t)}\) respectively denote V (t)-dimensional vectors used in Dirichlet priors for formal topic k and informal topic j at time slice t, where the components corresponding to the new words equal β. The collapsed Gibbs sampling algorithm in Section 2 is carried out, with β replaced by \(\beta _{kv;\text {for}}^{(t)}\) or \(\beta _{jv;\inf }^{(t)}\), and the count matrices \(l_{dk}^{(1)}\), \(g_{dk}^{(1)}\), \(l_{kv}^{(2)}\), \(g_{kv}^{(2)}\), \(g_{jv}^{(3)}\), \(s_{dc}^{(1)}\) and \(s_{dc}^{(2)}\) calculated using only normal documents and short texts at time slice t.

After convergence of the collapsed Gibbs sampling algorithm, the first posterior draw of z, x, y and b is used to calculate the word counts \(l_{kv}^{(2)(t)}\), \(g_{kv}^{(2)(t)}\) and \(g_{jv}^{(3)(t)}\). These counts are used to adjust the prior vectors for the next time slice by setting:

$$\begin{array}{@{}rcl@{}} \beta_{kv;\text{for}}^{(t+1)}&=&\beta_{kv;\text{for}}^{(t)}+\lambda\left( l_{kv}^{(2)(t)}+g_{kv}^{(2)(t)}\right), \end{array} $$
(11)
$$\begin{array}{@{}rcl@{}} \beta_{jv;\inf}^{(t+1)}&=&\beta_{jv;\inf}^{(t)}+\lambda g_{jv}^{(3)(t)}, \end{array} $$
(12)

where λ ∈ [0,1] is a decay parameter, indicating the strength of influence of historical topic information. When λ = 1, we simply accumulate the historical counts of topic assignments without any decay; when λ = 0, the COTM models trained at different time slices are independent. At the initial time slice, we set all entries in \(\boldsymbol {\beta }_{k;\text {for}}^{(t)}\) and \(\boldsymbol {\beta }_{j;{inf}}^{(t)}\) as a constant β. Then these prior vectors would be updated at the end of each time slice and the historical information would be involved in model fitting at later time.

We now compare the complexities of batch and online algorithms of COTM. Assume there are D (t) normal documents in time slice t. Let D (1:t) = D (1) + ⋯ + D (t) denote the cumulative number of normal documents up to time t. For simplicity, we further assume all short texts corresponding to a normal document in time slice t are published in time slice t. The batch COTM algorithm needs to take account of all normal documents and short texts up to time t. It would require computational time in the order \(O(N_{iter}D^{(1:t)}(K\bar {N}+\bar {C}(K+J+2\bar {M})))\), and memory size \(D^{(1:t)}K+V^{(t)}(K+J)+D^{(1:t)}(\bar {N}+\bar {C}(4+\bar {M}))\). The online COTM algorithm needs only to take account of normal documents and short texts in time slice t. It would require computational time in the order \(O(N_{iter}D^{(t)}(K\bar {N}+\bar {C}(K+J+2\bar {M})))\), and memory size \(D^{(t)}K+V^{(t)}(K+J)+D^{(t)}(\bar {N}+\bar {C}(4+\bar {M}))\). The computational complexity and memory consumption are compared in Table 2.

Table 2 Time complexity and the number of in-memory variables of batch and online COTM algorithms in time slice t

4 Experiments

4.1 Experimental settings

Datasets. The effectiveness of our approach is evaluated over two text datasets with co-occurring structure.

  • NetEase collection includes news articles and reader comments crawled from the most popular Chinese news publishing platform.Footnote 2 All the texts crawled are published between May 1st, 2015 and May 1st, 2016.

  • Sina collection includes blog posts and user comments crawled from a famous Chinese blog platform.Footnote 3 All the texts crawled are published between Jan 1st, 2016 and May 1st, 2016. Each blog post is assigned to one of eight categories by its author, as illustrated in Figure 3a.

    Figure 3
    figure 3

    a Categories of Sina blog posts and b Distribution of counts of comments for NetEase news articles

All the datasets have been made to be public.Footnote 4 The raw texts are mainly written in Chinese and we take the following preprocessing procedure to obtain clean text corpus. Firstly, we erase non-Chinese characters, punctuations and convert traditional Chinese characters to simplified Chinese characters. Secondly, we segment sentences into word sequences using an open source package NLPIR.Footnote 5 Finally, we remove stop words, low frequency words and normal documents followed by no short texts. After preprocessing, the basic statistics of the two datasets are listed in Table 3, including the numbers of normal documents and short comments, the average lengths of normal documents and short comments, and the number of unique Chinese words. Figure 3b also illustrates the distribution of counts of short comments (in logarithm) following normal documents for NetEase data. The distribution follows power-law and has a heavy tail, indicating that while some news articles gain great popularity among news readers, most of them are only followed by a few short comments.

Table 3 Basic statistics of NetEase dataset and Sina dataset

We first evaluate the batch COTM algorithm on a randomly sampled dataset of 10% normal documents and their corresponding comments in NetEase and Sina collections, since both of the datasets are too large to be processed efficiently by batch algorithms. Then the online algorithm of COTM is evaluated over the whole datasets of NetEase and Sina collections. For the online algorithm, the time periods in the data sets are equally divided into T = 20 time slices, with each time slice roughly equal to 18 days for the NetEase data or 6 days for the Sina data.

Baseline methods

In COTM, the semantic meanings of normal documents is covered by the formal topics, and the semantic meanings of short texts is covered by both formal topics and informal topics. We compare topics learned by COTM with the following state-of-art baselines.

  • LDA-P: the standard LDA model trained by Gibbs sampling and applied to pseudo-documents obtained by aggregating each normal document with its corresponding short texts.

  • LCTM-P: the LCTM model trained by Gibbs sampling and applied to word2vec representation [19] of the pseudo-documents obtained by aggregating each normal document with its corresponding short texts.

  • BTM-B: the standard BTM model trained by Gibbs sampling and applied to the corpus including both normal documents and short texts and treating them equally.

  • PTM-B: the standard PTM model trained by Gibbs sampling and applied to the corpus including both normal documents and short texts and treating them equally.

  • EXTM: the EXTM model trained by Gibbs sampling and applied to the corpus including both normal documents and short texts.

The online algorithm of COTM is also compared with online implementations of LDA [1] and BTM [5], since there are no online versions of the other alternatives:

  • oLDA-B: the online algorithm of LDA applied to the corpus including both normal documents and short texts and treating them equally. Here the algorithm uses the counts of words in topics at the current time slice to update parameters in priors for topics’ word probabilities (β) for the next time slice.

  • oLDA-S: the online algorithm of LDA applied to short texts.

  • oBTM-S: the online algorithm of BTMFootnote 6 applied to short texts. Here the algorithm fits a BTM model in each time slice, and uses the counts of topics in the corpus and the counts of words in topics at the current time slice to update parameters in priors for the corpus’ topic probabilities (α) and topics’ word probabilities (β) for the next time slice.

  • iBTM-S: the incremental algorithm of BTM applied to short texts. Here the algorithm updates prior parameters continuously whenever a piece of text arrives.

For the online LDA algorithm, we do not consider aggregating normal documents and their corresponding short texts into pseudo-documents, because a normal document and its corresponding short texts may not be published in the same time slice. We run online algorithms of BTM only on short texts, because BTM suffers from expensive computational cost and memory explosion when applied to normal documents.

To make fair comparisons, all the methods are implemented in C++, including the batch COTM algorithmFootnote 7 and the online COTM algorithm.Footnote 8

The hyperparameters for all baseline models and the online implementations are set to default values. For COTM, results obtained in various hyperparameter settings show little difference, and we set α = 0.5, β = 0.1, γ = 0.5 and 𝜖 = 0.5 for illustration. In all the methods, Gibbs sampling is run for 1000 iterations, which is enough for convergence. The decay weight λ for online methods are all set to be 1.

Measurements

Model performance is evaluated in two perspectives: the quality of learned topics, and the quality of topic representation of documents.

We use coherence score [20] to measure the quality of topics learned by each method. Given any topic and its top L words V = (v 1,v 2,...,v L ) ordered by ϕ k or ψ j , the coherence score is defined as:

$$ \begin{array}{l} CS(\boldsymbol{V})={\sum}_{l=2}^{L}{\sum}_{l^{\prime}=1}^{l-1}log\frac{F(v_{l},v_{l^{\prime}})+1}{F(v_{l^{\prime}})}, \end{array} $$
(13)

where F(v) is the number of relevant documents including word v, \(F(v,v^{\prime })\) is the number of relevant documents including both words v and \(v^{\prime }\). The general idea of this metric is that we believe words belonging to the same topic tend to co-occur within the same document. Therefore topics with higher coherence scores imply better developed methods. Note that this definition is consistent with the basic assumption of BTM, i.e., words co-occurring more frequently should be more possible to belong to the same topic, thus BTM has inherent advantage under this evaluation metric [5].

To evaluate the quality of topic representation of documents, we investigate how much the documents’ topic probabilities can help discriminate documents in different clusters or classes. For LDA and LCTM, document d’s topic proportions 𝜃 d are used as features. For PTM, topic proportions of each document are those of its associated pseudo document. In BTM, the topic proportions of each document are derived using the topic indicators z [5, 27]. However, [14, 25] have validated that topic proportions of documents obtained by using post inference method are critical for downstream applications. In this context, we do not compare BTM models in document clustering and classification. In COTM, the formal topic proportions 𝜃 d are used as features in clustering and classifying normal documents. For each short text, a (K + J)-dim vector of pseudo topic proportions \(\boldsymbol {\widetilde {\theta }}_{dc}\) is created by setting the entry corresponding to the formal topic x d c to p d c , and the entry corresponding to the informal topic y d c to 1 − p d c , and these topic proportions are then used as features in clustering and classifying short texts. Similar with COTM, in EXTM, we use the proportions of master topics to classify normal documents and also transfer the specific topic of each short text into a pseudo vector to classify short texts.

In document clustering, K-means algorithm is performed under different number of clusters, and the pseudo F index [4], which describes the ratio of between-cluster variance to within cluster variance, is used to evaluate the performance of clustering.

The pseudo F index is defined as follows. Let ||⋅|| denote the Euclidean distance, let Ω g denote the set of indices of documents in the g th cluster, and let |Ω| denote the number of normal documents in cluster Ω. For now, let 𝜃 d denote the general topic proportions for documents d, including the case of pseudo topic proportions. For the g th cluster, we denote \(\boldsymbol {\bar {\theta }}_{g}=\frac {1}{|{\Omega }_{g}|}{\sum }_{d \in {\Omega }_{g}}\boldsymbol {\theta }_{d}\) as the average topic proportions of documents in this cluster; we denote \(\boldsymbol {\bar {\theta }}=\frac {1}{D}{\sum }_{d=1}^{D}\boldsymbol {\theta }_{d}\) as the average topic proportions of all documents. Then the within-group sum of squares SSW and between-group sum of squares SSG can be derived as:

$$ \begin{array}{l} SSW={\sum}_{g=1}^{G}{\sum}_{d\in {\Omega}_{g}}||\boldsymbol{\theta}_{d}-\boldsymbol{\bar{\theta}}_{g}||^{2}, \end{array} $$
(14)
$$ \begin{array}{l} SSG={\sum}_{d=1}^{D}||\boldsymbol{\theta}_{d}-\boldsymbol{\bar{\theta}}||^{2}-SSW, \end{array} $$
(15)

where G is the number of clusters. Then the pseudo F index is calculated as

$$ \begin{array}{l} pseudo \ F=\frac{SSG/(G-1)}{SSW/(D-G)}. \end{array} $$
(16)

Larger values of pseudo F index indicate that the clusters are better separated, implying that the topic representations of documents is of higher quality.

In document classification, we use the SVM classifier LIBLINEAR[8] with 10-fold cross validation. Methods resulting in better classification accuracy indicate better topic representations of documents.

4.2 Evaluation of batch COTM

4.2.1 Evaluating formal topics

We first evaluate the quality of learned formal topics, and then perform clustering and classification of normal documents to evaluate the quality of document representation using formal topic proportions. Since BTM and PTM are originally proposed to deal with short texts, here we only compare COTM with models which are designed for modeling normal documents, such as LDA-P, LCTM-P and EXTM.

Comparison of words under topics

In COTM, formal topics are mixtures of words from both normal documents and short texts. While in LCTM-P, topics are mixtures of concepts rather than words, we only compare the formal topics learned by COTM with those learned by LDA-P and EXTM. In the experiment, the numbers of formal (master) and informal (extended) topics for COTM (EXTM) are set to be K C O T M = 100 and J C O T M = 50. To make a fair comparison, we set the number of topics for LDA-P to be 150, since it is applied to pseudo-documents that include both normal documents and their corresponding short texts.

We randomly select three topics shared by the three methods to make the comparison. We follow [3] to proceed this selection. We first create for each method a topic word set including the top five words with highest probabilities under each topic. We then get the intersection set of the three topic word sets. Finally, we randomly select three words from the intersection set. For each selected word, we use the topics whose top five words include the given word as illustrative examples.

For the Netease data, the three selected words are “airplane”, “cellphone” and “student”. Table 4 shows topics selected by the word “airplane” for the three methods. In the first row which lists the top twenty words with highest probabilities under the selected topics, we find “airplane”, “aviation” and “airport” are among the top words in all three methods. This indicates that all three topics discuss aviation. However, the top word set of LDA-P includes words “legitimate” and “France”, which have little to do with aviation. As for EXTM, its master topic has more irrelevant words, such as “Jackie-Chen”, “France” and “Germany”, which are names of super stars or countries. Since EXTM uses separate vocabularies for normal documents and short texts, master topics are only represented by words appearing in normal documents and less influenced by short texts. Results for COTM are better than those for LDA-P and EXTM, since most of the top words for COTM are closely related to aviation. Moreover, the formal topic in COTM is enhanced by including words “China”, “design” and ”aero-engine” from user comments, as the underdeveloped manufacturing of “aero-engine” in Chinese aircraft industry has long been a hot issue discussed among Chinese netizens.

Table 4 Topics selected by the word “airplane” in NetEase collection

Table 5 shows topics selected by the word “cellphone” for the three methods. The top twenty words listed in the first row include words “cellphone”, “Apple” and “Mi” (a Chinese mobile internet company), which indicates that all three topics discuss mobile industry. When comparing LDA-P and EXTM, we find that the topic learned by LDA-P is more concentrated on attributes and brands of cellphones. This finding indicates that aggregating user comments with news articles can enhance topic learning in LDA-P, while using separate vocabularies for normal documents and short texts would lead to EXTM borrowing less content information from short texts. The COTM model achieves comparable results with LDA-P, but also includes words such as “Lenovo” (a Chinese technology company) and “Green” (a Chinese electric appliances company). Interestingly, Lenovo is a major brand in the Chinese cellphone market, and Green has recently released two unsuccessful cellphone models.

Table 5 Topics selected by the word “cellphone” in NetEase collection

From the above two comparisons, we find that the formal topics learned by LDA-P and COTM outperform those learned by EXTM. This indicates that when most short texts are topically related to the corresponding normal documents, both LDA-P and COTM are able to enhance the learning of formal topics by using the whole vocabulary that consists of normal documents and short texts. However, it is not the same case when most short texts are topically irrelevant to the normal documents, as demonstrated below.

Table 6 shows topics selected by the word “student” for the three methods. The top twenty words listed in the first row include words “student”, “school” and “teacher”, which indicates that all three topics discuss school life. EXTM performs well with only a few less relevant words, such as “American” and “management”. Results for LDA-P are worse than those for EXTM, with the top words including names of Chinese cities, provinces or companies, like “Kunming”, “Suqian”, “Yunnan” and “Zhonghao”. This can be largely attributed to highly spammed user comments following news articles related to school life. In this case, only using vocabulary formed by normal documents and modeling spam comments by extended topics could lead to more prominent master topics in EXTM. The COTM model not only separates topically relevant contents and spam contents in user comments, but also enhances the learning of formal topics with relevant information from user comments. As a result, COTM achieves the best performance, with its top twenty words mostly related to school life.

Table 6 Topics selected by the word “student” in NetEase collection

For a high quality topic, its non-top words should be semantically related to its top words as much as possible. For topics selected by “airplane”, “cellphone” and “student”, the non-top words whose probabilities ranked from 501 to 520 are listed in the second rows of Tables 45 and 6.

In Table 4, the non-top words for COTM are more related to aviation than those for LDA-P and EXTM. In Table 5, the non-top words for COTM reflect broader issues related to information technology, and are more relate to cellphones than the non-top words for LDA-P and EXTM. In Table 6, the superiority of COTM is even more obvious, with most of its non-top words relevant to school life.

Comparison of clustering and classification performance

To evaluate the quality of topic representation of normal documents, we compare the performance of using topic proportions derived by LDA-P, LCTM-P, EXTM and COTM to cluster normal documents. Similar to LDA-P, we set the number of topics for LCTM-P to be K L C T MP = 150 since it is also applied to pseudo-documents which aggregates normal documents and their corresponding short texts. Figure 4a shows the pseudo F values under different number of clusters for all these methods applied to the NetEase data. We find that LDA-P and LCTM-P achieve smaller pseudo F values than EXTM and COTM, which indicates that the indiscriminative inclusion of short comments by using pseudo-documents has weakened topic representation of normal documents. On the contrary, by separating topically relevant contents from topically irrelevant contents in short texts, COTM and EXTM achieves higher pseudo F values. Moreover, COTM performs better than EXTM by consistently obtaining prominent formal topics under circumstances that short texts are topically relevant or irrelevant to their corresponding normal documents.

Figure 4
figure 4

Comparison of COTM with LDA-P, LCTM-P and EXTM algorithms in a clustering NetEase news and b classifying Sina blog posts

There are class labels for the normal documents in the Sina dataset, so we use document classification performance to further compare the topic representation of normal documents achieved by LDA-P, LCTM-P, EXTM and COTM. In the experiment, the formal (master) topic numbers for COTM (EXTM) are set to be 100, and the number of informal (extended) topics vary from 20 to 100 with a step of 20 topics. For LDA-P and LCTM-P, their topic numbers are set to be the summation of formal topics and informal topics in COTM. Under each setting, we use topic proportions of normal documents to classify blog posts into 8 categories. From the results shown in Figure 4b, we find both COTM and EXTM show superiority against LDA-P and LCTM-P, and COTM consistently outperforms EXTM in all experimental settings.

4.2.2 Evaluating informal topics

To evaluate the quality of informal topics learned by COTM, we make comparisons with BTM-B and PTM-B, since they are demonstrated to have good performances in modeling short texts[5, 30]. In the following evaluations, BTM-B, PTM-B and COTM are all applied to the corpus including both news articles and short comments in the NetEase dataset. The number of topics in COTM are K C O T M = 100, J C O T M = 50, and those for BTM-B and PTM-B are set as 150.

Following the same strategy used above, two words “judgement” and “nation” are selected from the interaction of topic word sets of BTM-B, PTM-B and COTM, and Table 7 shows the top twenty words under the correspondingly selected topics. In the first row of Table 7, the topics selected by word “judgement” are related to the league matches of China Basketball Association (CBA). Comparing the informal topic of COTM with those matched topics of BTM-B and PTM-B, we find that the topic learned by COTM has discovered more technical details of basketball playing. This difference can be attributed to the fact that the informal topics in COTM may have higher probabilities over the words that only appear in short texts. As a result, the informal topics in COTM tend to reflect more flexible meanings, such as more details of the related issues, or expression of personal opinions. These characteristics are further validated in the second row of Table 7: topics discovered by the three methods are all related to international relationships, but the one extracted by COTM talks more about relations across Taiwan strait, which have long been a hot issue discussed among Chinese netizens.

Table 7 Topics selected by the words “judgement” and “nation” in the NetEase dataset

For further validation of the characteristics of informal topics discovered by COTM, Table 8 shows two unique topics only discovered by COTM. The first row represents a topic of rude talking, and the second row represent a topic of mutual judgements, i.e., users making judgements about each other. These two topics can be commonly found in comments of news articles, as users who hold opposite viewpoints firstly argue with each other and then the argument could evolve into mutual verbal abuses. However, the correlation coefficients of word probabilities (ϕ) between these two topics and topics extracted by BTM-B and PTM-B are extremely low, indicating that these two topics have not been discovered by the other competitors.

Table 8 Unique informal topics discovered by COTM from the NetEase dataset

4.2.3 Overall evaluation of topics

After separate evaluations of formal and informal topics, we use an automated metric, coherence score, to evaluate the overall quality of topics learned by COTM. In the experiments, the coherence score of both formal and informal topics learned by COTM is compared with the scores of topics learned by LDA-P, BTM-B, PTM-B and EXTM. We do not show coherence scores for LCTM-P, since the topics extracted by this model are distributions over concepts, not words. The topic numbers for COTM are set to be K C O T M = 100 and J C O T M = 20,40,80, separately. The numbers of master topics and extended topics in EXTM are the same with those for COTM. The corresponding topic numbers for LDA-P, BTM-B and PTM-B are set to be 120, 140, 180 correspondingly. From Table 9, we find that COTM and PTM-B achieve comparable results, with PTM-B outperforming COTM when the topic number is larget (180) and COTM achieving slightly better results when the topic numbers are small (120 and 140). Besides, both PTM-B and COTM consistently outperform the other three models in all experimental settings.

Table 9 Average coherence scores of topics learned by COTM and its competitors. A larger coherence score indicates more coherence topics

To further explore the quality of formal topics and informal topics learned by COTM, we calculate the average coherence scores only on formal topics and informal topics respectively, the results of which are defined as COTM-F and COTM-I in Table 9. We find that the average coherence scores of formal topics are higher than those of informal topics, and even become the highest in nearly all experimental settings. These findings indicate that COTM shows strong performances in learning formal topics and poor performances in learning informal topics, and therefore achieves comparable performances with PTM-B in modeling both of the normal documents and short texts. This phenomenon validates the basic assumption of COTM that borrowing external information from short texts can help improve the topic learning of formal topics. As for informal topics, since they are only formed by short texts which are not much correlated with normal documents, the less amount of content information results in their poor performance.

4.2.4 Detection of spam short texts

In recent years, detecting spam reviews on the Web has gained great importance. The relevant text corpora often consists of co-occurring normal documents and following short texts. For instance, products or services are often described by normal textual introductions on the electronic commerce website, with each product description followed by a number of short buyers’ reviews. Spam detection can help filter out “untruthful reviews”, “brands reviews” and “non-reviews” for products or services [7], which are highly concerned by manufacturers and retailers [6].

Exploring the topical relationships between normal documents and short texts also sheds light on detecting spams. For example, EXTM uses a switch variable H for each short text to decide whether it talks about a slave topic or an extended topic. Short texts that are classified as discussing extended topics can be regarded as spams. However, in real situations such as buyers’ reviews, short texts may not only talk about slave topics derived from normal product descriptions, but also include additional personal opinions. In this scenario, using switch variables to simply classify short texts into two groups is an assumption that is too strong for modeling the topical meanings of short texts.

In a more natural way, the COTM model uses association probabilities p d c to describe topical relationships of normal documents and their corresponding short texts. As is shown in Figure 5, the association probabilities between NetEase news and their following short reader comments vary a lot between 0 and 1. Figure 5a shows the distribution of p d c for the entire corpus and we find several obvious peaks within the interval. Figure 5b presents the p d c distributions of 9 randomly sampled NetEase news and their corresponding reader comments, which shows different patterns. As a result, we can draw a conclude that short texts have different topical relationships with their co-occurring normal documents. So we propose to use p d c in COTM to automatically detect potential spam texts. Specifically, short texts with p d c less than a certain threshold \(\widetilde {p}\) are not much semantically related to their corresponding normal documents, and thus can be classified as irrelevant short texts, which are often spams.

Figure 5
figure 5

p d c distribution of NetEase news dataset and random sampled pieces of NetEase news, with K C O T M = 100 and J C O T M = 20

To illustrate the ability of using p d c to detect spams, the top half of Table 10 shows one sampled news article reporting an accident that took place in Vietnam, and the second half shows sampled user comments following the news article. All sampled comments are sorted by p d c in descending order, and are further classified into relevant and irrelevant comments, with the threshold being \(\widetilde {p}=0.2\). It can be observed that relevant comments talk about the original article and express preaches and feelings of sorrow, surprise or ridicule. On the contrary, irrelevant comments could be random talks, verbal abuses or advertising. This example demonstrate that COTM can be efficiently used for identifying topically irrelevant comments, and can be potentially used to detect spam reviews.

Table 10 A sampled news article and its corresponding user comments

4.3 Evaluation of online COTM

4.3.1 Topic coherence

To evaluate the quality of topics learned by online algorithms, we compare the average coherence scores of the models. We first compare oCOTM with oLDA-B, and set K o C O T M = 100, K o L D AB = K o C O T M + J o C O T M where J o C O T M is 20, 40 or 80. From the coherence scores shown in Table 11, we see that oCOTM achieves higher topic coherence scores than oLDA-B in all experimental settings for the two datasets.

Table 11 Average coherence scores of topics learned by oLDA-B and oCOTM

We next use average coherence scores to compare the quality of informal topics learned by oCOTM with the quality of topics learned by oBTM-S and iBTM-S. We set K o B T MS = K i B T MS = J o C O T M . While online BTM algorithms use biterms from the entire collection of short texts, the learning of informal topics in oCOTM only gains knowledge from a subset of words in each short comment. This difference implies an additional inherent advantage to the BTM algorithms, besides the one mentioned before that BTM’s basic assumption is consistent with the definition of coherent score. Nonetheless, we still find that oCOTM outperforms iBTM-S and oBTM-S in all experimental settings for the two datasets, as shown in Table 12.

Table 12 Average coherence score of topics learned by online BTM algoithms and informal topics learned by oCOTM

4.3.2 Document classification

We further evaluate the quality of topic representation of documents learned by oCOTM through document classification for the Sina dataset.

In the experiment of classifying normal documents, we set K o C O T M = 100, J o C O T M = 50 and K o L D AB = 150. Topic proportions of normal documents are used as features in classification. From the results shown in Figure 6a, we observe that the accuracy of oCOTM is higher than oLDA-B. In the experiment of classifying short texts, we set K o C O T M = 100, and J o C O T M = K o L D AS = 50. From the results shown in Figure 6b, we find that oCOTM outperforms oLDA-S dramatically. Overall, oCOTM achieves the best performance both in classifying normal documents and short texts at all time slices.

Figure 6
figure 6

Comparison of classification performance of online algorithms on Sina dataset

5 Conclusion

With the development of online services, co-occurring normal documents and short texts are becoming increasingly prevalent throughout the Internet. Conventional topic models designed for normal texts or short texts are not applicable to these texts with co-occurring structure. In this paper, we propose a novel topic model, namely COTM, to deal with this kind of text corpora. The COTM model can directly exploit the co-occurring structure, and use information from both normal documents and short texts to learn topics in a mutually reinforced way. We also introduce an online algorithm for COTM, referred to as oCOTM, to deal with large scale datasets. Extensive experiments on the NetEase news and Sina blog datasets demonstrate that COTM outperforms several state-of-art models in various ways, including learning more prominent and comprehensive topics, and getting better topic representations of documents. Besides, COTM can be potentially used for unsupervised detection of spam reviews.